Wednesday, July 25, 2012

IDAscope update: WinAPI browsing

A week has passed since my last blog post, so it's time to give an update on the current status of development for IDAscope. The title mentions WinAPI browsing, which I am introducing later in the post. First I want to give a follow up on the need for data flow analysis I explained in the last post.

IDAscope + Data flow analysis?

In my last post, I mentioned that one of the next steps would be data flow analysis of parameters to get a better interpretation of API calls. While I am still pursuing this, I realized that it will come not in as easy as I hoped. At least when I want to do it properly. While having studied CS and not having tried to circumvent lectures on theory, I went back to basics and started reading on data flow analysis. Soon, I realized that I have rusted already a little bit, doing more practical work than I probably should have (at least for being in an academic environment). However, the first two chapters were pretty illuminating and helped me to grasp the message of Rolf Rolles' keynote at REcon better.
I took the following lessons from my peek into this book:
  • There are well-defined, nice mathematical frameworks to perform data flow analysis.
    Efficient algorithms are available in pseudo code, so most of the work has been done already.
  • Intraprocedural data flow analysis is enough for what we need here.
    Having Def/Use-chains would be great.
  • Implementing this generically will take a lot of time. ;)
So for now I feel that it has more value to implement other functionality/ideas for easing the reverse engineering workflow first than putting together a full data flow framework. I still have this on the agenda but I will probably come up with a very simplified version (say: hack) that will at least show reference in a way IDA does it already (incl. clickable reference):

Possible substitution for full data flow analysis.
However, I still see the potential of data flow analysis and will pick this up later on, I guess.
If you have a hint on how to integrate data flow analysis at this point without introducing much external dependencies, let me know.

IDAscope: WinAPI browsing

Reading my last blog again, it turns out the "long-term targeted functionality to have MSDN browsing embedded in IDAscope" was a lot easier to do than initially assumed.

The starting point to this was given to me by Alex. As you might know, there comes a very handy information database along with a Windows SDK installation. In its program files folder, there is a subfolder "help/<version number>", containing roundabout 250 *.hxs files, which are basically "Microsoft Help Compiled Storage" files. Treating them with 7zip results in about 130.000 files, consuming 1.4 GB space. Most of them are simple HTML files, probably similar to the MSDN available online.
What is great about those files, is the indexing/keyword scheme used by Microsoft, explained here and here.
Just to show you what I am talking about, example "createfile.htm":
[...]
<MSHelp:Keyword Index="F" Term="CreateFile"/>
<MSHelp:Keyword Index="F" Term="CreateFileA"/>
<MSHelp:Keyword Index="F" Term="CreateFileW"/>
<MSHelp:Keyword Index="F" Term="0"/>
<MSHelp:Keyword Index="F" Term="FILE_SHARE_DELETE"/>
<MSHelp:Keyword Index="F" Term="FILE_SHARE_READ"/>
<MSHelp:Keyword Index="F" Term="FILE_SHARE_WRITE"/>
<MSHelp:Keyword Index="F" Term="CREATE_ALWAYS"/>
[...]

Parsing this information was trivial, leaving us with a dictionary of 110.000 keywords (API names, structures, symbolic constants, parameter names, ...) pointing to the corresponding files.

Now we just need a way to visualize the data/html. I decided to use QtGui's QTextBrowser instead of QWebView, which would have been basically full WebKit. Mainly because it requires a full installation of Qt instead of only PySide as shipped with IDA Pro. Furthermore, QTextBrowser fully suffices our needs as it is able to render basic HTML of which the Windows SDK API documentation is comprised anyway.

The result looks like this:

WinAPI browsing via IDAscope.





The links you see in the picture there are all functional, which is really nice to get some context around the API you are currently reading about. And because of course we want to be hip, I used QCompleter to give search suggestions based on the keywords:

Clicking on the API names in the first picture shown above in the data flow section will also bring up the respective API page, changing window focus to the browser.
As possible future features for this, I think of extending the context menu (right click) of the IDA View by a "search in WinAPI" in order to ease use and also cover names that are not targeted by the set of semantic definitions. From my own usage experience, having a "back" button in the browser will also be essential, so I will add that soon, too.

A downside of using Windows SDK as exclusive data source is that information about ntdll and CRT functions is not included. Maybe I will add a switch for "online" mode, so you can still surf MSDN from within the window. But this has lower priority right now.

So not much technical stuff in today's post but I am positive that we can change that in the next one. I hope I have implemented the data flow "hack" by then. But the next main goal is to bring the subroutine exploration explained by Alex in his blog post into IDAscope. Based on the structural information generated through his scripts, I feel that there is more to gain from.

Wednesday, July 18, 2012

Introducing: IDAscope

About a week ago, I already announced on Twitter the progress for the IDA plugin called "IDAscope" Alex and I are currently working on, showing a screenshot. In this post, I want to roll out some basic thoughts on the idea behind the plugin and its motivation.

I feel that there is still a lot of potential for visually exploring the data contained in a binary being subject to analysis. And be it just by providing certain overviews that are not available by the stock versions of our analysis tools.
About a year ago, I started off with a little script that tagged unexplored (i.e. not renamed) functions with a short semantic description on what I assume is happening inside based on API calls. If there are calls to, let's say ws2_32!connect, ws2_32!send, ws2_32!receive there would be an extension of "net" to the default name "sub_c0ffee", yielding the name "net_sub_c0ffee". However, sorting by function names with the standard Funtion Window of IDA is unsatisfactory, as sorting by tags is just not possible. That brought up that I would need some kind of custom table visualization, like the one you might have already seen in my tweet. Here is the screenshot, so you don't have to click anything:
Introducing IDAscope
Introducing IDAscope.

I read a MindShaRE blog post by Aaron Portnoy on his journey with IDA/PySide and it was some kind of a door opener for me, as it showed me what would actually be possible by building own GUI extensions. By that time, I started working on the plugin but was thrown back when Aaron and Brandon announced Toolbag, which already in the Beta seemed to be a powerful implementation extending IDA with a lot of features that come in handy.
REcon set me back on track and now I am motivated again to pursue my plugin as I noticed that the focus of my plugin is different from theirs. The feedback of Alex also put in a lot of motivation, helping me to continue.

So after the REstart, the next step was to take the basic existing script as mentioned before and embedding it in some optimized graphic front end, resulting in the GUI as shown here:

Current state of "Function Inspection".

Having an overview of the tagged functions was just one step, having the relevant API calls responsible for the tag was a logical consequence. Right now, I am working on extracting the parameters to these function calls. For this, some basic data flow analysis is needed of course.

To support my point, I want to introduce you to my favorite malware sample: 92a1ad5bb921d59d5537aa45a2bde798. This is a very simple Spybot variant with timestamp of 2003, which I believe to be its true date. It's one of my standard samples used to teach RE at university. The sample is a good read and nice to study if you are new to malware analysis. Funny sidenote: it is only detected by 37/42 AVs on VirusTotal, despite having no protection, obfuscation, whatsoever.

From the 231 API calls tagged by IDAscope, the parameters to these API calls have pushes of the following type:
  • General Register -> 287
  • Immediate Value -> 263
  • Memory Reg [Base Reg + Index Reg + Displacement] -> 83
  • Direct Memory Reference to Data -> 21

This means that 60% of the parameters can be potentially resolved via data flow analysis, providing a more interesting value than "eax" or "[0x405004]" as it is in the current state of development. While this is only one example, I am confident that putting the effort into data flow analysis is worth it as it opens doors to other interesting use cases.

But even for the immediates there are more possibilities. Many of them can be further resolved as shown in the following example.
Think of:
push 0
push 1
push 2
call socket
a typical constellation as shown to you by IDA Pro.

By knowing the type of the parameter and the immediate value, we can directly resolve those to:
push IPPROTO_IP
push SOCK_STREAM
push AF_INET
call socket
which nets us the information that it is a TCP connection based on IP. While these are probably values you know by heart anyway, there is still a lot of moments where I find myself looking into MSDN in order to figure out what exactly is happening with this or that API call.
Long-term, I want to have some functionality for looking up APIs, structs and types via MSDN directly integrated into the plugin. I know that there are scripts by others that do this already, but often combination of features leads to emergence.

Another feature that is already integrated and that was shown in the tweet was the coloring of basic blocks based on the semantic type of the tag. Once you are used to the colors, this can really speed up navigation in a function using the Graph overview.

For my config, I use the following six colors:
  • yellow for memory manipulation
  • orange for file manipulation
  • red for registry manipulation
  • violet for execution manipulation
  • blue for network operations
  • green for cryptography

Right now, the highlighting is implemented in a 3-way cycle: use 6 colors, use standard color (all red), disable. Disabling is important because I noticed that you can also get to a point where you focus too hard on the colors and might miss other important spots.

We will not commit to any kind of release date as there is still a lot of ideas that might find their way into the first, official release. However, if you are interested or want to share ideas for features, let us know and we will see what we can do.

Alex will probably blog in the next days about another aspect of functionality that will find its way into the plugin, introducing a second tab.

Stay tuned for more news on IDAscope. :)

Tuesday, July 10, 2012

PNX.TF now with blog

I decided to start this blog as a test-balloon in order to complement my recently launched site pnx.tf.
There are multiple reasons for this.

First, I feel that it will help me to publish content easier, which is something I had in mind for some time already. Each blog entry is defined by its publishing time as well as optional tags, which allows for at least two comfortable ways of exploring it.
Furthermore, a notification on updates in form of feeds comes for free with the provided infrastructure, which is really handy. You readers can directly comment on the posts, too, another plus.

I didn't have any plans on integrating all that functionality into pnx.tf because I like my main website static. But as an addition, I think it's the right decision. This means that I will keep the main site reduced to some kind of documentary file vault, a showcase of my work. It will provide the space for references I might want to include here.

That's for the introduction, I hope that I can follow up with actual content soon.

Alex already blogged on a project that I put some effort in some time ago and published a really nice script in Python for IDA that allows automatic renaming of functions containing certain API calls.
Over he last week, I picked up development speed for the GUI-driven version Alex mentioned. It contains some bonus features, so if you like the renaming script, chances are good you will like my plugin, too.