Tuesday, December 3, 2013

AnalyzePDF - Bringing the Dirt Up to the Surface

This has been ported over to my GitHub site and is not longer being maintained here. For any issues, comments or updates head here

What is that thing they call a PDF?

The Portable Document Format (PDF) is an old format ... it was created by Adobe back in 1993 as an open standard but wasn't officially released as an open standard (SIO 32000-1) until 2008 - right @nullandnull ?  I can't take credit for the nickname that I call it today, Payload Delivery Format, but I think it's clever and applicable enough to mention.  I did a lot of painful reading through the PDF specifications in the past and if you happen to do the same I'm sure you'll also have a lot of "hm, that's interesting" thoughts as well as many "wtf, why?" thoughts.  I truly encourage you to go out and do the same... it's a great way to learn about the internals of something, what to expect and what would be abnormal.  The PDF has become a defacto for transferring files, presentations, whitepapers etc.

<rant> How about we stop releasing research/whitepapers about PDF 0-days/exploits via a PDF file... seems a bit backwards</rant>

We've all had those instances where you wonder if that file is malicious or benign ... do you trust the sender or was it downloaded from the Internet?   Do you open it or not?  We might be a bit more paranoid than most people when it comes to this type of thing and but since they're so common they're still a reliable means for a delivery method by malicious actors.  As the PDF contains many 'features', these features often turn into 'vulnerabilities' (Do we really need to embed an exe into our PDF? or play a SWF game?).  Good thing it doesn't contain any vulnerabilities, right? (to be fair, the sandboxed versions and other security controls these days have helped significantly)

What does a PDF consist of?

In its most basic format, a PDF consists of four components: header, body, cross-reference table (Xref) and trailer:

(sick M$ Paint skillz, I know)

If we create a simple PDF (this example only contains a single word in it) we can see a better idea of the contents we'd expect to see:

 What else is out there?

Since PDF files are so common these days there's no shortage of tools to rip them apart and analyze them.  Some of the information contained in this post and within the code I'm releasing may be an overlap of others out there but that's mainly because the results of our research produced similar results or our minds think alike...I'm not going to touch on every tool out there but there are some that are worth mentioning as I either still use them in my analysis process or some of their functionality/lack of functionality is what sparked me to write AnalyzePDF.  By mentioning the tools below my intentions aren't to downplay them and/or their ability to analyze PDF's but rather helping to show reasons I ended up doing what I did.


Didier Stevens created some of the first analysis tools in this space, which I'm sure you're already aware of.  Since they're bundled into distros like BackTrack/REMnux already they seem like good candidates to leverage for this task.  Why recreate something if it's already out there?  Like some of the other tools, it parses the file structure and presents the data to you... but it's up to you to be able to interpret that data.  Because these tools are commonly available on distros and get the job done I decided they were the best to wrap around.

Did you know that pdfid has a lot more capability/features that most aren't aware of?  If you run it with the (-h) switch you'll see some other useful options such as the (-e) which display extra information. Of particular note here is the mention of "%%EOF", "After last %%EOF", create/mod dates and the entropy calculations.  During my data gathering I encountered a few hiccups that I hadn't previously experienced.  This is expected as I was testing a large data set of who knows what kind of PDF's.  Again, I'm not noting these to put down anyone's tools but I feel it's important to be aware of what the capabilities and limitations of something are - and also in case anyone else runs into something similar so they have a reference.  Because of some of these, I am including a slightly modified version of pdfid as well.  I haven't tested if the newer version fixed anything so I'd rather give the files that I know work with it for everyone.

  • I first experienced a similar error as mentioned here when using the (-e) option on a few files (e.g. - cbf76a32de0738fea7073b3d4b3f1d60).  It appears it doesn't count multiple '%%EOF's since if the '%%EOF' is the last thing in the file without a '/r' or '/n' behind it, it doesn't  seem to count it.
  • I've had cases where the '/Pages' count was incorrect - there were (15) PDF's that showed '0' pages during my tests.  One way I tried to get around this was to use the (-a) option and test between the '/Page' and '/Pages/ values. (e.g. - ac0487e8eae9b2323d4304eaa4a2fdfce4c94131)
  • There were times when the number of characters after the last '%%EOF' were incorrect
  • Won't flag on JavaScript if it's written like "<script contentType="application/x-javascript">" (e.g - cbf76a32de0738fea7073b3d4b3f1d60) :


Peepdf has gone through some great development over the course of me using it and definitely provides some great features to aid in your analysis process.  It has some intelligence built into it to flag on things and also allows one to decode things like JavaScript from the current shell.  Even though it has a batch/automated mode to it, it still feels like more of a tool that I want to use to analyze a single PDF at a time and dig deep into the files internals.

  • Originally, this tool didn't look match keywords if they had spaces after them but it was a quick and easy fix... glad this testing could help improve another users work.


PDFStreamDumper is a great tool with many sweet features but it has its uses and limitations like all things.  It's a GUI and built for analysis on Windows systems which is fine but it's power comes from analyzing a single PDF at a time - and again, it's still mostly a manual process.


Pdfxray was originally an online tool but Brandon created a lite version so it could be included in REMnux (used to be publicly accessible but at the time of writing this looks like that might have changed).  If you look back at some of Brandon's work historically he's also done a lot in this space as well and since I encountered some issues with other tools and noticed he did as well in the past I know he's definitely dug deep and used that knowledge for his tools.  Pdfxray_lite has the ability to query VirusTotal for the file's hash and produce a nice HTML report of the files structure - which is great if you want to include that into an overall report but again this requires the user to interpret the parsed data


Pdfcop is part of the Origami framework.  There're some really cool tools within this framework but I liked the idea of analyzing a PDF file and alerting on badness.  This particular tool in the framework has that ability, however, I noticed that if it flagged on one cause then it wouldn't continue analyzing the rest of the file for other things of interest (e.g. - I've had it close the file our right away if there was an invalid Xref without looking at anything else.  This is because PDF's are read from the bottom up meaning their Xref tables are first read in order to determine where to go next).  I can see the argument of saying why continue to analyze the file if it already was flagged bad but I feel like that's too much of tunnel vision for me.  I personally prefer to know more than less...especially if I want to do trending/stats/analytics.

So why create something new?

While there are a wealth of PDF analysis tools these days, there was a noticeable gap of tools that have some intelligence built into them in order to help automate certain checks or alert on badness.  In fairness, some (try to) detect exploits based on keywords or flag suspicious objects based on their contents/names but that's generally the extent of it.  I use a lot of those above mentioned tools when I'm in the situation where I'm handed a file and someone wants to know if it's malicious or not... but what about when I'm not around?  What if I'm focused/dedicated to something else at the moment?  What if there's wayyyy too many files for me to manually go through each one?  Those are the kinds of questions I had to address and as a result I felt I needed to create something new.  Not necessarily write something from scratch... I mean why waste that time if I can leverage other things out there and tweak them to fit my needs?  

Thought Process

What do people typically do when trying to determine if a PDF file is benign or malicious?  Maybe scan it with A/V and hope something triggers, run it through a sandbox and hope the right conditions are met to trigger or take them one at a time through one of the above mentioned tools?  They're all fine work flows but what if you discover something unique or come across it enough times to create a signature/rule out of so you can trigger on it in the future?  We tend to have a lot to remember so doing the analysis one offs may result in us forgetting something that we previously discovered.  Additionally, this doesn't scale too great in the sense that everyone on your team might not have the same knowledge that you do... so we need some consistency/intelligence built in to try and compensate for these things.<

 I felt it was better to use the characteristics of a malicious file (either known or observed from combinations of within malicious files) to eval what would indicate a malicious file.  Instead of just adding points for every questionable attribute observed. e.g. - instead of adding a point for being a one page PDF, make a condition to say if you see an invalid xref and a one page PDF then give it a score of X.  This makes the conditions more accurate in my eyes; since, for example:
  1. A single paged PDF by itself isn't malicious but if it also contains other things of question then it should have a heavier weight of being malicious.  
  2. Another example is JavaScript within a PDF.  While statistics show JavaScript within a PDF are a high indicator that it's malicious, there're still legitimate reasons for JavaScript to be within a PDF (e.g. - to calculate a purchase order form or verify that you correctly entered all the required information the PDF requires).

Gathering Stats

At the time I was performing my PDF research and determining how I wanted to tackle this task I wasn't really aware of machine learning.  I feel this would be a better path to take in the future but the way I gathered my stats/data was in a similar (less automated/cool AI) way.  There's no shortage of PDF's out there which is good for us as it can help us to determine what's normal, malicious, or questionable and leverage that intelligence within a tool.

If you need some PDF's to gather some stats on, contagio has a pretty big bundle to help get you started.  Another resource is Govdocs from Digital Corpora ... or a simple Google dork.

Note : Spidering/downloading these will give you files but they still need to be classified as good/bad for initial testing).  Be aware that you're going to come across files that someone may mark as good but it actually shows signs of badness... always interesting to detect these types of things during testing!

Stat Gathering Process

So now that I have a large set of files, what do I do now?  I can't just rely on their file extensions or someone else saying they're malicious or benign so how about something like this:
  1. Verify it's a PDF file.  
    • When reading through the PDF specs I noticed that the PDF header can be within the first 1024 bytes of the file as stated in ""3.4.1, 'File Header' of Appendix H - ' Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.'"... that's a long way down compared to the traditional header which is usually  right in the beginning of a file.  So what's that mean for us?  Well if we rely solely on something like file or TRiD they _might_ not properly identify/classify a PDF that has the header that far into the file as most only look within the first 8 bytes (unfair example is from corkami).  We can compensate for this within our code/create a YARA rule etc.... you don't believe me you say?  Fair enough, I don't believe things unless I try them myself either:
    The file to the left is properly identified as a PDF file but when I created a copy of it and modified it so the header was a bit lower, the tools failed.  The PDF on the right is still in accordance with the PDF specs and PDF viewers will still open it (as shown)... so this needs to be taken into consideration.

  2. Get rid of duplicates (based on SHA256 hash) for both files in the same category (clean vs. dirty) then again via the entire data set afterwards to make sure there're no duplicates between the clean and dirty sets.
  3. Run pdfid & pdfinfo over the file to parse out their data.  

    • These two are already included in REMnux so I leveraged them. You can modify them to other tools but this made it flexible for me and I knew the tool would work when run on this distro; pdfinfo parsed some of the data better during tests so getting the best of both of them seemed like the best approach.

  4. Run scans for low hanging fruit/know badness with local A/V||YARA
  5. Now that we have a more accurate data set classified:

  6. Are all PDFs classified as benign really benign?
  7. Are all PDFs classified as malicious really malicious? 


Files analyzed (no duplicates found between clean & dirty):

Class Type Count
Dirty Pre-Dup 22,342
Dirty Post-Dup 11,147
Clean Pre-Dup 2,530
Dirty Post-Dup 2,529
Total Files Analyzed: 13,676

I've collected more than enough data to put together a paper or presentation but I feel that's been played out already so if you want more than what's outlined here just ping me.  Instead of dragging this post on for a while showing each and every stat that was pulled I feel it might be more useful to show a high level comparison of what was detected the most in each set and some anomalies.


  • None of the clean files had incorrect file headers/versions
  • There wasn't a single keyword/attribute parsed from the clean files that covered more than 4.55% of it's entire data set class.  This helps show the uniqueness of these files vs. malicious actors reusing things.
  • The dates within the clean files were generally unique while the date fields on the dirty files were more clustered together - again, reuse?
  • None of the values for the keywords/attributes of the clean files were flagged as trying to be obfuscated by pdfid
  • Clean files never had '/Colors > 2^24' above 0 while some dirty files did 
  • Rarely did a clean file have a high count of JavaScript in it while dirty files ranged from 5-149 occurrences per file
  • '/JBIG2Decode' was never above '0' in any clean file
  • '/Launch' wasn't used much in either of the data sets but still more common in the dirty ones
  • Dirty files have far more characters after the last %%EOF (starting from 300+ characters is a good check)
  • Single page PDF's have a higher likelihood of being malicious - no duh
  • '/OpenAction' is far more common in malicious files

YARA signatures

I've also included some PDF YARA rules that I've created as a separate file so you can use those to get started.  YARA isn't really required but I'm making it that way for the time being because it's helpful... so I have the default rules location pointing to REMnux's copy of MACB's rules unless otherwise specified.

Clean data set:

Dirty data set:

Signatures that triggered across both data sets:

Cool... so we know we have some rules that work well and others that might need adjusting, but they still help!

What to look for

So we have some data to go off of... what are some additional things we can take away from all of this and incorporate into our analysis tool so we don't forget about them and/or stop repetitive steps?

  1. Header
    • In addition to being after the first 8 bytes I found it useful to look at the specific version within the header.  This should normally look like "%PDF-M.N." where M.N is the Major/Minor version .. however, the above mentioned 'low header' needs to be looked for as well.

      Knowing this we can look for invalid PDF version numbers or digging deeper we can correlate the PDF's features/elements to the version number and flag on mismatches. Here're some examples of what I mean, and more reasons why reading those dry specs are useful:
      • If FlateDecode was introduced in v1.2 then it shouldn't be in any version below
      • If JavaScript and EmbeddedFiles were introduced in v1.3 then they shouldn't be in any version below
      • If JBIG2 was introduced in v1.4 then it shouldn't be in any version below
  2. Body
    • This is where all of the data is (supposed to be) stored; objects (strings, names, streams, images etc.).  So what kinds of semi-intelligent things can we do here?
      • Look for object/stream mismatches.  e.g - Indirect Objects must be represented by 'obj' and 'endobj' so if the number of 'obj' is different than the number of  'endobj' mentions then it might be something of interest
      • Are there any questionable features/elements within the PDF? 
      • JavaScript doesn't immediately make the file malicious as mentioned earlier, however, it's found in ~90% of malicious PDF's based on others and my own research.
      • '/RichMedia'  - indicates the use of Flash (could be leveraged for heap sprays)
      • '/AA', '/OpenAction', '/AcroForm' - indicate that an automatic action is to be performed (often used to execute JavaScript)
      • '/JBIG2Decode', '/Colors' - could indicate the use of vulnerable filters; Based on the data above maybe we should look for colors with a value greater than 2^24
      • '/Launch', '/URL', '/Action', '/F', '/GoToE', '/GoToR' - opening external programs, places to visit and redirection games
      • Obfuscation
      • Multiple filters ('/FlateDecode', '/ASCIIHexDecode', '/ASCII85Decode', '/LZWDecode', '/RunLengthDecode')
      •  The streams within a PDF file may have filters applied to them (usually for compressing/encoding the data).  While this is common, it's not common within benign PDF files to have multiple filters applied.  This behavior is commonly associated with malicious files to try and thwart A/V detection by making them work harder.
      • Separating code over multiple objects
      • Placing code in places it shouldn't be (e.g. - Author, Keywords etc.)
      • White space randomization
      • Comment randomization
      • Variable name randomization
      • String randomization
      • Function name randomization
      • Integer obfuscation
      • Block randomization
      • Any suspicious keywords that could mean something malicious when seen with others?
      •  eval, array, String.fromCharCode, getAnnots, getPageNumWords, getPageNthWords, this.info, unescape, %u9090
  3. Xref
  4. The first object has an ID 0 and always contains one entry with generation number 65535. This is at the head of the list of free objects (note the letter ‘f’ that means free). The last object in the cross reference table uses the generation number 0.

    Translation please?  Take a look a the following Xref:
    Knowing how it's supposed to look we can search for Xrefs that don't adhere to this structure.
    • Trailer
      • Provides the offset of the Xref (startxref)
      • Contains the EOF, which is supposed to be a single line with "%%EOF" to mark the end of the trailer/document.  Each trailer will be terminated by these characters and should also contain the '/Prev' entry which will point to the previous Xref.
      • Any updates to the PDF usually result in appending additional elements to the end of the file

        This makes it pretty easy to determine PDF's with multiple updates or additional characters after what's supposed to be the EOF
    • Misc.
      • Creation dates (both format and if a particular one is known to be used)
      • Title
      • Author
      • Producer
      • Creator
      • Page count

    The Code

    So what now?  We have plenty of data to go on - some previously known, but some extremely new and helpful.  It's one thing to know that most files with JavaScript or that are (1) page have a higher tendency of being malicious... but what about some of the other characteristics of these files?  By themselves, a single keyword/attribute might not stick out that much but what happens when you start to combine them together?  Welp, hang on because we're going to put this all together.

    File Identification

    In order to account for the header issue, I decided the tool itself would look within the first 1024 bytes instead of relying on other file identification tools:

    Another way, so this could be detected whether this tool was used or not, was to create a YARA rule such as:

    Wrap pdfinfo

    Through my testing I found this tool to be more reliable in some areas as opposed to pdfid such as:

    • Determining if there're any Xref errors produced when trying to read the PDF
    • Look for any unterminated hex strings etc.
    • Detecting EOF errors

    Wrap pdfid

    • Read the header.  *pdfid will show exactly what's there and not try to convert it*
    • _attempt_ to determine the number of pages
    • Look for object/stream mismatches
    • Not only look for JavaScript but also determine if there's an abnormally high amount
    • Look for other suspicious/commonly used elements for malicious purposes (AcroForm, OpenAction, AdditionalAction, Launch, Embedded files etc.)
    • Look for data after EOF
    • Calculate a few different entropy scores
    Next, perform some automagical checks and hold on to the results for later calculations.

    Scan with YARA

    While there are some pre-populated conditions that score a ranking built into the tool already, the ability to add/modify your own is extremely easy.  Additionally, since I'm a big fan of YARA I incorporated it into this as well.  There're many benefits of this such as being able to write a rule for header evasion, version number mismatching to elements or even flagging on known malicious authors or producers.  The biggest strength, however, is the ability to add a 'weight' field in the meta section of the YARA rules.  What this does is allow the user to determine how good of a rule it is and if the rule triggers on the PDF, then hold on to its weighted value and incorporate it later in the overall calculation process which might increase it's maliciousness score.  Here's what the YARA parsing looks like when checking the meta field:

    And here's another YARA rule with that section highlighted for those who aren't sure what I'm talking about:

    If the (-m) option is supplied then if _any_ YARA rule triggers on the PDF file it will be moved to another directory of your choosing.  This is important to note because one of your rules may hit on the file but it may not be displayed in the output, especially if it doesn't have a weight field.

    Once the analysis has completed the calculation process starts.  This is two phase -

    1. Anything noted from pdfino and pdfid are evaluated against some pre-determined combinations I configured.  These are easy enough to modify as needed but they've been very reliable in my testing...but hey, things change!  Instead of moving on once one of the combination sets is met I allow the scoring to go through each one and add the additional points to the overall score, if warranted.  This allows several 'smaller' things to bundle up into something of interest rather than passing them up individually.
    2. Any YARA rule that triggered on the PDF file has it's weighted value parsed from the rule and added to the overall score.  This helps bump up a files score or immediately flag it as suspicious if you have a rule you really want to alert on.

    So what's it look like in action?  Here's a picture I tweeted a little while back of it analyzing a PDF exploiting CVE-2013-0640 :


    I've had this code for quite a while and haven't gotten around to writing up a post to release it with but after reading a former coworkers blog post last night I realized it was time to just write something up and get this out there as there are still people asking for something that employs some of the capabilities (e.g. - weight ranking).  Is this 100% right all the time? No... let's be real.  I've come across situations where a file that was benign was flagged as malicious based on its characteristics and that's going to happen from time to time.  Not all PDF creators adhere to the required specifications and some users think it's fun to embed or add things to PDF's when it's not necessary.  What this helps to do is give a higher ranking to files that require closer attention or help someone determine if they should open a file right away vs. send it to someone else for analysis (e.g. - deploy something like this on a web server somewhere and let the user upload their questionable file to is and get back a "yes it's ok -or- no, sending it for analysis".

    AnalyzePDF can be downloaded on my github

    Further Reading

    Monday, November 11, 2013

    OMFW & OSDFC recap

    This has been ported over to my GitHub site and is not longer being maintained here. For any issues, comments or updates head here.

    General Notes

    I attended both the Open Memory Forensics Workshop (OMFW) and the Open Source Digital Forensics Conference (OSDFC) for the first time last year and just like I said last year - they're both set as recurring events on my calendar now.  I was told that my tweets and recap post of last years activities was helpful to those who couldn't attend so I figured I'd write up something again since I took notes anyway.  I really like that both conferences have ~30-40 minute talks so you're not stuck listening to anyone ramble about anything and you also get the benefit of getting more presentations.  If you haven't been able to make either of these yet or are still debating if you should attend - go for it.  They're both 1 day (well, if you just go to the presentations) each and I have yet to be let down with the overall quality of presentations and better yet, the networking that you're able to do at them.

    Best Quotes of the Cons

    • "They can tunnel faster than you can image" - @williballenthin
    • "Brian Carrier just virtually twerked the audience"- @bbaskin
    • "What one man can invent,  another man can discover" - Sherlock Holmes (on someone's t-shirt)

    Disclaimer - I didn't make it to every talk at OSDFC so if I don't have notes on it, sorry.  Also - these are notes that I jotted down so if something is wrong or there are slides uploaded for ones I didn't link please contact me so I can update the post.


    The first thing I want to say about this conference was how glad I was that it was at the same venue as OSDFC this year - this makes it really convenient for those attending both so hopefully it stays that way next year {nudge @volatility}.

    The State of Volatility

    PresenterAAron Walters
    Notes : Went over where Volatility currently stands, major updates/changes and what's on their roadmap.
    Highlights :
    • The Volatility Foundation has officially become a 501(c)(3)
    • Version 2.3.1 of Volatility is officially released and includes full Mac support, Android/ARM support, new address spaces and new/updated plugins.
    • AAron also touched on a new plugin he created, dumpfiles, which is extremely useful as it reconstructs files from the Windows cache manager and share section objects. 

    Stabilizing Volatility

    Presenter : Mike Auty
    Notes : Went over a lot of the questions that need to be addressed/answered moving forward with the framework and discussed some of the code layout/structure that needs to be modified
    Highlights :
    • Version 2.4 of Volatility is pretty much done already but the real focus is version 3 of Volatility.
    • The big thing I took away here is that it will be written in Python v3... so I guess it's time to start writing in it too :/

    Mastering Truecrypt and Windows 8/Server 2012 Memory Forensics

    Presenter : MHL
    Notes : MHL talked on the research he's recently done regarding Truecrypt and the support that Volatility now has in order to help recover Truecrypt keys in memory.  His slides go into more detail about the structure of Truecrypt'ed data and where to look for it etc. so hopefully those will pop-up online as there was some good information on them.
    Highlights :
    • The older versions aren't currently supported but that doesn't indicate that it can't, just that most people are probably using the newer versions of it anyway so why waste time on it?
    • Did I mention Volatility can analyze Windows 8 and Server 2012 dumps?  The true beauty of open source showed here... just after new releases came to the market there were Volatility profiles to analyze them.  This is pure awesomeness because it means you don't have to wait for a vendor to implement it into a new release of the tool you're using... you can go home and analyze it today!
    • Two new plugins were mentioned, and are said to be committed in v2.4, Truecryptpassprase and Truecryptsummary

    All Your Social Media are Belong to Volatility

    Presenter : Jeff Bryner
    Notes : Gave a presentation about the recent plugins he contributed to Volatility regarding extracting social media artifacts within memory.  Jeff's only scraped the begining of this and hopefully he or someone else can also take a look at the other social media sites he hasn't yet gotten around too - except MySpace... no one uses that anymore, honestly.
    Highlights :
    • The first thing about his presentation that caught my eyes was his slide deck.  After digging a little into his source code I saw it was all being done with reveal.js - cool thing to bookmark and also gives you the ability to say "my slides are online right now" so people don't have to bug you about where to find them.
    • After watching Jeff demo his plugins some discussions started to spark.  When you visit these social media pages you get a huge JSON file returned and why you may not realize it - there're some real gems in there.  You have the possibility of determining who a users friends are, what they 'like'/'favorite', what they've viewed etc.  This can be significant if you need to say they've communicated with someone or viewed something they're denying.

    "All the things you think only exist in movies and sci-fi books" 

    ...OK, I made up the title because I don't remember what it was... but I think this one is fitting anyway
    Presenter : George M. Garner Jr.
    Notes : This talk wasn't listed on the schedule but this made up title is right on point.  George seems to either have a presentation that is extremely technical and will make you feel dumb on several occasions or he'll talk about things that some think only happen in the movies... the latter in this instance. Most of his content was just speaking so unfortunately I don't think having his slides would be of more use.
    Highlights :
    • scary, fun, exciting
    • George went into detail about engagements he's been on where there was malware in the BIOS and optical drives.... and of course the recent buzz around 'airgapped' malware wasn't left out.  The only difference... this wasn't fabricated in a Hollywood studio.  This got me thinking, as I'm sure many other in attendance and reading this... how the hell do you even detect these types of things?  I know for sure I'm not looking for this type of malware in my routine investigations but I guess if there's some suspicion then this type of deeper analysis could be started.

    Memory, Volatility and the Threat Intel life Cycle

    Presenters : Steven Adair and Sean Koessel
    Notes : While this was probably the least technical presentation of the conference, it still added value.  I enjoy hearing about what others have faced while in this field, what worked, what didn't work etc.
    Highlights :

    • While I'm sure some of those reading this post already do similar things within their analysis process, I figured it would still be worth mentioning a good tactic they covered - making YARA signatures for all archive utilities, Microsoft tools (e.g. - net, copy, xcopy, ftp, psexec, sticky keys etc.).  Useful for many things but in this talk they mentioned leveraging these rules with yarascan to run across memory dumps.
    • They also discussed some of the things they've encountered during engagements and some of the things they've needed to recommend to customers.  I feel these are worth pointing out here because I may or may not also come across a lot of these too often and feel they need to be changed as well : 2 factor authentication, flat networks, ability to change all passwords and ability to perform DNS sink-holing.

    Dalvik Memory Analysis and a Call to ARMs

    Presenter : Joe Sylve
    Notes : Joe touched on some of the work he's been doing to add ARM support to Volatility, went over the tool 'Dalvik Inspector' and put out a call for people who are interested in this space to help out as there's still a lot to be tackled/uncovered.
    Highlights :
    • The tool referenced above may or may not sound familiar to you... but in case it doesn't or you forgot where you hear it from, check out the related blog post for it. The tool looks pretty slick and the auto creation of Volatility plugins will surely help others during their Android investigations.  I didn't hear an exact date on its release but it's supposed to be soon so be on the look out!

    Bringing Mac Memory Forensics to the Mainstream

    Presenter : Andrew Case
    Notes : One of the big things with the latest Volatility release was the Mac support.  Some of the Mac support/plugins have been around for a bit but if you look now you'll see the number of plugins specifically for Mac is over 30!
    Highlights :
    • There are some Mac profiles are in a .zip file on the wiki - don't copy them all to the Volatility directory or upon execution it will load each of them and slow things down.  Only copy the one that's applicable.
    • launchd shouldn't be a child process
    • lsmod may show ones with a size of 0 and aren't found on disk - doesn't mean it's malware
    • slide 4 on Mac userland rootkits shows how to detect them with plugins (these slides would help, nudge @attrc).
    • 10.9.x of Mac compresses free pages so running strings over a dump won't show anything

    Memoirs of a Hidsight Hero: Detecting Rootkits in OS X

    Presenter : Gem Gurkok
    Notes : Don't try and write a book about Mac rootkits or Gem will make it his hobby to disprove your data before you get to publish it
    Highlights :
    • There was some really good information here on showing how to detect every new method some authors were saying couldn't be detected but I think the slides would be of better explanation.  

    Every Step You Take: Profiling the System

    Presenter : Jamie Levy
    Notes : I always tend to find the stuff Jamie talks on to be the most relevant to my daily operations.  Last year she talked on MBR/MFT stuff and this year she showed off some plugins related to profiling/intelligence.
    Highlights :
    • Jamie touched on a plugin she created a little bit ago, CybOX, which checks for threat indicators in memory samples
    • There was also mention of profiling memory dumps.  I can't specifically recall if there was a plugin called 'profiler' but never the less, it was sweet.  Think about generating profiles of memory dumps so you can detect either good stuff or malicious stuff.  In one way of thinking, you can create your golden profile - a baseline of a clean system so you can diff that against another memory dump and see what's different. This can help detect new software, processes etc.  Another thought is creating a memory dump while the system is infected and then using some of those artifacts to later determine if they exist in another memory dump.  This is something that can scale and I'm really excited to start playing with it.

    Honorable Mention

    • ethscan - This plugin was a runner up in Volatility's plugin contest but it's definitely something I can start to leverage on engagements right away.  I'm not sure how the author's blog post managed to slip under the cracks but it's linked above so give it a look.  I like the fact that it will work for any OS' memory dump and can utilize dpkt to save the network traffic to a PCAP file.


    First... I'm glad the official conference page had a Twitter hashtag to use this year but I still ran into the same issue as last year - people using a variety of hashtags... stick to the default! One of the first observations this year was that it appeared the attendance was double that of last year.  Additionally, I noticed there were a lot of younger attendees this year so it's great to see them getting involved and starting to network.  On the disappointing side - I did feel like I was seeing a noticeable amount of people doing the same things as others have already done.  I know it's useful from a learning perspective to do things yourself but why spend so much time re-doing something that's already out there to use?

    Forensics Visualizations with Open Source Tools

    Presenter : Simson Garfinkel
    Notes : Simson has spoken at every OSDFC, he hates pie graphs and likes PDFs
    Highlights :
    • No seriously, doesn't like pie graphs... and when he rotated them around it kind of made sense.  When you rotate the pie chart the focus of what you're trying to show changes.  He referenced another presentation, Save the Pies for Desert, that's worth a read.
    • Simson brought up an interesting point that some graphing tools (graphviz etc.) will produce different graphs when run more than once.  This happens when there's randomized algorithms being used and the seed keeps changing when producing the graphs.  Not good when need things to be repeatable by others.
    • Have you every visualized network traffic?  It can certainly be helpful... what about creating some stats/reports?  Sometimes looking at graphs instead of lines within Wireshark can help show things you might have otherwise missed.  A quick, high-level overview can be generated with 'netviz' (slide 46).  It's currently within tcpflow and creates some histogram for you.
    • He made some valid points for PDFs having a high resolution that could be zoomed in on and also they have the ability to be text searchable

    Autopsy 3: Extensible Desktop Forensics

    Presenter : Brian Carrier
    Notes : Brian twerked it
    Highlights :
    • Brian had a great transition in his slides to incorporate a Miley Cyrus picture (related quote later in this post)
    • The keyword searches within Autopsy refresh every 5 mins by default
    • The searches for specific locations (e.g. - user's folders) are prioritized so their results show first.  (can this be modified??)
    • The video triage module does periodic screenshots of the video so you don't have to sit there and watch the entire thing to see if it changes at any point
    • The text gisting module helps translate text into English
    • Future things - will use SQLite for hash DB, carve using Scalpel and will have Mac and *nix installers

    "Challenge Results" - Autopsy Module Contest

    Notes : I was surprised there were only two submissions to this contest and just as surprised that both of them were more on the complex side of things.  Someone could have just created a module to periodically show a cat picture and won some dinero.  Of the two submissions, one was a remote submission and only had a video to show if off.  It looked useful, but just didn't cut it - Willi B took the gold.
    Highlights :
    • Willi B - Registry Module; wrote an entire library in Java
    • someone else - Fuzzy Hash Module

    A Tool for Answering the Question: What Changed on Disk?

    Presenter : Stuart Maclean
    Notes : Tool to do some diffing (waiting for github for code)
    Highlights :
    • Armour - shell program to compare TSK bodyfile's
    • slide 15 - cmd's
    • slide 19 - cuckoo report
    • Not just used for VM diffing, slide 21 - can do psychical machine disk diffing w/ external drive and *nix live cd

    Bulk_Extract Like a Boss

    Presenter : Jon Stewart
    Notes : Lightgrep FTW
    Highlights :
    • Unicode shout out to U+1F4A9 (you know you want to look this up now)
    • With lightgrep incorporated into bulk_extractor, if you disable the normal 'find' disabled (-x find) you'll have blazingly fast searches - slide 11
    • Bulk_extractor contains recursive scanners to extract files then scan them (defaults to recurse 7 times to make sure don't fall into zip bombs)
    • There's a couple of new scanners - xor and hiberfil
    • "useful options" - slide 8
    • paper in last years DFWRS on its unicode support
    • Lightgrep is incorporated into the Windows installer and the source can be downloaded and installed yourself for other flavors

    Making Molehills Out of Mountains: Data Reduction Using Sleuth Kit Tools

    Presenter : Tobin Craig
    Notes : The speaker saw a gap and tackled it but I do think some of it is repetitive to what's already out there.
    Highlights :
    • Built to work on DEFT v8
    • Created a bash script that leverages TSK
    • limitations: limited to FAT/NTFS partitions and relies on file extentions to determine file types

    Doing More with Less: Triaging Compromised Systems with Constrained Resources

    Presenter : Willi Ballenthin
    Notes : Willi showed that you don't always need to have the entire disk in order to answer the key questions to your investigation.  He also let us into his analysis process and a peek into all of the sweet things he's written.
    Highlights :
    • 'pareto principle' - get 20% of artifacts to answer 80% of the questions
    • The key data to grab is generally the $MFT, Registry files and Event Logs (others, depending in your questions to ask could be memory, Internet history etc.)
    • These key files compress extremely well and are generally result in being under 100MB 

    • list_mft.py - creates timeline and can also pull resident INDX records
    • MFTView.py - pulls resident data if it's there in the 'Data' ta and tells what sectors to pull from disk to get contents of it if not ;  right pane shows Unicode/ASCII strings so can see refinements of what was previously there
    • get_file_info.py - CLI that's scriptable and creates a mini timeline
    • reg_view.py - R/O GUI registry viewer
    • findkey.py - search keys/values/paths etc. to feed it keywords to search for
    • timeline.py - create timeline from key modification time stamps
    • forensicating.py - some functions I put together to show how to utilize this library for forensics (got a sweet shout out for it, w00t w00t... now your turn)
    • full documentation can be found : here.
    • Lfle.py - carve for records
    • Willi also mentioned a GUI Event Log Viewer which has the ability to index records for easier searching and puts the event IDs in categories/sub-categories that are sortable.  This is something I had talked to a few about over the years and I'm really glad to see someone finally doing it, thanks Willi!  This currently isn't publicly released yet but be on the lookout.
    • full documentation can be found : here.

    Computer Forensic Triage using Manta Ray

    Presenter : Doug Koster & Kevin Murphy
    Notes : "Automated Triage" - looks to be the same thing as Tapeworm was.  There looks like there still needs some things to be ironed out/finished.  In my investigations I don't need to run every tool every time and that's kind of what I feel this does... maybe useful for others but doesn't fit into my process flow.
    Highlights :
    • Going to be in SIFT v3.0 but for the time being it's at mantarayforensics.com

    Honorable Mention

    • Noriben - Brian Baskin gave a quick demo of his latest version ; useful for quick analysis
    • SIFTER - "the Google of digital forensics"... I unfortunately didn't make it to this talk but I heard it was great so I'm putting it here as it's something I want to look into and feel others might want to as well.
    • MassScan - This was described as an internal VirusTotal tool.  Unfortunately the display on the projectors wasn't working properly so it wasn't easy to follow along but I'm eager to see the code and what it can really do. (anyone have a link?)

    Wednesday, June 26, 2013

    Don't Get Locked Out

    This has been ported over to my GitHub site and is not longer being maintained here. For any issues, comments or updates head here.


    The system had Full Disk Encryption (FDE) via McAfee SafeBoot and I had recently changed my Windows password but apparently fat fingered it from what I thought I had changed it to which left me unable to authenticate to Windows.  The OS and SafeBoot were working properly and I had valid credentials to login to the SafeBoot file system (SBFS)...this is because it used separate credentials from my Windows credentials.


    Even though I could authenticate to SafeBoot and decrypt the OS, I wasn’t able to boot off of anything else (Kon-boot, Ophcrack etc.) after authenticating to SafebBoot or prior to entering the SafeBoot environment.

    Since my Windows passphrase was over 18 characters (don’t ask me why) a dictionary attack wasn’t on the list of possible solutions.  While rainbow tables were next on my list, LM was turned off and the key space for my passphrase would have been too big to tackle.  There was the option to try and unlock it via FireWire (Inception) but since this was a Windows 7 x64 with SP1 and 8 GB of memory it was unlikely to work in its release at that time.


    In order to recover/troubleshoot SafeBoot you can use the WinTech CD.  Once you boot your system from the WinTech CD the first thing that you must do is open up WinTech (start > Programs > SafeBoot WinTech) and enter the daily access code.  

    After successfully authorizing yourself the next step is to authenticate to SafeBoot.  This can be done three different ways:

    Since I had valid credentials for this particular SafeBoot group I chose the first option - to "Authenticate From SBFS".  If all goes well you’ll see authorized and authenticated in the bottom of the program.

    You now have the ability to mount your decrypted file system and browse it with an explorer within the BartPE environment or from cmd.  My first thought was to copy off the SAM and SECURITY files but again, lack of LM hashes and my long passphrase were telling me nope, try another way.

    As such, I decided to try the old Sticky Keys trick.  For those of you who are unaware of what I mean, Sticky Keys is an accessibility feature within Windows meant to allow a user to be able to hold down two or more keys at a time when they would otherwise be unable to.  This feature is enabled by default on Windows installations and is therefore highly reliable as another option.  To make sure this was a possible solution I hit the ‘Shift’ key five times once I was at the Windows login screen.  If your settings haven’t been altered and Sticky Keys is enabled you’ll be presented with:

    By switching the Sticky Keys application with a command prompt on the system you can take advantage of this feature and reset a local user’s password or create a new local user.  Usually, this trick would be carried out by either booting the system from a Windows installation disk and utilizing the recovery console or by mounting the file system within a live Linux instance.  The issue that came up again is that neither of them would have sufficed since the OS file system would still be encrypted.


    Once I was authorized and authenticated to the SBFS I opened cmd within WinTech and did the following:
    1. Created a copy of the Sticky Keys application:
      > copy c:\Windows\system32\sethc.exe c:\Windows\system32\sethc.bak
      1 file(s) copied.
    2. Tried to replace the Sticky Keys application with a copy of the command prompt:
      > copy /y c:\Windows\system32\cmd.exe c:\Windows\system32\sethc.exe
      Access is denied.
      0 file(s) copied.

      The first time around I received an "Access Denied" error in this step, as depicted above.  This is something I hadn't run into before because every time I had previously performed this trick I was working on a Windows XP system - but this time it was a Windows 7 system.  After some troubleshooting I realized this error was due to enhanced protections on the System32 files that Windows 7 has over Windows XP...so the ownership/permissions on this file need to be modified.
    3. Within SafeTech, Start > Programs > File Management > MS Explorer.
      • Right click on sethc.exe > Properties > Security > Advanced
      • Change the current owner (TrustedInstaller) to Administrators (in my case)
      • Change the permissions of who you changed ownership to (Administrators in this example) to "full control"
    4. I attempted to replace the Sticky Keys application again with a copy of command prompt:
      > copy /y c:\Windows\system32\cmd.exe c:\Windows\system32\sethc.exe
      1 file(s) copied.

    5. Then, after a system restart I pressed the Shift key 5x at the Windows login.  If all went well then the command prompt should now pop up and allow us to add a new user or reset an existing users password:

    6. At which point I could just do the first of those two, reset my Windows password:
      > net user <username> <new password>

    While not a super exciting post, it was something that I had to think about for a sec. and hopefully these little notes will help someone else out there if they ever run into the same situation.

    Tuesday, January 22, 2013


    This has been ported over to my GitHub site and is not longer being maintained here. For any issues, comments or updates head here.

    Update 04/09/2013 - NoMoreXOR is now included in REMnux as of version 4.

    Have you ever been faced with a file that was XOR'ed with a 256 byte key? While it may not be the most common length for an XOR key, it's still something that has popped up enough over the last few months to make it on my to-do list.  If you take a look at first the two links mentioned above you'll see they both include some in-house tool(s) which do some magic and provide you with the XOR key.  Even though they both state that at some point their tools will be released, that doesn't help me now.

    Most of the tools I came across can handle single byte - four byte XOR keys no problem (xortool, xortools, XORBruteForcer, xorsearch etc.) but other than that I didn't notice any that would handle (or actually work) with a large XOR key besides for (okteta, converter and cryptam_unxor).

    I noticed Cryptam's online document analysis tool had the ability to do this as well so I sent them a few questions on their process and received a quick, informative response which pointed me to a post on their site.  Within the post/email they said that they don't perform any bruteforcing on the XOR key but rather perform cryptanalysis and then brute force the ROL1-7 (if present).  As shown in the dispersion graphs they provide, they appear to essentially be looking for high frequencies of repetitive data then using whatever appears the most to test as the key(s).

    So how do you know if the file is XOR'ed with a 256 byte key in the first place?  Well... you could always try to reverse it but you may also be lucky enough to have some YARA rules which have some pre-calculated rules to help aid in this situation.  A good start would be to look at MACB's xotrools (previously linked) and also consider what it is you might want to look for (i.e. - "This program cannot be run") and XOR it with some permutations.

    Manual process

    If we open that file within a hex editor and go to the offset flagged (0x25C8) we'll see what is supposedly "This program cannot be run" = 26 bytes :

    If we take that original file and covert it to hex we'll essentially just get a big hex blob:

    ...but that hex blob helps to try and guess the XOR key:

    From my initial tests, the XOR key has always been in the top returned results, but even if you're having some difficulties for whatever reason you can always modify the code to fit your needs - gotta love that.

    So if we now try to unxor the original file with the first guessed XOR key (remember XOR is symmetric) hopefully we'll get the original content that was XOR'ed:

    After the original file was unxored and scanned with YARA we see that it was flagged for having an embedded EXE within it (this rule can be found within MACB's capabilities.yara file) so it looks like it worked.

    Now while all this hex may look like a bunch of garbage at times, the human eye is very good at recognizing patterns - and when you look more and more at things like this you'll start to recognize them.  Do you recall the YARA hit that triggered? It stated that the XOR key was incremented.  What this means is that each byte is being XOR'ed with the next byte in an incremental fashion until it wraps back around to the beginning.  That may be confusing the grasp at first so lets visualize it by breaking down the previously found 256 byte XOR key in its' respective order:


    As you see, it started with 86 and looped all the way around till it reached 85 - you should also notice the patterns on each line.  This is just an example of incremental/decremental XOR (not as commonly observed in my testing but useful to be aware of) but it's useful to know because it's quite easy to spot if you look at the original file in a hex editor again:

    ... and that's a pattern that was observed repeating ~56 times.

    Automated process

    So now we can kind of put together a process flow of what we want to do:
    1. Convert the original, XOR'ed file to hex
    2. Conduct some slight frequency analysis of the newly created hex file and look for the most common characters as well as the most commonly observed hex chunks.  
      1. The first part may help in determining if there's an embedded PE file (usually a lot of \x00's) or possibly help deduce if certain bytes should be skipped.  
      2. The latter essentially reads 512 bytes at a time, stores it and continues till the end of the file.  Once complete it does some simple checking to try and weave out meaningless possible keys then presents the top five most observed 512 bytes or characters in this sense  (i.e. - 512 characters = 1 possible 256 byte key(s))
    3. For each possible XOR key guessed from the previous step, XOR (the entire file for right now) the original file, save it to a new file and scan it with YARA.  
      1. I chose to perform YARA scans here to help determine the likelihood that the key used was correct - you may choose to implement something else such as just a check for an embedded PE file etc.  If there are YARA hits then I stop attempting the other possible XOR keys (if any other were still to be processed) and assume the previous XOR key was the correct one.
    * If you stick with the YARA scanning, it will continue to process all of the possible key(s) it outlined as the top, in terms of frequency, so your YARA rules should include something that might be present in the original XOR'ed file.  If not, you might already have the correct XOR key but aren't aware.  Embedded exe's are a good start to look for since they're common - but remember if we XOR the entire file at once instead of a specific section that you might find the embedded content but that doesn't mean the original file will be readable afterwards (i.e - won't be a Word document anymore since it was XOR'ed) 

    Let's try out that process flow in a more automated way (on a new file):

    As you can see, it worked like a charm :)

    As always, I'm sure there's a better way to code some of the stuff I did but hey, it works for me at the moment.  There's a to-do list of things that I want to further implement into this tool, some of which is already included in other tools.  I've been asked before how this tools will work with smaller XOR keys and that's up to you to test and tell me - I created this in order to tackle the problem solely of the 256 byte key files I was observing so I'd recommend using one of the earlier mentioned tools for that situation, at least for the time being.

    Example To-do's:
    • ROL(1-7)/ROT(1-25) - either brute forcing or via YARA scans
    • Add ability to skip \x00 & other chosen bytes (ref)
    • more is outlined within the file....


    NoMoreXOR can be found on my github