Showing posts with label yara. Show all posts
Showing posts with label yara. Show all posts

Wednesday, March 12, 2014

Bruteforcing XOR with YARA

This has been ported over to my GitHub site and is not longer being maintained here. For any issues, comments or updates head here.


In a previous post I looked at coming up with a process for determining XOR keys that were 256 bytes.  I've received and have read some great feedback/posts regarding the tool and even though I wrote it in such a way to try and still possibly see patterns/repetitive bytes for smaller XOR keys, that wasn't its purpose.  There are plenty of other tools out there to try and assist oneself when dealing with XOR'ed files, however, recently a co-worker and I were left unsuccessful after exhausting those resources.

I'm often asked to look at some artifact that's believed to be encoded in some fashion or hear that even if something is XOR'ed that they wouldn't know how to go about decrypting/decoding it.  I'm by no means an expert and sometimes find myself just as lost as you might feel but I thrive on learning and challenges, hence why I decided to work in the dfir space.

I believe this type of scenario is just like most others - the more time you spend doing it, the easier it becomes.  Additionally, pattern recognition is key when it comes to XOR (pun intended).  Determining the XOR key and any other skips etc. that might be used can be quite trivial, but let's look at a few ways that make this type of scenario harder:
  • You don't have access to the source code of the file responsible for performing the XOR 
  • You don't have access to the binary  responsible for performing the XOR 
  • You don't have the knowledge/skills/resources
  • The key you think should work isn't working
So you just have a file that you believe is encoded but you're not sure how (e.g. - you try to open it and you don't see any plain text). One of the easiest ways to determine if it's XOR'ed is if while scrolling through it you start to see patterns emerging.  This could be horizontal, vertically or maybe just repetitive characters constantly appearing - all depends on the key length and any other skips that might be in play.  When I say skips I'm referring to the XOR routine skipping null bytes, line feeds, carriage returns, not XOR'ing itself (e.g. - if the key is A5 then maybe if it sees A5 it skips it instead of XOR'ing itself) or some other trick.  Again, these are easier to determine if you have either of the first two bullet points listed above...but unfortunately that's not always the case.

In a recent blog post there was mention of the malware named XtremeRAT and additionally a few tools to help in scenarios where you're investigating incidents involving it.  One of the scripts listed there is for decrypting a keylog file created from XtremeRAT with a two byte XOR key of '3fa5'.  While it's helpful to know that two byte XOR key is used, what if it doesn't work on your file (bullet point number 4 mentioned above)?  Or what if there's a new variant using a different XOR key that you now need to try and figure out?  To try and solve these questions I decided to leverage a combination of YARA, the script xortools from Malware Analysts Cookbook (the book that keeps on giving) and use case examples from some others within the YaraExchange.  Xortools has some useful functions for creating different XOR's, permutations and then spitting them out into YARA rules... sweet, right?

The functions within xortools didn't quite have a solution for what I was trying to do but some quick modifications to a couple of them was easy enough to implement.  Let's break down the thought process:

  1. I wanted to generate a list of all possible combinations of two byte XOR keys (e.g. - 1010, 1011, 1012 etc.).
  2. Using those combinations I then wanted to XOR a string of my choosing
  3. With the resulting XOR'ed string I wanted to create a YARA rule for their hex bytes.  
  4. I also wanted to keep track of the two byte XOR key being used for each rule and add them to the rules name so if/when a rule triggers, the XOR key is easily identifiable - this wasn't currently included in xortools so see my modified functions
  5. Wash, Rise, Repeat.... this would entail creating different strings that you wanted XOR'ed.  I have a list that I usually feed to xorsearch such as 'http', 'explorer', 'kernel32' but in this particular instance I needed a list of strings that were likely to appear in a keylog file, such as:
  • Backspace
  • Delete
  • CLIPBOARD
  • Arrow Left
  • Arrow Right
  • Caps Lock
  • Left Ctrl
  • Right Ctrl

    For some additional hints on what you might see within a keylog file, check out Ian's YARA rule for DarkComet_Keylogs_Memory.
Good thought process thus far, but what if those strings aren't contained within the keylog file?  You wouldn't necessarily know unless you've previously dealt with this malware or have come across an example online...so another approach to think about is what is likely to be recorded on the system?  Here are some examples I've found helpful:

  • Company name (most likely keylogged email and/or Internet browsing)
  • The persons name/user name
  • Microsoft
This should help make things more flexible and tackling the unknown aspect.

First things first... create a function to generate every combo of two byte XOR keys:


The top is the original and the bottom is an example of how to generate the pair by adding another loop and at the end saving the two byte key for use in the rule name.  Note: Doing it this way may produce hex characters that are only a nibble and YARA will not like that if you're trying to match on hex characters so to circumvent it, I decided to add a wild card (?) as the other nibble.

Next, we need to feed those two bytes to an XOR function and XOR the string we passed it.  Finally, leverage the 'yaratize' function to create the YARA rule.  I got things working and when I went to scan the XOR'ed keylog files I received 'Error 25' from YARA (sad face).  After some troubleshooting I was told this issue was being caused by having too many strings in a single rule. Essentially, Error 25 'ERROR_EXEC_STACK_OVERFLOW' meant I was hitting a hard limit on the stack size. No bueno... My options were to tweak line 24 in libyara/exec.c or create better YARA rules.  By creating so many strings and using the pre-existing 'yaratize' function within xortools my rule looked followed this structure:



You'll notice it's the standard rule format most of you are probably familiar with seeing: rule name followed by the strings to match and at the bottom (not shown) would be the condition.  After some testing I determined that ~16k strings to match on seemed to be the limit that YARA would accept in a single rule (that's based on my systems config. + length of string to match etc.).  Back to my options - I could tweak that setting in YARA which I didn't want to, have a counter and only add X amount of strings to match per rule or the third option of creating one rule per string.  The third might not be familiar to some of you but that's what I opted to go with.  It creates a larger file because of all the extra characters you're adding but with the new version of YARA, performance shouldn't really be too much of a factor.  An example of this type of format is:



Now that this hurdle was bypassed, I was able to use the YARA rules generated.  On a test file that I XOR'ed with the key '3fa5' the YARA rules worked ...however, they still weren't working on the keylog files from XtremeRAT - Err!  



Note: the (-s) switch to YARA tells it to print out what matched, which is important here because our string name has the XOR key in it and the (-f) switch tells it to use fast matching mode, which only prints out the first match in the file instead of every time it's matched.

Alright, so let's pop open the XOR'ed test file I created and check out its hex and compare it to what I was seeing in the XtremeRAT files:

Here's what the test file looks like XOR'ed and in plain text, respectively:



And here is an image of the first 10 lines of two keylog files from XtremeRAT.  If you scroll through this example you'll notice the first file has a second byte consistently of '00' while the second file has a second byte consistently 'a5':




If you've read anything on XOR'ing before you may be aware that XOR keys can present themselves based on what they're XOR'ing (hence why sometimes they have skips/checks implemented).  Focusing on the bottom file, I'd say 'a5' is part of the XOR key - if not the key itself (depends on the length you're dealing with). Circling back to the XtremeRAT blog post we know a common key is '3fa5' so it appears we're being presented with half the key when we browse through the XOR'ed keylog file.  Now if you recall back to previous YARA rules being created, I was producing a straight two byte XOR without any skips... if you look at the above files you'll realize, or maybe after some troubleshooting, that this conversion won't work in this instance as the keylog file doesn't have each byte sequentially (e.g - If the word within the keylog file we're looking for is 'Microsoft', the keylog file doesn't show it as that word XOR'ed in order, but rather with 'a5' in between each XOR'ed character.) Hm, what's happening?  According to the blog post,

"XtremeRAT's key scheduling algorithm (KSA) implementation contains a bug wherein it only considers the length of the key string, not including the null bytes between each character, as found in these Unicode strings".

Now without having the binary or source code to make that determination (which I didn't), it should still become evident if you try and do a comparison:




On the left hand side of the above image is another look at the previously shown test file I created with some common keywords typically found in a keylogger file and on the right hand side is a sanitized copy of one created by XtremeRAT.  In each of the panes, the word 'Microsoft' is highlighted in the format of the particular file it's part of.  For a visual guide of what's going on and what should be expected I put together a quick image:



The top section shows the string 'Microsoft' in its native form, converted to other formats followed by what its representation would be if that particular character was XOR'ed by each half of the two byte XOR key '3fa5' by themselves.  The bottom section again shows the same string but separated by 'a5' as shown when viewing the keylog file XOR'ed followed by what would be required in a YARA rule to match on this particular string as it's seen within the XOR'ed file (hope this makes sense).

When stuck or first starting off with something like this you can reference online tables or use online systems to see binary/decimal/hex conversions but it might be worth while figuring out how to do it programmatically in something you feel comfortable with - python, perl, bash, M$ Excel etc. to try and see what's going on.

Below is another copy of the same exact table shown above, but this time with two columns highlighted.  The top column helps show each character within the string 'Microsoft' as its value in hex once it's XOR'ed with the single byte key of '3f'.  The bottom column contains the same information, but has the second half of the XOR key 'a5' inserted in between each of the strings characters.



In other words? - Because XtremeRAT uses a two byte XOR key and has null bytes in between each character, the second part of the two byte XOR key 'a5' is always displayed.  Essentially, it becomes a one byte XOR key as each character is always XOR'ed with the first half of the XOR key '3f'.  So how do we compensate for this?  After generating the permutations for every two byte XOR key we just read each character one at a time from the string we supply it then XOR each of them with the first half of the two byte key and add the second half of the two byte key right after it as itself (represented in the bottom blue column above).

Once we do that, bingo! :



We first see what the new YARA rule for '3fa5' looks like (which as the second byte as itself 'a5') and first see that it doesn't match on a file that's XOR'ed normally with the two byte key '3fa5' and lastly that it now matches on a keylog file XOR'ed from XtremeRAT with the added null byte routine.

So how easy is it to code?  Pretty easy since the majority of it existed, just some slight modifications and you're good to go.  You just need to modify the permutations function to generate combos of two byte XOR keys:



and push them over to the xor routine of need:





and finally to a function to create the YARA rules:


Other than that, just import the required functions and supply them with the required data; so for the modified functions I created, I could just say:



and voila, game over.  This should hopefully have helped explain a little more on what XOR is, how to go about detecting it and another resource you can use in the future for trying to brute force what a possible XOR key is based on some common strings that might be present.  Since xortools is hosted on Google code I opted to put up a modified version on my github instead of just a patch. I'm not the original author of all the code, just a guy modifying as needed.

Tuesday, December 3, 2013

AnalyzePDF - Bringing the Dirt Up to the Surface

This has been ported over to my GitHub site and is not longer being maintained here. For any issues, comments or updates head here

What is that thing they call a PDF?

The Portable Document Format (PDF) is an old format ... it was created by Adobe back in 1993 as an open standard but wasn't officially released as an open standard (SIO 32000-1) until 2008 - right @nullandnull ?  I can't take credit for the nickname that I call it today, Payload Delivery Format, but I think it's clever and applicable enough to mention.  I did a lot of painful reading through the PDF specifications in the past and if you happen to do the same I'm sure you'll also have a lot of "hm, that's interesting" thoughts as well as many "wtf, why?" thoughts.  I truly encourage you to go out and do the same... it's a great way to learn about the internals of something, what to expect and what would be abnormal.  The PDF has become a defacto for transferring files, presentations, whitepapers etc.

<rant> How about we stop releasing research/whitepapers about PDF 0-days/exploits via a PDF file... seems a bit backwards</rant>

We've all had those instances where you wonder if that file is malicious or benign ... do you trust the sender or was it downloaded from the Internet?   Do you open it or not?  We might be a bit more paranoid than most people when it comes to this type of thing and but since they're so common they're still a reliable means for a delivery method by malicious actors.  As the PDF contains many 'features', these features often turn into 'vulnerabilities' (Do we really need to embed an exe into our PDF? or play a SWF game?).  Good thing it doesn't contain any vulnerabilities, right? (to be fair, the sandboxed versions and other security controls these days have helped significantly)


What does a PDF consist of?

In its most basic format, a PDF consists of four components: header, body, cross-reference table (Xref) and trailer:

(sick M$ Paint skillz, I know)

If we create a simple PDF (this example only contains a single word in it) we can see a better idea of the contents we'd expect to see:


 What else is out there?

Since PDF files are so common these days there's no shortage of tools to rip them apart and analyze them.  Some of the information contained in this post and within the code I'm releasing may be an overlap of others out there but that's mainly because the results of our research produced similar results or our minds think alike...I'm not going to touch on every tool out there but there are some that are worth mentioning as I either still use them in my analysis process or some of their functionality/lack of functionality is what sparked me to write AnalyzePDF.  By mentioning the tools below my intentions aren't to downplay them and/or their ability to analyze PDF's but rather helping to show reasons I ended up doing what I did.

pdfid/pdf-parser

Didier Stevens created some of the first analysis tools in this space, which I'm sure you're already aware of.  Since they're bundled into distros like BackTrack/REMnux already they seem like good candidates to leverage for this task.  Why recreate something if it's already out there?  Like some of the other tools, it parses the file structure and presents the data to you... but it's up to you to be able to interpret that data.  Because these tools are commonly available on distros and get the job done I decided they were the best to wrap around.

Did you know that pdfid has a lot more capability/features that most aren't aware of?  If you run it with the (-h) switch you'll see some other useful options such as the (-e) which display extra information. Of particular note here is the mention of "%%EOF", "After last %%EOF", create/mod dates and the entropy calculations.  During my data gathering I encountered a few hiccups that I hadn't previously experienced.  This is expected as I was testing a large data set of who knows what kind of PDF's.  Again, I'm not noting these to put down anyone's tools but I feel it's important to be aware of what the capabilities and limitations of something are - and also in case anyone else runs into something similar so they have a reference.  Because of some of these, I am including a slightly modified version of pdfid as well.  I haven't tested if the newer version fixed anything so I'd rather give the files that I know work with it for everyone.

  • I first experienced a similar error as mentioned here when using the (-e) option on a few files (e.g. - cbf76a32de0738fea7073b3d4b3f1d60).  It appears it doesn't count multiple '%%EOF's since if the '%%EOF' is the last thing in the file without a '/r' or '/n' behind it, it doesn't  seem to count it.
  • I've had cases where the '/Pages' count was incorrect - there were (15) PDF's that showed '0' pages during my tests.  One way I tried to get around this was to use the (-a) option and test between the '/Page' and '/Pages/ values. (e.g. - ac0487e8eae9b2323d4304eaa4a2fdfce4c94131)
  • There were times when the number of characters after the last '%%EOF' were incorrect
  • Won't flag on JavaScript if it's written like "<script contentType="application/x-javascript">" (e.g - cbf76a32de0738fea7073b3d4b3f1d60) :



peepdf

Peepdf has gone through some great development over the course of me using it and definitely provides some great features to aid in your analysis process.  It has some intelligence built into it to flag on things and also allows one to decode things like JavaScript from the current shell.  Even though it has a batch/automated mode to it, it still feels like more of a tool that I want to use to analyze a single PDF at a time and dig deep into the files internals.

  • Originally, this tool didn't look match keywords if they had spaces after them but it was a quick and easy fix... glad this testing could help improve another users work.

PDFStreamDumper

PDFStreamDumper is a great tool with many sweet features but it has its uses and limitations like all things.  It's a GUI and built for analysis on Windows systems which is fine but it's power comes from analyzing a single PDF at a time - and again, it's still mostly a manual process.

pdfxray/pdfxray_lite

Pdfxray was originally an online tool but Brandon created a lite version so it could be included in REMnux (used to be publicly accessible but at the time of writing this looks like that might have changed).  If you look back at some of Brandon's work historically he's also done a lot in this space as well and since I encountered some issues with other tools and noticed he did as well in the past I know he's definitely dug deep and used that knowledge for his tools.  Pdfxray_lite has the ability to query VirusTotal for the file's hash and produce a nice HTML report of the files structure - which is great if you want to include that into an overall report but again this requires the user to interpret the parsed data

pdfcop

Pdfcop is part of the Origami framework.  There're some really cool tools within this framework but I liked the idea of analyzing a PDF file and alerting on badness.  This particular tool in the framework has that ability, however, I noticed that if it flagged on one cause then it wouldn't continue analyzing the rest of the file for other things of interest (e.g. - I've had it close the file our right away if there was an invalid Xref without looking at anything else.  This is because PDF's are read from the bottom up meaning their Xref tables are first read in order to determine where to go next).  I can see the argument of saying why continue to analyze the file if it already was flagged bad but I feel like that's too much of tunnel vision for me.  I personally prefer to know more than less...especially if I want to do trending/stats/analytics.

So why create something new?

While there are a wealth of PDF analysis tools these days, there was a noticeable gap of tools that have some intelligence built into them in order to help automate certain checks or alert on badness.  In fairness, some (try to) detect exploits based on keywords or flag suspicious objects based on their contents/names but that's generally the extent of it.  I use a lot of those above mentioned tools when I'm in the situation where I'm handed a file and someone wants to know if it's malicious or not... but what about when I'm not around?  What if I'm focused/dedicated to something else at the moment?  What if there's wayyyy too many files for me to manually go through each one?  Those are the kinds of questions I had to address and as a result I felt I needed to create something new.  Not necessarily write something from scratch... I mean why waste that time if I can leverage other things out there and tweak them to fit my needs?  

Thought Process


What do people typically do when trying to determine if a PDF file is benign or malicious?  Maybe scan it with A/V and hope something triggers, run it through a sandbox and hope the right conditions are met to trigger or take them one at a time through one of the above mentioned tools?  They're all fine work flows but what if you discover something unique or come across it enough times to create a signature/rule out of so you can trigger on it in the future?  We tend to have a lot to remember so doing the analysis one offs may result in us forgetting something that we previously discovered.  Additionally, this doesn't scale too great in the sense that everyone on your team might not have the same knowledge that you do... so we need some consistency/intelligence built in to try and compensate for these things.<

 I felt it was better to use the characteristics of a malicious file (either known or observed from combinations of within malicious files) to eval what would indicate a malicious file.  Instead of just adding points for every questionable attribute observed. e.g. - instead of adding a point for being a one page PDF, make a condition to say if you see an invalid xref and a one page PDF then give it a score of X.  This makes the conditions more accurate in my eyes; since, for example:
  1. A single paged PDF by itself isn't malicious but if it also contains other things of question then it should have a heavier weight of being malicious.  
  2. Another example is JavaScript within a PDF.  While statistics show JavaScript within a PDF are a high indicator that it's malicious, there're still legitimate reasons for JavaScript to be within a PDF (e.g. - to calculate a purchase order form or verify that you correctly entered all the required information the PDF requires).

Gathering Stats

At the time I was performing my PDF research and determining how I wanted to tackle this task I wasn't really aware of machine learning.  I feel this would be a better path to take in the future but the way I gathered my stats/data was in a similar (less automated/cool AI) way.  There's no shortage of PDF's out there which is good for us as it can help us to determine what's normal, malicious, or questionable and leverage that intelligence within a tool.

If you need some PDF's to gather some stats on, contagio has a pretty big bundle to help get you started.  Another resource is Govdocs from Digital Corpora ... or a simple Google dork.

Note : Spidering/downloading these will give you files but they still need to be classified as good/bad for initial testing).  Be aware that you're going to come across files that someone may mark as good but it actually shows signs of badness... always interesting to detect these types of things during testing!

Stat Gathering Process

So now that I have a large set of files, what do I do now?  I can't just rely on their file extensions or someone else saying they're malicious or benign so how about something like this:
  1. Verify it's a PDF file.  
    • When reading through the PDF specs I noticed that the PDF header can be within the first 1024 bytes of the file as stated in ""3.4.1, 'File Header' of Appendix H - ' Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.'"... that's a long way down compared to the traditional header which is usually  right in the beginning of a file.  So what's that mean for us?  Well if we rely solely on something like file or TRiD they _might_ not properly identify/classify a PDF that has the header that far into the file as most only look within the first 8 bytes (unfair example is from corkami).  We can compensate for this within our code/create a YARA rule etc.... you don't believe me you say?  Fair enough, I don't believe things unless I try them myself either:
    The file to the left is properly identified as a PDF file but when I created a copy of it and modified it so the header was a bit lower, the tools failed.  The PDF on the right is still in accordance with the PDF specs and PDF viewers will still open it (as shown)... so this needs to be taken into consideration.




  2. Get rid of duplicates (based on SHA256 hash) for both files in the same category (clean vs. dirty) then again via the entire data set afterwards to make sure there're no duplicates between the clean and dirty sets.
  3. Run pdfid & pdfinfo over the file to parse out their data.  

    • These two are already included in REMnux so I leveraged them. You can modify them to other tools but this made it flexible for me and I knew the tool would work when run on this distro; pdfinfo parsed some of the data better during tests so getting the best of both of them seemed like the best approach.


  4. Run scans for low hanging fruit/know badness with local A/V||YARA
  5. Now that we have a more accurate data set classified:



  6. Are all PDFs classified as benign really benign?
  7. Are all PDFs classified as malicious really malicious? 

Stats

Files analyzed (no duplicates found between clean & dirty):

Class Type Count
Dirty Pre-Dup 22,342
Dirty Post-Dup 11,147
Clean Pre-Dup 2,530
Dirty Post-Dup 2,529
Total Files Analyzed: 13,676



I've collected more than enough data to put together a paper or presentation but I feel that's been played out already so if you want more than what's outlined here just ping me.  Instead of dragging this post on for a while showing each and every stat that was pulled I feel it might be more useful to show a high level comparison of what was detected the most in each set and some anomalies.



Ah-Ha's

  • None of the clean files had incorrect file headers/versions
  • There wasn't a single keyword/attribute parsed from the clean files that covered more than 4.55% of it's entire data set class.  This helps show the uniqueness of these files vs. malicious actors reusing things.
  • The dates within the clean files were generally unique while the date fields on the dirty files were more clustered together - again, reuse?
  • None of the values for the keywords/attributes of the clean files were flagged as trying to be obfuscated by pdfid
  • Clean files never had '/Colors > 2^24' above 0 while some dirty files did 
  • Rarely did a clean file have a high count of JavaScript in it while dirty files ranged from 5-149 occurrences per file
  • '/JBIG2Decode' was never above '0' in any clean file
  • '/Launch' wasn't used much in either of the data sets but still more common in the dirty ones
  • Dirty files have far more characters after the last %%EOF (starting from 300+ characters is a good check)
  • Single page PDF's have a higher likelihood of being malicious - no duh
  • '/OpenAction' is far more common in malicious files

YARA signatures

I've also included some PDF YARA rules that I've created as a separate file so you can use those to get started.  YARA isn't really required but I'm making it that way for the time being because it's helpful... so I have the default rules location pointing to REMnux's copy of MACB's rules unless otherwise specified.

Clean data set:


Dirty data set:


Signatures that triggered across both data sets:



Cool... so we know we have some rules that work well and others that might need adjusting, but they still help!

What to look for

So we have some data to go off of... what are some additional things we can take away from all of this and incorporate into our analysis tool so we don't forget about them and/or stop repetitive steps?

  1. Header
    • In addition to being after the first 8 bytes I found it useful to look at the specific version within the header.  This should normally look like "%PDF-M.N." where M.N is the Major/Minor version .. however, the above mentioned 'low header' needs to be looked for as well.

      Knowing this we can look for invalid PDF version numbers or digging deeper we can correlate the PDF's features/elements to the version number and flag on mismatches. Here're some examples of what I mean, and more reasons why reading those dry specs are useful:
      • If FlateDecode was introduced in v1.2 then it shouldn't be in any version below
      • If JavaScript and EmbeddedFiles were introduced in v1.3 then they shouldn't be in any version below
      • If JBIG2 was introduced in v1.4 then it shouldn't be in any version below
  2. Body
    • This is where all of the data is (supposed to be) stored; objects (strings, names, streams, images etc.).  So what kinds of semi-intelligent things can we do here?
      • Look for object/stream mismatches.  e.g - Indirect Objects must be represented by 'obj' and 'endobj' so if the number of 'obj' is different than the number of  'endobj' mentions then it might be something of interest
      • Are there any questionable features/elements within the PDF? 
      • JavaScript doesn't immediately make the file malicious as mentioned earlier, however, it's found in ~90% of malicious PDF's based on others and my own research.
      • '/RichMedia'  - indicates the use of Flash (could be leveraged for heap sprays)
      • '/AA', '/OpenAction', '/AcroForm' - indicate that an automatic action is to be performed (often used to execute JavaScript)
      • '/JBIG2Decode', '/Colors' - could indicate the use of vulnerable filters; Based on the data above maybe we should look for colors with a value greater than 2^24
      • '/Launch', '/URL', '/Action', '/F', '/GoToE', '/GoToR' - opening external programs, places to visit and redirection games
      • Obfuscation
      • Multiple filters ('/FlateDecode', '/ASCIIHexDecode', '/ASCII85Decode', '/LZWDecode', '/RunLengthDecode')
      •  The streams within a PDF file may have filters applied to them (usually for compressing/encoding the data).  While this is common, it's not common within benign PDF files to have multiple filters applied.  This behavior is commonly associated with malicious files to try and thwart A/V detection by making them work harder.
      • Separating code over multiple objects
      • Placing code in places it shouldn't be (e.g. - Author, Keywords etc.)
      • White space randomization
      • Comment randomization
      • Variable name randomization
      • String randomization
      • Function name randomization
      • Integer obfuscation
      • Block randomization
      • Any suspicious keywords that could mean something malicious when seen with others?
      •  eval, array, String.fromCharCode, getAnnots, getPageNumWords, getPageNthWords, this.info, unescape, %u9090
  3. Xref
  4. The first object has an ID 0 and always contains one entry with generation number 65535. This is at the head of the list of free objects (note the letter ‘f’ that means free). The last object in the cross reference table uses the generation number 0.

    Translation please?  Take a look a the following Xref:
    Knowing how it's supposed to look we can search for Xrefs that don't adhere to this structure.
    • Trailer
      • Provides the offset of the Xref (startxref)
      • Contains the EOF, which is supposed to be a single line with "%%EOF" to mark the end of the trailer/document.  Each trailer will be terminated by these characters and should also contain the '/Prev' entry which will point to the previous Xref.
      • Any updates to the PDF usually result in appending additional elements to the end of the file

        This makes it pretty easy to determine PDF's with multiple updates or additional characters after what's supposed to be the EOF
    • Misc.
      • Creation dates (both format and if a particular one is known to be used)
      • Title
      • Author
      • Producer
      • Creator
      • Page count

    The Code

    So what now?  We have plenty of data to go on - some previously known, but some extremely new and helpful.  It's one thing to know that most files with JavaScript or that are (1) page have a higher tendency of being malicious... but what about some of the other characteristics of these files?  By themselves, a single keyword/attribute might not stick out that much but what happens when you start to combine them together?  Welp, hang on because we're going to put this all together.

    File Identification

    In order to account for the header issue, I decided the tool itself would look within the first 1024 bytes instead of relying on other file identification tools:



    Another way, so this could be detected whether this tool was used or not, was to create a YARA rule such as:

    Wrap pdfinfo

    Through my testing I found this tool to be more reliable in some areas as opposed to pdfid such as:

    • Determining if there're any Xref errors produced when trying to read the PDF
    • Look for any unterminated hex strings etc.
    • Detecting EOF errors


    Wrap pdfid

    • Read the header.  *pdfid will show exactly what's there and not try to convert it*
    • _attempt_ to determine the number of pages
    • Look for object/stream mismatches
    • Not only look for JavaScript but also determine if there's an abnormally high amount
    • Look for other suspicious/commonly used elements for malicious purposes (AcroForm, OpenAction, AdditionalAction, Launch, Embedded files etc.)
    • Look for data after EOF
    • Calculate a few different entropy scores
    Next, perform some automagical checks and hold on to the results for later calculations.

    Scan with YARA

    While there are some pre-populated conditions that score a ranking built into the tool already, the ability to add/modify your own is extremely easy.  Additionally, since I'm a big fan of YARA I incorporated it into this as well.  There're many benefits of this such as being able to write a rule for header evasion, version number mismatching to elements or even flagging on known malicious authors or producers.  The biggest strength, however, is the ability to add a 'weight' field in the meta section of the YARA rules.  What this does is allow the user to determine how good of a rule it is and if the rule triggers on the PDF, then hold on to its weighted value and incorporate it later in the overall calculation process which might increase it's maliciousness score.  Here's what the YARA parsing looks like when checking the meta field:




    And here's another YARA rule with that section highlighted for those who aren't sure what I'm talking about:



    If the (-m) option is supplied then if _any_ YARA rule triggers on the PDF file it will be moved to another directory of your choosing.  This is important to note because one of your rules may hit on the file but it may not be displayed in the output, especially if it doesn't have a weight field.

    Once the analysis has completed the calculation process starts.  This is two phase -

    1. Anything noted from pdfino and pdfid are evaluated against some pre-determined combinations I configured.  These are easy enough to modify as needed but they've been very reliable in my testing...but hey, things change!  Instead of moving on once one of the combination sets is met I allow the scoring to go through each one and add the additional points to the overall score, if warranted.  This allows several 'smaller' things to bundle up into something of interest rather than passing them up individually.
    2. Any YARA rule that triggered on the PDF file has it's weighted value parsed from the rule and added to the overall score.  This helps bump up a files score or immediately flag it as suspicious if you have a rule you really want to alert on.



    So what's it look like in action?  Here's a picture I tweeted a little while back of it analyzing a PDF exploiting CVE-2013-0640 :



    Download

    I've had this code for quite a while and haven't gotten around to writing up a post to release it with but after reading a former coworkers blog post last night I realized it was time to just write something up and get this out there as there are still people asking for something that employs some of the capabilities (e.g. - weight ranking).  Is this 100% right all the time? No... let's be real.  I've come across situations where a file that was benign was flagged as malicious based on its characteristics and that's going to happen from time to time.  Not all PDF creators adhere to the required specifications and some users think it's fun to embed or add things to PDF's when it's not necessary.  What this helps to do is give a higher ranking to files that require closer attention or help someone determine if they should open a file right away vs. send it to someone else for analysis (e.g. - deploy something like this on a web server somewhere and let the user upload their questionable file to is and get back a "yes it's ok -or- no, sending it for analysis".

    AnalyzePDF can be downloaded on my github

    Further Reading

    Tuesday, January 22, 2013

    NoMoreXOR

    This has been ported over to my GitHub site and is not longer being maintained here. For any issues, comments or updates head here.


    Update 04/09/2013 - NoMoreXOR is now included in REMnux as of version 4.

    Have you ever been faced with a file that was XOR'ed with a 256 byte key? While it may not be the most common length for an XOR key, it's still something that has popped up enough over the last few months to make it on my to-do list.  If you take a look at first the two links mentioned above you'll see they both include some in-house tool(s) which do some magic and provide you with the XOR key.  Even though they both state that at some point their tools will be released, that doesn't help me now.

    Most of the tools I came across can handle single byte - four byte XOR keys no problem (xortool, xortools, XORBruteForcer, xorsearch etc.) but other than that I didn't notice any that would handle (or actually work) with a large XOR key besides for (okteta, converter and cryptam_unxor).

    I noticed Cryptam's online document analysis tool had the ability to do this as well so I sent them a few questions on their process and received a quick, informative response which pointed me to a post on their site.  Within the post/email they said that they don't perform any bruteforcing on the XOR key but rather perform cryptanalysis and then brute force the ROL1-7 (if present).  As shown in the dispersion graphs they provide, they appear to essentially be looking for high frequencies of repetitive data then using whatever appears the most to test as the key(s).

    So how do you know if the file is XOR'ed with a 256 byte key in the first place?  Well... you could always try to reverse it but you may also be lucky enough to have some YARA rules which have some pre-calculated rules to help aid in this situation.  A good start would be to look at MACB's xotrools (previously linked) and also consider what it is you might want to look for (i.e. - "This program cannot be run") and XOR it with some permutations.

    Manual process



    If we open that file within a hex editor and go to the offset flagged (0x25C8) we'll see what is supposedly "This program cannot be run" = 26 bytes :

    If we take that original file and covert it to hex we'll essentially just get a big hex blob:





    ...but that hex blob helps to try and guess the XOR key:


    From my initial tests, the XOR key has always been in the top returned results, but even if you're having some difficulties for whatever reason you can always modify the code to fit your needs - gotta love that.

    So if we now try to unxor the original file with the first guessed XOR key (remember XOR is symmetric) hopefully we'll get the original content that was XOR'ed:





    After the original file was unxored and scanned with YARA we see that it was flagged for having an embedded EXE within it (this rule can be found within MACB's capabilities.yara file) so it looks like it worked.


    Now while all this hex may look like a bunch of garbage at times, the human eye is very good at recognizing patterns - and when you look more and more at things like this you'll start to recognize them.  Do you recall the YARA hit that triggered? It stated that the XOR key was incremented.  What this means is that each byte is being XOR'ed with the next byte in an incremental fashion until it wraps back around to the beginning.  That may be confusing the grasp at first so lets visualize it by breaking down the previously found 256 byte XOR key in its' respective order:


    868788898a8b8c8d8e8f
    909192939495969798999
    a9b9c9d9e9f
    a0a1a2a3a4a5a6a7a8a9
    aaabacadaeaf
    b0b1b2b3b4b5b6b7b8b9
    babbbcbdbebf
    c0c1c2c3c4c5c6c7c8c9
    cacbcccdcecf
    d0d1d2d3d4d5d6d7d8d9
    dadbdcdddedfe
    0e1e2e3e4e5e6e7e8e9
    eaebecedeeef
    f0f1f2f3f4f5f6f7f8f9
    fafbfcfdfeff
    000102030405060708090
    a0b0c0d0e0f
    10111213141516171819
    1a1b1c1d1e1f
    20212223242526272829
    2a2b2c2d2e2f
    30313233343536373839
    3a3b3c3d3e3f
    40414243444546474849
    4a4b4c4d4e4f
    50515253545556575859
    5a5b5c5d5e5f
    60616263646566676869
    6a6b6c6d6e6f
    70717273747576777879
    7a7b7c7d7e7f
    808182838485

    As you see, it started with 86 and looped all the way around till it reached 85 - you should also notice the patterns on each line.  This is just an example of incremental/decremental XOR (not as commonly observed in my testing but useful to be aware of) but it's useful to know because it's quite easy to spot if you look at the original file in a hex editor again:






    ... and that's a pattern that was observed repeating ~56 times.


    Automated process

    So now we can kind of put together a process flow of what we want to do:
    1. Convert the original, XOR'ed file to hex
    2. Conduct some slight frequency analysis of the newly created hex file and look for the most common characters as well as the most commonly observed hex chunks.  
      1. The first part may help in determining if there's an embedded PE file (usually a lot of \x00's) or possibly help deduce if certain bytes should be skipped.  
      2. The latter essentially reads 512 bytes at a time, stores it and continues till the end of the file.  Once complete it does some simple checking to try and weave out meaningless possible keys then presents the top five most observed 512 bytes or characters in this sense  (i.e. - 512 characters = 1 possible 256 byte key(s))
    3. For each possible XOR key guessed from the previous step, XOR (the entire file for right now) the original file, save it to a new file and scan it with YARA.  
      1. I chose to perform YARA scans here to help determine the likelihood that the key used was correct - you may choose to implement something else such as just a check for an embedded PE file etc.  If there are YARA hits then I stop attempting the other possible XOR keys (if any other were still to be processed) and assume the previous XOR key was the correct one.
    * If you stick with the YARA scanning, it will continue to process all of the possible key(s) it outlined as the top, in terms of frequency, so your YARA rules should include something that might be present in the original XOR'ed file.  If not, you might already have the correct XOR key but aren't aware.  Embedded exe's are a good start to look for since they're common - but remember if we XOR the entire file at once instead of a specific section that you might find the embedded content but that doesn't mean the original file will be readable afterwards (i.e - won't be a Word document anymore since it was XOR'ed) 


    Let's try out that process flow in a more automated way (on a new file):


    As you can see, it worked like a charm :)

    As always, I'm sure there's a better way to code some of the stuff I did but hey, it works for me at the moment.  There's a to-do list of things that I want to further implement into this tool, some of which is already included in other tools.  I've been asked before how this tools will work with smaller XOR keys and that's up to you to test and tell me - I created this in order to tackle the problem solely of the 256 byte key files I was observing so I'd recommend using one of the earlier mentioned tools for that situation, at least for the time being.

    Example To-do's:
    • ROL(1-7)/ROT(1-25) - either brute forcing or via YARA scans
    • Add ability to skip \x00 & other chosen bytes (ref)
    • more is outlined within the file....

    Download

    NoMoreXOR can be found on my github

    Tuesday, July 17, 2012

    Customizing cuckoo to fit your needs

    This has been ported over to my GitHub site and is not longer being maintained here. For any issues, comments or updates head here.


    With the talk of the .4 release of cuckoo to be publicly released shortly I figured I should get this post out as some of the things I talk about here are said to be addressed and included in that release.  If you don't want to wait for that release or something I touch on here isn't included in that release then hopefully the information below will be of use to you. In full disclosure, I'm not a python guru so if you see something that could have been done an easier way or something turns out not to be working for you please let me know...I found out the hard way python is strict on spacing.  Throughout my testing it all seemed to work fine for me but there may be some scenario I didn't test or think of.

    (patches available on my github)

    General Notes

    The installation notes are pretty straightforward to get you up and running and after you successfully do it the first time, any subsequent installation process should be even faster for you.  There are a couple of notes worth mentioning though:
    • The first user you create during your Ubuntu installation is an admin user.  This is important to remember if you want your cuckoo user to be a limited user.
    • When you add the cuckoo user to its group, you need to log out and log back in for it to take affect.
    • To ensure there are no permission issues, you should do the virtualbox setup as the cuckoo user instead of another admin/root account.
    • If during your analysis the VM isn't able to be restored or you need to kill cuckoo.py then you need to run virtualbox after and take the vm our of 'saved' mode by discarding it.
    • If you are installing 3rd party applications (and you should be if you want to test exploitation), make sure you're properly pointed to them within their appropriate analyzer file "/path/to/cuckoo/setup/packages"
    • There's a default list of hashes for common programs that are automatically discarded in the dropped files section so be aware of them "/path/to/cuckoo/shares/setup/conf/analyzer.conf"

    Patching

    Instead of re-posting all of the files in the cuckoo repo I decided the easiest way to go about releasing these patches/modifications was to utilize the diff & patch commands in *nix. To create the patches:

    diff -u 'original' 'new' > 'file.patch'

    and once the patches are downloaded from my github, all you need to do is run:

    patch '/path/to/original/cuckoo/file' < 'file.patch'

    Customization

    Web Reports/Portal

    At first I couldn't understand why I was able to continuously reanalyze a sample but when I thought about it , it made sense.  Since cuckoo gives you the ability to analyze a file in multiple VM’s, it has to be processed more than once (duh)…maybe a better approach would be to only have that sample be analyzed once by the same VM.

    In the main web portal page you are presented with a single search box to search for a files MD5 hash. For convenience and as a time saver I hyper-linked the files MD5 hash in the general information section as well as the dropped files section so you can quickly see if/when it was analyzed previously instead of having to copy and paste it in the main search box every time.

    I didn’t want to clutter up the general information section of the report with all of the scans and lookups I was adding to the report so I created two other sections for the report (signatures & lookups).


    Signatures


    Within the signatures section I added the following ClamAV (2 versions) and YARA.  If you have other scan engines you wish to run against your files then the same type of method could be re-used.  With all three of these features you need to configure the location to their corresponding signatures within "/path/to/cuckoo/processing/file.py".

    ClamAV

    (besides for above noted change, you also need to edit the path to your clamscan)
    I’m a fan of ClamAV and the numerous ways it can be leveraged just make it ideal to have included in my automated processes.  If you’ve read the Malware Analysts Cookbook  (MACB) you might recall that there’s some really handy code made available and one of which shows how to do exactly what I wanted to do – scan the files with ClamAV and show the results.  I don’t like to re-do what someone else has done if it works how I need it to so I made one or two modifications and plugged it in as necessary.

    Custom ClamAV

    Using the traditional signatures database from ClamAV is good but it can also be worthwhile to create some of your own signatures (remember how logical signatures can be a big help) so I also added a section where you can point it to your custom ClamAV database so it can pickup on other signatures you’ve personally written/acquired.

    YARA

    On the cuckoo mailing list I came across another user who said he had patches for implementing YARA into cuckoo.  If you’ve read any of my past posts or follow me on twitter you’ll know that I’m a fan of YARA’s capabilities and as such contacted him to see what he had wrote.  The patches themselves were very straightforward and since they worked I didn’t see a need to change them.  He provided me a link to them on his personal GDrive so if you only want to implement that feature into cuckoo then you can use his files, however, the files I’m releasing have that already implemented so no need to do double the work otherwise.  When/if more than one YARA rule is matched, they'll be comma separated within brackets.  The additional files needed besides for for the ones in my github that you'll need to download and install are:
    • http://yara-project.googlecode.com/files/yara-1.6.tar.gz
    • http://yara-project.googlecode.com/files/yara-python-1.6.tar.gz

      Lookups


      The looksups section only contains two actual lookups at the moment but also contains what I refer to as ‘future linkage’.  I didn't add the lookups section to the dropped files section because I plan on analyzing them automatically with the modifications mentioned earlier and that would just be too repetitive and a waste of a time.  As far as actual lookups I put in Cymru and VirusTotal for right now so if there’s Internet they will pull the last time the sample was scanned/seen with their services and the A/V detection rate (note - I'm only querying for the hashes, I don’t like submitting for a few reasons).

      Team Cymru 

      Team Cymru offers a couple of very useful services and one of which I use during investigations is their Malware Hash Registry (MHR). MHR will take the hash(es) you supply it and tells you if it’s a known bad file, the last time they've seen it and an approximate percentage for A/V detection. MACB also had a recipe for adding this to a script so once again I just modified as necessary and inserted to fit cuckoo.

      VirusTotal

      There are a few scripts online to utilize VirusTotals API and submit/query their site but I decided to use this script.  You can use any method you'd like but if you use the patches I provided just install that script and supply your API key in "/path/to/cuckoo/processing/file.py".  I didn’t want to overly insert code into the existing cuckoo files so I opted to build this file and then import it from within cuckoo.  Essentially I take the files hash and try to get a report of it and if it exists just pull last scan date and detection rate.  While it can be useful to see what the A/V's detected it as, I didn't want to waste time making a collapsable table including all of this information if the new release of cuckoo will already do this.  If it doesn't, then I'll re-visit it.

      If the sample doesn't have any VT detection or exist then I have it just state that and if there’s no current Internet connection then state an error.  The latter is very important because I’ve seen others trying to stuff this capability into their code but they fail to address the scenario when there’s no Internet connectivity and therefore their report will fail to be created because they don’t handle the error created.  I wrote it so it would be generic in catching an error because I don’t want my report to fail because of this so if there’s no Internet connection or another error (note that this will also suppress the error that your API key may be wrong!) and the rest of the report is fine to generate then it can still generate.  The same hold true for the snippet for the Cymru check.

      Internet connection and results found :



      No Internet connection :



      Internet connection and no results found :


      Future Linkage

      I thought it was useful to pre-link the samples to common online sites people use for additional reference/analysis (malwr, shadowserver and threatexpert).  Instead of slowing up the analysis by trying to pull down all of these reports if they exists then parse them I decided it was just easier to create a link for them based on the samples hash that way even if the sample hasn’t been analyzed on any of these sites at the time of my analysis, I could go back to them at a later time and check if a report exists since then.  Just another way to save some time and make life easier.

      Dropped Files

      Cuckoo will take any dropped files during the analysis of the sample and copy them back over to the host machine under the structure "/path/to/cuckoo/analysis/<#>/files".  By default those files are just left in that subfolder and not analyzed (they will have basic information such as file type and hash in the report though) but I felt it didn't make sense to just leave them in that sub-directory (at least for my goals) so I added the following opted to change "/path/to/cuckoo/processing/data.py" so it would take those files and move it to my samples directory (/opt/samples):
       shutil.move(cur_path,'/opt/samples')
      This samples folder is the folder that I'm going to monitoring for new/created files and automatically process them to be analyzed as mentioned later via the watcher.rb script.  Once I did that I noticed another side affect... if there was a queue in the samples directory and the files being moved from the dropped files folder to the samples folder were the same then it would crap out.  I thought the move command would overwrite it but it didn't.  I figured this could be fixed by either copying instead or what I chose to do, check if it exists and if so just delete it from the dropped files folder since it was going to be processed anyway:

              check = os.path.join('/opt/samples/', cur_file)
              if os.path.exists(check):
                  os.remove(cur_path)
              else:
                  shutil.copy(cur_path,'/opt/samples')
                  os.remove(cur_path)

              return dropped
      This may not be something that everyone feels they want to do since one obvious consequent I could think of was that since every file is being moved out of the dropped files directory, any special configuration file etc. that you might be interested in won't be there (unless you do file type identification and only move files which can even be processed or if a file can't be processed, move it out of the samples directory to another folder to store dropped files that couldn't get processed i.e. - html files, js etc.).  Another reason might be because it may end up being a continual loop.  Some malware will go out and download another copy of itself etc. and as such by continuing to automatically analyze them will just cause a loop.  This will vary of course by sample, if the Internet is connected and what you want out of your analysis.  Other than that, your analysis task numbers might rise quickly but that shouldn't be on concern because you aren't going to have a sequential set since there's going to be times when a file can't be processes.


      Samples Directory Watcher

      Melissa wrote a post a little bit ago on integrating cuckoo with NTR and in that post she touched upon the usefulness of having a script running to automatically realize that a new file was created or moved to a certain directory and then take action on that file.  I thought it was nifty and since it was already built into Ruby, I wasn't going to try and hack something else together and see how it held up.   I've read that INotify can be a memory hog so that's something that should be paid attention to although I haven't had any noticeable issues thus far.  If you read the original post you'll soon realize there's some typos... Melissa pointed one out but there are a couple others that might make you frustrated when troubleshooting and to make things easier, I took care of them already.  To get this directory watcher up and running do the following:

      sudo apt-get install ruby rubygems 
      sudo gem install rb-inotify  

      Update 07-18-12 : ... had the wrong command for installing rb-inotify

      Download the modified watcher.rb script (on my github too) and edit it to point to the directory you want to watch and the script you want it to execute upon an action/event occurring.  Instead of having an interim script here you can just pass the new sample to "/path/to/cuckoo/submit.py" but I realized I needed an interim script because the sample might be password protected or in a format that cuckoo wouldn't take (i.e. an archive file).

      That's the basic customization you need to do for this script, however, you can change it as you see fit.  Initially when I was talking to some Ruby gurus they said that using the IO.popen method was overhaul for what I wanted to do since all I'm essentially doing is passing along a string (new file created/moved) to another file to process.  For testing purposes, I changed it to use exec instead... which worked, but would kill the watcher script after each event.... and that basically killed the purpose of me even having it running so I opted to keep the original method. Once you have all of the pre-reqs installed and the script modified to your needs just open another tab in your shell and let it fly (you don't need the '&' at the end but I like to get my terminal back):

      ruby watcher.rb

      Archive Parser


      If you’re like me then you might have some emails which contain malware samples as attachments or download/get sent password protected archives with possible malware.  If you hand cuckoo an archive or email file (pst etc.) then nothing will happen as it doesn't have a default module to handle them.  As far as the email situation goes, the sheer thought of individually saving each sample one by one doesn’t sound like fun so figured within the interim script I'm calling from the watcher script that there would be a check for a Microsoft Outlook data file and if so, run pffexport against the file.  The thought process is to basically just recursively extract everything out of the the email messages and attempt to process them with cuckoo (if you install libpff, remember to sudo ldconfig after you install it).


      To address the archives/pw protected archives issue I try to identify it as an archive file and if so, try to unzip it both with and without a password.  I wasn’t aware that if you supply a wrong password to unzip a file with 7zip that it will still unzip the archive if it turns out that there isn’t even a password protecting the archive (thanks Pär).  I also have a little array set up which contains some of the common password schemes used to password protected malware archives that way I could also add to it in the future (sort of like a dictionary).  

         Additional Software

        Depending on the installation you're performing and what additional features you're going to be installing there might be some additional software required which could include:
        • YARA
          • sudo apt-get install libpcre3 libpcre3-dev 
        • python
          • sudo apt-get install python python2.7-dev python-magic python-dpkt python-mako
        • ssdeep
          • http://sourceforge.net/projects/ssdeep/files/ssdeep-2.8/ssdeep-2.8.tar.gz/download
          • svn checkout http://pyssdeep.googlecode.com/svn/trunk/ pyssdeep
          • g++
            • sudo apt-get install g++
          • subversion
            • sudo apt-get install subversion
          • 7zip
            • sudo apt-get install p7zip

          To-do/Wish List

          • The cuckoo DB that's created "/path/to/cuckoo/db/cuckoo.db" only stores a limited amount of information within it.  Even though information regarding a files SHA1/256 hash, ssdeep hash, mutexes, IP/domains etc. are included in the samples report, they aren't stored in the DB.  This helps keep the DB to a limited size but doesn't help if I want to search my repository of analyzed samples for all samples which called a particular IP/host etc.  I didn't want to start changing big chunks of the code to implement this at this point because updates may kill it etc... so I think the better solution will be to only change the snippet which says which fields to create in the DB and to store other selected fields into that DB after analysis.  Another solution can be used to query that DB as it's a common task many of us do anyway.
          • The file identification process for determine what type of file the sample is and if it should be processed is pretty basic at this point.  It does the job but at times could use a boost.  A similar thing noticed is if there's certain characters in the samples file name then it won't get processed.  This looks like it could be a one or two line fix with something like Python's string.printable .
          • After talking with one of my friends about cuckoo he noted that he's observed not all of the dropped files from the sample being analyzed were being copied back over to the host after the analysis.  This is no bueno... and while I haven't verified this at this time, a simple solution looks to be installing CaptureBAT on the Windows VM and using something (xcopy or robocopy) to copy all of the files caught by CaptureBAT back over to the host after analysis.
          • I'm debating to add a switch so I can choose for the analysis to either run wild on the Internet or feed it something like INetSim for simulation.  There are pros and cons to each scenario and maybe a better solution is to use something like Tor ... but I'm up in the air.  As a side note, installing INetSim can be a pain and I'm spoiled as I'm used to it already being installed so other options to look at could be something like HoneyD
          • I'd like to modify some of the existing analyzers to run additional programs against a sample and report on their results (i.e. hachoir-subfile, pdfextract etc.)