A Novel Way of Detecting Malicious PDF Documents

For some time now the Portable Document Format standard has been a considerable risk in regard to corporate as well as private information security concerns. Some work has been done to classify PDF documents as malicious or benign, but not as much when it comes to clustering the malicious documents by techniques used. Such clustering would provide insight, in automated analysis, to how sophisticated an attack is and who staged it.
A 100.000 unique PDF dataset was supplied by the Shadowserver foundation. Analysis of experiment results showed that 97% of the documents contained javascripts. This and other sources revealed that most exploits are delivered through such, or similar object types. Based on that, javascript object labeling gets a thorough focus in the paper.

The scope of the paper is limited to extend the attribution research already done in regard to PDF documents, so that a feature vector may be used in labeling a given (or a batch) PDF to a relevant cluster. That as an attempt to recognize different techniques and threat agents.

Javascript is currently one of the most exploited PDF objects. How can the PDF feature vector be extended to include a javascript subvector correctly describing the technique/style, sophistication and similarity to previous malicious PDF documents. How does it relate to the term digital evidence?

— Thesis statement

The thesis statement considers the coding styles and obfuscation techniques used and the related sophistication in the coding style. Least but most important the statement involves how the current PDF document measures to others previously labeled. These are all essential problems when it comes to automatated data mining and clustering.

A. Related Work

Proposed solutions for malicious contra benign classification of PDF documents has been explicitly documented in several papers. Classification using support vector machines (SVM) was handled by Jarle Kittilsen in his recent Master's thesis1.

Further, the author of this paper in his bachelor's thesis2 investigated the possibility to detect obfuscated malware by analyzing HTTP data traffic known to contain malware. In regard, the findings were implemented, designed and tested in Snort. Some of the detection techniques will be used as a fundament for labeling in this paper.

Even though much good work has been done in the era of analyzing malicious PDF documents, many of the resulting tools are based on manual analysis. To be mentioned are Didier Stevens who developed several practical tools, such as the PDF parser and PDFid. These tools are not only tools, but was the beginning of a structured way of looking at suspicious objects in PDF documents as well. To be credited as well is Paul Baccas in Sophos, which did considerable work on characterizing malicious contra benign PDF documents3.

The paper will be doing research into the feature, javascript subvector of malicious PDF documents. To be able to determine an effective vector (in this experimental phase), it is essential that the dataset is filtered, meaning that the files must be malicious. As Kittilsen has done in regard to PDF documents, Al-Tharwa et ál2 has done interesting work to detect malicious javascript in browsers.


A.1. The Feature Vector in Support of Digital Evidence

Carrier and Spafford defined "digital evidence" as any digital data that contain reliable information that supports or refutes a hypothesis about the incident7. Formally, the investigation process consists of five parts and is specially crafted for maintaining evidence integrity, the order of volatility (OOV) and the chain of custody. This all leads up to the term forensic soudness.

The investigation process consists of five phases. Note the identification and analysis phase.

Fig. 1: The investigation process. The investigation process consists of five phases9. Note the identification and analysis phase

In this paper, forensic soudness is a notion previously defined10 as meaning: No alternation of source data has occured. Traditionally this means that every bit of data is copied and no data added. The previous paper stated two elementary questions:

  • Can one trust the host where the data is collected from?
  • Does the information correlate to other data?

When it comes to malicious documents, they are typically collected in two places:

  1. In the security monitoring logging, the pre-event phase
  2. When an incident has occured and as part of the reaction to an incident (the collection phase)

Now, the ten thousand dollar question: When a malicious document gets executed on the computer, how is it possible to get indications that alteration of evidence has occured? The answer is potentially the first collection point, the pre-event logging.

In many cases, especially considering targeted attacks, it is not possible to state an PDF document as malicious in the pre-event phase. The reason for this is often the way the threat agent craft his attack to evade the security mechanisms in the target using collected intelligence. Most systems in accordance to local legislation should then delete the content data. A proposition though is to store the feature vector.

The reasoning behind storing a feature vector is quite simple: When storing hashes, object counts and the javascript subvector which we will return to later in the paper, it will be possible to indicate if the document features has changed. On the other side there is no identifiable data invading privacy.

It is reasonable to argue that the measure of how similar one PDF document is to another, is also the measure of how forensically sound the evidence collected in a post-event phase is. How likely it is that the document aquired in the collection phase is the same as the one in the pre-phase is decided by the characteristics supplied by the feature vectors of both. Further, the feature-vector should be as rich and relevant as possible.

Fig. 2: Correlation by using the feature vector of the PDF document. Illustration of a possible pre/post incident scenario

A.2. Identification as an Extension of Similarity

The notion of similarity largely relates to the feature vector: How is it in large quantities of data possible to tell if the new PDF document carries similar characteristics like others of a larger dataset.

In his work with semantic similarity and preserving hashing, M. Pittalis11 defined similarity from the Merriam-Webster dictionary:

Similarity: The existance of comparable aspect between two elements

— Merriam-Webster Dictionary

The measure of similarity is important in regard to clustering or grouping the documents. When clustering datasets the procedure is usually in six steps, finding the similarity measure is step 2.

  1. Feature selection
  2. Proximity/similarity measure
  3. Clustering criterion
  4. Clustering algorithm
  5. Validation
  6. Interpretation

In this paper the k-means unsupervised learning clustering algorithm was consideres. This simple algorithm groups the number n observations into k clusters22. Each observation relates to the cluster with the nearest mean.

Now, as will be seen over the next two sections, work done in the subject is mostly missing out on giving a valid similarity measure when it comes to classifying PDF documents as anything other than malicious or benign. So, to be able to cluster the PDF documents the feature vector will need a revision.

As Pittalis introduced the concept of similarity, it is important to define one more term: Identification. According to the American Heritage Dictionary, identification is:

Proof or Evidence of Identity. — The American Heritage Dictionary

In our context this means being able to identify a PDF document and attribute it to e.g. a certain type of botnet or perhaps more correct a coding or obfuscation technique. In an ideal state this will give an indication to which threat agent is behind the attack. This is something that has not been researched extensively in regard to PDF documents earlier.

C. The Portable Document Format

When it comes to the feature vector of the portable document format (PDF), it is reasonable to have a look at how PDF documents are structured. The PDF consists of objects, each object is of a certain type. As much research has been done on the topic previously, the format itself will not be treated any further in this paper12.

A simplified illustration of the portable document format

When considering malicious PDF documents, relevant statistics has shown the following distribution of resource objects:

Known Malicious Datasets Objects A table showing a number interesting and selected features in malicious seen against clean PDF documents. Baccas used two datasets where one indicated slightly different results.

Dataset    Object Type Clean (%)   Malicious (%)
The Shadowserver 100k PDF malicious dataset     /JavaScript                NA      97%
Paul Baccas' Sophos 130k malicious/benign dataset3    /JavaScript      2%   94%
            /RichMedia       0%    0,26%
            /FlateDecode    89%   77%
            /Encrypt        0,91% 10,81%

What can be seen of the table above is that when it comes to the distribution of objects in malicious files, most of them contains javascript. This makes it very hard to distinguish and find the similarity between the documents without considering a javascript subvector. The author would argue that this makes it a requirement for a javascript subvector to be included in the PDF feature vector to make it a valid. In previous work, where the aim has been to distinguish between malicious and benign, this has not been an issue.

D. Closing in on the Core: The PDF Javascript Feature Subvector

Javascript is a client-side scripting language primarily offering greater interactivity with webpages. Specifically javascript is not a compiled language, weakly-typed4 and has first-class functions5. In form of rapid development, these features gives great advantages. In a security perspective this is problematic. The following states a Snort signature to detect a javascript "unescape"-obfuscation technique2(we will return to the concept of obfuscation later on):

alert tcp any any -> any any (msg:”Obfuscated unescape”; sid: 1337003; content:”replace”; pcre:”/u.{0,2}n.{0,2}e.{0,2}s.{0,2}c.{0,2}a.{0,2}p.{0,1}e’ ?.replace (/”;rev:4;)

Traditionally javascript is integrated as a part of an browser. Seen from a security perspective, this opens for what is commonly known as client-side attacks. More formally: Javascript enables programmatic access to computational objects within a host environment. This is complicated as javascript comes in different flavors, making general parsing and evaluation complex6, as may be seen of the above signature. The flavors are often specific to the application. Today, most browsers are becoming more aligned due to the requirements of interoperability. Some applications, such as the widely deployed Adobe Reader has some extended functionality though, which we will be focusing on in this paper.

Even though javascript may pose challenges to security, it is important to realize that this is due to complexity. Javascript (which is implemented through SpiderMonkey in Mozilla18-products and in Adobe Reader as well) builds on a standard named ECMA-262. The ECMA is an standardization-organ of Information and Communication Technology (ICT) and Consumer Electronics (CE)17. Thus, Javascript is built from the ECMAScript scripting language standard. To fully understand which functions is essential in regard to malicious Javascripts this paper will rely on the ECMAScript Language Specification19 combined with expert knowledge.

E. Introducing Obfuscation

Harawa et al.8 describes javascript obfuscation by six elements:

  • Identifier reassignment or randomization
  • Block randomization
  • White space and comment randomization
  • Strings encoding
  • String splitting
  • Integer obfuscation

Further, Kittilsen1 documented a javascript feature vector which states the following functions as potentially malicious: [function, evallength, maxstring, stringcount, replace, substring, eval, fromCharCode]. Even though his confusion matrix shows good results, there are some problems when it comes to evaluating these as is: Such characters are usually obfuscated. The following is an example from sample SHA256:d3874cf113fa6b43e7f6e2c438bd500edea5cae7901e2bf921b9d0d2bf081201]:

if((String+'').substr(1,4)==='unct'){e="".indexOf;}c='var _l1="4c206f5783eb9d;pnwAy()utio{.VsSg',h<+I}*/DkR%x-W[]mCj^?:LBKQYEUqFM';l='l';e=e()[((2+3)?'e'+'v':"")+"a"+l];s=[];a='pus'+'h';z=c's'+"ubstr";sa;z=c's'+"ubstr";sa;z=c['s'+"ubstr"] [...]e(s.join(""));}

The above example tells an interesting story about the attackers awareness of complexity. In respect to Kittilsens javascript feature vector the above would yield the following result: [0,x,x,x,0,0,0,0] (considerable results on the second to fourth, plus one count if we are to shorten substring to substr), in other words the features are to be found in the embedded, obfuscated javascript, but not in clear text. When it comes to evallength, maxstring and string_count we will return to those later in the paper.

Deobfuscated, the script would look like:

var _l1="[...]";_l3=app;_l4=new Array();function _l5(){var _l6=_l3.viewerVersion.toString();_l6=_l6.replace('.','');while(_l6.length&4)_l6l='0';return parsetnt(_l6,10);function _l7(_l8,_l9){while(_l8.length+2&_l9)_l8l=_l8;return _l8.substring(0,_l9I2);function _t0(_t1){_t1=unescape(_t1);rote}a*=_t1.length+2;da*/ote=unescape('Du9090');spray=_l7(da*/ote,0k2000Rrote}a*);lok%hee=_t1lspray;lok%hee=_l7(lok%hee,524098);for(i=0; i & 400; ill)_l4xi-=lok%hee.substr(0,lok%hee.lengthR1)lda*/ote;;function _t2(_t1,len){while(_t1.length&len)_t1l=_t1;return _t1.substring(0,len);function _t3(_t1){ret='';for(i=0;i&_t1.length;il=2){b=_t1.substr(i,2);c=parsetnt(b,16);retl=String.froW[har[ode(c);;return ret;function _]i1(_t1,_t4){_t5='';for(_t6=0;_t6&_t1.length;_t6ll){_l9=_t4.length;_t7=_t1.char[odeAt(_t6);_t8=_t4.char[odeAt(_t6D_l9);_t5l=String.froW[har[ode(_t7m_t8);;return _t5;function _t9(_t6){_]0=_t6.toString(16);_]1=_]0.length;_t5=(_]1D2)C'0'l_]0j_]0;return _t5;function _]2(_t1){_t5='';for(_t6=0;_t6&_t1.length;_t6l=2){_t5l='Du';_t5l=_t9(_t1.char[odeAt(_t6l1));_t5l=_t9(_t1.char[odeAt(_t6));return _t5;function _]3(){_]4=_l5();if(_]4&9000){_]5='oluAS]ggg*pu^4?:IIIIIwAAAA?AAAAAAAAAAAALAAAAAAAAfhaASiAgBA98Kt?:';_]6=_l1;_]7=_t3(_]6);else{_]5='*?lAS]iLhKp9fo?:IIIIIwAAAA?AAAAAAAAAAAALAAAAAAAABk[ASiAgBAIfK4?:';_]6=_l2;_]7=_t3(_]6);_]8='SQ*YA}ggAA??';_]9=_t2('LQE?',10984);_ll0='LLcAAAK}AAKAAAAwtAAAALK}AAKAAAA?AAAAAwK}AAKAAAA?AAAA?gK}AAKAAAA?AAAAKLKKAAKAAAAtAAAAEwKKAAKAAAAwtAAAQAK}AUwAAA[StAAAAAAAAAAU}A]IIIII';_ll1=_]8l_]9l_ll0l_]5;_ll2=_]i1(_]7,'');if(_ll2.lengthD2)_ll2l=unescape('D00');_ll3=_]2(_ll2);with({*j_ll3;)_t0(*);Ywe123.rawValue=_ll1;_]3();

Which through the simple Python script javascript feature vector generator (appendice 1), yields:

['function: 9', 'eval_length: x', 'max_string: x', 'stringcount: x', 'replace: 1', 'substring|substr: 4', 'eval: 0', 'fromCharCode: 0']

Harawa et al.' 6 elements of javascript obfuscation is probably a better, or necessary supplemental approach to Kittilsens work.

There is a notable difference between deobfuscation and detecting obfuscation techniques. The difference consists of the depth of insight one might gain in actually deobfuscating a javascript as it will reveal completely different code while the obfuscation routines may be based on a generic obfuscator routine used by several threat agents. This is much like the issue of packers in regard to executables23.

This section has shown the difficulties of balancing deobfuscation for a more detailed coding style analysis against a less specific feature vector by using abstract obfuscation detection.

Extracting and Analysing a PDF Feature Vector

A. Deobfuscation - Emerging Intentions

Usually the most pressing question when an incident involving a PDF document occur is: Who did it, and what's his intentions. This is also a consideration when further evolving the PDF feature vector. In the next figure is a model describing three groups of threat agents, where one usually stands out. Such as if a Stuxnet scale attack24 involving a PDF document is perceived it will be associated with a cluster containing "group 1" entities.

While Al-Tharwa et ál2 argues for no need for deobfuscation in regard to classification, deobfuscation is an important step in regard to finding a distinct feature vector. The issue is that in most situations it isn't good enough to tell if the documents is malicious, but also in addition to who, what, where and how it was created. In regard to being defined as valid digital evidence a rich feature vector (in addition to the network on-the-fly hash-sum) is part of telling. The latter also makes itself relevant when it comes to large quantities of data, where an analyst is not capable of manually analyzing and identifying hundreds to tens of thousands of PDF documents each day.

Fig. 4: The threat agent modelA model describing three groups of attackers. These are necessary to filter and detect in the collection phase

B. Technical Problems During Deobfuscation

Normally most javascript engines, such as Mozillas Spidermonkey15, Google V816 and others, tend to be javascript libraries for browsers and miss some basic functionality in regard to Adobe Reader which is the most used PDF reader. These engines is most often used for dynamic analysis of Javascripts and is a prerequiste when it comes to being able to completely deobfuscate javascripts.

To prove the concepts of this article a static Python feature vector generator engine based on a rewritten version of the Jsunpack-n14project is used. The application used in the paper is providing a vector based interpretation of the static script, meaningn it is not run it dynamically.

Reliably detecting malicious PDF documents is a challenge due to the obfuscation routines often used. This makes it necessary to perform some kind of deobfuscation to reveal more functionality. Even if one managed to deobfuscate the script one time, there may be several rounds more before it is in clear text. This was a challenge not solvable in the scope of this article.

Due to parsing errors under half of the Shadowserver 100k dataset was processed by the custom Jsunpack-n module.

C. Introducing Two Techniques: Feature Vector Inversion and Outer Loop Obfuscation Variable Computation

As have been very well documented so far in the paper it is more or less impossible to completely automate an deobfuscation process of the PDF format. Obfuscation leaves many distinct characteristics though, so the threat agent on the other hand must be careful to not trigger anomaly alarms. There is a balance. This part of the article introduces two novel techniques proposed applied to the javascript subvector to improvie its reliability.

C.1. Outer Loop Obfuscation Variable Computation (OLOVC)

When the threat agent implements obfuscation, one of his weaknesses is being detected using obfuscation. When it comes to PDF documents using javascripts alone is a trigger. Now, the threat agent is probably using every trick in the book, meaning the 6 elements of javascripts obfuscation8. The job of an analyst in such a matter will be to predict new obfuscation attempts and implement anomaly alerts using the extended PDF feature vector.

Throughout this paper we will name this technique "Outer Loop Obfuscation Variable Computation". The term "outer loop" most often refer to round zero or the first of the deobfuscation routines. Variable computation is as its name states, a matter of computing the original javascript variable. As we have seen this may be done by either deobfuscating the script as a whole including its near-impossible-for-automation complexity, or use the original obfuscated data. We will have a further look at the latter option.

Take for instance this excerpt from the "Introducing Obfuscation"-section:


Harawa ét al defined the above obfuscation technique as "string splitting" (as seen in the section "Introducing obfuscation"). The following two obfuscation-extraction regular expressions, is previously stated in the authors Bachelors thesis2:



Keep the two above statements and the previous code excerpt in mind. When breaking down the above expressions we introduce one more regular expression:


While searching for "substr" in plain text in the plain-text will certainly fail, the above expression will match e.g.:


Recall Kittilsens javascript feature vector: [function, eval_length, max_string, stringcount, replace, substring, eval, fromCharCode]. If extended by the above techniques, the results is somewhat different.

Without string splitting detection:

['function: 9', 'eval_length: x', 'max_string: 10849', 'stringcount: 1', 'replace: 1', 'substring|substr: 4', 'eval: 0', 'fromCharCode: 0']

With outer loop obfuscation variable computation:

['function: 0', 'eval_length: x', 'max_string: 67', 'stringcount: 2', 'replace: 0', 'substring: 0', 'substr: 3663', 'eval: 1', 'fromCharCode: 0']

Additionally, rewriting and extending Kittilsens feature vector by several other typically suspicious functions should give preferrable results: [max_string, stringcount, function, replace, substring, substr, eval, fromCharCode, indexof, push, unescape, split, join, sort, length, concat]

This makes the following results in two random, but related, samples:

[SHA256:5a61a0d5b0edecfb58952572addc06f2de60fcb99a21988394926ced4bbc8d1b]:{'function': 0, 'sort': 0, 'unescape': 0, 'indexof': 0, 'max_string': 10849, 'stringcount': 2, 'replace': 0, 'substring': 0, 'substr': 1, 'length': 1, 'split': 2, 'eval': 0, 'push': 0, 'join': 1, 'concat': 0, 'fromCharCode': 0}

[SHA256:d3874cf113fa6b43e7f6e2c438bd500edea5cae7901e2bf921b9d0d2bf081201]:{'function': 0, 'sort': 0, 'unescape': 0, 'indexof': 0, 'max_string': 67, 'stringcount': 1, 'replace': 0, 'substring': 0, 'substr': 3663, 'length': 0, 'split': 0, 'eval': 0, 'push': 1, 'join': 1, 'concat': 0, 'fromCharCode': 0}

It may perhaps not need a comment, but in the above results we see that there are two types of elements in the feature vector that stands out: max_string and two of the suspicious functions.

Summarized the "Outer Loop Obfuscation Variable Computation" may be used to, at least partially, defeat the malware authors obfuscation attempts. By running the somewhat complex regular expressions with known malicious obfuscation routines, the implementation result of the 100.000 PDF dataset may be seen in the following table:
Dataset generalization by "outer loop obfuscation variable computation"
Dataset aggregated by counting javascript variables and functions, OLOVC applied (due to errors in the jsunpack-n the total number of entities calculated is 42736).

Word        Count
function      651
sort         7579
unescape        4
toLowerCase     1
indexof         8
max_string  42346
stringcount 41979
replace        70
substring      91
replace        70
substring      91
substr      38952
length       1512
split        9621
eval           77
push          260
join           91
inverse_vector 41423
concat         86
fromCharCode   45

By the counts in the above table it is shown that the selected feature vector has several very interesting features. On a sidenote: Even though some features has a larger quantity than others it should be mentioned that this is not necessarily the measure of how good that feature is, such is especially the case with the inverse vector as we will be more familiar with in the next section. Also, as previously mentioned it is interesting to see the composition of multiple features to determine the origin of the script (or the script style if you'd like). The aggregation script is attached in appendice 2.

The "Outer Loop Obfuscation Variable Computation" will require a notable amount of computational resources in high-quantity networks due to the high workload. In a way this is unavoidable since the threat agents objective of running client-side scripts is to stress the resources of such systems.

Fig. 5: Illustration of Computational Complexity. The illustration shows the computational load on a network sensor in regard to different obfuscation techniques

C.2. Feature Vector Inversion

Threat agents go a long way in evading detection algorithms. The following thought is derived from a common misconception in database security:

A group of ten persons which names are not to be revealed is listed amongst a couple of thousands, in an organizations LDAP directory. The group, let us name it X, is not to be revealed and is therefore not named in the department field.

While the public may not search and filter directly on the department name, being X, an indirect search would be succesful to reveal the group due to the ten persons being the only ones not associated with a department.

The concept of searching indirectly may be applied to evaluating javascripts in PDF documents as well. We might start off with some of the expected characters found in benign javascript documents:


The above which is found by expert knowledge as the probable used variables and functions in a benign javascript or other object. Much of these functions is used in interactive PDF documents, e.g. providing print buttons,

A weight is added to each cleartext function/variable. After counting the words in the document a summarized variable named the invertedfeaturevector gives an integer. The higher the integer, the higher the probability of the javascript being benign.

The inversed feature vector may be used as a signature and a whitelist indication database may be built of datasets. In the 100k malicious dataset the statistics showed that out of 42475, 41423 had more than one occurence of a known benign variable. This might seem like a less good feature, but the quantity is not the issue here, it is the weight of each variable. So: One may say that the higher the inverse vector is, the more likely it is that the PDF or javascript is benign. To clarify, next table shows variables fragmented by weight:
Inverse vector separated by interval, the

Shadowserver 100k dataset The table shows that most malicious PDF files in the 100k Shadowserver dataset contains low-weighted scores when it comes to the inverted vector as a measure of how benign the scripts are.

Weight    interval    Instances   Instance percentage
<10          15232    35,6%
20<>9        26852    62,8%
30<>19         136    ~0%
40<>29         148    ~0%
50<>39          87    ~0%
60<>49          28    ~0%
>60            253    ~0%
Total        42736    -

The inversion vector may as well be seen as a measure of the likeliness that the script is obfuscated. A quick look at the table shows that the characteristics of obfuscation is found in most PDF documents in the Shadowserver 100k dataset.

Even though this part of the vector should be seen as an indication, analysts should be aware that threat agents may adapt to the detection technique and insert clear text variables such as the ones listed above in addition to their malicious javascripts. This latter would function as a primitive feature vector inversion jammer. In other words it should be seen in context with the other items of the javascript feature vector as well. Further, the concept should be further evolved to avoid such evasion. One technique to segment the code before analyzing it (giving each code segment a score, finally generating a overall probability score), making it more difficult for the threat agent to utilize noise in his obfuscation.

D. Clustering

Experience shows that in practically oriented environments security analysis is, at least partially, done in a manual manner. This saying that the detection is based on indicators or anomalies and the analysis of the detection results is performed manually by an analyst. Though this may possibly be the approach resulting in least false positives it is overwhelming in regard to analysis of all potentially PDF documents in a larger organization. The 100k PDF dataset used in this paper is a evidence of such. So, how is it possible to automatically detect the interesting parts of the 100k PDF dataset? This question leads to the concept of data mining.

The definition of data mining is the transformation of data to "meaningful patterns and rules".

Michael Abernethy at IBM developerWorks20 covers data mining quite extensively.

D.1. A Narrow Experiment and Results

In this paper the goal is to achieve an view of the dataset in a way that is named "undirected" data mining: Trying to find patterns or rules in existing data. This is achieved through the feature vector previously presented.

Up until now this paper has discussed how to generate an satisfactionary feature vector and what makes the measure of similarity. Let us do an experiment using WEKA (Waikato Environment for Knowledge Analysis) for analyzing our feature vector.

Appendice 3 describes the ARFF format found from our feature vector and two of the previously presented feature vectors (SHA256: 5a61a0d5b0edecfb58952572addc06f2de60fcb99a21988394926ced4bbc8d1b, d3874cf113fa6b43e7f6e2c438bd500edea5cae7901e2bf921b9d0d2bf081201) and a random selection of 2587 parseable PDF-documents from the dataset.

In this experiement the feature vector were produced of 200 random samples from the 100k dataset. Interesting in that regard is that the subdataset loaded from originally contained 6214 samples, while our application only handled the decoding of under half. The feature vector was extracted in a CSV format, converted by the following WEKA Java class and loaded in WEKA:

java -classpath /Applications/weka-3-6-6.app/Contents/Resources/Java/weka.jar weka.core.converters.CSVLoader dataset.csv

In the WEKA preprocessing, the results may be visualized:

Fig. 6: Results 1; PDF Feature Vector DistributionA model showing the PDF feature vector object distribution using the 2587 parsable PDF documents

D.2. The complete dataset

Next loading the complete feature vector dataset consisting of 42736 entities showed interesting results when clustering.

Fig. 7: Stringcount vs anomalies in the inverse vector. Stringcount vs anomalies in the inverse_vector. Using k-means algorithm and k=5. Medium Jitter to emphasize the clusters

The cluster process above also enables the possibility to look at the anomalies where the inversevector is high. For instance 9724 (the highest one in the Y-axis) the inversevector is 21510 which is a very clear anomaly compared to the rest of the clusters (the distance is far). This should encourage a closer look at the file based on the hash.

The Shadowserver 100k ARFF dataset will be further evolved and may be found at the project GitHub page25.

E. Logging and Interpreting Errors

Again and again while analyzing the 100k dataset the interpreter went on parsing errors. Bad code one may say, but a fact is that the threat agents are adapting their code to evading known tools and frameworks. An example of this is a recent bug21 in Stevens PDF parser where empty PDF objects in fact created an exception in the application.

So, what does this have to do with this paper? Creative threat agents can never be avoided, creating malicious code that avoids the detection routines. This makes an important point, being that the application implemented should be using strict deobfuscation and interpretation routines. When an error occurs, which will happen sooner or later, the file should be traceable and manually analyzed. This in turn should lead to an adaption of the application. Where the routines fails will also be a characteristic of the threat agent: What part of the detection routines does he try to evade? E.g. in the 100k dataset an error on the ascii85-filter occurred. The parsing error made the parser-module not to output a feature vector, and were detected by error monitoring in log files.

Discussion and Conclusions

In regard to being used standalone as evidence the feature vector will have its limitations, especially since its hard to connect it to an event it should be considered circumstancial.

The PDF and ECMA standard are complex and difficult to interpret, especially when it comes to automation. As has been shown in this article a really hard problem is dynamically and generically executing javascripts for deobfuscation. This is also shown just in the Adobe Reader, where e.g. Adobe Reader X uses Spidermonkey 1.8, while previous more prevalent versions use version 1.7 of Spidermonkey. This often resulted in parsing errors, and again it will potentially cause a larger error rate in the next generation intrusion detection systems.

It has been proved that a static analysis through a Jsunpack-n modification recovers good enough round-zero data, from a little less than half of the Shadowserver 100k dataset, to generate a characteristic of each file. The results were somewhat disappointing in regard to the extensive parsing errors. Parsing optimalization and error correction making the script more robust and reliable should be covered in a separate report. Despite the latter a good foundation and enough data were given to give a clue for what to expect from the extended PDF feature vector. Also, the inverse vector with its weighting gives a individual score to each document, making it exceptionally promising for further research.

In regard to OLOVC a certain enhancement would be to combine it with the work of Franke' and Petrovic' "Improving the efficiency of digital forensic search by means of contrained edit distance". Their concept seems quite promising and might provide valuable input to OLOVC.

The dataset used in this article may contain certain flaws in its scientific foundation. No dataset flaws, but indications that some data origins from the same source, has been seen throughout this article. The reason is most probably that the dataset was collected over three continuous days. Linked to the behaviour of malware it is known that certain malware such as drive-by attacks has peaks in its spread as a function of time. It is therefore natural to assume that there are larger occurences of PDF documents originating from the same threat agent. On the other side, in further research, this should be a measure of the effectiveness of algorithms ability to group the data.

The Shadowserver 100k dataset only contains distinct files. It would be interesting to recollect a similar dataset with non-distinct hash-entries, and to cluster it by fuzzy hashing as well.

Even though clustering is mentioned in the last part of this article, further extensive research should be done to completely explore the potential of using the current feature vector. In other words the scope of the article permitted for a manual selection of a feature vector and a more or less defined measure of similarity though the extended PDF feature vector.

The project has a maintained GitHub page as introduced in the last section. This page should encourage further development into the extended PDF feature vector.

If you'd like please have a look at the GuC Testimon Forensic Laboratory.


Tommy is an analyst and incident handler with more than seven years of experience from the government and private industry. He holds an M.Sc. in Digital Forensics and a B.Tech. in information security