#### Indicators: Our Worst Nightmare?

##### February 21, 2018
ioc communications storage format standards

Over what have become some years, cyber security professionals have been working on optimising the sharing of information and knowledge. A lot of the efforts have recently been focused around intelligence- and data-driven teams. Today many of these discussions have ended evolving around something related to the STIX format.

Don’t use a lot where a little will do Unknown origin

This post features a perspective of the potential of today’s standard-oriented approach for documenting indicator sets related to cyber security threat actors and incidents. It turns out we have a longer way to go than expected.

For the purpose of this article, an indicator is a characteristic or evidence of something unwanted, or hostile if you’d like. I like to refer to the military term “Indicators & Warnings” in this regard. In other words, an indicator isn’t necessarily limited to the cyber domain alone either. Physical security could be in an even worse condition than cyber security when it comes to expressing threat indicators. I’ll leave the cross-domain discussion for another time.

## Up Until Today

Multiple standards have evolved and disappeared, and one that I have been in favor of previously is the OpenIOC 1.1 standard. However, times are changing, and so are the terminology and breadth of how we are able to express the intrusion sets.

Even though OpenIOC was a very good start, and still is as far as I am concerned, it has far been surpassed Cybox and ultimately STIX in popularity.

STIX is a container, a quite verbose XML format (which is turning JSON in 2.0). Cybox is the artefact format (example here), for malware you have MAEC and so on. Basically it’s a set of projects collaborating.

This all sounds good, right? Not quite. Have a look at the OpenIOC to STIX repository on Github and you will find that stuxnet.stix.xml is 202 lines of XML code for 18 atomic indicators. OpenIOC on the other hand, is 91 lines, and that is a verbose format as well. In fact the overhead ratio of the STIX file is about 10:1, while OpenIOC is about 5:1.

To add to the mind-blowing inefficiency I have yet to see, on a regular basis, complex and nested expressions of an actor or a campaign in the STIX format.

Before you continue, do a simple Google search for “STIX editor” and “cybox editor”. Do it now, and while you are at it google for “openioc editor” as well. Hello guys, these standards have been going around for many years. So, how should we interpret that there aren’t any user friendly approaches to using them? The closest I’ve come is through MISP, and that is generally speaking not using these standards for their internal workings either. This one on the MISP GitHub issue tracker says it all: STIX 2.x support (MISP).

I’m sure that some may disagree with the above statements, calling out the infancy of these formats. However, they can’t be said to be new standards anymore. They are just too complex. One example of such is the graph-oriented relations implemented into the formats. Why not just let a graph take care of these?

This is not just a post to establish the current state. How would a better approach look?

## What Is The Problem to Be Solved?

Back to where things have gone since the OpenIOC 1.1/atomic indicator days. The most promising addition, in my opinion, is the MITRE PRE-ATT&CK and ATT&CK frameworks. The two frameworks builds on a less structured approach than seen for atomic indicators (Lockheed’s Kill-Chain). The latter can be viewed in form of the Intelligence Pyramid shown below.

The Intelligence Pyramid’s abstraction levels can be mapped against what it is supposed to support when it comes to indicators like the following:

| Level of abstraction  |    | Supports
|-----------------------|----|-------------
| Behavior              |    | Knowledge
|-----------------------|--->|-------------
| Derived               |    | Information
|-----------------------|--->|-------------
| Atomic                |    | Data


The purpose of the abstration layer is in this case to support assessments and measures at the corresponding contextual level. For instance a technical report tailored to an Incident Response Team (IRT) generally concerns Derived and Atomic indicators, while an intelligence report would usually be based on the Behavioural level.

Having covered the abstraction layers, we can recognize that OpenIOC (or Cybox and MAEC) covers the bottom layers of abstration, while MITRE (PRE-)ATT&CK in its current form is mostly about the Behaviour level.

For Derived indicators there are primarily two well-established, seasoned and successful formats that have become standards through its widespread usage. This is amongst others caused by the indicators and rules being effective, rapid, easy and pleasing to write.

First we have Snort/Suricata rules and Lua scripts which was designed for network detection. For Snort/Suricata I’d say that most of what is detected of metadata today is probably expressable in OpenIOC (except for the magic that can be done with Lua). Second there is the Yara format which has become known for its applicability against malicious files. The simplicity of both formats is obviously due to their power of expression. Thus, I’d say that Yara and Snort/Suricata formats is the ones to look for when it comes to content and pattern detection.

Indicators should be easy and pleasing to write.

To summarize the above, each of the formats can be mapped to an abstraction level:

| Level of abstraction  |    | Formats
|-----------------------|----|-------------
| Behavior              |    | MITRE (PRE-)ATT&CK
|-----------------------|--->|-------------
| Derived               |    | Suricata+Lua, Yara
|-----------------------|--->|-------------
| Atomic                |    | OpenIOC 1.1


Going through my notes on how I document my own indicators I also found that I use the CVE database, datetimes, confidence, analyst comments for context and classification as well (the latter being irrelevant for detection).

One of the major problems is: everything that is currently out there breaks the analyst workflow. You either need to log in to some fancy web interface, edit XML files (god forbid) or you would just jot down everything in a text file. The text file seems to be the natural fallback in almost any instance. I have even attempted to use the very good initiative by Yahoo, PyIOCe, and Mandiant’s long-forgotten IOC Editor. These projects have both lost tracktion, as almost every other intiative in this space. So that is right folks, the text editor is still the preferred tool in 2018, and let’s face it: indicators should be pleasing to design and create - like putting your signature to an incident or a job well done.

an indicator set should be for humans and machines by humans

After all, the human is the one that is going to have to deal with the indicator sets at some point, and we are the slowest link. So let us not slow ourselves down more than necessary. At this point I would like to propose the golden rule of creating golden rules: an indicator set should be for humans and machines by humans.

You may also have noticed that when all these standards suddendly are combined into one standard, they become less user-friendly. In other words, let us rather find back to our common *NIX roots where each tool had a limited set of tasks.

Graphs are essential when writing indicators. Almost everything in the world around us can be modelled as a network, and infiltration and persistence in cyberspace is no exception. Thus, an indicator format needs to be representable in a graph, and guess what? Almost everything are as long as it maintains some kind of structure.

For graphs there are two ways of going about the problem:

1) Implement the graph in the format 2) Make sure that you have a good graph backend and a automatable and traversable format available

For option 1, the graph in the format will increase the complexity significantly. Option 2 results in the opposite, but that does not mean that it can’t be converted to a graph. To make an elaborate discussion short, this is what we have graph databases for, such as Janusgraph.

## A Conceptual View

Summarizing the above, I’d like to propose the following requirements for indicator formats:

1) Indicator sets should be easy and inviting to create 2) You should be able to start writing at any time, when you need it 3) Unnecessary complexity should be avoided 4) The format should be human readable and editable 5) A machine should be able to interpret the format 6) Indicator sets should be graph compatible

With a basis in this article, I believe that the best approach is to provide a basic plain text format specification that inherits from the OpenIOC 1.1 and MITRE frameworks and references other formats where necessary.

Let us imagine that we found an IP address in one situation. The IP-address was connected to a domain that we found using passive DNS. Further, it was found that a specific file was associated with that domain through a Twitter comment. Representing the given information in its purest (readable) form looks like the following:

// a test file
class                  tlp:white
date                   2018/02/18
ipv4          low      188.226.130.166
domain      med      secdiary.com
technique            PRE-T1146
filename  med      some_filename.docx
comment            found in open sources


To recap some of the previous points: the above format is simple, it can be written at any time based on knowledge of well known standards. The best of it all is that if you are heavily invested in specific formats, it can be converted to them all using a simple interpreter traversing the format.

Further, such a format is easily converted into a tree and can be loaded into a graph for traversing and automated assessments. Each confidence value can be quantified (low=0.33, med=0.66, high=1.0). That said, simplicity in this case equals actionable indicators.

| v: 188.226.130.166 (0.33)    | match    |
| e                            |          |
| v: secdiary.com (0.66)       | no match | (0.33+0.66)/2=0.5
| e                            |          |
| v: some_filename.docx (0.66) | match    |


For networks vs hierarchies: a drawback of the latter, as mentioned in the former section, is the lack of e.g. multiple domains being connected to different other vertices. A practical solution goes as follows:

ipv4      low    188.226.130.166
domain  med    secdiary.com
domain    low    secdiary.com
ipv4      low    128.199.56.232


The graph receiving the above indicator file should identify the domain as being a unique entity and link the two IP addresses to the same domain:

| v: 188.226.130.166 (0.33)
| e: 0.5
| v: secdiary.com (0.5)
| e: 0.33
| v: 128.199.56.232 (0.33)


As for structuring the indicator format for machines in the practical aspect, consider the following pseudocode:

indicators = [(0,'ipv4','low','188.226.130.166'),...]
_tree = tree(root_node)
for indicator in indicators
depth = indicator[0]
_tree.insert(indicator,depth)


Now that we have the tree represented in code, it is trivially traversable when loading it into some graph:

method load_indicators(node,depth):
graph.insert(node.parent,edge_label,node)
for child in node.children



## Summary

Hopefully I did not kill too many kittens with this post. You may or may not agree, but I do believe that most analysts share at least parts of my purist views on the matter.

We are currently too focused on supporting standards and having everyone use as few of them as possible. I believe that energy is better used on getting more consistent in the way we document and actually exchange more developed indicator sets than the md5 hash- and domainlists that are typically shared today (“not looking at these kinds of files at all” - even though it’s not the worst I’ve seen: MAR-10135536-F_WHITE_stix.xml).

In the conceptual part of this article I propose a simple but yet effective way of representing indicators in a practical manner. Frankly, it is even too simple to be novel. It is just consistent and intutitive.

PS! For the STIX example above, have a look at the following to get a feel with the actual content of the file (used one of the mentioned specimens to show the point):

class             tlp:white
date              2018/02/05

sha1          high    4efb9c09d7bffb2f64fc6fe2519ea85378756195
comment             NCCIC:Observable-724f9bfe-1392-456e-8d9b-c143af15f8d4
comment             did not convert all attributes
compiler            Microsoft Visual C++ 6.0
md5         high    3dae0dc356c2b217a452b477c4b1db06
date                2016-01-29T09:21:46Z
entropy     med     6.65226708818
#sections   low     5
intname     med     ProxyDll.dll


The original document states for those same indicators in no less than 119 lines with an overhead ratio of about 1:5 (it looks completely insane):

<stix:Observables cybox_major_version="2" cybox_minor_version="1" cybox_update_version="0">
<cybox:Observable id="NCCIC:Observable-724f9bfe-1392-456e-8d9b-c143af15f8d4">
<cybox:Object id="NCCIC:WinExecutableFile-bb9e38d1-d91c-4727-ab6a-514ecc0c02a2">
<cybox:Properties xsi:type="WinExecutableFileObj:WindowsExecutableFileObjectType">
<FileObj:File_Name>3DAE0DC356C2B217A452B477C4B1DB06</FileObj:File_Name>
<FileObj:Size_In_Bytes>336073</FileObj:Size_In_Bytes>
<FileObj:File_Format>PE32 executable (DLL) (console) Intel 80386, for MS Windows</FileObj:File_Format>
<FileObj:Hashes>
<cyboxCommon:Hash>
<cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">MD5</cyboxCommon:Type>
<cyboxCommon:Simple_Hash_Value>3dae0dc356c2b217a452b477c4b1db06</cyboxCommon:Simple_Hash_Value>
</cyboxCommon:Hash>
<cyboxCommon:Hash>
<cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">SHA1</cyboxCommon:Type>
<cyboxCommon:Simple_Hash_Value>4efb9c09d7bffb2f64fc6fe2519ea85378756195</cyboxCommon:Simple_Hash_Value>
</cyboxCommon:Hash>
<cyboxCommon:Hash>
<cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">SHA256</cyboxCommon:Type>
<cyboxCommon:Simple_Hash_Value>8acfe8ba294ebb81402f37aa094cca8f914792b9171bc62e758a3bbefafb6e02</cyboxCommon:Simple_Hash_Value>
</cyboxCommon:Hash>
<cyboxCommon:Hash>
<cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">SHA512</cyboxCommon:Type>
</cyboxCommon:Hash>
<cyboxCommon:Hash>
<cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">SSDEEP</cyboxCommon:Type>
<cyboxCommon:Simple_Hash_Value>3072:jUdidTaC07zIQt9xSx1pYxHvQY06emquSYttxlxep0xnC:jyi1XCzcbpYdvQ2e9g3kp01C</cyboxCommon:Simple_Hash_Value>
</cyboxCommon:Hash>
</FileObj:Hashes>
<FileObj:Packer_List>
<FileObj:Packer>
<FileObj:Name>Microsoft Visual C++ 6.0</FileObj:Name>
</FileObj:Packer>
<FileObj:Packer>
<FileObj:Name>Microsoft Visual C++ 6.0 DLL (Debug)</FileObj:Name>
</FileObj:Packer>
</FileObj:Packer_List>
<FileObj:Peak_Entropy>6.65226708818</FileObj:Peak_Entropy>
<WinExecutableFileObj:Number_Of_Sections>5</WinExecutableFileObj:Number_Of_Sections>
<WinExecutableFileObj:Time_Date_Stamp>2016-01-29T09:21:46Z</WinExecutableFileObj:Time_Date_Stamp>
<WinExecutableFileObj:Hashes>
<cyboxCommon:Hash>
<cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">MD5</cyboxCommon:Type>
<cyboxCommon:Simple_Hash_Value>e14dca360e273ca75c52a4446cd39897</cyboxCommon:Simple_Hash_Value>
</cyboxCommon:Hash>
</WinExecutableFileObj:Hashes>
<WinExecutableFileObj:Entropy>
<WinExecutableFileObj:Value>0.672591739631</WinExecutableFileObj:Value>
</WinExecutableFileObj:Entropy>
<WinExecutableFileObj:Sections>
<WinExecutableFileObj:Section>
<WinExecutableFileObj:Name>.text</WinExecutableFileObj:Name>
<WinExecutableFileObj:Size_Of_Raw_Data>49152</WinExecutableFileObj:Size_Of_Raw_Data>
<WinExecutableFileObj:Entropy>
<WinExecutableFileObj:Value>6.41338619924</WinExecutableFileObj:Value>
</WinExecutableFileObj:Entropy>
<cyboxCommon:Hash>
<cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">MD5</cyboxCommon:Type>
<cyboxCommon:Simple_Hash_Value>076cdf2a2c0b721f0259de10578505a1</cyboxCommon:Simple_Hash_Value>
</cyboxCommon:Hash>
</WinExecutableFileObj:Section>
<WinExecutableFileObj:Section>
<WinExecutableFileObj:Name>.rdata</WinExecutableFileObj:Name>
<WinExecutableFileObj:Size_Of_Raw_Data>8192</WinExecutableFileObj:Size_Of_Raw_Data>
<WinExecutableFileObj:Entropy>
<WinExecutableFileObj:Value>3.293891672</WinExecutableFileObj:Value>
</WinExecutableFileObj:Entropy>
<cyboxCommon:Hash>
<cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">MD5</cyboxCommon:Type>
<cyboxCommon:Simple_Hash_Value>4a6af2b49d08dd42374deda5564c24ef</cyboxCommon:Simple_Hash_Value>
</cyboxCommon:Hash>
</WinExecutableFileObj:Section>
<WinExecutableFileObj:Section>
<WinExecutableFileObj:Name>.data</WinExecutableFileObj:Name>
<WinExecutableFileObj:Size_Of_Raw_Data>110592</WinExecutableFileObj:Size_Of_Raw_Data>
<WinExecutableFileObj:Entropy>
<WinExecutableFileObj:Value>6.78785911234</WinExecutableFileObj:Value>
</WinExecutableFileObj:Entropy>
<cyboxCommon:Hash>
<cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">MD5</cyboxCommon:Type>
<cyboxCommon:Simple_Hash_Value>c797dda9277ee1d5469683527955d77a</cyboxCommon:Simple_Hash_Value>
</cyboxCommon:Hash>
</WinExecutableFileObj:Section>
<WinExecutableFileObj:Section>
<WinExecutableFileObj:Name>.reloc</WinExecutableFileObj:Name>
<WinExecutableFileObj:Size_Of_Raw_Data>8192</WinExecutableFileObj:Size_Of_Raw_Data>
<WinExecutableFileObj:Entropy>
<WinExecutableFileObj:Value>3.46819043887</WinExecutableFileObj:Value>
</WinExecutableFileObj:Entropy>