From Maltego to a Distributed Graph Environment
While I am certainly a fan of Maltego, the lack of client APIs and scalable collaboration features in the Maltego client made me start looking for alternatives. I was actually in touch with Paterva in order to have a look at using the XMPP-interface for integrating with my decision support applications. They weren't exactly forthcoming on opening up on the AES-implementation. I am neither a fan of non-transparency in cryptography or the use of features for locking in their customers. Anyway, this isn't a post about that - it is a post on how to start migrate your existing Maltego graphs to something more collaborative, the graph database Titan. Note that Titan doesn't come with a front-end other than the Dog House if you implement Rexster on top such as me. We will get back to that in a later post.
Titan is basically a graph language built on top of a variety of storage back-ends. To get started you can use the local storage back-end, but for this post we will use HBase. Further, Titan implements Tinkerpop Blueprints, a Java-based API interface for graph databases.
In this post I have been using:
- The development branch of Titan 0.5 from GitHub, with local modifications (pull request #539)
- Rexster Server 2.4.0
- Hortonworks Data Platform 2.0.6, Sandbox
Petter have previously had a look at the storage format in his Maltego importer. In this regard he figured they stored the graph in the GraphML XML-format. This, you can easily inspect by unzipping a graph-archive, and see inside the decompressed Graphs directory. A mtgx-file will have the following decompressed directory strucure:
- Entities: Contains all entities used in the graph, in an XML MaltegoEntity
- Files: Files contained in the graph entities. Each file referenced in an mtg:Property in the graph.
- Icons: All icons used for entity types.
- Graphs: Contains two files, typically
Graph1.properties. The first is the graph itself, and the latter is a property file containing meta-data of the graph, such as the created-date.
Importing Graph1.graphml Into a Titan Graph
So I'm going to cut to the case here. To get started you will need to somehow convert the XML file containing the graph to the storage backend. Guess what.. The guys behind Tinkerpop have made sure to support it natively. The language we are talking about here is Gremlin, for graph querying. You can inspect the other methods here. The two we are going to try first are the following (you will see later on that it won't work as expected):
- Graph.loadGraphML(String or File)
- load GraphML file into graph
- Graph.saveGraphML(String or File)
- save graph to a GraphML file
So when you have finished setting up Titan, which is left as an exercise for reader, enter the following into a Gremlin shell:
conf = new BaseConfiguration(); conf.setProperty("storage.backend","hbase"); conf.setProperty("storage.hostname","sandbox.hortonworks.com"); conf.setProperty("storage.port","2181"); g = TitanFactory.open(conf);
You have now loaded your online graph. In order to load up your graph, use the following command:
Turns out it crashes and burns (NullPointer), so we will have to inspect the graph to figure out where it fails. If you try to load a Maltego graph without the following in the header:
<?xml version="1.1" encoding="UTF-8" standalone="no"?> <graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:y="http://www.yworks.com/xml/graphml" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://www.yworks.com/xml/schema/graphml/1.1/ygraphml.xsd"> <VersionInfo createdBy="Maltego Tungsten" subtitle="" version="126.96.36.19925"/> <key for="graphml" id="d0" yfiles.type="resources"/> <key for="port" id="d1" yfiles.type="portgraphics"/> <key for="port" id="d2" yfiles.type="portgeometry"/> <key for="port" id="d3" yfiles.type="portuserdata"/> <key attr.name="MaltegoEntity" for="node" id="d4"/> <key for="node" id="d5" yfiles.type="nodegraphics"/> <key attr.name="MaltegoLink" for="edge" id="d6"/> <key for="edge" id="d7" yfiles.type="edgegraphics"/> <graph edgedefault="directed" id="G">
</graph> <data key="d0"> <y:Resources/> </data> </graphml>
It will take you pretty far. I just used the standard headers from the Gephi support pages. That took me a step further, and everything loaded successfully. Dog House graph:
Probably you'd be pretty happy that it imported successfully at this point, but it will become clear quite soon that everything is property-less. That is because of the custom schemas and XML namespaces used in Maltego. From what I could tell it's the nested objects and namespace that is the showstopper.
Too keep it a bit simpler, let's take a sample from the graph:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:y="http://www.yworks.com/xml/graphml" xsi:schemaLoction="ht$ <key attr.name="MaltegoEntity" for="node" id="d4"/> <node id="n0"> <data key="d4"> <mtg:MaltegoEntity xmlns:mtg="http://maltego.paterva.com/xml/mtgx" id="981jsam13laj2" type="maltego.Domain"> <mtg:Properties> <mtg:Property displayName="Domain Name" hidden="false" name="fqdn" nullable="true" readonly="false" type="string"> <mtg:Value>example.com</mtg:Value> </mtg:Property> <mtg:Property displayName="WHOIS Info" hidden="false" name="whois-info" nullable="true" readonly="false" type="string"> <mtg:Value/> </mtg:Property> </mtg:Properties> </mtg:MaltegoEntity> </data> </node> </graphml>
That gives an error like this: "Message: elementGetText() function expects text only element but START_ELEMENT was encountered." That pretty much means that Gremlin will only take text and not a nested element at the current point. So why is that?
Nothing Really Works
I guess at some point I got sick and tired of messing around with GraphML, it's just XML right. Since I am only migrating the thing and not doing anything else to go back, I was keen on getting the vertices data and edges over to Titan without loosing the essentials of the the data.
A quick walk-through:
- Get the vertice objects, which is specified by mtg:MaltegoEntity
- Get the edge elements, specified by mtg:MaltegoLink
- Parse down the vertices in a dynamic way to account for future changes. Remember that we are going to do this with a large-scale graph, so store the IDs that gets returned instead of operating with the Maltego ones. I found that the most essential information is the sets of properties, with values of course.
- Parse down the edges, and use the stored IDs to create the relationships between the vertices. It is also important to keep the labels here in addition to the source and destination vertice.
Along the way I also filtered some of the elements, that I found to be non-relevant for our setup (such as whois data that I don't use in this context, ipaddress-internal and so on).
You can see the complete script in this Gist. Here's an example that it works from one of my graphs: