Merge and Transform

We reconcile differences in terms as we merge information from extracted files and make connections between those elements where important relations exist.

Our strategy will be to build hash tables reading one extracted file after another. When this is complete we write each hash as a csv file, the preferred format for importing nodes and relations into neo4j.

# Methods

We've found it convenient to define and use a handful of helper methods. Our examples are in ruby.

n = node('LABEL', params)

Create or retrieve a node based on 'name' or 'id' in params. Other params will be recorded and merged with params on subsequent retrievals.

r = rel('TYPE', f, t, params)

Create or retrieve a relation from and to nodes f and t. Record params and merge with params on subsequent retrievals. Automatically add 'source' parameter naming explain node managed by file accessing methods below.

json('dir/file_of_things.json').each do |thing| ... end

yaml('dir/file_of_things.yaml').each do |thing| ... end

Open and read a file in json or yaml format. Read source metadata from the same directory and create 'EXPLAIN' nodes describing each source. Establish this as the context to annotate relations created from this source.

# Example

We consider how we might use these functions to create nodes and relations from an organization chart extracted as json as in our Data in Context example.

[ { "name": "E. R. Lee", "email": "erlee@email.com", "manager": null, "start": "2004-6-7" }, { "name": "H. R. Collins", "email": "hrcollins@email.com", "manager": "erlee@email.com", "start": "2002-10-11" }, ... ]

We use email as a property and as the foreign key 'id' for subsequent relations. We depend on node merging props so we can process nodes in any order.

json('org-chart/org-data.json').each do |employee| props = { id: employee['email'], name: employee['name'], email: employee['email'], start: employee['start'] } e = node('EMPLOYEE', props) m = node('EMPLOYEE', {id: employee['manager']}) rel('MANAGER', e, m, {}) end

Notice that some inputs are saved as properties of EMPLOYEE nodes while another, email, is used to identify another node to be joined by a relation.

In many cases additional logic will be required to construct consistent ids for nodes found in different sources. Clever defaults for node and rels method will simplify this logic.

# Output

We write a separate csv file for each node label and relationship type. For each we scan the recorded items to discover how many columns will be required. We then write this as column heads and then write each item's fields in the corresponding order.

See Github Example