Match with Heuristics

We will find that the names of things we care about vary from source to source so it is upon us to reconcile these variations so as to produce reliable node identifiers during the transform step.

Consider the case where we want to retrieve an employee's id number given some subset of information of questionable consistency. Often a human can recognize a correspondence based on partial matches and familiar substitutions.

John Doe ≃ Jonathan Doe

@jdoe ≃ jdoe@example.com

Our approach will be to read the more definitive sources first, match new names by applying rules, and record the chosen rule in each relation thus constructed.

We often make a new class to manage rule evaluation but here is serviable ruby code where each rule returns true when successful having read and modified @props and identified itself in @rule. The result of the heuristic match is a tuple of [node, rule] where rule will be added to subsequent relations.

def match_employee props @props = props if rule1 || rule 2 || rule3 || rule4 [node('EMPLOYEE', @props), @rule] else [node('STRANGER', @props), 'no match'] end end

We can query for strangers, an obvious failure.

match (s:STRANGER) return s.name as name, s.email as email

We can also query for matches for each rule to look for less obvious failures of individual heuristics.

match (e:EMPLOYEE)-[r]-(x) where exists(r.rule) return r.rule as rule, collect(distinct e.name + e.email) as employee, collect(distinct r.source) as source

We often write multiple queries which drill down into cases when the heuristics are complex and edge cases obscure. This might seem tedious but it is really a privileged view into how work happens.