graph - Records deduplication(linkage) algorithms -


i have standart record deduplication task: have alot records text ( or other ) fields , of them corresponding same entity. merging of such records goal of task.

there used , simple statistical approachs kind of tasks "probabilistic record linkage". of them more precise , more complicated exploit same ideas https://github.com/datamade/dedupe: try weight somehow each field measure of similarity , linear composition of weighted differences measure of whole record similarity.

but tasks have alot of unknown fields, amount of similar fields rather large :

record1 : propa = ; propb = unknown ; propc = unknown ;  .... record2 : propa = ; propb = b ; propc = unknown ; .... record3 : propa = unkown ; propb = b ; propc = d ; ....  record4 : propa = a2 ; propb = unknown ; propc = unknown ;  .... record5 : propa = a2 ; propb = b2 ; propc = unknown ; .... record6 : propa = x2 ; propb = b2 ; propc = d2 ; .... 

in case record1 can linked record3 via record2 more record4 record6.

this means need similar graph clustering alot of skips , huge amount of nodes , edges. don't need precise solution better classical statistical deduplikation must exist.

dedupe handles missing data , graph clustering. there other paradigms record linkage, data not seem demand it.

if want investigate @ newer paradigms, @ work of beka steorts or michael wick.


Comments

Popular posts from this blog

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -

mongodb - How to keep track of users making Stripe Payments -