graph - Records deduplication(linkage) algorithms -
i have standart record deduplication task: have alot records text ( or other ) fields , of them corresponding same entity. merging of such records goal of task.
there used , simple statistical approachs kind of tasks "probabilistic record linkage". of them more precise , more complicated exploit same ideas https://github.com/datamade/dedupe: try weight somehow each field measure of similarity , linear composition of weighted differences measure of whole record similarity.
but tasks have alot of unknown fields, amount of similar fields rather large :
record1 : propa = ; propb = unknown ; propc = unknown ; .... record2 : propa = ; propb = b ; propc = unknown ; .... record3 : propa = unkown ; propb = b ; propc = d ; .... record4 : propa = a2 ; propb = unknown ; propc = unknown ; .... record5 : propa = a2 ; propb = b2 ; propc = unknown ; .... record6 : propa = x2 ; propb = b2 ; propc = d2 ; ....
in case record1 can linked record3 via record2 more record4 record6.
this means need similar graph clustering alot of skips , huge amount of nodes , edges. don't need precise solution better classical statistical deduplikation must exist.
dedupe handles missing data , graph clustering. there other paradigms record linkage, data not seem demand it.
if want investigate @ newer paradigms, @ work of beka steorts or michael wick.
Comments
Post a Comment