information retrieval - Using Jaccard Coefficient for measuring string similarity -

i got test , training dataset should used string similarity measurement. here have given few lines of dataset,

brandon bass ||| hell brandon bass thinking ||| brandon bass has 5 personal fouls ||| false sac ||| congrats sac kings fans ||| why yall forcing kings stay in sac town smh ||| false stella ||| hello stella can follow me please ||| stella u hate me ||| false   data file has 50 entries of form  topic ||| tweet_sent_1 ||| tweet_sent_2 ||| have_similar_meaning

topic – twitter topic

tweet_sent_1 – tweet sentence 1 tweet_sent_2 – tweet sentence 2 have_similar_meaning – binary label (true – 2 sentences similar, false – 2 sentences not similar) assigned human annotator

we need divide data set two: training set (35 samples) , test set (15 samples) , have use training set parameter tuning of algorithms. , test test set using best tuned parameter.

if algorithm jaccard coefficient

how can perform task? can please let me know approach can use.

jaccard similarity measure of how 2 sets (of n-grams in case) similar. there no "tuning" done here, except threshold @ decide 2 strings similar or not.

for example if have 2 strings abcde , abdcde works follow :

ngrams (n=2) :  'abcde' & 'abdcde'    ab bc cd de dc bd  1  1  1  1  0  0 b  1  0  1  1  1  1

j(a, b) = (a∩b) / (a∪b)

j(a, b) = (3 / 6) = 0.5

there jaccard distance captures dissimilarity between 2 sets, , calculated taking one minus jaccard coeeficient (in case, 1 - 0.5 = 0.5)

so, problem, use training set labels in order define proper threshold strings considered similar/dissimilar.

Search This Blog

Tomorrow

information retrieval - Using Jaccard Coefficient for measuring string similarity -

Comments

Post a Comment

Popular posts from this blog

php - Auto increment employee ID -

php - isset function not working properly -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -