information retrieval - Using Jaccard Coefficient for measuring string similarity -
i got test , training dataset should used string similarity measurement. here have given few lines of dataset,
brandon bass ||| hell brandon bass thinking ||| brandon bass has 5 personal fouls ||| false sac ||| congrats sac kings fans ||| why yall forcing kings stay in sac town smh ||| false stella ||| hello stella can follow me please ||| stella u hate me ||| false data file has 50 entries of form topic ||| tweet_sent_1 ||| tweet_sent_2 ||| have_similar_meaning
topic – twitter topic
tweet_sent_1 – tweet sentence 1 tweet_sent_2 – tweet sentence 2 have_similar_meaning – binary label (true – 2 sentences similar, false – 2 sentences not similar) assigned human annotator
we need divide data set two: training set (35 samples) , test set (15 samples) , have use training set parameter tuning of algorithms. , test test set using best tuned parameter.
if algorithm jaccard coefficient
how can perform task? can please let me know approach can use.
jaccard similarity measure of how 2 sets (of n-grams in case) similar. there no "tuning" done here, except threshold @ decide 2 strings similar or not.
for example if have 2 strings abcde
, abdcde
works follow :
ngrams (n=2) : 'abcde' & 'abdcde' ab bc cd de dc bd 1 1 1 1 0 0 b 1 0 1 1 1 1
j(a, b) = (a∩b) / (a∪b)
j(a, b) = (3 / 6) = 0.5
there jaccard distance captures dissimilarity between 2 sets, , calculated taking one
minus jaccard coeeficient (in case, 1 - 0.5 = 0.5
)
so, problem, use training set labels in order define proper threshold strings considered similar/dissimilar.
Comments
Post a Comment