Improving a fuzzy matching algorithm in Python -


task: take 2 text files , output 100% matches , 75% matches.

solution:

import difflib import csv  # imports , parses files filea = open("h:/comm.names.txt", 'r') try:         seta = filea.readlines() finally:             filea.close()  fileb = open("h:/acad.names.txt", 'r') try:         setb = fileb.readlines() finally:             fileb.close()  # 100% match setmatch100 = set(seta).intersection(setb)  match100 = open("h:\match100.txt", 'w') try:     item in setmatch100:         match100.write(item) finally:     match100.close()  # remove 100% matches 2 lists seta_leftover = set(seta).difference(setmatch100) setb_leftover = set(setb).difference(setmatch100)  #return best match seta_leftover[i] in setb_leftover @ least 75% matching. fmatch75 = open("h:\match75.csv", 'w') match75 = csv.writer(fmatch75) try:     match75.writerow(['file a', 'file b'])     item in seta_leftover:                 match = difflib.get_close_matches(item, setb_leftover, 1, 0.75)                 if len(match) > 0:                         row = [item.rstrip(), match[0].rstrip()]                         match75.writerow(row)   finally:     fmatch75.close() 

problem: works, results aren't good. here example of match:

fovea pharmaceuticals sa kobe pharmaceutical univ
can't turn minimum percent in diff because need able match univ university. also, can't make sure first words match because strings start "the" , need matched strings exclude "the". can point me in direction throw out matches technically 75% similar, human aren't similar @ all?

i try comparing strings tool such pylevenshtein. allows fuzzy string comparisons.


Comments

Popular posts from this blog

apache - Add omitted ? to URLs -

redirect - bbPress Forum - rewrite to wwww.mysite prohibits login -

php - How can I stop spam on my custom forum/blog? -