Improving a fuzzy matching algorithm in Python -
task: take 2 text files , output 100% matches , 75% matches.
solution:
import difflib import csv # imports , parses files filea = open("h:/comm.names.txt", 'r') try: seta = filea.readlines() finally: filea.close() fileb = open("h:/acad.names.txt", 'r') try: setb = fileb.readlines() finally: fileb.close() # 100% match setmatch100 = set(seta).intersection(setb) match100 = open("h:\match100.txt", 'w') try: item in setmatch100: match100.write(item) finally: match100.close() # remove 100% matches 2 lists seta_leftover = set(seta).difference(setmatch100) setb_leftover = set(setb).difference(setmatch100) #return best match seta_leftover[i] in setb_leftover @ least 75% matching. fmatch75 = open("h:\match75.csv", 'w') match75 = csv.writer(fmatch75) try: match75.writerow(['file a', 'file b']) item in seta_leftover: match = difflib.get_close_matches(item, setb_leftover, 1, 0.75) if len(match) > 0: row = [item.rstrip(), match[0].rstrip()] match75.writerow(row) finally: fmatch75.close()
problem: works, results aren't good. here example of match:
fovea pharmaceuticals sa kobe pharmaceutical univcan't turn minimum percent in diff because need able match univ university. also, can't make sure first words match because strings start "the" , need matched strings exclude "the". can point me in direction throw out matches technically 75% similar, human aren't similar @ all?
i try comparing strings tool such pylevenshtein. allows fuzzy string comparisons.
Comments
Post a Comment