Cleaning dirty data – Matching similar text strings

I once had the dubious pleasure of attempting to combine datasets where the only field in common is a free-text field containing slightly different representations of the same entity (eg: Company Name). The existing solution in place was to eyeball the records, and to manually create a mapping table to link the 2 datasets. The mapping table records are created only if the system fails (ie: reactive in nature). Here are some of my findings on how to auto-magically generate this mapping table, together with working code in Python. Hopefully this helps to save some eyeballs, especially for those who are currently in data migration projects.

Github repo here: https://github.com/amosang/datawrangling/blob/master/fuzzywuzzy/Illustrate%20-%20fuzzywuzzy%20library.ipynb

Leave a Reply