Transcription

Fuzzy Matching with SAS: DataAnalysts Tool to Cleaner DataJosh Fogarasi

Agenda What is Fuzzy Matching Anyways? Why is it relevantto a Data Professional? Introducing some useful SAS Text Functions Fuzzy Matching Cycle Simple Business Case: Required to verify CustomerData and have mistyped information (Dirty Data) Example: Small Scale Data Merge Example: Bulk Processing-Overload

Things to Consider Computer Science“: Fuzzy String Searching” Approximate join or a linkage between observationsthat is not an exact 100% one to one match Applies to strings/character arrays There is no one direct method or algorithm thatsolves the problem of joining mismatched data Fuzzy Matching is often an iterative process

Evaluating External Data Data Auditing: Access how clean your organizationsquality level of the data that exists Marketing: Generate a Lead List from data from anexternal source with relevant contact informationand exclude pre-existing customers Validating Data between two different databases(Access,Sybase,DB2,Excel Files,flat files) Correcting mistyped data fields among data sources(Manually Maintained Spreadsheets)

Evaluating Internal DataData Stewardship-Maintain the Quality of Data forinternal stakeholders Comparing Historical Data-Names, contactinformation, Addresses change over time Technology Migration between systems Different Data Sources that do not communicate withone another

Useful SAS Text FunctionsText Parsing Functions: Scan, SubstringText Positioning-IndexString Modification:Compress-removes all blanks, special charactersStrip-removes leading/trailing blank spacesTrim-removes trailing , used in concatenation functionsCase ManipulationUpcase()Lowcase()Propcase()

Useful SAS Text FunctionsText Extraction Functions: Scan() , Substring()Index Function-gives position within a string for a specific textString Modification:Compress()- removes all blanks, special charactersStrip() -removes leading/trailing blank spacesTrim() -removes trailing , used in concatenation functionsCase ManipulationUpcase()Lowcase()Propcase()Other useful text FunctionsLeft(),right(),length()

Useful SAS Text FunctionsText Extraction Functions: Scan() , Substring()Index Function-gives position within a string for a specific textString Modification:Compress()- removes all blanks, special charactersStrip() -removes leading/trailing blank spacesTrim() -removes trailing , used in concatenation functionsCase ManipulationUpcase()Lowcase()Propcase()Other useful text FunctionsLeft(),right(),length()

Functions useful for Fuzzy MatchingSOUNDEX generates a unique key/code for the string Phonetic coding system can be used with combination of the “* “ sounds likeoperator for both Proc SQL or within the wherestatement of a data step Useful in simplifying long character strings and iscomputationally less expensive during the mergeprocessing stage

Functions useful for Fuzzy MatchingSOUNDEX-generates a unique key/code for the string

Functions useful for Fuzzy MatchingCOMPGED Computes the Levenshtein Edit Distance between twostrings Scoring algorithm for (Replacement, deletion, or insertion)of characters within the stringCOMPLEV Computes special case of the Levenshtein Distance Not as versatile as Compged, good for small stringsSPEDIS Measures the propensity of two strings matchingCOMPARE Evaluates two strings and returns the left most character ifthey are different or a zero if they are the exact same

Functions useful for Fuzzy MatchingLevenshtein Edit Distance Algorithm

Fuzzy Matching Cycle1) Identifythe datafields5) Use theMatchingset tocombinedata sets4)Evaluatethe FuzzyMatches2) Simplifythe data3) Cleanthe data

Simple Example: Identify the DataFieldsProblem: We have two customer lists with no unique key tomatch them on in order to combine the data setsData Set 1- Name, Mailing Address, Postal code, CityData Set 2- Name and E-mail, Phone NumberResult- Data Set that contains All ParametersSolution: Without a unique key such as a client id or account idthen we are required to join these some howTwist-None of the Names were entered the same

Simple Example: Step 2- Simply the Data-I prefer to use column vectors for Fuzzy Matching-saves computational time and allows the process to runcleaner

Simple Example: Step 3- Clean the Data-Generally before any matching begins the fields should bemade to resemble one another

Simple Example: Step 4- Evaluate the Fuzzy Matches-Using the COMPGED after the Match is Complete

Simple Example: Step 5- Using the Matches to combine the two data sets

Useful SAS Papers 1) Matching Data Using Sounds-Like Operators and SAS Compare Functions Amanda Roesch, Educational Testing Service, Princeton, NJ 2)Fuzzy Merges - A Guide to Joining Data sets with Non-ExactKeys Using the SAS SQL Procedure Robert W. Graebner, Quintiles, Overland Park, KS, USAWebsitesa) -perform-afuzzy-match-using-sas-functions/b) www.lexjansen.com/nesug/nesug11/ps/ps07.pdfc) www.lexjansen.com/nesug/nesug07/ap/ap23.pd

APPENDIX: CODE

APPENDIX

Questions?

What is Fuzzy Matching Anyways? Why is it relevant to a Data Professional? Introducing some useful SAS Text Functions Fuzzy Matching Cycle Simple Business Case: Required to verify Customer Data and have mistyped information (Dirty Data) Example: Small Scale Data Merge Example: Bulk Processing-Overload AgendaFile Size: 281KBPage Count: 22Explore furtherFuzzy match in sas" Keyword Found Websites Listing .www.keyword-suggest-tool.c How to perform a fuzzy match using SAS functions - SAS U blogs.sas.comHot Fuzz: using SAS “fuzzy” data matching techniques .www.sas.comHow to fuzzy match by firm names - SAS Support Communit communities.sas.com4 Functions For Fuzzy Matching In SAS - 9TO5SASwww.9to5sas.comRecommended to you b