Transcription

A Quick Look at Fuzzy Matching Programming TechniquesUsing SAS SoftwareStephen Sloan; AccentureKirk Paul Lafler; SAS Consultant, Application Developer, Programmer, Educator and AuthorABSTRACTData comes in all forms, shapes, sizes and complexities. Stored in files and data sets, SAS users across industriesknow all too well that data can be, and often is, problematic and plagued with a variety of issues. Two data files canbe joined without a problem when they have identifiers with unique values. However, many files do not have uniqueidentifiers, or “keys”, and need to be joined by character values, like names or E-mail addresses. These identifiersmight be spelled differently or use different abbreviation or capitalization protocols. This paper illustrates data setscontaining a sampling of data issues, popular data cleaning and user-defined validation techniques, datatransformation techniques, traditional merge and join techniques, the introduction to the application of different SAScharacter-handling functions for phonetic matching, including SOUNDEX, SPEDIS, COMPLEV, and COMPGED, and anassortment of SAS programming techniques to resolve key identifier issues and to successfully merge, join and matchless than perfect, or “messy” data. Although the programming techniques are illustrated using SAS code, many, if notmost, of the techniques can be applied to any software platform that supports character-handling.Keywords: Fuzzy matching, SAS, character-handling functions, phonetic matching, SOUNDEX, SPEDIS, edit distance,Levenshtein, COMPLEV, COMPGEDINTRODUCTIONWhen data sources and data sets contain consistent and valid data values, share common unique identifier(s), andhave no missing data, the matching process rarely presents any problems. But, when data originating from multiplesources contain duplicate observations, duplicate and/or unreliable keys, missing values, invalid values, capitalizationand punctuation issues, inconsistent matching variables, and imprecise text identifiers, the matching process is oftencompromised by unreliable and/or unpredictable results. When issues like these exist, SAS users must first clean andstandardize any and all data irregularities before any attempts to match data records are performed. To assist in thistime-consuming and costly process, users often utilize special-purpose programming techniques including theapplication of one or more SAS functions, the use of approximate string matching, and/or an assortment ofconstructive programming techniques to standardize and combine data sets together.DATA SETS USED IN EXAMPLESThe examples presented in this paper illustrate two data sets, Movies with Messy Data andActors with Messy Data. The Movies with Messy Data data set, illustrated in Figure 1a, consists of 31 observations,a data structure of six variables where Title, Category, Studio, and Rating are defined as character variables; andLength and Year are defined as numeric variables. After careful inspection several data issues can be found in this dataset including the existence of missing data, duplicate observations, spelling errors, punctuation inconsistencies, andinvalid values.The Actors with Messy Data data set, illustrated in Figure 1b, contains 15 observations and a data structureconsisting of three character variables: Title, Actor Leading and Actor Supporting. As with theMovies with Messy Data data set, several data issues are found including missing data, spelling errors, punctuationinconsistencies, and invalid values.Page 1

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020Figure 1b: Actors with Messy Data data set.Figure 1a: Movies with Messy Data data set.THE MATCHING PROCESS EXPLAINEDIn an age of endless spreadsheets, apps and relational database management systems (RDBMS), it’s unusual to find asingle sheet, file, table or data set that contains all the data needed to answer an organization’s questions. Today’sdata exists in many forms and all too often involves matching two or more data sources to create a combined file. Thematching process typically involves combining two or more data sets, spreadsheets and/or files possessing a shared,common and reliable, identifier (or key) to create a single data set, spreadsheet and/or file. The matching process,illustrated in the following diagram, shows two tables with a key, Title, to combine the two tables together.MOVIES TitleLengthCategoryYearStudioRatingACTORS TitleActor LeadingActor SupportingBut, when a shared and reliable key is associated with input data sources that are nonexistent, inexact, or unreliable,the matching process often becomes more involved and problematic. As cited in Sloan and Hoicowitz (2016), specialprocesses are needed to successfully match the names and addresses from different files when they are similar, butnot exactly the same. In a constructive and systematic way the authors of this paper describe a six step approach tocleansing data and performing fuzzy matching techniques.Page 2

Fuzzy Matching Programming Techniques Using SAS Software, continuedSIX-STEP FUZZY MATCHING PROCESSWe adhere and recommend the following six step fuzzy matching process.Step 1: Determine the Likely Matching Variables using Metadata.Step 2: Understand the Distribution of Data Values.Step 3: Perform Data Cleaning.Step 4: Perform Data Transformations.Step 5: Process Exact Matches.Step 6: Match Key Fields using Fuzzy Matching Techniques.Page 3PharmaSUG 2020

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020STEP #1: DETERMINE THE LIKELY MATCHING VARIABLES.This first step determines whether any variables exist for matching purposes. Using the PROC CONTENTS listing,shown in Figure 2, we examine the results of the TITLE variable along with a sampling of values in greater detail todetermine whether it can serve as the key for matching observations in both data sets, as well as the distribution ofdata values for other categorical variables.PROC CONTENTS Code:PROC CONTENTS DATA mydata.Movies with Messy Data ;RUN ;PROC CONTENTS DATA mydata.Actors with Messy Data ;RUN ;The Movies with Messy Data data set’s data structure consists of six variables where Title, Category, Studio, andRating are defined as character variables; and Length and Year are defined as numeric variables. After carefulinspection several data issues can be found in this data set including the existence of missing data, duplicateobservations, spelling errors, punctuation inconsistencies, and invalid values.The Actors with Messy Data data set’s data structure consists of three character variables: Title, Actor Leading andActor Supporting. As with the Movies with Messy Data data set, several data issues are found including missingdata, spelling errors, punctuation inconsistencies, and invalid values.The PROC CONTENTS output for both data sets are illustrated in Figure 2, below.Results:Figure 2: CONTENTS procedure Output for Movies with Messy Data and Actors with Messy Data data sets.Page 4

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020STEP #2: UNDERSTAND THE DISTRIBUTION OF DATA VALUES.To derive a more accurate picture of the data sources, we suggest that users conduct extensive data analysis byidentifying missing values, outliers, invalid values, minimum and maximum values, averages, value ranges, duplicateobservations, distribution of values, and the number of distinct values a categorical variable contains. This importantstep provides an understanding of the data, while leveraging the data cleaning and standardizing activities that will beperformed later. One of the first things data wranglers will want to do is explore the data using the SAS FREQprocedure, or an equivalent approach like Excel Pivot Tables.PROC FREQ Code:PROC FREQ DATA mydata.Movies with Messy Data ;TABLES ALL / NOCUM NOPERCENT MISSING ;RUN ;Reviewing the results, we see an assortment of data issues including “key” values and/or record duplication, dataaccuracy, inconsistent values, missing values, validation, capitalization versus mixed case, and incomplete (partial)data issues, as shown in Figure 3.Results:Figure 3: Distribution of Values from the FREQ Procedure.Determining the number of distinct values a categorical variable has is critical to the fuzzy matching process. Acquiringthis information helps everyone involved better understand the number of distinct variable levels, the unique valuesand the number of occurrences for developing data-driven programming constructs and elements. The following SAScode provides us with the number of By-group levels for each variable of interest we see in Figure 4.Page 5

Fuzzy Matching Programming Techniques Using SAS Software, continuedPROC FREQ Code:TITLE "By-group NLevels in Movies with Messy Data" ;PROC FREQ DATA mydata.Movies with Messy Data NLEVELS ;RUN ;Results:Figure 4: The number of By-group levels for each variable of interestPage 6PharmaSUG 2020

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020Figure 4: The number of By-group levels for each variable of interest (continued)Figure 4: The number of By-group levels for each variable of interest (continued)Page 7

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020STEP #3: PERFORM DATA CLEANING.Data cleaning, referred to as data scrubbing, is the process of identifying and fixing data quality issues includingmissing values, invalid character and numeric values, outlier values, value ranges, duplicate observations, and otheranomalies found in data sets. SAS provides many powerful ways to perform data cleaning tasks (Cody, 2017).Use SAS Functions to Modify DataSAS functions are an essential component of the SAS Base software. Representing a variety of built-in and callableroutines, functions serve as the “work horses” in the SAS software providing users with “ready-to-use” tools designedto ease the burden of writing and testing often lengthy and complex code for a variety of programming tasks. Theadvantage of using SAS functions is evidenced by their relative ease of use, and their ability to provide a moreefficient, robust and scalable approach to simplifying a process or programming task.It is sometimes necessary to concatenate fields when matching files, because the fields could be concatenated in onefile while separate in another. SAS functions span many functional categories, and this paper focuses on those thatare integral to the fuzzy matching process. The following is a list of alternative methods of concatenating stringsand/or variables together. Use the STRIP function to eliminate leading and trailing blanks, and then concatenate the stripped fields usingthe concatenation operator, and insert blanks between the stripped fields.Use one of the following CAT functions to concatenate fields: CAT, the simplest of concatenation functions, joins two or more strings and/or variables together, endto-end producing the same results as with the concatenation operator. CATQ is similar to the CATX function, but the CATQ function adds quotation marks to any concatenatedstring or variable. CATS removes leading and trailing blanks and concatenates two or more strings and/or variablestogether. CATT removes trailing blanks and concatenates two or more strings and/or variables together. CATX, perhaps the most robust CAT function, removes leading and trailing blanks and concatenatestwo or more strings and/or variables together with a delimiter between each.Explore Data Issues with SAS’ PROC FORMATProblems with inaccurately entered data often necessitate time-consuming validation activities. A popular techniqueused by many to identify data issues is to use the FORMAT procedure. In the next example, a user-defined format iscreated with PROC FORMAT, a SAS DATA step identifies data issues associated with the Category variable, and a SASPROC PRINT is specified to display the Category variable’s data issues, which are displayed in Figure 5.PROC FORMAT, DATA Step and PROC PRINT Code:PROC FORMAT LIBRARY WORK ;VALUE Category Validation'Action' 'Action Adventure' 'Action Cops & Robber' 'Action Sci-Fi' 'Adventure' 'Comedy' 'Drama' 'Drama Mysteries' 'Drama Romance' 'Drama Suspense' 'Horror' 'Action''Action Adventure''Action Cops & Robber''Action Sci-Fi''Adventure''Comedy''Drama''Drama Mysteries''Drama Romance''Drama Suspense''Horror'Page 8

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020Other 'ERROR - Invalid Category'/* Other identified categories not listed */;RUN ;DATA Validate Category ;SET mydata.Movies with Messy Data ;Check Category PUT(Category, Category Validation.) ;IF Check Category 'ERROR - Invalid Category' THENDO ;OUTPUT ;END ;RUN ;PROC PRINT DATA work.Validate CategoryNOOBS N ;TITLE "Validation Report for Movie Category Variable" ;VAR Category Title Rating Length Studio Year ;RUN ;Results:Figure 5: Validation Report isolating Issues with the Movie Category Variable.Once the invalid movie categories are identified with the validation report, users have the option of using one or moredata cleaning techniques to manually correct, automating the process, or applying fuzzy matching techniques tocorrect (or handle) each invalid movie category.Add Categories, if Available, to the Start of the NameDoing this can eliminate matches that might occur if two businesses in the same general geographic area have thesame name (for example: Smith’s could describe a hardware store, a restaurant, or another type of business.) This isdone in Figure 1, where Category is in the third column.Remove Special or Extraneous CharactersPunctuation can differ even when names or titles are the same. Therefore, we remove the following characters: ‘ “ &? – from the movie title. For example, “National Lampoon’s Vacation” and “National Lampoons Vacation” refer to thesame movie title even though the former contains an apostrophe and the latter does not. Although the specialcharacters can be removed in a number of ways, the next example shows their removal from the Title variable in bothdata sets using the COMPRESS function. The results are displayed in Figure 6.Code to Remove Special Characters from Title and Perform Matching Process:data work.Movies Cleaned ;set mydata.Movies with messy data ;where title NE ‘’ ;title Page 9

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020compress(Title,"'""&?-") ;/*Remove special chars from Title*/run ;title "Movies Data Set After Removing Special Characters" ;proc print data work.Movies Cleaned ;run ;data work.Actors Cleaned ;set mydata.Actors with messy data ;where title NE ‘’ ;title compress(Title,"'""&?-") ;/*Remove special chars from Title*/run ;title "Actors Data Set After Removing Special Characters" ;proc print data work.Actors Cleaned ;run ;proc sql ;title "Matched Rows from Movies and Actors" ;select DISTINCT M.Title, Rating, Length, Actor Leadingfrom work.Movies Cleaned M,work.Actors Cleaned Awhere M.Title A.Title ;quit ;Results:Figure 6: After the Removal of Special Characters and the Results from an Inner Join.Page 10

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020Put All Characters in Upper-case Notation and Remove Leading BlanksDifferent data bases could have different standards for capitalization, and some character strings can be copied inwith leading blanks. As found in our example data sets the value contained in the Title variable can be stored as alllower-case, upper-case, or in mixed-case which can impact the success of traditional merge and join matchingtechniques. Consequently, to remedy the issue associated with case and leading blanks, we recommend using theSTRIP function to remove leading and trailing blanks along with the UPCASE function to convert all Title values touppercase characters. For users of other popular programming languages, there is generally an equivalent function, ormethod, available to handle these types of issues.Remove Words that might or might not Appear in Key FieldsCommonly used words in language, referred to as stop words, are frequently ignored by many search and retrievalprocesses. Stop words are classified as irrelevant and, as a result, are inserted into stop lists and are ignored. Examplesinclude The, .com, Inc, LTD, LLC, DIVISION, CORP, CORPORATION, CO., and COMPANY. Some data base tables mightinclude these, while others might not.Choose a Standard for AddressesAddress fields can present a challenge when analyzing and processing data sources. To help alleviate comparisonissues, decide whether to use Avenue or Ave, Road or Rd, Street or St, etc, and then convert the address fieldsaccordingly or create a user-defined lookup process using PROC FORMAT to match the standard values.Rationalize Zip Codes when Matching Addresses, Use Geocodes when AvailableWe found it useful to remove the last 4 digits of 9-digit zip codes, because some files might only have 5-digit zip codes.Since some files might have zip codes as numeric fields, and other files might have zip codes as character fields, makesure to include leading zeroes. For example, zip codes with a leading zero, as in 08514, would appear in a numericfield as 8514 requiring the leading zero to be inserted along with the specification of a Z5. informat and format beingassigned to the zip code variable.If working with US zip codes, make sure they are all numeric. This may not apply for other countries. One commonmistake to watch for is that sometimes Canada, with abbreviation CA, is put in as the state CA (California) instead ofthe country CA. Since Canada has an alphanumeric 6-character zip code, this, hopefully, will be caught when checkingfor numeric zip codes.If the program has access to geocodes, or if they are in the input data bases, geocodes can provide a further level ofvalidation in addition to the zip codes.Specify the DUPOUT , NODUPRECS, or NODUPKEYS OptionsA popular and frequently used procedure, PROC SORT, identifies and removes duplicate observations from a data set.By specifying one or more of the SORT procedure’s three options: DUPOUT , NODUPRECS, and NODUPKEYS, usersare able to control how duplicate observations are identified and removed.PROC SORT’s DUPOUT option is often used to identify duplicate observations before removing them from a data set.A DUPOUT option, often specified when a data set is too large for visual inspection, can be used with theNODUPKEYS or NODUPRECS options to name a data set that contains duplicate keys or entire observations. In the nextexample, the DUPOUT , OUT and NODUPKEY options are specified to identify duplicate keys. The NODUPKEY optionremoves observations that have the same key values, so that only one remains in the output data set. The PROC SORTis followed by the PROC PRINT procedure so that the results can be examined.Page 11

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020PROC SORT and PROC PRINT Code:PROC SORT DATA mydata.Movies with Messy DataDUPOUT work.Movies Dupout NoDupkeyOUT work.Movies Sorted Cleaned NoDupkeyNODUPKEY ;BY Title ;WHERE Title NE “” ;RUN ;PROC PRINT DATA work.Movies Dupout NoDupkey ;TITLE “Observations Slated for Removal” ;RUN ;PROC PRINT DATA work.Movies Sorted Cleaned NoDupkey ;TITLE “Cleaned Movies Data Set” ;RUN ;The results of the above SAS code are shown in Figure 7. The NODUPKEY option retains only one observation from anygroup of observations with duplicate keys. When Observations with identical key values are not adjacent to eachother, users may first need to specify the NODUPKEY or NODUPKEYS option and sort the data set by all the variables(BY ALL ;) to ensure the observations are in the correct order to remove all duplicates (SAS Usage Note 1566, 2000;Lafler, 2017).Results:Figure 7: Observations Slated for Removal and the Cleaned Movies Data Set.Page 12

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020Although the removal of duplicates using PROC SORT is a popular technique among many SAS users, an element ofcare should be given to using this method when processing large data sets. Since sort operations can often be CPUintensive, the authors of this paper recommend comparing PROC SORT to procedures like SAS PROC SQL with theSELECT DISTINCT keyword and/or SAS PROC SUMMARY with the CLASS statement to determine the performanceimpact of one method versus another.STEP #4: PERFORM DATA TRANSFORMATIONS.Data transformations can be required to compare files. Data set structures sometimes need to be converted fromwide to long or long to wide and files may need to be reconciled by having their variables grouped in different ways.When a data set’s structure and data is transformed, we typically recommend that a new data set be created from theoriginal one. SAS’ PROC TRANSPOSE is handy for restructuring data in a data set, and is typically used in preparationfor special types of processing like array processing. In its simplest form, data can be transformed with or withoutgrouping. In the next example, the Movies data set is first sorted in ascending order by the variable RATING then thesorted data set is transposed using the RATING variable as the by-group variable. The result is shown in Figure 8, and itgives all of the titles within each rating.PROC TRANSPOSE Code:PROC SORT DATA mydata.Movies with Messy DataOUT work.Movies Sorted ;BY Rating ; /* BY-Group to Transpose */WHERE Title NE “” ;RUN ;PROC TRANSPOSE DATA work.Movies SortedOUT work.Movies Transposed ;VAR Title ; /* Variable to Transpose */BY Rating ; /* BY-Group to Transpose */RUN ;PROC PRINT DATA work.Movies Transposed ;RUN ;Results:Figure 8: Results from Performing a Data Transform with the TRANSPOSE Procedure.STEP 5: PROCESS EXACT MATCHES.Since we are trying to match entries that do not have an exact match, we can save processing time by immediatelyeliminating the observations (or rows) with missing key information. This can be accomplished in a number of ways,including constructing IF-THEN/ELSE or WHERE logic to bypass processing observations with missing movie titles.Another approach to bypass processing observations with missing movie titles could be to use the NODUP orNODUPKEY parameter with SAS’ PROC SORT (more detail on these options will be presented later). Once missingobservations with missing keys are eliminated, the focus can then be turned to processing observations that haveexact matches on name, address, and as with our example data sets, the Title variable, as shown in Figure 9. We alsoPage 13

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020process and retain the observations that have mismatches on the Title variable, as shown in Figure 10; theobservations that did not have exact matches on the Title variable from the Movies data set, as shown in Figure 11;and the observations that did not have exact matches on the Title variable from the Actors data set, as shown inFigure 12.PROC SORT, DATA Step and PROC PRINT Code:proc sort data mydata.Actors with messy dataout work.Actors Sorted ;where Title NE "" ;by Title ;run ;proc sort data mydata.Movies with messy dataout work.Movies Sorted ;where Title NE "" ;by Title ;run ;data work.Matches(DROP Title)work.MisMatches(DROP Title)work.Movies with Unmatched Obs(KEEP Title Length CategoryYear Studio Rating)work.Actors with Unmatched Obs(KEEP Title Actor LeadingActor Supporting) ;merge work.Movies Sorted (IN M)work.Actors Sorted (IN A) ;by Title ;if M then Title from Movies Title ;if A then Title from Actors Title ;if M and A then output work.Matches ;else if NOT M or NOT A then output work.MisMatches ;if M and NOT A then output work.Movies with Unmatched Obs ;else if A and NOT M thenoutput work.Actors with Unmatched Obs ;run ;proc print data work.Matches N ;title "Matched Observations with Missing Keys Eliminated" ;var Title from Movies Title from Actors Length Category YearStudio Rating Actor Leading Actor Supporting ;run ;proc print data work.MisMatches N ;title "MisMatched Observations with Missing Keys Eliminated" ;var Title from Movies Title from Actors Length Category YearStudio Rating Actor Leading Actor Supporting ;run ;proc print data work.Movies with Unmatched Obs N ;title "Movies with UnMatched Observations" ;var Title Length Category Year Studio Rating ;run ;proc print data work.Actors with Unmatched Obs N ;title "Actors with UnMatched Observations" ;var Title Actor Leading Actor Supporting ;run ;Page 14

Fuzzy Matching Programming Techniques Using SAS Software, continuedResults:Figure 9: Matched Observations.Figure 10: Mismatched Observations.Page 15PharmaSUG 2020

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020Figure 11: UnMatched Movies Observations.Figure 12: UnMatched Actors Observations.STEP 6: MATCH KEY FIELDS USING FUZZY MATCHING TECHNIQUES.Once the data has been cleaned and transformed, a variety of fuzzy matching techniques are available for use. Asmentioned in (Dunn, 2014), these techniques are designed to be used in a systematic way when a reliable keybetween data sources is nonexistent, inexact, or unreliable.Fuzzy matching techniques are available with most, if not all, the leading software languages including R, Python, Java,and others (RosettaCode, 2018). SAS Institute offers four techniques for its users: the Soundex (phonetic matching)algorithm, and the SPEDIS, COMPLEV, and COMPGED functions to help make fuzzy matching easier and more effective(Sloan and Lafler, 2018).APPLY THE SOUNDEX ALGORITHMThe Soundex (phonetic matching) algorithm involves matching files on words that sound alike. As one of the earliestfuzzy matching techniques, Soundex was invented and patented by Margaret K. Odell and Robert C. Russell in 1918and 1922 to help match surnames that sound alike. It is limited to finding phonetic matches and adheres to thefollowing rules when performing a search: Ignores case (case insensitive);Ignores embedded blanks and punctuations;Is better at finding English-sounding names.Page 16

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020Although the Soundex algorithm does a fairly good job with English-sounding names, it frequently falls short whendealing with the multitude of data sources found in today’s world economy where English- and non-English soundingnames are commonplace. It also has been known to miss similar-sounding surnames like Rogers and Rodgers whilematching dissimilar surnames such as Smith, Snthe and Schmitt (Foley, 1999).So, how does the Soundex algorithm work? As implemented, SAS determines whether a name (or a variable’scontents) sounds like another by converting each word to a code. The value assigned to the code consists of the firstletter in the word followed by one or more digits. Vowels, A, E, I, O and U, along with H, W, Y, and non-alphabeticalcharacters do not receive a coded value and are ignored; and double letters (e.g., ‘TT’) are assigned a single code valuefor both letters. The codes derived from each word conform to the letters and values are found in Table 1.SOUNDEX Algorithm RulesLetterValueB, P, F, V1C, S, G, J, K, Q, X, Z2D, T3L4M, N5R6Table 1: Soundex Algorithm RulesThe general syntax of the Soundex algorithm takes the form of:Variable * “character-string”To examine how the movie title, Rocky, is assigned a value of R22, R has a value of 6 but is retained as R, O is ignored,C is assigned a value of 2, K is assigned a value of 2, and Y is ignored. The converted code for “Rocky” is then matchedwith any other name that has the same assigned code.In the next example, we use the Soundex algorithm’s * operator in a simple DATA step WHERE statement with thework.Movies with Unmatched Obs data set created in Step #5 earlier, to find similar sounding Movie Titles.DATA Step Code with SOUNDEX Algorithm:DATA work.Soundex Matches ;SET work.Movies with Unmatched Obs ;WHERE Title * “Michael” ;RUN ;PROC PRINT DATA work.Soundex Matches NOOBS ;TITLE “Soundex Algorithm Matches” ;RUN ;In the next example, the Soundex algorithm is illustrated using the * operator in a simple SAS PROC SQL step with aWHERE-clause to find similar sounding Movie Titles.Page 17

Fuzzy Matching Programming Techniques Using SAS Software, continuedPharmaSUG 2020PROC SQL Code with SOUNDEX Algorithm:proc sql ;select *from work.Movies with Unmatched Obswhere Title * "Michael" ;quit ;The results from both SOUNDEX algorithm examples are displayed in Figure 13.Results:Figure 13: The result of the Soundex match for “Michael”APPLY THE SPEDIS FUNCTIONThe SPEDIS, or Spelling Distance, function and its two arguments evaluate possible matching scenarios by translating akeyword into a query containing the smallest distance value. Because the SPEDIS function evaluates numerousscenarios, it can experience varying performance issues in comparison to other matching techniques. The SPEDISfunction evaluates query and keyword arguments returning non-negative spelling distance values. A derived value ofzero indicates an exact match. Generally, derived values are less than 100, but, on occasion, can exceed 200. Theauthors have used and recommend using the SPEDIS function to control the matching process by specifying spellingdistance values greater than zero and in increments of 10 (e.g., 10, 20, etc.).So, how does the SPEDIS function work? As implemented, the SPEDIS function determines whether two names (orvariables’ contents) are alike by computing an asymmetric spelling distance between two words. The SPEDIS functioncomputes the costs associated with converting the keyword to the query, as illustrated in Table 2.SPEDIS Cost RulesOperationCostDescriptionMatch0No changeSinglet25Delete one of a double letterDoublet50Double a letterSwap50Reverse the order of t

Fuzzy Matching Programming Techniques Using SAS Software, continued PharmaSUG 2020 Page 8 STEP #3: PERFORM DATA CLEANING. Data cleaning, referred to as data scrubbing, is the process of identifying and fixing data quality issues including