Market Report‘Free’ Data Profiling ToolsA Market Report by Bloor ResearchAuthor : Philip HowardPublish date : January 2014
you should be able to install, getrunning and start profiling withinless than a day . so it is possibleto try (products) out quickly to seeif they suit your needs.Philip Howard
‘Free’ Data Profiling ToolsIntroductionThere are a number of data profiling anddiscovery tools that are available for freedownload. Needless to say, a number of theseare open source products but there are also,notably, proprietary vendors that make theirproducts available as free to use offerings,albeit sometimes with restrictions on who canuse them and how. In addition, of course, thereare also suppliers that make their productsavailable on a try-before-you-buy basis. Inother words, you can use the product for freefor a limited period of time.A Bloor Market Report1Such offers are commonplace across the software industry but they are especially apposite within the context of data profiling anddiscovery. It is worth considering why that isbefore we compare the various free-to-useproducts that are available. 2014 Bloor Research
‘Free’ Data Profiling ToolsWhy profile?Profiling and discovery software does threethings:1. It can analyse a database, or subset of adatabase, for errors.2. It can monitor a dataset for errors on anon-going basis, typically presenting theresults via a dashboard.3. It can establish (discover) the relationshipsthat exist between data elements, not justwithin a single database but also across andbetween heterogeneous data sources.The second of these capabilities is unlikelyto be used by people just trying out the software, as opposed to those that are alreadycommitted to a data quality or governanceinitiative and, for this reason, we will confineour discussion to the first and third of the threethings that profiling and discovery softwarecan do. Moreover, we will discuss these capabilities from both a business and an IT pointof-view although we should advise readers ofeither persuasion to read both of the followingsections and not just that which pertains totheir role.The business perspectiveDespite the fact that data quality products havebeen in existence for the best part of 20 yearsit is still the case that a great many companies do not have any data quality or governance initiatives in place or, where they do exist,they are limited to a single (usually marketing)department.According to “The data revolution – liberating lostbudget” a report published in 2012 by Experian,“nearly 90% of companies admitted to wastingdepartmental budget as a result of duplicatedmailings, lost contacts and missed sales opportunities, which is all down to inaccurate data.Departments such as Marketing, Sales, Operations and Customer Services report wasting15% of their budget on average, while in IT andData Management this rises to 18%, or about asixth of the overall budget.In customer-focused areas, the cost of poordata quality is particularly high. More than 80%of the companies quizzed operate customerloyalty programmes, and two thirds of thesereport that inaccurate data has had a negativeimpact on their programmes. In terms of lostFigure 1: What do you estimate that data-related issues cost your company annually? 2014 Bloor Research2A Bloor Market Report
‘Free’ Data Profiling ToolsWhy profile?custom and deteriorating reputation, this hasa significant impact.” Conversely, “companiesinvesting in improving data accuracy believethat they generate an average of nearly 1million in additional profit.”There are endless such examples of variousbodies estimating the costs of poor quality data.The result of another, conducted by Forbes onbehalf of SAP, is illustrated in Figure 1. Unfortunately, despite this wealth of evidence thereremain significant numbers of executiveswho think that this is just an IT problem andnot a business issue. As the examples quoteddemonstrate, this is not the case.A major potential benefit of free profilingis therefore to demonstrate to an unwillingmanagement that there are issues with corporate data that are directly affecting the performance of your organisation. You should beable to install, get running and start profilingwithin less than a day so that, within a veryshort period of time and with negligible cost(just some time), data irregularities can bedemonstrated and, with appropriate software,you can assign values to data quality, not onlyso that you prioritise remediation but also inorder to estimate the value of such a process.Your measurement of real data quality issuesshould allow you to justify the establishment ofappropriately funded business-focused, business-sponsored data quality and data governance processes.The IT perspectiveOne of the difficulties with justifying investment in data quality is precisely that business executives often deny the possibility oferror, or under-estimate its extent, or decryits potential benefits. They are also inclined tothink that it is an IT problem. As the precedingdiscussion demonstrates this is not the case.However, there is some truth in it, as some ITfunctions will fail or run over-budget becauseof data quality issues.The most well-known example of an IT functionin which data quality is an issue is data warehousing. TDWI (The Data Warehousing Institute), back in 2002, estimated that poor dataquality in data warehouses and data martscost American businesses 600bn per annum.More recent estimates suggest that that figureis now around 700bn. Now, this may be an ITfunction but we would argue that with figuresthat big it is actually a business issue.A Bloor Market Report3Another IT function that is really a businessissue but is often relegated to IT and whichdepends on data profiling and discovery, is datamigration. That is, where you are migratingfrom one version of an SAP, Oracle or similarapplication to another, when you are migratingfrom one database to another or when you areconsolidating databases. Such projects canfail or significantly overrun their budgets ifdata quality is not up to scratch. It is importantto understand why.There are two types of data error. Data maybe incorrect or invalid. For example, an emailaddress may be wrong but it may still be inthe correct format. Alternatively, that emailaddress may not be in the right format at all(for instance, using “&” instead of “@”). If youload an email address that is valid but incorrect into your new system then that will notimpact the running of that system—it maydepreciate its value, that’s what we have previously been discussing, but it will still run—butif you load an invalid value then the load mayfail or, in the worst instance, it could actuallycrash the new system.For this reason it is best practice to profileyour data before you even start your migration(or archival: the same principles apply) projectin order to determine the scale of the dataquality problem you face. It is only after youhave done that that you can reasonably estimate the resources and costs that the migration project is likely to require. Thus, profilingyour data using a ‘free’ tool prior to a project ofthis type is another good use case.One final consideration relates to the third—discovery—capability associated with profilingtools. In data migration (and archival) projectsit is important to migrate ‘business entities’—by which we mean, for example, a customerwith his invoices, service history, deliveryaddresses and so on—as a whole. However,discovering what constitutes a business entityis a non-trivial exercise and it is somethingthat is enabled by this discovery aspect ofprofiling tools. It should be noted that somevendors (see next section) have put much moreeffort into extending their discovery capabilities than others and this becomes especiallycritical in distributed environments where youare, perhaps, consolidating data from multipledifferent sources and understanding relationships across those sources becomes morecomplex. 2014 Bloor Research
‘Free’ Data Profiling ToolsThe productsWe should re-iterate that we are hereconcerned with free to use software asopposed to a limited period free trial offer.While they might appear to be comparable,we do not believe that is the case. The advantage of free software is that you can downloadthe software when you want to and you cantrial it when you want to. If you have a limitedperiod offer then you have to be committed touse the software within that period; if something else comes up that you urgently need toattend to you can’t put your data profiling software aside to be used when you feel like it, youhave to use it now or you will lose it. You canof course extend a trial by contacting the salesteam of the software vendor, but that mayinvolve a lot more than a simple phone call oremail. With free to use software you can trialit whenever you need to. Moreover, if you areusing the software to demonstrate to executives that you have a data quality problem thatneeds to be addressed then they will take moreconvincing than a single run through of thedata. It is entirely possible that once the matterhas been raised then it will be escalated todifferent levels within the organisation, whichmay require additional evidence. This wholeprocess, like it or not, may take months. Thuswe believe that limited time, try before you buy,offers are most suitable when you have alreadydecided that you are going to go ahead with adata quality, migration or governance projectand you want to decide which product to useas opposed to establishing that you have a dataquality issue in the first place.Having said all of this it would be disingenuousnot to at least mention the vendors offering afree limited time trial version of their products and these are, most notably, CloverETL,Datiris and DataLynx. However, we will focuson the free to use market, which consists of: Talend, which is the leading open sourcevendor in this market. Ataccama, a proprietary vendor that makesits data profiling software free-to-use as anencouragement for those users to licenseits data quality software. Note that this (butnot the free download) is also available fromiWay (a division of Information Builders),which OEMs Ataccama’s software. X88. A proprietary supplier of data profiling,discovery, quality and migration software.Its free-to-use software is a constrainedversion of the Pandora Profiling Edition, 2014 Bloor Research4which only runs on Windows, the licence isfor a single user, it is limited to 50 tableswith no more than one million rows per tableand the licence is, in fact, limited to an automatically renewable six months period. Onefurther point is that the repository createdwith the free profiler can be carried forwardand used by the full-function product. This isbecause they are actually the same productbut controlled by licence key. SQL Power, which is an open source providerof tools to support data warehousing. As weshall see, SQL Power Architect is a datamodelling tool that has some data profilingfeatures but the emphasis is very much ondata modelling with profiling as a ‘nice tohave’. DataCleaner is another open source tool,which, as its name suggests, covers thefull gamut of data quality capabilities. LikeTalend and SQL Power it offers Professionaland Enterprise Editions in addition to itsCommunity Edition. However, the companydoes not provide commercial support for itsproduct, which is instead provided by HumanInference, the Dutch MDM specialist. Onenotable downside of this product is that inthe Community Edition information is storedin a file-based repository whereas the othertwo editions use a database-based repository. This will impair performance in theCommunity Edition and make migration toone of the paid-for options more complex. Open Source Data Quality and Profiling.This is downloadable from SourceForge.In other words, there is no commercialversion or support for this product. Thatmakes it difficult to get information aboutthe product—indeed, experience has provedthat the founders and developers of smallopen source projects do not have the time orinclination to respond to analyst’s requestsfor information—so we are having to rely onpublicly available information here. In fact,we had initially thought that this was thesame product as DataCleaner but since theproduct versions numbers are different wehave had to assume that this is not the case. AMB Data Profiling. This is another opensource project downloadable from SourceForge. It is an open source project started in2010 but its web site has not had an updatein nine months, which suggests that nothingmuch is happening with it.A Bloor Market Report
‘Free’ Data Profiling ToolsThe productsIn the Bullseye diagram that follows, we haverelied on Bloor Research’s recently publishedMarket Update on Data Profiling and Discoveryfor the positioning of Talend, Ataccama andX88. However, it is worth briefly outlining thestrengths and weaknesses of these products: Ataccama: there is no feature of this productthat is not also available in one or other, orboth, of the other two products. Talend: has extensive support for NoSQL,will run on a Hadoop platform, has supportfor non-normalised and specialised datatypes (such as nested tables and small integers respectively—both in Oracle databases)and support for COBOL copybooks. In termsof functionality the product is not as richas X88’s Pandora and, where it does haveequivalent features these may be in its dataquality rather than its data profiling product. X88: lacks the source support offered byTalend and relies wholly on JDBC. Offers anumber of features not included in either ofthe other two products, such as overlap andprecedence analysis, discovery of matchingkeys, the ability to discover personally identifiable information (PII) and similar data,support for the definition and use of businessterms and the ability to construct referencedata based on profiling results. We wouldalso expect Pandora to out-perform eitherof the other two, based on its architecture.However, the remaining products were notincluded in that paper and nor has BloorResearch previously published any material relating to any of these products, so it isappropriate to also briefly discuss these. Thefirst thing to say is that all of these lack crossdatabase discovery capabilities and they alsolack support for joins, patterns (some of them, DataCleaner is an exception), or drilldown /navigation and there is no support for multicolumn key analysis and relationship discovery /analysis / validation. DataCleaner does havesome nice visualisations. Both this and theOpen Source Data Quality and Profiling software have support for some NoSQL sourcesbut it is unclear whether this includes dataprofiling as well as cleansing. In the case of thelatter product it is SQL-based, which suggestsnot. It would also imply that profiling of nonrelational sources such as CSV or Excel files(which are supported for cleansing purposes)will be lenChaOpen SourceData Quality& ProﬁlingSQL PowerFigure 2: The highest scoring companies are nearest theInnovatorAMB Data Proﬁlingcentre. The analyst then defines a benchmark score for adomain leading company from their overall ratings and all thoseabove that are in the champions segment. Those that remain areplaced in the Innovator segment if their innovation rating is over2.5 and Challenger if it is less than 2.5. The exact position ineach segment is calculated based on their combined innovationand overall score.A Bloor Market Report5 2014 Bloor Research
‘Free’ Data Profiling ToolsConclusionAs mentioned previously, you should be ableto install, get running and start profilingwithin less than a day and most of the products offer training videos and documents, soit is possible to try them out quickly to see ifthey suit your needs. More specifically, a greatdeal will depend on how likely you think yourcompany is to adopt a data quality programme.If that is a real possibility then being able tocarry forward your results into productionsystems could be a significant factor, as couldcommercial support.Nor should it be a shock that Talend and Ataccama (in that order) are more richlyfeatured than any of the other productsfeatured in this paper. More generally, Talendmay be more familiar and reassuring to atechnical audience whereas X88’s ease of useis appreciated by less-technical data analysts.If we were going to try a product other thanfrom the three major vendors, it would beDataCleaner. SQL Power Architect is an interesting product but not for data profiling per se.Further InformationThese considerations will tend to mean thatone of the three major vendors (Talend, X88 andAtaccama) should be preferred. Conversely, ifthe chances are remote then you can only goon the features of the products as they stand.Of course, Pandora will be ruled out if you wantto exceed the constraints that X88 places onits free-to-use version or if you want to workwith NoSQL databases but otherwise X88 isone of the leading vendors in this market andit will come as no surprise that Pandora is thehighest rated of the products covered here. 2014 Bloor Research6Further information is available fromhttp://www.BloorResearch.com/update/2203A Bloor Market Report
Bloor Research overviewBloor Research is one of Europe’s leading ITresearch, analysis and consultancy organisations. We explain how to bring greater Agilityto corporate IT systems through the effectivegovernance, management and leverage ofInformation. We have built a reputation for‘telling the right story’ with independent,intelligent, well-articulated communicationscontent and publications on all aspects of theICT industry. We believe the objective of tellingthe right story is to: Describe the technology in context to itsbusiness value and the other systems andprocesses it interacts with. Understand how new and innovative technologies fit in with existing ICT investments. Look at the whole market and explain allthe solutions available and how they can bemore effectively evaluated. Filter “noise” and make it easier to find theadditional information or news that supportsboth investment and implementation. Ensure all our content is available throughthe most appropriate channel.Founded in 1989, we have spent over twodecades distributing research and analysis toIT user and vendor organisations throughoutthe world via online subscriptions, tailoredresearch services, events and consultancyprojects. We are committed to turning ourknowledge into business value for you.About the authorPhilip HowardResearch Director - Data ManagementPhilip started in the computer industry way backin 1973 and has variously worked as a systemsanalyst, programmer and salesperson, as wellas in marketing and product management, fora variety of companies including GEC Marconi,GPT, Philips Data Systems, Raytheon and NCR.After a quarter of a century of not being his own boss Philip set up hisown company in 1992 and his first client was Bloor Research (then ButlerBloor), with Philip working for the company as an associateanalyst. His relationship with Bloor Research has continued since thattime and he is now Research Director focused on Data Management.Data management refers to the management, movement, governanceand storage of data and involves diverse technologies that include (butare not limited to) databases and data warehousing, data integration(including ETL, data migration and data federation), data quality, masterdata management, metadata management and log and event management. Philip also tracks spreadsheet management and complex eventprocessing.In addition to the numerous reports Philip has written on behalf of BloorResearch, Philip also contributes regularly to IT-Director.com and IT-Analysis.com and was previously editor of both “Application DevelopmentNews” and “Operating System News” on behalf of Cambridge Market Intelligence (CMI). He has also contributed to various magazines and written anumber of reports published by companies such as CMI and The FinancialTimes. Philip speaks regularly at conferences and other events throughoutEurope and North America.Away from work, Philip’s primary leisure activities are canal boats,skiing, playing Bridge (at which he is a Life Master), dining out andwalking Benji the dog.
Copyright & disclaimerThis document is copyright 2014 Bloor Research. No part of thispublication may be reproduced by any method whatsoever without theprior consent of Bloor Research.Due to the nature of this material, numerous hardware and softwareproducts have been mentioned by name. In the majority, if not all, of thecases, these product names are claimed as trademarks by the companies that manufacture the products. It is not Bloor Research’s intent toclaim these names or trademarks as our own. Likewise, company logos,graphics or screen shots have been reproduced with the consent of theowner and are subject to that owner’s copyright.Whilst every care has been taken in the preparation of this documentto ensure that the information is correct, the publishers cannot acceptresponsibility for any errors or omissions.
2nd Floor,145–157 St John StreetLONDON,EC1V 4PY, United KingdomTel: 44 (0)207 043 9750Fax: 44 (0)207 043 9748Web: www.BloorResearch.comemail: [email protected]
on the free to use market, which consists of: Talend, which is the leading open source vendor in this market. Ataccama, a proprietary vendor that makes its data profiling software free-to-use as an encouragement for those users to license its data quality software. Note that this (but not t