Transcription

i

The Data Warehouse Lifecycle ToolkitTable of ContentsChapter 1- The Chess PiecesSection 1 - Project Management and RequirementsChapter 2Chapter 3Chapter 4- The Business Dimensional Lifecycle- Project Planning and Management- Collecting the RequirementsSection 2 - Data DesignChapter 5Chapter 6Chapter 7- A First Course on Dimensional Modeling- A Graduate Course on Dimensional Modeling- Building Dimensional ModelsSection 3 - ArchitectureChapter 8Chapter 9Chapter 10Chapter 11Chapter 12Chapter 13- Introducing Data Warehouse Architecture- Back Room Technical Architecture- Architecture for the Front Room- Infrastructure and Metadata- A Graduate Course on the Internet and Security- Creating the Architecture Plan and Selecting ProductsSection 4 – ImplementationChapter 14Chapter 15Chapter 16Chapter 17- A Graduate Course on Aggregates- Completing the Physical Design- Data Staging- Building End User ApplicationsSection 5 - Deployment and GrowthChapter 18 - Planning the DeploymentChapter 19 - Maintaining and Growing the Data Warehouseii

The Purpose of Each Chapter1. The Chess Pieces. As of the writing of this book, a lot of vague terminology wasbeing tossed around in the data warehouse marketplace. Even the term datawarehouse has lost its precision. Some people are even trying to define the datawarehouse as a nonqueryable data resource! We are not foolish enough to think wecan settle all the terminology disputes in these pages, but within this book we willstick to a very specific set of meanings. This chapter briefly defines all the importantterms used in data warehousing in a consistent way. Perhaps this is something likestudying all the chess pieces and what they can do before attempting to play a chessgame. We think we are pretty close to the mainstream with these definitions.Section 1: Project Management and Requirements2. The Business Dimensional Lifecycle. We define the complete BusinessDimensional Lifecycle from 50,000 feet. We briefly discuss each step and giveperspective on the lifecycle as a whole.3. Project Planning and Management. In this chapter, we define the project and talkabout setting its scope within your environment. We talk extensively about thevarious project roles and responsibilities. You won’t necessarily need a full headcountequivalent for each of these roles, but you will need to fill them in almost anyimaginable project. This is a chapter for managers.4. Collecting the Requirements. Collecting the business and data requirements is thefoundation of the entire data warehouse effort—or at least it should be. Collecting therequirements is an art form, and it is one of the least natural activities for an ISorganization. We give you techniques to make this job easier and hope to impressupon you the necessity of spending quality time on this step.Section 2: Data Design5. A First Course on Dimensional Modeling. We start with an energetic argument forthe value of dimensional modeling. We want you to understand the depth of ourcommitment to this approach. After performing hundreds of data warehouse designsand installations over the last 15 years, we think this is the only approach you canuse to achieve the twin goals of understandability and performance. We then revealthe central secret for combining multiple dimensional models together into a coherentwhole. This secret is called conformed dimensions and conformed facts. We call thisapproach the Data Warehouse Bus Architecture. Your computer has a backbone,called the computer bus, that everything connects to, and your data warehouse has abackbone, called the data warehouse bus, that everything connects to. Theremainder of this chapter is a self-contained introduction to the science ofdimensional modeling for data warehouses. This introduction can be viewed as anappendix to the full treatment of this subject in Ralph Kimball’s earlier book, The DataWarehouse Toolkit.6. A Graduate Course on Dimensional Modeling. Here we collect all the hardestdimensional modeling situations we can think of. Most of these examples come fromspecific business situations, such as dealing with a monster customer list.7. Building Dimensional Models. In this chapter we tackle the issue of how to createthe right model within your organization. You start with a matrix of data marts anddimensions, and then you design each fact table in each data mart according to thetechniques described in Chapter 5. The last half of this chapter describes the real-lifemanagement issues in applying this methodology and building all the dimensionalmodels needed in each data mart.Section 3: Architecture8. Introducing Data Warehouse Architecture. In this chapter we introduce all thecomponents of the technical architecture at a medium level of detail. This paints theoverall picture. The remaining five chapters in this section go into the specific areasof detail. We divide the discussion into data architecture, application architecture, andinfrastructure. If you follow the Data Warehouse Bus Architecture we developed inChapter 5, you will be able to develop your data marts one at a time, and you will endiii

up with a flexible, coherent overall data warehouse. But we didn’t say it would beeasy.9. Technical Back Room Architecture. We introduce you to the system componentsin the back room: the source systems, the reporting instance, the data staging area,the base level data warehouse, and the business process data marts. We tell youwhat happened to the operational data store (ODS). We also talk about all theservices you must provide in the back room to get the data ready to load into yourdata mart presentation server.10. Architecture for the Front Room. The front room is your publishing operation. Youmake the data available and provide an array of tools for different user needs. Wegive you a comprehensive view of the many requirements you must support in thefront room.11. Infrastructure and Metadata. Infrastructure is the glue that holds the datawarehouse together. This chapter covers the nuts and bolts. We deal with the detailwe think every data warehouse designer and manager need to know abouthardware, software, communications, and especially metadata.12. A Graduate Course on the Internet and Security. The Internet has a potentiallyhuge impact on the life of the data warehouse manager, but many data warehousemanagers are either not aware of the true impact of the Internet or they are avoidingthe issues. This chapter will expose you to the current state of the art on Internetbased data warehouses and security issues and give you a list of immediate actionsto take to protect your installation. The examples throughout this chapter are slantedtoward the exposures and challenges faced by the data warehouse owner.13. Creating the Architecture Plan and Selecting Products. Now that you are asoftware, hardware, and infrastructure expert, you are ready to commit to a specificarchitecture plan for your organization and to choose specific products. We talk aboutthe selection process and which combination of product categories you need. Bear inmind this book is not a platform for talking about specific vendors, however.Section 4: Implementation14. A Graduate Course on Aggregations. Aggregations are prestored summaries thatyou create to boost performance of your database systems. This chapter divesdeeply into the structure of aggregations, where you put them, how you use them,and how you administer them. Aggregations are the single most cost-effective way toboost performance in a large data warehouse system assuming that the rest of yoursystem is constructed according to the Data Warehouse Bus Architecture.15. Completing the Physical Design. Although we don’t know which DBMS and whichhardware architecture you will choose, there are a number of powerful ideas at thislevel that you should understand. We talk about physical data structures, indexingstrategies, specialty databases for data warehousing, and RAID storage strategies.16. Data Staging. Once you have the major systems in place, the biggest and riskieststep in the process is getting the data out of the legacy systems and loading into thedata mart DBMSs. The data staging area is the intermediate place where you bringthe legacy data in for cleaning and transforming. We have a lot of strong opinionsabout what should and should not happen in the data staging area.17. Building End User Applications. After the data is finally loaded into the DBMS, westill have to arrange for a soft landing on the users’ desktops. The end userapplications are all the query tools and report writers and data mining systems forgetting the data out of the DBMS and doing something useful. This chapter describesthe starter set of end user applications you need to provide as part of the initial datamart implementation.Section 5: Deployment and Growth18. Planning the Deployment. When everything is ready to go, you still have to roll thesystem out and behave in many ways like a commercial software vendor. You needto install the software, train the users, collect bug reports, solicit feedback, andrespond to new requirements. You need to plan carefully so that you can deliveraccording to the expectations you have set.iv

19. Maintaining and Growing the Data Warehouse. Finally, when your entire datamart edifice is up and running, you have to turn around to do it again! As we saidearlier, the data warehouse is more of a process than a project. This chapter is anappropriate end for the book, if only because it leaves you with a valuable lastimpression: You are never done.Supporting Tools Appendix A. This appendix summarizes the entire project plan for the BusinessDimensional Lifecycle in one place and in one format. All of the project tasks and rolesare listed. Appendix B. This appendix is a guided tour of the contents of the CD-ROM. All of theuseful checklists, templates, and forms are listed. We also walk you through how to useour sample design of a Data Warehouse Bus Architecture. CD-ROM. The CD-ROM that accompanies the book contains a large number of actualchecklists, templates, and forms for you to use with your data warehouse development. Italso includes a sample design illustrating the Data Warehouse Bus ArchitectureThe Goals of a Data WarehouseOne of the most important assets of an organization is its information. This asset isalmost always kept by an organization in two forms: the operational systems of recordand the data warehouse. Crudely speaking, the operational systems of record are wherethe data is put in, and the data warehouse is where we get the data out. In The DataWarehouse Toolkit, we described this dichotomy at length. At the time of this writing, it isno longer so necessary to convince the world that there are really two systems or thatthere will always be two systems. It is now widely recognized that the data warehousehas profoundly different needs, clients, structures, and rhythms than the operationalsystems of record.Ultimately, we need to put aside the details of implementation and modeling, andremember what the fundamental goals of the data warehouse are. In our opinion, thedata warehouse: Makes an organization’s information accessible. The contents of the datawarehouse are understandable and navigable, and the access is characterized by fastperformance. These requirements have no boundaries and no fixed limits.Understandable means correctly labeled and obvious. Navigable means recognizingyour destination on the screen and getting there in one click. Fast performance meanszero wait time. Anything else is a compromise and therefore something that we mustimprove. Makes the organization’s information consistent. Information from one part of theorganization can be matched with information from another part of the organization. Iftwo measures of an organization have the same name, then they must mean the samething. Conversely, if two measures don’t mean the same thing, then they are labeleddifferently. Consistent information means high-quality information. It means that all ofthe information is accounted for and is complete. Anything else is a compromise andtherefore something that we must improve. Is an adaptive and resilient source of information. The data warehouse is designedfor continuous change. When new questions are asked of the data warehouse, theexisting data and the technologies are not changed or disrupted. When new data isadded to the data warehouse, the existing data and the technologies are not changedor disrupted. The design of the separate data marts that make up the data warehousemust be distributed and incremental. Anything else is a compromise and thereforesomething that we must improve. Is a secure bastion that protects our information asset. The data warehouse notonly controls access to the data effectively, but gives its owners great visibility into theuses and abuses of that data, even after it has left the data warehouse. Anything elseis a compromise and therefore something that we must improve.v

Is the foundation for decision making. The data warehouse has the right data in it tosupport decision making. There is only one true output from a data warehouse: thedecisions that are made after the data warehouse has presented its evidence. Theoriginal label that predates the data warehouse is still the best description of what we aretrying to build: a decision support system.The Goals of This BookIf we succeed with this book, you—the designers and managers of large datawarehouses—will achieve your goals more quickly. You will build effective datawarehouses that match well against the goals outlined in the preceding section, and youwill make fewer mistakes along the way. Hopefully, you will not reinvent the wheel anddiscover “previously owned” truths.We have tried to be as technical as this large subject allows, without getting waylaid byvendor-specific details. Certainly, one of the interesting aspects of working in the datawarehouse marketplace is the breadth of knowledge needed to understand all of the datawarehouse responsibilities. We feel quite strongly that this wide perspective must bemaintained because of the continuously evolving nature of data warehousing. Even ifdata warehousing leaves behind such bedrock notions as text and number data, or thereliance on relational database technology, most of the principles of this book wouldremain applicable, because the mission of a data warehouse team is to build a decisionsupport system in the most fundamental sense of the words.We think that a moderate amount of structure and discipline helps a lot in building a largeand complex data warehouse. We want to transfer this structure and discipline to youthrough this book. We want you to understand and anticipate the whole BusinessDimensional Lifecycle, and we want you to infuse your own organizations with thisperspective. In many ways, the data warehouse is an expression of information systems’fundamental charter: to collect the organization’s information and make it useful.The idea of a lifecycle suggests an endless process where data warehouses sprout andflourish and eventually die, only to be replaced with new data warehouses that build on thelegacies of the previous generations. This book tries to capture that perspective and helpyou get it started in your organization.Visit the Companion Web SiteThis book is necessarily a static snapshot of the data warehouse industry and themethodologies we think are important. For a dynamic, up-to-date perspective on theseissues, please visit this book’s Web site at www.wiley.com/compbooks/kimball, or log on tothe mirror site at www.lifecycle-toolkit.com. We, the authors of this book, intend to maintainthis Web site personally and make it a useful resource for data warehouse professionals.vi

1. 1

OverviewAll of the authors of this book worked together at Metaphor Computer Systems over aperiod that spanned more than ten years, from 1982 to 1994. Although the real value ofthe Metaphor experience was the building of hundreds of data warehouses, there was anancillary benefit that we sometimes find useful. We are really conscious of metaphors.How could we avoid metaphors, with a name like that?A useful metaphor to get this book started is to think about studying the chess piecesvery carefully before trying to play the game of chess. You really need to learn theshapes of the pieces and what they can do on the board. More subtly, you need to learnthe strategic significance of the pieces and how to wield them in order to win the game.Certainly, with a data warehouse, as well as with chess, you need to think way ahead.Your opponent is the ever-changing nature of the environment you are forced to work in.You can’t avoid the changing user needs, the changing business conditions, thechanging nature of the data you are given to work with, and the changing technicalenvironment. So maybe the game of data warehousing is something like the game ofchess. At least it’s a pretty good metaphor.If you intend to read this book, you need to read this chapter. We are fairly precise in thisbook with our vocabulary, and you will get more out of this book if you know where westand. We begin by briefly defining the basic elements of the data warehouse. As weremarked in the introduction, there is not universal agreement in the marketplace overthese definitions. But our use of these words is as close to mainstream practice as wecan make them. Here in this book, we will use these words precisely and consistently,according to the definitions we provide in the next section.We will then list the data warehouse processes you need to be concerned about. This list isa declaration of the boundaries for your job. Perhaps the biggest insight into yourresponsibilities as a data warehouse manager is that this list of data warehouse processesis long and somewhat dauntingBasic Elements of the Data WarehouseAs you read through the definitions in this section, please refer to Figure 1.1. We willmove through Figure 1.1 roughly in left to right order.Figure 1.1 The basic elements of the data warehouse.Source SystemAn operational system of record whose function it is to capture the transactions of thebusiness. A source system is often called a “legacy system” in a mainframe environment.The main priorities of the source system are uptime and availability. Queries againstsource systems are narrow, “account-based” queries that are part of the normaltransaction flow and severely restricted in their demands on the legacy system. Weassume that the source systems maintain little historical data and that managementreporting from source systems is a burden on these systems. We make the strongassumption that source systems are not queried in the broad and unexpected ways thatdata warehouses are typically queried. We also assume that each source system is a1. 2

natural stovepipe, where little or no investment has been made to conform basicdimensions such as product, customer, geography, or calendar with other legacysystems in the organization. Source systems have keys that make certain things unique,like product keys or customer keys. We call these source system keys production keys,and we treat them as attributes, just like any other textual description of something. Wenever use the production keys as the keys within our data warehouse. (Hopefully that gotyour attention. Read the chapters on data modeling.)Data Staging AreaA storage area and set of processes that clean, transform, combine, de-duplicate,household, archive, and prepare source data for use in the data warehouse. The datastaging area is everything in between the source system and the presentation server.Although it would be nice if the data staging area were a single centralized facility on onepiece of hardware, it is far more likely that the data staging area is spread over a numberof machines. The data staging area is dominated by the simple activities of sorting andsequential processing and, in some cases, the data staging area does not need to bebased on relational technology. After you check your data for conformance with all theone-to-one and many-to-one business rules you have defined, it may be pointless to takethe final step of building a full blown entity-relation-based physical database design.However, there are many cases where the data arrives at the doorstep of the datastaging area in a third normal form relational database. In other cases, the managers ofthe data staging area are more comfortable organizing their cleaning, transforming, andcombining steps around a set of normalized structures. In these cases, a normalizedstructure for the data staging storage is certainly acceptable. The key defining restrictionon the data staging area is that it does not provide query and presentation services. Assoon as a system provides query and presentation services, it must be categorized as apresentation server, which is described next.Presentation ServerThe target physical machine on which the data warehouse data is organized and storedfor direct querying by end users, report writers, and other applications. In our opinion,three very different systems are required for a data warehouse to function: the sourcesystem, the data staging area, and the presentation server. The source system should bethought of as outside the data warehouse, since we assume we have no control over thecontent and format of the data in the legacy system. We have described the data stagingarea as the initial storage and cleaning system for data that is moving toward thepresentation server, and we made the point that the data staging area may well consistof a system of flat files. It is the presentation server where we insist that the data bepresented and stored in a dimensional framework. If the presentation server is based ona relational database, then the tables will be organized as star schemas. If thepresentation server is based on nonrelational on-line analytic processing (OLAP)technology, then the data will still have recognizable dimensions, and most of therecommendations in this book will pertain. At the time this book was written, most of thelarge data marts (greater than a few gigabytes) were implemented on relationaldatabases. Thus, most of the specific discussions surrounding the presentation serverare couched in terms of relational databases.Dimensional ModelA specific discipline for modeling data that is an alternative to entity-relationship (E/R)modeling. A dimensional model contains the same information as an E/R model butpackages the data in a symmetric format whose design goals are user understandability,query performance, and resilience to change. The rationale for dimensional modeling ispresented in Chapter 5.This book and its predecessor, The Data Warehouse Toolkit, are based on the disciplineof dimensional modeling. We, the authors, are committed to this approach because wehave seen too many data warehouses fail because of overly complex E/R designs. Wehave successfully employed the techniques of dimensional modeling in hundreds ofdesign situations over the last 15 years.The main components of a dimensional model are fact tables and dimension tables,which are defined carefully in Chapter 5. But let’s look at them briefly.A fact table is the primary table in each dimensional model that is meant to contain1. 3

measurements of the business. Throughout this book, we will consistently use the wordfact to represent a business measure. We will reduce terminology confusion by not usingthe words measure or measurement. The most useful facts are numeric and additive.Every fact table represents a many-to-many relationship and every fact table contains aset of two or more foreign keys that join to their respective dimension tables.A dimension table is one of a set of companion tables to a fact table. Each dimension isdefined by its primary key that serves as the basis for referential integrity with any givenfact table to which it is joined. Most dimension tables contain many textual attributes(fields) that are the basis for constraining and grouping within data warehouse queries.Business ProcessA coherent set of business activities that make sense to the business users of our datawarehouses. This definition is purposefully a little vague. A business process is usually aset of activities like “order processing” or “customer pipeline management,” but businessprocesses can overlap, and certainly the definition of an individual business process willevolve over time. In this book, we assume that a business process is a useful grouping ofinformation resources with a coherent theme. In many cases, we will implement one ormore data marts for each business process.Data MartA logical subset of the complete data warehouse. A data mart is a complete “pie-wedge”of the overall data warehouse pie. A data mart represents a project that can be broughtto completion rather than being an impossible galactic undertaking. A data warehouse ismade up of the union of all its data marts. Beyond this rather simple logical definition, weoften view the data mart as the restriction of the data warehouse to a single businessprocess or to a group of related business processes targeted toward a particularbusiness group. The data mart is probably sponsored by and built by a single part of thebusiness, and a data mart is usually organized around a single business process.We impose some very specific design requirements on every data mart. Every data martmust be represented by a dimensional model and, within a single data warehouse, allsuch data marts must be built from conformed dimensions and conformed facts. This isthe basis of the Data Warehouse Bus Architecture. Without conformed dimensions andconformed facts, a data mart is a stovepipe. Stovepipes are the bane of the datawarehouse movement. If you have any hope of building a data warehouse that is robustand resilient in the face of continuously evolving requirements, you must adhere to thedata mart definition we recommend. We will show in this book that, when data martshave been designed with conformed dimensions and conformed facts, they can becombined and used together. (Read more on this topic in Chapter 5.)We do not believe that there are two “contrasting” points of view about top-down vs.bottom-up data warehouses. The extreme top-down perspective is that a completelycentralized, tightly designed master database must be completed before parts of it aresummarized and published as individual data marts. The extreme bottom-up perspectiveis that an enterprise data warehouse can be assembled from disparate and unrelateddata marts. Neither approach taken to these limits is feasible. In both cases, the onlyworkable solution is a blend of the two approaches, where we put in place a properarchitecture that guides the design of all the separate pieces.When all the pieces of all the data marts are broken down to individual physical tables onvarious database servers, as they must ultimately be, then the only physical way tocombine the data from these separate tables and achieve an integrated enterprise datawarehouse is if the dimensions of the data mean the same thing across these tables. Wecall these conformed dimensions. This Data Warehouse Bus Architecture is afundamental driver for this book.Finally, we do not adhere to the old data mart definition that a data mart is comprised ofsummary data. Data marts are based on granular data and may or may not containperformance enhancing summaries, which we call “aggregates” in this book.Data WarehouseThe queryable source of data in the enterprise. The data warehouse is nothing more thanthe union of all the constituent data marts. A data warehouse is fed from the data stagingarea. The data warehouse manager is responsible both for the data warehouse and thedata staging area.1. 4

Please understand that we (and the marketplace) have departed in a number of waysfrom the original definition of the data warehouse dating from the early 1990s.Specifically, the data warehouse is the queryable presentation resource for anenterprise’s data and this presentation resource must not be organized around an entityrelation model because, if you use entity-relation modeling, you will loseunderstandability and performance. Also, the data warehouse is frequently updated on acontrolled load basis as data is corrected, snapshots are accumulated, and statuses andlabels are changed. Finally, the data warehouse is precisely the union of its constituentdata marts.Operational Data Store (ODS)The term “operational data store” has taken on too many definitions to be useful to thedata warehouse. We have seen this term used to describe everything from the databasethat underlies the operational system to the data warehouse itself. There are two primarydefinitions that are worth exploring in the context of the data warehouse. Originally, theODS was meant to serve as the point of integration for operational systems. This wasespecially important for legacy systems that grew up independent of each other. Banks,for example, typically had several independent systems set up to support differentproducts—loans, checking accounts, savings accounts, and so on. The advent of tellersupport computers and the ATM helped push many banks to create an operational datastore to integrate current balances and recent history from these separate accountsunder one customer number. This kind of operational lookup is a perfect example of theuseful role an ODS can play. In fact, this need for integration has been the driving forcebehind the success of the client/server ERP business.Since this kind of ODS needs to support constant operational access and updates, itshould be housed outside the warehouse. That is, any system structured to meetoperational needs and performance requirements will be hard pressed to meet decisionsupport needs and performance requirements. For example,

our sample design of a Data Warehouse Bus Architecture. CD-ROM. The CD-ROM that accompanies the book contains a large number of actual checklists, templates, and forms for you to use with your data warehouse development. It also includes a sample design illustrating the Data Warehouse Bus Ar