Transcription

Enterprise Strategy Group Getting to the bigger truth. White PaperGoogle Streaming Analytics PlatformEnd-to-end, Cloud-based Streaming AnalyticsBy Kerry Dolan, ESG Senior IT Validation AnalystDecember 2019This ESG White Paper was commissioned by Google and is distributed under license from ESG. 2019 by The Enterprise Strategy Group, Inc. All Rights Reserved.

White Paper: Google Streaming Analytics Platform2ContentsReal-time Analytics: A Business Priority . 3Infrastructure and Skills Challenges . 4Google Streaming Analytics: An End-to-end Platform . 5Ingest . 7Transform/Process . 8Google Streaming Solution Accessibility Demonstration . 9Data Warehouse/Analyze . 10Google Alternatives to Its Prescribed Streaming Pattern . 11Cloud Dataproc. 12Cloud Data Fusion . 12Advanced Analytics . 12Customer Successes Prove the Power of Google Streaming Analytics . 13Results. 14The Bigger Truth. 15 2019 by The Enterprise Strategy Group, Inc. All Rights Reserved.

White Paper: Google Streaming Analytics Platform3Real-time Analytics: A Business PriorityThe ability to collect and use data in real time is transforming and empowering organizations in ways they neverimagined. Real-time data is generated from a growing variety of sources across customer, supplier, partner, andmarket interactions. Messaging applications are now real-time, and sensor-enabled machines deliver a constantstream of data. Social media delivers real-time feedback and insight into consumers. Clickstream data fromdigital commerce can deliver predictive value. All of these data sources present the opportunity to addsignificant business value.The value of mining, analyzing, and acting on this data cannot be overstated; organizations are using this data tounderstand customers, identify trends, design products, and head-off problems. As a result, real-time dataanalytics has become a key business priority. When asked what business initiatives they believed would drive themost technology spending in 2019, 31% of ESG survey respondents cited improving data analytics for real-timebusiness intelligence and customer insight, making it the second most-cited initiative, behind strengtheningcybersecurity (see Figure 1). 1Figure 1. Business Initiatives Driving Technology SpendingWhich of the following business initiatives do you believe will drive the most technologyspending in your organization over the next 12 months? (Percent of respondents, N 810,five responses accepted)Strengthening cybersecurity37%Improving data analytics for real-time business intelligence andcustomer insight31%Cost reduction30%Improving our customer experience (CX)29%Improving internal collaboration capabilities28%Regulatory compliance assurance26%Business continuity/disaster recovery programsDeveloping strategies to ensure we interact with our customerson their mobile devices23%22%Business growth via mergers, acquisitions, or organic expansion21%Providing our employees with the mobile devices and applicationsthey need to maximize productivity21%New product research and development20%Source: Enterprise Strategy GroupStreaming analytics provides a key opportunity to analyze data in real time. To be clear, batch remains animportant part of data analytics, and to the surprise of some, it's a critical component of stream analytics too, ashistorical and batch data can strengthen the analysis offered by real-time systems.1Source: ESG Master Survey Results, 2019 Technology Spending Intentions Survey, March 2019. 2019 by The Enterprise Strategy Group, Inc. All Rights Reserved.

White Paper: Google Streaming Analytics Platform4Data can be used for other strategic purposes in batch mode, but there is a window of new opportunity iforganizations can collect, process, analyze, and act on a continuing stream of data in real time, particularly inindustries like retail, financial services, media/advertising, healthcare, and utilities. The ability to collect data,instantly analyze it, and take immediate action can strengthen ties with customers and partners and enableorganizations to shift more quickly in response to business conditions. Consumer and business transactions areconducted online and provide data on profiles, purchases, finances, and delivery. In retail, you can only impactbehavior if you can predict in real time; if you take too long, the customer will move on. Online gaming isanother example: once a user logs in, the gaming app collects a steady stream of data about play, progressmade, in-game purchases, social interactions, etc. This data is used to advance the user through the experienceand can be analyzed immediately by financial and marketing systems, which can then present additional realtime experiences to the gamer. Fraud detection must take place in real time for prevention. IoT data fromsensors is much cheaper to analyze and maintain in real time. Event-driven data can be used for real-timeresponsiveness, process automation, instant interactions, targeted marketing, and myriad organization-specificprocesses.Infrastructure and Skills ChallengesThe primary challenges of streaming analytics are having the infrastructure and expertise to collect, store, andanalyze the data in real time. These are similar challenges to batch processing, so what makes streaminganalytics more difficult? It deals with very small ingest data—1 KB at a time—compared with files that will be ingested at rates ofmore like 20 MB at a time. Each event demands processing and delivery as soon as it happens forimmediate action. The system needs to store and persist a continuous stream of these small events, distributing the loadamong clients without introducing latency. Jobs may run for days, months, or even years—continually rather than with a clear start and finish to theprocessing; with streaming data, it’s hard to know when all the data has been collected. Systems must bestable and self-healing to avoid interrupted processing. Data must be fresh to be useful, so the analysis engine must have instant access. Administrators must know immediately when something is wrong so they can troubleshoot and resumeoperations. Late-arriving data causes issues when aggregate statistics need to be produced for time intervals that mightnot have all of the data. Performing complex aggregations over the time dimension is difficult. The arrival time of an event is not the same as the business transaction time of the event, making analyticsmore difficult.It takes significant compute, storage, and networking infrastructure to deal with the heavy flow of data ingestionfrom a growing number of data sources, to be able to scale up and down as needed for efficiency and growth,and to process, store, and analyze it all. Plus, organizations in regulated industries must be certain that theycapture every event without losing any data. Data processing engines must be flexible enough to handledifferent data types, and various types of application expertise are required for both infrastructure andprocessing/analysis applications, all using different toolsets. Organizations want to focus on using all this data toinform decisions and actions, not on buying, building, and managing a lifecycle of applications and infrastructurefor each segment of the streaming analytics process. 2019 by The Enterprise Strategy Group, Inc. All Rights Reserved.

White Paper: Google Streaming Analytics Platform5Additionally, organizations are struggling to find staff with the skills they need to plan, deploy, and manage theend-to-end process. When asked in what areas they believed their IT organizations had a problematic shortageof skills, IT architecture and planning, artificial intelligence (AI)/machine learning (ML), data analytics/datascience, and IT orchestration and automation were among the top five most-cited areas, surpassed only bycybersecurity (see Figure 2). 2 These contribute to why many organizations struggle to take full advantage of theirdata. The infrastructure costs are already daunting; when combined with the complexity of processes andapplications and a lack of IT skills, many are unable to take advantage.Figure 2. Top Ten Problematic Skills ShortagesIn which of the following areas do you believe your IT organization currently has a problematicshortage of existing skills? (Percent of respondents, N 586, multiple responses accepted)Cybersecurity53%IT architecture/planning38%Artificial intelligence/machine learning (AI/ML)35%Data analytics/data science34%IT orchestration and automation33%Application development/DevOps26%Data protection26%Database administration24%Network administration23%Enterprise mobility management22%0%10%20%30%40%50%60%Source: Enterprise Strategy GroupGoogle Streaming Analytics: An End-to-end PlatformGoogle offers a complete, cloud-based streaming analytics platform that provides automation, scalability, andease of use so organizations can focus on analyzing and operationalizing their data, not on infrastructure andadministration. This lets organizations quickly, easily, and cost-efficiently start using their data to drive insights.Google’s services require no hardware to deploy and maintain, and no upfront costs; they include simpleadministration and automated scaling. Costs are limited to exactly what is needed for a job’s execution, asGoogle’s autoscaling eliminates the need to overprovision for unexpected spikes in data creation/ingestion.The Google Streaming Analytics platform provides: Robust ingestion services. Cloud Pub/Sub takes in data and events reliably and publishes them out tomultiple subscribers while reducing redundancy.2 ibid. 2019 by The Enterprise Strategy Group, Inc. All Rights Reserved.

White Paper: Google Streaming Analytics Platform6 Unified stream and batch processing. Cloud Dataflow changes events and data into actionable insights;there is no separate infrastructure for stream versus batch, simplifying management and reducing costs. Serverless architecture. This service-based offering can automatically scale up to handle spikes of data andback down when event volumes subside, while managing all resource-intensive provisioning and tuningtasks. Comprehensive set of analysis tools. It includes an integrated toolset across ingest, processing, and analysisversus stitching together disparate tools. Flexibility for users. Apache Beam, Dataflow’s SDK, is an open source programming model that enablesportability and language choice.To make the best use of a managed service platform for streaming analytics, customers need a way to get datainto the platform that’s reliable, scalable, and fast. They need a processing engine that can transform that datathe instant it reaches the platform. They need services on top of that data, such as SQL semantics, AI/ML, orcustom application logic. They need data to be protected and secure. In regulated industries, they need controlsfor privacy and data governance, and they need the ability to create a duplicate of untransformed data as a fullfidelity record. The Google platform takes care of all these concerns so customers can concentrate on analysisand insights. In contrast, traditional platforms require organizations to handle configuring and deployinginfrastructure; monitoring; tuning; and ensuring reliability, protection, security, and provisioning for scale—andthey need to revisit these constantly throughout the lifecycle (see Figure 3).Figure 3. DIY versus Google Serverless Streaming AnalyticsSource: Enterprise Strategy GroupFigure 4 shows the products that make up Google’s end-to-end streaming analytics platform, from ingest at anyscale through analysis. All parts of this platform are delivered as fully managed and integrated services, relieving 2019 by The Enterprise Strategy Group, Inc. All Rights Reserved.

White Paper: Google Streaming Analytics Platform7customers of the burdensome infrastructure tasks required. These services are available in every Google regionaround the globe.Figure 4. Google Cloud Analytics: Comprehensive, Cloud-based PlatformSource: Enterprise Strategy GroupIngestThe entry point for stream analytics pipelines, Cloud Pub/Sub, takes in event and data streams from media,messaging, sensors, IoT, etc., organizes them into topics, and makes them available to subscribers by topic.Organizations can publish and subscribe to events in any geography. Data is replicated synchronously acrosszones for up to seven days for availability. Cloud IoT Core provides device connection and management for IoTuse cases.For real-time enrichment of streaming data with broader legacy/batch data sets, Data Transfer Service andStorage Transfer Service enable organizations to ingest data from on-premises locations, SaaS applications, IoTdevices, and between clouds.Cloud Pub/Sub benefitsinclude: The topic-based systemenables a one-to-manymodel of data publicationso that multiple pipelinescan be built from a singlestream. For example,forecasting, inventory,staffing, and billingpipelines can all subscribe 2019 by The Enterprise Strategy Group, Inc. All Rights Reserved.

White Paper: Google Streaming Analytics Platform8to the latest sales data to use for specific analyses. Real-time and batch processing can be mixed, reducingthe numbers of patterns and pipelines to deploy, and simplifying architecture. The system scales automatically and rebalances workloads to support the data without additionalprovisioning. This is a huge benefit in terms of time savings, agility, and cost. Compute and storage can scaleat different rates. In addition, customers set throughput quotas for publishing, pulling/pushing data, andadministrative operations (e.g., Get, List, Create, Delete, etc.). Topics can expand globally. Pub/Sub supports containerization and microservices for multiple inputs and data streams within anapplication. Enterprise security is built in, including end-to-end encryption, identity and access management (IAM), andaudit logging. Native client libraries provide data engineers with a choice of languages in which to work, and an opensource API supports cross-cloud and hybrid deployments. Simple pricing lets customers pay only for what they consume. There is no need to estimate or monitorusage, or to pay for overhead that you might not use.Transform/ProcessCloud Dataflow is a unified stream and batch processing engine that leverages Apache Beam as its SDK, enablingorganizations to build processing pipelines with the languages they choose. Apache Beam also offers freedomfrom lock-in, as code written to Beam can be executed onDataflow, Apache Spark, Apache Flink, and other “runners.”Customer Success“Google Cloud enables us to operate at scale withease. . . We can focus on creating new things ratherthan on maintaining systems and worrying aboutthings like Black Friday [scalability]. Instead we focuson our vision, where every customer has a personalexperience with the brands they love.” - Qubit,Using the same code for batch and stream reduces both costsand complexity; the latter is especially important in light ofthe skills shortages in data analytics, AI/ML, and infrastructurementioned previously. Dataflow ensures “exactly once”processing (eliminating both duplication and missed inputs)with fault-tolerant execution.The Dataflow service includes and calls upon both computeand storage hardware, but customers don’t need to knowanything about them, other than that they are decoupled to save the customer money. Dataflow managesresources and schedules, scales and rebalances workloads, monitors, self-heals, and collects logs. Customerssimply know their data is going to the right places for processing, and Google handles the rest. Benefits include:customer experience/personalization company Dataflow automatically manages performance, scale, availability, security, and compliance. It offers bothadministrative and cost efficiency, supporting parallel data processing and charging customers only for whatthey consume. Dataflow tracks small data bits and assigns them to processing nodes with a focus on workflow schedulingand dynamic rebalancing. This is particularly helpful for stream data, which often comes in peaks andvalleys. Dataflow will automatically add new workers and assign data to them if that will improve executiontime, and spin down workers when the volume subsides. As shards take longer to process and begin toaffect worker execution time, Dataflow will reallocate load from the struggling workers to others. 2019 by The Enterprise Strategy Group, Inc. All Rights Reserved.

White Paper: Google Streaming Analytics Platform9 Flexibility is built in. With Apache Beam as the SDK, Dataflow job code can be deployed in other clouds, onpremises, or using Apache Spark, Apache Flink, or other runtimes. This flexibility extends to languages aswell, giving engineers choices including Java, Python, and Go. Dataflow and Beam also support SQL. Data analysts who are not data engineers often use SQL for streamingpipelines, so this support eliminates or reduces their reliance on data engineers to build and adjustpipelines. Dataflow’s batch and streaming flexibility extends beyond the ability to execute both paradigms. For batchjobs that can be run overnight, Dataflow’s flexible resource scheduling enables these jobs to be done at alower cost with guaranteed start windows. This gives organizations the flexibility to move processesbetween stream, batch, and overnight batch as needed while optimizing for cost. For example, “first draft”route and delivery scheduling for a logistics company may occur overnight for the following business day,after which real-time inputs on delays, traffic, and package priority will reroute drivers in real time. Thisenables flexibility and cost savings.Google Streaming Solution Accessibility DemonstrationAs mentioned earlier, infrastructure and skills are two of the largest barriers to adopting stream analytics. We’veexplored how Google’s autoscaling capabilities address the infrastructure barrier, but Google has also addressedthe streaming skills gap through product development. Google’s Dataflow SQL gives data analysts the ability tocreate new streaming pipelines using SQL semantics from within BigQuery, Google’s data warehouse. As a result,staff members with a wide variety of skillsets can access streaming data within Google’s platform, removingbottlenecks and opening up engineering resources for other jobs.ESG viewed a demonstration of the Dataflow SQL capabilities,which showcased Pub/Sub, Dataflow, and BigQuery. The demofeatured creation of a streaming job that joins data from astreaming Pub/Sub topic with a BigQuery table, populating theoutput table in BigQuery for immediate analysis anddashboarding. The context for this demonstration wasexamining sales transactions, and specifically retrieving realtime sales transactions by time-based windows.Customer Success“With our new data pipeline and warehouse, we areable to personalize access to large volumes of data thatwere not previously there. That means new insightsand correlations and, therefore, better decisions andincreased revenue for our customers.” - AB Tasty,personalization and A/B testing company In the Pub/Sub UI, we created a Transactions topic (toinclude a continual stream of sales data such as who purchased a product, where, when, the price, etc.),and with a right-click, selected New subscription/Cloud Storage text file. Next, we used the Dataflow Create job from template UI to name the job and select the destination in whichto store the data stream from Pub/Sub. Instructions for creating a job were viewable in the right navigationbar. Dataflow created a three-step job immediately: read the events, bundle them into five-minute chunks, andwrite them to a BigQuery table. Once in BigQuery, the data was immediately available for analysis anddashboarding. While the job ran, system latency and data freshness graphics were generated (see Figure 5). 2019 by The Enterprise Strategy Group, Inc. All Rights Reserved.

White Paper: Google Streaming Analytics Platform 10We clicked on Create alerting , which took us to the Stackdriver monitoring UI, and created an alert to notifythe administrator when latency exceeded 20 seconds.Figure 5. Pub/Sub and Dataflow: Create Job with Latency and Data Freshness GraphsSource: Enterprise Strategy GroupData Warehouse/AnalyzeGoogle BigQuery is a cloud-native, fully managed enterprise data warehouse supporting large-scale analytics andis a common target for streaming pipelines. It offers high-performance analysis of large data sets, with automaticscaling up and down to maximize query performance and cost. It eliminates the overhead and complexity ofmaintaining on-premises hardware and administration. As a cloud-native data warehouse, BigQuery alsodecouples compute and storage to provide cost-effective resources that are unavailable to on-prem users or tocloud data warehouses based on legacy technology.BigQuery streaming enables transformed data to be streamed in from Dataflow one record at a time withimmediate querying. BigQuery also offers a BI engine for fast, in-memory analysis of data stored in BigQuerywith sub-second latency. This provides fast dashboards and is required for real-time systems with humaninteraction endpoints such as driving instructions or in-store alerts.BigQuery also supports direct stream ingestion via the BigQuery Streaming API. This enables customers to deployan ELT model that takes advantage of broadly-available SQL skills for analytic processing, which can helpcustomers generate faster analysis or unburden data engineering resources. This can be used for automatedprocesses, interactive querying of real-time data, or real-time BI dashboards.Benefits include: Fast time to value. Customers can get their data warehouse environment up and running quickly and easilywithout expert system and database administrative skills. Speed. BigQuery speeds ingest, query, and export of petabyte-scale data sets for faster insight. 2019 by The Enterprise Strategy Group, Inc. All Rights Reserved.

White Paper: Google Streaming Analytics Platform11 Ease of use. Simple management includes an intuitive interface and automated scaling to petabyte scale, sothat customers don’t need to throttle data streaming into it. Queries finish efficiently and resources arethen reallocated to other projects/users. Reliability and data security. Google handles geo-replication for always-on data availability and deliverscontinual uptime. Data protection, recovery, encryption,and IAM are also provided.ESG Economic Value Validation Cost optimization. This includes predictable costs with flatrate or pay-as-you-go pricing. Because compute andstorage are separated, storage can be offered at a lowercost, and customers can establish project/user resourcequotas.ESG validated a 52% reduction in three-year TCO,including cost reduction and economic benefits, formigrating an enterprise data warehouse to BigQueryversus on-premises infrastructure. 3Google Alternatives to Its Prescribed Streaming PatternBased on the company’s extensive experience running search and advertising businesses with real-time inputs,Google believes the best architecture for streaming is one that can do as much to automate infrastructure—particularly stream ingestion and stream processing—as is technically possible. ESG’s survey respondents seemto agree, having listed various components of infrastructure management as areas of problematic skillsshortages at their organization (see Figure 2).However, there are users and organizations for which infrastructure and skills are not an issue. These companiesmay be interested in the configurability that Apache Spark provides. They may have existing Apache Kafkastreaming solutions on-premises that they’re looking to extend to the cloud. They may have an organizationalphilosophy that locks them to open source technology. Or they may want a GUI provided by a data integrationtool through which they compose their streaming pipelines.Google’s broader compilation of streaming-enabled products gives customers the flexibility to mix and matchtechnologies to achieve the optimal combination.Figure 6. Customer Choice for Stream Data AnalyticsSource: Enterprise Strategy GroupFor details, including information on customer successes, please see the ESG Economic Value Validation, The Economic Advantages of MigratingEnterprise Data Warehouse Workloads to Google BigQuery.3 2019 by The Enterprise Strategy Group, Inc. All Rights Reserved.

White Paper: Google Streaming Analytics Platform12Cloud DataprocCloud Dataproc provides cost-effective, managed processing for Hadoop and Spark environments, enablingcustomers to retain their familiar on-premises architecture and tools while using Google cloud storage. Thishigh-performance, cost-effective service is easy to deploy and scale; clusters can be spun up and down asneeded in minutes and are easily customizable for optimal resources on a per-job basis. For example, Dataprocoffers customizable machine types, such as compute-intensive for machine learning versus standard for ad hocanalysis; these machines can read and write from the same Google cloud storage but don’t compete forresources. Clusters can be tailored to use cases. For ephemeral jobs, Dataproc creates right-sized clusters, runsthe jobs, and breaks down the clusters, saving data to Stackdriver to keep a record. Long-standing clusters runcontinuously for jobs such as streaming analytics, and also for BI, web notebooks, and ad hoc analysis. Featuresinclude auto-scaling, workflow templates, high availability mode, stable back-end storage, and low TCO. 4Dataproc specifically accomplishes stream processing withApache Spark, the popular open source framework for dataprocessing. Given the ubiquity of Spark within enterprises,both from an architecture and skills perspective, some usersmay prefer to accomplish stream processing within GoogleCloud through Dataproc—particularly for migrations.Customers with heavy streaming requirements may then optto transition their workloads to Dataflow after gaining moreexperience with the platform.FlexibilityWith Kafka, Dataflow, Dataproc, and Data Fusion,Google Cloud Platform offers an alternative to itsproprietary streaming platform through an opensource, fully managed solution.Cloud Data FusionIn Google Cloud, streaming pipelines can be deployed directly via Apache Spark from Cloud Dataproc, orindirectly through Cloud Data Fusion, Google Cloud Platform’s ETL offering. Cloud Data Fusion provides codefree, visual drag-and-drop connectors to simplify data migration and transformation from on-premises andhybrid/multi-cloud environments. Customers simply point and click through sources, sinks, and transformationswithout any coding.For users with limited engineering experience and a need to develop streaming pipelines, Cloud Data Fusion is awelcome tool. For example, data analysts or ETL developers can quickly build high-quality streaming pipelines totackle their use cases without ever having to write a single line of code. Data Fusion also has the ability to publishand call upon private libraries of transformation code, meaning that even difficult tasks that require dataengineers can be written once and called upon countless times by analysts.Advanced AnalyticsReal-time data must be analyzed in real time and immediately acted upon; otherwise, batch jobs would suffice.Google tools make it easy for users of any skill level to predict, perform, and take action on analysis in real time. Google offers APIs to drive processed data to analysis and action plans. Cloud AutoML simplifies training of custom machine learning models. It delivers data to automated modelsthat have been trained without code for business-specific purposes.For additional Cloud Dataproc TCO information, see the ESG Economic Value Audit, Analyzing the Economic Benefits of Google Cloud DataprocCloud-native Hadoop and Spark Platform.4 2019 by The Enterprise Strategy Group, Inc. All Rights Reserved.

White Paper: Google Streaming Analytics Platform 13Cloud Machine Learning Engine scales to meet the needs of ML models and data feeding them, removing acomplex logistical burden. Google supports TensorFlow, designed for data experts, a

market interactions. Messaging applications are now real-time, and sensor -enabled machines deliver a constant stream of data. Social media delivers real-time feedback and insight into consumers. Clickstream data from digital commerce can deliver predictive value. All of these data sources p