Transcription

Data Engineering andStreaming Analytics

Welcome and Housekeeping You should have received instructions on howto participate in the training session If you have questions, you can use the Q&Awindow in Go To Webinar The recording of the session will be madeavailable after the event2

About Your InstructorDoug Bateman is Directorof Training and Educationat Databricks. Prior to thisrole he was Director ofTraining at NewCircle.3

Apache Spark - Genesis and Open SourceSpark was originally created at the AMP Lab at Berkeley. Theoriginal creators went on to found Databricks.Spark was created to address bringing data and machinelearning togetherSpark was donated to the Apache Foundation to create theApache Spark open source project4

VISIONAccelerate innovation by unifying data science,engineering and businessSOLUTIONUnified Analytics PlatformWHO WEARE Original creators of 2000 global companies use our platform across bigdata & machine learning lifecycle

Apache Spark: The 1st Unified Analytics EngineUniquely combined Data & AI technologiesRuntimeDeltaSpark Core EngineBig Data ProcessingMachine LearningETL SQL StreamingMLlib SparkR

Introducing Delta LakeA New Standard for Building Data LakesOpen Format Based on ParquetWith TransactionsApache Spark API’s

Apache Spark - A Unified Analytics Engine8

Apache Spark“Unified analytics engine for big dataprocessing, with built-in modules forstreaming, SQL, machine learning andgraph processing” Research project at UC Berkeley in 2009 APIs: Scala, Java, Python, R, and SQL Built by more than 1,200 developers from more than 200companies9

HOW TO PROCESS LOTS OF DATA?

M&Ms11

Spark ClusterOne Driver and many Executor JVMs12

Data Lakes - A Key Enabler of AnalyticsData Science and MLData Lake Recommendation EnginesRisk, Fraud, & Intrusion DetectionCustomer AnalyticsIoT & Predictive MaintenanceGenomics & DNA Sequencing

Data Lake Challenges 65% big dataprojects fail perGartnerXData LakeUnreliable Low Quality DataSlow PerformanceData Science and ML Recommendation EnginesRisk, Fraud, & Intrusion DetectionCustomer AnalyticsIoT & Predictive MaintenanceGenomics & DNA Sequencing

1. Data Reliability Challenges Failed production jobs leave data in corruptstate requiring tedious recoveryLack of schema enforcement createsinconsistent and low quality dataLack of consistency makes it almost impossible tomix appends ands reads, batch and streaming

2. Performance ChallengesToo many small or very big files - more time opening &closing files rather than reading contents (worse withstreaming)Partitioning aka “poor man’s indexing”- breaks down ifyou picked the wrong fields or when data has manydimensions, high cardinality columnsNo caching - cloud storage throughput is low (S3 is20-50MB/s/core vs 300MB/s/core for local SSDs)

Databricks DeltaNext-generation engine built on top of SparkDatabricks DeltaVersionedParquet FilesTransactionalLogIndexes &StatsLeverages your cloud blob storage Co-designed compute & storage Compatible with Spark API’s Built on open standards (Parquet)

Delta Makes Data ReliableUpdates/DeletesDelta TableStreamingTransactionalLogVersionedParquet FilesReliable data always readyfor analyticsBatchKey Features ACID TransactionsSchema Enforcement UpsertsData Versioning

Delta Makes Data More PerformantDelta EngineI/O & QueryOptimizationsOpen SparkAPI’sFast, highly responsivequeries at scaleDelta TableVersionedParquet FilesKey FeaturesTransactionalLog CompactionCaching Data skippingZ-ordering

Get Started with Delta using Spark APIsInstead of parquet. simply say deltaCREATE TABLE .USING parquet.CREATE TABLE .USING delta aframe.write.format("delta").save("/data")

Using Delta with your Existing ParquetTablesStep 1: Convert Parquet to Delta TablesCONVERT TO DELTA parquet. path/to/table [NO STATISTICS][PARTITIONED BY (col name1 col type1, col name2 col type2, .)]Step 2: Optimize Layout for Fast QueriesOPTIMIZE eventsWHERE date current timestamp() - INTERVAL 1 dayZORDER BY (eventType)

Upsert/Merge: Fine-grained UpdatesMERGE INTO customers-- Delta tableUSING updatesON customers.customerId source.customerIdWHEN MATCHED THENUPDATE SET address updates.addressWHEN NOT MATCHEDTHEN INSERT (customerId, address) VALUES (updates.customerId,updates.address)

Time TravelReproduce experiments & reportsRollback accidental bad writesSELECT count(*) FROMINSERT INTO my tableTIMESTAMPeventsSELECT * FROM my table TIMESTAMP ASAS OF timestampOFSELECT count(*) FROMVERSIONdate sub(current date(), 1)eventsAS OF pAsOf",timestamp string).load("/events/")

Apple: Threat Detection at Scale with DeltaDetect signal across user, application and network logs; Quickly analyze the blast radius with ad hoc queries;Respond quickly in an automated fashion; Scaling across petabytes of data and 100’s of security analystsDatabricks DeltaData Science 100TB new data/day 300B events/dayStreamingRefinementAlertsMachine LearningBEFORE DELTAKEYNOTE TALKWITH DELTA Took 20 engineers; 24 weeks to build Took 2 engineers; 2 weeks to build Only able to analyze 2 week window of data Analyze 2 years of batch with streaming data

Spark References Databricks Apache Spark ML Programming Guide Scala API Docs Python API Docs Spark Key Terms25

Questions?Further Training Options: http://bit.ly/DBTrng Live Onsite Training Live Online Self PacedMeet one of our Spark experts: http://bit.ly/ContactUsDB26

engineering and business Original creators of 2000 global companies use our platform across big data & machine learning lifecycle VISION WHO WE ARE . Customer Analytics IoT & Predictive Maintenance Genomics & DNA Sequencing 65% big data projects fail per Gartner X. 1. Data Reliability Challenges