Transcription

Data ProcessingVenkatesh Chennai Mathematical InstituteData is the new oil. - Clive Humby, 2006.Venkatesh Vinayakarao (Vv)

What Comes Next?bytekilobytemegabytegigabyte?

SizesNameByteKilobyteSize8 bits1024 bytesMegabyte1024 kilobytesGigabyte1024 megabytesTerabyte1024 gigabytesPetabyte1024 terabytesExabyte1024 petabytesZettabyte1024 exabytesYottabyte1024 zettabytes41

RecapChallengesData StorageSTaaS

43

Big Data Characteristics Volume Petabytes, exabytes, Variety pdf, json, text, images, Velocity real-time, near real-time, batch Veracity Trustworthiness, correctness and consistency44

“Where there is data smoke, thereis business fire.”— Thomas Redman, Author.Tuj mein rab diktha haiData mein kya karoonjab bhi koi data dekhunmera dil deewana boleole ole ole

Quiz Which is right? accommodateacommodateaccomodateacomodate46

Google n-gram Viewerhttps://books.google.com/ngrams47

Quiz Long-term or long term48

49

Data Processing Microprocessors Multi-core Processors Supercomputers all roads lead to Rome Cloud!50

MicroprocessorsProcessing unit on an integrated circuitWhat are ICs made of?51

Transistors Basic electronic component that alters the flow ofcurrent. Form the basic building block of an integratedcircuit. Think of it as an electronic switch52

Logic Gates Implements Boolean functions (thus performslogical operations) Implemented using TransistorsMicroprocessors contain millions oflogic gates.53

Processor Performance54

Moore’s LawThe number of transistors on a microchip doubles everytwo years, though the cost of computers is halved.55

56Source: s-law-in-action-1971-2019/

Multi-Core Processors Two or more separate processing units (calledcores) Enhances parallel processingintel core duo(has 2 cores, 2.66 GHz)intel core i7(has 4 cores, 4 GHz)57

Quality UpWhat do we achieve when we use pprocessors?Quality Up 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑜𝑛 𝑝 𝑙𝑖𝑡𝑦 𝑜𝑛 1 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟Read Section 1.1 of Jan Verschelde’s book on “Introduction to Supercomputing”.58

Can we use multiple processors? Amdahl’s law Let R be the fraction of the operations which cannot beparallelized. The speedup with p processors is bound by11 𝑅 .𝑅 𝑝 Example Say, 10% cannot be parallelized, and we have 81processors. Best speedup 1 1/10 4.7x.1/10 859

Multiple Processors for Speedup Amdahl’s law Let R be the fraction of the operations which cannot beparallelized. The speedup with p processors is bound by11 𝑅 .Speed up in terms of𝑅 𝑝 Exampleproblem size. Say, 10% cannot be parallelized, and we have 81processors. Best speedup 1 1/10 4.7x.1/10 8 What happens if we had infinite processors?60

Quiz Say, 10% of the operations cannot be parallelized.What happens if we had infinite processors? Answer: 10x.61

SpeedupSpeed up in terms oftime. Gustafson’s Law If 𝑠 is the fraction of serial operations in a parallelprogram run on 𝑝 processors, then the scaled speedupis bounded by 𝑝 (1 𝑝)𝑠. Example Say, all other seven processers are kept idle while oneprocessor completes 5% work, scaled speedup 8 (1 –8) * 0.05 7.65.Our ability to parallelize determines thesuccessful use of multi-core processors.62

Supercomputer A computing system that provides close to the bestcurrently achievable sustained performance ondemanding computational problems.How do supercomputers achieve suchperformance levels?63

GPUs and GPGPUs Graphics Processing Unit (GPU) Massive parallelization Thousands of cores Originally created for the gaming industry General Purpose GPU (GPGPU) Architecture allows for programming (Example:Compute Unified Device Architecture (CUDA) on NVIDIAGPGPUs). Performance is measured in FLOP (Floating PointOperation) sometimes, FLOPS (floating point operations per second)64

CPU vs. GPU Say, two floating point operations could beperformed in a clock cycle, 3 GHz processor 6 gigaflop per second. Top GPUs achieve petaflop per second. Achieved through an array of cores (V100 has 5120cores)65

My SystemMy Lenovo X390 usesIntel Core i7-8565U CPU @ 1.8 GHz4 cores only! 66

Deep Blue Beat Chess World Champion Garry Kasparov in1997259th most powerful supercomputer.Achieved 11.38 GFLOPS.67

IBM Watson and Jeopardy Game,2011 Cluster of 90 servers each having 3.5GHz eight-coreprocessor and 16 TB of RAM. Equivalent to 80 Teraflops (a slow supercomputer bytoday’s standards).68

Trivia Can you name the fastest supercomputer as ofdate? How much data can it store? How fast is it?69

Trivia Can you name the fastest supercomputer as ofdate? IBM SUMMIT How much data can it store? 250 PB How fast is it? 200 petaflopsFor more info, see puter.html70

How fast is 200 petaflops?Uses NVIDIA Tesla V100 GPU – How fast is its 200petaflops?"If every person on Earth completed one calculation per second, itwould take the world population 305 days to do what Summit cando in 1 second” - Oak Ridge National Laboratory.That is 200 quadrillion calculations in one second!71

Limitations and Opportunities Supercomputers are too expensive still far away from achieving desirable speedups need skilled programming (distributed computingalgorithms, parallelizable code) But, GPUs are becoming commonplace High Performance Clusters are increasingly available72

The Central Question!!!!!Instead of using supercomputers,can we put commodity hardwareinto a cluster and achieve speedup?73

Computing with CommodityHardware – Distributed ComputingSun et al., Dynamic Task Flow Scheduling for Heterogeneous DistributedComputing, 2007.74

Cluster Computing Multiple nodes acting as a single node High Performance Computing (HPC) clusters arebecoming increasingly popularSource: chrisdag, flickr.Sun Microsystems, Solaris Cluster75

Grid Vs. Cluster Computing Clusters have homogenous set of nodes. Grid refers to heterogenous systems.76

All Roads Lead To . CloudWe are in the Big Data era!Two kinds of Big Data OpportunitiesStorageProcessing77

Imperatives for Big Data PlatformSource: a-platform-manifesto78

Key Questions How to setup and manage such clusters? How to achieve reliability, availability, scalability, ? How to build services on cloud?Apache HadoopOpen source platform - reliable, scalable, - distributedprocessing of large data sets - built on clusters ofcommodity computers.79

Remember Presentation registration deadline is approaching. Register yourself on moodle. Do not ignore the readings.80

Big Data Characteristics Volume Petabytes, exabytes, Variety pdf, json, text, images, Velocity real-time, near real-time, batch Veracity Trustworthiness, correctness and consistency 44 "Where there is data smoke, there is business fire." —Thomas Redman, Author. jab bhi koi data dekhun mera dil deewana bole ole ole ole Tuj mein rab diktha hai Data mein .