Resistive RAM based MemoryHierarchy for Ultra-low PowerData-parallel Processor PlatformFrancky CatthoorWith use of MSc and PhD thesis resultsin cooperation with ReRAM team at IMECAlso based on ULP-DSIP PhD team work imec 2013

Secure, trustworthy computing and communicationAmbientintelligenceA pervasive, context aware ambient,embedded in every-thing and every-body.sensitive and responsive to the presence of people imec 2013IMEC confidential

Personal Healthcare arWLAN imec 2013IMEC sImplants

Connected Everywhere VisionTeleconferenceOther TerminalsConnectivity to ServerWireless Terminaland much more? imec 2013Interaction with sensor network

Current Architectures:Energy-Flexibility ConflictGoal progrDSIP as goodas ASICCourtesy: Engel Roza (Philips) imec 2013Note: higher than 1000 MOPS/mW reachable now due tosmaller subword length than 32 bit andnon-standard cell based layout schemes for critical components

DSIP vs other styles for loop dominatedirregular applications: scaling supportEnergy;Area;Production costuProc/FPGANot efficient enough(quantifiable model can verify!)Go up dueto older technol(market volume too low!)DSP/RISCDSIP bestNot flexible enoughtrade-off when E or Texclude more flexible platforms (quantifiable model can verify!)market volumewhere NRE impactlower than production cost imec 2013ASIPIMEC confidentialASIC1/Flexibility;1/market volume

Data Level Parallelism in DSPsSingle Instruction Multiple Data (SIMD) or Vector Processing :Instruct.MemoryLocal Data MemoryEU1ControlUnit imec 2013EU2EU3EU4EUNN identicalExecutionUnitscommon control lines! highly efficient for data parallel algorithms! difficult to handle for compiler! special treatment for data-dependant branches and forout-of-object data! extra measures for global operations or data exchangerequiredIMEC confidential

Typical Application: EnergyBreakdown for MPEG2 DecodingOptim. ITSE mapping(no ITSE trafo yet)Non-optim.(no adv. DTSE)I ns t r uc t i o nme mo r yhi e r a r c hy32%l e v e l1d a t ac a c he28%Data Instructionglobal m em ory(optm ized)1%Optim izedcom m unicationarchitecture1%d a t ar e g i s t e rf i l e35%Datapath(m odule gener.)3%Non-optim. imec 2013P.Raghavan andF.CatthoorIMEC confidential11

Decoder wordline contributes significantly tomemory delay and energy: wide word accessA typical small SRAM ( 64kb)sized in conventional wayn Breakdown of delay and energyof such SRAMn Decoder Wordlinecontributes nearly 60% ofSRAM delayn Decoder Wordlinecontributes about 40%-50%of SRAM energySource:n Energy: Evans circ. 1995n Delay: Horowitz et al.Trans. Solid-State circ. 2002n SRAM delay breakdowndelay of decoder WLdelay of the restSRAM energy breakdownenergy of decoder WLenergy of the rest part imec 2013

Proposed DSIP Architecture Templateexploiting wide L1D/L1I memory accessExternal memoryWidth-1Prog.DMADMA LB(SDRAM)Wide Scratch PadVWRVWRVWR(Level-0 DM)SWPShifter imec 2013Width-3: datapath wordComplxFU1ComplxFU2MMU LBAGULD/STVWRLBVWRVWR LBLBWidth-2: verywide wordLevel-0 inst memDP LB(Level-1 DM)Level-1I-CacheFEENECS:1 Tile in aPlatform

Data Memory Hierarchy Operation (Generic Rd)DMALBProgrammable Load/Store UnitVery wide word 960 bitsVWR1InputVWR2Very wide wordScratchpadVWRNVery wide wordVery Wide Registers imec 2013IMEC confidential

Data Memory Hierarchy Operation (Generic Wr)DMALBProgrammable Load/Store UnitVery wide word 960 bitsVWR1InputVWR2Very wide wordVWRNScratchpadVery wide wordVery Wide Registers imec 2013IMEC confidential

BG memory for data parallel accessesTransparent Memoryn Most of the known approaches are data-path orientedn E.g. Coolflux, EVPn We expose the whole memory hierarchy to the architecture (ISA).i.e., the ISA & the compiler is aware of the memory hierarchyn Conventional and other state-of-the-art approaches “hide” thememory hierarchy. Accessible mainly through simple load/storeinstructions.n 2.50E-086x Gain Wide accesses andcomplete usage of all bitsretrieved from .00E 00Register File imec 2013IMEC confidentialInstructionMemoryData MemoryData Path

FG Memory for Data Parallelism (vector)Irregular architecturen Registers have dedicated portsn Separate interfaces to BG memories & datapathsn Low hardware overheadn Single ported cellsn Such FG memory organization non existent in any state-of-the-artprocessorsn Compilation to handle such irregularity non existent as welln 2.50E-08VLIWFEENECS2.00E-08500x Gain 12x (port reduction)20x (compiler, MAC, branchreduction, less addr. calc)1.50E-081.00E-085.00E-090.00E 00Register File imec 2013IMEC confidentialInstructionMemoryData MemoryData Path

Nonvolatile Memory RoadmapCell size[F2]evolutionaryCopyright: Jan Van Houdt, IMEC, 2010disruptivecodeNOR-Flash (FG/NROM)8productionPCM6dataNAND-Flash (FG)4TANOSRRAM218013090655x4x3D-NAND Flash2x3x1x(Eq.) technology nodeF [nm]Conclusion: main focus now on stand-alone applicationsBut: embedded L1-L2 focus has not many options (ReRAM,STT-MRAM, maybe domain wall mem) strong need! imec 2013IMEC confidential21

BG memory for data parallel access:remaining issue leakage energy in DDSMProposed solution use of embedded NVM for both L1 (e.g.32-128 Kb) and L2 (e.g. 1Mb)n Modeling/optimisation: ongoing at IMEC for ReRAM STT-MRAMn Read delay 1 ns; dyn energy/bit read 32Kb 4 fJ ;n Leakage energy only in periphery circuits with limited device countand non-minimal transistor sizes so much less problems in DDSMn Mitigate for long write latency and low write endurance by VWRcombination. Conventional and other state-of-the-art approacheswould not allow this because of too random access requirementsn imec 2013IMEC confidential

PhD Matthias Hartmann (IMEC):eNVM based L1D organisationDevelop “glue logic” to mask xRAM problemsDevelop an “higher-level” ReRAM model for P, E, ABenchmark xRAM as SRAM replacement in terms ofenergy, performance and area imec 2013IMEC confidential24

Write Frequency50,00%40,00%30,00%20,00%10,00%0,00% imec 2013IMEC confidential25

Various possible memories to replace in Wireless SoCEach memory needs unique policy for SRAM replacementInstructionmemory canexploit nonvolatilityARM A-15ARM A-15MPEG4acceleratorL2 MemoryL1 DL1 DHigh speeddata needsefficientlatencymaskingL1 DL1 DL1IL1ILTE receiver imec MEC confidentialMemL1 DL1IMuch slowermemoriescould handleTurbodrop decoderinreplacementJanuary 2013

Banking policy to hide bankRWxRAMbank imec 2013IMEC confidentialJanuary 2013

Scenario:Wireless SoCBoADRES Baseband ProcessorClock Frequency: 1 GHzn Supports current and future LTE and WLAN standardsn 256bit vector processingn Strict Latency Requirementsn SystemC model availablen 2 dedicated L1 Data Memoriesn Implement SystemC model for xRAM and potentialmicro-architectural solutionsSee DSD Euromicro conf 2013 imec 2013IMEC confidential

CAN SRAM Performance be met?500%400%300%200%Simple “Drop In” solutionlowers performance by up tofactor 5Special masking solutions caneliminate performance penaltyArea and Energy cost?100%Drop In2 Banks2 Banksoptimizedline bufferline buffer withread bypassAssumption: xRAM performance 10ns instead of 1ns write for SRAM imec 2013IMEC confidential

Effects of Write Latency300%ReRAM DropIn280%260%240%Different masking solution mightscale differently220%200%180%Decreasing write latency (interms of clock cycles) canenable “cheaper” solutions160%ReRAM 2BanksReRAM linebuffer140%120%100%Latency 10 imec 2013Latency 8Latency 6IMEC confidentialLatency 4ReRAMlinebuffer withread bypass

Inst Memory Hierarchy: Executing Loops inParallelfor (i 1 N)for (j 1 .M)for (k 1 L)for (I’ 1 N’’)for (I 1 N’)for (j 1 .M’)AfterTransformationsLoopBody3LoopBody1Loop Bodyfor (j 1 .M’’)for (k 1 .L’)LoopBody4LoopBody2Original LoopKernelIntermediate LoopStructure within theCompilerIROCLCDistributed LoopBuffers imec 2013Data Path L0BuffersComplxFULCAGU L0BuffersAGU

PhD Manu Komalan (UCMadrid):Instr Mem Organ - Motivation§ § imec 2013Embedded memories are increasingly dominating System-OnChip designs in terms of chip area, performance, powerconsumption, and manufacturing yield.On-chip memories today, occupy more than 50% of the total diearea and are responsible for more than 40% of the total powerconsumption. Cache memory alone accounts for 30% of the onchip area in state-of-the-art microprocessors.IMEC confidential

Sim 1Performance penalty variation for different latencies(delays) in the original system with an NVM based Icache.No modification to cache organization.Conclusion: Simple substitution of SRAM by NVM is notfeasible. Architectural modifications are a must. imec 2013IMEC confidential

Performance penalty variation imec 2013IMEC confidential

MSHR is modified to hold Instruction block with Tags. A new control bit S isadded to signal that the entry holds valid instructions.MSHR instruction block is promoted to the NVM array whenever number ofaccesses to the block reaches a certain promotion threshold.This enhanced MSHR effectively acts like a small fully associative cachebetween IL1 & the next L2 cache. imec 2013IMEC confidential

Performance Penalty Reduction imec 2013IMEC confidential

Sim 6Endurance plays a huge part in the feasibility of theseproposed caches.Important aspect of the modification to the instruction cache reduction of writes to the I-cache. This has a significantbearing on endurance and lifetime.We explore the influence of changing I-cache size on theendurance. Endurance here can roughly be gauged bythe write reduction (%). imec 2013IMEC confidential

Write Reduction Variation imec 2013IMEC confidential

ConclusionsTuning of selective parameters can reduce the performance penaltydue to the NVM to extremely tolerable levels ( 1%).For ReRAM, we can obtain significant reductions in energyconsumption, given that the technology has a favourable readenergy per access compared to SRAM.Pareto-optimum values for the different parameters are application andplatform dependent.Architectural modifications proposed in this paper also offer endurancegains, as a result of the filtering of some writes to the NVM array.Selective bancking can be explored to increase endurance further. imec 2013IMEC confidential

www.imec.beWorldwide collaboration with more than500 companies and institutes.IMEC – Kapeldreef 75 – B-3001 Leuven – Belgium – Tel. 32 16 281211 – Fax 32 16 229400 –

BG memory for data parallel accesses ! Transparent Memory ! Most of the known approaches are data-path oriented ! E.g. Coolflux, EVP ! We expose the whole memory hierarchy to the architecture (ISA). i.e., the ISA & the compiler is aware of the memory hierarchy ! Conv