Ahn et al. EURASIP Journal on Image and Video Processing 2014, 4/1/16RESEARCHOpen AccessImplementation of fast HEVC encoder based onSIMD and data-level parallelismYong-Jo Ahn1, Tae-Jin Hwang1, Dong-Gyu Sim1* and Woo-Jin Han2AbstractThis paper presents several optimization algorithms for a High Efficiency Video Coding (HEVC) encoder based onsingle instruction multiple data (SIMD) operations and data-level parallelism. Based on the analysis of thecomputational complexity of HEVC encoder, we found that interpolation filter, cost function, and transform takearound 68% of the total computation, on average. In this paper, several software optimization techniques, includingframe-level interpolation filter and SIMD implementation for those computationally intensive parts, are presentedfor a fast HEVC encoder. In addition, we propose a slice-level parallelization and its load-balancing algorithm onmulti-core platforms from the estimated computational load of each slice during the encoding process. Theencoding speed of the proposed parallelized HEVC encoder is accelerated by approximately ten times compared tothe HEVC reference model (HM) software, with minimal loss of coding efficiency.Keywords: HEVC; HEVC encoder; SIMD implementation; Slice-level parallelism; Load balancing1 IntroductionAlong with the development of multimedia and hardware technologies, the demand for high-resolution videoservices with better quality has been increasing. Thesedays, the demand for ultrahigh definition (UHD) videoservices is emerging, and its resolution is higher thanthat of full high definition (FHD), by a factor of 4 ormore. Based on the market demands, ISO/IEC MovingPicture Experts Group (MPEG) and ITU-T Video CodingExperts Group (VCEG) have organized Joint CollaborativeTeam on Video Coding (JCT-VC) and standardized HighEfficiency Video Coding (HEVC), whose target codingefficiency was twice better than that of H.264/AVC [1].In the near future, HEVC is expected to be employedfor many video applications, such as video broadcastingand video communications.Historically, MPEG-x and H.26x video compressionstandards employ the macro-block (MB) as one basicprocessing unit [2], and its size is 16 16. However,HEVC supports larger sizes of the basic processing unit,called coding tree unit (CTU), from 8 8 to 64 64. ACTU is split into multiple coding units (CU), in a quad* Correspondence: [email protected] of Computer Engineering, Kwangwoon University,Wolgye-dong, Nowon-gu, Seoul 447-1, South KoreaFull list of author information is available at the end of the articletree fashion [3]. Along with the CU, the prediction unit(PU) and transform unit (TU) are defined, and theirsizes and shapes are more diverse than the prior standard technologies [4,5]. On top of them, many advancedcoding tools that improve prediction, transform, andloop filtering are employed to double the compressionperformance compared with H.264/AVC. However, thecomputation requirement of HEVC is known to be significantly higher than that of H.264/AVC because HEVChas more prediction modes, larger block size, longerinterpolation filter, and so forth.Typically, a huge number of rate-distortion (RD) costcomputations are required to find the best mode from64 64 to 8 8 block sizes in the encoder side for HEVC.With respect to applications, HEVC would be employedfor ultrahigh-resolution video services. For such cases, fastvideo coders are required to process more data with agiven processing power. Thus, parallelization techniqueswould be crucial, with multiple low-power processorsor platforms. The single instruction multiple data (SIMD)implementation of the most time-consuming moduleson HM 6.2 encoders was proposed [6]. This workimplemented the cost functions, transformation, andinterpolation filter with SIMD, and it reported that theaverage time saving obtained is approximately 50% to80%, depending on the modules. Wavefront parallel 2014 Ahn et al.; licensee Springer. This is an open access article distributed under the terms of the Creative CommonsAttribution License (, which permits unrestricted use, distribution, and reproductionin any medium, provided the original work is properly cited.

Ahn et al. EURASIP Journal on Image and Video Processing 2014, 4/1/16Page 2 of 19Finally, Section 7 concludes the work, with further research topics.processing (WPP) for HEVC encoders and decoderswas introduced [7]. For the decoder case, they achievedparallel speed-up by a factor of 3. The acceleration factor of the wavefront parallelism is in general saturatedinto 2 or 3 due to data communication overhead, epilog,and prolog parts. There are no works that incorporateall the parallel algorithms, with maximum harmonizationfor fast HEVC encoders. In this paper, we focus onload-balanced slice parallelization, with optimizationimplementation of HEVC. This paper presents severaloptimization techniques using SIMD operations for theRD cost computations and transforms for variable blocksizes. In addition, motion estimation is also efficientlyimplemented with a frame-based processing to reducethe number of redundant filtering. For data-level parallelization, this paper demonstrates how to allocateencoding jobs to all the available cores through the useof complexity estimation. As a result, it is possible toachieve load-balanced slice parallelism in HEVC encoders to significantly reduce the average encodingtime. With all the proposed techniques, the optimizedHEVC encoder achieves a 90.1% average time savingwithin 3.0% Bjontegaard distortion (BD) rate increasescompared to HM 9.0 reference software.The paper is organized as follows. Section 2 presents acomplexity analysis of HEVC encoder, and Section 3 introduces basic data-level parallelisms for video encoders.In Section 4, the SIMD optimization for cost functionsand transform, as well as frame-level implementation ofinterpolation filter, is explained in detail. A slice-levelparallelization technique with a load-balancing propertyis proposed in Section 5. Section 6 shows the performance and numerical analysis of the proposed techniques.2 HEVC and its complexity analysisFigure 1 shows a block diagram of HM encoder [8]. TheHEVC encoder consists of prediction, transformation,loop filters, and entropy coder, which are the same coresas the prior hybrid video coders. However, HEVC employs more diverse block sizes and types, named CU,PU, and TU. The CU sizes range from 8 8 to 64 64,when the CTU is set to 64 64. The CU structure ispartitioned in a quad-tree fashion, and each CU canhave one of several PU types inside it. This allows eachCU to be predicted with diverse block sizes and shapes.In addition, advanced motion vector prediction (AMVP)[9] and block merging techniques [10,11] are adopted toeffectively represent the estimated motion vectors. Forresidual coding, the transform block sizes and shapes aredetermined based on a rate-distortion optimization (RDO).The quad-tree transform is used for each CU, and it isindependent of the PU type. Transform coefficients arequantized with a scalar quantizer. To improve codingefficiency, the rate-distortion optimized quantization(RDOQ) is often employed, during the quantizationprocess. Finally, the reconstructed blocks are filteredwith two-stage loop filters: the de-blocking filter (DBF)and sample adaptive offset (SAO).As mentioned before, HEVC supports hierarchicalblock partitioning. Figure 2 shows an example of diverseCU and PU realizations in a slice, when the CTU size isset to 64 64. A slice is divided into multiple CTUs,and each CTU is again partitioned into multiple CUs.The quad-tree structure is effectively represented byTransformRnFn- TU size:32 32 - 4 4- Residual quad tree(RQT)Entropy codingQuantizationPictureBufferF’n-1Inter prediction- Motionestimation- AMVPF’n-2- DCT basedinterpolationfilter (DCT-IF)- Delta QP- Motioncompensation- Block merge- Rate-distortionoptimizedquantization (RDOQ)- Context adaptive binaryarithmetic coding (CABAC)Intra prediction- Referencesamplepadding- Mode dependentintra smoothing(MDIS)- Planar- DC- AngularQuantization-1Transform-1Loop filterF’n- Sampleadaptiveoffset (SAO)- De-blockingfilter R’nFigure 1 Block diagram of HEVC reference model (HM) encoder.Bitstream

Ahn et al. EURASIP Journal on Image and Video Processing 2014, 4/1/16Page 3 of 19CTU64CU16 16Prediction unit (PU)CU16 16CU32 32CU8 8CU8 82N 2N2N NN 2NN N2N nU2N nDnL 2NnR 2NCU16 16CU16 16CU16 16CTU64 64CU8 8CU8 8CU8 8CU8 8CU8 8CU8 8CU8 8CU8 8CU8 8CU8 8CU8 8CU8 8CU8 8CU8 8CU16 16CU16 16Transform unit (TU)CU16 16TU depth 0TU depth 1CTU64 64CTU64 TU depth 2Figure 2 Example of CU, PU, and TU partitions for a CTU in HEVC.hierarchical flag syntaxes. HEVC has one identical syntax for diverse CU block sizes. The size of a CU blockcan be derived with quad-tree split flags. In addition,one CU can be coded with one of the several PUs. For acurrent CU of 2 N 2 N, the CU can have one of sevenPU splitting types: 2 N 2 N, 2 N N, N 2 N, or fourasymmetric shapes (2 N nU, 2 N nD, nL 2 N, andnR 2 N) [12]. The residual signal is transformed inTU, which can be recursively split into multiple levelquadrants. HEVC supports diverse TU sizes, from 4 4up to 32 32, and the maximum TU size depends onthe CU size. However, quadtree-based TU partitioningis conducted independent of PU partitioning.In this section, the complexity of HEVC encoder is investigated, and critical modules can be identified based onthe complexity analysis. In this work, HM 9.0 referencesoftware [13] was used for HEVC encoder analysis. Notethat it was used as the base software for our optimization.A HEVC encoder can be mainly modularized into fiveparts: entropy coding, intra prediction, inter prediction,transform quantization, and loop filter. The cycle analyzer,Intel VTune Amplifier XE 2013 [14] on Intel Core i73960 K processor, was employed to measure the numberof cycles for each module, in cases of the random access(RA) and low-delay (LD) test configurations, under thecommon test conditions [15]. Note that class B (1,920 1,080) and class C (832 480) sequences were used.Table 1 shows the percentages of computation cyclesof all the key modules, for two configurations, RA andLD. We found that the inter prediction takes around68.4% to 89.1% of the total cycles, whereas the intra prediction takes only 1.2% to 3.3%. For the transformquantization and entropy coding, the percentages ofcycles are counted as 10.1% to 20.7% and 0.3% to 6.6%,respectively. Note that quantization is one of the keymodules in terms of functionality, but its computationalcycle is not significant for measuring it independently.In this paper, the total computational cycles of transform and quantization were measured. As shown inTable 1, the inter prediction, which consists of motionestimation and compensation, is the dominant moduleTable 1 Percentages of computational cycles of HM 9.0encoderModuleEntropy codingRA (%)LD (%)Average (%)2.982.402.69Intra prediction2.251.952.10Inter prediction79.0382.2380.63Transform quantization14.4812.5013.49In-loop filter (de-blocking filter) filter (sample adaptive offset)

Ahn et al. EURASIP Journal on Image and Video Processing 2014, 4/1/16in computational complexity. For HEVC inter prediction coding, a large number of coding modes for varioussize blocks are evaluated and coded by computing ratedistortion costs. In addition, the hierarchical block partition is traversed in a recursive fashion in HM.Table 2 shows percentages of the cycles for intra prediction, inter prediction, and skip modes, depending onthe CU sizes. Regardless of the CU sizes, the percentageof inter prediction coding is about 73.7% to 83.4%. Forthe CU of 8 8, the cycle percentage of the intra prediction coding is approximately 2 to 3 times higher than theothers, because 4 4 and 8 8 intra prediction modes aretested for RDO.Table 3 shows the percentiles of the numbers of cyclesfor the top four functions. The interpolation filter, sum ofabsolute differences (SAD), sum of absolute transformeddifferences (SATD), and discrete cosine transform (DCT)/inverse DCT (IDCT) take approximately 67% to 71% inthe total cycles. The interpolation filter is the most complex function of HEVC encoders and is used for motionestimation and motion compensation. SAD and SATD arecost functions to calculate distortions between originaland prediction blocks. In particular, SAD and SATD arethe metrics for motion estimation of integer-pels andfractional-pels, respectively. In addition, SATD is also usedfor intra prediction. DCT/IDCT is applied to the residualsignal from intra or inter prediction for data compactionin the DCT domain. We can say that optimization of thefour functions is inevitable in order to accelerate HEVCencoder.3 Data-level parallelization of video encodersData-level and function-level parallelization approachesare widely used for high-speed video codecs. In particular,function-level parallel processing is frequently used forTable 2 Percentages of computational cycles, dependingon CU sizes and modesSize64 6432 3216 168 8ModeRA (%)LD (%)Average (%)Ratio in eachCU size 73.7Skip1. 4 of 19Table 3 Percentages of computational cycles of top fourfunctionsModuleRA (%)LD (%)Interpolation filter35.7436.00Average form/inverse transform3.523.083.30Total (%)67.5870.9169.25hard-wired implementations. Note that function-levelparallel processing is not easily implemented mainlydue to difficulties of load balancing and longer development period. Data-level parallel processing is relativelyeasy to be employed for video encoders because thedata processing flows are the same for all the data. Thedata-level parallelism for HEVC can be conducted interms of CU-, slice-, and frame-level ones. In addition,HEVC contains a parallel tool, called tile, which dividesa picture into multiple rectangles [16]. In tile partitioning, the number of CTUs adjacent to boundaries of tilepartitions is less than that of slices. From this fact, tilepartitioning can yield slightly lower coding loss in compression efficiency compared to an implementation withthe same number of slices [17].For parallel implementations, we need to consider several factors, such as throughput and core scalability, aswell as coding efficiency. Note that the core scalabilitymeans how much we need to change an implementation,depending on an increasing or decreasing number ofcores. In addition, the throughput can be improved withparallel processing as compared with the single processing unit. However, many video coding algorithms, ingeneral, have dependencies among neighboring codingunits, neighboring frame, earlier-coded syntaxes, and soon. At the same time, we need to consider the codingefficiency degradation from the parallelization. Eventhough the throughput can be improved with parallelprocessing, it is not desirable that the coding efficiencyis significantly degraded. Regarding the core scalability,it is better to employ a scalable parallelization methodthat can be easily extended for an increasing number ofcores. If not, we are required to change the implementation, depending on the number of cores.The 2D wavefront algorithm [18] has been used for theparallelization of video coding in CTU level. This codingtool does not impact the coding gain, but there is a limitation in the parallelization factor, even with many cores,due to coding dependence. Frame-level parallelization canbe also used for non-reference bidirectional pictures; however, it depends on the encoding of reference structures.The slice-level parallelism is widely used because we canassume any dependencies among multiple slices. However,we need to realize that the coding gain can be degraded

Ahn et al. EURASIP Journal on Image and Video Processing 2014, 4/1/16with increased number of slices. In this paper, we evaluated bitrate increases in terms of the number of slices inthe HM encoder software. Figure 3 shows the bitrate increases in terms of the number of slices, for four sequenceclasses. In our evaluation, class A (2,560 1,600), class B(1,920 1,080), class C (832 480), and class D (416 240)of the HEVC common test sequences [15] were used,under HEVC common test conditions. As shown inFigure 3, we can see the bitrate increases in terms ofthe number of slices. The bitrate increase becomes lesssignificant as video resolution increases. In general,multiple slices are widely used for larger sequences dueto parallel processing and error resilience. Regardingmany commercial applications, the slice-level parallelprocessing is one of the good approaches for such largeresolution videos.As mentioned before, the slice-level parallelism hasrelatively high coding losses of around 2% to 4% compared to tile-level parallelism and wavefront processing[19]. However, slice-level parallelism has an advantage thatthe slice partitioning is more flexible and accurate for picture partitioning, by adjusting the number of CTUs, compared to the tile partitioning. Note that the tiles withinthe same row and column should use the same tilewidth and height, respectively. Slice-level parallelismof a fine-grained load balancing can yield additionalPage 5 of 19encoding speed-up compared to the tile levels. WPPhas the advantage that the loss of parallelization is relatively small compared to other parallelization methods.However, the acceleration factor of WPP is not so highcompared to slice- or tile-level parallelism because WPPhas prolog and epilog so that parts of the cores are inactivated. It is not easy to utilize all the cores with WPP onaverage. In our work, slice-level parallelism was chosen forthe acceleration of parallelization. In addition, slice partitioning is widely used for the packetizing of bitstreams forerror resiliencies, in practical video encoders and services.There are two main criteria to divide a picture intomultiple slices. One is an equal bitrate, and the other isthe same number of CTUs for all the slices. The first onecannot be easily employed for parallel encoding becausewe cannot define the target bit prior to actual encoding.For the second method, we can easily use the same number of CTUs at a time.4 Optimization for fast HEVC encoderIn this section, two software optimization methods, framelevel processing and SIMD implementation, for three mostcomplex functions at the function-level are presented.The proposed software optimization methods have several advantages to accelerate HEVC encoders withoutany bitrate increase.130ClassDClassCClassBClassA125Ratio of bitrate increments12011511010510012841632Number of slicesFigure 3 Bitrate increment in terms of the number of slices, for four sequence classes. HEVC common test sequences [15]; class A, 2,560 1,600; class B, 1,920 1,080; class C, 832 480; class D, 416 240.

Ahn et al. EURASIP Journal on Image and Video Processing 2014, 4/1/16integer sample in an entire frame. In addition, SIMD instructions and multi-thread processes using OpenMPand GPU can be easily used for fast encoding.4.1 Frame-level interpolation filter in HEVC encoderThe HEVC DCT-based interpolation filter (DCT-IF),which is used for obtaining fractional sample positions,is the most complex function, especially with motion estimation in encoders. Instead of using 6-tap and bilinearinterpolation filters of H.264/AVC, HEVC adopts 8(7)-tapDCT-IF for luminance components, and 4-tap DCT-IF forchrominance components [20]. Furthermore, all of thefractional position samples are derived by increasing thenumber of filter taps without intermediate rounding operations which can reduce potential rounding errors compared to H.264/AVC. In order to determine the optimalCU size and coding modes, HM encoder uses a recursivescheme for the RD optimization process. In particular,the PU-level interpolation filter causes iterative memoryaccesses for the same positions redundantly. Excessivememory accesses significantly increase encoding timedue to the limit of memory bandwidth. Actually, theDCT-IF occupies approximately 30% to 35% of thetotal cycles in the HM encoder. We adopt a framelevel interpolation filter to reduce redundant memoryaccesses. The frame-level interpolation filter avoids redundant execution that occurs in the RD optimizationprocess and enables parallel process with independencyamong neighboring blocks. However, it requires the additional amount of memory for 15 factional samples per12711195i7127i6111111j7PACKUSWB 14763j9jjH ði; jÞjj 2i2i11i879!31i363i9I;JX47j1279ð1Þwhere i and j are the pixel indices, and their ranges aredetermined by a block size. O(i,j) and P(i,j) are the original and predicted pixel values, respectively. Note that63j13jjOði; jÞ Pði; jÞjji;jj479j14SATD ¼63j5I;JXi;ji1279j6SAD ¼63i1395SAD, SATD, and DCT are the most complex functionsin the HEVC encoder, except for DCT-IF. Several costfunctions are used to decide the best coding mode andits associated parameters. SAD and SATD are the twomain metrics to find integer and quarter-pel motion vectors in the motion estimation process, respectively. SADtakes around 10% to 12% of the total cycles in HEVC encoder, and SATD takes 15% to 16%. These two costfunctions are defined byi479i144.2 SIMD implementation of cost function andtransformation63i595i1512779Page 6 of 190i1i015j3j20j1j0PSADBW1271110 0950 0790 063SAD[15-8]Figure 4 Computation steps of SAD using SIMD instructions.470 0310 0150 00SAD[7-0]PACKUSWB

Ahn et al. EURASIP Journal on Image and Video Processing 2014, 4/1/16Page 7 of 19c0 c1 c2 c3c8 c9 c10 c11c4 c5 c6 c7c12 c13 c14 c15PUNPCKLWDPUNPCKLWDc0 c4 c1 c5 c2 c6 c3 c7c8 c12 c9 c13 c10 c14 c11 c15PUNPCKLDQPUNPCKHDQc0 c4 c8 c12 c1 c5 c9 c13c2 c6 c10 c14 c3 c7 c11 c15PUNPCKLQDQPUNPCKHQDQPUNPCKLQDQPUNPCKHQDQc0 c4 c8 c12 c0 c4 c8 c12c1 c5 c9 c13 c1 c5 c9 c13c2 c6 c10 c14 c2 c6 c10 c14c3 c7 c11 c15 c3 c7 c11 c15k0 k4 k8 k12 k1 k5 k9 k13k0 k4 k8 k12 k1 k5 k9 k13k0 k4 k8 k12 k1 k5 k9 k13k0 k4 k8 k12 k1 k5 k9 k13PMADDWDPMADDWDc0*k0 c4*k4c8*k8 c12*k12c0*k1 c4*k5c8*k9 c12*k13k2 k6 k10 k14 k3 k7 k11 k15c8*k10 c12*k14c0*k3 c4*k7c8*k11 c12*k15c0*k1 c4*k5 c8*k9 c12*k13c0*k2 c4*k6 c8*k10 c12*k14c0*k3 c4*k7 c8*k11 c12*k15c1*k1 c5*k5c9*k9 c13*k13k2 k6 k10 k14 k3 k7 k11 k15c1*k2 c5*k6c1*k0 c5*k4 c9*k8 c13*k12c2*k0 c6*k4c10*k8 c14*k12c2*k1 c6*k5c10*k9 c14*k13k2 k6 k10 k14 k3 k7 k11 k15c1*k3 c5*k7c9*k11 c13*k15c1*k1 c5*k5 c9*k9 c13*k13c1*k2 c5*k6 c9*k10 c13*k14c1*k3 c5*k7 c9*k11 c13*k15c2*k2 c6*k6c10*k10 c14*k14c2*k3 c6*k7c10*k11 c14*k15c2*k1 c6*k5 c10*k9 c14*k13c2*k2 c6*k6 c10*k10 c14*k14c2*k3 c6*k7 c10*k11 c14*k15c11*k8 c15*k12c3*k1 c7*k5c11*k9 c15*k13k2 k6 k10 k14 k3 k7 k11 k15c3*k2 c7*k6c11*k10 c15*k14c3*k3 c7*k7c11*k11 c15*k15c3*k1 c7*k5 c11*k9 c15*k13c3*k2 c7*k6 c11*k10 c15*k14c3*k3 c7*k7 c11*k11 c15*k15PHADDDPHADDDc2*k0 c6*k4 c10*k8 c14*k12c3*k0 c7*k4PMADDWDPMADDWDc9*k10 c13*k14PHADDDPHADDDc0*k0 c4*k4 c8*k8 c12*k12c9*k8 c13*k12PMADDWDPMADDWDc0*k2 c4*k6c1*k0 c5*k4PMADDWDPMADDWDc3*k0 c7*k4 c11*k8 c15*k12Figure 5 Computation steps of HEVC inverse transform using SIMD instructions.H(i,j) is the Hadamard transformation of the predictionerror, O(i,j) P(i,j) [8]. Because only addition and subtraction operations are involved for the cost functions,SATD can yield an accurate cost in the transform domain with relatively small complexity compared to DCT.Since both apply the same operations on multiple data,vector instructions are quite useful to reduce the requiredclock cycles. This work uses SSE2 and SSE3 instructionsdefined in Intel SIMD architecture, which are widelyemployed for many DSP processors [21]. In the case ofthe SAD operation, we employed PSADBW (packed sumof absolute differences), PACKUSWB (packed with unsigned saturation), and PADDD (add packed double wordintegers) instructions. Sixteen SAD values can be computed by PSADBW instruction at once. Figure 4 showshow to compute SAD with SIMD instructions. Data packing is conducted with sixteen 16-bit original pixels (ix) andsixteen 16-bit reference pixels (jx) using PACKUSWB instruction. For 8-bit internal bit depth, the data packing isconducted to form 16-bit short data. For 10-bit internalbit depth, the data packing process is not required. Sixteenoriginal pixels and reference pixels are packed into two128-bit registers, and PSADBW is performed. The computed SAD from i0-j0 to i7-j7 is stored in the lower 16 bits,and the SAD from i8-j8 to i15-j15 is stored at bit position64 to 79. Acceleration of 4 4 to 64 64 SADcomputations can be achieved using the aforementionedinstructions based on instruction-level parallelism. The4 4 and 8 8 SATD operations are implemented usinginterleaving instructions, such as PUNPCKLQDQ (unpacklow-order quad-words), PUNPCKHWD (unpack highorder words), PUNPCKLWD (unpack low-order words),and arithmetic instructions, such as PADDW (add packedword integers), PSUBW (subtract packed word integers),and PABSW (packed absolute value).Not only the cost function but also the forward transform and inverse transform can be implemented withSIMD instructions. The forward transform and inversetransform in HEVC are implemented by partial butterflyor matrix multiplication. In this paper, the matrix multiplication is chosen due to its simplicity and regularity.Transform and inverse transform are accelerated usinginterleaving instructions such as PUNPCKLDQ (unpackTable 4 Normalized complexity for variable CU size andmodeCU sizeSkipInterIntra64 641097605232 32422801616 1697138 82191

Ahn et al. EURASIP Journal on Image and Video Processing 2014, 4/1/16Page 8 of 19(a)Slice 0Slice 1Slice 2Slice 30.5Ratio of actual encoding time0. number300350400450500(b)Slice 0Slice 1Slice 2Slice 3Ratio of predicted encoding time0. number300350400450Figure 6 Ratios of actual encoding time and predicted complexity for BasketballDrive sequence. (a) Ratios of actual encoding time forthe four slices. (b) Ratios of predicted complexity for four slices.500

Ahn et al. EURASIP Journal on Image and Video Processing 2014, 4/1/16low-order double words), PUNPCKHDQ (unpack highorder double words), PUNPCKLQDQ, PUNPCKHQDQ(unpack high-order quad words), and arithmetic instructions such as PMADDWD (packed multiply andadd), PADDD, and shift instruction such as PSRAD(packed shift right arithmetic). For HEVC forward andbackward transformation, we need to consider the datarange and the center value in computing matrix multiplications, unlike SAD and SATD implementations. Figure 5shows how to compute the HEVC inverse transform usingSIMD instructions. Data packing is conducted withsixteen 16-bit coefficients (cx) using PUNPCKLWDinstruction. The 16 coefficients are packed into two128-bit registers. For reordering coefficients, the packedcoefficient signals are repacked using PUNPCKLQDQ andPUNPCKHQDQ instructions. Repacked coefficients andthe kernel (kx) of the inverse transform are multipliedfor eight 16-bit data in 128-bit registers. Then, the results of multiplications are added into 128-bit registersusing PMADDWD instruction. Finally, the results ofPMADDWD are added into the 128-bit destinationregister to compute inverse-transformed residuals usingPHADD instruction. Input data for transformation rangefrom 255 to 255. As a result, the data should be represented by at least 9 bits. Data ranges of coefficients ofHEVC transform kernels depend on the size of the transform kernels. However, they can be represented in 8 bitsfor the 32 32 kernel because they range from 90 to 90[22]. For computation of one transform coefficient, the required number of addition and multiplication operationsis as many as the size of the transform kernel along thehorizontal and vertical directions. A downscale shouldbe employed to keep 16 bits in every operation for eachdirection. To avoid overflow and underflow, four 32bit data should be packed into the 128-bit integerregister of SSE2. In addition, the transform matrix istransposed in advance to reduce memory read/writeoperations.5 Proposed slice-level parallelism with loadbalanceTo reduce the computational load of the RD optimization,early termination and mode competition algorithms havebeen adopted in HM reference software [23-25]. However,these fast encoding algorithms cause different encodingcomplexities among different slices. To maximize parallelism of the data-level task partition, an accurate load balance for slice parallelization is required. Several works[26,27] have been conducted to achieve accurate load balance for slice parallelization. In Zhang's algorithm [26],the adaptive data partitioning for MPEG-2 video encoderswas proposed by adjusting computational loads based onthe complexity of a previously encoded frame of the samepicture type. In Jung's algorithm [27], the adaptive slicePage 9 of 19partition algorithm was proposed to use early-decidedcoding mode for macro-blocks in H.264/AVC. In the conventional algorithm, a quantitative model was designed toestimate the computational load associate

the number of redundant filtering. For data-level pa-rallelization, this paper demonstrates how to allocate encoding jobs to all the available cores through the use of complexity estimation. As a result, it is possible to achieve load-balanced slice parallelism in HEVC en-