I. INTRODUCTION
As the demand for video quality has increased for many years now, new video codec standards have also been developed with improved compression performance. Most famous standards such as MPEG-2, MPEG-4 AVC/H.264 and HEVC were standardized by either ISO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Experts Group (VCEG) under royalty-bearing intellectual property rights policy. Recently, the need for royalty-free video codec has been coming up with interesting situations: most patents of core technologies adopted in widely-used standards (e.g., MPEG-2) have either expired or will be expiring soon; many codec-related companies would like to support royalty-free codec; and studies of royalty-free codecs have recently been receiving attention in the literature [1-5].
Recognizing the diversified needs of the Internet, MPEG issued the Call for Proposals (CfP) for Internet Video Coding (IVC) technologies [6]. The IVC standard should achieve three goals: 1) the baseline profile will be granted in a free of charge license (i.e., Type-1 license) by patent owners according to the ISO/IEC Common Patent Policy [7], 2) the baseline profile will achieve better compression performance than MPEG-2 and be comparable to AVC Baseline Profile, and 3) the complexity will be feasible for real-time encoding/decoding on generally available personal computers and mobile devices [6]. Responding to the CfP, several influential industry leaders and universities proposed three codecs [8]: Web Video Coding (WVC), Video Coding for Browsers (VCB) and IVC.
By far, MPEG experts investigated and verified that the coding efficiency of IVC is better than that of AVC Constrained Baseline Profile and is even comparable to AVC High Profile in terms of subjective quality [8], showing additional results that IVC is mostly better than WVC and VCB. With these diligent efforts, the preliminary of the final draft of international standard (FDIS) version of IVC was published in January 2017. There exists, however, one important issue that needs resolving: the decoding complexity problem for various real-time internet applications (e.g., video chat and internet streaming).
In this paper, we briefly review IVC technologies focusing on their differences from conventional video codecs and analyze the decoder modules in terms of computational complexity. We measure time complexity (i.e., running time) to precisely investigate the complexity of IVC coding tools just as other conventional video codecs have been investigated in the literature ([9-11]). In addition, we evaluate how much an IVC-specific tool affects decoding time by turning on/off those tools. Through the experimental results, we present how complex an IVC decoder is and which module is critical in IVC decoding. In addition, we compare the decoding complexity of IVC to that of AVC/H.264 Baseline and High Profiles, which are commonly used in many streaming applications. With the results from the analysis of IVC decoding, it would be helpful for the reader to derive the time complexity estimation for a variety of processors and to optimize the decoding speed of IVC.
The remaining sections are organized as follows. Section 2 briefly describes the different features of IVC decoding from other existing video codecs. Section 3 identifies the time complexity of IVC decoder with experimental analysis. Section 4 shows experimental results of comparison between IVC and a widely-used codec AVC/H.264 [12] in terms of time complexity. Finally, Section 5 concludes this paper.
II. DISTINCTIVE FEATURES OF INTERNET VIDEO CODING (IVC) DECODER
IVC is a codec with a similar coding structure to MPEG-2 standard [13] but enhanced with several effective techniques. Some contributors declared their own patented techniques to be Type-1 for IVC codec. Other contributors mined prior art techniques that have expired or revealed in the literature without patents. In other cases, contributors mined prior art techniques, which means that the techniques have expired or revealed in the literature without patents. In big picture, as in conventional video codecs, an IVC bitstream should be decoded through an inverse transform/quantization process (if needed according to the syntax of each macroblock), motion compensation and entropy coding. Except for group of pictures (GOP) layer, IVC has the same hierarchical layer as MPEG-2, which consists of sequences, pictures, slices and macroblocks in layers. However, since many new and old (i.e., prior art) techniques have been adopted in IVC to gain compression performance further, those different aspects are given in the remaining subsections with emphasis on the basic principle. The detailed information of how to parse bitstream, how to interpret the coded symbols and how to reconstruct video will be fully specified by the FDIS of IVC [14], as usual with popular video coding standards.
There are three picture types in IVC: intra-coded frame (I-frame), predictive-coded frame (P-frame) and bidirectional predictive-coded frame (B-frame). Since IVC adopted a prior art technique that uses multiple reference frames, blocks of P frame can refer not only to the most recent P-frame, but also to earlier P-frames or I-frame.
IVC has been developed with two coding configuration targets: random access and low-delay scenarios. For the random access scenario, IVC takes IBBP coding structure. As described in Fig. 1, B-frame can only refer to the nearest I- or P-frame, while P-frame can refer to multiple previous P-frames or I-frames—if stored in the reference frame buffer.
For low-delay applications, IVC bitstream can be encoded with IPPP coding structure. In this case, a P-frame must refer to one of the previous frames of which distances from the current frame were pre-defined. IVC has an additional P-frame type, called non-reference P-frame, as a sub-type that will not be delivered or used to reference frame buffers for coding efficiency.
Macroblock (MB) is partitioned by a quadtree-based approach within 16 x 16 pixels as shown in Fig. 2. Among those five partitions, the inter-predicted block can be coded by four partitions {16 x 16, 8 x 16, 16 x 8 and 8 x 8}; on the other hand, intra-predicted block can be coded by three squared partitions {16 x 16, 8 x 8 and 4 x 4}. According to the MB partition type, each block can be predicted by various modes and transformed/quantized with different kernel sizes separately, which is described in the following subsections.
A block in a partitioned MB can be encoded by several inter-prediction modes depending on the frame type. In a P-frame, three prediction modes are possible for inter-prediction: forward prediction, skip and multiple-hypotheses prediction modes. Multiple-hypotheses prediction mode is an intriguing mode that makes an imaginary block by combining two reference blocks in previous frames (the detailed process was presented in [15]). Thus, this last mode needs additional motion compensation process, which can overweigh decoder complexity.
In B-frame, the basic concept of forward prediction and skip modes are shared with P-frame. In addition, backward and bidirectional prediction (also called symmetrical) modes are allowed in B-frame. Backward mode predicts the current block through future reference frames. On the other hand, Bidirectional mode refers to both past and future reference frames and make two blocks, one which is suitable enough to predict the current block.
IVC also increased motion accuracy by adopting interpolation filtering technique that enables it to generate half-/quarter-pels. The interpolation filter in IVC is distinguished from other recent video codecs due to a variable filter tap size depending on video resolution. It varies within 4-, 6- and 10-tap size for luma component, but for chroma, a 4-tap size is used. Undoubtedly, the larger filter tap size is, the heavier the burden on decoder complexity becomes.
The concept of intra-prediction—predicting pixels through information in the same frame—was already present in MPEG-2 [13]; however, in IVC, the intra-predicted block can exist in P- or B-frames as well, of which this has been widely used in recent video codecs. Instead of having only DC mode as in MPEG-2, intra-prediction in IVC has several additional modes depending on MB partition and on color component. For luma component, there are one DC and four directional modes (i.e., horizontal, vertical, down left (↙) and down right (↘)) based on the availability of upside and/or left side neighbor samples. These five modes are supported in 16 x 16, 8 x 8 and 4 x 4 MB partitions.
On the other hand, there are totally four modes (i.e., DC, horizontal, vertical and plane) for chroma components supported in an 8 x 8 MB partition only. Among the four modes, DC/horizontal/vertical modes for chroma intra-prediction are operated in the same way as those for luma, but the last mode is different. The plane mode takes neighbor samples of both directions (upside and left side samples) and does summation, shift, and clipping operations with them. The plane mode might place a burden on the decoder complexity due to those operations since other directional modes could directly assign neighbor pixels to the target pixels without those additional operations.
Integer discrete cosine transform technique is used with quadtree-based variable kernel size [16] and the supported sizes for IVC are 16 x 16, 8 x 8 and 4 x 4. Unless the prediction mode of a block is encoded with skip mode, an inverse transform should be performed on the premise that this block has been quantized. In the current ITM, a butterfly structure is used, supporting a 1-D 8-point forward transform and proper approximation is performed to generate rational numbers for irrational numbers in this structure.
The order of transform and quantization process at decoder is as follows. The input values should be scanned in a zigzag order and the scanned values should be transformed inversely. Afterwards, the inverse-transformed values are to be dequantized according to a given quantization parameter (QP) value. Dequantization table and associated shift table are described in the FDIS of IVC [14].
For the entropy coding, IVC uses logarithmic domain arithmetic coding which takes the following steps: 1) initialization process of context model, 2) binarization process if the syntax element is non-binary, and 3) binary arithmetic decoding for bin string (including context model selection if necessary). The arithmetic entropy coder in IVC is logarithmic binary arithmetic coder (LBAC) which avoids multiplication operations and look-up tables. By using LBAC, the decoder can avoid redundant memory costs and path delays caused in context adaptive binary arithmetic coding [17].
Within the decoding loop of IVC, a filter that conditionally filters boundaries between blocks can be applied except to image boundaries and slice boundaries (the basic concept can be found in an expired patent [18]). This loop filter, called deblocking filter, come in three types—weak, normal and strong loop filtering—according to conditions that judge how much compensation is needed for subjective visual quality. In brevity, weak loop filtering filters only two pixels per one horizontal or vertical boundary line, normal loop filtering filters four pixels and strong filtering filters six pixels. Surely, the stronger the filtering, the more decoding time is needed. The detailed information on how to filter pixels is described in [8] and the associated parameters such as threshold values are presented in [19].
III. COMPLEXITY ANALYSIS OF INTERNET VIDEO CODING DECODER
In this section, the complexity of IVC decoder is analyzed using a profiling tool. To analyze the complexity, IVC bitstream files are generated by IVC test model (ITM) 14.0 and then decoded by IVC decoder. To give the associated information in detail, the test material (i.e., video sequences) and test environment to decode bitstream are presented in the following subsection. In addition, specific coding conditions are described, including the parameters used in the encoding process to generate IVC bitstream files. To analyze the complexity of the IVC decoder, a well-known profiling tool—Intel VTune performance analyzer [20]—is used in this paper. Finally, the results of the analysis are described according to the classification of major coding tools so that we could notice which tool is critical in terms of the complexity.
We chose four test sequences from the recommended video sequences specified in the IVC exploration experiment document [21]. The detailed information on each video sequence is shown in Table 1, including the number of frames to be encoded. All the sequences were tested under both constraint set 1 (CS1) and constraint set 2 (CS2) conditions. CS1 and CS2, respectively, are similar to random access and low delay coding structures, the commonly used configurations in recent video codecs. To evaluate the time complexity, the following development environment was employed: quad-core CPUs running at 2.40 GHz, 8 GB random-access memory (RAM) and a 64-bit Windows operating system (OS). Decoding each bitstream file was carried in a single thread and no parallelization techniques were used during decoding.
Sequence name | Resolution | Total frame number | FPS |
---|---|---|---|
Kimono | 1920x1080 | 240 | 24 |
ParkScene | 1920x1080 | 240 | 24 |
BasketballDrill | 832x480 | 500 | 50 |
PartyScene | 832x480 | 500 | 50 |
Describing related encoding conditions specifically is important as the characteristics of bitstream files including decoding complexity can vary depending on the encoding conditions such as quantization parameter. Table 2 shows the general encoding parameters for CS1 and Table 3 shows sequence-specific encoding parameters for CS1. Similarly, Table 4 shows the general encoding parameters for CS2 and Table 5 shows sequence-specific encoding parameters for CS2. In general, the ITM encoder description [19] describes some of the encoding conditions and parameters, but there are few different parameters, such as QP, in this paper. Those different parameters are set to fit the given target range of bitstream size, which was agreed by MPEG experts to conduct visual assessment of Type-1 codecs [22].
Sequence | Intra Period | QP First Frame | Number B Frames |
---|---|---|---|
Kimono | 8 (24)* | 24 | 2 |
ParkScene | 6 (24)* | 27 | 3 |
BasketballDrill | 13 (52)* | 32 | 3 |
PartyScene | 13 (52)* | 35 | 3 |
Sequence | QP First Frame (QP for I-frame) |
---|---|
Kimono | 23 |
ParkScene | 23 |
BasketballDrill | 29 |
PartyScene | 33 |
We measured the time consumed by each function using the performance analyzer. We classified those functions used in the decoding into six categories: motion compensation (MC), entropy decoding (ED), intra-prediction (IP), loop filtering (LF), inverse transform/quantization (T/Q) and so on. This classification is a common theme in research on the decoding complexity analysis of recent video codecs including the analysis of HEVC [10] and of AVC/H.264 [9]. Under the CS1 condition, Fig. 3 shows the performance ratio of the six categories of functions in accordance with video resolutions—1920 x 1080 and 832 x 480. The most time-consuming category is MC. This trend has also been seen in other recent video codecs [9-10] because of the highly complex interpolation filtering. The reason that MC consumes most of the decoding time can be explained as follows. Firstly, all the motion vectors in B-frame are derived by multiplying the distances of frames. Thus, motion vectors can indicate half-pel or quarter-pel not only depending on the motion vector difference (MVD) value, but also depending on the distance. Secondly, multiple-hypotheses prediction modes in P-frame must use interpolation filtering as this mode takes the average value of two motion vectors. Finally, due to the adaptive filter tap size according to the video resolution, the percentage of MC can be increased in low video resolution. If the height of frame is less than 720, the filter size for interpolation filtering will be 10-tap, which is larger than the filter tap size of HEVC. Note that IVC uses the same filter tap size for half-pel and quarter-pel interpolation processes.
Under the CS2 condition, Fig. 4 shows the performance ratio of the six categories of functions in accordance with video resolutions—1920 x 1080 and 832 x 480. Still, the most time-consuming category is MC under CS2. One of differences of results from CS1 is that the percentage of MC under CS2 further decreased. One possible explanation is that there is no more B-frame in CS2. The other noticeable difference from CS1 is that the percentage of LF is slightly increased. Since CS2 has a special P-frame type, called non-reference P frame, which is usually encoded by much higher QP value than other frames, we guess that those frames tend to need deblocking filtering to compensate for coding errors.
As shown in Fig. 3 and Fig. 4, MC was the most time-consuming category. Thus, we believe that to reduce IVC decoding complexity, interpolation filtering should be carefully considered as a main target. In addition, inverse transform/quantization and loop filtering should be targeted as well. Possible solutions can be a decoder-side optimization—a software-based coefficient-aware fast algorithm [23]—or a hardware-based acceleration. In a different approach, the other solution can be an encoder-side filtering restriction. For that purpose, an encoder may choose not to use deblocking filtering and/or interpolation filtering though bitrate, which may compromise frame quality. For example, a similar approach exists in the restriction method of adaptive loop filter (ALF) that was tried in HEVC [24].
IV. COMPARISON RESULTS OF TIME COMPLEXITY
To compare the time complexity of IVC decoding with other codecs, we selected AVC/H.264 as an anchor, which has been widely used in many video applications such as video streaming. Specifically, two profiles of AVC/H.264 were chosen: High Profile (HP)—which shows the best coding efficiency among all the AVC profiles—and constrained Baseline Profile (cBP)—which is one of the goals of the IVC project. Since decoding complexity can vary depending on various encoding configurations, we generated bitstream of codecs according to encoding conditions agreed by MPEG experts [22]. Table 6 describes the information on test materials including frame per second (FPS). To satisfy the rate points as closely as possible, video codecs used in this paper may have a chance of increasing one additional QP after a certain frame number during encoding. By allowing the increase, all bitstream files satisfied the rate points in Table 6 within the range of -3% to +3%. To evaluate the decoding time, the following development environment was employed: quad-core CPUs running at 4.00 GHz, more than 16 GB random-access memory (RAM) and a 64-bit Windows operating system (OS).
The decoding time results of IVC and AVC/H.264 (cBP and HP) are shown in Table 7 and Table 8. Here, the sequence names are briefly noted as SXX (XX is two-digit numbers denoting each sequence) and the target rate points are briefly noted as RX (X is one-digit number denoting each rate point). The notation DTm stands for the decoding time of m codec. On average, IVC showed slower decoding times than AVC cBP and AVC HP. Under CS1, IVC was 3.65 times slower than AVC cBP and 2.84 times slower than AVC HP, on average. Note that as IVC uses the smallest tap size for interpolation in high resolution, the percentage difference in decoding times of the IVC and AVC cBP could be up to 194 under CS1%. However, of the bitstream for 832 x 480 resolution, IVC had a much smaller decoding time than AVC codec, showing the time difference almost 400%. Under CS2, IVC showed similar results as under CS1. On average, IVC showed 3.13 times slower than AVC cBP and 2.9 times slower than AVC HP as shown in Table 8. Table 8 also shows that the difference of decoding time between IVC and others could be small in high resolution, whereas the difference could be large in low resolution. In conclusion, IVC showed a comparatively slow decoding complexity than the two profiles of AVC/H.264, which should be reduced significantly for real-time video decoding application. Especially, in the low-resolution case, the interpolation filtering process should be focused to substantially decrease the overall decoding complexity.
V. CONCLUSION
In this paper, we briefly presented IVC coding techniques, focusing on computational time complexity. The relative importance of the coding tool in terms of decoding time was investigated using a profiling software and the experimental results showed that motion compensation and transform/quantization processes consume most of the decoding time. Particularly, one IVC-specific coding tool (i.e., resolution-adaptive interpolation filtering) has critical impact on low video resolution because of large filter tap size, which should be overcome to reduce the decoding complexity. In addition to the complexity analysis of IVC itself, we provided comparison results of the decoding time with those of AVC/H.264 cBP and HP—two widely used codecs. As demonstrated in experiments, the decoding complexity of IVC should be significantly reduced for real-time video decoding applications. Possible solutions on reducing the decoding complexity of IVC bitstream could be 1) parallelization techniques on motion compensation and transform/quantization processes, 2) decoding complexity-aware RD optimization during encoding and 3) hardware-based decoder acceleration.