A FRAMEWORK FOR VIDEO QUALITY ASSESSMENT

K, Manasa and Channappayya, Sumohana (2016) A FRAMEWORK FOR VIDEO QUALITY ASSESSMENT. PhD thesis, Indian Institute of Technology Hyderabad.

[img] Text
EE12P1002.pdf - Submitted Version
Restricted to Repository staff only until August 2024.

Download (24MB) | Request a copy

Abstract

The explosive growth of video content over the past decade has led to a very urgent need to e�ectively manage this content [1]. This includes better compression, storage and transport of video data so as to best utilize available storage and communication resources. To achieve this goal, it is important to note that in a majority of cases, the ultimate consumer of the video content is a human subject. Therefore, video processing algorithms must be optimized with respect to the quality of human experience. This however is not a trivial task due to two primary reasons. First and foremost, subjective experiments are time consuming and expensive. Secondly, the volume of content is simply too overwhelming for subjective evaluation. These drawbacks call for objective quality measures that correlate well with subjective scores. Objective quality assessment algorithms have been a subject of signi�cant research interest over the past decade and are still being actively investigated { both for image and video quality assessment. Quality assessment algorithms are broadly classi�ed into full-reference (FR), reducedreference (RR), and no-reference (NR) techniques depending on the reference signal being fully available, partially available and not available for evaluation, respectively. Mean squared error (MSE) is perhaps the most widely used full-reference quality metric. While MSE and the related peak signal to noise ratio (PSNR) are popularly used for quality assessment, they do not necessarily correlate well with subjective judgment [2, 3]. Quantitative evidence to this e�ect (for natural image data) has been provided by Sheikh et al [4] by evaluating the correlation of MSE (PSNR) with mean opinion scores (MOS) of human subjects. The same has been shown to be true for video data as well [5]. The poor correlation of MSE and PSNR with subjective scores has long been recognized by the image processing community but they continued to be used due to their ease of implementation and due to the lack of better algorithms. The invention of the structural similarity (SSIM) index [6] gave a signi�cant �llip ix to quality assessment (QA) algorithms research { both for image and video data and in all the classes of QA algorithms (FR, RR, and NR). The algorithms developed since the advent of the SSIM index have far surpassed the methods developed in the previous three decades! The SSIM index and its variants continue to be among the state-of-the-art in full-reference image quality assessment (FR IQA). The premise of the SSIM index is that natural images are highly structured (i.e., have high local correlation) and distortions a�ect this correlation. SSIM index quanti�es the loss in correlation due to distortion and uses it as a measure of perceptual quality. Other signi�cant FR IQA algorithms include information theoretic approaches invented by Sheikh et al [7] and the recent feature similarity index [8]. The information theoretic approach uses a data communication setting to solve the IQA problem. Image perception is considered to be a noisy communication process and the quality of the image is hypothesized to be the mutual information between the reference and distorted image sources. This technique uses a Gaussian Scale Mixture (GSM) model to represent image statistics. The feature similarity index measures changes in local phase coherency of a distorted image relative to the source in addition to measuring the change in local gradient information. NR IQA algorithms have also seen a signi�cant improvement in performance over the past half a decade or so. The philosophy behind several popular NR approaches is to learn a relationship between the features of a training set of images and corresponding subjective scores. This learned relationship is then used to estimate the subjective score of a test image from its features. The key to the success of NR IQA algorithms is the strength of the features. Saad et al [9] provide a discrete cosine transform (DCT) domain solution where the DCT coe�cients of the image are the feature points. Moorthy et al [10] used the statistics of wavelet coe�cients as feature points instead. Mittal et al [11, 12] signi�cantly reduced the complexity of the algorithms by operating directly in the pixel domain and used the parameters of a generalized x Gaussian density model as feature points. All these methods signi�cantly outperform MSE and PSNR and are competitive with state-of-the-art FR IQA algorithms. Video quality assessment (VQA) algorithms have lagged IQA algorithms primarily due to the signi�cant complexity introduced due to the temporal axis. This is true for all classes of VQA algorithms { FR, NR, and RR. A recent algorithm has taken signi�cant strides forward - in the FR class. The motion-based video integrity evaluator (MOVIE) index [5] is a FR VQA algorithm that uses the optical ow plane to weight the outputs of Gabor �lters to the reference and distorted inputs. The central idea is that distortions cause the optical ow plane of the distorted video to move away from the reference video's optical ow plane. The weights are designed such that �lters close to the reference video's optical ow plane are given excitatory weights and those away from it are given inhibitory weights. The MOVIE index is shown to perform signi�cantly better than previous FR IQA methods. The proposed framework for video quality assessment is motivated from the responses of the neurons in the area 18 of the visual cortex which are shown to be almost separable in the spatial and temporal �elds [13]. This motivated us to propose a three-stage approach, where the spatial and temporal features are computed individually and later pooled to obtain a single quality score for the entire video. The temporal �eld involves exploiting the motion based properties of the human visual system (HVS). The signi�cance of motion in a video is understood from the fact that, most of the objects in the visual world are static [14]. Hence, visual attention is concentrated on the trajectory of the moving objects and distortion in this trajectory causes annoyance. The motion information is measured by computing optical ow which measures the instantaneous motion of the image intensities. Optical ow gives the �nest representation of motion to the pixel level. Additionally, the HVS based motivation to use optical ow to measure motion lies in the premise that the medial superior temporal (MST) area of the brain functions by computing optical ow to xi perceive motion [15]. The FR VQA algorithm is based on the premise that local optical ow statistics are a�ected by distortions and that the deviation from pristine ow statistics is proportional to the amount of distortion. The local ow statistics are characterized using the mean, the standard deviation, the coe�cient of variation (CV), and the minimum eigenvalue (�min) of the local ow patches. Temporal distortion is estimated as the change in the CV of the distorted ow with respect to the reference ow, and the correlation between the �min of the reference and of the distorted patches. Spatial quality estimation is done with robust Multi-scale Structural SIMilarity (MS-SSIM) index. The temporal and spatial distortions thus computed are then pooled using a perceptually motivated heuristic to generate a spatio-temporal quality score. The proposed method is competitive with the state-of-the-art when evaluated on the LIVE SD database, the EPFL Polimi SD database, and the LIVE Mobile HD database. The distortions considered in these databases include those due to compression, packetloss, wireless channel errors, and rate-adaptation. The algorithm is exible enough to allow for any robust FR spatial distortion metric for spatial distortion estimation. Additionally, the proposed method is not only parameter-free but also independent of the choice of the optical ow algorithm. Finally, it was observed that the replacement of the optical ow vectors in the proposed method with the much coarser block motion vectors also results in an acceptable FR-VQA algorithm. The algorithm is called the FLOw SIMilarity (FLOSIM) index. The FRVQA algorithm was extended to an NRVQA framework and the features were modi�ed to represent a no-reference setting. It is based on the hypothesis that distortions a�ect ow statistics both locally and globally. To capture the e�ects of distortion on optical ow, the irregularities are measured at the patch level and at the frame level. At the patch level, intra- and inter- patch level irregularities are measured with the ow magnitude's variance and mean. Also, the correlation in the patch level xii ow randomness between successive frames is measured. At the frame level, the normalized mean ow magnitude di�erence between successive frames is measured. The robust NIQE [12] algorithm is used for no-reference spatial quality assessment of the frames. These temporal and spatial features are averaged over all the frames to arrive at a video level feature vector. The video level features and the corresponding DMOS scores are used to train a support vector machine for regression (SVR). This machine is used to estimate the quality score of a test video. The competence of the proposed method is clearly demonstrated on SD and HD video databases that include common distortion types such as compression artifacts, packet loss artifacts, additive noise, and blur. Additionally, a more sophisticated NRVQA was proposed by modeling the joint statistics of the horizontal and vertical ow components in a ow frame by a bivariate gaussian mixture model (GMM). It is shown that the statistics of the GMM parameters are very e�ective not only in capturing salient distortions in a video but also in localizing distortions spatially and temporally. This localization allows for the generation of a spatio-temporal distortion map. While the GMM parameters represent frame-level ow statistics, local statistics are measured using the mean, variance and minimum eigen values of non-overlapping ow patches obtained from previously proposed NRVQA metric. Finally, the model parameters and the local statistics are pooled to form features that are employed for supervised learning. The performance of the algorithm is competitive even when made opinion unaware by replacing the subjective scores with an equivalent objective metric during the training stage. The algorithm is exible enough to allow for any robust NR spatial distortion metric as one of the components of the feature vector. Additionally, the proposed method is not only parameter-free but also independent of the choice of the optical ow algorithm. In summary, we have proposed a framework for FR and NR video quality assessment as a part of my thesis work. The highlights of the contributions are: xiii � simplicity { �rst and second order optical ow statistics are employed for temporal quality assessment while o�-the-shelf image quality assessment algorithms are used for spatial quality assessment � robustness { the framework is independent of the choice of the optical ow algorithm � parameter free { the framework is free of trained or prede�ned parameters � resolution independent { works consistently across video resolutions � distortion localization { able to localize distortions in space and time � state-of-the-art performance xiv

[error in script]
IITH Creators:
IITH CreatorsORCiD
Channappayya, SumohanaUNSPECIFIED
Item Type: Thesis (PhD)
Subjects: Electrical Engineering
Divisions: Department of Electrical Engineering
Depositing User: Team Library
Date Deposited: 30 Aug 2019 03:51
Last Modified: 30 Aug 2019 03:52
URI: http://raiithold.iith.ac.in/id/eprint/6085
Publisher URL:
Related URLs:

Actions (login required)

View Item View Item
Statistics for RAIITH ePrint 6085 Statistics for this ePrint Item