Objective measurement of video is performed using mathematical models and algorithm measuring the introduction of noise and the structural similarity of video objects.
There are several mathematical models
such as PSNR (Peak Signal to Noise Ratio) and SSIM (Structural Similarity) for instance, that
are traditionally used for these calculations. The complexity resides in the
fact that a mathematical difference from one pixel to another, from one frame
to another does not necessarily translate equally in the human eye.
PSNR is a measure that has a medium to
low accuracy but is quite economic in computation. It represent possibly up to
10% of the CPU effort necessary to perform a transcoding operation. This means
that although it provides a result that is not fully accurate, the model can be
used to compute calculations as the file is being optimized. A vendor can use
PSNR as a basis to provide a Mean Opinion Score (MOS) on the quality of a video
file.
Video quality of experience measurement can be performed with full
reference (FR), reduced reference (RR) or no reference (NR).
Full Reference
Full reference video measurement means that every pixel of a distorted
video is compared to the original video. It implies that both original and
optimized video have the same number of frames, are encoded in the same format,
with the same aspect ratio, etc… It is utterly impractical in most cases and
requires enormous CPU capacity to process, in many cases more than what is
necessary for the actual transcoding / optimization.
Here is an example of a full reference video quality measurement
method under evaluation and being submitted to ITU-T.
As a full reference approach, the model compares the input or
high-quality reference video and the associated degraded video sequence under test.
Score estimation is based on the following steps:
1) First, the video sequences are
preprocessed. In particular, noise is removed by filtering the frames and the
frames are subsampled.
2) A temporal frame alignment between
reference and processed video sequence is performed.
3) A spatial frame alignment between
processed video frame and the corresponding reference video frame is performed.
4) Local spatial quality features are
computed: a local similarity and a local difference measure, inspired by visual
perception.
5) An analysis of the distribution of the
local similarity and difference feature is performed.
6) A global spatial degradation is
measured using a Blockiness feature.
7) A global temporal degradation is
measured using a Jerkiness feature.
The jerkiness measure is computed by evaluating local and global motion intensity
and frame display time.
8) The quality score is estimated based on
a non-linear aggregation of the above features.
9) To avoid misprediction in case of
relatively large spatial misalignment between reference and processed video
sequence, the above steps are computed for three different horizontal and
vertical spatial alignments of the video sequence, and the maximum predicted
score among all spatial positions is the final estimated quality score.
Reduced reference
Reduced reference video measurement is performing the same evaluation
as in the full reference model but only on a subset of the media. It is not
widely used as frames need to be synchronized and recognized before evaluation.
No reference
No reference video measurement is the most popular method in video
optimization and is used usually when the encoding method is known. The method
relies on the tracking of artefacts in the video, such as blockiness,
jerkiness, blurring, ringing…etc. to derive a score.
Most vendors will create a MOS score from proprietary no reference
video measurement derived from mathematical models. The good vendors constantly
update the mathematical model with comparative subjective measurement to ensure
that the objective MOS score sticks as much as possible to the subjective
testing. You can find out who is performing which type of measurement and their method in my report, here.