Learning Representations for Image and Video Understanding

Das, B. K. and Mohan, C. K. (2019) Learning Representations for Image and Video Understanding. Masters thesis, Indian institute of technology Hyderabad.

Preview

Text
Mtech_Thesis_TD1467_2019.pdf
Download (4MB) | Preview

Abstract

Data representation is the core of all machine learning algorithms, and their performance depends mostly on the features or representations of the input on which any machine learning algorithms can be applied. Hence, to deploy a machine learning model, a considerable amount of time is invested in designing data preprocessing pipelines and data transformations that help in efficient representation of the data so that machine learning algorithms can be applied on them. Such feature engineering is costly yet essential and accentuates the shortcomings and pitfalls of machine learning algorithms, i.e., their lack of ability to extract abstract information from the input data. Feature engineering is a way to leverage human ingenuity and prior knowledge to compensate for the shortcomings of the machine learning algorithms. Hence, to make the machine learning models easily deployable and application ready, it is highly desirable to curtail the dependence of learning algorithms on engineered features so that the construction of novel algorithms can be much faster. This thesis proposes a novel approach to represent a video as a graph for action recognition and localization using only class label information. In addition to that, this thesis also proposes a novel subspace attention mechanism to learn to capture long-range inter-dependencies in visual data. This attention mechanism is implemented as a block which can be incorporated into any backbone convolution neural network.

[error in script]

IITH Creators: