EFFICIENT APPROACHES FOR HUMAN ACTION RECOGNITION USING DEEP LEARNING

Ijjina, E P and C, Krishna Mohan (2017) EFFICIENT APPROACHES FOR HUMAN ACTION RECOGNITION USING DEEP LEARNING. PhD thesis, Indian institute of technology Hyderabad.

[img] Text
Thesis_Phd_CS_5213.pdf - Submitted Version
Restricted to Repository staff only until May 2020.

Download (26MB) | Request a copy

Abstract

The object of this research work is to address some of the issues affecting vision based human action recognition. The subjects gait characteristics, appearance, execution speed, capturing conditions, and modality of the observations affect the spatial and temporal visual information captured in an observation used for action recognition. To address these distortions in subject’s visual information, deep learning approaches are used to learn discriminative features for human action recognition. The thesis begins with an action recognition approach that is unaffected by visual factors due to the use of visual markers for obtaining accurate motion information. The motion characteristics of human actions are considered in computing a new motion capture (MOCAP) action representation, which is in turn used by a stacked autoencoder for action recognition. This representation is computed from the skeletal information in MOCAP observation after it is normalized by the corresponding subject’s t-pose (i.e., reference pose). This work addresses the inconsistency in speed and motion of limbs across observations by considering the subject independent representation of actions and tolerance of stacked autoencoder to noise/distortions in input representation. As obtaining accurate motion information by tracking visual markers may not be feasible in many scenarios, the next approach considers motion information captured by depth camera to recognize fall action. A new temporal template capturing subject’s pose over a given period of time is us (CNN) for the detection of fall action. The illumination invariance of depth information and the ability of the CNN to learn the local patterns associated with each action, minimize the impact of subjects gait characteristics and execution speed on the overall performance. Since the existing and the earlier temporal templates cannot assign higher significance to motion in the beginning and middle frames of the observation, we propose new motion history images emphasizing motion in these temporal regions. The convolutional neural network (ConvNet) features extracted from these motion history images computed from depth and RGB video streams are used to recognize the human actions. By considering multi-modal features, the illumination invariance of depth and the precise subject’s pose information from RGB video are utilized for action recognition. Finally, evidence across classifiers using different temporal templates are combined for efficient recognition of human actions irrespective of the location of their key poses in the temporal regions. Even though the earlier approaches are real-time and have high performance, the sensitivity of temporal templates to angle-of-view limits its application to observations captured in an unconstrained environment. So, we propose a view-independent approach to action recognition for videos captured by a regular digital camera using convolutional neural network. The action bank representation of videos containing similar local patterns for videos of the same actions are given as input to CNN to learn the linear patterns associated with each class for action recognition. Since the initial weights of a CNN affects its performance after training with a back propagation algorithm (BPA), we combine evidences across multiple CNN classifiers to minimize the impact of the solution being stuck in a local minimum. We consider the outputs of the binary coded classifier as the evidence value associated with the prediction, thereby assigning high confidence (≈ 1) to accurate predictions and low confidence (≈ 0) to incorrect predictions. As a result, combining evidences across classifiers leads to selecting predictions with high confidence, thereby resulting in the improvement of the overall performance. iv Since the effectiveness of the above technique depends on the complementary information obtained from the implicit diversity of the CNN classifiers, we propose an approach to initialize the weights using genetic algorithms (GA) that can optimize the convolutional neural network after training with back-propagation algorithm. The convolution masks in CNN architecture are considered as the GA-string, whose fitness is computed as the accuracy of CNN classifier after training with back-propagation algorithm for a fixed number of epochs. As a result, the CNN training algorithm which combines the global and local search capabilities of GA and back-propagation algorithm, respectively, is proposed to identify the initial weights to achieve better performance. A near ideal performance is achieved when evidence across the classifiers (of candidate solutions) is combined using fusion rules for action recognition, due to the high mean and low standard deviation of the CNN classifiers in comparison to random weight initialization. In summary, this thesis proposes new methods to human action recognition by using domain specific action representation as input to deep learning models for action detection. A MOCAP action representation generated from the characteristics of recognized actions is used by a stacked autoencoder to recognize human actions. The new temporal template of depth video, capturing subject’s pose over a given time period is used to detect fall event and recognize human actions by the CNN from the local patterns associated with each action. The convolutional neural network (ConvNet) features extracted from the RGB and depth temporal templates emphasizing motion in the beginning and middle frames of video observations are used for human action recognition. Finally, a view-independent action recognition model using action bank features is optimized by a) increasing the complementary information across multiple CNN classifiers through unique weight initialization and b) combining the global and local search capabilities of GA and back-propagation algorithm, respectively, to identify the initial weights in order to achieve better performance

[error in script]
IITH Creators:
IITH CreatorsORCiD
C, Krishna MohanUNSPECIFIED
Item Type: Thesis (PhD)
Uncontrolled Keywords: Human action recognition, motion capture (MOCAP) information, RGBD video, gesture recognition, multi-modal action recognition, deep learning, stacked autoencoder, convolutional neural networks, extreme learning machines, genetic algorithms, fuzzy membership functions
Subjects: Computer science
Divisions: Department of Computer Science & Engineering
Depositing User: Team Library
Date Deposited: 17 May 2019 06:01
Last Modified: 17 May 2019 06:01
URI: http://raiithold.iith.ac.in/id/eprint/5213
Publisher URL:
Related URLs:

Actions (login required)

View Item View Item
Statistics for RAIITH ePrint 5213 Statistics for this ePrint Item