AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video Description

Prudviraj, Jeripothula and Reddy, Malipatel Indrakaran and Vishnu, Chalavadi and Mohan, C Krishna (2022) AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video Description. IEEE Transactions on Image Processing, 31. pp. 5559-5569. ISSN 1057-7149

[img] Text
IEEE_Transactions_on_Image_Processing.pdf - Published Version
Restricted to Registered users only

Download (2MB) | Request a copy

Abstract

Generating multi-sentence descriptions for video is considered to be the most complex task in computer vision and natural language understanding due to the intricate nature of video-text data. With the recent advances in deep learning approaches, the multi-sentence video description has achieved an impressive progress. However, learning rich temporal context representation of visual sequences and modelling long-term dependencies of natural language descriptions is still a challenging problem. Towards this goal, we propose an Attentive Atrous Pyramid network and Memory Incorporated Transformer (AAP-MIT) for multi-sentence video description. The proposed AAP-MIT incorporates the effective representation of visual scene by distilling the most informative and discriminative spatio-temporal features of video data at multiple granularities and further generates the highly summarized descriptions. Profoundly, we construct AAP-MIT with three major components: i) a temporal pyramid network, which builds the temporal feature hierarchy at multiple scales by convolving the local features at temporal space, ii) a temporal correlation attention to learn the relations among various temporal video segments, and iii) the memory incorporated transformer, which augments the new memory block in language transformer to generate highly descriptive natural language sentences. Finally, the extensive experiments on ActivityNet Captions and YouCookII datasets demonstrate the substantial superiority of AAP-MIT over the existing approaches.

[error in script]
IITH Creators:
IITH CreatorsORCiD
Mohan, C Krishnahttps://orcid.org/0000-0002-7316-0836
Item Type: Article
Uncontrolled Keywords: Multi-sentence video description , dense video captioning , atrous pyramid network , temporal correlation attention , transformers
Subjects: Computer science
Divisions: Department of Computer Science & Engineering
Depositing User: . LibTrainee 2021
Date Deposited: 08 Sep 2022 08:54
Last Modified: 08 Sep 2022 08:54
URI: http://raiithold.iith.ac.in/id/eprint/10480
Publisher URL: http://doi.org/10.1109/TIP.2022.3195643
OA policy: https://v2.sherpa.ac.uk/id/publication/3474
Related URLs:

Actions (login required)

View Item View Item
Statistics for RAIITH ePrint 10480 Statistics for this ePrint Item