Detecting offensive speech in conversational code-mixed dialogue on social media: A contextual dataset and benchmark experiments

Satapara, Shrey (2023) Detecting offensive speech in conversational code-mixed dialogue on social media: A contextual dataset and benchmark experiments. Expert Systems with Applications, 215. p. 119342. ISSN 0957-4174

[img] Text
1-s2.0-S0957417422023600-main.pdf

Download (1MB)

Abstract

The spread of Hate Speech on online platforms is a severe issue for societies and requires the identification of offensive content by platforms. Research has modeled Hate Speech recognition as a text classification problem that predicts the class of a message based on the text of the message only. However, context plays a huge role in communication. In particular, for short messages, the text of the preceding tweets can completely change the interpretation of a message within a discourse. This work extends previous efforts to classify Hate Speech by considering the current and previous tweets jointly. In particular, we introduce a clearly defined way of extracting context. We present the development of the first dataset for conversational-based Hate Speech classification with an approach for collecting context from long conversations for code-mixed Hindi (ICHCL dataset). Overall, our benchmark experiments show that the inclusion of context can improve classification performance over a baseline. Furthermore, we develop a novel processing pipeline for processing the context. The best-performing pipeline uses a fine-tuned SentBERT paired with an LSTM as a classifier. This pipeline achieves a macro F1 score of 0.892 on the ICHCL test dataset. Another KNN, SentBERT, and ABC weighting-based pipeline yields an F1 Macro of 0.807, which gives the best results among traditional classifiers. So even a KNN model gives better results with an optimized BERT than a vanilla BERT model.

[error in script]
IITH Creators:
IITH CreatorsORCiD
Item Type: Article
Uncontrolled Keywords: Benchmark; Conversational Analysis; Evaluation; Hate Speech; Natural Language Processing; Transformer; Benchmarking; Character recognition; Classification (of information); Codes (symbols); Long short-term memory; Natural language processing systems; Pipeline processing systems; Social networking (online); Speech recognition; Statistical tests; Text processing; Benchmark; Benchmark experiments; Conversational analysis; Evaluation; Hate speech; Language processing; Natural language processing; Natural languages; Social media; Transformer; Pipelines
Subjects: Artificial Intelligence
Divisions: Department of Artificial Intelligence
Depositing User: Mr Nigam Prasad Bisoyi
Date Deposited: 12 Sep 2023 05:48
Last Modified: 12 Sep 2023 05:48
URI: http://raiithold.iith.ac.in/id/eprint/11666
Publisher URL: https://doi.org/10.1016/j.eswa.2022.119342
OA policy: https://v2.sherpa.ac.uk/id/publication/4628
Related URLs:

Actions (login required)

View Item View Item
Statistics for RAIITH ePrint 11666 Statistics for this ePrint Item