Use of Natural Language Processing Techniques for Detection of Cyberbullying Tweets

Project associated with:

Lancaster University

Context

Social media platforms like Twitter have made cyberbullying much easier due to user anonymity. There are various types of cyberbullying based on factors like age, gender, ethnicity, and religion. This system aims to identify and flag cyberbullying tweets to address this growing issue. The research explores different natural language processing (NLP) and machine learning techniques to classify tweets as cyberbullying or not, as well as determine the specific type of cyberbullying. This automated detection can help social media platforms take action against accounts engaged in cyberbullying behavior.

Requirements

Prediction Models:Develop models to predict whether a given tweet is an instance of cyberbullying or not.

Category Classification:Classify cyberbullying tweets into specific categories (age, gender, ethnicity, religion).

High Accuracy:Achieve high accuracy in tweet classification (90%+).

Scalability:Process and analyze large volumes of tweet data.

Preprocessing & Feature Extraction:Handle text data preprocessing and feature extraction.

Algorithm Comparison:Compare performance of different machine learning algorithms.

Implementation:Implement both traditional ML and neural network approaches.

Approach

Data Collection:

Dataset:Obtain a dataset of 47,000+ labeled tweets from Kaggle. Balance dataset to have 8,000 samples per class.

Data Preprocessing:

Text Cleaning:Clean text by removing links, special characters, punctuation, etc. Remove duplicate tweets.
Label Encoding:Encode class labels.

Feature Extraction:

Tokenization:Tokenize text using TFIDF vectorizer and Keras tokenizer.
Matrix Creation:Create document term matrices.

Model Development:

Classical ML Algorithms:Implement Logistic Regression, Random Forest, Gradient Boosting, LightGBM, XGBoost, Naive Bayes, SVM.
Neural Network Model:Develop a model using LSTM architecture.

Model Evaluation:

Data Splitting:Split data into 80% training and 20% testing sets.
Metrics:Evaluate models on accuracy, precision, recall, F1-score. Analyze confusion matrices.

Optimization:

Hyperparameter Tuning:Perform hyperparameter tuning using grid search.
Model Optimization:Optimize top-performing models (Random Forest, LightGBM).

Technologies Used

Programming Language:Python.
Libraries & Frameworks:
NLP & Data Manipulation:NLTK, Pandas, NumPy.
Machine Learning:Scikit-learn, LightGBM, XGBoost.
Deep Learning:Keras/TensorFlow.
Visualization:Matplotlib.

Prathamesh Kulkarni

Use of Natural Language Processing Techniques for Detection of Cyberbullying Tweets

Context

Requirements

Approach

Technologies Used