Multiclass Text Classification
NLP benchmark classifying Q&A text into 10 topic categories
An NLP project that classifies question-and-answer text into 10 balanced topic categories (Business & Finance, Computers & Internet, Education & Reference, Entertainment & Music, and six more), benchmarking classical machine learning against deep learning approaches on a dataset of over 93,000 training samples.
Key Features:
- Comprehensive text preprocessing pipeline with stop-word removal and tokenization
- Multiple feature representations: TF-IDF, Word2Vec, and GloVe embeddings
- Systematic comparison across model families — Logistic Regression, Dense Networks, and RNN/GRU/LSTM (including bidirectional) architectures
- Hyperparameter tuning with early stopping; Logistic Regression with TF-IDF achieved the best Macro F1 (0.6479)
Tech Stack: Python · scikit-learn · TensorFlow/Keras · NLTK · Gensim · pandas · NumPy · Matplotlib · Seaborn