Multiclass Text Classification

NLP benchmark classifying Q&A text into 10 topic categories

An NLP project that classifies question-and-answer text into 10 balanced topic categories (Business & Finance, Computers & Internet, Education & Reference, Entertainment & Music, and six more), benchmarking classical machine learning against deep learning approaches on a dataset of over 93,000 training samples.

Key Features:

  • Comprehensive text preprocessing pipeline with stop-word removal and tokenization
  • Multiple feature representations: TF-IDF, Word2Vec, and GloVe embeddings
  • Systematic comparison across model families — Logistic Regression, Dense Networks, and RNN/GRU/LSTM (including bidirectional) architectures
  • Hyperparameter tuning with early stopping; Logistic Regression with TF-IDF achieved the best Macro F1 (0.6479)

Tech Stack: Python · scikit-learn · TensorFlow/Keras · NLTK · Gensim · pandas · NumPy · Matplotlib · Seaborn