This project focuses on Natural Language Processing (NLP) to perform sentiment analysis on movie reviews. The dataset used consists of 49,459 user reviews that can be either positive or negative. The goal of this project is to create an NLP model that identifies whether a review is positive or negative.
Project Overview:
Initial Approach:
- Model: Logistic Regression;
- Accuracy: Approximately 65% (without preprocessing);
Improved Approach:
- Preprocessing with NLTK:
- Removal of stopwords;
- Removal of punctuation and accents;
- Stemming;
- TF-IDF Vectorization;
- N-grams;
- Final Accuracy: Approximately 88%.
The project focuses on the following:
- Learn fundamental concepts of Natural Language Processing (NLP).
- Perform automated Sentiment Analysis.
- Develop an architecture for sentiment classification.
- Create visualizations to facilitate the analysis of textual data.
- Start using NLTK, one of the main Python libraries for NLP.
- Learn best practices for NLP.
- Improve classification results by normalizing texts.
- Learn how to use TF-IDF and n-grams to improve classification.
- Understand how text normalization improves data visualization.
- Advance in using the NLTK library.
- Learn to use SKlearn resources to optimize classification.