This project uses regex and language models to focus on Natural Language Processing (NLP). The project utilizes datasets with programming-related questions asked by users on forums. These datasets include questions in English, Portuguese, and Spanish. The project aims to create models trained with these datasets to develop a system capable of recognizing the language of a given text.
Project Overview
Handling code snippets using regex
Unifying and tokenizing the texts
Trained three Maximum Likelihood Estimator (MLE) models, one for each language
Used these models to calculate the perplexity of a given text (lower perplexity indicates a closer match to the tested model)
Identified a limitation: MLE can assign zero probability to unseen events in the training data
Improved Approach:
Implemented Laplace smoothing to address the zero probability issue
Classifies text based on perplexity scores
Achieved high accuracy rates for the test data:
- Portuguese: 100%
- English: 100%
- Spanish: 97%
The project focuses on the following:
- Advance studies in Natural Language Processing (NLP).
- Learn how regex can assist in processing textual data.
- Understand language models and their applications.
- Create a model that automatically detects languages.
- Practice using Python libraries such as NLTK and Scikit-Learn.