Word2Vec: Interpreting Human Language with Word Embeddingdels

This project focuses on using Word2Vec for interpreting human language through word embedding. The project uses a dataset with titles and texts of news articles in Brazilian Portuguese from various websites, classified by the categories: columns, daily life, sports, illustrated, market, and world. The challenge proposed in the project is to create a news classifier that receives the title of the news article as input and determines its category.

The Word Embeddings Models are from NILC (Interinstitutional Center for Computational Linguistics) composed of researchers from different Brazilian universities such as USP and UFSCar. Pre-trained models in Brazilian Portuguese are available for download at the NILC website.

Word Embeddings Models Used:
- CBOW (Continuous Bag of Words) with 100 and 300 dimensions: Predicts the target word using the context and is generally faster.
- SKIP-GRAM with 300 dimensions: Predicts the context using the target word and performs better with rare words.
Vector Representation: The vectors representing the words in the news titles are summed. This method combines the individual semantic information of each word, capturing the collective meaning of the words in the phrase.
Logistic Regression Model Training: A LogisticRegression model is trained with the resulting vectors from the title processing, achieving an accuracy rate of 80% for the CBOW model and 81% for the SKIP-GRAM model.

The project focuses on the following:
- Learn how to represent words with One-hot encoding, including its advantages and disadvantages.
- Understand what Word2Vec is and its benefits.
- Use pre-trained Word2Vec models.
- Comprehend the impacts of biases in Word2Vec models.
- Combine word vectors to represent and classify texts.

Developed: oct, 2023

Published: jul 15, 2024