Word2Vec: Word Embedding Training

This project focuses on training Word2Vec models using a dataset of news titles and texts in Brazilian Portuguese from various websites, classified into categories: columns, daily life, sports, illustrated, market, and world.

The project focuses on the following:
Part 1 - Training Word Embedding Models:
The first part of this project aims to use the news titles to create Word Embedding models, which generate vector representations of words, capturing their semantics based on the context they appear in. The Word2Vec models created in this project are:
- CBOW (Continuous Bag of Words) with 300 dimensions: Attempts to predict the target word using the context and is generally faster.
- SKIP-GRAM with 300 dimensions: Attempts to predict the context using the target word and performs better with rare words.

Part 2 - Classifying News Titles
The second part of this project proposes using the models generated in the first part to classify news titles, determining their category. This is similar to what was done in the project Word2Vec: Interpreting Human Language with Word Embedding, which used Word Embeddings models from NILC (Interinstitutional Center for Computational Linguistics).

Using the same method of summing vectors and classifying with a LogisticRegression model from the previous project, but with the CBOW and SKIP-GRAM models trained in the first part of this project, we achieve a classification accuracy of:
- CBOW model: 79%
- SKIP-GRAM model: 79%

Comparing with the NILC Models used in the project Word2Vec: Interpreting Human Language with Word Embedding, both models reached almost the same accuracy, but the models developed in this project, trained with news titles to evaluate news titles, are 98.3% smaller.

The project focuses on the following:
- Learn how to use SpaCy for text data preprocessing, including its advantages and disadvantages.
- Learn to configure the hyperparameters of the Word2Vec model.
- Train your own Word2Vec model using Gensim.
- Create a text classifier using your own Word2Vec model.
- Deploy your model in a web application.

Developed: oct, 2023

Published: jul 15, 2024