Project 7 OpenClassrooms

Tweet classification

Mission

The client wants a prototype of an AI product capable of predicting the sentiment associated with a tweet. Since they don’t yet have customer-specific data, we will work with open-source datasets.

These datasets include not only the tweet text, but also the user ID, the timestamp of the post, and a binary label indicating sentiment (negative vs. positive).

I will need to prepare a functional prototype of the model. The model will be exposed via an API deployed in the cloud, which is called by a local interface that sends a tweet to the API and receives the sentiment prediction in return.

This project is an opportunity to explore several approaches:

A “basic custom model” approach, to quickly develop a traditional model (e.g., logistic regression) that can predict the sentiment of a tweet.
An “advanced custom model” approach, involving deep neural networks to predict tweet sentiment. This is the model I will deploy and present to the client. I plan to experiment with two different word embeddings and even try a BERT-family model.

I’ve also been asked to provide, for an internal presentation, a solid example of an MLOps-oriented workflow.

Data

Available for download here, the dataset includes 1,600,000 English-language tweets, each annotated with a sentiment score — 0 for negative, 2 for neutral, and 4 for positive — along with a unique ID, date, flag, and username.

Alas, data is rarely as clean as one would hope, and there are always a few hurdles to overcome before processing — which inevitably means making some decisions. In our specific case, for instance, we see that only 1,581,466 tweet texts are actually unique:

So what? Caution is required! For example, will identical texts always be labeled the same way?

The answer is no — here are a few charts to illustrate:

This also helps us notice that there may actually be only two scores present: 0 and 4. That is indeed the case.

In the end, for duplicate texts, we’ll retain only the majority label (which, based on the examples above, seems to be the correct one), and we’ll convert the labels so that 0 represents "negative" and 1 represents "positive."

We can see that our tweets remain perfectly balanced between the two sentiment categories:

Finally, we will keep only the tweet texts and their associated labels — the rest of the data is not relevant to our task.

Modeling Approaches

Principles

When training an artificial intelligence model, there are parameters called hyperparameters.

These are settings defined outside the model itself — for example, batch size, learning rate, number of epochs, etc.

Hyperparameters are not learned by the model; they are decisions made before training, and their values can have a major impact on final performance.

Grid search is a systematic method for finding the best hyperparameters.

You define several possible values for each hyperparameter in advance, and then train and evaluate the model for every possible combination of these values.

This helps identify the configuration that yields the best results on your data.

In our case, the dataset is perfectly balanced: there are as many examples from each class.
In such a situation, accuracy (the percentage of correct predictions) is a reliable metric to compare the performance of different models.

So we’ll use accuracy to evaluate hyperparameter performance during the grid search — but we’ll also rely on other metrics, especially to compare the different model architectures we implement.

Metrics

Percentage of correct predictions across all outputs.

Risky with imbalanced datasets, but in this case appropriate — precisely because our dataset is perfectly balanced.
Harmonic mean of precision (true positives among predicted positives) and recall (true positives among all actual positives).

If the model has perfect precision (never wrong when predicting “positive”) but only finds 1 out of 100 positive tweets, the F1-score will still be low.
The F1-score is only high when both precision and recall are strong.

This metric balances the model’s ability to catch all positive cases without generating too many false alarms.
Probability that the model ranks a positive example higher than a negative one across all possible thresholds.
Generally, the model predicts a tweet as positive if it estimates the probability to be over 50% — that’s the default threshold. This metric goes beyond that fixed 50% cutoff.

If the model consistently assigns higher scores to true positives than to negatives, the ROC AUC will be 1.0 (perfect).
If the model is guessing randomly, the ROC AUC will be 0.5 (as bad as a coin flip).

A very robust metric, especially useful when comparing models or selecting the optimal detection threshold.
Time required for a model to generate a prediction from new input, once trained.

It’s a good indicator of the model’s complexity and computational load.

MLFlow

In this project, we use MLflow via the DagsHub platform to manage and track our machine learning experiments in a professional and collaborative way.

What is MLflow?

MLflow is an open-source tool that enables you to:

Track and log all training experiments (models, hyperparameters, metrics, etc.)
Easily compare the performance of different models and runs
Centralize results so they’re accessible to the entire team
Save and version models for future reuse or deployment

With MLflow, every training run is automatically recorded: model settings, scores, model files, and more.

You can roll back, replay a run, or share the exact configuration that led to a given result.

DagsHub: The Collaborative Platform

DagsHub is an online platform that provides a convenient interface to visualize and organize everything tracked by MLflow. It also offers collaborative features (comments, versioning, Git integration, etc.).

With DagsHub, all MLflow experiments are accessible from a centralized dashboard.

You can:

View and compare all metrics from tested models
Download or share models
Ensure full traceability and transparency of the scientific workflow

In real-world projects, you often run dozens — even hundreds — of experiments with different settings and data.
Without proper tracking, it’s impossible to reproduce the “best” model or collaborate efficiently.

With MLflow and DagsHub:

Everything is logged, documented, and comparable
Any result can be reproduced at any time
You avoid time loss and errors caused by disorganization or forgotten experiments

It’s an essential tool for any team-based data science project, bringing rigor and professionalism to machine learning workflow management.

"Classical" Model

We’ll use a classic and effective approach to text-based machine learning: the TF-IDF + logistic regression pipeline.

TF-IDF (Term Frequency–Inverse Document Frequency)

This is a method for transforming text into numerical values.
It calculates the importance of each word in a text, taking into account its frequency across the entire corpus.
This allows us to convert raw text into numerical matrices that can be processed by machine learning models.

Logistic Regression

A supervised classification model that predicts the probability of belonging to a given class (e.g., positive or negative).
It’s widely used for binary classification tasks on vectorized text data.
Put simply, it finds the equation of a line that best separates the data into two distinct regions.

In this case, the grid search will focus on the following hyperparameters:

Minimum Document Frequency

This is the minimum threshold (expressed as a proportion or integer) for a word to be included in the TF-IDF vocabulary.

If a word appears in less than this fraction of documents (e.g., 2%), it’s considered too rare and is ignored.

The goal is to avoid including ultra-rare words, which are often just noise or typos.
Maximum Document Frequency

This is the maximum threshold (as a proportion) for a word to be retained in the vocabulary.

If a word appears in more than this fraction of documents (e.g., 80%), it’s considered too frequent and is excluded.

The goal is to remove overly common words (often stopwords not filtered by default, like “the”, “and”, etc.) that don’t provide useful discriminative information.
The size of the n-grams used.

(1,1) = unigrams: individual words only
(1,2) = unigrams + bigrams: individual words and all pairs of consecutive words

This parameter controls whether or not we account for context, expressions, and relationships between nearby words.
Regularization parameter for logistic regression (C = 1/λ).

The smaller the value of C, the stronger the regularization (the model is more constrained, which helps prevent overfitting but may lead to underfitting).

The larger the value of C, the weaker the regularization (the model can better adapt to dataset specifics, but with a higher risk of overfitting).

The goal here is to find the right balance between bias and variance.

Convolutional Neural Network (CNN)

We’ll use a very simple Convolutional Neural Network (CNN) architecture. The underlying principle is straightforward:

Text goes in, it’s transformed into vectors, passed through filters that detect important expressions, these are summarized, and then the model makes a final decision: “Yes or no?”

Here’s a more detailed breakdown of the architecture:

1. Input
The model receives a sequence of words (e.g., a sentence or a tweet). Each word is represented by an integer (an index in the vocabulary).

2. Embedding Layer
This layer transforms each word index into a numeric vector. It’s like translating each word into a space where word similarities are captured numerically.

3. Dropout
Randomly “drops” some elements during training, to make the model more robust and prevent overfitting on the training data.

4. Two Convolutional Layers (Conv1D)
These act as pattern detectors across the text (e.g., recognizing expressions like “not bad at all” or “really awful”), by applying multiple filters that slide over subparts of the sentence to identify characteristic word sequences.

5. Global Max Pooling
This step summarizes the information extracted by the filters: for each filter, it keeps only the maximum activation across the sequence.
It’s like asking each detector, “Did you see this pattern, yes or no?”
The result is a compact and robust summary of all the important patterns found.

6. Dense Layer
A fully connected layer that combines the summarized information to learn more complex interactions.

7. Dropout
Additional regularization to stabilize learning and reduce overfitting.

8. Output Layer
A final layer reduces everything to a single value between 0 and 1, representing the probability that the tweet belongs to the target class (e.g., positive or negative).

For this model, the hyperparameter grid search will focus on:

The size of the word embedding vectors.

Larger = each word carries more information, potentially capturing finer semantic nuances, but with increased risk of overfitting or slower training.

Smaller = a lighter model, but sometimes less expressive.
The number of filters in the convolutional layer (Conv1D).

More filters = the model can learn more diverse patterns in the text sequences.

Too many filters = heavier model, slower training, higher risk of overfitting.
The kernel size used in the convolution (how many words at a time).

Small kernel = detects short patterns (e.g., 2–3 words together)

Large kernel = captures longer sequences/patterns (e.g., 5–7 words)
The proportion of neurons “dropped” at each pass (to prevent overfitting).

High dropout = useful for regularization on small datasets.

Low dropout = better suited for large datasets, with less risk of underfitting.
The number of samples processed simultaneously during one training pass (batch size).

Small batch = more noise, finer convergence, but slower.

Large batch = faster, better for large datasets, but may generalize less effectively.
The step size used to update the model's weights at each iteration of the optimizer.
It’s one of the most important hyperparameters in deep learning.

With a learning rate that’s too high: the model updates too aggressively, may “overshoot” the minima of the loss function, and fail to converge.

With a learning rate that’s too low: the model learns very slowly, may stagnate for a long time, or get stuck in a local minimum.

Preprocessing

We can work with raw data, but we can also choose to apply preprocessing steps before training the models.

Cleaning: We used a cleaning script to remove elements that could interfere with analysis: mentions (@user), hashtags, URLs, special characters, etc.

We ran one training session with just this cleaning step, and another where it was combined with lemmatization and advanced tokenization (using spaCy) for a third trained model.

Determining Vocabulary Size

Before choosing the vocabulary size for the model, it is essential to understand how many words “cover” most of the corpus, to avoid overloading the model unnecessarily and to retain only the most meaningful words.

We used the tokenizer to count all unique words present in the texts and determine how many times each one appears. The words were sorted by frequency, and we then calculated, cumulatively, the total proportion of occurrences represented by the most frequent words (for example, the most frequent word may represent 5% of the corpus, the top two 8%, etc.).

We plotted the curve “number of words vs. % of coverage,” which shows how many words are needed to cover a large portion of the corpus.

In general, a very small part of the vocabulary covers the vast majority of word occurrences (Zipf’s law).

Here, the goal is to determine how many words are needed to reach 95% or 90% coverage (i.e., for 95% or 90% of the words in the corpus to be recognized by the model). This allows us to rationally set the num_words parameter. As an example: before preprocessing, the corpus contains 690,487 different words.

But a tiny portion — 8,147 words, just over 1% — is enough to cover 90% of the corpus. (So for this case, I chose a num_words value of 10,000.)

With a Pre-trained Embedding

With a custom embedding layer trained within my model, the word vectors are optimized specifically for our task and dataset. However, this comes with a risk: the resulting language representation may become overly idiosyncratic. Fortunately, pre-trained embeddings already exist. In these, words are represented by vectors optimized on massive corpora (like Wikipedia, Common Crawl, etc.).

This allows the model to benefit from the general semantics of language: words like “king” and “queen”, or “cat” and “dog”, will be close in the embedding space. We can replace our embedding layer with one of these pre-trained embeddings and freeze the vectors during training, meaning they won’t be updated — as we did in our initial neural network versions.

The two embeddings I tested are:

GloVe.twitter.27B.200d

A pre-trained embedding from the GloVe (Global Vectors) family, trained specifically on a massive Twitter corpus (27 billion tokens).
Each word is represented by a 200-dimensional vector.

Well-suited to Twitter language: understands hashtags, abbreviations, emojis, and common expressions.
Ideal for sentiment analysis, social media NLP, or tasks involving informal writing.
Excellent for handling short, casual, or social media texts.

GoogleNews-vectors-negative300

A Word2Vec embedding pre-trained on a huge Google News corpus (100 billion words).
Each word is represented by a 300-dimensional vector.

Good at capturing general semantics of written English (news articles, formal documents).
Less suited to “spoken” language or abbreviations, but highly effective for a wide and diverse vocabulary.
Popular in many NLP applications, known for generating strong analogies (e.g., "king" – "man" + "woman" ≈ "queen").

Recurrent Neural Network (RNN)

Convolutional neural networks are designed to detect patterns — but those patterns are decontextualized.
Recurrent neural networks, on the other hand, are better suited for processing longer sequences thanks to a form of memory.

Here's the architecture we implemented:

1. Input

2. Embedding

3. Two Bidirectional LSTM Layers
LSTM (Long Short-Term Memory):
A type of recurrent neural network layer specialized in remembering sequences.
It can retain the context of a sentence — even in relatively long texts — while forgetting less relevant information.

Bidirectional:
Each LSTM processes the sequence in both directions.
This allows the model to take into account both the preceding and the following context of each word.
(For instance, with the word “banc”, the model can determine whether it refers to a bench or a school of fish based on what follows.)

The first bidirectional LSTM layer processes the full sequence and returns enriched information for each word.
The second bidirectional LSTM layer summarizes the entire sequence into a single global vector (based on both directions

4. Output layer

For this model, the hyperparameter grid search will cover: batch_size, learning_rate, number of LSTM units (i.e., the memory size in each LSTM layer). Too few LSTM units, and the model may fail to capture the complexity of the text. Too many, and we face increased computational cost and a higher risk of overfitting.

Transformer Model

Transformers are a family of deep learning models introduced in 2017 that revolutionized natural language processing (NLP).
Their main innovation: they use an attention mechanism to understand the context of each word in a sentence, regardless of its position. This mechanism makes it easy to capture long-range dependencies, complex relationships, and the true meaning of sentences.

In this project, we used DistilBERT, a compact and optimized version of BERT — one of the most well-known models, developed by Google.

We fine-tuned it on our own dataset, which means we:

1. Subword-based Tokenization
We used DistilBERT’s tokenizer, which breaks each text into subwords to better handle rare or unknown words.
This makes it highly effective for dealing with variations, typos, and neologisms — common on social media.

2. Contextual Embedding
Each text sequence is encoded into vectors using the pre-trained model, which understands the context of each word in the sentence.
Unlike static embeddings, these representations change depending on surrounding words, allowing the model to grasp subtle meanings.

3. Fine-tuning for Classification
DistilBERT is then fine-tuned on our dataset — that is, its weights are slightly adjusted for the specific task of binary sentiment classification on tweets. The final output is again a probability indicating whether the tweet belongs to the target class.

Once more, we perform a grid search on batch_size and learning_rate.

Model Performance and Selection

Now comes the time to choose between all these models, and that decision will be based both on performance and on the model’s size when deployed.

We can estimate the model’s weight through inference time, but ultimately, we’ll check the actual file size (which MLflow conveniently allows us to save and track).

And speaking of MLflow, it’s exactly what will help us retrieve all the necessary data to guide our decision. We can query it using a small program that pulls the relevant metrics and lets us visualize everything in a clear, synthetic way:

DistilBERT’s performance is clearly superior, without requiring a large amount of disk space…
So we’ll go ahead and deploy our API using this model.

API and UI

1. The Backend API (with FastAPI)

This is the component that handles predictions. It exposes an HTTP endpoint that accepts a POST request with a tweet’s text and returns the prediction (class — positive or negative — and the probabilities for each class). I use it with my frontend API, but it could just as well be queried by any other application or external service.

FastAPI is a modern, open-source web framework designed for building APIs quickly and cleanly in Python.

2. The Frontend UI (with Streamlit)

A web interface that allows a human user to interact with the backend API — but not only that: an alert button lets the user send information to an online data collection service, which in turn can notify me if recurrent misclassifications are reported.

Streamlit is an open-source Python framework for building interactive web apps, specifically designed for data science and machine learning.

First, I download the model weights and dictionary from MLflow. Then I set up unit tests to ensure the components of my programs continue to work as expected, even as the code evolves.

I synchronize my local directory with a GitHub repository — the world’s leading platform for hosting, versioning, and collaborating on source code.

This platform also lets me trigger automated actions with every push to the repository. As a first step, I automatically launch unit tests to verify that my code still behaves as intended.

But how can I ensure that my code works just as well on the internet as it does on my own machine (where I’ve built a virtual environment with all the right Python libraries, etc.)?

For that, I use Docker, which, as its name suggests, creates containers that encapsulate a minimal environment required to run my APIs properly.

Even better: GitHub’s automated actions allow me to rebuild container images with every code update, store them on DockerHub, and then deploy everything via Azure.

Azure also lets me receive notifications about tweets flagged by users as misclassified. One of its modules, Azure Insights, even allows me to set up an alert system. In my example, if three tweets are flagged as misclassified within five minutes, I receive both an email and an SMS.

Evaluate your tweets using my online API

Add a Title

Add paragraph text. Click “Edit Text” to update the font, size and more. To change and reuse text themes, go to Site Styles.

What’s next?

Monitoring — although not implemented in this project — is of course essential.

First, it's important to remember that the alert button could be used in ways other than originally intended — for example, to report a different kind of issue, such as a server error.

So, in case of a significant increase in alerts, it’s crucial to react quickly by checking the overall health of the API — even though unit tests are already supposed to provide some level of protection here.

Next, collecting tweets that were flagged as misclassified and reviewing them manually appears to be indispensable.

Again, paying attention to alert volume can be insightful. A spike might point to emerging linguistic patterns (slang, recent social phenomena, etc.).

Naturally, if we identify truly misclassified tweets — or significant new language uses worth integrating — we’ll want to retrain the model. But caution: our current dataset contains around 1,600,000 tweets — a few dozen more won’t change much.

Therefore, we should label incoming flagged tweets continuously, but wait until we have a numerically significant set (typically 0.1% to 1%, i.e., 1,600 to 16,000 tweets) before considering retraining.
We can also choose to oversample certain tweets to give greater weight to the specific patterns they represent.

Once retrained, we can reuse the existing pipeline (GitHub, DockerHub, Azure) to deploy the new model quickly and efficiently.