top of page

Computer Vision Projects

  • Writer: Fabien Cappelli
    Fabien Cappelli
  • Feb 15
  • 3 min read

Semantic Image Segmentation (Projects 8 & 9)

Overview


These two projects focus on semantic image segmentation for intelligent vision systems, using real-world urban imagery from the Cityscapes dataset.

The objective was to design, evaluate, and deploy deep learning models capable of assigning a semantic class to each pixel in an image, while carefully balancing segmentation accuracy, robustness, and inference performance.

Across both projects, several architectures were implemented, compared, and deployed, ranging from CNN-based encoder–decoder models to Transformer-based segmentation (SegFormer).


Project Context & Objectives

Context


Semantic segmentation is a critical computer vision task for applications such as:

  • autonomous driving,

  • urban scene understanding,

  • intelligent transportation systems.


Unlike object detection, segmentation requires dense pixel-level predictions, making model evaluation and deployment particularly sensitive to:

  • class imbalance,

  • boundary precision,

  • inference latency.


Objectives

  • Train and compare multiple segmentation architectures on urban scenes.

  • Evaluate models using domain-specific metrics (IoU, Dice coefficient).

  • Measure inference-time trade-offs for production use.

  • Deploy best-performing models through a REST API and interactive web application.


Dataset

Item

Description

Dataset

Cityscapes

Images

Urban street scenes

Classes

8 semantic classes

Input size

Resized and normalized RGB images

Splits

Training / Validation / Test

Project 8 — Baseline & Encoder–Decoder Models

Methodology

Project 8 focused on establishing strong CNN-based baselines using encoder–decoder architectures.


Key steps included:

  • Data preprocessing and label encoding

  • Extensive data augmentation

  • Baseline model training

  • Hyperparameter tuning

  • Quantitative evaluation on validation and test sets


Models Evaluated


Architecture

Encoder

Description

U-Net

Custom

Baseline encoder–decoder

FPN

EfficientNet-B0

Multi-scale feature aggregation

LinkNet

ResNet

Lightweight architecture with skip connections

Training Configuration

  • Loss functions: Dice Loss combined with Cross-Entropy

  • Optimizers: Adam / AdamW

  • Learning rate scheduling

  • Data augmentation to improve generalization


Evaluation Metrics


Metric

Purpose

IoU (Jaccard Index)

Region overlap accuracy

Dice Coefficient

Boundary-sensitive similarity

Inference Time

Deployment feasibility


Key Results — Project 8

Model

Mean IoU (Test)

Inference Time (relative)

U-Net (baseline)

~0.69

Fast

FPN + EfficientNet (no aug.)

~0.70

Medium

FPN + EfficientNet (with aug.)

~0.81

Medium

LinkNet

~0.73

Fast–Medium

Key observations

  • Data augmentation increased mean IoU by +10–12 points compared to non-augmented baselines.

  • FPN with EfficientNet backbone provided the best accuracy / performance trade-off.

  • U-Net remained a solid baseline but struggled on complex urban scenes.




Project 9 — Transformer-Based Segmentation (SegFormer)

Motivation

While CNN-based models capture local spatial patterns effectively, they can struggle with long-range dependencies in large urban scenes.


Project 9 explored SegFormer, a Transformer-based architecture designed to combine:

  • local feature extraction,

  • global contextual awareness,

  • efficient multi-scale fusion.


Architecture Highlights

  • Hierarchical Transformer encoder

  • Multi-scale feature representations

  • Lightweight MLP decoder

  • No positional encoding (flexible input resolution)


Training Strategy

Component

Choice

Backbone

SegFormer B5 (ADE20K pretraining)

Loss

Sparse Categorical Cross-Entropy + Dice / Tversky

Metric

Mean IoU

Optimizer

AdamW + learning rate scheduler


Results — Project 9

Model

Mean IoU (Test)

Inference Time

FPN + EfficientNet

~0.81

Medium

SegFormer B5

~0.77–0.78

Slower

Interpretation

  • SegFormer achieved strong global consistency and better handling of large structures.

  • Performance was competitive with CNN-based models but came with higher computational cost.

  • The model is well-suited for accuracy-critical scenarios, less for real-time constraints.



Model Comparison Summary

Model

Mean IoU

Strengths

Trade-offs

U-Net

~0.69

Simple, fast

Lower accuracy

FPN + EfficientNet

~0.81

Best trade-off

Moderate latency

SegFormer

~0.77

Global context

Higher cost

Deployment & Applications

API & Web Application

  • Backend: FastAPI

  • Frontend: Streamlit

  • Deployment: Cloud-based, containerized with Docker


Users can:

  • upload an image,

  • run segmentation inference,

  • visualize predicted masks alongside original images.


You can test the interactive demonstrations of both projects below.


Particular attention was given, in Project 9, to accessibility and the clarity of the user interface.




Limitations & Perspectives

Limitations

  • Transformer-based models introduce higher inference latency.

  • Explainability remains limited for dense prediction tasks.

  • Training constrained by available hardware.


Future Improvements

  • Model distillation for faster inference

  • Quantization / pruning for edge deployment

  • Improved monitoring and production observability

  • Advanced explainability methods for segmentation



Comments


bottom of page