Computer Vision Projects

Fabien Cappelli
Feb 15
3 min read

Semantic Image Segmentation (Projects 8 & 9)

Overview

These two projects focus on semantic image segmentation for intelligent vision systems, using real-world urban imagery from the Cityscapes dataset.

The objective was to design, evaluate, and deploy deep learning models capable of assigning a semantic class to each pixel in an image, while carefully balancing segmentation accuracy, robustness, and inference performance.

Across both projects, several architectures were implemented, compared, and deployed, ranging from CNN-based encoder–decoder models to Transformer-based segmentation (SegFormer).

Project Context & Objectives

Context

Semantic segmentation is a critical computer vision task for applications such as:

autonomous driving,
urban scene understanding,
intelligent transportation systems.

Unlike object detection, segmentation requires dense pixel-level predictions, making model evaluation and deployment particularly sensitive to:

class imbalance,
boundary precision,
inference latency.

Objectives

Train and compare multiple segmentation architectures on urban scenes.
Evaluate models using domain-specific metrics (IoU, Dice coefficient).
Measure inference-time trade-offs for production use.
Deploy best-performing models through a REST API and interactive web application.

Dataset

Item	Description
Dataset	Cityscapes
Images	Urban street scenes
Classes	8 semantic classes
Input size	Resized and normalized RGB images
Splits	Training / Validation / Test

Project 8 — Baseline & Encoder–Decoder Models

Methodology

Project 8 focused on establishing strong CNN-based baselines using encoder–decoder architectures.

Key steps included:

Data preprocessing and label encoding
Extensive data augmentation
Baseline model training
Hyperparameter tuning
Quantitative evaluation on validation and test sets

Models Evaluated

Architecture	Encoder	Description
U-Net	Custom	Baseline encoder–decoder
FPN	EfficientNet-B0	Multi-scale feature aggregation
LinkNet	ResNet	Lightweight architecture with skip connections

Training Configuration

Loss functions: Dice Loss combined with Cross-Entropy
Optimizers: Adam / AdamW
Learning rate scheduling
Data augmentation to improve generalization

Evaluation Metrics

Metric	Purpose
IoU (Jaccard Index)	Region overlap accuracy
Dice Coefficient	Boundary-sensitive similarity
Inference Time	Deployment feasibility

Key Results — Project 8

Model	Mean IoU (Test)	Inference Time (relative)
U-Net (baseline)	~0.69	Fast
FPN + EfficientNet (no aug.)	~0.70	Medium
FPN + EfficientNet (with aug.)	~0.81	Medium
LinkNet	~0.73	Fast–Medium

Key observations

Data augmentation increased mean IoU by +10–12 points compared to non-augmented baselines.
FPN with EfficientNet backbone provided the best accuracy / performance trade-off.
U-Net remained a solid baseline but struggled on complex urban scenes.

Project 9 — Transformer-Based Segmentation (SegFormer)

Motivation

While CNN-based models capture local spatial patterns effectively, they can struggle with long-range dependencies in large urban scenes.

Project 9 explored SegFormer, a Transformer-based architecture designed to combine:

local feature extraction,
global contextual awareness,
efficient multi-scale fusion.

Architecture Highlights

Hierarchical Transformer encoder
Multi-scale feature representations
Lightweight MLP decoder
No positional encoding (flexible input resolution)

Training Strategy

Component	Choice
Backbone	SegFormer B5 (ADE20K pretraining)
Loss	Sparse Categorical Cross-Entropy + Dice / Tversky
Metric	Mean IoU
Optimizer	AdamW + learning rate scheduler

Results — Project 9

Model	Mean IoU (Test)	Inference Time
FPN + EfficientNet	~0.81	Medium
SegFormer B5	~0.77–0.78	Slower

Interpretation

SegFormer achieved strong global consistency and better handling of large structures.
Performance was competitive with CNN-based models but came with higher computational cost.
The model is well-suited for accuracy-critical scenarios, less for real-time constraints.

Model Comparison Summary

Model	Mean IoU	Strengths	Trade-offs
U-Net	~0.69	Simple, fast	Lower accuracy
FPN + EfficientNet	~0.81	Best trade-off	Moderate latency
SegFormer	~0.77	Global context	Higher cost

Deployment & Applications

API & Web Application

Backend: FastAPI
Frontend: Streamlit
Deployment: Cloud-based, containerized with Docker

Users can:

upload an image,
run segmentation inference,
visualize predicted masks alongside original images.

You can test the interactive demonstrations of both projects below.

Particular attention was given, in Project 9, to accessibility and the clarity of the user interface.

Project 8 (FPN)

Project 9 (SegFormer)

Limitations & Perspectives

Limitations

Transformer-based models introduce higher inference latency.
Explainability remains limited for dense prediction tasks.
Training constrained by available hardware.

Future Improvements

Model distillation for faster inference
Quantization / pruning for edge deployment
Improved monitoring and production observability
Advanced explainability methods for segmentation