Computer Vision Projects
- Fabien Cappelli
- Feb 15
- 3 min read
Semantic Image Segmentation (Projects 8 & 9)
Overview
These two projects focus on semantic image segmentation for intelligent vision systems, using real-world urban imagery from the Cityscapes dataset.
The objective was to design, evaluate, and deploy deep learning models capable of assigning a semantic class to each pixel in an image, while carefully balancing segmentation accuracy, robustness, and inference performance.
Across both projects, several architectures were implemented, compared, and deployed, ranging from CNN-based encoder–decoder models to Transformer-based segmentation (SegFormer).
Project Context & Objectives
Context
Semantic segmentation is a critical computer vision task for applications such as:
autonomous driving,
urban scene understanding,
intelligent transportation systems.
Unlike object detection, segmentation requires dense pixel-level predictions, making model evaluation and deployment particularly sensitive to:
class imbalance,
boundary precision,
inference latency.
Objectives
Train and compare multiple segmentation architectures on urban scenes.
Evaluate models using domain-specific metrics (IoU, Dice coefficient).
Measure inference-time trade-offs for production use.
Deploy best-performing models through a REST API and interactive web application.
Dataset
Item | Description |
Dataset | Cityscapes |
Images | Urban street scenes |
Classes | 8 semantic classes |
Input size | Resized and normalized RGB images |
Splits | Training / Validation / Test |
Project 8 — Baseline & Encoder–Decoder Models
Methodology
Project 8 focused on establishing strong CNN-based baselines using encoder–decoder architectures.
Key steps included:
Data preprocessing and label encoding
Extensive data augmentation
Baseline model training
Hyperparameter tuning
Quantitative evaluation on validation and test sets
Models Evaluated
Architecture | Encoder | Description |
U-Net | Custom | Baseline encoder–decoder |
FPN | EfficientNet-B0 | Multi-scale feature aggregation |
LinkNet | ResNet | Lightweight architecture with skip connections |
Training Configuration
Loss functions: Dice Loss combined with Cross-Entropy
Optimizers: Adam / AdamW
Learning rate scheduling
Data augmentation to improve generalization
Evaluation Metrics
Metric | Purpose |
IoU (Jaccard Index) | Region overlap accuracy |
Dice Coefficient | Boundary-sensitive similarity |
Inference Time | Deployment feasibility |
Key Results — Project 8
Model | Mean IoU (Test) | Inference Time (relative) |
U-Net (baseline) | ~0.69 | Fast |
FPN + EfficientNet (no aug.) | ~0.70 | Medium |
FPN + EfficientNet (with aug.) | ~0.81 | Medium |
LinkNet | ~0.73 | Fast–Medium |
Key observations
Data augmentation increased mean IoU by +10–12 points compared to non-augmented baselines.
FPN with EfficientNet backbone provided the best accuracy / performance trade-off.
U-Net remained a solid baseline but struggled on complex urban scenes.


Project 9 — Transformer-Based Segmentation (SegFormer)
Motivation
While CNN-based models capture local spatial patterns effectively, they can struggle with long-range dependencies in large urban scenes.
Project 9 explored SegFormer, a Transformer-based architecture designed to combine:
local feature extraction,
global contextual awareness,
efficient multi-scale fusion.
Architecture Highlights
Hierarchical Transformer encoder
Multi-scale feature representations
Lightweight MLP decoder
No positional encoding (flexible input resolution)
Training Strategy
Component | Choice |
Backbone | SegFormer B5 (ADE20K pretraining) |
Loss | Sparse Categorical Cross-Entropy + Dice / Tversky |
Metric | Mean IoU |
Optimizer | AdamW + learning rate scheduler |
Results — Project 9
Model | Mean IoU (Test) | Inference Time |
FPN + EfficientNet | ~0.81 | Medium |
SegFormer B5 | ~0.77–0.78 | Slower |
Interpretation
SegFormer achieved strong global consistency and better handling of large structures.
Performance was competitive with CNN-based models but came with higher computational cost.
The model is well-suited for accuracy-critical scenarios, less for real-time constraints.


Model Comparison Summary
Model | Mean IoU | Strengths | Trade-offs |
U-Net | ~0.69 | Simple, fast | Lower accuracy |
FPN + EfficientNet | ~0.81 | Best trade-off | Moderate latency |
SegFormer | ~0.77 | Global context | Higher cost |
Deployment & Applications
API & Web Application
Backend: FastAPI
Frontend: Streamlit
Deployment: Cloud-based, containerized with Docker
Users can:
upload an image,
run segmentation inference,
visualize predicted masks alongside original images.
You can test the interactive demonstrations of both projects below.
Particular attention was given, in Project 9, to accessibility and the clarity of the user interface.
Limitations & Perspectives
Limitations
Transformer-based models introduce higher inference latency.
Explainability remains limited for dense prediction tasks.
Training constrained by available hardware.
Future Improvements
Model distillation for faster inference
Quantization / pruning for edge deployment
Improved monitoring and production observability
Advanced explainability methods for segmentation


Comments