Reflection on MP
Notes on the COMP3007 Machine Perception assignment and my reflection.
Purpose and Background
The goal of this assignment was to build an end-to-end system that can read building numbers on the Curtin campus from raw images. Instead of just classifying a whole image, the system had to:
- Detect where the building number sign is.
- Segment out the individual digits.
- Recognise each digit and reconstruct the final building number.
What I liked about this project is that it combined almost every part of the COMP3007 pipeline:
- feature extraction (binary shape analysis, HOG);
- classical machine learning (SVM);
- modern deep learning (YOLOv8 for detection);
- plus practical tooling (CLI pipeline, config files, experiment tracking).
It felt like a nice capstone for the unit: taking things from lectures and practicals and wiring them into something that actually works on real photos.
Dataset and Annotation
All images came from Curtin campus scenes, where the building number is printed on plates attached to walls, pillars, or building facades. To make the system robust, the dataset included variations in:
- viewpoint (front-on vs angled shots),
- lighting (harsh sun, shadows, cloudy),
- background clutter (trees, people, vehicles),
- sign styles (different fonts, contrast, size).
I used Roboflow to:
- upload and organise the raw images,
- draw bounding boxes around building-number signs,
- export annotations in COCO format (Microsoft’s Common Objects in Context format),
- perform basic augmentations (flips, slight rotations, brightness/contrast changes).
Having COCO-style JSON annotations made it straightforward to plug the dataset into YOLOv8’s training pipeline.
System Overview
The final system is a three-stage pipeline:
-
Detection (YOLOv8)
Take a campus image and detect the building-number sign as a bounding box. -
Segmentation & Preprocessing
Crop the sign region, binarise it, clean it up, and isolate individual characters using connected components and simple geometric rules. -
Recognition (HOG + SVM)
For each character crop, extract HOG features and classify them with an SVM trained on digit patches. Concatenate digits to form the final building number (e.g., “204”, “410A”).
Everything is wrapped in a CLI pipeline, so running the whole assignment is essentially:
python assignment.py --task all --input_dir path/to/images --output_dir results/
This runs detection, segmentation, and recognition, and writes out the expected result files for marking.
Detection: YOLOv8 Fine-Tuning
For detection I used YOLOv8 (Ultralytics), starting from a pre-trained checkpoint that was originally trained on Microsoft COCO. COCO pretraining gives the model strong general visual features, so fine-tuning on my small campus dataset converged reasonably quickly.
Key steps:
- Converted the Roboflow annotations to the exact format YOLOv8 expects.
- Split the dataset into train / val sets to monitor generalisation.
- Fine-tuned a lightweight YOLOv8 model on the “building-number sign” class.
- Logged training runs to Weights & Biases (wandb) to track:
- training/validation loss,
mAP@0.5andmAP@0.5:0.95,- learning rate schedules and run comparisons.
In practice, the detector was very good at finding the sign region even under challenging lighting and clutter, which is crucial because any missed sign means the whole pipeline fails for that image.
Segmentation and Character Extraction
Once YOLOv8 finds the sign, I crop that region and switch to more classical image processing:
-
Grayscale + Thresholding
Convert to grayscale and apply Otsu/adaptive thresholding to get a clean binary image of the digits against the background. -
Morphological Operations
Use basic morphology (opening/closing) to remove small noise and connect slightly broken strokes. - Connected Component Labelling (CCL)
Run CCL on the binary mask to find separate blobs. For each blob I compute:- bounding box,
- area,
- width/height,
- aspect ratio.
- Character Filtering and Ordering
- Filter out blobs that are too small or have unrealistic aspect ratios for digits/letters.
- Sort remaining blobs left-to-right (by x-coordinate) to get the reading order.
The output of this stage is a set of normalised character crops, ready to be fed into a recogniser.
Recognition: HOG Features + SVM Classifier
For character recognition, I went with a classical combination: HOG + SVM.
-
Normalisation
Resize each character patch to a fixed size (e.g., 32×32 or 40×40), pad if necessary, and keep a small margin so strokes are not cut off. - Feature Extraction (HOG)
For each normalised patch:- compute gradient orientations,
- build histograms over small spatial cells,
- normalise across overlapping blocks to get some invariance to contrast and illumination.
The final HOG descriptor is a single feature vector per character patch.
- Multi-Class SVM
- Train a linear SVM (one-vs-rest) on labelled character patches.
- Classes include digits 0–9 (and letters, if needed for building suffixes like “410A”).
At test time, each character patch is mapped to a HOG feature vector and passed through the SVM, returning the most likely digit/letter. Concatenating these predictions (in left-to-right order) yields the final building number.
Evaluation and Failure Cases
I evaluated the system in two layers:
- Detection quality (YOLOv8)
- Looked at mAP metrics and qualitative results on a validation set.
- Checked if signs were detected at reasonable IoU thresholds (e.g., ≥ 0.5).
- End-to-end building number accuracy
- Took held-out images and compared the predicted building string (e.g., “204”) to the ground truth.
- Manually inspected failure cases.
Common failure modes included:
- Low contrast or reflective signs where thresholding produced broken or merged characters.
- Extreme perspective causing characters to touch or warp, making CCL messy.
- Small, distant signs where the detection box was correct but the cropped resolution was too low for reliable recognition.
Despite these, the pipeline performed reliably on most typical campus shots where the sign was relatively clear and central.
Tooling and Reproducibility
One of the most valuable parts of this assignment was building something that is reproducible and inspectable:
- CLI pipeline with clear arguments for input/output paths and which task(s) to run.
- Config-driven training for YOLOv8 (image size, epochs, augmentations, etc.).
- Roboflow as a single source of truth for images + labels.
- Weights & Biases (wandb) dashboards to compare different training runs and hyperparameters.
Compared to my earlier ML assignments, this workflow felt much closer to how real computer vision projects are run in practice.
Key Learnings
A few things I took away from this project:
-
Deep + classical is a powerful combo.
YOLOv8 handles spatial localisation extremely well, while classical segmentation and HOG+SVM provide simple, interpretable character recognition. -
Good annotations matter more than fancy models.
Cleaning labels, handling edge cases, and exporting proper COCO-format JSON was just as important as tweaking YOLO hyperparameters. -
Experiment tracking is worth the effort.
Using wandb to monitor loss curves and mAP across runs made it much easier to iterate and avoid “mystery” performance changes. -
Error handling and pipeline design are not optional.
A brittle script that crashes on one weird image is useless. Building a robust CLI that can skip bad cases and still produce valid outputs was a big mindset shift.
Reflection and Future Work
I really enjoyed this assignment and the Machine Perception unit overall. It was a rare chance to:
- work hands-on with YOLOv8, instead of just reading about it.
- explore how tools like Roboflow and wandb streamline the workflow, Roboflow made dataset preparation and augmentation surprisingly smooth, while wandb gave me a clear view of training runs, metrics, and experiment tracking.
- learn how COCO-format datasets and modern tooling fit together in practice.
- connect lecture topics (feature extraction, classical ML, deep learning, detection, tracking) in one coherent project.
If I revisit this project, I’d like to explore:
- replacing the HOG+SVM recogniser with a small CNN or CRNN for characters;
- making the segmentation step more robust to perspective and lighting (e.g., using learned segmentation instead of pure thresholding);
- deploying the detector in a simple demo app (e.g., a phone or web interface) to read building numbers in real time on campus.
Overall, this assignment was a fun and meaningful way to wrap up the unit and made me much more confident about applying both classical and modern computer vision techniques to real-world problems.
Leave a comment