📢 Notice 📢

Please have a read first!

Computer Vision and Natural Language Processing

May 19, 2025 2 minute read

Notes from the summary of COMP3010 Lecture 11, labs, and online resources.

Overview

Computer Vision (CV) and Natural Language Processing (NLP) are at the forefront of artificial intelligence (AI). They are driving many recent advances in areas like self-supervised learning, generative models, and cross-modal architectures that integrate vision and language.

Computer Vision (CV)

What is Computer Vision?

Computer Vision is the study of building algorithms that enable machines to perceive and understand visual data—such as images and videos—in a way similar to humans.

“Every picture tells a story.”
We want computers to extract and understand that story.

Challenges in Computer Vision

Semantic Gap: Images are numerical pixel grids; interpreting them as semantic entities is non-trivial.
Viewpoint Variation: Same object looks different from different angles.
Intraclass Variation: Objects from the same class may look vastly different.
Fine-grained Categories: Subtle differences between visually similar categories.
Background Clutter: Irrelevant information in images that confuses the model.
Illumination Changes: Light can drastically alter how objects appear.
Deformation: Objects can change shape (e.g., animals, clothing).
Occlusion: Parts of objects can be hidden from view.

Core CV Tasks

Image Classification
Assign a single label to the entire image.
Object Detection
Identify objects in an image and draw bounding boxes around them.
- Predict both what (label) and where (bounding box).
Semantic Segmentation
Label each pixel in the image with a class (e.g., cat, sky, grass).
Instance Segmentation
Like semantic segmentation but differentiates between instances of the same class.
Generative Modeling
- Style Transfer
- Diffusion models for image synthesis
- Vision-language models like DALL·E
3D Vision
- Reconstructing 3D shapes from images
- Mesh R-CNN, point cloud models

Natural Language Processing (NLP)

What is NLP?

NLP is the field focused on enabling machines to understand, generate, and interact using human language. This involves processing languages like English, Mandarin, Arabic, etc., through:

Analysis: Text → meaning
Generation: Meaning → text
Acquisition: Learning meaning from data

Components of NLP Systems

Speech Recognition
Language Understanding
Dialogue Management
Information Retrieval
Text-to-Speech Synthesis

Challenges in NLP

Ambiguity: Words with multiple meanings (e.g., “bank”).
Scale: Enormous vocabulary sizes.
Sparsity: Rare or unseen word combinations.
Variation: Dialects, informal usage.
Expressivity: Subtle meanings and emotions.
Unmodeled variables: Context and world knowledge.

Evolution of NLP Approaches

Era	Approach
~1980s	Symbolic/Rule-Based
~1990s	Statistical NLP
~2010s	Deep Learning + Word Embeddings
Today	Pretrained LLMs & Transformers

Representing Meaning in NLP

From WordNet to Word Embeddings

Early approaches like WordNet relied on structured thesauri and synonym graphs.
Limitations: Incomplete, subjective, and manually intensive.

Distributional Semantics

“You shall know a word by the company it keeps.” – J.R. Firth
Words are represented by their co-occurring contexts.

Word2Vec

Learns dense word vectors by predicting context words (Skip-gram model).
Replaces one-hot representations with semantic vectors:
- hotel ≈ motel
- good ≈ nice

Language Modeling

Goal: Predict the next word given a context.
n-gram models: Early statistical models (e.g., trigrams).
RNNs: Improved modeling of sequential dependencies, but suffer from vanishing gradients.
Transformers: Fully parallel, capture long-range dependencies efficiently.

Key Takeaways

Machine Learning is central to both CV and NLP today.
Self-supervised pretraining followed by supervised fine-tuning is the dominant paradigm.
Cross-modal models unify vision and language, enabling tasks like image captioning, VQA, and multimodal generation.
Transformers and Large Language Models (LLMs) like ChatGPT, CLIP, and DALL·E represent the state-of-the-art across modalities.

Share on

X Facebook LinkedIn Bluesky