Computer Vision and Natural Language Processing
Notes from the summary of COMP3010 Lecture 11, labs, and online resources.
Overview
Computer Vision (CV) and Natural Language Processing (NLP) are at the forefront of artificial intelligence (AI). They are driving many recent advances in areas like self-supervised learning, generative models, and cross-modal architectures that integrate vision and language.
Computer Vision (CV)
What is Computer Vision?
Computer Vision is the study of building algorithms that enable machines to perceive and understand visual data—such as images and videos—in a way similar to humans.
“Every picture tells a story.”
We want computers to extract and understand that story.
Challenges in Computer Vision
- Semantic Gap: Images are numerical pixel grids; interpreting them as semantic entities is non-trivial.
- Viewpoint Variation: Same object looks different from different angles.
- Intraclass Variation: Objects from the same class may look vastly different.
- Fine-grained Categories: Subtle differences between visually similar categories.
- Background Clutter: Irrelevant information in images that confuses the model.
- Illumination Changes: Light can drastically alter how objects appear.
- Deformation: Objects can change shape (e.g., animals, clothing).
- Occlusion: Parts of objects can be hidden from view.
Core CV Tasks
-
Image Classification
Assign a single label to the entire image. - Object Detection
Identify objects in an image and draw bounding boxes around them.- Predict both what (label) and where (bounding box).
-
Semantic Segmentation
Label each pixel in the image with a class (e.g., cat, sky, grass). -
Instance Segmentation
Like semantic segmentation but differentiates between instances of the same class. - Generative Modeling
- Style Transfer
- Diffusion models for image synthesis
- Vision-language models like DALL·E
- 3D Vision
- Reconstructing 3D shapes from images
- Mesh R-CNN, point cloud models
Natural Language Processing (NLP)
What is NLP?
NLP is the field focused on enabling machines to understand, generate, and interact using human language. This involves processing languages like English, Mandarin, Arabic, etc., through:
- Analysis: Text → meaning
- Generation: Meaning → text
- Acquisition: Learning meaning from data
Components of NLP Systems
- Speech Recognition
- Language Understanding
- Dialogue Management
- Information Retrieval
- Text-to-Speech Synthesis
Challenges in NLP
- Ambiguity: Words with multiple meanings (e.g., “bank”).
- Scale: Enormous vocabulary sizes.
- Sparsity: Rare or unseen word combinations.
- Variation: Dialects, informal usage.
- Expressivity: Subtle meanings and emotions.
- Unmodeled variables: Context and world knowledge.
Evolution of NLP Approaches
| Era | Approach |
|---|---|
| ~1980s | Symbolic/Rule-Based |
| ~1990s | Statistical NLP |
| ~2010s | Deep Learning + Word Embeddings |
| Today | Pretrained LLMs & Transformers |
Representing Meaning in NLP
From WordNet to Word Embeddings
- Early approaches like WordNet relied on structured thesauri and synonym graphs.
- Limitations: Incomplete, subjective, and manually intensive.
Distributional Semantics
- “You shall know a word by the company it keeps.” – J.R. Firth
- Words are represented by their co-occurring contexts.
Word2Vec
- Learns dense word vectors by predicting context words (Skip-gram model).
- Replaces one-hot representations with semantic vectors:
- hotel ≈ motel
- good ≈ nice
Language Modeling
- Goal: Predict the next word given a context.
- n-gram models: Early statistical models (e.g., trigrams).
- RNNs: Improved modeling of sequential dependencies, but suffer from vanishing gradients.
- Transformers: Fully parallel, capture long-range dependencies efficiently.
Key Takeaways
- Machine Learning is central to both CV and NLP today.
- Self-supervised pretraining followed by supervised fine-tuning is the dominant paradigm.
- Cross-modal models unify vision and language, enabling tasks like image captioning, VQA, and multimodal generation.
- Transformers and Large Language Models (LLMs) like ChatGPT, CLIP, and DALL·E represent the state-of-the-art across modalities.
Leave a comment