📢 Notice 📢

Please have a read first!

Data Processing

July 29, 2025 2 minute read

Data preparation is a critical step in transforming raw, messy real-world data into a format suitable for data mining algorithms. Poor preparation leads to unreliable models.

Key concepts: feature extraction, data portability, data cleaning, and dimensionality reduction.

Why Data Preparation?

Real-world data is often messy, inconsistent, and heterogeneous.
Preparation ensures data is:
- Compatible with mining algorithms (numeric/structured format)
- Clean and consistent (handle missing, incorrect, or scaled values)
- Reduced in size/complexity while retaining useful information

Feature Extraction

Features are measurable properties used for analysis.
Raw data may be high-dimensional, non-invariant, or hard to interpret.

High Dimensionality Challenges

This is commonly referred to as the curse of dimensionality.
For example, 100-dimensional data is mathematically manageable, but difficult to visualize.
In high-dimensional spaces:
- Data points tend to cluster near corners.
- Points appear deceptively close to one another.
This makes it harder to assess similarity or distance, posing challenges for accurate analysis.
Therefore, reducing dimensionality is often beneficial.
Application-specific:
- Structured tables → Columns like age, salary
- Images → Edges, textures, shapes
- Text → Bag-of-Words, TF-IDF, embeddings
- Time series → Frequency patterns, wavelet coefficients
Output is usually a feature vector for ML or DM algorithms.

Data Portability (Type Transformation)

Many algorithms require numeric inputs, so transforming data types is often necessary:

Common Type Conversions

Numeric → Categorical (Discretization)
- Converts continuous data into intervals or bins
- Example: Age → {Low, Mid, High}
Categorical → Numeric (One-hot / Binarization)
- Encodes each category as separate binary features
Text → Numeric
- Represented through techniques like BoW, TF-IDF, and BPE

Bag-of-Words (BoW)

Counts word occurrences in a document
Ignores grammar, syntax, and word order

TF-IDF (Term Frequency–Inverse Document Frequency)

Weighs words based on their frequency and rarity
Still disregards context and word order

Byte-Pair Encoding (BPE)

Subword tokenization technique used in models like GPT
Splits rare or unknown words into frequent subword units
- Example: unhappiness → ["un", "happi", "ness"]
Each subword token is mapped to a unique integer ID using a predefined vocabulary
- Example: ["un", "happi", "ness"] → [502, 8143, 1092]
Time Series → Numeric Conversion
- Many machine learning models require numeric inputs, even for temporal data. Two common transformation techniques are:

Wavelet-Based Feature Extraction

Decomposes a time series into components at multiple scales (resolutions) and time positions (locations)
Captures both local and global features across time
Especially useful for non-stationary signals and localized events

Fourier-Based Feature Extraction

Represents a time series as a sum of sinusoidal components
Produces a vector of frequency-domain coefficients:
- Magnitude: strength of each frequency
- Phase: timing shift of each frequency component
Best suited for stationary signals and cyclic patterns

Data Cleaning

Real-world datasets often have missing, incorrect, or inconsistent values:

Handle Missing Data:
- Discard, Impute (mean/neighbor), or Accept (use algorithms tolerant to missingness)
Handle Inconsistent Data:
- Cross-check sources, use domain rules (e.g., City–Country mismatch), or apply outlier detection
Scaling & Normalization:
- Standardization (Z-score): Mean 0, SD 1 (good for ML/KNN/SVM)
- Min-Max Normalization: Scale to a fixed range, useful for neural networks

Dimensionality Reduction (Data Reduction)

Simplifies data to reduce computation and avoid the curse of dimensionality:

Sampling: Random, stratified, or streaming (e.g., reservoir sampling)
Feature Selection:
- Supervised: Based on label relevance (filter, wrapper, Lasso)
- Unsupervised: Remove redundant features using clustering/separation
Feature Reduction with Axis Rotation:
- PCA, SVD, LSA – Compress correlated features into fewer components
Type Transformation for Compression:
- Combine transformation with dimensionality reduction for compact representation

Share on

X Facebook LinkedIn Bluesky