📢 Notice 📢

2 minute read

Data preparation is a critical step in transforming raw, messy real-world data into a format suitable for data mining algorithms. Poor preparation leads to unreliable models.

Key concepts: feature extraction, data portability, data cleaning, and dimensionality reduction.

Why Data Preparation?

  • Real-world data is often messy, inconsistent, and heterogeneous.
  • Preparation ensures data is:
    • Compatible with mining algorithms (numeric/structured format)
    • Clean and consistent (handle missing, incorrect, or scaled values)
    • Reduced in size/complexity while retaining useful information

Feature Extraction

  • Features are measurable properties used for analysis.
  • Raw data may be high-dimensional, non-invariant, or hard to interpret.

High Dimensionality Challenges

  • This is commonly referred to as the curse of dimensionality.
  • For example, 100-dimensional data is mathematically manageable, but difficult to visualize.
  • In high-dimensional spaces:
    • Data points tend to cluster near corners.
    • Points appear deceptively close to one another.
  • This makes it harder to assess similarity or distance, posing challenges for accurate analysis.
  • Therefore, reducing dimensionality is often beneficial.

  • Application-specific:
    • Structured tables → Columns like age, salary
    • Images → Edges, textures, shapes
    • Text → Bag-of-Words, TF-IDF, embeddings
    • Time series → Frequency patterns, wavelet coefficients
  • Output is usually a feature vector for ML or DM algorithms.

Data Portability (Type Transformation)

Many algorithms require numeric inputs, so transforming data types is often necessary:

Common Type Conversions

  • Numeric → Categorical (Discretization)
    • Converts continuous data into intervals or bins
    • Example: Age{Low, Mid, High}
  • Categorical → Numeric (One-hot / Binarization)
    • Encodes each category as separate binary features
  • Text → Numeric
    • Represented through techniques like BoW, TF-IDF, and BPE

Bag-of-Words (BoW)

  • Counts word occurrences in a document
  • Ignores grammar, syntax, and word order

TF-IDF (Term Frequency–Inverse Document Frequency)

  • Weighs words based on their frequency and rarity
  • Still disregards context and word order

Byte-Pair Encoding (BPE)

  • Subword tokenization technique used in models like GPT
  • Splits rare or unknown words into frequent subword units
    • Example: unhappiness["un", "happi", "ness"]
  • Each subword token is mapped to a unique integer ID using a predefined vocabulary
    • Example: ["un", "happi", "ness"][502, 8143, 1092]
  • Time Series → Numeric Conversion
    • Many machine learning models require numeric inputs, even for temporal data. Two common transformation techniques are:

Wavelet-Based Feature Extraction

  • Decomposes a time series into components at multiple scales (resolutions) and time positions (locations)
  • Captures both local and global features across time
  • Especially useful for non-stationary signals and localized events

Fourier-Based Feature Extraction

  • Represents a time series as a sum of sinusoidal components
  • Produces a vector of frequency-domain coefficients:
    • Magnitude: strength of each frequency
    • Phase: timing shift of each frequency component
  • Best suited for stationary signals and cyclic patterns

Data Cleaning

Real-world datasets often have missing, incorrect, or inconsistent values:

  • Handle Missing Data:
    • Discard, Impute (mean/neighbor), or Accept (use algorithms tolerant to missingness)
  • Handle Inconsistent Data:
    • Cross-check sources, use domain rules (e.g., City–Country mismatch), or apply outlier detection
  • Scaling & Normalization:
    • Standardization (Z-score): Mean 0, SD 1 (good for ML/KNN/SVM)
    • Min-Max Normalization: Scale to a fixed range, useful for neural networks

Dimensionality Reduction (Data Reduction)

Simplifies data to reduce computation and avoid the curse of dimensionality:

  • Sampling: Random, stratified, or streaming (e.g., reservoir sampling)
  • Feature Selection:
    • Supervised: Based on label relevance (filter, wrapper, Lasso)
    • Unsupervised: Remove redundant features using clustering/separation
  • Feature Reduction with Axis Rotation:
    • PCA, SVD, LSA – Compress correlated features into fewer components
  • Type Transformation for Compression:
    • Combine transformation with dimensionality reduction for compact representation

Leave a comment