Data Processing
Data preparation is a critical step in transforming raw, messy real-world data into a format suitable for data mining algorithms. Poor preparation leads to unreliable models.
Key concepts: feature extraction, data portability, data cleaning, and dimensionality reduction.
Why Data Preparation?
- Real-world data is often messy, inconsistent, and heterogeneous.
- Preparation ensures data is:
- Compatible with mining algorithms (numeric/structured format)
- Clean and consistent (handle missing, incorrect, or scaled values)
- Reduced in size/complexity while retaining useful information
Feature Extraction
- Features are measurable properties used for analysis.
- Raw data may be high-dimensional, non-invariant, or hard to interpret.
High Dimensionality Challenges
- This is commonly referred to as the curse of dimensionality.
- For example, 100-dimensional data is mathematically manageable, but difficult to visualize.
- In high-dimensional spaces:
- Data points tend to cluster near corners.
- Points appear deceptively close to one another.
- This makes it harder to assess similarity or distance, posing challenges for accurate analysis.
-
Therefore, reducing dimensionality is often beneficial.
- Application-specific:
- Structured tables → Columns like age, salary
- Images → Edges, textures, shapes
- Text → Bag-of-Words, TF-IDF, embeddings
- Time series → Frequency patterns, wavelet coefficients
- Output is usually a feature vector for ML or DM algorithms.
Data Portability (Type Transformation)
Many algorithms require numeric inputs, so transforming data types is often necessary:
Common Type Conversions
- Numeric → Categorical (Discretization)
- Converts continuous data into intervals or bins
- Example:
Age→{Low, Mid, High}
- Categorical → Numeric (One-hot / Binarization)
- Encodes each category as separate binary features
- Text → Numeric
- Represented through techniques like BoW, TF-IDF, and BPE
Bag-of-Words (BoW)
- Counts word occurrences in a document
- Ignores grammar, syntax, and word order
TF-IDF (Term Frequency–Inverse Document Frequency)
- Weighs words based on their frequency and rarity
- Still disregards context and word order
Byte-Pair Encoding (BPE)
- Subword tokenization technique used in models like GPT
- Splits rare or unknown words into frequent subword units
- Example:
unhappiness→["un", "happi", "ness"]
- Example:
- Each subword token is mapped to a unique integer ID using a predefined vocabulary
- Example:
["un", "happi", "ness"]→[502, 8143, 1092]
- Example:
- Time Series → Numeric Conversion
- Many machine learning models require numeric inputs, even for temporal data. Two common transformation techniques are:
Wavelet-Based Feature Extraction
- Decomposes a time series into components at multiple scales (resolutions) and time positions (locations)
- Captures both local and global features across time
- Especially useful for non-stationary signals and localized events
Fourier-Based Feature Extraction
- Represents a time series as a sum of sinusoidal components
- Produces a vector of frequency-domain coefficients:
- Magnitude: strength of each frequency
- Phase: timing shift of each frequency component
- Best suited for stationary signals and cyclic patterns
Data Cleaning
Real-world datasets often have missing, incorrect, or inconsistent values:
- Handle Missing Data:
- Discard, Impute (mean/neighbor), or Accept (use algorithms tolerant to missingness)
- Handle Inconsistent Data:
- Cross-check sources, use domain rules (e.g., City–Country mismatch), or apply outlier detection
- Scaling & Normalization:
- Standardization (Z-score): Mean 0, SD 1 (good for ML/KNN/SVM)
- Min-Max Normalization: Scale to a fixed range, useful for neural networks
Dimensionality Reduction (Data Reduction)
Simplifies data to reduce computation and avoid the curse of dimensionality:
- Sampling: Random, stratified, or streaming (e.g., reservoir sampling)
- Feature Selection:
- Supervised: Based on label relevance (filter, wrapper, Lasso)
- Unsupervised: Remove redundant features using clustering/separation
- Feature Reduction with Axis Rotation:
- PCA, SVD, LSA – Compress correlated features into fewer components
- Type Transformation for Compression:
- Combine transformation with dimensionality reduction for compact representation
Leave a comment