Data Similarity and Distance
Similarity and distance measures are central to clustering, classification, and retrieval tasks.
The choice of measure strongly affects algorithm performance and depends on data type and application.
- Similarity → higher = more alike.
- Distance → lower = more alike.
Multidimensional Data
Quantitative Data
Lp Norms: The Lp norm defines a family of distance metrics parameterized by $p$, used to measure the distance between two vectors:
$x = (x_1, x_2, \dots, x_n), \quad y = (y_1, y_2, \dots, y_n)$
-
L1 (Manhattan): $\sum_{i=1}^{n} |x_i - y_i|$
-
L2 (Euclidean): $\sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$
-
L∞ (Chebyshev): $\max_{i} |x_i - y_i|$
Lp Norm (Minkowski Distance)
$\left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{1/p}$
This generalizes the above norms:
- $p$ = 1 → Manhattan
- $p$ = 2 → Euclidean
- $p$ to ∞ → Chebyshev
Example calculation
Other considerations:
- Domain relevance: weight features differently (application-specific).
- High-dimensionality:
- Distance concentration → all distances ≈ same.
- Curse of dimensionality → feature selection / dimensionality reduction.
- Locally irrelevant features: reduce similarity quality → use local weighting.
- Match-based similarity: cosine similarity (direction not magnitude).
- Data distribution impact: skewed/clustered data → distances misleading.
- Nonlinear structures: ISOMAP handles manifolds.
- Local distribution: kernel density or adaptive distances.
- Computational issues: need indexing, approximation for large datasets.
Categorical Data
- Simple Matching Coefficient (SMC):
(# matches) / (total attributes) - Weighted versions if attributes differ in importance.
Mixed Data
- Combine quantitative + categorical (e.g., Gower’s similarity).
Text Similarity
Representations:
- Binary (word present/absent).
- Set-based (unique words).
- Frequency vectors (TF, TF-IDF).
Measures:
- Cosine similarity:
(X·Y) / (||X||·||Y||) - Jaccard coefficient:
|X∩Y| / |X∪Y| - Hamming distance for binary strings.
Important: normalization (stopwords, stemming, weighting).
Temporal Similarity
Time-Series
Challenges: scaling, translation, noise.
Measures:
- Euclidean distance: direct but sensitive.
- Lp-norms: generalization.
- Dynamic Time Warping (DTW): elastic alignment of sequences.
- Window-based: sliding subsequence comparison.
Preprocessing: z-normalization, detrending.
Discrete Sequences (Strings)
- Edit distance (Levenshtein): min # edits (insert, delete, substitute).
- Longest Common Subsequence (LCS): length of max subsequence shared.
Graph Similarity
Node-to-Node in One Graph
- Structural distance: shortest path lengths.
- Random walk similarity: probability of reaching one node from another.
Between Two Graphs
- Graph edit distance: # edits to transform one graph → another.
- Spectral measures: compare eigenvalues/eigenvectors of adjacency matrices.
Supervised Similarity
- Learn task-specific similarity using labeled data.
- Metric learning: adapt distances (e.g., Mahalanobis metric learning).
- Improves classification & clustering.
Key Takeaways
- No universal best measure → depends on data type & task.
- Quantitative data: Lp norms, cosine, distribution-aware.
- Categorical/mixed: matching coefficient, Gower’s measure.
- Text: cosine, Jaccard, TF-IDF weighting.
- Temporal: DTW, edit distance, subsequence matching.
- Graph: structural paths, random walks, edit distance.
- Supervised metric learning = powerful for specialized tasks.
- High-dimensions & irrelevant features are dangerous → reduce / weight features
Leave a comment