📢 Notice 📢

2 minute read

Similarity and distance measures are central to clustering, classification, and retrieval tasks.

The choice of measure strongly affects algorithm performance and depends on data type and application.

  • Similarity → higher = more alike.
  • Distance → lower = more alike.

Multidimensional Data

Quantitative Data

Lp Norms: The Lp norm defines a family of distance metrics parameterized by $p$, used to measure the distance between two vectors:

$x = (x_1, x_2, \dots, x_n), \quad y = (y_1, y_2, \dots, y_n)$

  • L1 (Manhattan): $\sum_{i=1}^{n} |x_i - y_i|$

  • L2 (Euclidean): $\sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$

  • L∞ (Chebyshev): $\max_{i} |x_i - y_i|$

Lp Norm (Minkowski Distance)

$\left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{1/p}$

This generalizes the above norms: - $p$ = 1 → Manhattan
- $p$ = 2 → Euclidean
- $p$ to ∞ → Chebyshev

Example calculation

Other considerations:

  • Domain relevance: weight features differently (application-specific).
  • High-dimensionality:
    • Distance concentration → all distances ≈ same.
    • Curse of dimensionality → feature selection / dimensionality reduction.
  • Locally irrelevant features: reduce similarity quality → use local weighting.
  • Match-based similarity: cosine similarity (direction not magnitude).
  • Data distribution impact: skewed/clustered data → distances misleading.
  • Nonlinear structures: ISOMAP handles manifolds.
  • Local distribution: kernel density or adaptive distances.
  • Computational issues: need indexing, approximation for large datasets.

Categorical Data

  • Simple Matching Coefficient (SMC):
    (# matches) / (total attributes)
  • Weighted versions if attributes differ in importance.

Mixed Data

  • Combine quantitative + categorical (e.g., Gower’s similarity).

Text Similarity

Representations:

  • Binary (word present/absent).
  • Set-based (unique words).
  • Frequency vectors (TF, TF-IDF).

Measures:

  • Cosine similarity: (X·Y) / (||X||·||Y||)
  • Jaccard coefficient: |X∩Y| / |X∪Y|
  • Hamming distance for binary strings.

Important: normalization (stopwords, stemming, weighting).

Temporal Similarity

Time-Series

Challenges: scaling, translation, noise.

Measures:

  • Euclidean distance: direct but sensitive.
  • Lp-norms: generalization.
  • Dynamic Time Warping (DTW): elastic alignment of sequences.
  • Window-based: sliding subsequence comparison.

Preprocessing: z-normalization, detrending.

Discrete Sequences (Strings)

  • Edit distance (Levenshtein): min # edits (insert, delete, substitute).
  • Longest Common Subsequence (LCS): length of max subsequence shared.

Graph Similarity

Node-to-Node in One Graph

  • Structural distance: shortest path lengths.
  • Random walk similarity: probability of reaching one node from another.

Between Two Graphs

  • Graph edit distance: # edits to transform one graph → another.
  • Spectral measures: compare eigenvalues/eigenvectors of adjacency matrices.

Supervised Similarity

  • Learn task-specific similarity using labeled data.
  • Metric learning: adapt distances (e.g., Mahalanobis metric learning).
  • Improves classification & clustering.

Key Takeaways

  • No universal best measure → depends on data type & task.
  • Quantitative data: Lp norms, cosine, distribution-aware.
  • Categorical/mixed: matching coefficient, Gower’s measure.
  • Text: cosine, Jaccard, TF-IDF weighting.
  • Temporal: DTW, edit distance, subsequence matching.
  • Graph: structural paths, random walks, edit distance.
  • Supervised metric learning = powerful for specialized tasks.
  • High-dimensions & irrelevant features are dangerous → reduce / weight features

Leave a comment