Skip to main content


Table 2 Comparison of distance measures for clustering immunoglobulin gene variable sequences

From: Clustering-based identification of clonally-related immunoglobulin gene sequence sets

Clustering method Number of clusters below the threshold Number of sequences in clusters below threshold Number of clusters different from benchmark set Number of incorrectly assigned sequences Correctly clustered sequences (%)
(a) Expert inspection 67 184 4 16 95.1
(b) LD 117 364 71 182 50.0
(c) PNED 93 258 36 76 70.5
(d) NED 78 211 15 29 85.9
(e) NED_VJ 70 190 4 8 95.8
  1. Sequences in the benchmark PNG dataset were clustered using the following 4 methods (a) Expert inspection carried out by visual inspection of the partitioned gene segments without automated clustering. (b) LD: automated clustering based on pairwise Levenshtein Distance between CDR3 sequences. (c) PNED: automated clustering based on post-normalized edit distance. The Levenshtein Distances of each sequence pair is normalize by square root of the length of longer sequence in comparison. (d) NED: automated clustering based on the Normalized Edit Distance. (e) NED_VJ: automated clustering based on the Normalized Edit Distance, incorporating germline gene identity. Gap penalties of 3 were applied to each automated method. The resulting clusterings were evaluated relative to the “benchmark” clustering obtained by combination of automated clustering and visual inspection.