Skip to main content

Table 2 Comparison of distance measures for clustering immunoglobulin gene variable sequences

From: Clustering-based identification of clonally-related immunoglobulin gene sequence sets

Clustering method

Number of clusters below the threshold

Number of sequences in clusters below threshold

Number of clusters different from benchmark set

Number of incorrectly assigned sequences

Correctly clustered sequences (%)

(a) Expert inspection

67

184

4

16

95.1

(b) LD

117

364

71

182

50.0

(c) PNED

93

258

36

76

70.5

(d) NED

78

211

15

29

85.9

(e) NED_VJ

70

190

4

8

95.8

  1. Sequences in the benchmark PNG dataset were clustered using the following 4 methods (a) Expert inspection carried out by visual inspection of the partitioned gene segments without automated clustering. (b) LD: automated clustering based on pairwise Levenshtein Distance between CDR3 sequences. (c) PNED: automated clustering based on post-normalized edit distance. The Levenshtein Distances of each sequence pair is normalize by square root of the length of longer sequence in comparison. (d) NED: automated clustering based on the Normalized Edit Distance. (e) NED_VJ: automated clustering based on the Normalized Edit Distance, incorporating germline gene identity. Gap penalties of 3 were applied to each automated method. The resulting clusterings were evaluated relative to the “benchmark” clustering obtained by combination of automated clustering and visual inspection.