Clustering-based identification of clonally-related immunoglobulin gene sequence sets

Immunome Research

Table 2 Comparison of distance measures for clustering immunoglobulin gene variable sequences

Clustering method	Number of clusters below the threshold	Number of sequences in clusters below threshold	Number of clusters different from benchmark set	Number of incorrectly assigned sequences	Correctly clustered sequences (%)
(a) Expert inspection	67	184	4	16	95.1
(b) LD	117	364	71	182	50.0
(c) PNED	93	258	36	76	70.5
(d) NED	78	211	15	29	85.9
(e) NED_VJ	70	190	4	8	95.8

Sequences in the benchmark PNG dataset were clustered using the following 4 methods (a) Expert inspection carried out by visual inspection of the partitioned gene segments without automated clustering. (b) LD: automated clustering based on pairwise Levenshtein Distance between CDR3 sequences. (c) PNED: automated clustering based on post-normalized edit distance. The Levenshtein Distances of each sequence pair is normalize by square root of the length of longer sequence in comparison. (d) NED: automated clustering based on the Normalized Edit Distance. (e) NED_VJ: automated clustering based on the Normalized Edit Distance, incorporating germline gene identity. Gap penalties of 3 were applied to each automated method. The resulting clusterings were evaluated relative to the “benchmark” clustering obtained by combination of automated clustering and visual inspection.