similarity.Rd
Multiple measurements of similarity and distance between pairs of binary matrices as listed in Choi et al (2010).
similarity( M, ..., statistic, normalized = TRUE, firstonly = FALSE, include_self = FALSE, exclude_j = FALSE ) # S3 method for list similarity( M, ..., statistic, normalized = TRUE, firstonly = FALSE, include_self = FALSE, exclude_j = FALSE ) # S3 method for matrix similarity( M, ..., statistic, normalized = TRUE, firstonly = FALSE, include_self = FALSE, exclude_j = FALSE )
M | Either a list of matrices of size |
---|---|
... | More matrices to be passed to the function. |
statistic | Character. Name of the similarity index to be using. |
normalized | Logical. When |
firstonly | Logical. When |
include_self | Logical. When set to |
exclude_j | Logical. When |
A matrix of size n*(n - 1)/2
by length(statistic)
, where columns
1 and 2 indicate the id if i
and j
, and the reminder columns are the
corresponding distances/similarities.
All of the available statistics are based on a 2x2 contingency matrix counting matches and missmatches between each pair of matrices (R1, R2).
R2 | |||
1 | 0 | ||
R1 | 1 | a | b |
0 | c | d |
A complete list of the statistics available is available in the Similarity and Distance sections.
distance
is just an alias for similarity
.
Jaccard (1): "sjaccard"
or "jaccard"
Sørensen–Dice coefficient (2), Sczekanowsk (3), Nei \& Li (5): "sdice"
or "sczekanowsk"
or "sneili"
3w-jaccard (4): "s3wjaccard"
Sokal & Michener (7): "ssokmich"
or "sokmich"
Sokal & Sneath II (8): "ssoksne"
or "soksne"
Roger & Tanimoto (9): "roger&tanimoto"
or "sroger&tanimoto"
Faith (10): "sfaith"
or "faith"
Gower and Legendre (11): "sgl"
or "gl"
Rusell & Rao (14): "srusrao"
Tarwid (54): "starwid"
or "tarwid"
.
Pearson & Heron 1 (54): "sph1"
or "ph1"
or "s14"
. This is also known as S14 in
Gower and Legendre (1986).
In the case of the S14
function, following Krackhardt's 1989:
$$% \sqrt{\left(\frac{a}{(a + c)} - \frac{b}{(b + d)}\right)\times\left(\frac{a}{(a + b)} - \frac{c}{(c + d)}\right)} $$
Which is an statistic lying between 0 and 1.
Dennis (44): "sdennis"
or "dennis"
Yuleq (61): "syuleq"
Yuleq similarity (63): "syuleqw"
Michael (68): "smichael"
or "michael"
$$% S_{michael} = \frac{4(ad-bc)}{(a+d)^2 + (b+c)^2} $$
Dispersion (66): "sdisp"
or "disp"
$$% S_{Dispersion} = \frac{ad - bc}{(a + b + c + d)^2} $$
Hamann (67): "shamann"
or "hamann"
$$% S_{Hamann} = \frac{(a + d) - (b + c)}{a + b + c + d} $$
Goodman & Kruskal (69): "sgk"
or "gk"
$$% S_{Goodman \& Kruskal} = \frac{\sigma - \sigma'}{2n - \sigma'} $$
where \(\sigma = \max(a,b) + \max(c,d) + \max(a,c) + \max(b,d)\), and \(\sigma' = \max(a + c, b + d) + \max(a + b, c + d)\)
Anderberg (70): "sanderberg"
or "anderberg"
$$% S_{Anderberg} = \frac{\sigma - \sigma'}{2n} $$
where \(\sigma\) and \(\sigma\) are defined as in (69).
Peirce (73): "speirce"
or "peirce"
$$% S_{Peirce} = \frac{ab + bc}{ab + 2bc + cd} $$
In the case of fscore
, ask Kyosuke Tanaka.
FScore (00): "fscore"
or "sfscore"
Vari (23): "dvari"
or "vari"
Sized Difference (24): "dsizedif"
or "sizedif"
Shaped Difference (25): "dsphd"
or "sphd"
Pattern Difference (26): "dpattdif"
or "pattdif"
Hamming (15): "dhamming"
or "hamming"
Mean Manhattan (20): "dmeanman"
or "meamman"
$$%
D_{Mean-manhattan} = \frac{b + c}{a + b + c + d}
$$
Yuleq distance (62): "dyuleq"
Choi, S. S., Cha, S. H., & Tappert, C. C. (2010). A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1), 43-48.
Krackhardt, D. (1990). Assessing the political landscape: Structure, cognition, and power in organizations. Administrative science quarterly, 342-369.
Gower, J. C., & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefficients. Journal of classification, 3(1), 5-48.
The statistics object contains a list with the available statistics for convenience.
# Getting all networks of size 3 data(powerset03) # We can compute it over the entire set head(similarity(powerset03, statistic="s14"))#> i j s14 #> [1,] 1 2 0.6324555 #> [2,] 1 3 -0.2000000 #> [3,] 1 4 0.6324555 #> [4,] 1 5 0.4472136 #> [5,] 1 6 -0.3162278 #> [6,] 1 7 -0.2000000# Or over two pairs head(similarity(powerset03[[1]], powerset03[[2]], powerset03[[3]], statistic="s14"))#> i j s14 #> [1,] 1 2 0.6324555 #> [2,] 1 3 -0.2000000 #> [3,] 2 3 0.6324555# We can compute multiple distances at the same time ans <- similarity(powerset03, statistic=c("hamming", "dennis", "jaccard")) head(ans)#> i j hamming dennis jaccard #> [1,] 1 2 0.1666667 1.6329932 0.5000000 #> [2,] 1 3 0.3333333 -0.5773503 0.0000000 #> [3,] 1 4 0.1666667 1.6329932 0.5000000 #> [4,] 1 5 0.3333333 1.0000000 0.3333333 #> [5,] 1 6 0.5000000 -0.8164966 0.0000000 #> [6,] 1 7 0.3333333 -0.5773503 0.0000000