Similarity and Distance between pairs of binary matrices

Multiple measurements of similarity and distance between pairs of binary matrices as listed in Choi et al (2010).

similarity(
  M,
  ...,
  statistic,
  normalized = TRUE,
  firstonly = FALSE,
  include_self = FALSE,
  exclude_j = FALSE
)

# S3 method for list
similarity(
  M,
  ...,
  statistic,
  normalized = TRUE,
  firstonly = FALSE,
  include_self = FALSE,
  exclude_j = FALSE
)

# S3 method for matrix
similarity(
  M,
  ...,
  statistic,
  normalized = TRUE,
  firstonly = FALSE,
  include_self = FALSE,
  exclude_j = FALSE
)

Arguments

M	Either a list of matrices of size `n` (need not to be square), or a single matrix of size `n` (see details).
...	More matrices to be passed to the function.
statistic	Character. Name of the similarity index to be using.
normalized	Logical. When `TRUE` it returns the normalized hamming distance, which ranges between 0 and 1 (currently only used in `statistic="hamming"`).
firstonly	Logical. When `TRUE`, the comparison is done as the first matrix to all only.
include_self	Logical. When set to `TRUE`, the diagonal is considered in the calculations. Since most calculations are done in the context of social networks, the default is set to `FALE`.
exclude_j	Logical. When `TRUE`, the comparison between matrices `i` and `j` is done after the jth column and rows are removed from each.

Value

A matrix of size n*(n - 1)/2 by length(statistic), where columns 1 and 2 indicate the id if i and j, and the reminder columns are the corresponding distances/similarities.

Details

All of the available statistics are based on a 2x2 contingency matrix counting matches and missmatches between each pair of matrices (R1, R2).

		R2
		1	0
R1	1	a	b
	0	c	d

A complete list of the statistics available is available in the Similarity and Distance sections.

distance is just an alias for similarity.

Similarity

Jaccard (1): "sjaccard" or "jaccard"

Sørensen–Dice coefficient (2), Sczekanowsk (3), Nei \& Li (5): "sdice" or "sczekanowsk" or "sneili"

3w-jaccard (4): "s3wjaccard"

Sokal & Michener (7): "ssokmich" or "sokmich"

Sokal & Sneath II (8): "ssoksne" or "soksne"

Roger & Tanimoto (9): "roger&tanimoto" or "sroger&tanimoto"

Faith (10): "sfaith" or "faith"

Gower and Legendre (11): "sgl" or "gl"

Rusell & Rao (14): "srusrao"

Tarwid (54): "starwid" or "tarwid".

Pearson & Heron 1 (54): "sph1" or "ph1" or "s14". This is also known as S14 in Gower and Legendre (1986).

In the case of the S14 function, following Krackhardt's 1989:

$$% \sqrt{\left(\frac{a}{(a + c)} - \frac{b}{(b + d)}\right)\times\left(\frac{a}{(a + b)} - \frac{c}{(c + d)}\right)} $$

Which is an statistic lying between 0 and 1.

Dennis (44): "sdennis" or "dennis"

Yuleq (61): "syuleq"

Yuleq similarity (63): "syuleqw"

Michael (68): "smichael" or "michael"

$$% S_{michael} = \frac{4(ad-bc)}{(a+d)^2 + (b+c)^2} $$

Dispersion (66): "sdisp" or "disp"

$$% S_{Dispersion} = \frac{ad - bc}{(a + b + c + d)^2} $$

Hamann (67): "shamann" or "hamann"

$$% S_{Hamann} = \frac{(a + d) - (b + c)}{a + b + c + d} $$

Goodman & Kruskal (69): "sgk" or "gk"

$$% S_{Goodman \& Kruskal} = \frac{\sigma - \sigma'}{2n - \sigma'} $$

where $\sigma = \max(a,b) + \max(c,d) + \max(a,c) + \max(b,d)$, and $\sigma' = \max(a + c, b + d) + \max(a + b, c + d)$

Anderberg (70): "sanderberg" or "anderberg"

$$% S_{Anderberg} = \frac{\sigma - \sigma'}{2n} $$

where $\sigma$ and $\sigma$ are defined as in (69).

Peirce (73): "speirce" or "peirce"

$$% S_{Peirce} = \frac{ab + bc}{ab + 2bc + cd} $$

In the case of fscore, ask Kyosuke Tanaka.

FScore (00): "fscore" or "sfscore"

Distance

Vari (23): "dvari" or "vari"

Sized Difference (24): "dsizedif" or "sizedif"

Shaped Difference (25): "dsphd" or "sphd"

Pattern Difference (26): "dpattdif" or "pattdif"

Hamming (15): "dhamming" or "hamming"

Mean Manhattan (20): "dmeanman" or "meamman" $$% D_{Mean-manhattan} = \frac{b + c}{a + b + c + d} $$

Yuleq distance (62): "dyuleq"

References

Choi, S. S., Cha, S. H., & Tappert, C. C. (2010). A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1), 43-48.

Krackhardt, D. (1990). Assessing the political landscape: Structure, cognition, and power in organizations. Administrative science quarterly, 342-369.

Gower, J. C., & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefficients. Journal of classification, 3(1), 5-48.

Examples

# Getting all networks of size 3
data(powerset03)

# We can compute it over the entire set
head(similarity(powerset03, statistic="s14"))
#>      i j        s14
#> [1,] 1 2  0.6324555
#> [2,] 1 3 -0.2000000
#> [3,] 1 4  0.6324555
#> [4,] 1 5  0.4472136
#> [5,] 1 6 -0.3162278
#> [6,] 1 7 -0.2000000

# Or over two pairs
head(similarity(powerset03[[1]], powerset03[[2]], powerset03[[3]], statistic="s14"))
#>      i j        s14
#> [1,] 1 2  0.6324555
#> [2,] 1 3 -0.2000000
#> [3,] 2 3  0.6324555

# We can compute multiple distances at the same time
ans <- similarity(powerset03, statistic=c("hamming", "dennis", "jaccard"))
head(ans)
#>      i j   hamming     dennis   jaccard
#> [1,] 1 2 0.1666667  1.6329932 0.5000000
#> [2,] 1 3 0.3333333 -0.5773503 0.0000000
#> [3,] 1 4 0.1666667  1.6329932 0.5000000
#> [4,] 1 5 0.3333333  1.0000000 0.3333333
#> [5,] 1 6 0.5000000 -0.8164966 0.0000000
#> [6,] 1 7 0.3333333 -0.5773503 0.0000000