Multiple measurements of similarity and distance between pairs of binary matrices as listed in Choi et al (2010).

similarity(
  M,
  ...,
  statistic,
  normalized = TRUE,
  firstonly = FALSE,
  include_self = FALSE,
  exclude_j = FALSE
)

# S3 method for list
similarity(
  M,
  ...,
  statistic,
  normalized = TRUE,
  firstonly = FALSE,
  include_self = FALSE,
  exclude_j = FALSE
)

# S3 method for matrix
similarity(
  M,
  ...,
  statistic,
  normalized = TRUE,
  firstonly = FALSE,
  include_self = FALSE,
  exclude_j = FALSE
)

Arguments

M

Either a list of matrices of size n (need not to be square), or a single matrix of size n (see details).

...

More matrices to be passed to the function.

statistic

Character. Name of the similarity index to be using.

normalized

Logical. When TRUE it returns the normalized hamming distance, which ranges between 0 and 1 (currently only used in statistic="hamming").

firstonly

Logical. When TRUE, the comparison is done as the first matrix to all only.

include_self

Logical. When set to TRUE, the diagonal is considered in the calculations. Since most calculations are done in the context of social networks, the default is set to FALE.

exclude_j

Logical. When TRUE, the comparison between matrices i and j is done after the jth column and rows are removed from each.

Value

A matrix of size n*(n - 1)/2 by length(statistic), where columns 1 and 2 indicate the id if i and j, and the reminder columns are the corresponding distances/similarities.

Details

All of the available statistics are based on a 2x2 contingency matrix counting matches and missmatches between each pair of matrices (R1, R2).

R2
10
R11ab
0cd

A complete list of the statistics available is available in the Similarity and Distance sections.

distance is just an alias for similarity.

Similarity

  • Jaccard (1): "sjaccard" or "jaccard"

  • Sørensen–Dice coefficient (2), Sczekanowsk (3), Nei \& Li (5): "sdice" or "sczekanowsk" or "sneili"

  • 3w-jaccard (4): "s3wjaccard"

  • Sokal & Michener (7): "ssokmich" or "sokmich"

  • Sokal & Sneath II (8): "ssoksne" or "soksne"

  • Roger & Tanimoto (9): "roger&tanimoto" or "sroger&tanimoto"

  • Faith (10): "sfaith" or "faith"

  • Gower and Legendre (11): "sgl" or "gl"

  • Rusell & Rao (14): "srusrao"

  • Tarwid (54): "starwid" or "tarwid".

  • Pearson & Heron 1 (54): "sph1" or "ph1" or "s14". This is also known as S14 in Gower and Legendre (1986).

    In the case of the S14 function, following Krackhardt's 1989:

    $$% \sqrt{\left(\frac{a}{(a + c)} - \frac{b}{(b + d)}\right)\times\left(\frac{a}{(a + b)} - \frac{c}{(c + d)}\right)} $$

    Which is an statistic lying between 0 and 1.

  • Dennis (44): "sdennis" or "dennis"

  • Yuleq (61): "syuleq"

  • Yuleq similarity (63): "syuleqw"

  • Michael (68): "smichael" or "michael"

    $$% S_{michael} = \frac{4(ad-bc)}{(a+d)^2 + (b+c)^2} $$

  • Dispersion (66): "sdisp" or "disp"

    $$% S_{Dispersion} = \frac{ad - bc}{(a + b + c + d)^2} $$

  • Hamann (67): "shamann" or "hamann"

    $$% S_{Hamann} = \frac{(a + d) - (b + c)}{a + b + c + d} $$

  • Goodman & Kruskal (69): "sgk" or "gk"

    $$% S_{Goodman \& Kruskal} = \frac{\sigma - \sigma'}{2n - \sigma'} $$

    where \(\sigma = \max(a,b) + \max(c,d) + \max(a,c) + \max(b,d)\), and \(\sigma' = \max(a + c, b + d) + \max(a + b, c + d)\)

  • Anderberg (70): "sanderberg" or "anderberg"

    $$% S_{Anderberg} = \frac{\sigma - \sigma'}{2n} $$

    where \(\sigma\) and \(\sigma\) are defined as in (69).

  • Peirce (73): "speirce" or "peirce"

    $$% S_{Peirce} = \frac{ab + bc}{ab + 2bc + cd} $$

In the case of fscore, ask Kyosuke Tanaka.

  • FScore (00): "fscore" or "sfscore"

Distance

  • Vari (23): "dvari" or "vari"

  • Sized Difference (24): "dsizedif" or "sizedif"

  • Shaped Difference (25): "dsphd" or "sphd"

  • Pattern Difference (26): "dpattdif" or "pattdif"

  • Hamming (15): "dhamming" or "hamming"

  • Mean Manhattan (20): "dmeanman" or "meamman" $$% D_{Mean-manhattan} = \frac{b + c}{a + b + c + d} $$

  • Yuleq distance (62): "dyuleq"

References

Choi, S. S., Cha, S. H., & Tappert, C. C. (2010). A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1), 43-48.

Krackhardt, D. (1990). Assessing the political landscape: Structure, cognition, and power in organizations. Administrative science quarterly, 342-369.

Gower, J. C., & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefficients. Journal of classification, 3(1), 5-48.

See also

The statistics object contains a list with the available statistics for convenience.

Examples

# Getting all networks of size 3 data(powerset03) # We can compute it over the entire set head(similarity(powerset03, statistic="s14"))
#> i j s14 #> [1,] 1 2 0.6324555 #> [2,] 1 3 -0.2000000 #> [3,] 1 4 0.6324555 #> [4,] 1 5 0.4472136 #> [5,] 1 6 -0.3162278 #> [6,] 1 7 -0.2000000
# Or over two pairs head(similarity(powerset03[[1]], powerset03[[2]], powerset03[[3]], statistic="s14"))
#> i j s14 #> [1,] 1 2 0.6324555 #> [2,] 1 3 -0.2000000 #> [3,] 2 3 0.6324555
# We can compute multiple distances at the same time ans <- similarity(powerset03, statistic=c("hamming", "dennis", "jaccard")) head(ans)
#> i j hamming dennis jaccard #> [1,] 1 2 0.1666667 1.6329932 0.5000000 #> [2,] 1 3 0.3333333 -0.5773503 0.0000000 #> [3,] 1 4 0.1666667 1.6329932 0.5000000 #> [4,] 1 5 0.3333333 1.0000000 0.3333333 #> [5,] 1 6 0.5000000 -0.8164966 0.0000000 #> [6,] 1 7 0.3333333 -0.5773503 0.0000000