History of the Logical Entropy Formula

The logical entropy formula $h(p) = 1-\sum_{i}p_{i}^{2}$
Given a partition $\pi = \{B,B',\ldots\}$ on a finite universe set U, the set of distinctions or dits $dit(\pi)$ is the set $\cup_{B\in \pi}B \times (U-B)$ of ordered pairs of elements in distinct blocks of the partition. The logical entropy $h(\pi)$ of the partition is the normalized cardinality of the dit set: $h(\pi) = \frac{|dit(\pi)|}{|U\times U|}$ . The logical entropy can be interpreted probabilistically as the probability that the random drawing of a pair of elements (with replacement between draws) from U (with equiprobable elements) gives a distinction of the partition. In terms of the probability $p_{B} = \frac{|B|}{|U|}$ , the logical entropy could be computed as:

$h(\pi) = \sum_{B \in \pi}p_{B}(1-p_{B}) = 1-\sum_{B \in \pi}p_{B}^{2}$ .

The set-complement of the dit set $dit(\pi)$ within the Cartesian product $U \times U$ is the set $indit(\pi)$ of indistinctions or indits, i.e., the ordered pairs of elements in the same block of the partition. The complement $1-h(\pi)= \sum_{B \in \pi}p_{B}^{2}$ of the logical entropy is the normalized count of the indistinctions. It is the probability that a randomly drawn pair will give two elements in the same block.

Replacing each block probability $p_{B}$ by a probability $p_{i}$ from a finite probability distribution $p = \{p_{1},\ldots,p_{n}\}$ , the logical entropy $h(p)$ of the finite probability distribution (or finite random variable) is:

$h(p) = 1- \sum_{i=1}^{n}p_{i}^{2}$ .

It is the probability that in two independent trials, the random variable will take on different values.

The relative shares interpretation

The logical entropy formula makes perfectly good sense when the $p_{i}$ are interpreted nonprobabilistically as the relative shares in some total where $\sum_{i}p_{i}= 1$ . Then it is a measure of the heterogeneity or diversity of with the maximum value being obtained for equal shares.

The complement $1-h(p)= \sum_{i=1}^{n}p_{i}^{2}$ is the probability that in two independent trials, the random variable will have the same value. If the $p_{i}$ are interpreted as relative shares, then the complement is a measure of homogeneity or concentration where the maximum value is obtained when one of the $p_{i}$ is 1 and the others are 0.

According to the late statistician I. J. Good, the formula (in the complementary form) has a certain naturalness: “If $p_{1},\ldots,p_{t}$ are the probabilities of t mutually exclusive and exhaustive events, any statistician of this century who wanted a measure of homogeneity would take about two seconds to suggest $\sum p_{i}^{2}$ which I shall call $\rho$ .” [Good 1983, p. 561]

History of the formula

This logical entropy formula and its complement have a long and varied interdisciplinary history. The formula goes back at least to Corrado Gini who suggested $1-\sum_{i}p_{i}^{2}$ as an index of “mutability” or diversity in the same 1912 paper, Variabilità e mutabilità, where he defined his far more famous index of inequality now known as the Gini coefficient.

The formula in cryptography

But another development of the formula (in the complementary form) in the early twentieth century was in cryptography. The American cryptologist, William F. Friedman, devoted a 1922 book to the “index of coincidence” (i.e., $\sum p_{i}^{2}$ ) which, according to one author on the topic, “must be regarded as the most important single publication in cryptology.” [Kahn 1967] Two mathematicians, Solomon Kullback and Abraham Sinkov, who at one time worked as assistants to Friedman, also wrote books on cryptology which used the index [Kullback 1976; Sinkov 1968].

During World War II, Alan M. Turing worked for a time in the Government Code and Cypher School at the Bletchley Park facility in England. Probably unaware of the earlier work, Turing used $\rho = \sum_{i} p_{i}^{2}$ in his cryptanalysis work and called it the repeat rate since it is the probability of a repeat in a pair of independent draws from a population with those probabilities (i.e., the identification probability $1-h(p)$ ). Polish cryptanalyists had independently used the repeat rate in their work on the Enigma machine [Rejewski 1981].

The formula in biostatistics

After the war, Edward H. Simpson, a British statistician, proposed $\sum_{B \in \pi}p_{B}^{2}$ as a measure of species concentration (the opposite of diversity) where $\pi$ is the partition of animals or plants according to species and where each animal or plant is considered as equiprobable. And Simpson gave the interpretation of this homogeneity measure as “the probability that two individuals chosen at random and independently from the population will be found to belong to the same group.” [Simpson 1949] Hence $1-\sum_{B \in \pi}p_{B}^{2}$ is the probability that a random pair will belong to different species, i.e., will be distinguished by the species partition. In the biodiversity literature [Ricotta and Szeidl 2006], the formula is known as “Simpson’s index of diversity” or sometimes, the “Gini-Simpson diversity index.”

However, Simpson along with I. J. Good worked at Bletchley during WWII, and, according to Good, “E. H. Simpson and I both obtained the notion [the repeat rate] from Turing.” [Good 1979]. When Simpson published the index in 1948, he (again, according to Good) did not acknowledge Turing “fearing that to acknowledge him would be regarded as a breach of security.” [Good 1982]

The formula in economics

There is at least a third independent development of the formula. In 1945, Albert O. Hirschman [1945, 1964] suggested using $\sqrt{\sum_{i}p_{i}^{2}}$ as an index of trade concentration (where $p_{i}$ is the relative share of trade in a certain commodity or with a certain partner). A few years later, Orris Herfindahl [1950] independently suggested using $\sum_{i}p_{i}^{2}$ as an index of industrial concentration (where $p_{i}$ is the relative share of the $i^{th}$ firm in an industry). In the industrial economics literature, the index $H = \sum_{i}p_{i}^{2}$ is variously called the Hirschman-Herfindahl index, the HH index, or just the H index of concentration. If all the relative shares were equal (i.e., $p_{i}=1/n$ ), then the identification or repeat probability is just the probability of drawing any element, i.e., $H=1/n$ , so $1/H = n$ is the number of equal elements. This led to the “numbers equivalent” interpretation of the reciprocal of the H index [Adelman 1969]. The basic idea of the numbers-equivalent was introduced by Robert MacArthur [1965] slightly earlier in the biodiversity literature as a way to interpret the antilog of the Shannon entropy (i.e., the block-count entropy in Ellerman 2009).

The numbers-equivalent interpretation

In general, given an event with probability $p_{i}$ , the numbers-equivalent interpretation of the event is that it is ‘as if’ an element was drawn out of a set of $1/p_{i}$ equiprobable elements (it is ‘as if’ since $1/p_{i}$ need not be an integer). When drawing from a population of n equiprobable elements, the probability of drawing (with replacement) two distinct elements is the logical entropy $h(p) = 1-(1/n)$ (i.e., the probability that the second draw is different from the first). The complementary form $1-h(p)= 1/n$ is the identification probability of drawing the same element twice. Hence in general, given a probability distribution $p = \{p_{1},\ldots,p_{n}\}$ , the numbers-equivalent interpretation is that it is ‘as if’ the random drawing of a pair was from a set of $n=\frac{1}{1-h(p)}$ equiprobable elements, i.e., such an equiprobable distribution would have the same logical entropy as p. The numbers-equivalent concept shows up again in the interpretation of the (antilog of) Shannon entropy so it is one way to construct the conceptual bridge between the concepts of logical entropy and Shannon entropy.

Many independent discoveries of the formula

In view of the multiple independent discoveries of the formula $\rho = \sum p_{i}^{2}$ or its complement $1- \sum p_{i}^{2}$ by Gini, Friedman, Turing, Hirschman, Herfindahl, and no doubt others, I. J. Good wisely advises that “it is unjust to associate ρ with any one person.” [Good 1982] The name “logical entropy” for $h(p) = 1- \sum_{i=1}^{n}p_{i}^{2}$ not only denotes the basic logical status of the formula, it follows “Stigler’s Law of Eponymy”: “No scientific discovery is named after its original discoverer.”[Stigler 1999]

C.R. Rao’s quadratic entropy

From the logical viewpoint, two elements from $U = \{u_{1},\ldots,u_{n}\}$ are either identical or distinct. Gini [1912] introduced $d_{ij}$ as the “distance” between the $i^{th}$ and $j^{th}$ elements where $d_{ij}=1$ for $i \not= j$ and $d_{ii}=0$ . Since

$1= (p_{1}+\ldots+p_{n})(p_{1}+\ldots+p_{n}) = \sum_{i}p_{i}^{2}+ \sum_{i\not=j}p_{i}p_{j}$ ,

the logical entropy, i.e., Gini’s index of mutability,

$h(p)=1-\sum_{i}p_{i}^{2}=\sum_{i\not=j}p_{i}p_{j}$ ,

is the average logical distance between a pair of independently drawn elements. But one might generalize by allowing other distances $d_{ij}=d_{ji}$ for $i=j$ (but always $d_{ii}=0$ ) so that $Q=\sum_{i\not=j}d_{ij}p_{i}p_{j}$ would be the average generalized distance between a pair of independently drawn elements from U. In 1982, C. R. (Calyampudi Radhakrishna) Rao introduced this concept as quadratic entropy [1982] (which was later rediscovered in the biodiversity literature as the “Avalanche Index” by Ganeshaish et al. [1997]). In many domains, it is quite reasonable to move beyond the bare-bones logical distance of $d_{ij}=1$ for $i \not= j$ so that Rao’s quadratic entropy is a useful and easily interpreted generalization of logical entropy.

Bibliography

Adelman, M. A. 1969. Comment on the H Concentration Measure as a Numbers-Equivalent. Review of Economics and Statistics. 51: 99-101.

Ellerman, David 2009. Counting Distinctions: On the Conceptual Foundations of Shannon’s Information Theory. Synthese. 168 (1 (May)): 119-149.

Friedman, William F. 1922. The Index of Coincidence and Its Applications in Cryptography. Geneva IL: Riverbank Laboratories.

Ganeshaiah, K. N., K. Chandrashekara and A. R. V. Kumar 1997. Avalanche Index: A new measure of biodiversity based on biological heterogeneity of communities. Current Science. 73: 128-33.

Gini, Corrado 1912. Variabilità e mutabilità. Bologna: Tipografia di Paolo Cuppini.

Good, I. J. 1979. A.M. Turing’s statistical work in World War II. Biometrika. 66 (2): 393-6.

Good, I. J. 1982. Comment (on Patil and Taillie: Diversity as a Concept and its Measurement). Journal of the American Statistical Association. 77 (379): 561-3.

Herfindahl, Orris C. 1950. Concentration in the U.S. Steel Industry. Unpublished doctoral dissertation: Columbia University.

Hirschman, Albert O. 1945. National power and the structure of foreign trade. Berkeley: University of California Press.

Hirschman, Albert O. 1964. The Paternity of an Index. American Economic Review. 54 (5): 761-2.

Kahn, David 1967. The Codebreakers: The Story of Secret Writing. New York: Macmillan.

Kullback, Solomon 1976. Statistical Methods in Cryptanalysis. Walnut Creek CA: Aegean Park Press.

MacArthur, Robert H. 1965. Patterns of Species Diversity. Biol. Rev. 40: 510-33.

Rao, C. Radhakrishna 1982. Diversity and Dissimilarity Coefficients: A Unified Approach. Theoretical Population Biology. 21: 24-43.

Rejewski, M. 1981. How Polish Mathematicians Deciphered the Enigma. Annals of the History of Computing. 3: 213-34.

Ricotta, Carlo and Laszlo Szeidl 2006. Towards a unifying approach to diversity measures: Bridging the gap between the Shannon entropy and Rao’s quadratic index. Theoretical Population Biology. 70: 237-43.

Simpson, Edward Hugh 1949. Measurement of Diversity. Nature. 163: 688.

Sinkov, Abraham 1968. Elementary Cryptanalysis: A Mathematical Approach. New York: Random House.

Stigler, Stephen M. 1999. Statistics on the Table. Cambridge: Harvard University Press.