Brian Wichmann is the Editor of Voting matters and a visiting professor of The Open University.
An obvious question to raise is if the information provided in a ballot can somehow be simplified to provide the essential content. In this paper, a simple model is proposed which appears to provide the essential information from a preferential ballot.
Hence we consider the case with four candidates: Albert, Bernard, Clare and Diana, with the votes cast as follows:
     20  AB
     15  CDA
      4  ADC
      1  B
From this data, we compute the number of each pair of preferences, adding
both the starting position and a terminating position. For instance, the
number of times the preference for A is followed by B is 20, and the number
of times the starting position is 'followed by' A is 20+4=24. The complete
table is therefore:
    A   B   C   D   e
s  24   1  15   0   -
A   -  20   0   4  15
B   0   -   0   0  21
C   0   0   -  15   4
D  15   0   4   -   0
Obviously, a preference for X cannot be followed by X, resulting in the
diagonal of dashes. The entry under s-e could represent the invalid
votes.
Having now computed this table, we can use it to characterise voting behaviour. For instance, 24 out of 40, or 60% of voters gave A as their first preference. More than this, we can use the table to compute ballot papers having the same statistical properties. For example, if the first preference was A, then the second row of the table shows that the subsequent preference should be B, D or e in the proportions of 20:4:15. Due to the fortunately large number of zeros in the table, we can easily compute the distribution of all the possible ballot papers which can be constructed this way. Putting these in reducing frequency of occurrence we have:
AB 30.8% (50.0%) A 23.1% CDAB 16.9% CDA 12.7% (37.5%) C 7.9% ADC 6.1% (10.0%) B 2.5% ( 2.5%)The figures in brackets are the frequencies from the original data - which can be seen to be quite different.
A number of points arise from this example:
The conclusion so far is that the model characterises some aspects of voter behaviour, but does not mirror other aspects. However, from the point of view of preferential voting systems, we need to know if the characterization influences the results obtained by a variety of STV algorithms. The property can be checked by comparing sets of ballot papers constructed by the above process against those produced by random selection of ballot papers from the original data.
We take the ballot papers from a real election which was to select 7 candidates from 14, being election R33 from the STV database. From this data, which consists of 194 ballot papers, we select 100 elections of 25 votes by a) producing random subsets of the actual ballots, or by b) the process described above.
For each of the 200 elections we determine 4 properties as follows:
               Subset   Process  Number
Condorcet (G)    75       67      100
Meek (C)         42       34      100
ERS (N)          56       47      100
Tideman (E)      14       20       50
I believe that the four properties above are sufficiently independent, and
the elections themselves independent enough to undertake the chi-squared
test to see if the two sets of elections could be regarded as having come
from the same population. Passing this test would indicate that the
statistical construction process is effective in providing 'election' data
for research purposes.
The statistical testing is best done as a separate 2 × 2 table test of each line. The first line, for example, gives the table
        Condorcet Analysis
          (G)  other
Subset    75    25     100
Process   67    33     100 
         -----------------
         142    58     200
The four tables give P = 0.28, 0.31, 0.26 and 0.29 respectively, using a
two-tailed test. So, so far as this test goes, these show no significant
differences in the two methods.