In Search of the Black Swan: Analysis of the Statistical Evidence of Electoral Fraud in Venezuela Ricardo Hausmann Harvard University Roberto Rigobón Massachusetts Institute of Technology September 3, 2004
*This study was requested by Súmate who also provided the databases we used. We appreciate the great information gathering effort carried out by this organization. We are equally indebted to a hard working collaborator who, because of institutional reasons, must remain anonymous. We thank Andrés Velasco as well for his useful comments. The opinions expressed in this report and the errors we may have incurred are our responsibility and do not compromise either Súmate, or the universities to which we belong.
1
Abstract
This study analyzes diverse hypotheses of electronic fraud in the Recall Referendum celebrated in Venezuela on August 15, 2004. We define fraud as the difference between the elector’s intent, and the official vote tally. Our null hypothesis is that there was no fraud, and we attempt to search for evidence that will allow us to reject this hypothesis. We reject the hypothesis that fraud was committed by applying numerical maximums to machines in some precincts. Equally, we discard any hypothesis that implies altering some machines and not others, at each electoral precinct, because the variation patterns between machines at each precinct are normal. However, the statistical evidence is compatible with the occurrence of fraud that has affected every machine in a single precinct, but differentially more in some precincts than others. We find that the deviation pattern between precincts, based on the relationship between the signatures from the November 2003 Reafirmazo, and the YES votes on August 15, is positive and significantly correlated with the deviation pattern in the relationship between exit polls and votes. In other words, those precincts in which, according to the number of signatures, there are an unusually low number of YES votes, is also where, according to the exit polls, the same thing occurs. Using statistical techniques, we discard the fact that this is due to spurious errors in the data or to random coefficients in such relationships. We interpret that it is because both the signatures and the exit polls are imperfect measurements of the elector’s intent but not of the possible fraud, and therefore what causes its correlation is precisely the presence of fraud. Moreover, we find that the sample used for the auditing done on August 18 was neither random nor representative of the entire universe of precincts. In this sample, the Reafirmazo signatures are associated with 10 percent more votes than in the non-audited precincts. We built 1,000 random samples in non-audited precincts and found that this result occurs with a frequency lower than 1 percent. This result is compatible with the hypothesis that the sample for the audit was chosen only among those precincts whose results had not been altered.
2
Introduction
This study presents a statistical evaluation of the results of the August 15, 2004 Recall Referendum on President Hugo Chávez’s mandate. From the morning of August 16, 2004, when the CNE (Consejo Nacional Electoral) announced the results, opposition spokespersons expressed doubts about the validity of these results, and argued that an electronic fraud had been committed. These doubts have not been cleared up with the passing of time and the opposition has yet to acknowledge President Chavez’s alleged victory. In this context, Súmate requested that we do a statistical analysis to verify if the available information is compatible with the hypothesis of fraud or if, on the contrary, it rejects this hypothesis. Súmate provided the data used in this study but gave us complete autonomy over the conduct of our research. We were informed that the presumption of fraud is based on the following elements: 1. A new automated voting system in spite of the fact that the opposition had requested a manual tally. 2. The voting machines left a paper trail by printing ballots that allowed each elector to verify that the machine had counted his vote adequately. These ballots were collected in boxes. However, the CNE did not allow the boxes to be opened and counted. Instead, it performed a so-called “hot” audit of 1 percent of the machines on the evening of the election. Moreover, the CNE decided that the number of boxes to be opened would be chosen by a random number generator program run on its own computer. 3. After a difficult negotiation, the CNE allowed the OAS and Carter Center to participate as observers in every phase of the process except for access to the central computer server that communicated with each machine in each voting precinct. No witness from the opposition was granted access to that room either. Only two people were allowed in that room until the results were ready. 4. The adopted technology allowed --in fact required-- bidirectional communication between the central servers and the voting machines. This bidirectional communication occurred. This is different from the information that was provided to opposition negotiators about the nature of the technology involved. 5. Contrary to what was initially stipulated, the voting machines communicated with the central server before printing the results in a document called Acta. This opens the possibility that the machines were instructed to print a result different from the one expressed by the voters. 6. On August 15, 2004, different organizations including Súmate, conducted exit polls in a number of precincts. To assure its quality, Súmate´s poll was conducted with the assistance of the firm Penn, Shoen and Berland. Its results were radically different from official figures. The same thing occurred with the exit poll conducted by “Primero Justicia,” a political party. The data-base of both surveys was given to us to conduct this study.
3
7. The “hot-audit” conducted at dawn on August 16, 2004 was not carried out to the satisfaction of either the opposition or the international observers. Only 78 of the 192 boxes stipulated were counted. The opposition only attended 28 counts, and the international observers were only present in less than 20. 8. As requested by the international observers, a second audit was conducted on August 18. The opposition did not participate in this audit because its conditions were not met; for example, the electoral materials were not delivered to a centralized location before choosing the boxes to be opened and there was no verification that the boxes selected had not been tampered with. Instead, the boxes were chosen 24 hours before they were opened, which in theory would give time for them to be altered. Notably, the CNE did not use the random number generator program proposed by the Carter Center, and instead insisted on using its own program run on its own computer and started with a seed defined by a pro-government member of the CNE. This raises doubts over whether the sample selected was truly a random one. All these facts raise the possibility of an electronic fraud in which the machines printed outcomes different from the real count. This could in theory have been done through software alterations, or through electronic communications with the computer hub. Our main findings are the following. First, the paper finds that the sample used for the audit of August 18, which was observed by the OAS and the Carter Center, was not randomly chosen. In that sample, the relationship between the votes obtained by the opposition on August 15 and the signatures gathered requesting the Referendum in November 2003 was 10 percent higher than in the rest of the boxes. We calculate the probability of this taking place by pure chance at less than 1 percent. In fact, we create 1,000 samples of non-audited precincts to prove this. This result opens the possibility that the fraud was committed only in a subset of the 4,580 automated precincts, say 3,000, and that the audit was successful because it directed the search to the 1,580 unaltered precincts. This sheds new light on the fact that the Electoral Council did not accept the use of the random number generator proposed by the Carter Center and under these conditions one can infer why the Carter Center could not identify the fraud with the audit they observed. In addition, we develop a statistical technique to identify whether there are signs of fraud in the data. To do so, we depart from previous work on the subject that was based on finding patterns in the number of votes per machine or precinct. Instead, we look for two independent variables that are imperfect correlates of the intention of voters. Fraud is nothing other than a deviation between the voters’ intention and the actual count. Since each variable used is correlated with the intention, but not with the fraud we can develop a test as to whether fraud is present. In other words, each of our two independent measures of the intention to vote predicts the actual number of votes imperfectly. If there is no fraud, the errors these two measures generate would not be correlated, as they each would make mistakes for different reasons. However, if there is fraud, the variables would make larger mistakes where the fraud was bigger and hence the errors would be positively correlated. The paper shows these errors to be highly correlated and the probability that this is pure chance is again less than 1 percent.
4
The first variable we use is the number of registered voters in each precinct that signed the recall petition in November, 2003. This clearly shows intent to vote yes in a future election but it does so imperfectly. Our second measure is the exit poll conducted by Penn, Schoen and Berland and complemented with an independent exit poll conducted by Primero Justicia. This is also an imperfect measure as it depends on potential biases in the sample, differences in the skill of the interviewer, etc. But this source of error should not be correlated at the precinct level with the one that affects the signatures. Therefore, it is very telling that in the precincts where the Penn, Schoen and Berland exit poll makes bigger mistakes is also where the number of petitioners suggests that the Yes votes would be higher. This evidence is troubling because it resonates with three facts about the conduct of the election. First of all, contrary to the agreed procedure, the voting machines were ordered to communicate with the election computer server before printing the results. Secondly, contrary to what had been stated publicly, the technology utilized to connect the machines with the computer hub allowed two-way communication and this communication actually took place. This raises the possibility that the hub could have informed the machines what numbers to print, instead of the other way around. Finally, after an arduous negotiation, the Electoral Council allowed the OAS and the Carter Center to observe all aspects of the election process except for the central computer hub, a place where they also prohibited the presence of any witnesses from the opposition. At the time, this appeared to be an insignificant detail. Now it looks much more meaningful. The structure of the paper is as follows. First we describe the evidence coming from the exit polls. We show that the difference between the exit polls and the actual vote is not caused by a sampling error, due for example, to an over-representation of anti-Chavez precincts, but instead to a generalized but variable difference, precinct by precinct. Second, we test the validity of the popular so-called “topes” hypothesis. According to this theory, machines were ordered not to surpass a certain maximum number of Yes votes. If this was the case, there should be an unusually large number of repeated Yes totals in each precinct and the repeated number should also be the maximum Yes vote total in the precinct. We do not check whether the number of repeats is unusually high, but we do show that the frequency with which the repeated number is also the maximum Yes vote of the precinct is consistent with a random event. We then move on to study whether the variance of results at the precinct level is unusual. This would be the case if some but not all machines were manipulated at the precinct level. We find the variance at the precinct level to be if anything smaller than would be expected by pure chance. The next section develops our test for fraud using our two independent but correlated measures of voters’ intent. We then move on to test whether the sample used for the audit of August 18 was random. The final section concludes.
Exit Polls vs. Votes: Analysis of the Differences
The first evidence of potential irregularities in the election count derives from the exit polls conducted independently by Súmate and Primero Justicia (PJ). As shown in Table
5
1, according to the CNE, 41.1 percent of voters supported the YES. On the other hand, in the Súmate and PJ surveys, the weighted projections were 62.0 and 61.6 percent respectively, a difference of more than 20 points. We check whether this difference is due to the fact that the sample chosen by Súmate and Primero Justicia was not representative of the electoral universe. In other words, we check whether the problem arises because of an over-representation of precincts in favor of the YES vote in relation to those in favor of the NO. We show that this is not the source of the problem. As shown in Table 1, according to the CNE the percentage obtained by the YES in the precincts surveyed by Súmate was 45.0 percent, while in PJ´s sample the result was 42.7 percent. In other words, in the sample chosen by both organizations, the result reported by them differs from the official one by more than 17 percentage points. Hence, the difference in the results is not principally due to the sample composition but to a systematic difference across the sample where the exit polls were conducted. Table 1: Comparison between Electoral Results and Súmate´s and Primero Justicia´s Exit Polls Unweighted Percentage of YES votes at the precinct level 37.0% Percentage of YES in Súmate´s exit poll 59.5% Percentage of YES votes where Súmate did their exit poll 42.9% Percentage of YES in PJ´s exit poll 62.6% Percentage of YES votes where PJ did their exit poll 42.9% Percentage of YES in Súmate+PJ exit polls 61.3% Percentage of YES votes where Súmate+PJ did their Exit polls 43.1% Weighted 41.1% 62.0% 45.0% 61.6% 42.7% 62.2% 44.2%
To illustrate this problem more clearly, in Figure 1 we show the percentage of votes and the survey results for the 340 precincts surveyed by both groups. If the surveys were perfect, the points would align in a ray from the origin with a 45 degree slope (drawn in the graph). In other words, where the YES option received respectively 10 percent, 50 percent or 80 percent, the surveys would show the same result. If the points in the graph are above the 45 degree line, it means that the poll overestimates the result in that precinct. If the points are below, the poll underestimates it. As Figure 2 clearly shows, the bulk of the 342 precincts polled are above the 45 degree line. Moreover, the graph indicates that the differences between the votes and the surveys are very variable among precincts. The distances to the 45 degree line are largest in places where the YES option garnered between 20 and 40 percent. This analysis has the following implications. First, it indicates that the difference between the surveys and the votes is not due, in any important way, to problems in the selection of the precincts to be included in the survey. Second, the analysis implies that the difference may be due to one of the two reasons, or to a combination of both. It may be due to a generalized failure in both surveys in each precinct, or to a quite generalized and non-linear manipulation of the results. It will be a challenge of the statistical work to distinguish between these two hypotheses and investigate which is the right one.
6
GRAPH 1 Exit polls vs. Electoral result: percentage of the YES by Precinct
1
7270 10101 39071 42241 38481 38570 38151 38884 38022 3887238204 37901 38511 3931419271 38850 38730 509638480 38682 19432 100 37311 384821929238571 13940 3416 38091 28652 2770 38890 17040 37360 28950 9214 39204 39400 38790 4900 38820 91 60 1713019070 28120 1170 5011 0 2943 37361 37 372 422501875050042 3818 034115342 5450 1323 47490 33200 9600 50031 35550 3970 23340 18751 47590 69038280 420 48440 8739 4076014720 95506581 2630 28761 762 33141 33510 11200 34840 28370 49120 2836050050 210 33190 44300 37821 451509890 4849 0 26130 225404560 2853013922 260 320 474282039464 19000 41020 36640 10160 37030 55160 19940 40510 550 11160 2550 40650 3358042370 24270 1970 40350 811 12 14670 9920 894015680 49630 42340 42808 3172331681 13860 38910432 7101 34610 9140 5430 41270 1955018621 24680 18730 10390 14153 25630 21630 15630 2841 4 151809180438202840716033350 1401 37375 24081 1129017010 3584042850 44690 6831 42040 30880380640 40710 840 5100 86107250 38090 22290 28310 200 2851 29220 1961055510 6783 21681 5519018560 17300 2970113160 35680 5091 37862 1808122270400 35080 5570031711 17480500 23800 2752 4963 27580 42832 28490 1767048464 1765045110 35591 17440 11895 35040 39200 14145 25660 28311 2600 16890 43600 11790 37730 19841 38930 52500 2572 14873 54510 20180 830 1600 39310 110701489055500 40526 260619511 17600 9410 10961 199705990 3322018781 17161 53790 52250 21311 14147 1687044140 19920 126240560 53150 47520 45130 28431 51433 29330 259717110 45 960 20080 4270 3569020204290 9830 34909611 36020 24781 33540 3180040340 40430 18681 11860 2152306241580 17232 45170 23360 14 165 25530 8760 22460 2611 22351 26400 2864418810 47470 8551 11661 178011700 1370027080 119112821034570 1556015600 44147 3590 40341 18420 28151 2881054990 22401 29240 42723 438029700 46730 30370 2352040480 40381 28646 11770 20170 45360 3021 17540 12616 42660 14650 34880 7000 4363 1 3960 44241 19600 19910 15960 11230 25610 31720 9330 14151 9500 53410 35293 4010 38950 18310 2430 19842 9850 14011 47010 1550 4423 0 20041 7990 45190 26300 1163 3 29110 44 240 27740 18602
0
.2
.4
.6
.8
13 652
0
.2
.4 SHSI_V shsi_ex
.6 li ne
.8
1
The Caps or “Topes” Hypothesis
The fraud hypothesis most discussed in Venezuela has been based on the idea that numerical caps were imposed on the amount of YES votes that could be allowed in a precinct and that the overflow of YES votes would be switched into NO votes. In this section we evaluate this hypothesis. To analyze the feasibility of this hypothesis we examine how many times the number of YES and NO votes are repeated at the precinct level in the CNE´s database, which contains 19,062 automated machines.
Table 2 . Number of YES and NO total votes per machine that are repeated in the same precinct Variable Si No Number of machines 19,062 19,062 Numbers repeated 1,875 1,472 Frequency 9.8 7.7
The repetition of the YES count occurs with a frequency of 9.8 percent while that of the NO occurs with a frequency of 7.7 percent. We do not test whether this frequency is
7
unusually high or low1. However, the relatively high frequency is at least in part due to the fact that the number of electors as well as the voting percentage tends to be very similar among machines in the same precinct. The fact that the repeated YES totals occurs with a slightly higher frequency than the NO is at least in part due to the fact that YES has a lower percentage of votes. Let us illustrate this point with an example. Suppose the preference for the YES in a single precinct is approximately 40 percent and the number of voters at each machine is 100. A 5 percent variation would imply 2 votes, so the expected result in each machine could be between 38 and 42. The result could be in some of the 5 numbers included in that interval. On the contrary, the same percent variation for the NO would yield a variation between 57 and 63 votes, which gives 7 possible numbers. Since the amount of possible numbers is higher for the NO than for the Yes, it is logical the latter would repeat less frequently. More importantly, the cap hypothesis implies that the number that repeats itself is also the maximum from the precinct and that the difference is assigned to the NO. For this, it is necessary that the repeated number also be the maximum YES vote in the precinct. We study this hypothesis in Table 3. If the repeated number was randomly distributed, it would occur with a frequency equal to 1/(Number of machines – 1). For example, in the case of precincts with 2 machines, the repeated number is simultaneously the maximum and the minimum, for there is only one number. In the case of three machines, the probability that the repeated number is the maximum is 50 percent. As we see in Table 3, 66 is not very far from being half of 124. In the case of 5 tomes, 54 is not far from being one fourth of 198. We conclude that if there was fraud, this was not done through the imposition of numerical caps to the YES votes in the machines of a precinct.
1
Jonathan Taylor from Stanford University has argued that it is unusually high. See http://wwwstat.stanford.edu/~jtaylo/venezuela/
8
Table 3. Maximum and non-maximum numbers repeated per voting tome at the precincts. Machines per precinct 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Total Non-maximum 0 58 161 144 230 221 197 151 97 85 52 36 18 20 7 6 6 1,489 Maximum 64 66 80 54 46 46 14 4 8 2 2 0 0 0 0 0 0 386 Total 64 124 241 198 276 267 211 155 105 87 54 36 18 20 7 6 6 1,875
Variance Analysis of the Within-precinct Results
The caps hypothesis, if true, would also affect the percentage difference in the results of the machines belonging to the same precinct. This is due to the fact that the amount of voters per machine varies due to differences in the abstention rate or in the number of electors assigned to each machine. This variation would show in the number of No votes, and therefore would create a source of variation in the results across machines of the same precinct. This hypothesis and any other hypothesis that is based on the idea of altering some machine more than others at the precinct level can be tested. In each precinct, voters are distributed to machines according to the last two digits in their identity card (cédula) number. This allows each machine to be a random sample of the precinct’s voters because the last digits in their identity card are not correlated with any variable relevant to the voting decision. This limits the possible distance between the results from two machines from the same precinct. To illustrate this, consider how opinion surveys are done in any country. A random sample is chosen - usually of a thousand or two thousand people – and the outcome is used to predict the results of millions of voters. In other words, a representative sample composed of a miniscule fraction of the electorate is used to predict the outcome of the whole. In the case of a precinct we are taking a much smaller and homogeneous universe than a country and we are dividing the population randomly according to the number of machines in the precinct. For example, in the case of a precinct with five machines, each machine represents approximately 20 percent of the total population of the precinct. In addition, in the case of this referendum, the options were limited to two: YES or NO. This 9
imposes a condition for the standard deviation of the number of votes per machine. Suppose that in a machine, N- number of people vote and the probability that each one of them votes YES is p. Probability theory requires that the standard deviation follow a binomial distribution and be equal to:
Standard Deviation = p(1 - p) N To illustrate this, take the case in which p is the probability that an elector will vote YES in a given precinct, is equal to 50 percent and N is 400. In this case, the standard deviation would be 10 votes. The coefficient of variation (or the standard deviation of the percentage vote) would be 10 divided by 400, meaning, 2.5 percent. Given this, the typical deviation among machines in the same precinct must be compatible with this rule. If for example within a precinct the results of some machines were changed by 10 percent while the others were left unaltered, then we would see an increase in the deviation among all machines that would be 4 times the expected standard deviation of 2.5 percent. This would be abnormal. One implication of this result is that the caps or “topes” theory would also violate the expected distribution of a binomial. If numerical caps were assigned to each machine in a precinct, the variation of the number of voters per machine would affect the number of NO votes and therefore alter the percentage results in a manner that would increase the dispersion of the results and cause these to violate the binomial rule. To verify if the CNE vote data complies with the standard deviation predicted by probability theory, we calculate each machine’s deviation with respect to the average of its precinct. Moreover, we divide this number by the standard deviation that would correspond to a precinct with the actual number of voters and machines. Figure 2 presents our results. It shows a histogram of the percent difference among machines of the same precinct with respect to the standard deviation expected by the binomial distribution. The curve reflects the expected theoretical distribution. The bars are the frequency calculated with the actual data. As can be seen, the coincidence is quite substantial. The graph indicates that only close to 1 percent of the machines have deviations higher than 2 times the expected standard deviation. This frequency is consistent with the theoretical distribution. In fact, if there is anything surprising about the graph, it is that the deviations of the results are if anything too small, as can be seen by the large concentration of results near zero variation. This result has two possible interpretations. One is that there was no fraud. The other is that if fraud was committed, it must have been done by changing every machine in the precinct by a similar percentage. In fact, a fraud of this kind would not be detected with the analysis done so far for it would not alter the variance results among machines. Any hypothesis of fraud that does not comply with this condition would violate the restriction imposed on the deviation of the results by the binomial distribution.
10
Figure 2. Distribution of the deviation of results of machines relative to the precinct mean (relative to the predicted standard deviation)
.8 0 -10 .2 D ns e ity .4 .6
-5
0 ttest
5
10
A Statistical Strategy to Detect the Presence of Fraud
To detect if the data is compatible with the presence of fraud we need to develop a model and put it to the data. We define fraud as the difference between the voters’ intent and what the electoral system registered about his decision. Votes = Intent + Fraud = I + F We will take as our null hypothesis the assumption that there was no fraud, i.e. that F = 0. We will then develop a test to see if the null hypothesis can be rejected. The problem is that we cannot observe the voters’ intent directly. The statistical strategy we adopted begins with finding two sets of independent variables that are correlated to the voters’ intent, but not with the fraud. For our purposes, it is not too important that our variables do not predict the voters’ intent perfectly. Even if they do so imperfectly, it may still give us a chance to reject the hypothesis of no fraud. Notice that the worse the quality of the data, the harder it will be to reject the null hypothesis meaning that bad information makes it harder, not easier, to reject the hypothesis of no fraud. To illustrate what we do, we start with a simplified presentation of our approach. In practice, things are a bit more complicated, but explaining the sources of complexity will be easier after the fundamental intuition is presented.
11
Let us take two variables that are correlated to the elector’s intent: the NovemberDecember 2003 Reafirmazo drive for signatures for the Recall Referendum petition and the exit polls. Each one of these variables is an imperfect measure of the voters’ intent on August 15, 2004. Some people that signed the petition may have changed their opinion in the intervening months. Others might have decided not to sign because it was not secret, but may have decided to vote given its secrecy. Others may not have been registered in November and hence could not sign, but were registered by August and hence could vote. The lines in the August election were particularly long and slow and that may have reduced the number of voters, etc. Equally, exit polls are an imperfect measure of the voter’s intent. Pollsters may have, consciously or unconsciously, gathered a biased sample. People may have had more or less willingness to cooperate with the interview, etc. However, these errors are of a quite different nature from the errors generated by the relationship between signatures and votes and hence should not be correlated. Suppose we have an imperfect measure of the voter’s intent in each precinct and we build a graph relating this variable – say the signatures— and the voters’ intent. As the signatures are an imperfect measure of the voters intent, the graph will look like a cloud of dots around some basic relationship (Figure 3-a)2. Regression analysis can identify the line that relates the signature with the voters’ intent. The real relationship is 0.7, because that is how we built the data. The estimated relationship using the simulated data 0.71 +- 0.014, as is indicated by the graph.
2
This graph was built with simulated data using a random number generator. The data was created supposing that each signature generates 0.7 votes with an error normally distributed between +0.1 and – 0.1.
12
Figure 3a Simulated relationship between signatures and voters’ intent
-.4 -.5
-.2
e(in ote| X) t_v 0 .2
.4
.6
0 e( firma | X )
.5
coef = .71490196, se = .01417772, t = 50.42
We cannot observe the voters' actual intent but the votes registered, and these, in theory, could be influenced by fraud. Suppose fraud takes place and it is directly proportional to the numbers of votes in that precinct. For example, let us suppose that fraud is committed by multiplying the total number of Yes votes in a machine by 0.7 and the difference added to the No votes. Figure 3b illustrates this case. In this case, the estimated slope is no longer 0.7 but 0.5. In addition, the pattern of errors – that is to say, the distance with respect to the regression line — looks similar. It reveals no evidence of fraud. If fraud were committed this way, we would be unable to detect it with our method. In fact, a fraud that reduces a fixed percentage of Yes votes across all machines would practically be impossible to detect by purely statistical methods. It could only be detected using another source of information such as counting the paper ballots.
13
Figure 3b Simulated ratio between signatures and votes with fraud proportional to 30 percent of the Yes votes.
-.4 -.5
-.2
e vote0| X) ( 0
.2
.4
0 e( firma | X )
.5
coef = .50043138, se = .0099244, t = 50.42
Now, suppose the fraud was not committed in a proportional manner. For example, suppose it was committed in some precincts and not in others. Specifically, suppose fraud consists of eliminating 30 percent of the Yes votes in precincts where signatures were less than 30 percent or more than 70 percent of the registered voters. In this case, the pattern of errors will have a peculiar shape, as shown in Figure 3c. This peculiarity is not due to the imperfect nature of the number of signatures as predictor of votes, but in the fraud.
14
Figure 3c. Non proportional fraud
-.4 -.5
-.2
0
e vote1| X) ( .2
.4
.6
0 e( firma | X )
.5
coef = .57435291, se = .01600038, t = 35.9
What happens if we now use a second measure of the voters' intended vote, for example the exit polls? This is also an imperfect measure of the voters' intended vote and as such when doing a regression analysis, this will generate some errors. Nevertheless, if there is a non-proportional fraud, this will also generate an irregularity in the errors which will look similar, i.e. will be correlated with the errors in the other relationship. A positive and significant correlation would identify non-proportional fraud. Note that each measure - signatures and exit polls - is imperfect. Nevertheless, what make each of them imperfect are factors different and independent from each other. The exit poll is not influenced by the abstention rate, as people are interviewed after they vote. The signatures do not depend on the ability or bias of the interviewer. People could have changed their minds between November and August, but people do not change their minds for the same reason between the act of voting and the exit interview. Signing is a public act and voting is secret, etc. Therefore, errors made by each measure may be larger or smaller but they should not be correlated. Nevertheless, if there is nonproportional fraud, it will influence each of these measures in the same way. Hence, the errors made by both should be positively correlated. This is the essence of the method we used.
Instrumental Variable Approach
In this section we derive formally the technique we use. In particular, we show that for a variety of increasingly complex assumptions about the nature of the fraud the
15
covariance between the errors of the instrumental variables regression is an appropriate test of the absence of fraud. Assume that the fraud is defined as the difference between the votes for SI actually collected and an unobservable variable that is the intention of voting of the voters that showed up. We define the first one as Vi, the intention of voters as Xi, and the fraud as Fi. Vi = Xi + Fi There are also two additional measures of the intention of voters: the exit polls (Ei) and the signatures (Si) in each of the precincts. These measures, however, are imperfect. We assume a very general form of that imperfection – a random coefficient model. However, to make the point clear we start with a simpler form of errors and then generalize them. Assume that Ei = a*Xi + epsi Si = b*Xi + etai Where we are assuming that the exit polls are possibly a biased estimate of the intention to vote: a can be smaller than one. The signatures (Si’s), as well, could be a biased measure. Both equations have an error (epsi and etai) that take into account the fact that both the exit polls and the signatures are very imperfect measures of the voter’s intentions – even the biased measured intentions. We assume that these errors are uncorrelated among themselves and with the fraud.3 How can we detect the fraud? The fraud can only affect the actual votes, not the exit polls, nor the signatures. In other words, the fraud is a displacement of the distribution of votes that is not present in the other two measures. Statistically, this means that the fraud could be detected by using the exit polls and the signatures as predictors of the voting process and analyzing the correlation structure of the residuals. Under the assumption that all residuals are uncorrelated – which makes sense given the definitions we have adopted – then the correlation of residuals is an indication of the magnitude of the fraud. The particular procedure used to detect the fraud is the following: 1. Estimate the regression of Vi on Ei plus controls and recover the residual. This residual has two components: the fraud and the errors in variables residual due to the fact that the exit polls are noisy.
2. Estimate the regression of Vi on Si plus controls and recover the residual. This residual has two components: the fraud and the errors in variables residual due to the fact that the signatures are an imperfect measure of the intention of voters. Notice that these two residuals are correlated. First, because both have the fraud
3
This is a reasonable assumption considering that the signatures were collected at different times and conditions than the exit polls.
16
as an unobservable component, and second, because the right hand side variables are correlated and there are errors in variables in the regression.
3. Estimate the regression of Vi on Ei plus controls using Si as an instrument. Recover the residual. Notice that in our model, because etai is uncorrelated with epsi and Fi, we can use Si as an instrument to correct for the error in variables.
4. Using the same logic estimate Vi on Si plus controls, and using Ei as the instrument. Recover the residual. In this case, because the two coefficients are supposed to have solved the problem of error in variables the residuals can only be correlated if there is a common component – which in our case is the definition of the fraud. This procedure actually detects how important the fraud is. The next section first explains why this procedure indeed is able to identify the fraud. After that we also analyze the possibility that the fraud is correlated with the signatures – which is likely given what we have argued about the stochastic properties of the votes per machine and precinct. Finally, we present evidence.
OLS estimation with no correlation between fraud and intention to vote
Running the OLS regression of Votes on Exit Poll is: Vi = Xi + Fi Ei = a*Xi + epsi Where Xi = (1/a)*Ei - (1/a)*epsi
Substituting in the voting equation Vi = c1*Ei + psi1 Where psi1 = Fi – (1/a)*epsi In this model, the OLS coefficient is c1ols = a*var(Xi) / (a^2 * var(Xi) + var(eps))
which is always smaller than 1/a which is the true coefficient. This means that the residual from the regression (psi1) is
17
psi1 = Fi +(1/a-c1ols)*Ei – (1/a)*epsi We can do the same thing for the signatures. Notice that everything is symmetric so the equations are almost identical. Vi = Xi + Fi Si = b*Xi + etai Which means that Xi = (1/b)*Si - (1/b)*etai Substituting Xi in the Vi equation Vi = c2*Si + psi2 Where psi2 = Fi - (1/b)*etai In this model, the OLS coefficient is c2ols = b*var(Xi) / (b^2 * var(Xi) + var(eta)) which is always smaller than 1/b – it is only equal to 1/b when the variance of eta is zero. The outcome is that the residual will be psi2 = Fi +(1/b-c2ols)*Si – (1/b)*etai Notice that the two residuals are correlated. Under the assumption that epsi and etai are uncorrelated, and also uncorrelated with the fraud there are two components that create the correlation among these residuals: the fraud, and the errors-in-variable bias. cov(psi1,psi2) = var(Fi) + (1/a-c1ols)*(1/b-c2ols)*cov(Ei,Si) The first term is the variance coming from the fraud, while the second term comes from the variance due to the error-in-variables that is present in both Ei and Si. Notice that we are assuming that the errors in variables are independent. The covariance arises because the error-in-variables downward biases both coefficients (c1ols