Quote:
Originally Posted by woj
I wouldn't call the data set "unusable", just have to be more careful about performing statistical analysis on it and reaching conclusions based on the results...
but some conclusions can still be drawn from it, for example that 6% or whatever he found to illegally vote is likely the lower bound of the actual fraud... so at LEAST 6% of illlegals vote... as the biases could only reasonably be expected to exclude "criminals", legal voters would have little reason to opt out of the poll or to lie...
|
No.
You can't have it both ways. Is the data flawed because the respondents lie and self-select, OR can you use the data to draw conclusions?
IF you accept the data is relevant to your cause, then I will direct you back to the article:
"We begin with an example. Suppose a survey question is asked of 20,000 respondents, and that, of these persons, 19,500 have a given characteristic (e.g., are citizens) and 500 do not. Suppose that 99.9 percent of the time the survey question identifies correctly whether people have a given characteristic, and 0.1 percent of the time respondents who have a given characteristic incorrectly state that they do not have that characteristic. (That is, they check the wrong box by mistake.) That means, 99.9 percent of the time the question correctly classifies an individual as having a characteristic?such as being a citizen of the United States?and 0.1 percent of the time it classifies someone as not having a characteristic, when in fact they do.
This rate of misclassification or measurement error is extremely low and would be tolerated by any survey researcher. It implies, however, that one expects 19 people out of 20,000 to be incorrectly classified as not having a given characteristic, when in fact they do.
Normally, this is not a problem. In the typical survey of 1,000 to 2,000 persons, such a low level of measurement error would have no detectable effect on the sample. Even in very large sample surveys, survey practitioners expect a very low level of measurement error would have effects that wash out between two categories.
The non-citizen voting example highlights a potential pitfall with very large databases in the study of low frequency categories. Continuing with the example of citizenship and voting, the problem is that the citizen group is very large compared to the non-citizen group in the survey. So even if the classification is extremely reliable, a small classification error rate will cause the bigger category to influence analysis of the low frequency category is substantial ways. Misclassification of 0.1 percent of 19,500 respondents leads us to expect that 19 respondents who are citizens will be classified as non-citizens and 1 non-citizen will be classified as a citizen. (This is a statistical expectation?the actual numbers will vary slightly.) The one non-citizen classified as a citizen will have trivial effects on any analyses of the overall pool of people categorized as citizens, as that individual will be 1 of 19,481 respondents.
However, the 19 citizens incorrectly classified as non-citizens can have significant effects on analyses, as they are 3.7 percent (19 of 519) of respondents who said they are non-citizens.
Such misclassifications can explain completely the observed low rate of a behavior, such as voting, among a relatively rare or low-frequency group, such as non-citizens. Suppose that 70 percent of those with a given characteristic (e.g., citizens) engage in a behavior (e.g., voting). Suppose, further, that none of the people without the characteristic (e.g., non-citizens) are allowed to engage in the behavior in question (e.g., vote in federal elections). Based on these suppositions, of the 19 misclassified people, we expect 13 (70%) to be incorrectly determined to be non-citizen voters while 0 correctly classified non-citizens would be voters.
Hence, a 0.1 percent rate of misclassification?a very low level of measurement error?would lead researchers to expect to observe that 13 of 519 (2.8 percent) people classified as non-citizens voted in the election, when those results are due entirely to measurement error, and no non-citizens actually voted."