Northern Exposure: Bayesian Stats and the NSA

One of the things that always amazes my Environmental Science students is the incredibly high number of false positives you can get when doing tests for diseases that are very uncommon in nature, even if the test itself is relatively accurate. The same idea, using Bayesian statistics, can be applied to the questionably-legal wiretapping/data mining done by the NSA to show that the old saw “if you’re not doing anything illegal, you don’t have anything to worry about” is total bunk.

To illustrate, we’ll need to use some numbers (sorry…exams are over and I’ve actually got time to do this now!). Let’s assume that the test has a 99% accuracy rate for finding terrorist activity when it’s there – in other words, the false positive rate is 1% (1 in 100). Now let’s also assume that the false negative rate is 1 in 1000 (i.e. 99.9% accurate). [Sounds like a pretty accurate test, right?] Finally, in order to apply Bayes’ Theorem, we need to make an assumption about the background incidence of what we’re looking for (this is known as the prior probability). For the sake of argument, let’s use a very aggressive prior: we’ll assume that 1 in 1,000,000 conversations are actually discussing terrorist plots (surely, given the sheer number of conversations going on in the U.S. in any given day this is much higher than the actual rate, but it will prove the point very well. Now we can apply a formulation of Bayes’ Theorem to find what the probability is that a positive result has actually found a terrorist.

[P(TD) x prior] / [(P(TD) x prior) + (P(TD) x (1-prior))]

which is:

.99 x .000001 / (.99 x .000001) + (.001 x .999999)

like you cared about those details.

Now, if the test raises a red flag for a certain person, what is the probability that the person is actually involved in terrorist activity? (Hint: it’s not 99%). By solving the above formula, we get this answer: It’s 0.000989, or 0.0989%. It’s incredibly small. Less than 0.1% of conversations giving a positive result actually involve terrorist activity. In other words, 99.9011% of the people flagged by such a super-accurate system are completely innocent. To put it another way, for every one person they successful arrest using such a program, they’ll have arrested over 1000 innocent people. And that’s using some pretty optimistic numbers for accuracy and a fairly high rate of incidence of threatening conversations. In reality, the numbers are probably much, much more damning…like one in millions and millions if you raise the background rate to a more realistic number (there's probably actually something like a trillion pieces of data to get mined every day).

This is an inherent problem when testing for something that is really rare – your test, no matter how accurate it is, is going to turn in a lot of false positives. This raises all sorts of questions about both the expense of such a data-mining system, and whether the results it churns out are worth the price. Furthermore, it puts the problem of civil liberties in a whole new light, as it’s clear that many, many people will be harassed completely unnecessarily (at least let’s hope it’s just an inconvenience, and not time it Guantanamo Bay).

What all this means is that perhaps computer-guided data mining is not the right method to discover terrorist activity. It’s a great idea for credit card companies, where the costs of all those false positives are small (a simple phone call from the company to check if your recent purchases are legit), or with medical testing, where you are willing to put up with a lot of false positives if it means you can have a very low incidence of false negatives (which you definitely don’t want). However, in this case the resources of the police will be spent investigating thousands of false alarms, wasting time trampling on the civil liberties of ordinary Americans, rather than using their human experience and intelligence to uncover potential terrorist plots.

[Hat tip: Bruce Schneier's article in the Minneapolis-St. Paul Star-Tribune.]

Northern Exposure

Thursday, June 01, 2006

Bayesian Stats and the NSA

0 Comments:

About Me

Blogroll

Causes

Previous Posts