Monday, August 18, 2008

The ethics of real world data samples

Another interesting thread on the Yahoo litsupport group over the past few days.

Originally this was a thread about near-duping (near-deduplication); the identification of documents, in the broad sense of the word, that are similar to, but not identical to, other documents. (I'll do a post on near-duping some other time).

Someone on this thread posed the perennial problem of eDiscovery vendors and consultants "where do you get your test data from?" This is important to vendors and consultants because in order to truly test any given eDiscovery tool, you really need some "real world data" to run through it. A lot of eDiscovery software companies use the publicly available Enron dataset which has a couple of million emails in it (as I recall); but the problem with everyone using that one dataset is that it's all too easy to tailor your tool to that one, publicly available, data set.

So for vendors and consultants who are going to spend maybe $20,000 on one license for one tool, having some real-world, knarly, unpredictable data is a good thing. The problem is actually getting hold of it.

Back to the litsupport thread . . .

One person on the thread had the creative solution of buying used hard drives off eBay (as job lots), forensically recovering whatever data he could on them (as good a test for forensic tools as any), and then running the resulting data through whatever eDiscovery processing tools he wanted to test.

A creative solution for sure. But is it ethical?

Bil Kellerman argued that no, it was not. You do not have rights to the data, even if you have ownership of the media. And it's doubtful that the original owners of the data intended that the information on that hard drive be used that way. An ethical can of worms.

R. Sam Gilchrist, however, argued that it really wasn't a big deal and that it is no more unethical to use a drive you bought any more than it is unethical or illegal to readpapers you bought. His argument was that a person who sells a drive has any reason to think that the information on the drive is proprietary or private once it is sold.

An interesting conundrum . . .

No comments: