Monday, April 6, 2009

Great article from the ABA Journal Magazine this month on a research project being conducted by the Text Retrieval Conference Legal Track ( In short, as the article's title suggests, they're "in search of the perfect search", which translates to "just how many documents are missed with the various search technologies".

The answer (so far) is sobering:
Legal Track showed Boolean keyword searches using commands such as and, or
and within so many words across a range of different hypothetical topics
found only between 22 and 57 percent of all relevant documents cumulatively
retrieved through a variety of alternative search methods. But the Boolean
search was no better or worse than other more sophisticated search methods
tested, and it still represents the current standard.

Wow. That's pretty bad.

Slightly better news is that different searches find different documents, so if you combine several different types of searches, you're able to find up to 78% of the documents. As it says in the article (and as I've said to clients): ". . . No one off-the-shelf method will solve all of your e-discovery efforts."

So, in practical terms, what does this mean?

Well, for smaller data sets where you can feasibly (if expensively) look at every single document you are not as reliant on search technology to find the documents - you can categorize them as you see fit (usually by issue coding) and that, along with the bibliographic information coded into the review database is usually going to be enough to find your documents.

The big issue with this approach, of course, is consistency. Many highly-trained lawyers and legal staff just don't code the same (or similar) documents the same way. Use of near-duplicate technology, for example, can help with this, but you will still find variance in coding between documents that are similar in theme, if not content.

For larger data sets where it would take years for even a large team of people to look at every document, you have to rely on searches of the electronic data (or even OCR, which is a whole other post in itself).

The key takeaway is that you shouldn't rely on just Boolean searches, or just Concept searches, or any other kind of searching to do it all. Expect to use several kinds of search technology. Expect to come up with a smart set of search terms in the first place. If you're looking for documents about cars, don't just search on "autos" and "cars", search on "Ford" and "GoodYear" and "Mechanic" and "battery" and "gas" as well. These terms will pull up documents that are all about cars, but may never actually mention the magic word "car".

Thanks to Ralph Losey's Tweet which led me to the article in the first place.

No comments: