Large Data and Document Production — Keyword Search and Predictive Coding
In the case relating to the Biomet Hip Implant Litigation (In Re Biomet M2a Magnum Hip Implant ProductsLiability Litigation). The case involves over 19 million documents and is a great example for why plaintiffs must get involved early in document discovery.
The defendants applied a combination of keyword searches and predictive coding to arrive at a collection under three million documents from the over 19 million with which they started. It sounds as if the methods of search and culling were not agreed upon (or discussed) by the parties. The defendant did, however, offer to the plaintiffs to provide additional keyword search criteria (after the search had already been performed) and invited them to review samples of the output sampling from the predictive coding.
Here is the problem, as I see it, from the plaintiffs’ perspective. If they had knowledge, they should have objected to the methods being used before the culling began. Once Biomet had conducted all the keyword isolation and predictive coding, they had already expended a large amount of money and resources.
The problem with Biomet’s approach is two-fold. Keyword searching is only as good as the keywords and search phrases used. The fact that the plaintiffs were not involved in the keyword selection is a significant detriment to the plaintiffs. Being involved should have allowed them the ability to evaluate the way in which Biomet did business when it came to their hip implants and it should have allowed the plaintiffs to learn about corporate structure, terminology and other important details of the hip manufacturing section.
To have performed keyword searching before predictive culling may have narrowed the document group too much. The fact that keyword searching reduced the document collection from over 19 million to three million is telling, I think, in terms of how severely keyword searching must have been applied.
Predictive coding/culling is a process in which cross section collections of documents are used by humans; who review the documents and code the relevant field data from the documents (document type, dates, authors, etc). Once a sufficient number of documents have been reviewed and coded, the predictive software itself is tested to see if it can proceed with additional culling of the documents. The predictive coding software has been watching and (hopefully) learning from the data coded by the humans and at some point it reaches a point at which the software is competent to proceed with the same actions the human coders were performing. This learning process requires test batches and evaluation of the error rates in those test batches.
The predictive coding process provides for an opportunity for all parties to be involved in more detail with the culling of production documents without disclosing the specific details of confidentiality. So, all the parties should be able to reach a point at which they are satisfied with proceeding with the discovery and production process.
The research seems clear that when done properly, predictive coding is significantly more reliable than keyword searching or even human review. In this case, the judge has allowed use of a system that can be fraught with flaws (keyword searching) to arrive at a group to apply predictive coding. The defendants are then left with a much smaller selection and testing group than otherwise and incorrectly selecting the keywording could allow them to manipulate the ultimate grouping to which predictive coding would be applied.
Simply, the judge may have unknowingly allowed the defendants success in digital concealment of discoverable documents, but the plaintiffs sound like they sat on their haunches instead of actively participating…until it was too late..