Testing for Junk Science in the Discovery Process
What is the “Daubert” standard supposed to test? Daubert v Merrill Dow Pharmaceuticals, 509 US 579 (1993).
Daubert and the Federal Rule of Evidence 702 are intended to allow the court to act as a gatekeeper in keeping “junk science” away from jurors. The theory, I suppose, is that jurors could be swayed by a slick presentation of junk science and time should not be spent in trial with opposing parties being required to present contrary evidence that the science is junk and should be ignored by jurors.
If a Daubert standard should be or will apply to evidence in a case, shouldn’t it apply to speculative or new technology used in what is possibly the most important aspect of litigation – the discovery process?
Since the Judge Peck’s ruling in Da Silva Moore v. Publicis Groupe, et al., No. 11-cv-1279 (S.D.N.Y.), predictive coding and technology assisted review (TAR) have become e-discovery vendors’ catharsis. Between national seminars touting the fantastic virtues of predictive coding to judicial endorsements and orders virtually requiring the use of predictive coding, we have seen a relatively new science flower into a forerunner in claiming to save costs and carve discovery down to a manageable size.
So do I believe predictive coding (only one tool within TAR) is bad, flawed or not in the best alternative? Not necessarily.
The precluding issue to using predictive coding is not the software; it does what it is written and trained to do. The problem comes in the teaching of the software and the selection of the training sets, which requires human review and analysis. It is the back room, human manipulation, which forms the basis for most of my criticism of predictive coding.
In predictive coding, seed sets of documents are selected by humans either randomly or specifically, coded and tested through the software. The seed set of documents is reviewed by “experts” having knowledge in the subject matter of the case. The determinations made on the seed set comprise the primary reference data to teach the predictive coding machine how to recognize patterns of relevance in the larger document set.
This is where the problem can develop. You know your case and I know my case, but do we know each other’s cases enough to make judgments of relevance of seed sets that will set the stage for the software to pare down the universe of documents? Shouldn’t this be a process in which all parties participate?
How can a party be assured that seed sets are not garbage in, which will result in garbage out?
Parties will maintain that statistical sampling should tell us the relevance level of the global document collection. If, after statistical sampling we conclude that only 1% of the documents are relevant; how can I have confidence in that without understanding how we arrived at this point? If the sample size is insufficient won’t that skew the statistical outcome? Is the defined global collection actually that – have all relevant custodians been collected? Has alternate jargon been accounted for in the selection of keywords?
These represent some of the reasons the refusal to disclose the details of the test sets is frustrating to parties.
- How the seed sets were constructed?
- Who constructed the seed sets?
- What criteria was used to gauge relevance?
These are all things that go into a process that the receiving party is expected to blindly accept.
The courts have largely foreclosed opposing parties from discovering information about the discovery process. And, although many claim to have the right score number or statistical analysis number that should assure everyone that a given predictive coding project went well, each case is different and each set of reviewers are different.
We now have the revisions to the federal rules; particularly Rule 26(b)(1) and its proportionality restrictions and its virtual burying of traditional discovery yardsticks like “reasonably calculated”. The rule now will cater to what supporters interpret as a narrower description that “information within this scope of discovery need not be admissible in evidence to be discoverable”.
In the same rule, the issue of proportionality has now been elevated to the top of the considerations in whether discovery will even be permitted. This is nothing more than a corporate welfare rule. Now, the court may look at discovery in terms of it being “relevant to the parties’ claims and defenses and proportional to the needs of the case” and considering (6) factors:
- The issues at stake in the action;
- The amount in controversy;
- The parties’ relative access to relevant information;
- The parties’ resources;
- The importance of the discovery in resolving the issues, and;
- Whether the burden or expense of the proposed discovery outweighs its likely benefit.
Hope for the future now, in part, lies in the hands of judges and the hope they will demonstrate careful consideration before restraining justified discovery or listening to what we can expect will be the same boiler plate argument that the discovery is difficult to obtain; largely because of the methods to store it were chosen by the producing party.
If the court is understandably concerned about proportionality, let us hope judges will see the wisdom in compelling greater and closer cooperation and transparency in a process that is difficult even when using known methods and tools.
Let’s hope that the changes to the Rules, particularly in Rule 26, are interpreted with the idea in mind that it is not the 1930’s and parties today should take it as well established they can discover information about documents and the identity of people who know of discoverable information.
Let us hope for judicial insight.