We have now written about the general environment of e-discovery and some of the rules surrounding e-discovery and electronically stored information (ESI). Now, let’s talk a little about document/data location (search) and collection.
First, know that collecting (search) document and data evidence has completely different perspectives depending on whether you are the producing or requesting party. Producing parties want to limit their time and expense to the smallest possible number. They want to produce only the very bare minimum required by the law. Requesting parties want their cake, they want to eat it and they want to take all the left overs home with them.
In the beginning of ESI discovery, targeted collections were done. This is a process where records custodians would locate the “relevant” documents and produce them to counsel, who would then produce them to the receiving party. A number of obvious problems with this method. First, the human custodians may well have reasons for not wanting to produce certain records or to destroy records. This type of collection can lead to data destruction. When a record is accessed on a computer, it changes its metadata and spoils valuable evidence.
Typically the alternative response to collection of documents and data that the requesting party wants is to tell them that a “keyword” search will be done of the responding party’s “computers”. So, opposing party, give me a list of keywords you want me to search across. There are a few problems with this approach:
- The scope of the search is uncertain. What computers will be searched? Will servers, laptops and other devices be searched? Will other locations be searched?
- The requesting party is unlikely to understand enough about the responding party’s data to construct a meaningful set of keywords.
- Keyword searches without using keyword plus (see below) misses misspellings, abbreviations and code words.
- Specific custodians. What custodians are key custodians? Who decides key custodians?
- How documents and data are stored and where they are stored.
- What measures will be taken to preserve data when doing the search process?
- It does not address the possible limitations of the search software employed to look at the documents.
- It suggests a mythical “all connected”, cohesive and fully integrated computer system that can be magically searched from beginning to end. A construct that does not exist in the real world.
There is a myth out there that is called “the myth of the enterprise search”, as described by Craig Ball, a noted e-discovery expert and lawyer:
“Consider ‘The Myth of the Enterprise Search.’ Counsel within and without companies and lawyers on both sides of the docket believe that companies have the ability to run keyword searches against their myriad silos of data: mail systems, archives, local drives, network shares, portable devices, removable media and databases. They imagine that finding responsive ESI hinges on the ability to incant magic keywords like Harry Potter. Documentum Relevantus!”
As Laurie Briggs pointed out in Beginning Electronic Discovery – Considering the Basics, knowing how the responding party maintains its information and who the key players within the organization are, based on the subject matter of the case, is some essential pieces of information to formulating the method you will desire be used in collecting the discovery.
Another important issue is: “what are we searching; what type of data are we searching; and what format of documents will we need to search?
What is wrong with keyword searching? Maybe nothing, depending on what you are searching for, the collection you are searching within and the ultimate goals of your search. Usually, though, the search method(s) are dictated by the ESI you are searching through. Here are some search methodologies available for consideration today:
- Keyword – keyword search is a linguistic method that requires a word to be in the data set to retrieve data containing that word. It also requires the entire dataset be searchable. Bad OCR’d (a process allowing documents to be full text searchable) documents can be problematic.
- Keyword plus – keyword plus is keywords with Boolean connectors (And, Or, Not), stemming (hous*) or proximity searching (house w/ 5 of mate).
- Clustering – clustering methods are statistically based and group (or cluster) similar data together based upon common factors. This is sometimes referred to as concept or conceptual searching. For example, airplane, aircraft, and plane might all appear together in one cluster.
- Ontologies – ontological methods are linguistically based and group things together based upon a query expanded using lexicons and other techniques. Ontologies are especially effective on foreign language, cryptic language use, and other code-based data sets.
- Predictive coding – this is a search method where software is used to predict what data is relevant. This requires someone to “train” the software for search criteria, test the collections and continue a process of further refining the software’s training.
- Determinative coding – this is a search method where the software extrapolates the relevancy decisions made on data samples by reviewers to an entire data set (addressing the volume problem).
- Pattern analysis – pattern analysis methods look at patterns in the data – these patterns may be word based (linguistic) or similarity/likeness based (statistical). Social network analysis is an example of a statistical based pattern analysis – how often do you communicate with Person A or Person B.
Why not just deploy all these various search methodologies on a given ESI collection? First, some of these methods are wholly duplicative and the costs involved would be unnecessarily inflated. No method of searching will be fool proof and perfect. The goal is to reach a point at which the search methodology provides the closest one to near perfect.
Search and collection of discovery is really about validation of the search results. Regardless of the methods employed to find relevant ESI, predominating the whole process is the need for validation of the ESI retrieved and that ESI not retrieved. Sometimes what is not being retrieved by the search methodology can tell you more about the flaws or success of the search process than anything else.
In recent months, a federal judge has expressed a fondness for predictive coding and vendors have been promoting that methodology as a “court-endorsed process”. Proponents of predictive coding have written articles suggesting it may be the holy grail of search. Predictive coding essentially works like this: After documents are loaded into the program, a lawyer manually reviews a batch to train the program how to recognize what is relevant to a case. The manual review is repeated until the program has developed a model that can accurately predict relevance in the rest of the documents. Based on what I have read so far, and predictive coding is really new, this particularly technology, although a very useful tool, will not be the “end all” for ESI search issues yet.
So what are some of the obvious pitfalls encountered in any search method?
- Documents in the system that were not OCR’d or in which the OCR is flawed. These documents will not show up in a keyword, keyword plus, or most other search techniques, other than possibly ontologies. OCR is a process to make documents that are essentially just a picture (PDF”s and TIFF’s) full text readable by a machine.
- Documents formatted as TIFF’s, gifs, jpg’ s, and similar graphics files do not lend themselves well to any type of search other than human review.
- Handwriting always has been a problem since no programs do a good job interpreting handwriting and most ignore it by treating it as a graphic.
- Use of code words or project codes. Suppose that during the safety analysis stage of a widget, the company referred to the widget as the Zebra Project and to the widget as the Zebra. A search of relevant ESI is unlikely to disclose many relevant documents without first knowing about the code reference.
- Calendar entries can be tricky because they are very similar in general and users often have very unique ways of entering them, which causes an examination minutely.
- Data with large numbers of “near duplicates” can be nearly impossible to deal with on a keyword search basis and may need a clustering approach.
- Locked or password protected files.
Regardless of the search method or methods to be used in any project the more information about how the data is kept, key custodians and the characteristics unique to the data are all crucial to evaluating how to approach both collection and review of ESI.
(Our next installment will discuss in more detail the collection and review process.)