Electronic Discovery — Myths, Fables and Folklore
Next in our Electronically Stored Information (ESI) series I had planned to discuss the collection, review and culling of ESI.
I changed my mind after reading a few articles and recently decided cases that discussed some crazy myths and fables about ESI and e-discovery.
Let’s talk about myths, legends and folklore in the world of ESI.
Portable Document Format (PDF) documents are always “full text searchable”:
Wrong. What makes PDF documents searchable is optical character recognition (OCR), which is simply a program that reads machine language and interprets it as text so a search for “dog” can be found by the program examining the document. You can create a PDF document (or series of documents) that are NOT full text searchable and determining the full text search-ability is not always as obvious as one may think. To have search-ability, you must have the OCR data.
Tagged Image File Format (TIFF) documents are always “full text searchable”:
Wrong. TIFF is a “picture” of the document. See answer above concerning OCR.
Paper Documents are the same, but just a copy of the electronic file:
Wrong. First, the sheer magnitude of ESI significantly outweighs paper documents. The different types of ESI vary greatly. A picture of a dog printed on a piece of paper has little difference than text on a piece of paper – it is a paper document. In the ESI world, though, the picture of the dog may be a: gif, jpeg, bmp, RAW, PNG, TIFF, PDF, RGBE, CGM, or another of over 25 different formats. A text paper document as an ESI file might be: Word, WordPerfect, Word Pad, OpenOffice, Notepad, WordStar, TextEdit, or over 75 additional text editors.
Now add to this equation “metadata”; or what is referred to in ESI as “data about data”.
Also consider “portability” in the distinction of paper vs. electronic. A single gigabyte of information you could carry around on a flash drive might equate to over 150,000 pieces of paper and take up over 30 cubic feet of space; about 200 pounds of paper. If a party anticipates producing a terabyte of ESI, that’s 150,000,000 pieces of paper and 30,000 cubic feet of space.
When asking for the production of ESI, you should always ask for and get “metadata”:
Wrong. First, you should know what metadata really is as it relates to given formats of documents. The extent, amount and type of metadata you can recover from program produced documents varies greatly from format to format. A quick answer to what metadata is: “data about data”. An answer that is absolutely accurate and could not be more useless.
What metadata is, whether it is important and how it can be used is a subject about which entire books are written. Craig Ball, noted e-discovery guru, has written a comprehensive 36 page paper on it that should be read.
Consider your computer contains only ones and zeroes for all the data it processes. The various programs help your computer learn how to read and interpret those ones and zeroes. The metadata tells the computer where the data you need is located and how to retrieve and sort it by various bits of information, including the metadata basics:
- Original file path;
- File name;
- Last modified date;
- Last modified time.
There are many, many types of metadata and many are unique to the program creating the particular data. The Windows Shell includes over 280 types of properties (metadata).
Is metadata essential to your ability to use a given electronic production set? Maybe or maybe not, but the guiding principles of e-discovery, the Sedona Principles, provide a reasonable basis for metadata:
“Absent party agreement or court order specifying the form or forms of production, production should be made in the form or forms in which the information is ordinarily maintained or in a reasonably usable form, taking into account the need to produce reasonably accessible metadata that will enable the receiving party to have the same ability to access, search, and display the information as the producing party where appropriate or necessary in light of the nature of the information and the needs of the case.”
Deleted ESI magically disappears into the “ethernet bucket in the sky”:
Wrong. What happens when most users delete an electronic file?
Now, we are not talking about a user taking a disk wiping program and forensically deleting information; although that will leave you with potentially useful information as well. This is a situation in which information has been deleted by the pressing of the delete key by a user.
When a Microsoft Word document, for example, is deleted, the reference data pointing to the document is moved to the recycle bin. So, to recover that file, the recycle bin would be the first place to look. When the user empties the recycle bin, the data comprising the document is still there on the disk, but it looks different to the system.
An analogy. Let’s say you want to find a book in a library on the history of computers; you might go to the card catalog and look it up. If the card is there, it will tell you where in the library the book is located. Let’s say someone removes the card from the catalog. The book is still in the library, but you have no idea how or where to find it. The card in the catalog is the book’s metadata.
That’s deleting essentially. The computer changes the identifying path for the document and allows the space occupied by that document to be used and overwritten by other data and programs. Until the data comprising that particular document is overwritten, it is still there on the disk and still entirely recoverable.
So, when you burn or shred all copies of a document it may be lost forever; not so much with digital files.
Once files are deleted they are undiscoverable:
Wrong. Whether the deleted data “should be” discoverable or not is driven largely by time that has past since deletion and the amount of additional data added to the drive upon which the deleted data resides. In addition, because electronic data is so ubiquitous, multiple copies of the same documents may reside in other locations, including:
- E-mail records as attachments (e-mail server, sent box, file server, other users’ boxes);
- Removable drives;
- Cloud storage;
- Laptops, home computers;
- Through the use of undelete recovery programs;
- Through the use of computer forensic techniques, including slack space analysis and file table recovery.
Should the court allow you to employ these approaches? That depends on the case that can be made for destruction (whether inadvertent or otherwise), relevance, discoverability, and which party should bear the costs.
Attorneys need not know anything about an organization’s IT structure:
Wrong. Without a reasonably detailed understanding of how your adversary produces, processes and stores information an attorney cannot possibly do the job necessary to formulate good and responsible discovery. A lawyer who knows little about her adversaries’ data is much less likely to know the specifics of what to ask for or be able to verify a complete production.
Optimally, before any actual discovery is undertaken, you will have discussed in detail the way in which your adversary’s organization handles its data and in what formats they maintain it.
When receiving production of ESI, you always want “native production”:
Wrong. In some production you may want the files produced in a native format, but native production has its own set of problems.
What is “native format”? When you receive production in file formats that are the same as the program that produced them, that is native. So, if I send you a collection of documents produced in WordStar, you might need the WordStar program to view those files. If I send you a production set of Excel spreadsheets, you will need the Microsoft program, Excel, to open and view them. If I send you a production set that was produced in a proprietary database programmed particularly for my use, you will need that program. So, the availability of the authoring programs and whether you, as the receiving party, have access to them are driving concerns for native production.
Another native production issue that can be worked around is metadata. When you access the native file, you have just changed some metadata associated with that file. The “workaround” is to have metadata produced as a separate piece of data.
Wrong. There is probably no end all for searching ESI for relevant documents. Keyword searches are still a valid tool if used properly, but it is not the only effective method. As much as I have been skeptical about it, predictive coding looks like it could be one of the more viable approaches. But, caution should be taken, regardless of the judges who have recently become enamored with predictive coding, it, too has its problems.
Probably the first approaches to ESI search were custodial review. You determine the key custodians most likely to possess relevant data and documents and you review each of their computers and other locations to which each custodian had access for relevant and responsive ESI.
Then keyword searching became popular, but it still employed the isolation of data and documents from key custodians. Keyword searching has many pitfalls. One of the classic problems is “code words”. If while doing my business, I use trade words or code words unique to my business. Is it likely that, unless I disclose that fact, much of the relevant information will be missed in a keyword search? Yes. Misspellings, code words and synonyms are all obvious problems. Finally, keyword searching completely removes context from the search and this often results in far too many documents that require manual, human review.
Documents in a system or on a computer that are not OCR’d (optical character recognition) and, thus, not full text searchable, will all be missed in a keyword search approach; as will pictures, graphics, Power Points, CAD drawings, spreadsheets, and databases.
Can these problems be overcome? Things can be done to improve accuracy. Sampling of smaller data sets and testing for accuracy can cut missed data and documents way down.
Conceptual and emotive searching has also been employed with keyword search to improve accuracy. Concept searching was well described by Chaplin and Jytyla at Kroll Ontrack:
For example, a conceptual search for “cellular” will return documents containing the words “mobile” and “Federal Communications Commission” in a document set involving telecommunications. Moreover, this intelligent search technology will return documents containing the words “genetics” and “molecular” in response to an identical conceptual search for “cellular” if its analysis reveals that the document set contains documents regarding biology. Abbreviations, acronyms, text and email slang, along with industry and corporate specific terminology are continually progressing. Conceptual search can adapt to changes in the way language is used and the ever growing amount of information.
Conceptual searching has its drawbacks that are similar to keyword searching in many respects.
Emotive searching is similar to conceptual searching, except it focuses on using the framework of keywords and concepts in an effort to provide an emotional context to the search results. Emotive search employs a combination of concept and keyword, while injecting emotional context into the search.
Predictive coding, though, is the newest 800 pound gorilla on the block; largely because of a few judges who see it as a real cost saving measure as opposed to the other methods. I suggest that if predictive coding is done properly, the cost savings will be much less than its proponents predict. That said, predictive coding may still well be a solution long in waiting.
Predictive coding combines machine-learning technology and work flow processes that use keyword search, filtering and sampling to automate portions of an e-discovery document review. The process promotes a method of sampling, human review and through human review and coding; the software “learns” the difference between responsive and non-responsive documents. The problem here is obvious: who will be involved in and control sampling and teaching the software? In some new litigation using predictive coding, the parties have agreed to cooperate by using a competent, expert third party to do the initial set up and training work. Possibly cooperation and collaboration will make predictive coding the best we have had so far – only time will tell.