The current state of e-discovery has clients and attorneys caught in a dangerous Catch-22. Courts continue to deliver crippling sanctions to clients who fail to adequately preserve relevant company data in anticipation of litigation. For example, technology giant Samsung recently received a devastating adverse inference sanction in the Apple v. Samsung lawsuit and received a $400 million fine in another matter due to its spoliation of relevant data. To avoid such sanctions, many companies now take the broadest steps possible to preserve all potentially relevant data when they anticipate claims. This has led to an explosion in the volume of data that attorneys must collect, process, host, review, and produce during litigation. Although thorough data preservation may prevent case-ending sanctions, the Catch-22 arises when clients face resulting legal bills of hundreds of thousands (and sometimes millions) of dollars for teams of attorneys to review every document in the massive data collection for relevance and privilege, in addition to the universally substantial charges from e-discovery vendors for processing and hosting these enormous amounts of data. Quite simply, the economics of traditional attorney review of electronically stored information (ESI) are unsustainable.

The growth of such unsustainable e-discovery costs has resulted in the development of a series of technologies broadly known as Technology-Assisted Review (TAR). TAR tools provide attorneys and clients with different methods to review and analyze data without the expense of attorneys manually reviewing every single file. No form of TAR is more hotly discussed than “predictive coding.”

Predictive coding is a form of document review that combines attorney analysis with a computer program that runs complicated algorithms to sample and predict relevance across large collections of ESI. The process begins with a senior attorney reviewing, identifying and coding the “relevant” and “irrelevant” documents among a small random sample (e.g., 1,500 documents) from the entire data pool. The predictive coding program analyzes the common characteristics and concepts across this sample set – also known as the “seed set” or “training set” – and runs this criteria against the entire data collection to find documents conceptually related to the representative documents identified in the seed set. For purposes of quality control, the process repeats several times. With each subsequent round of the attorney coding random samples of documents, the attorney “trains” the program to recognize relevant documents with increasing precision. Ultimately, the program leaves the attorney with the entire data collection automatically segregated into “relevant” and “irrelevant” buckets.

Of course, the attorney still must review all “relevant” documents for privilege before producing to opposing counsel, but the attorney circumvents reviewing the “irrelevant” files that are now segregated from the data pool. Thus, the population of documents requiring attorney review is drastically reduced, saving significant time and money. Some predictive coding programs produce simple results of “yes/no” in terms of document relevance, and attorneys simply review everything identified in the “relevant” bucket. Other programs generate a relevance score for each document, e.g., a scale of 0 to 100 with 100 being the most relevant. This function allows attorneys to prioritize review and to decide that documents below a certain score are so highly unlikely to be relevant that no further attorney review is necessary. The bottom line is that predictive coding automatically isolates the irrelevant data and may provide significant savings to a client in terms of legal fees for document review.

The potential savings offered by predictive coding does come at its own steep price. Some vendors charge $150 – $250/GB just to run the predictive coding technology on a data collection. Attorneys and clients must conduct a cost-benefit analysis of their case to determine whether the high price tag of predictive coding will yield even greater savings by reducing the number of hours necessary for attorney privilege review. Generally, predictive coding generates the highest savings in cases that involve large sets of data and many data custodians. Small cases involving limited data sets (i.e., less than 150,000 documents) and few custodians are not appropriate candidates for predictive coding. Additionally, the technology is not suited for non-text files such as photographs, drawings, and encrypted or password-protected files. As construction cases typically involve large volumes of photographs and drawings, clients and attorneys should exclude those data files from the predictive coding sessions. If these special file types constitute the bulk of the data collection, leaving only a small amount of text files for review, predictive coding may not be a cost-effective option.

Of course, many clients and attorneys remain skeptical of predictive coding and the prospect of a computer determining what documents they receive in a production. Many parties fear that relevant documents will be miscoded and excluded from production. Two recent cases illustrate the context in which parties have considered using predictive coding. The cases also illustrate that courts may allow – or even require – parties to proceed with the technology despite opposing party concerns regarding the accuracy of the production.

A gender discrimination lawsuit made headlines this year as the first judicial opinion recognizing that “computer-assisted review is an acceptable way to search for relevant ESI in appropriate cases.” Da Silva Moore v. Publicis Groupe, No. 11 Civ. 1279 (ALL) (AJP), 2012 WL 607412 (S.D.N.Y. Feb. 24, 2012). The parties in Da Silva Moore initially agreed that the defendants could use predictive coding to review their collection of three million documents before production to plaintiffs. The parties also agreed to an initial protocol in which senior attorneys for both sides would work together to establish an accurate, agreed-upon “seed set” to train the predictive coding program. Critically, defendants agreed to show plaintiffs all sample sets of documents it reviewed in each iterative round of coding (excluding privileged documents) so that plaintiffs could monitor the defendants’ “relevant” and “irrelevant” coding decisions. The parties submitted their “final” ESI protocol to the court, which the court “so ordered.”

Shortly thereafter, however, plaintiffs filed objections to the court-ordered protocol. Plaintiffs argued that they could not determine the accuracy and reliability of the defendants’ proposed method of predictive coding because it lacked a transparent, accessible standard of relevance. Plaintiffs objected to the method of validating the predictive coding results and to how many iterations of review must be done before a party may stop the predictive coding cycle. Simply put, the parties could not agree how to measure and define relevancy success. United States Magistrate Judge Andrew Peck concluded that the proposed protocol was adequately transparent and that the plaintiffs’ objections were premature. He approved the use of the predictive coding for the defendants but emphasized that the technology is not appropriate in every case.

Similarly, in April 2012 a Virginia circuit court judge ordered that “Defendants shall be allowed to proceed with the use of predictive coding for purposes of processing and production of electronically stored information.” Global Aerospace Inc. v. Landow Aviation LP, Consolidated Case No. CL 61040, 2012 WL 1431215 (Va. Cir. Ct. Loudoun County Apr. 23, 2012). The Global Aerospace litigation stemmed from the collapse of three hangars at the Dulles Jet Center during a major snowstorm in February 2010. The defendants, who owned and operated the hangars, immediately implemented data preservation measures in anticipation of litigation which resulted in the preservation and collection of more than eight terabytes (TB) (8,000 GB) of potentially relevant data. (By way of comparison, this data would fill over 1,700 DVDs, which hold approximately 4.7 GB each.) The defendants enlisted a vendor to de-duplicate and filter the massive amount of data down to a more manageable volume of 250 GB, but the subsequent initial reviews of even this smaller data set revealed that a significant portion of the ESI was “wholly unrelated” to the construction, operation or collapse of the jet hangars. Defendants estimated that it would take 20,000 attorney hours to review all documents for responsive, non-responsive, and privileged information. Thus, defendants proposed the use of predictive coding to reduce the burden of its document review and to generate a smaller but more accurate set of data potentially relevant to the case.

The plaintiff flatly objected to the proposal of predictive coding technology. The plaintiff insisted that predictive coding would fail to produce all responsive documents to which plaintiff was entitled. The court permitted the defendants to proceed with the technology but noted that its order did not limit the plaintiff’s ability to raise issues in the future as to completeness of the contents of the production or to the ongoing use of predictive coding generally.

Predictive coding and other e-discovery technology developments continue to generate headlines on a near daily basis. Attorneys and clients should conduct a thorough strategy analysis at the initial stages of a lawsuit to assess the anticipated scope of discovery and determine whether these new technologies are appropriate for the case.