Statistical Sampling Techniques in eDiscovery Workflows: Part II
Technical Execution. The technical execution of a random sampling process is simple, quick and is fully supported in some eDiscovery platforms such as Relativity.
Let’s take a look at some of the main considerations and caveats to be aware of prior to utilizing this technique for validating your workflow.
A random sample is only going to be worth its salt as a predictor for the whole population if every document in that population had an equal chance of being drawn into the sample. Sometimes it might be tempting to project results from a sampled dataset to a one that hasn’t yet been sampled, especially if the two datasets are highly alike (for example, using the richness sample results from one custodian to estimate the proportion of responsive documents in another custodians collection), but that is a road fraught with potential errors. This might be especially dangerous if the custodians hold significantly different jobs and responsibilities in their company, or if the underlying data is expected to be substantially different for other reasons.
Generally speaking, the bigger the sample size, the more accurately the properties of the overall population will be reflected in the sample. The size of the random sample will determine the confidence interval and margin of error of the observations you make based on the sample. This topic is quite complex, and the exact math behind the proper size selection is outside of the scope of this post, but most vendors today have a data scientist or an analytics expert on staff who can advise on the most appropriate sample size for your needs.
III. Highly stratified population
Sometimes the population being sampled might be already be separated into multiple, easily defined groups that have a significantly uneven distribution of the sought-after variable. An easy example would be a document review QC process. Here, the population is neatly divided into (usually) exclusive groups – “Relevant”, “Not Relevant”, and perhaps something like “Further Review Necessary” or “Technical Issues”. We would expect that the rate of reviewer error in the “Not Relevant” population is significantly lower than in the “Relevant” population. On top of that, the actual population sizes are likely to be highly uneven – “Not Relevant” population can contain anywhere between 10 to 100x as many documents or more as the “Relevant” population.
A random sample based on the whole reviewed set, then, is going to be mostly full of “Not Relevant” documents and show a very small percentage of error. A more appropriate process would be treating these coding choices as distinct and separate groups and sampling them individually to get a more meaningful measurement of the reviewer error rates.
Non-probability Sampling – Judgmental Sampling
The biggest differentiator between a simple random sample. as described earlier, and a judgmental sample is that in a judgmental sample not every member of the sampled population is equally likely to be selected for the sample set. In fact, the whole point of this sampling method is for the eDiscovery practitioners to exercise their judgment on which documents would be of greatest value when included in their sample.
Since the selection criteria are biased and the process of selection is non-random, these types of samples are not typically used for identifying population characteristics or testing the efficiency of various culling methods. They can, however, be far more cost-effective and efficient than random sampling in some specific use-case scenarios:
- Quality control – Sometimes, instead of performing blind random sampling on all of the reviewed documents, a QC manager might choose to target his or her efforts on a specific set of documents that is most likely to produce errors, or could potentially result in the most damaging mistakes. Such a QC sample might, for example, be heavily skewed toward documents reviewed by a specific individual, documents reviewed on a certain date, documents of a particular file-type or issue tag, or a combination of all of the above. While these QC samples won’t tell you much about the accuracy of the overall review, they can catch the most damaging mistakes much more quickly and efficiently than a randomized QC approach.
- Machine Learning / Technology Assisted Review – Although randomized samples have earned their well-deserved place in training machine learning algorithms, a lot of eDiscovery practitioners today tend to favor a judgmental sampling approach, where the documents selected as a seed set for machine learning are primarily composed of high-value documents (e.g. documents that have already been identified as relevant, or documents that are highly likely to contain relevant material). The main argument against using the randomized samples is that of time and money. In document sets with very low prevalence, using randomized training samples might entail reviewing ten times as many Non-Relevant examples as Relevant examples, resulting in high redundancy and potentially a very low return on investment. There are also examples of other approaches to machine learning where non-random selection of seed documents creates efficiencies above and beyond that which can be achieved with a simple random sample.
While judgmental sampling is not as ubiquitous as simple random sampling among the various eDdiscovery workflows, when used properly it can be a very powerful addition to your toolkit.
The old adage “you don’t know what you don’t know” has never been truer than when it comes to selecting your sample documents. Since the sample is, by its nature, biased based on your selection criteria, it tends to confirm and reinforce any preconceptions that you might have had about the data.
In QC processes, a sample biased toward a certain type of privilege might show a 0% error rate, but at production time the real problem will be something that your team did not anticipate and could not screen for. A random sample might have revealed that issue ahead of time due to its probabilistic selection process, whereas a biased sample would be doomed from the start.
This can also be a significant problem in machine learning / TAR workflows. A biased sample will tend to predispose machine learning algorithms to discovering the types of documents found in the seed set first and foremost. If you are utilizing recall rate as your main indicator of project completion, you might get to that target recall rate without ever having discovered other types of relevant documents in the workspace. To give an extreme example, if a particular workspace is saturated with PowerPoint decks talking about Company A’s financials, and has a very small number of critical emails about a different issue, a biased seed set composed of those financials might create a situation where your review team will get through all of the PowerPoint documents and declare the review to be complete based on their estimates and projections, while never having discovered a single one of those critical emails.
The above list is by no means exhaustive, but it should give you a quick glimpse into some of the things that you need to be aware of prior to utilizing this easy, but powerful technique for evaluating your eDiscovery workflows.
We hope that you have found this information helpful and, as always, we would encourage you to reach out to us if you would like any additional advice on how to streamline your document review and reduce your litigation expenses in a safe and defensible way.