Horseshoes, hand grenades and near deduplication

The old cliche is that close is only good in horseshoes and hand grenades. Sometimes document review can feel like both. And close can represent millions of dollars and countless attorney hours. The EDiscovery world does much to discuss data analytics in terms of predictable coding, but there are additional analytical processes that can be used to reduce costs, time and risk. One of these technology-supported services is Near Deduplication.

When preparing electronically stored information (ESI) for document review, a common practice is Deduplication. This is the process of identifying exact copies of documents found in a data collection by comparing the unique hash values ​​of electronic records. Once the documents are identified as exact duplicates, they can be excluded from the full data set of a repository (a copy of the exact duplicate document present for each of these data belongs, or custodian) or global (a copy of the exact duplicate document present throughout the universe of documents) level. Because this process compares hash values ​​derived either from the binary content of files (for loose files or attachments) or from email metadata values ​​(form emails), it produces absolute results.

As a separate issue, it is sometimes important to know which documents have almost the same content in one or more data collection. The main documents relevant to a case can contain only minor differences. A party represented in litigation may own documents that contain a large amount of form or standard language. Custodians may have documents that have been slightly altered and stored or shared as multiple files. Documents can contain fonts that match their background in color, thus rendering the text invisible for visual review. Whether for one of these reasons or another, sometimes grouping documents based on similar text content represents a strategic advantage under review. This is the analysis Near Deduplication provides.

Near Deduplication works based on the searchable text content of a document. This text is extracted during regular eDiscovery processing or, for documents without extractable text, via Optical Character Recognition (OCR is the conversion of electronic documents into searchable text information). The extracted and / or OCR searchable text is analyzed, and then a percentage ranking is assigned to the documents based on their text content; the documents are compiled on the basis of their similarity. The higher the percentage, the more the documents resemble the Pivot or Master document (the document with the highest number of words / characters in a close duplicate grouping) in terms of their text. The output is provided in the form of a CSV loading file (viewable in any text editor or Microsoft Excel) and may contain additional field information, including the number of near duplicates, associations of near duplicates, and even the number of words contained in each document.

Please note that because Near Deduplication relies on the accuracy of OCR, it does not produce absolute results. Issues such as handwriting and font and size variations can change OCR accuracy and therefore change scores near Deduplication. Because of this, there is a risk in making coding decisions for documents that are based on near Deduplication output alone. It is recommended that duplicates be repeated before making final decisions regarding document production. Remember that while one document may have 98% of the same text from another, it can be the difference between confirmation and contradiction.

The main advantage of near deduplication is that it groups documents based on text. This may allow batching of documents for specific reviewers and / or experts, the ability to prioritize the review of a group near duplicates based on the relevance of their topic and the ability to facilitate a more effective review as opposed to relying on traditional linear review tactics. For these reasons, along with the ability to solve the data problems described earlier in this article, Near Deduplication has the potential to save time and thereby money in the most expensive phase of eDiscovery, document review.

Whether Near Deduplication is the right analytical process to use on your dataset will be situation specific. This should be determined after collaborating with the legal team, litigation support and / or service provider to help get the data ready for review. As with any eDiscovery process, getting an understanding of the nature and types of electronic files provided by the legal parties will help decide which tools will be suitable to use. Whether you get transferred server data, loose files, horseshoes or hand grenades, you can add new dynamics to your case management by exploring the data analytics resources available to you and using them for good use.