
Michael Cizmar
President, Managing Director @ MC+A
Part 1 - Similarity can find duplicates
Last week, the national archives released a series of documents which are said to be the final release of the JFK assassination files. This release contains many files that were previously released, but are included now unredacted those that are rereleased are now being released fully redacted. We thought this would be an excellent opportunity for us to demonstrate our document intelligence technology that we have been developing to assist in investigative use cases for our clients.
Background
As we previously discussed, AI is a terrific way to begin the process with objective coding and document enrichment. We can use this process with the Kennedy files too:
- Determine which documents were previously released.
- Determine what is different about those documents.
- Enrich these documents so they are searchable.
- Extract key information from these documents to build a knowledge graph.
- Store these in a vector database for retrieval by AI Agents
- Enable AI Agents to perform investigations.
We will demonstrate the technology that we have developed over the past few years using the Kennedy Files data set. This technology is available right now for our clients and can be leveraged in a variety of use cases like Legal Litigation Support, Intelligence Driven Policing, and any use case to organize unstructured information. By leveraging AI, we can automate the tedious and time-consuming tasks involved in document analysis. This not only speeds up the investigation process but also ensures a higher level of accuracy and consistency. Our AI technology can identify subtle changes in documents that might be missed by human reviewers, but more importantly, can perform this on a scale providing a comprehensive understanding of the information.
Whether it is analyzing financial records, legal documents, or intelligence reports, our solution can help uncover hidden patterns and insights that can aid in decision-making and problem-solving.
Performing this at Scale
There is no shortage of examples out there that claim to do something similar. It’s important to note the contextual information about a document such as its date, who authored it, what type of document it is as a starting port and not simply a stream a text. Furthermore, most techniques use Large Language Models and are demonstrated with a very small set of data. As an example, if you were to leverage Microsoft’s GraphRag on a dataset like the Enron dataset could cost tens of thousands of dollars. This is not scalable and for it to be practical, it must be less expensive as this would cause you to limit your data that you ingest or cause delays in ingestion where time is of the essence, like when you are investigating a crime and need to respond immediately or you are rushing to a court filing having the answer a few minutes late in these cases is the same as not having it at all.
Contrary to claims, OCRing PDFs, which is really to say images, is not a time-consuming affair or as costly as some in the community have suggested. Yes, there can be costs associated with OCRing a document. For example, to OCR the 60,000 pages of the latest release with Google’s vision API would cost $90 which is hardly significant but if done repeatedly or if you add on additional attributes to be extracted it can quickly go over $1000 per ingestion run.
Document Context, which OCR and bounding boxes can help povide, helps with the understanding of the document in a similar way to what a human’s brain would do. This was outlined in Microsoft’s LayoutLM paper. With this approach, you can achieve some interesting things. Additionally, OCRing the document produces text which can be searched easily in a traditional lexical search (aka Keyword).
What's the Difference about the Kennedy Files in this Release
We took the current dataset and compared it to the two previous datasets and looked for documents that were previously released. We are not interested primarily in duplicates, but near duplicates as the redactions would cause the documents not to be considered duplicates.
When we create a signature of the document on a variety of properitary metrics, we can perform a similarity search across the dataset to find similar documents. With this set, we can use AI to detected and show differences. This process can be put to an Agentic AI agent inside a loop which can produce a report for human analysis.
If you are interested in the report, please request it via our online form. But what our deltas can detect visually looks like this:

This is the highlighted differences between document 104-10123-1040 between releases. If you are interested in seeing the full report, register to download it: