An Untested Algorithm for Matching Edges of Shredded Documents

Shredded paperLaw enforcement and intelligence agencies reportedly invest a lot of man-hours in pouring over shredded documents, piecing them together. If I were shredding documents I didn’t want anyone to read I would take a few extra steps to make it harder for them to match up the pieces but you can’t assume that you would be able to prevent full capture of all the pieces.

Incineration of shredded documents is probably the most reliable method for ensuring they don’t fall into the right hands, but most people don’t have access to an incinerator any more.

If I were on the law enforcement/intelligence agency side of the equation I would want the best possible scans of each shredded piece. I would want them to be flat (not wrinkled), but I believe there is commercial document scanning software that attempts to flatten wrinkled documents. It has been years since I used a feature like that but I suspect the technology has been improved.

Assuming you have the simplest form of shredded documents, long strands of paper, I would want to scan them in such a way that I could detect color changes along the edges. I would use a binary encoding to track the color changes along the edges as simple vectors.

Every scanned piece would have a unique identifier, a “top” and a “bottom” edge vector, and a “left” and “right” edge vector. The vectors would serve as complex data keys. I would then write a simple sorting algorithm to match left-side vectors with right-side vectors.

All probable hits would be dumped into a first pass file. I would then sort the first-pass file by document identifier to see how many probable matches each document identifier has.

In this way you reduce the amount of time spent sorting the pieces of the puzzle. You just let basic data sorting logic match similar-value vectors.

Taking this a step further you could scan a full picture and then divide it into smaller pieces that are matched by edge-vectors. In this way you can identify objects more quickly and easily by temporarily classifying certain color ranges as “white noise”. In my first pass I would identify all the pieces that appear to contain the edge of an object. I would then connect the edges and finally fill in the center with the remaining pieces (they are already tagged and sorted).

But if this is an untested algorithm, how do I know it will work? I don’t. But I do know that if you can convert images to data then the data can be sorted very, very fast by multiple algorithms. You just have to use a classification system that keeps connections between data “records” to a minimum while you match them up.

By using vectors of color-intensity designations (say from 1 to 65,535) you create strings of binary data that can be mapped and sorted by long-used data indexing and analysis algorithms. You don’t have to invent an entirely new technology just to cope with strange shapes.

Conceptually this process should work very well with small sets of vectors but I believe it would scale out very smoothly by several orders of magnitude. You would probably run into limitations when you exceed a relatively small amount of storage, but once you assemble a sub-set of vectors into a coherent map you could then create new edge vectors that work at a larger scale (by hashing or compressing the data for each point in the atomic vectors).

In theory this vector expansion should scale out infinitely. You could map the entire known universe on a single hard-drive, starting out with a few million images. But you would not be able to reverse the process because too much data would be lost during each stage of compression.

The number of compression stages you have subjected your edge vectors to would be equivalent to the resolution or precision with which you are evaluating the data. High resolution would be achievable only for small images because of the immense amount of data you would need to map their edge vectors.

However, fractal geometry might lead us to develop some heuristics that could sharpen edge vector sets, possibly allowing a reversal of the compression in at least some regions of visual data.

I just wanted to get that all down somewhere to refer to in the future, in case I ever find myself in a high-end consulting contract on how to deshred large volumes of documents for all the right reasons. You know how it is.