Skip Navigation
Madison, Wisconsin
Extract Systems
Government

Common Redaction Pitfalls to Avoid

July 30, 2024

Not every “redacted” document is as secure as its creators intended. In August of 2023, AP News reported a mishap in the legal battle over a proposed merger between JetBlue and Spirit Airlines (i). A JetBlue court filing incorrectly redacted information which seemed to imply that JetBlue Airways planned to increase fares after the merger—potentially problematic because they were arguing in court that the merger would benefit consumers (JetBlue argues the revealed information was misinterpreted without additional context). Regardless of the legal outcome there, neither side disputes the facts about the redaction itself: although the information looked redacted, the redaction could be broken by simply copy/pasting the text to another document! This common redaction mistake isn’t the only thing that can go wrong. Improper redaction can expose social security numbers, state secrets, crime victim names, and more. At Extract, we specialize in secure, accurate redactions in the government and healthcare spaces, and we’re here to help. This post covers a few common mistakes and how to avoid them.

The first category of mistake to avoid is confusing a redaction technique that makes information harder to read with a redaction technique that makes it impossible (You almost always want the latter!). We’ll talk about some ways a human adversary might get around your redactions, but your first and primary concern is computer software.  The thing to remember is that just because you, a human, can’t see the text doesn’t mean software can’t! Software “reads” data differently than the visual representation you see on your computer monitor.

A simple example is “invisible” text. Here’s a bad strategy: text that is the same color as the background it’s on which makes it “invisible” to the naked eye. If this is done digitally, the text can trivially be seen by highlighting via mouse cursor, but you might be surprised to learn that printing the document out and going over text with a black marker might also not be effective—modern optical character recognition can often visually read off text in cases where humans aren’t able to. Even if the text is genuinely visually invisible (did you know some Unicode characters have zero width or are otherwise natively invisible?) software can easily grab the text and make it readable. Similarly, don’t trust “blurring” tools which scramble the visible pixels such that they can’t be easily read by people. A dedicated adversary can use statistical tools like hidden Markov models to undo these kinds of visual transformations (ii).

So, what should you do instead? The black bar over text is a classic for a reason, but there is still a danger to watch out for here. You need the black bar to be “burned” into the image such that no data related to the original text is still there. “Searchable” pdfs have an invisible “text layer” in addition to the visual representation you see when you look at the pdf. This means that even if there is a black bar over text, if that text layer is present, the redacted text can be copied and pasted elsewhere. This is likely what occurred in the scenario that opened this blog post. Another way to run into this scenario is trying to cover sensitive information using Microsoft Word’s insert shape tool (iii). In addition to the additional text layer behind any covering shape, Word keeps a history of changes to documents, so even fully deleting or replacing text isn’t necessarily safe! A dedicated redaction tool, like Extract’s ID Shield, will remove any redacted text from this text layer while keeping pdfs otherwise searchable; it will leave no “metadata” or history by which a redaction can be undone.

One final point to consider: even if you have a “burned” in black bar over sensitive information with no hidden data for software to pull, there is still at least one vulnerability someone might exploit: what you don’t redact. Here the issue is partial or missed redactions. Partial redactions are black boxes that don’t fully cover the sensitive data. Even the smallest slivers of character edges still visible outside of the redaction box can give clues about the most likely letters attached. The bottom of a ‘G’ looks very different from the bottom of an ‘L.’ That might seem obvious but if you are making hundreds of redactions by hand it can be easy to occasionally make a box smaller than you intended to. A dedicated redaction tool like ours will have varying methods of assisting you in making large numbers of redactions at once including: finding the redaction for you, letting you redact words or phrases in multiple places at once, and automatically snapping the redaction box to the full outer edges of text. A more pernicious error is fully “missed” redactions. Part of this can be chalked up to human error as just discussed and mitigated in the same way, but a subtler point is that you need to fully understand the data you are redacting. Make sure the SSN you’re looking for isn’t also incorporated into a number elsewhere on the document. Make sure an ID number isn’t a slightly disguised birth date (e.g. birthdate + 5 random digits tacked on). Spend some time thinking through how you’d use the remaining information in a redacted document to guess at what is being redacted to determine if you’re being thorough enough.

Redaction serves an important role in safeguarding sensitive information in a variety of contexts. To ensure your redactions are successful, we advise using redaction methods that: aren’t susceptible to statistical methods of undoing visual transformations like blurring, don’t leave behind sensitive metadata, and don’t leave behind clues that a human could use to reason their way to the data despite the redaction. Consider using dedicated redaction software that can assist you with many of these points.  In general, make sure you redact thoroughly, accurately, and permanently so that the information you’re trying to protect stays safe!

i. https://apnews.com/article/jetblue-spirit-airlines-higher-fares-lawsuit-be63934fdb73c68a4beccf5f01863905

ii. https://www.researchgate.net/publication/305423573_On_the_Ineffectiveness_of_Mosaicing_and_Blurring_as_Tools_for_Document_Redaction

iii. https://irstore.blob.core.windows.net/materials/617b8c62-d677-48f5-8598-45a93721d10b.pdf

Meet The Author
Chris Mack
Chris is a Marketing Manager at Extract with experience in product development, data analysis, and both traditional and digital marketing. Chris received his bachelor’s degree in English from Bucknell University and has an MBA from the University of Notre Dame. A passionate marketer, Chris strives to make complex ideas more accessible to those around him in a compelling way.
Speak to a solution consultant