@[email protected] to

No Stupid [email protected] • 3 months ago

Is there a way to digitally markup a pdf so its not OCR-readable?

26

Is there a way to digitally markup a pdf so its not OCR-readable?

@[email protected] to

No Stupid [email protected] • 3 months ago

Want to ensure financial documents cant be parsed by automated systems

You must log in or register to comment.

Chat

@[email protected]
link
fedilink
28•3 months ago
PDF scanning is done by both OCR and PDF analysis so no. If you, a human can read it, a bot can read it too.

Your best bet is the classic inserting BS in a 1-hex-off-white font
- @umt
  link
  16•3 months ago
  This is correct. There are also typefaces that are designed to be difficult for OCR eg https://www.librarystack.org/zxx/
  
  These are, however, difficult for humans to read as well.
@[email protected]
link
fedilink
9•3 months ago
Zip them into a password protected file or pgp
Natanael
link
fedilink
7•3 months ago
There’s DRM solutions but they’re by definition not perfect, if it can be read then a photo can be taken
Nomecks
link
fedilink
2•3 months ago
Adobe makes a whole DRM platform to do exactly this. Digital Editions
@[email protected]
link
fedilink
English
2•3 months ago
I would OCR it myself, but edit the meta data in the file so that the text in the OCR metadata is lorem ipsum.

So any bots that assume that the OCR text is what’s on the image in the PDF (and why wouldn’t they), it will only read useless junk. Only someone reading the text from the image would “see” it, and only a bot programmed to OCR a file that already has OCR metadata would realize that there’s any inconsistency.

I’m not entirely sure how to accomplish that, but I’d figure it out if I was worried about the data being compromised.

Personally, I would simply keep the file in an encrypted container, then I wouldn’t worry about what can scan the file since it would be entirely unreadable ciphertext without the correct security key or passphrase.
@[email protected]
link
fedilink
-2•3 months ago
OCR cannot scan documents that have been certified or digitally signed.

Note that once you certify a document it can no longer be edited, combined with another PDF, or have pages inserted or extracted.

Once a PDF has been digitally signed it is locked and you can no longer add pages, delete pages, or read it via OCR.
- @[email protected]
  link
  fedilink
  English
  8•3 months ago
  What? If the document is accessible, and human readable, it’s parsable by OCR
  - @[email protected]
    link
    fedilink
    -1•3 months ago
    I don’t know what to tell you dude. A certified or digitally can’t/wont be read by OCR. A digitally signed document legally certifies that the document has not been modified. PDF editors such as Bluebeam or Adobe will not or cannot process a certified or digitally signed document.
    
    I’m not sure if that limitation is due to the process by which the document is certified or if it is a feature of software conforming for legality reasons. I’m not going to research this for OP, I’m just providing a simple and best accurate answer.
    
    Maybe current AI has better abilities to process document text? I’m not sure, maybe. But you’d think this would be a shared concern with groups wanting to protect documents for the same reason and therefore encryption would match.
    
    If it’s just the legality of it stopping a company from providing the feature, you would think most companies would want to keep out of legal hot water and would then disallow OCR processing. In this case sure there could be software that doesn’t conform, but for most application purposes I don’t think you’d have to worry too much.
    - @[email protected]
      link
      fedilink
      4•3 months ago
      It’s 100% a software limitation and you absolutely can screen capture and OCR it.
    - @[email protected]
      link
      fedilink
      3•
      edit-2
      3 months ago
      Lots of software can manipulate PDF. Open PDF in libredraw change pages,print as PDF or export as PDF. A system that skims content is purposely going to bypass any signed restriction.
      
      Edit: Here’s how to bypass restriction in Paperless OCR.
      
      The parameter PAPERLESS_OCR_USER_ARGS: ‘{“invalidate_digital_signatures”: true}’ in the context of Paperless-ngx and OCRmyPDF allows OCR processing of PDF documents that have been digitally signed by intentionally invalidating those signatures. In its standard configuration, OCRmyPDF does not process documents with digital signatures so as not to compromise their integrity. Setting this parameter to true allows OCR on such documents
    - @[email protected]
      link
      fedilink
      English
      3•3 months ago
      Many alternative OCR tools now simply screenshot the page. This is a cracked issue.
- @[email protected]
  link
  fedilink
  English
  4•3 months ago
  This works, right up until you introduce PDF compatible software that doesn’t give a shit about your rules, of which there’s plenty.
  
  You can also print/scan, or even print to PDF to get around such limitations. The original document cannot be altered since that would invalidate the digital signature on the file, but you can create a perfect digital copy, omitting the signature, and modify it however you want.
  
  If online systems that are skimming documents for their contents don’t give a shit about what the signature is, and simply take a copy and OCR it to train an AI or amalgamate the information for data harvesting or other purposes.
  
  I get what you’re saying and in concept, it should be fine, the problem is that it’s a software lock/restriction on a file type that isn’t inherently closed source, unknown, nor was the PDF format built to be secure from the ground up. So we’re applying security to a system that wasn’t built for it.

No Stupid [email protected]

[email protected]

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

There is no such thing as a Stupid Question!

Don’t be embarrassed of your curiosity; everyone has questions that they may feel uncomfortable asking certain people, so this place gives you a nice area not to be judged about asking it. Everyone here is willing to help.

ex. How do I change oil
ex. How to tie shoes
ex. Can you cry underwater?

Reminder that the rules for lemmy.ca still apply!

Thanks for reading all of this, even if you didn’t read all of this, and your eye started somewhere else, have a watermelon slice 🍉.

20 users / day
62 users / week
134 users / month
1.03K users / 6 months
2.47K subscribers
140 Posts
1.37K Comments
Modlog