There’s nothing worse than opening a PDF and realizing you possibly can’t use the search perform and even spotlight textual content. This usually occurs when a PDF was created by scanning a paper doc—it is only a collection of photographs. Most trendy scanning software program makes use of Optical Character Recognition (OCR) in order that phrases are each searchable and selectable however generally you will run into paperwork the place this did not occur.
In these circumstances, the free and open supply OCRmyPDF is ideal to have round. It is a command line software that shortly converts any PDF file right into a PDF/A file full with optical character recognition, that means you’ll search the textual content. Even higher, it is fully free.
Putting in the applying is finest accomplished utilizing your bundle supervisor on Linux gadgets and utilizing Homebrew on Mac. Home windows customers can technically set up the applying by putting in Python and some different dependencies—look into that for those who’re keen to do some digging.
As soon as the applying is ready up, you should utilize it by typing ocrmypdf
adopted by the identify of the doc you need to add OCR to, after which the identify of the doc you’d wish to create. So, for instance, ocrmypdf earlier than.pdf after.pdf
would take “earlier than.pdf”, add character recognition, then create a brand new doc referred to as “after.pdf”.
The method will take awhile, relying on the dimensions of the doc, and it won’t be solely correct if the picture high quality is low. Even saying all that, although, I discovered this did a reasonably good job even with probably the most historical and poorly compressed PDFs I might dig up.

Credit score: Justin Pot
And there is extra you are able to do right here: In truth, the Cookbook on the OCRmyPDF documentation outlines a bunch of issues you could possibly do. You’ll be able to compress the pictures within the PDF, for instance, by including --pdfa-image-compression jpeg
to your commend. You’ll be able to mechanically re-orient any pages with sideways textual content by including --rotate-pages
to the command. Or perhaps the PDF you are processing already has OCR that you just assume is poor high quality—you possibly can add --redo-ocr
to the command; this may strip out current OCR data and begin over.
You get the thought: There’s loads right here. Try the documentation for extra data as a result of there’s extra this factor can do.