How to do OCR on a Mac using the CLI or just Python

Nice post, OP! I was super impressed with the Apple's vision framework. I used it on a personal project involving the OCRing of tens of thousands of spreadsheet screenshots and ingesting them into a postgres database. I tried other OCR CPU methods (since macOS and Nvidia still don't play nice together) such as Tesseract but found the output to be incorrect too often. The vision framework was not only the highest quality output I had seen, but it also used the least amount of compute. It was fairly unstable, but I can chalk that up to user error w/ my implementation.

I used a combination of RHetTbull's vision.py (for the actual implementation) [1] + ocrmac (for experimentation) [2] and was pleasantly surprised by the performance on my i7 6700k hackintosh.

I wouldn't call myself a programmer but I can generally troubleshoot anything if given enough time, but it did cost time.

[1]: https://gist.github.com/RhetTbull/1c34fc07c95733642cffcd1ac5...

[2]: https://github.com/straussmaximilian/ocrmac

Tesseract alone is widely known to be "meh" at this point.

If you look at RAG frameworks as one example they'll typically use/support a variety of implementations. Tesseract is almost always supported but it's rarely ideal with projects like Unstructured[0] and DocTR[1] being preferred. By leveraging more-or-less SOTA vision models[2][3] they embarrass Tesseract.

I haven't compared them to the Apple Vision framework but they're absolutely better than Tesseract and potentially even Apple Vision.

There are also various approaches to use these in conjunction but that gets involved.

[0] - https://github.com/Unstructured-IO/unstructured-inference

[1] - https://github.com/mindee/doctr

[2] - https://github.com/mindee/doctr#models-architectures

[3] - https://github.com/Unstructured-IO/unstructured-inference#mo...

https://github.com/mindee/doctr/issues/1049

https://github.com/JaidedAI/EasyOCR#whats-coming-next

Happy to see OCR is advancing lately, but I really need HWR.

I am looking for something this polished and reliable for handwriting, does anyone have any pointers? I want to integrate it in a workflow with my eink tablet I take notes on. A few years ago, I tried various models, but they performed poorly (around 80% accuracy) on my handwriting, which I can read almost 90% of the time.

This is maybe not a solution, but how does ChatGPT do on your handwriting if you upload a photo? If that works well then maybe you can use the API?

AWS Textract is by far the best OCR engine we've used, it does great with handwritten text

Reading https://heartbeat.comet.ml/comparing-apples-and-google-s-on-... (2017), I expect this code to work for handwritten text.

How well it works on your handwriting is for you to test, but if you, having all kinds of contextual information, can’t read it well, I guess it won’t, either.

Does anyone know what languages Apple supports? The docs don't have a list. Tesseract might be "meh" but it is probably the best open source option available for devnagari scripts or Persian, for example.

I've used it on a number of Cyrillic languages (Russian, Bulgarian, etc), Hungarian, Turkish, along with the typical ones (Spanish, German, French, Italian, Portuguese). I've heard it supports Chinese. I just tried Persian and devnagari samples on my Mac and it could not do either.

I found this detailed comparison of OCRs (both open source and cloud services) super helpful: https://source.opennews.org/articles/our-search-best-ocr-too...

docTR comes out as strongest open solution.

Looks nice! Do you know if they can do table structuring as well? Something similar to what Amazon Textract does[0].

[0]https://docs.aws.amazon.com/textract/latest/dg/how-it-works-...

I have found Tesseract to be both better than I expect (it feels great when it works most of the time) and worse than I expect (not quite enough correct data to fully rely on).

It's better than Tesseract? That's really impressive.

Could you run a farm of macOS machines and turn this into an API for profit? Would that be legal?

You could run a farm of iphones to OCR memes if you felt so inclined

https://findthatmeme.com/blog/2023/01/08/image-stacks-and-ip...

That blog post is glorious. Thanks for sharing.

In my experience using it constantly, it is far beyond Tesseract’s.

I have never gotten truly garbled output from Apple’s, whereas Tesseract will frequently produce random Unicode characters from text.

Apple’s also handles things like overlapping text or changing font sizes and typefaces far better than any open-source OCR I’ve used.

IMO it goes head to head with the anazon/google cloud OCR services. It’s works superbly.

Yes, as long as you pay for the mac hardware it’s yours to do with as you please. I’m not an attorney and this is not legal advice.

Way, way better than Tesseract!

Is there a tutorial on how to extract table from pdf or image for Apple Vision Framework. I tried the two links in your post and it just extracts the text without maintaining the table structure.

AWS textract provides sample python code to extract tables into csv which works great.

I had good repeated success extracting tables from PDFs using Camelot (Python, https://github.com/camelot-dev/camelot)

Thanks will check it out.

Have you compared it with Textract?

The best way I've found for extracting tables from PDFs in a well formatted way is Adobe's free online service:

https://www.adobe.com/acrobat/online/pdf-to-excel.html

I'm a huge fan of this little ocr tool isntalled through brew onto my macbook https://github.com/schappim/macOCR

Same, and for my purposes, I just wrap that utility in a macOS Shortcut I can click from my menu bar, or launch from Quicksilver.

Great to hear! Shottr also has nice OCR these days.

Quicksilver, now there’s a blast from the past! I don’t think I’ve installed it on any Mac in the past 5 years, but I used to love it.

What are the advantages over native macOS shortcuts these days?

Awesome to hear!

I did notice that many Mac apps, including Safari and Preview and Notes, do OCR on images automatically. It's pretty neat that I can easily select text in an image and copy and paste it somewhere else.

It’s kinda ridiculous how good it is, you can even select text from inside a YouTube video while it’s playing (or pause if needed).

Also if it’s text of a URL/domain or a QR code (eg in a photo of a poster, or in a video) you can hold-press/hold-click to open the link directly from the image.

Thanks for sharing this! I had no clue about it.

The photos apps too. It’s just so good at conferences or when you need a long string digitised (iso default router password!). Photo > select > copy > then paste on phone or Mac (via that actually awesome handoff feature).

I'll throw my solution into the mix: https://skaplanofficial.github.io/PyXA/tutorial/images.html#...

PyXA uses the Vision framework to extract text from one or more images at a time. It's only a small part of the package, so it might be overkill for a one-off operation, but it's an option.

fyi you're using the old and less accurate api, VNRecognizeTextRequest

ImageAnalyzer is newer and much better

I bet this shortcut from OP is also using the older API under the hood

ImageAnalyzer is Swift-only and has no corresponding Objective-C method, so it's not available in PyObJC. I can look into bridging it at some point.

This would probably be pretty easy to do with swift and python processes running side by side with grpc.

The article was posted.. yesterday, and the entire reason given for not using the builtin Shortcuts sharing feature is... an article from 2 years ago, about a bug in the shortcuts hosting service, which has obviously been fixed.

I get that some people will want to create it from scratch themselves or incorporate the actual meat of it into a larger shortcut... but not sharing one that does what the article says, because of a bug 2 years ago, is a bit of a weird take.

sorry, that link may have been a cheap shot... but I did try to export the shortcut I created, and kept getting an error about not being signed in to icloud...! and I am signed in to icloud. it's just so confusing.

why can't shortcuts be exported as ... shortcut files?

it's not ideal to have people recreate the shortcut step by step (which is what I ended up describing in my post) but... I couldn't find a better way..! :-)

if you'd be able to recreate the shortcut and share it, and post the link here (and/or email it to me), I'd love to place that in the blog article! thank you

It seemed to work on iOS (https://www.icloud.com/shortcuts/cd7d2c5e63d8482ab0618e163bb...)

I'll try it again on macOS when I'm back at my desk.

Edit: also works on macOS Sonoma (https://www.icloud.com/shortcuts/6216aa9072144846adcaae69a5a...) - this one has all input sources selected, the iOS created one has only images/media/pdfs/files/rich text selected for input.

you can use clipboard with pbpaste/pbcopy commands

ocr-text "$1" && pbpaste

It also outputs to the command line if you pipe it to cat

    shortcuts run ocr-text -i new-haven-pizza.jpg | cat

Oddly enough if you enable it as a "quick action", when you run it, Finder creates a file in the same directory as the image containing the OCRed text (and named according to the first line of OCRed text).

I went back into my shortcut and Shortcuts added a pseudo-action "Stop and output <copy to clipboard>; if there's nowhere to output: <Do Nothing>", and I would think that "Do Nothing" would mean don't create a file, but I guess Quick Actions has some kind of special meaning given that all the other ones seem to be intransitive actions, implying that the user wants a file as the output.

I've built an opensource tool that gives you both CLI and a nice UI. It is free.

https://trex.ameba.co

+1000 for Trex!! I use it daily, thank you for creating it!

I am impressed how it handles handwriting and crappy screen grabs.

It's not so well known that one of the original rationales for "offside rule" programming languages is that it works just as easily for handwritten code as it does for typed.

Will we ever have programming languages that are primarily designed to take input from whiteboard grabs? (ie where not only handwriting, but also placement, connectivity, and maybe shape are meaningful?)

To place contents in a file (not claiming this is the most efficient way but it works)

OCRTHISFILE="ocr-test.jpg"

shortcuts run ocr-text -i "${OCRTHISFILE}"

pbpaste > ${OCRTHISFILE}.txt

or to view output and place in file:

OCRTHISFILE="ocr-test.jpg"

shortcuts run ocr-text -i "${OCRTHISFILE}"

pbpaste | tee ${OCRTHISFILE}.txt

Or use MacOS shortcuts to output ocr text as file (Action: "Append to Text File")

Yes took a bit of fiddling but that does work thanks.

Weird, I couldn't get it to work on a bunch of different files, even using very simple file names. Kept getting this error:

Error: The operation couldn’t be completed. (WFBackgroundShortcutRunnerErrorDomain error 1.)

I suppose you haven't renamed the new shortcut to `ocr-text`

I did do that.

Are ios and macos shortcuts crosscompatible? I didnt know there was shortcuts for the mac, seems pretty powerful to be able to run them from the terminal too. Thanks OP

Yes they are compatible as long you use actions available on both platforms. For example, you can use AppleScript or shell in macOS but it will not work on iOS. However, if you use cross platform apps shortcuts it works even when you write files into the iCloud folder. For example, I did a shortcut that takes today’s events from the Calendar and appends the list into a Markdown file in a Obsidian vault on iCloud. I use it to scaffold meeting notes, and it works on my phone too.

I would really love an `ocrmypdf` like tool which uses Apple Vision to create searchable PDFs from scanned images. I've been searching every week or so for some kind of project but so far haven't found anything. Perhaps it's time to make it myself...

that sounds bonkers useful!! you should definitely prototype the smallest version possible and publish it (and post it here as a Show HN!)

I know that I'd definitely use it!

Very cool. Anyone know how this compares to AWS Textract in general? Does the Apple Vision framework support table recognition?

It looks like it does, but you need to handle it at a pretty low level, this shortcut won't get you there: https://developer.apple.com/videos/play/wwdc2019/234?time=19...

Surprisingly, the Extract Text from Image action is available on Intel Macs: normally, features like automatic-image-OCR is limited to Apple Silicon Macs.

It's almost as if the constant clucking about "planned obsolescence" and deliberately withholding features is a load of bollocks.

Awesome! Is there a similar technique for the Apple vision ‘Copy Subject’ feature? I’ve become extremely reliant on it, but it feels very limited in access.

I had to Google this, do you mean the feature in Photos on mobile where you can "extract" items from a picture and make them into stickers? Apple seems to call it "lifting subjects" [0] [1].

0: https://support.apple.com/guide/iphone/lift-a-subject-from-t...

1: https://developer.apple.com/videos/play/wwdc2023/10176/

EDIT: Try replacing the "Extract text" action with "Remove background". When running the shortcut, use "-o" to specify output image filename.

   shortcuts run remove-background -i ~/Downloads/portrait-beard.avif -o beard.jpg

Very cool, and seems handy!

I’ve always had good results from the Preview.app. I wonder how this engine compares for number of errors in a difficult source versus Free alternatives.

Yeah preview app is everything. I take screenshots now for deliverables.

I tried doing something similar on Windows, and realized that PowerToys[1], a Microsoft project I already had installed, actually contains a very good OCR tool[2]. Just press Win+Shift+T and select the area to scan, and the text will be copied to the clipboard.

[1] https://learn.microsoft.com/en-us/windows/powertoys/

[2] https://learn.microsoft.com/en-us/windows/powertoys/text-ext...

I use autohotkey + powertoys to append screenshot data to a CSV, works great with it's own key mapping

This works great for local files. I can't seem to modify the shortcut correctly for an image hosted at a public URL.

use LLMs (gpt-4-vision or LLaVA) with aichat

`aichat -f tmp/test.png -- output only text in the image`

https://github.com/sigoden/aichat

If you want to do this a lot easier use: https://github.com/schappim/macOCR

On Windows, A9T9 does a great job of OCR'ing scanned JPEG files (and any JPEG file). It's also free.

I scanned about 100 A4 documents in just a couple of minutes.

I'm using https://xclippy.com/ app. It also has an OCR feature.

On Windows I recommend text extractor from powertoys:

https://learn.microsoft.com/en-us/windows/powertoys/text-ext...

CleanShot X (which is great) also allows you to OCR from your screen ("Capture Text")

I made a Shortcut + PHP to get text from a screenshot, ask ChatGPT to make a task name from text, and create new task in Clickup and attache a screenshot. Use it often.

Raycast (macOS only) is also nice as it's able to search images by text. It also allows you to copy text from those images. Quick official demo here: https://www.youtube.com/watch?v=c96IXGOo6E4

Is there any benchmarks on speed/compute/accuracy anywhere comparing to tesseract v5?

How to interact with built in OCR via the cli? "Doing" something is (to me) which ocr tooling, what fonts it recognises, all the associated package management and tuning not "how I configure the gui and ui to let me use the tool they shipped with the os"

Have u guy tried ChatGpt or other alternative?

Speaking the need of OCRs, I found a comment relevant and quite funny

we already have a common, portable data format for social media. It's screenshots of tweets

https://news.ycombinator.com/item?id=38841569

I don't know why but instead of pasting the text it copied to make sure it worked, I made it read it:

shortcuts run ocr-text -i <A PATH TO SOME IMAGE> | say -v Fred

I have played around with the OCR on my mac, and have been very impressed. It has been consistently better than tesseract for my purposes.

However, when creating a PDF from images using Preview and exporting using ‘Embed Text’ option to OCR, I have noticed the text is worse than if you OCR the exact same images using the shortcut above or using a script. Presumably Preview is using the Vision framework’s less accurate fast path when preparing the PDF. https://developer.apple.com/documentation/vision/recognizing...

Does anyone know of a straightforward library or setup to scan newspapers and/or magazines and detect and extract images and advertisements?

It doesn't work for Chinese characters :(

macOS Ventura and newer actually have basic OCR functionality integrated into the Image Capture UI. When using an AirPrint-compatible scanner and scanning to PDF, the checkbox "OCR" is shown in the right pane.

Python is quite basic and might not be very helpful for advanced users. It seems overly detailed for such a simple task.