Nice post, OP! I was super impressed with the Apple's vision framework. I used it on a personal project involving the OCRing of tens of thousands of spreadsheet screenshots and ingesting them into a postgres database. I tried other OCR CPU methods (since macOS and Nvidia still don't play nice together) such as Tesseract but found the output to be incorrect too often. The vision framework was not only the highest quality output I had seen, but it also used the least amount of compute. It was fairly unstable, but I can chalk that up to user error w/ my implementation.
I used a combination of RHetTbull's vision.py (for the actual implementation) [1] + ocrmac (for experimentation) [2] and was pleasantly surprised by the performance on my i7 6700k hackintosh.
I wouldn't call myself a programmer but I can generally troubleshoot anything if given enough time, but it did cost time.
[1]: https://gist.github.com/RhetTbull/1c34fc07c95733642cffcd1ac5...
Tesseract alone is widely known to be "meh" at this point.
If you look at RAG frameworks as one example they'll typically use/support a variety of implementations. Tesseract is almost always supported but it's rarely ideal with projects like Unstructured[0] and DocTR[1] being preferred. By leveraging more-or-less SOTA vision models[2][3] they embarrass Tesseract.
I haven't compared them to the Apple Vision framework but they're absolutely better than Tesseract and potentially even Apple Vision.
There are also various approaches to use these in conjunction but that gets involved.
[0] - https://github.com/Unstructured-IO/unstructured-inference
[1] - https://github.com/mindee/doctr
[2] - https://github.com/mindee/doctr#models-architectures
[3] - https://github.com/Unstructured-IO/unstructured-inference#mo...
https://github.com/mindee/doctr/issues/1049
https://github.com/JaidedAI/EasyOCR#whats-coming-next
Happy to see OCR is advancing lately, but I really need HWR.
I am looking for something this polished and reliable for handwriting, does anyone have any pointers? I want to integrate it in a workflow with my eink tablet I take notes on. A few years ago, I tried various models, but they performed poorly (around 80% accuracy) on my handwriting, which I can read almost 90% of the time.
This is maybe not a solution, but how does ChatGPT do on your handwriting if you upload a photo? If that works well then maybe you can use the API?
AWS Textract is by far the best OCR engine we've used, it does great with handwritten text
Reading https://heartbeat.comet.ml/comparing-apples-and-google-s-on-... (2017), I expect this code to work for handwritten text.
How well it works on your handwriting is for you to test, but if you, having all kinds of contextual information, can’t read it well, I guess it won’t, either.
Does anyone know what languages Apple supports? The docs don't have a list. Tesseract might be "meh" but it is probably the best open source option available for devnagari scripts or Persian, for example.
I've used it on a number of Cyrillic languages (Russian, Bulgarian, etc), Hungarian, Turkish, along with the typical ones (Spanish, German, French, Italian, Portuguese). I've heard it supports Chinese. I just tried Persian and devnagari samples on my Mac and it could not do either.
I found this detailed comparison of OCRs (both open source and cloud services) super helpful: https://source.opennews.org/articles/our-search-best-ocr-too...
docTR comes out as strongest open solution.
Looks nice! Do you know if they can do table structuring as well? Something similar to what Amazon Textract does[0].
[0]https://docs.aws.amazon.com/textract/latest/dg/how-it-works-...
I have found Tesseract to be both better than I expect (it feels great when it works most of the time) and worse than I expect (not quite enough correct data to fully rely on).
It's better than Tesseract? That's really impressive.
Could you run a farm of macOS machines and turn this into an API for profit? Would that be legal?
You could run a farm of iphones to OCR memes if you felt so inclined
https://findthatmeme.com/blog/2023/01/08/image-stacks-and-ip...
That blog post is glorious. Thanks for sharing.
In my experience using it constantly, it is far beyond Tesseract’s.
I have never gotten truly garbled output from Apple’s, whereas Tesseract will frequently produce random Unicode characters from text.
Apple’s also handles things like overlapping text or changing font sizes and typefaces far better than any open-source OCR I’ve used.
IMO it goes head to head with the anazon/google cloud OCR services. It’s works superbly.
Yes, as long as you pay for the mac hardware it’s yours to do with as you please. I’m not an attorney and this is not legal advice.
Way, way better than Tesseract!
Is there a tutorial on how to extract table from pdf or image for Apple Vision Framework. I tried the two links in your post and it just extracts the text without maintaining the table structure.
AWS textract provides sample python code to extract tables into csv which works great.
I had good repeated success extracting tables from PDFs using Camelot (Python, https://github.com/camelot-dev/camelot)
Thanks will check it out.
Have you compared it with Textract?
The best way I've found for extracting tables from PDFs in a well formatted way is Adobe's free online service:
https://www.adobe.com/acrobat/online/pdf-to-excel.html