Show HN: Benchmarking VLMs vs. Traditional OCR

146 points by themanmaran 5 months ago

Vision models have been gaining popularity as a replacement for traditional OCR. Especially with Gemini 2.0 becoming cost competitive with the cloud platforms.

We've been continuously evaluating different models since we released the Zerox package last year (https://github.com/getomni-ai/zerox). And we wanted to put some numbers behind it. So we’re open sourcing our internal OCR benchmark + evaluation datasets.

Full writeup + data explorer here: https://getomni.ai/ocr-benchmark

Github: https://github.com/getomni-ai/benchmark

Huggingface: https://huggingface.co/datasets/getomni-ai/ocr-benchmark

Couple notes on the methodology:

1. We are using JSON accuracy as our primary metric. The end goal is to evaluate how well each OCR provider can prepare the data for LLM ingestion.

2. This methodology differs from a lot of OCR benchmarks, because it doesn't rely on text similarity. We believe text similarity measurements are heavily biased towards the exact layout of the ground truth text, and penalize correct OCR that has slight layout differences.

3. Every document goes Image => OCR => Predicted JSON. And we compare the predicted JSON against the annotated ground truth JSON. The VLMs are capable of Image => JSON directly, we are primarily trying to measure OCR accuracy here. Planning to release a separate report on direct JSON accuracy next week.

This is a continuous work in progress! There are at least 10 additional providers we plan to add to the list.

The next big roadmap items are: - Comparing OCR vs. direct extraction. Early results here show a slight accuracy improvement, but it’s highly variable on page length.

- A multilingual comparison. Right now the evaluation data is english only.

- A breakdown of the data by type (best model for handwriting, tables, charts, photos, etc.)

simonw 5 months ago

The benchmark I most want to see around OCR is one that covers risks from accidental (or deliberate) prompt injection - I want to know how likely it is that a model might OCR a page and then accidentally act on instructions in that content rather than straight transcribing it as text.

I'm interested in the same thing for audio transcription too, for models like Gemini or GPT-4o audio accepting audio input.

themanmaran 5 months ago

We've tested basic prompt injections within images, but not been able to reliably trigger any adverse effects.
However there are two big bugs we've found with VLMs:
1. Correcting the document. If you have an income statement, and all the line items add up to $1,001. But the total says $1000. The model will frequently correct the final output. Which would be terrible if you were trying to build a "identify mistakes in these documents" type tool.
2. Infinite loops. Sometimes the models will get hung up on a particular token and repeat that until it times out. This gets triggered a lot in markdown tables |---|---|----------------->
mentalgear 5 months ago

An interesting OCR aspect indeed; hence it's great that their OCR Benchmark is open source, allowing for the addition of such a category. Or maybe there are already separate OCR prompt-injection benchmarks.
Also, I'd be useful to understand how an OCR context differs from standard injection attacks. One thing I can think of is potential tabular injection attacks. But also image-based, especially for VLMs, are relevant. So a OCR injection attack benchmark might just be a combination of different domain-specific benchmarks formated as images.
th0ma5 5 months ago

Due to the nature of how the technology works this risk shouldn't ever be possible to eliminate without breaking fundamental parts of what we find useful. In band vs out of band signalling. So far the out of band analysis hasn't helped either.
- falcor84 5 months ago
  
  I recall reading somewhere that traditionally Hebrew religious scrolls (prepared by mental copying) would be compared against the original by young children who know the letters but can't really read well. In this vein, I wonder if we could have a VLM intentionally made to not understand the actual words.
magicalhippo 5 months ago

Playing with local llama vision and minicpm-v models, they do seem resistant to what one might call blatant prompt injection. Ie just inserting one of the classic "ignore previous instructions" or similar.
So yeah, would be curious how susceptible they are to more refined approaches. Are there some known examples?

blindriver 5 months ago

As far as I'm concerned, all of these specialty services are dead compared to a generalized LLM like OpenAI or Gemini.

I wrote in a previous post about how NLP services were dead because of LLMs and obviously people in NLP took great offense to that. But I was able to use the NLP abilities of an LLM without needing to know anything about the intricacies of NLP or any APIs and it worked great. This post on OCR pretty much shows exactly what I meant. Gemini does OCR almost as good as OmniAI (granted I've never heard of it), but at 1/10th the cost. OpenAI will only get better very quickly. Kudos to OmniAI for releasing honest data, though.

Sure you might get an additional 5% accuracy from OmniAI vs Gemini but a generalized LLM can do so much more than just OCR. I've been playing with OpenAI this entire weekend and literally the sky's the limit. Not only can you OCR images, you can ask the LLM to summarize it, transform it into HTML, classify it, give a rating based on whatever parameters you want, get a lexile score, all in a single API call. Plus it will even spit out the code to do all of the above for you to use as well. And if it doesn't do what you need it to do right now, it will pretty soon.

I think the future of AI is going to be pretty bleak for everyone except the extremely big players that can afford to invest hundreds of billions of dollars. I also think there's going to be a real battle of copyright in less than 5 years which will also favor the big rich players as well.

Eridrus 5 months ago

5% accuracy can be worth a lot.
The price of any of these services pales in comparison to getting a human involved in any fraction of cases.
It is likely reasonable to expect the base LMs to keep getting better and for there to not be a moat on accuracy in the long term, but businesses are not just built on benchmark accuracy and have plenty of other ways to survive, even if the technology under the hood changes.
- toss1 5 months ago
  
  YES
  >>5% accuracy can be worth a lot.
  Most surprising to me about these results is the BEST error rate was over 8% errors (91.7% accuracy) and the worse was 40%.
  Their method of calculating errors seems quite good:
  >> Accuracy is measured by comparing the JSON output from the OCR/Extraction to the ground truth JSON. We calculate the number of JSON differences divided by the total number fields in the ground truth JSON. We believe this calculation method lines up most closely with a real world expectation of accuracy.
  >> Ex: if you are tasked with extracting 31 values from a document, and make 4 mistakes, that results in an 87% accuracy.
  Especially where dealing with numbers and money, having 10% of them being wrong seems unusable, often worse than doing nothing.
  Having humans check the results instead of doing the transcriptions would be better, but humans are notoriously bad at maintaining vigilance doing the same task over many documents.
  What would be interesting is finding which two OCR/AI systems make the most different mistakes and running documents against both. Flagging only the disagreements for human verification would reduce the task substantially.
  
  sgc 5 months ago
  
  > What would be interesting is finding which two OCR/AI systems make the most different mistakes and running documents against both. Flagging only the disagreements for human verification would reduce the task substantially.
  There have been OCR products that do that for decades, and I would hope all the ocr startups are doing the same already. Often times something is objectively difficult to read and the various models will all fail in the same place, reducing the expected utility of this method. It still helps of course. I forget the name of the product, there was one that used about 5 ocr engines and would use consensus to optimize its output. It could never beat ABBYY finereader though, it was a distant second place.
- blindriver 5 months ago
  
  I think 87% to 92% accuracy really isn't much difference. You're still going to get errors to the point where the level and amount of checking you need to do isn't affected. Even at 98-99% you still have to do a lot of error checking.
  But you get most of the bang for the buck for 1/10th the cost so I think overall it's far, far superior.
modo_mario 5 months ago

Wouldn't an issue be that whilst for LLM's replacing NLP's you don't often care about the super rare hickup or hallucination.
Whilst where OCR's tend to be used it's often a no go.... Just saying this trying to remember all the places where I've implemented it or seen it implemented. A common one was billing stuff.

jasonjmcghee 5 months ago

A big takeaway for me is that Gemini Flash 2.0 is a great solution to OCR, considering accessibility, cost, accuracy, and speed.

It also has a 1M token context window, though from personal experience it seems to work better the smaller the context window is.

Seems like Google models have been slowly improving. It wasn't so long ago I completely dismissed them.

shawabawa3 5 months ago

And from my personal experience with Gemini 2.0 flash Vs 2.0 pro is not even close
I had gemini 2.0 pro read my entire hand written, stain covered, half English, half french family cookbook perfectly first time
It's _crazy_ good. I had it output the whole thing in latex format to generate a printable document immediately too
anon373839 5 months ago

I’m definitely not getting that takeaway. This wasn’t even an OCR benchmark: the task was structured data extraction, and deterministic metrics were set aside in favor of GPT-as-a-judge.
VLMs are every bit as susceptible to the (unsolved) hallucination problem as regular LLMs are. I would not use them to do OCR on anything important because the failure modes are totally unbounded (unlike regular OCR).
- michaelt 5 months ago
  
  > This wasn’t even an OCR benchmark: the task was structured data extraction, and deterministic metrics were set aside in favor of GPT-as-a-judge.
  Looks like they've got deterministic metrics to me: For each document they've got a ground truth set of JSON extracted data, and they use json-diff to calculate the fields that disagree.
  There is GPT-4o in their evaluation pipeline - but only as a means of converting the OCRed document into their target JSON schema.
- bayindirh 5 months ago
  
  Also, what's strange is there's no free of paid OCR engine is added to the mix for the evaluation. Tessaract is built specially for OCR'in scanned documents, and it has a both traditional and neural network based modes, to boot.
  
  michaelt 5 months ago
  
  > Also, what's strange is there's no free of paid OCR engine is added to the mix for the evaluation.
  The article says they evaluated "Traditional OCR providers (Azure, AWS Textract, Google Document AI, etc.)"
  Are those not paid OCR engines?
  
  bayindirh 5 months ago
  
  You're absolutely correct. I read the article quite fast, and assumed they are AI, albeit not LLM powered systems as well.
  I'm using computers since I can read, and when somebody says "traditional OCR", I think about the older systems like Tessaract or ABBYY's FineReader which can be again automated for batch processing, albeit mostly locally.
  Sending huge amount of PDFs to a cloud server to get them processed is still a bit alien to me, since it can be done on-premises (or on a VPS with the said software) very efficiently from my perspective.
pzo 5 months ago

I'm wondering how gemini can OCR big image correctly with good quality. They charge for image as input ~250 tokens. Always the same no matter the size of the image you send. 250 tokens its ~200 words. Will OCR work if you send 4k image that has a lot of text in small font? What if page will have more than 200 words? Are google selling it at cost?

gbertb 5 months ago

How does this compare to Marker https://github.com/VikParuchuri/marker?

marcotac 5 months ago

I'm curious as well

bn-l 5 months ago

What is the privacy of the documents for the cloud service? There’s nothing in the privacy policy about data sent over the api.

dcreater 5 months ago

Then you have to assume by default that your data is visible to their employees, can be monetized, used to improve their models etc
- bn-l 5 months ago
  
  I’ll stick with the devil I know that is only slightly worse in their own benchmarks (Google)
cess11 5 months ago

"San Francisco, California"
There is none, because CLOUD Act.

EarlyOom 5 months ago

OCR seems to be mostly solved for 'normal' text laid out according to Latin alphabet norms (left to right, normal spacing etc.), but would love to see more adversarial examples. We've seen lots of regressions around faxed or scanned documents where the text boxes may be slightly rotated (e.g. https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-...) not to mention handwriting and poorly scanned docs. Then there's contextually dependent information like X-axis labels that are implicit from a legend somewhere, so its not clear even with the bounding boxes what the numbers refer to. This is where VLMs really shine: they can extract text then use similar examples from the page to map them into their output values when the bounding box doesn't provide this for free.

codelion 5 months ago

That's a great point about the limitations of traditional OCR with rotated or poorly scanned documents. I agree that VLMs really shine when it comes to understanding context and extracting information beyond just the text itself. It's pretty cool how they can map implicit relationships, like those X-axis labels you mentioned.

betula_ai 5 months ago

Thank you for sharing this. Some of the other public models that we can host ourselves may perform in practice better than the models listed - e.g. Qwen 2.5 VL https://github.com/QwenLM/Qwen2.5-VL?tab=readme-ov-file

alkh 5 months ago

What is the best solution for recognizing handwritten text that combines multiple languages, especially in cases where certain letters look the same but represent different sounds? For example, the letter 'p' in English versus 'р' in Cyrillic languages, which sounds more like the English 'r'.

alok-g 5 months ago

Looking at the sample documents, this seems more focused on tables and structured data extraction and not long-form texts. The ground truth JSON has so much less information than the original document image. I would love to see a similar benchmark for full contents including long-form text and tables.

lyu07282 5 months ago

Indeed, from their conclusions:
> They [VLMs] are generally more capable of "looking past the noise" of scan lines, creases, watermarks. Traditional models tend to outperform on high-density pages (textbooks, research papers) as well as common document formats like tax forms.
Which is a bit confusing? Did they test that or what? It doesn't seem that way from their limited dataset.

banditelol 5 months ago

Anyone have tried comparing with Qwen VL based model? I heard good things about its performance on ocr compared to other self hostable model, but haven't really tried benchmarking its performance

jimmySixDOF 5 months ago

Yes I'd like to see this repeated with any of the small VLM's like IBM Granite or the HF Smols. Pretty much anything in the sub 7B range.

hasibzunair 4 months ago

What kind of VLMs are being used in OmniAI?

I fine-tuned a Llama 3.2 Vision on a small dataset I created for extracting text without heavy cropping. Results are simply amazing in comparison with OCR-based approaches. It can be tried here: https://news.ycombinator.com/item?id=43192417

westurner 5 months ago

Harmonic Loss converges more efficiently on MNIST OCR: https://github.com/KindXiaoming/grow-crystals .. "Harmonic Loss Trains Interpretable AI Models" (2025) https://news.ycombinator.com/item?id=42941954

danielcampos93 5 months ago

GPT-4o as a judge to evaluate the quality of something which gpt4o is not inherently that good at. Red flag.

fzysingularity 5 months ago

What VLMs do you use when you're listing OmniAI - is this mostly wrapping the model providers like your zerox repo?

jll29 5 months ago

Does anyone have good experience with a particular pipeline for OCR-ing C source code?

default_ 4 months ago

Wondering why Florence-2 is not on the list of models?