Thursday, January 11, 2007

OCR in Preview

Maybe everyone knows this already, but it was news to me. Today I opened a PDF document in Preview. It contained a URL which I wanted to open in my browser. Unfortunately, the document was a full-page bitmap of a scanned paper page. This was obvious just by looking at the document, but my mousing hand went on autopilot. I told my hand, "Err, this is not going to work—it's a bitmap of a scanned page." "Yeah, I know," replied my hand, "but I can't stop myself trying." Oddly, the text was selectable. I copied it. I pasted it. A few characters were wrong, but it was a neat idea. A better quality scan would no doubt have helped. This:

http://www.apple.com/ilife/video/ilife04_32C.html

became this:

tttp: //www. apple. oom/ilife/videc/ilifeO432C.htm.l

Nice try. The URL was a 404 anyway.

7 comments:

  1. Were you actually able to do this? I was unable to recreate it but if Automator could access this power, a new application could be on its way....

    ReplyDelete
  2. It is quite possible the PDF had been created in Acrobat and had been processed by the Paper Capture plugin which does OCR. This overlays the text it finds on the bitmap it uses as the source.

    ReplyDelete
  3. I confirm that Preview does that automatically. I tried on multiple pdf scanned documents from 3 different scanners.
    I think this is a great feature!

    ReplyDelete
  4. also if you search in spotlight, you can get results in a scanned text, with the search result highlighted. amazing.

    ReplyDelete
  5. How about from image files such as .tiff? Any ideas?

    ReplyDelete
  6. Sorry folks. This just isn't going to work. The PDF was already OCR'd with the text in place. Preview does not OCR automatically. My wife scans about 100 pages a week for school, but not OCR'd. Never have they been searchable, not in Preview (Lion) or in PDFExpert (iOS 5)

    ReplyDelete
  7. I'm not sure who is doing the OCR (perhaps the Canon All-in-one scanner I'm using). but it gets sent to my Mac as a .pdf and I can then search the text very nicely in Preview.

    ReplyDelete