 |
Puppy Linux Discussion Forum Puppy home page: puppylinux.com
|
The time now is Tue 09 Feb 2010, 12:17
All times are UTC - 4 |
| Author |
Message |
disciple
Joined: 20 May 2006 Posts: 3808 Location: Auckland, New Zealand
|
Posted: Thu 25 Sep 2008, 08:00 Post_subject:
OCRopus 0.2 optical character recognition + layout analysis Sub_title: Uses tesseract engine. Much better than gocr |
|
This is an OCR program using the tesseract engine but with layout analysis. In my tests it seems "layout analysis" just means it recognizes columns of text, reads each of them and then strings them all together including headers, footers etc. It is noticeably less accurate than Tesseract, and probably only a little better than the OCR engine in Microsoft Office. This is still much better than gocr
I recommend using Tesseract itself (here) unless you intend to scan pages with parallel columns.
It produces an html file that copies and pastes nicely from a browser (not Dillo I think as it hasn't got UTF) to a word processor.
1. Install from here (1121kb).
2. Download your language file (English, French, German, Dutch, Spanish, Italian, Portuguese, Vietnamese or Old German) from http://code.google.com/p/tesseract-ocr/downloads/list (900 to 1400 kb)
YOU WANT THE "LANGUAGE DATA" file, not the "SOURCE TRAINING DATA" file. If you have a different language, you'd better make them
3. Extract the language file into /usr/local/share (so the stuff should be ending up in /usr/local/share/tessdata).
4. Run with e.g.
| Quote: | | ocroscript rec-tess /path/some_scan.png > /other_path/scan.html |
N.B. doesn't work with tiffs, as I disabled libtiff support in tesseract because of a bug that they tell me will be fixed in the next version. Convert to something else.
Ocropus can also be compiled against a language modelling program and a program for making vector images of diagrams in a scan. I didn't look hard, but there doesn't seem to be a ready-to-go way to use these (or aspell, which I think I did compile against), so I didn't bother.
There were also two HUGE files produced by the install that I didn't include for the same reason - a US dictionary and a file for neural network modelling.
BTW unlike tesseract, I think ocropus converts to black-and-white, so there is no advantage in colour images.
_________________ Probably posting from Colinux (if I'm at work)
Root forever! (Link courtesy of Nathan F)
|
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 3808 Location: Auckland, New Zealand
|
Posted: Sat 11 Apr 2009, 08:19 Post_subject:
Extra OCR related tools |
|
Also check out the extra tools I posted in the Tesseract thread.
http://www.murga-linux.com/puppy/viewtopic.php?p=279332#279332
and the following post.
_________________ Probably posting from Colinux (if I'm at work)
Root forever! (Link courtesy of Nathan F)
|
|
Back to top
|
|
 |
|
|
|
Rules_post_cannot Rules_reply_cannot Rules_edit_cannot Rules_delete_cannot Rules_vote_cannot You cannot attach files in this forum You can download files in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|