 |
Puppy Linux Discussion Forum Puppy home page: puppylinux.com
|
The time now is Tue 09 Feb 2010, 09:27
All times are UTC - 4 |
|
Page 1 of 2 Posts_count |
Goto page: 1, 2 Next |
| Author |
Message |
disciple
Joined: 20 May 2006 Posts: 3808 Location: Auckland, New Zealand
|
Posted: Tue 23 Sep 2008, 02:00 Post_subject:
tesseract-ocr optical character recognition Sub_title: MUCH more accurate than gocr |
|
Tesseract is the most accurate Open Source character recognition, but it has no layout analysis. If you intend to scan pages with parallel columns, you should use Ocropus (here), which uses the Tesseract engine.
If not, you will have to get rid of all the unnecessary line breaks with Tesseract, but the actual character recognition is better than with Ocropus.
In my tests tesseract was almost 100% accurate, except it missed a few spaces, and a few symmetrical apostrophes ' were turned into left-hand side single quotes ‘
This is much better than the OCR engine included in Microsoft Office, and MUCH better than gocr
1. Install from here (512 kb)
2. Copy everything from /local to /usr/local, then you can delete /local (I made a mistake packaging it... I'll package version 3 sometime and get it right).
2. Download your language file (English, French, German, Dutch, Spanish, Italian, Portuguese, Vietnamese or Old German) from http://code.google.com/p/tesseract-ocr/downloads/list (900 to 1400 kb)
YOU WANT THE "LANGUAGE DATA" file, not the "SOURCE TRAINING DATA" file. If you have a different language, you'd better make them
3. Extract the language file into /usr/local/share (so the stuff should be ending up in /usr/local/share/tessdata).
4. Run with e.g.
| Code: | | tesseract /path/scan.tif /path/output_file |
or pipe it to something with a spellcheck just in case
It will automatically append a .txt extension to the output.
It ONLY works with uncompressed and G3 compressed tiffs because I disabled libtiff support because of a bug that they tell me will be fixed in the next version. Xnview, nconvert, Imagemagick convert, and probably other things can make these. I'm guessing Xsane does too. The Gimp can't (or at least couldn't )
FYI I had compile problems with 2.03, so we're waiting for 2.04
_________________ Probably posting from Colinux (if I'm at work)
Root forever! (Link courtesy of Nathan F)
Edited_times_total
|
|
Back to top
|
|
 |
WhoDo

Joined: 11 Jul 2006 Posts: 4181 Location: Lake Macquarie NSW Australia
|
Posted: Tue 23 Sep 2008, 05:03 Post_subject:
Re: tesseract-ocr optical character recognition Sub_title: MUCH more accurate than gocr |
|
| disciple wrote: | | I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 11MB, and I'm sick of trying to find decent free file hosts). |
Why not get Tom or Will to upload it at puppylinux.org, or PM caneri to host it at puppylinux.ca ... either way.
Both locations have plenty of free space and don't charge for downloads in .pet or .pup formats. No need to bother with the adware hosts for trusted Puppy developers/compilers like yourself these days.
_________________ Actions speak louder than words ... and they usually work when words don't!
SIP:whodo@proxy01.sipphone.com; whodo@realsip.com
|
|
Back to top
|
|
 |
HairyWill

Joined: 26 May 2006 Posts: 2940 Location: Southampton, UK
|
Posted: Tue 23 Sep 2008, 07:20 Post_subject:
|
|
Puppylinux.org doesn't host packages at the moment, I think we did this to ensure that the site would not run out of transfer quota.
I'm sure Caneri can help.
_________________ Will
contribute: community website, screenshots, puplets, wiki, rss
|
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 3808 Location: Auckland, New Zealand
|
Posted: Tue 23 Sep 2008, 07:37 Post_subject:
|
|
OK, we'll see about that.
Oops. That was a pretty bad typo. I meant "over 1MB"
_________________ Probably posting from Colinux (if I'm at work)
Root forever! (Link courtesy of Nathan F)
|
|
Back to top
|
|
 |
Dingo

Joined: 11 Dec 2007 Posts: 923
|
Posted: Tue 23 Sep 2008, 13:23 Post_subject:
Re: tesseract-ocr optical character recognition Sub_title: MUCH more accurate than gocr |
|
| disciple wrote: | I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 1MB, and I'm sick of trying to find decent free file hosts).
|
http://www.filefront.com/
_________________ OpenOffice for Puppy Linux - puppy linux packages wiki
|
|
Back to top
|
|
 |
lluamco
Joined: 16 Mar 2007 Posts: 142 Location: Banyoles, Spain
|
Posted: Wed 24 Sep 2008, 03:59 Post_subject:
Re: tesseract-ocr optical character recognition Sub_title: MUCH more accurate than gocr |
|
| disciple wrote: | I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 1MB, and I'm sick of trying to find decent free file hosts).
|
Hello disciple.
MU is always very kind to host large files. Please read
http://www.murga-linux.com/puppy/viewtopic.php?p=99400#99400
to know how to proceed.
Cheers,
Lluis
|
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 3808 Location: Auckland, New Zealand
|
Posted: Thu 25 Sep 2008, 08:25 Post_subject:
tesseract-ocr optical character recognition Sub_title: MUCH more accurate than gocr |
|
OK I uploaded it and updated the first post.
It turned out I COULD get it under 1MB, but not OCRopus (see link), so thanks Caneri
_________________ Probably posting from Colinux (if I'm at work)
Root forever! (Link courtesy of Nathan F)
|
|
Back to top
|
|
 |
Dingo

Joined: 11 Dec 2007 Posts: 923
|
Posted: Thu 25 Sep 2008, 08:29 Post_subject:
|
|
thanks linked all two topics and mirrored on dokupuppy:
http://puppylover.netsons.org/dokupuppy/programs:ocr
_________________ OpenOffice for Puppy Linux - puppy linux packages wiki
|
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 3808 Location: Auckland, New Zealand
|
Posted: Thu 25 Sep 2008, 08:30 Post_subject:
|
|
BTW I was mistaken. Ocropus does not have a gui, but does have a complex set of Lua scripts
_________________ Probably posting from Colinux (if I'm at work)
Root forever! (Link courtesy of Nathan F)
|
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 3808 Location: Auckland, New Zealand
|
Posted: Sat 28 Feb 2009, 19:51 Post_subject:
|
|
Here are some OCR proofing aids that should be useful; probably more so if you are doing a lot of ocr:
http://gutcheck.sourceforge.net/
| Quote: | | Gutcheck is a plain-text checking program that specializes in reporting the problems that spellcheckers don't--errors like mismatched quotes, misplaced punctuation, unintended blank lines. It is specifically tuned for checking texts for submission to Project Gutenberg, though I hope it can be useful elsewhere as well. |
| Quote: | The common OCR error of mistaking a "b" for a "h" and vice versa used to lead to horrible things with the words "he" and "be". With the vast improvement in OCR programs in the last few years, this is not the nightmare it used to be.
jeebies detects common he/be errors by a simple lookup table. I really need to add some extra intelligence; I have a set of heuristics that I used previously, and I will probably get the time to plug them in at some point. For now, it's quick and does have some value, especially in checking older texts. It needs its lookup table, which is in the files he.jee and be.jee |
| Quote: | Gutspell: I made a very enthusiastic start on this, but I need a big dictionary with possible parts of speech listed for every word to do the next thing with it, and I never got around to doing that.
Now, it simply lists every word that isn't in its dictionary that occurs only once. Still, as a superfast check, it does still catch some typos. It has a bad habit of obsessing on one word sometimes, and reporting lots of instances. I must fix that one day. Its dictionary is the file gutspell.dic |
If someone is keen, it would be worth getting Guiguts working, which is a Perl/tk gui for these tools and aspell/ispell. In spite of what the gutcheck site implies, Guiguts is not Windows-only. The trickier part would be packaging perl/tk for Puppy.
| Description |
|

Download |
| Filename |
gutspell.zip |
| Filesize |
1.12 MB |
| Downloaded |
105 Time(s) |
| Description |
|

Download |
| Filename |
jeebies.zip |
| Filesize |
563.26 KB |
| Downloaded |
102 Time(s) |
| Description |
|

Download |
| Filename |
gutcheck.zip |
| Filesize |
35.44 KB |
| Downloaded |
102 Time(s) |
_________________ Probably posting from Colinux (if I'm at work)
Root forever! (Link courtesy of Nathan F)
|
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 3808 Location: Auckland, New Zealand
|
Posted: Sat 11 Apr 2009, 08:13 Post_subject:
unpaper - post-processing scanned and photocopied book pages Sub_title: Straighten pages and remove black edges |
|
This should also be useful before you do the OCR.
Unpaper is a tool for straightening pages and removing black edges, including in the middle, where you have photocopied an open book!
I haven't tested it, and it is at an early stage of development, but it certainly looks good
You'll need to figure out how to convert your images to and from .pnm
| Description |
|

Download |
| Filename |
unpaper-0.3.pet |
| Filesize |
29.67 KB |
| Downloaded |
112 Time(s) |
| Description |
|

Download |
| Filename |
unpaper_DOC-0.3.pet |
| Filesize |
490.46 KB |
| Downloaded |
112 Time(s) |
_________________ Probably posting from Colinux (if I'm at work)
Root forever! (Link courtesy of Nathan F)
|
|
Back to top
|
|
 |
jrb
Joined: 11 Dec 2007 Posts: 657 Location: Smithers, BC, Canada
|
Posted: Thu 23 Apr 2009, 17:03 Post_subject:
|
|
I have built ch-tesseract-2.01-OCR-en.sfs, an english version of tesseract. Tesseract_OCR is placed on the right click menu. If you right click on a .tif file it will produce a text file with the same name in a few seconds. However it is very fussy about these .tif files. You may have to open them in mtpaint or another graphics program and resave them. Even the training files required this. After that, however, it seems to work very well.
I have also placed a menu item on the Documents menu which opens a text file with these same instructions.
Packages for other major languages are available and can be easily built.
Let me know how it works for you. J
|
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 3808 Location: Auckland, New Zealand
|
Posted: Fri 24 Apr 2009, 02:42 Post_subject:
|
|
To download that sfs use the username "puppy" and password "linux" - I had to fill it in several times for some reason (unless the last time I changed it and put a capital or something?).
| Quote: | | However it is very fussy about these .tif files |
That should change in 2.04 or 3, which were both expected to be out already... so they should be out soon
_________________ Probably posting from Colinux (if I'm at work)
Root forever! (Link courtesy of Nathan F)
|
|
Back to top
|
|
 |
Dromeno
Joined: 12 Sep 2008 Posts: 186
|
Posted: Fri 24 Apr 2009, 04:46 Post_subject:
Scansoft Omnipage via wine in puppy Sub_title: not open source unfortunately |
|
OCR is one of those fields where windows applications still outshine the Linux ones. But fortunately for us puppy users, Scansoft Omnipage -my favorite- works (via wine). And it even works as 'portable' (just copy the Omnipage files from C:\Program files to some external device).
|
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 3808 Location: Auckland, New Zealand
|
Posted: Fri 24 Apr 2009, 20:13 Post_subject:
|
|
So I gather that is better because it deals with layout? Tesseract is noticeably more accurate than any of the windows products I've tried (some products are as accurate, and I suspect that one would be); where it is lacking is layout analysis.
They say that produces perfectly formatted documents, but how editable are they really? I've never tried any software that produces output that is formatted to match the original well and is also readily editable - it tends to be like copying text from a pdf.
_________________ Probably posting from Colinux (if I'm at work)
Root forever! (Link courtesy of Nathan F)
|
|
Back to top
|
|
 |
|
|
Page 1 of 2 Posts_count |
Goto page: 1, 2 Next |
|
|
Rules_post_cannot Rules_reply_cannot Rules_edit_cannot Rules_delete_cannot Rules_vote_cannot You cannot attach files in this forum You can download files in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|