Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Mon 24 Nov 2014, 03:10
All times are UTC - 4
 Forum index » Advanced Topics » Additional Software (PETs, n' stuff) » Documents
tesseract-ocr optical character recognition
Post new topic   Reply to topic View previous topic :: View next topic
Page 1 of 3 [31 Posts]   Goto page: 1, 2, 3 Next
Author Message
disciple

Joined: 20 May 2006
Posts: 6455
Location: Auckland, New Zealand

PostPosted: Tue 23 Sep 2008, 02:00    Post subject:  tesseract-ocr optical character recognition
Subject description: MUCH more accurate than gocr
 

Tesseract is the most accurate Open Source character recognition, but it has no layout analysis. If you intend to scan pages with parallel columns, you should use Ocropus (here), which uses the Tesseract engine.
If not, you will have to get rid of all the unnecessary line breaks with Tesseract, but the actual character recognition is better than with Ocropus.
In my tests tesseract was almost 100% accurate, except it missed a few spaces, and a few symmetrical apostrophes ' were turned into left-hand side single quotes ‘
This is much better than the OCR engine included in Microsoft Office, and MUCH better than gocr Smile

1. Install from here (512 kb)
2. Copy everything from /local to /usr/local, then you can delete /local (I made a mistake packaging it... I'll package version 3 sometime and get it right).
2. Download your language file (English, French, German, Dutch, Spanish, Italian, Portuguese, Vietnamese or Old German) from http://code.google.com/p/tesseract-ocr/downloads/list (900 to 1400 kb)
YOU WANT THE "LANGUAGE DATA" file, not the "SOURCE TRAINING DATA" file. If you have a different language, you'd better make them Smile
3. Extract the language file into /usr/local/share (so the stuff should be ending up in /usr/local/share/tessdata).
4. Run with e.g.
Code:
tesseract /path/scan.tif /path/output_file

or pipe it to something with a spellcheck just in case Smile
It will automatically append a .txt extension to the output.

It ONLY works with uncompressed and G3 compressed tiffs because I disabled libtiff support because of a bug that they tell me will be fixed in the next version. Xnview, nconvert, Imagemagick convert, and probably other things can make these. I'm guessing Xsane does too. The Gimp can't (or at least couldn't Smile )

FYI I had compile problems with 2.03, so we're waiting for 2.04 Smile

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!

Last edited by disciple on Tue 12 Jan 2010, 08:54; edited 3 times in total
Back to top
View user's profile Send private message 
WhoDo


Joined: 11 Jul 2006
Posts: 4441
Location: Lake Macquarie NSW Australia

PostPosted: Tue 23 Sep 2008, 05:03    Post subject: Re: tesseract-ocr optical character recognition
Subject description: MUCH more accurate than gocr
 

disciple wrote:
I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 11MB, and I'm sick of trying to find decent free file hosts).

Why not get Tom or Will to upload it at puppylinux.org, or PM caneri to host it at puppylinux.ca ... either way.

Both locations have plenty of free space and don't charge for downloads in .pet or .pup formats. No need to bother with the adware hosts for trusted Puppy developers/compilers like yourself these days.

_________________
Actions speak louder than words ... and they usually work when words don't!
SIP:whodo@proxy01.sipphone.com; whodo@realsip.com
Back to top
View user's profile Send private message 
HairyWill


Joined: 26 May 2006
Posts: 2949
Location: Southampton, UK

PostPosted: Tue 23 Sep 2008, 07:20    Post subject:  

Puppylinux.org doesn't host packages at the moment, I think we did this to ensure that the site would not run out of transfer quota.
I'm sure Caneri can help.

_________________
Will
contribute: community website, screenshots, puplets, wiki, rss
Back to top
View user's profile Send private message 
disciple

Joined: 20 May 2006
Posts: 6455
Location: Auckland, New Zealand

PostPosted: Tue 23 Sep 2008, 07:37    Post subject:  

OK, we'll see about that.
Oops. That was a pretty bad typo. I meant "over 1MB" Smile

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send private message 
Dingo


Joined: 11 Dec 2007
Posts: 1423
Location: somewhere at the end of rainbow...

PostPosted: Tue 23 Sep 2008, 13:23    Post subject: Re: tesseract-ocr optical character recognition
Subject description: MUCH more accurate than gocr
 

disciple wrote:
I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 1MB, and I'm sick of trying to find decent free file hosts).


http://www.filefront.com/

_________________
replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux
Back to top
View user's profile Send private message Visit poster's website 
lluamco

Joined: 16 Mar 2007
Posts: 207
Location: Banyoles, Spain

PostPosted: Wed 24 Sep 2008, 03:59    Post subject: Re: tesseract-ocr optical character recognition
Subject description: MUCH more accurate than gocr
 

disciple wrote:
I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 1MB, and I'm sick of trying to find decent free file hosts).

Hello disciple.
MU is always very kind to host large files. Please read
http://www.murga-linux.com/puppy/viewtopic.php?p=99400#99400
to know how to proceed.
Cheers,
Lluis
Back to top
View user's profile Send private message 
disciple

Joined: 20 May 2006
Posts: 6455
Location: Auckland, New Zealand

PostPosted: Thu 25 Sep 2008, 08:25    Post subject: tesseract-ocr optical character recognition
Subject description: MUCH more accurate than gocr
 

OK I uploaded it and updated the first post.
It turned out I COULD get it under 1MB, but not OCRopus (see link), so thanks Caneri Smile

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send private message 
Dingo


Joined: 11 Dec 2007
Posts: 1423
Location: somewhere at the end of rainbow...

PostPosted: Thu 25 Sep 2008, 08:29    Post subject:  

thanks linked all two topics and mirrored on dokupuppy:

http://puppylover.netsons.org/dokupuppy/programs:ocr

_________________
replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux
Back to top
View user's profile Send private message Visit poster's website 
disciple

Joined: 20 May 2006
Posts: 6455
Location: Auckland, New Zealand

PostPosted: Thu 25 Sep 2008, 08:30    Post subject:  

BTW I was mistaken. Ocropus does not have a gui, but does have a complex set of Lua scripts Smile
_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send private message 
disciple

Joined: 20 May 2006
Posts: 6455
Location: Auckland, New Zealand

PostPosted: Sat 28 Feb 2009, 19:51    Post subject:  

Here are some OCR proofing aids that should be useful; probably more so if you are doing a lot of ocr:

http://gutcheck.sourceforge.net/

Quote:
Gutcheck is a plain-text checking program that specializes in reporting the problems that spellcheckers don't--errors like mismatched quotes, misplaced punctuation, unintended blank lines. It is specifically tuned for checking texts for submission to Project Gutenberg, though I hope it can be useful elsewhere as well.


Quote:
The common OCR error of mistaking a "b" for a "h" and vice versa used to lead to horrible things with the words "he" and "be". With the vast improvement in OCR programs in the last few years, this is not the nightmare it used to be.

jeebies detects common he/be errors by a simple lookup table. I really need to add some extra intelligence; I have a set of heuristics that I used previously, and I will probably get the time to plug them in at some point. For now, it's quick and does have some value, especially in checking older texts. It needs its lookup table, which is in the files he.jee and be.jee


Quote:
Gutspell: I made a very enthusiastic start on this, but I need a big dictionary with possible parts of speech listed for every word to do the next thing with it, and I never got around to doing that.

Now, it simply lists every word that isn't in its dictionary that occurs only once. Still, as a superfast check, it does still catch some typos. It has a bad habit of obsessing on one word sometimes, and reporting lots of instances. I must fix that one day. Its dictionary is the file gutspell.dic


If someone is keen, it would be worth getting Guiguts working, which is a Perl/tk gui for these tools and aspell/ispell. In spite of what the gutcheck site implies, Guiguts is not Windows-only. The trickier part would be packaging perl/tk for Puppy.
gutspell.zip
Description 
zip

 Download 
Filename  gutspell.zip 
Filesize  1.12 MB 
Downloaded  793 Time(s) 
jeebies.zip
Description 
zip

 Download 
Filename  jeebies.zip 
Filesize  563.26 KB 
Downloaded  853 Time(s) 
gutcheck.zip
Description 
zip

 Download 
Filename  gutcheck.zip 
Filesize  35.44 KB 
Downloaded  789 Time(s) 

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send private message 
disciple

Joined: 20 May 2006
Posts: 6455
Location: Auckland, New Zealand

PostPosted: Sat 11 Apr 2009, 08:13    Post subject: unpaper - post-processing scanned and photocopied book pages
Subject description: Straighten pages and remove black edges
 

This should also be useful before you do the OCR.

Unpaper is a tool for straightening pages and removing black edges, including in the middle, where you have photocopied an open book!

I haven't tested it, and it is at an early stage of development, but it certainly looks good Smile

You'll need to figure out how to convert your images to and from .pnm
unpaper-0.3.pet
Description 
pet

 Download 
Filename  unpaper-0.3.pet 
Filesize  29.67 KB 
Downloaded  838 Time(s) 
unpaper_DOC-0.3.pet
Description 
pet

 Download 
Filename  unpaper_DOC-0.3.pet 
Filesize  490.46 KB 
Downloaded  829 Time(s) 

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send private message 
jrb


Joined: 11 Dec 2007
Posts: 1040
Location: Smithers, BC, Canada

PostPosted: Thu 23 Apr 2009, 17:03    Post subject:  

I have built ch-tesseract-2.01-OCR-en.sfs, an english version of tesseract. Tesseract_OCR is placed on the right click menu. If you right click on a .tif file it will produce a text file with the same name in a few seconds. However it is very fussy about these .tif files. You may have to open them in mtpaint or another graphics program and resave them. Even the training files required this. After that, however, it seems to work very well.

I have also placed a menu item on the Documents menu which opens a text file with these same instructions.

Packages for other major languages are available and can be easily built.

Let me know how it works for you. J
Back to top
View user's profile Send private message 
disciple

Joined: 20 May 2006
Posts: 6455
Location: Auckland, New Zealand

PostPosted: Fri 24 Apr 2009, 02:42    Post subject:  

To download that sfs use the username "puppy" and password "linux" - I had to fill it in several times for some reason (unless the last time I changed it and put a capital or something?).

Quote:
However it is very fussy about these .tif files

That should change in 2.04 or 3, which were both expected to be out already... so they should be out soon Smile

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send private message 
Dromeno

Joined: 12 Sep 2008
Posts: 538

PostPosted: Fri 24 Apr 2009, 04:46    Post subject: Scansoft Omnipage via wine in puppy
Subject description: not open source unfortunately
 

OCR is one of those fields where windows applications still outshine the Linux ones. But fortunately for us puppy users, Scansoft Omnipage -my favorite- works (via wine). And it even works as 'portable' (just copy the Omnipage files from C:\Program files to some external device).
Back to top
View user's profile Send private message 
disciple

Joined: 20 May 2006
Posts: 6455
Location: Auckland, New Zealand

PostPosted: Fri 24 Apr 2009, 20:13    Post subject:  

So I gather that is better because it deals with layout? Tesseract is noticeably more accurate than any of the windows products I've tried (some products are as accurate, and I suspect that one would be); where it is lacking is layout analysis.
They say that produces perfectly formatted documents, but how editable are they really? I've never tried any software that produces output that is formatted to match the original well and is also readily editable - it tends to be like copying text from a pdf.

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send private message 
Display posts from previous:   Sort by:   
Page 1 of 3 [31 Posts]   Goto page: 1, 2, 3 Next
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » Advanced Topics » Additional Software (PETs, n' stuff) » Documents
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.1019s ][ Queries: 13 (0.0135s) ][ GZIP on ]