Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy home page: puppylinux.com
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Tue 09 Feb 2010, 12:17
All times are UTC - 4
 Forum index » Advanced Topics » Additional Software (PETs, n' stuff)
OCRopus 0.2 optical character recognition + layout analysis
Moderators: deshlab, Flash, GuestToo, Ian, JohnMurga, Lobster
Post_new_topic   Reply_to_topic View_previous_topic :: View_next_topic
Page 1 of 1 Posts_count  
Author Message
disciple

Joined: 20 May 2006
Posts: 3808
Location: Auckland, New Zealand

PostPosted: Thu 25 Sep 2008, 08:00    Post_subject:  OCRopus 0.2 optical character recognition + layout analysis
Sub_title: Uses tesseract engine. Much better than gocr
 

This is an OCR program using the tesseract engine but with layout analysis. In my tests it seems "layout analysis" just means it recognizes columns of text, reads each of them and then strings them all together including headers, footers etc. It is noticeably less accurate than Tesseract, and probably only a little better than the OCR engine in Microsoft Office. This is still much better than gocr Smile
I recommend using Tesseract itself (here) unless you intend to scan pages with parallel columns.
It produces an html file that copies and pastes nicely from a browser (not Dillo I think as it hasn't got UTF) to a word processor.

1. Install from here (1121kb).
2. Download your language file (English, French, German, Dutch, Spanish, Italian, Portuguese, Vietnamese or Old German) from http://code.google.com/p/tesseract-ocr/downloads/list (900 to 1400 kb)
YOU WANT THE "LANGUAGE DATA" file, not the "SOURCE TRAINING DATA" file. If you have a different language, you'd better make them Wink
3. Extract the language file into /usr/local/share (so the stuff should be ending up in /usr/local/share/tessdata).
4. Run with e.g.
Quote:
ocroscript rec-tess /path/some_scan.png > /other_path/scan.html


N.B. doesn't work with tiffs, as I disabled libtiff support in tesseract because of a bug that they tell me will be fixed in the next version. Convert to something else.

Ocropus can also be compiled against a language modelling program and a program for making vector images of diagrams in a scan. I didn't look hard, but there doesn't seem to be a ready-to-go way to use these (or aspell, which I think I did compile against), so I didn't bother.

There were also two HUGE files produced by the install that I didn't include for the same reason - a US dictionary and a file for neural network modelling.

BTW unlike tesseract, I think ocropus converts to black-and-white, so there is no advantage in colour images.

_________________
Probably posting from Colinux (if I'm at work)
Root forever! (Link courtesy of Nathan F)
Back to top
View user's profile Send_private_message 
disciple

Joined: 20 May 2006
Posts: 3808
Location: Auckland, New Zealand

PostPosted: Sat 11 Apr 2009, 08:19    Post_subject: Extra OCR related tools  

Also check out the extra tools I posted in the Tesseract thread.
http://www.murga-linux.com/puppy/viewtopic.php?p=279332#279332
and the following post.

_________________
Probably posting from Colinux (if I'm at work)
Root forever! (Link courtesy of Nathan F)
Back to top
View user's profile Send_private_message 
Display_posts:   Sort by:   
Page 1 of 1 Posts_count  
Post_new_topic   Reply_to_topic View_previous_topic :: View_next_topic
 Forum index » Advanced Topics » Additional Software (PETs, n' stuff)
Jump to:  

Rules_post_cannot
Rules_reply_cannot
Rules_edit_cannot
Rules_delete_cannot
Rules_vote_cannot
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
hot copy
[ Time: 0.1771s ][ Queries: 8 (0.0039s) ][ Debug on ]