Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Mon 27 May 2019, 10:03
All times are UTC - 4
 Forum index » House Training » Users ( For the regulars )
PDF OCR on Precise 5.7.1 [solved]
Moderators: Flash, Ian, JohnMurga
Post new topic   Reply to topic View previous topic :: View next topic
Page 1 of 1 [11 Posts]  
Author Message
Saladin

Joined: 27 Aug 2011
Posts: 87

PostPosted: Mon 30 May 2016, 20:37    Post subject:  PDF OCR on Precise 5.7.1 [solved]  

I have some large scan PDFs (100 - 400 megs) that I'd like to convert to text. They're simple black text on white pages, all English, very little font variation. Online tools don't want to work with files that large, and all of the offline tools I've tried so far have been unwilling to work out of the box.

Adobe will apparently do this for me for just a few dollars. I'd rather have a permanent solution, but I also don't want to spend hours on something I could have done in ten minutes. So is there a way to do PDF OCR in Precise 5.7.1, that doesn't require me to install multiple apps and scripts and dependencies and language packs?

Last edited by Saladin on Tue 31 May 2016, 10:11; edited 1 time in total
Back to top
View user's profile Send private message 
rcrsn51


Joined: 05 Sep 2006
Posts: 12627
Location: Stratford, Ontario

PostPosted: Mon 30 May 2016, 22:56    Post subject:  

Are these PDFs of text or PDFs of pictures of text?

For #1, PeasyPDF can extract to text.

For #2, PeasyPDF can extract the graphics. Then you could try pic2txt and Tesseract OCR.

To do extractions in Precise, you may need the Ghostscript upgrade as described in the notes.
Back to top
View user's profile Send private message 
Saladin

Joined: 27 Aug 2011
Posts: 87

PostPosted: Tue 31 May 2016, 09:07    Post subject:  

Sorry I've taken a while to respond. I've been working with this.

They are images of text, by the way, not actual text. PeasyPDF works well for extracting the images. pic2txt works well for extracting the text. The problem is that pic2txt is a GUI tool -- it works great for a handful of pages, but I have over 500, and I don't want to drag all of them into the window. Is there a way to automate this?
Back to top
View user's profile Send private message 
rcrsn51


Joined: 05 Sep 2006
Posts: 12627
Location: Stratford, Ontario

PostPosted: Tue 31 May 2016, 09:59    Post subject:  

Here is a stripped-down version of pic2txt that runs from the command line.

1. Download and unpack it.

2. Put the script somewhere like /usr/bin.

3. Collect the image files in a folder.

4. Run: pic2txt-batch name_of_image_folder
pic2txt-batch.tar.gz
Description 
gz

 Download 
Filename  pic2txt-batch.tar.gz 
Filesize  566 Bytes 
Downloaded  191 Time(s) 

Last edited by rcrsn51 on Tue 31 May 2016, 10:37; edited 1 time in total
Back to top
View user's profile Send private message 
saladinsmith

Joined: 29 Nov 2013
Posts: 5

PostPosted: Tue 31 May 2016, 10:10    Post subject:  

Thanks! That works great.

This produces a text file for each image. Just in case someone else is doing this later on and doesn't know how to combine them, run this from the command line:

Code:
cat *.txt > output.txt
Back to top
View user's profile Send private message 
rcrsn51


Joined: 05 Sep 2006
Posts: 12627
Location: Stratford, Ontario

PostPosted: Tue 31 May 2016, 10:21    Post subject:  

Excellent. If you look at the script, you will see how it turns JPEGs into TIFFs before sending them to tesseract.

Your batch procedure would work faster if PeasyPDF could also extract images to TIFF.

I may add that option.
Back to top
View user's profile Send private message 
Dorothée


Joined: 27 Nov 2012
Posts: 253

PostPosted: Thu 29 Nov 2018, 04:39    Post subject:  

Bonjour rcrsn51,

comme je sais que tu comprends le français, j'utilise cette langue car mon anglais est horrible.

Comment ferait-on pour choisir une langue avant d'utiliser pix2txt-batch? C'est possible avec l'interface gui, mais qui ne traite qu'une image à la fois.

Serait-il vraiment compliqué de faire une interface gui pour traiter un dossier entier, tout en ayant le choix de la langue?

Je te remercie à l'avance,

ciaozinho,

PS; tu peux me répondre en anglais....

_________________

PIPOCA_Z_v3 + zdrv_v2
https://drive.google.com/open?id=0B_YYahskVg4qR3BIVG9kbFlqeUE
Back to top
View user's profile Send private message 
rcrsn51


Joined: 05 Sep 2006
Posts: 12627
Location: Stratford, Ontario

PostPosted: Thu 29 Nov 2018, 08:48    Post subject:  

Line 28 of the script is
Code:
tesseract /tmp/out.tif "$BF" -l eng

Change "eng" to "fra".
Back to top
View user's profile Send private message 
Dorothée


Joined: 27 Nov 2012
Posts: 253

PostPosted: Fri 30 Nov 2018, 04:00    Post subject:  

Hello rcrsn51 ,

thank you for your answer. I did this before but tesseract does not work with other languages than french. I dont understand why.

I copy here 2 pictures of the console, with the error message for english.

There is also a picture of tessdata folder. Why the french file looks different?

Of course, I have the same problem with the pic2txt interface, and with Ocrgui when I choose tesseract as main program (it works, very badly but it works, when I choose Gocr as main program).

Do you understand this mistery?

À l'avance, merci.
fra_OK.jpg
 Description   
 Filesize   35.55 KB
 Viewed   68 Time(s)

fra_OK.jpg

eng_does_not_work.jpg
 Description   
 Filesize   52.08 KB
 Viewed   66 Time(s)

eng_does_not_work.jpg

Tessdata_folder.jpg
 Description   
 Filesize   33.92 KB
 Viewed   67 Time(s)

Tessdata_folder.jpg

Back to top
View user's profile Send private message 
rcrsn51


Joined: 05 Sep 2006
Posts: 12627
Location: Stratford, Ontario

PostPosted: Fri 30 Nov 2018, 08:22    Post subject:  

Try the trained data files from here.

Scroll down to the middle of the list.
Back to top
View user's profile Send private message 
Dorothée


Joined: 27 Nov 2012
Posts: 253

PostPosted: Fri 30 Nov 2018, 13:47    Post subject:  

Thanck you rcrsn51,

I downloaded 2 types of file: the gz ones that look as a press, and the tar.gz ones that look as a zipped folder.

The zipped files from the gz files worked fine and look like a picture (like french file), the zipped files from the tar.gz files did not work.

Problem solved, thank you!
Back to top
View user's profile Send private message 
Display posts from previous:   Sort by:   
Page 1 of 1 [11 Posts]  
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » House Training » Users ( For the regulars )
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.0469s ][ Queries: 13 (0.0178s) ][ GZIP on ]