Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Wed 22 Oct 2014, 01:28
All times are UTC - 4
 Forum index » Advanced Topics » Additional Software (PETs, n' stuff) » Documents
pdf compressing software?
Post_new_topic   Reply_to_topic View_previous_topic :: View_next_topic
Page 1 of 1 Posts_count  
Author Message
Dromeno

Joined: 12 Sep 2008
Posts: 538

PostPosted: Mon 18 Nov 2013, 07:30    Post_subject:  pdf compressing software?  

I hava a couple of 300mb+ pdf files (scanned books, OCR'd) which I would like to shrink while maintaining text quality. I do not care much about the photos. What would be the best approach?
Back to top
View user's profile Send_private_message 
Dingo


Joined: 11 Dec 2007
Posts: 1422
Location: somewhere at the end of rainbow...

PostPosted: Mon 18 Nov 2013, 09:33    Post_subject: Re: pdf compressing software?  

Dromeno wrote:
I hava a couple of 300mb+ pdf files (scanned books, OCR'd) which I would like to shrink while maintaining text quality. I do not care much about the photos. What would be the best approach?

so you have a pdf that is, really, a pdf wrapper around scanned images

A right way to decrease filesize while keeping quality and readability, is to decrease depht of colors

if your scanned pages are in color/grayscale, you can achieve a great size crushing by using the encoder Adam Langley designed for GoogleBooks project to compress in black and white scanned text alongside jpeg 2000 for grayscale details

Jbig2enc
- http://dokupuppylinux.info/programs:encoders

you need:

- python:
I use python 2.5 in puppy 3.01 http://dokupuppylinux.info/programs:python)
- pdf.py ( a small python script to put all jbig2 encoded images in a pdf)
http://dokupuppylinux.info/programs:encoders



HOWTO:

1° - extract all images from pdf (if you don't have original outside pdf) at their original native resolution (that can be done with pdfimages from xpdfutils or from poppler-utils)

2° - encode all images with jbig2enc

Code:
jbig2 -s -p -v *.fileextension && pdf.py output>file.pdf


a sample:

original scanned image (taken with nikon d3200) 539 KB - 2230x3777


black and white image and encoded pdf (with jbig2enc) 28KB! keeping same size: 2230x3777

http://ge.tt/7jxwM301/v/0 (encoded pdf)

_________________
replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux
Back to top
View user's profile Send_private_message Visit_website 
Dromeno

Joined: 12 Sep 2008
Posts: 538

PostPosted: Mon 18 Nov 2013, 11:34    Post_subject: epub  

Dingo

Thx for your fast help. But I do not understand everything yet.

Yes it is easy to split the pdf's in text+images and then compress the images. But how do I recombine those back into a new compressed PDF with embedded text? Of course I can OCR that pdf but I expect to loose text quality then.

BTW, my end goal is to produce an epub from that PDF (with Calibre).
Back to top
View user's profile Send_private_message 
Dingo


Joined: 11 Dec 2007
Posts: 1422
Location: somewhere at the end of rainbow...

PostPosted: Mon 18 Nov 2013, 15:04    Post_subject: Re: epub  

Dromeno wrote:
how do I recombine those back into a new compressed PDF with embedded text?

with hopcr2pdf from exactimage utils
- http://www.exactcode.com/site/open_source/exactimage/hocr2pdf/

in this case a possible workflow is:

1° - extract all scanned images from wrapping pdf with pdfimages
Code:
pdfimages file.pdf 0

2° - convert to tiff b/w without dither (that can be done with graphicsmagick, lighter and faster than imagenagick)
Code:
gm mogrify -format tiff -dither None -compress Group4 -threshold *value*  *.fileext

3° - performing ocr on these tiff resuting images with an hocr capable software like tesseract that can produce an hocr output to reuse this with hocr2pdf
4 ° - combine together the hocr output and the tiff files inside one multipage pdf with hocr2pdf

if you want take a look to hocr2pdf you need to find the binaries already compiled, since I tried many times to build exactimage utils in Puppy KLinux, but the resulting executables, even if finely built, failed to open any file I submitted for testing

Dromeno wrote:
Of course I can OCR that pdf but I expect to loose text quality then.

this is generally the field of application of ADOBE CLEARSCAN, a proprietary software that shrink the size of scanned pdf, vectorizing the raster text on boon scans, creating a custom font from recognized text and using this custom subsetted font to represent the text
http://acrobatusers.com/tutorials/better-pdf-ocr-clearscan-smaller-looks-better

but, if you dislike Adobe or you, like me, hate this monopolizing company, a possible alternative is using
smoothscan
https://natecraun.net/projects/smoothscan/
Quote:
smoothscan is a tool to convert scanned text into a vectorized output form.


source available for building, last time I tried to build smoothscan I encountered problems with fontforge python dependencies, it seems now these dependencies were removed, so maybe there is a chance that building will be fine

_________________
replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux
Back to top
View user's profile Send_private_message Visit_website 
Flash
Official Dog Handler


Joined: 04 May 2005
Posts: 11118
Location: Arizona USA

PostPosted: Mon 18 Nov 2013, 16:02    Post_subject:  

I just tried compressing a 923 kB pdf file with Xarchive. The result was a 405 kB .tar.gz file which seemed to uncompress to the original pdf file just fine.

Instructions on how I used Xarchive to do this are here.
Back to top
View user's profile Send_private_message 
Display_posts:   Sort by:   
Page 1 of 1 Posts_count  
Post_new_topic   Reply_to_topic View_previous_topic :: View_next_topic
 Forum index » Advanced Topics » Additional Software (PETs, n' stuff) » Documents
Jump to:  

Rules_post_cannot
Rules_reply_cannot
Rules_edit_cannot
Rules_delete_cannot
Rules_vote_cannot
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.0628s ][ Queries: 11 (0.0068s) ][ GZIP on ]