mardi 21 mai 2013

Powerfull images manipulations with ImageMagick, ghostscript, and others.

I kept this post in my "draft" folder for a long time. Originaly, this was devised because I found a scanned copy of a very old, out-of-print, scientific monography, that was quite valuable to my research at the time (and also interesting in a historical perspective). Of course, no official electronic version of this book existed, and the second-hand versions of this book sold on ebay were grossly overpriced.
Copyrights issues set aside (the writer just unfortunately passed away), reading from a set of separated scanned files is really not convenient, and not really compatible with the electronic-book managment software Calibre that I am currently using in combination with an "old" Kindle 4.

Thus, I had to find a way to generate a sufficiently small electronic version of the set, say .pdf. And the open-source software Imagemagick is just fit to the task, more specifically through the command-line tool Convert. Note that for pdf generation/deconstruction, Convert depends on Ghostscript.

After a few years using this routinely - and searching routinely in my "draft" folder the informations contained here, I just decided to release it at is it. Hope this help.

---

So, you have this set of black-and-white scanned pictures in .png format, numbered from 001.png to, say, 300.png. And you want to convert that in pdf. You can actually do this with the following command (in bash):

convert *.png <name>.pdf
The result is a 300 pages document <name>.pdf. Yeah, it is that simple.
However, BEWARE: this operation has a huge memory requirement ; this is most probably because each file is first uncompressed by ghostscript prior to the pdf generation. As an example, in my case, it requires about 10 GB of RAM. On most systems, this command will probably freeze your computer [1].

Thus, you will have to optimize the output. For that, there are a huge number of options in Convert. The first one you want to use systematically one converting from/to pdf is the -verbose option, to see how it is going. It will print you the size of each file during processing ; if it is slowing down dramatically, it is probably time to hit Ctrl+C and to try another way.


First, convert each picture to pdf format ; in bash, you can use the one-liner
for i in *.png ; do f=`echo $i | sed s/.png/.pdf/` ; convert $i $f ; echo $f ; done
Then, use directly ghoscript to merge all the files together
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=<name>.pdf -dBATCH *pdf
Et voila. Note  that the final pdf is a bit bigger than the one obtained directly by Convert.



Of course, the resulting file size and quality depends on the initial set of pictures. In most case with black and white pictures, you will want to use .png format instead of .jpg



if you want to convert a pdf into a set of pictures:
- convert name.pdf name.png
It will split the pdf into as many pictures as there are pages, with numbers like name-0.png, name-1.png...
if you want jpg instead of png:
- convert name.pdf name.jpg
The quality of the pictures are by default 100 dpi ; you can change that to any value with -density:
-convert -density 300 name.pdf name.jpg
BEWARE: it will create huge raw files in memory so it can be very slow.

if you want to convert a set of picture to a pdf:
1. name the pictures in the order you want them to be packaged (00.png, 01.png 02.png, etc)
2. run:
- convert *.png name.pdf

if you want to convert a set of pictures from one format to another
- for i in *.png ; do f=`echo $i | sed s/.png/.jpg/` ; convert $i $f ; done

if you want to change their size, for example for them to have all a height of 1200 pixels:
- for i in *.png ; do f=`echo $i | sed s/.png/_resized.png/` ; convert -rezise x1200 $i $f ; done
BEWARE: resizing tends to render images quite dirty, the algorithms behind are not perfect...

Finally, if you prefer Djvu to PDF:
(Nb: if it is black & white files, it is probably a bad idea as it will be significantly smaller in pdf if you start from .png pictures )
1. convert all your pictures to djvu using c44:
- for i in *.png ; do f=`echo $i | sed s/.png/.djvu/` ; c44 $i $f ; done
2. create the djvu file using djvm:
- djvm -c name.djvu `for i in *.djvu ; do echo -n $i" " ; done`
3. you can then add an outline using a text file (e.g. outline.txt) with the following format :

(bookmark
  ("Chapter1" "#page_number"
    ("Subchapter1" "#page_number")
    ("Subchapter2" "#page_number")
  )
  ("Conclusion" "#page_number")
)

And then set it in your file with:
- djvused set -outline outline.txt name.dvju

[1] but for the lucky few that use supercomputers, this should not really be a problem. Yes, I tried.