Using Wand to extract images from PDFs in python
ImageMagick is a tool commonly used by developers to convert images between formats. It is a great tool that supports many image formats and is pretty easy to work with, once you get the command line arguments down. (It is a command line tool.)
I needed to extract images from PDFs, and although I could do it using just the command line, I wanted to use Python to do the extraction. There are a few libraries for python that can do it, so I wanted to compare them. Here are the candidates:
- Wand is a ctypes-based ImageMagick binding library for Python.
- PythonMagick is an object-oriented Python interface to ImageMagick.
- PythonMagickWand is an object-oriented Python interface to MagickWand based on ctypes.
The Comparison
After a short review of the current documentation on all three links above, it seems that only Wand is still supported, and only wand has up to date documentation. Based on that, I decided to continue with Wand alone.
Installing ImageMagick on a Mac with Brew
If you do not already have brew installed, you can follow my how-to on setting up a new mac, or just visit the brew home page. Once you have brew, you can install ImageMagick with the following:
$ brew install freetype imagemagick
==> Installing dependencies for imagemagick: freetype
==> Installing imagemagick dependency: freetype
==> Downloading https://homebrew.bintray.com/bottles/freetype-2.6.3.yosemite.bottle.tar.gz
######################################################## 100.0%
==> Pouring freetype-2.6.3.yosemite.bottle.tar.gz
🍺 /usr/local/Cellar/freetype/2.6.3: 60 files, 2.5M
==> Installing imagemagick
==> Downloading https://homebrew.bintray.com/bottles/imagemagick-6.9.3-0_2.yosemite.bottle.1.tar.gz
######################################################## 100.0%
==> Pouring imagemagick-6.9.3-0_2.yosemite.bottle.1.tar.gz
🍺 /usr/local/Cellar/imagemagick/6.9.3-0_2: 1,453 files, 17.6M
Assuming you do not already have ImageMagick installed, you will see the installation progress, be sure to follow any instructions at the end of the installation. You may be asked to link the binaries when it is completed.
To verify the installation went as planned, try running convert.
$ convert
Version: ImageMagick 6.9.3-0 Q16 x86_64 2016-01-08 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2016 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC Modules
Delegates (built-in): bzlib freetype jng jpeg ltdl lzma png tiff xml zlib
Usage: convert [options ...] file [ [options ...] file ...] [options ...] file
Installing Wand
You can easily install wand with pip:
$ pip install wand
Collecting wand
Downloading Wand-0.4.2.tar.gz (63kB)
100% |████████████████████████████████| 71kB 133kB/s
Building wheels for collected packages: wand
Running setup.py bdist_wheel for wand ... done
Stored in directory: /Users/home/Library/Caches/pip/wheels/f2/71/f5/0627b
Successfully built wand
Installing collected packages: wand
Successfully installed wand-0.4.2
Converting PDF to PNGs
Even before using wand, it might be helpful to convert the PDF using the command line tool, just to verify that the installation is correct and the PDF you are using is working. Here is a command to extract the images from a PDF:
$ convert -density 300 source.pdf s.png
This will create a single png for each page of the source.pdf file. For the file I am using, the background would be better off as white, and that can be fixed using the command line, but I am going to keep going on to get this running from python.
Wand
PDFs open in wand as images with multiple sequences. Here is the very simple source that opens the source.pdf, and converts it to a png file.
from __future__ import print_function
from wand.image import Image
with Image(filename='source.pdf') as img:
print('pages = ', len(img.sequence))
with img.convert('png') as converted:
converted.save(filename='pyout/page.png')
img.sequence is the number of images, and in a pdf, the number of pages in the file. Convert will convert the image from one format to another. Finally, save will save the image.
Notes and Comments
32bit Python
If you followed some of my other tutorials, or are just using a 32bit version of python, you will get the following error when you try to use the library.
$ python pdfToPNGWand.py
Traceback (most recent call last):
File "pdfToPNGWand.py", line 2, in <module>
from wand.image import Image
File "/Users/home/.virtualenvs/netvantage/lib/python2.7/site-packages/wand/image.py", line 20, in <module>
from .api import MagickPixelPacket, libc, libmagick, library
File "/Users/home/.virtualenvs/netvantage/lib/python2.7/site-packages/wand/api.py", line 205, in <module>
'Try to install:\n ' + msg)
ImportError: MagickWand shared library not found.
You probably had not installed ImageMagick library.
Try to install:
brew install freetype imagemagick
The error can be confusing, but the result is that you either need to install a 32bit version of ImageMagick, or use a 64bit version of Python.
Source
All files used here can be found here in this git repository: https://github.com/nycynik/PythonPDFtoPNG