Batch-convert PDFs to JPEGs and extract raw text from PDFs
One of the requirements my university imposes in order to formally submit a PhD thesis is to provide an overview of all published papers and their abstracts. I had PDF versions of all papers and wanted to automate the process. Here is a documentation of the steps, in the hope that this may be useful for others.
Requirements: This write-up assumes you are working on Mac, but it should work on all platforms.
- install ImageMagick, I use MacPorts: port install ImageMagick.
- Install Xpdf, using MacPorts this boils down to: port install Xpdf.
PDF to JPG / JPEG Process: First, put all PDFs in one separate folder, this will make your life easier. Second, convert all PDFs to JPEGs using this command:
for i in `ls *.pdf`; do convert -density 300 "$i" "$i".jpg; done
You may want to rescale the images (this could probably be done more elegantly in one step, I just did it in two).
for i in `ls *.jpg`; do convert "$i" -geometry x128 "$i"; done
PDF to Raw Text Process: For my paper index page, I just needed the first page (ACM style), so I used the following command:
for i in `ls *.pdf`; xpdf-pdftotext -f 1 -l 1 "$i" "$i".txt; done
This gives you a more or less correct raw text version of the PDF. Your mileage may vary, problem cases are especially footnotes, small caps, and strange formatting. I had to manually go through the text files and deleted everything besides the paper title and abstract.
Paper Index Page Process: I wrote a small script that loops over all PDFs, fetches the corresponding abstract, title, and image preview and returns the HTML. The resulting paper index page is online now (if you spot any conversion errors, please let me know, there are quite a few I am sure).
My write-up is inspired by these two resources:
http://www.medicalnerds.com/batch-converting-pdf-to-jpgjpeg-using-free-software/ and http://www.cyberciti.biz/faq/converter-pdf-files-to-text-format-command/