A Look Inside the Think Tank...Technical by Thomas Steiner
Bots vs. Wikipedians?Who edits more?
I have just released an app called Bots vs. Wikipedians that displays live stats on Wikipedia and Wikidata bot activity. Bots vs. Wikipedians has a public API that sends out Wikipedia and Wikidata edits as Server-Sent Events. You can learn everything about Server-Sent Events (SSE) from an amazing HTML5 Rocks article by Eric Bidelman. The code of Bots vs. Wikipedians is available under the terms of the Apache 2.0 license and published on GitHub.
Batch-convert PDFs to JPEGs and extract raw text from PDFs
One of the requirements my university imposes in order to formally submit a PhD thesis is to provide an overview of all published papers and their abstracts. I had PDF versions of all papers and wanted to automate the process. Here is a documentation of the steps, in the hope that this may be useful for others.
Requirements: This write-up assumes you are working on Mac, but it should work on all platforms.
- install ImageMagick, I use MacPorts: port install ImageMagick.
- Install Xpdf, using MacPorts this boils down to: port install Xpdf.
PDF to JPG / JPEG Process: First, put all PDFs in one separate folder, this will make your life easier. Second, convert all PDFs to JPEGs using this command:
for i in `ls *.pdf`; do convert -density 300 "$i" "$i".jpg; done
You may want to rescale the images (this could probably be done more elegantly in one step, I just did it in two).
for i in `ls *.jpg`; do convert "$i" -geometry x128 "$i"; done
PDF to Raw Text Process: For my paper index page, I just needed the first page (ACM style), so I used the following command:
for i in `ls *.pdf`; xpdf-pdftotext -f 1 -l 1 "$i" "$i".txt; done
This gives you a more or less correct raw text version of the PDF. Your mileage may vary, problem cases are especially footnotes, small caps, and strange formatting. I had to manually go through the text files and deleted everything besides the paper title and abstract.
Paper Index Page Process: I wrote a small script that loops over all PDFs, fetches the corresponding abstract, title, and image preview and returns the HTML. The resulting paper index page is online now (if you spot any conversion errors, please let me know, there are quite a few I am sure).
My write-up is inspired by these two resources:
http://www.medicalnerds.com/batch-converting-pdf-to-jpgjpeg-using-free-software/ and http://www.cyberciti.biz/faq/converter-pdf-files-to-text-format-command/
PDF Archive of All My Published Papers
I have created a folder on my Web space of all my published papers from my Google Scholar profile. Hopefully Google Scholar will index them correctly and match them with my profile. Providing this list is part of the preparational steps for thesis submission at my university.
This blog post is (mostly) for search engines, not so much humans. Sorry.
Disabling Blog Comments
I have given up. The (barely never) used commenting function of my blog got repeatedly spammed. In consequence, I have removed it from my home-brew blog software. It was fun while it lasted. Looking back, I had added a simple Turing test back in 2005, but apparently the spammers have caught up and care enough to even spam my super low traffic blog. If you're a spammer and you read this: you win. Also, fuck you!
If you care enough to comment on any of the items on this blog, you know how to reach me: @tomayac or +tomayac.
What I do for a living in my PhD
Benvingut a Munic, Pep! Guardiola joins FC Bayern. Preview of an automatically generated social media gallery. That's the stuff I work on in my PhD.
Finally something to show to people :-).