A Look Inside the Think Tank...Private by Thomas Steiner
PhD thesis successfully defended
I have finally defended my PhD thesis. A raw, unedited recording of the defense is available on YouTube
You can check out my slide deck that I used on http://tomayac.com/phd and the PDF of the thesis itself is available at http://tomayac.com/phd/thesis.pdf. The source code of the thesis is available in the GitHub repository https://github.com/tomayac/phd. I guess this makes me officially Dr. Thomas Steiner from now on.
Weaving the Web(VTT) of Data
This week, I'm attending the World Wide Web conference (WWW2014) in Seoul, Korea. Yesterday, I co-ran the 5th International Workshop on Web APIs and RESTful Design (WS-REST2014). It was a great workshop, for the papers, the keynote by my fellow Google colleague Sam Goto, and above all, for the people:
Now today, I presented work of ours in the workshop Linked Data on the Web (LDOW2014). The title of our paper is Weaving the Web(VTT) of Data, you can see the slides that I used for my talk below.
Telling Breaking News Stories from Wikipedia with Social Multimedia: A Case Study of the 2014 Winter OlympicsCreated on Wednesday, April 02, 2014 at 08:57:09 and categorized as Technical by Thomas Steiner
Telling Breaking News Stories from Wikipedia with Social Multimedia: A Case Study of the 2014 Winter Olympics
This week, I am attending the International Conference on Multimedia Retrieval (ICMR 2014) in Glasgow, Scotland. I contributed a paper to the workshop on Social Multimedia and Storytelling (SoMuS 2014) based on my work around combining Wikipedia Live Monitor and Social Media Illustrator. The title of the paper is Telling Breaking News Stories from Wikipedia with Social Multimedia: A Case Study of the 2014 Winter Olympics, you can read it on arXiv.org (deeplink to the PDF). My slides are available online and also embedded below.
This research was also featured by the MIT Technology Review in an article titled The Evolution of Automated Breaking News Stories.
Bots vs. Wikipedians?Who edits more?
I have just released an app called Bots vs. Wikipedians that displays live stats on Wikipedia and Wikidata bot activity. Bots vs. Wikipedians has a public API that sends out Wikipedia and Wikidata edits as Server-Sent Events. You can learn everything about Server-Sent Events (SSE) from an amazing HTML5 Rocks article by Eric Bidelman. The code of Bots vs. Wikipedians is available under the terms of the Apache 2.0 license and published on GitHub.
Batch-convert PDFs to JPEGs and extract raw text from PDFs
One of the requirements my university imposes in order to formally submit a PhD thesis is to provide an overview of all published papers and their abstracts. I had PDF versions of all papers and wanted to automate the process. Here is a documentation of the steps, in the hope that this may be useful for others.
Requirements: This write-up assumes you are working on Mac, but it should work on all platforms.
- install ImageMagick, I use MacPorts: port install ImageMagick.
- Install Xpdf, using MacPorts this boils down to: port install Xpdf.
PDF to JPG / JPEG Process: First, put all PDFs in one separate folder, this will make your life easier. Second, convert all PDFs to JPEGs using this command:
for i in `ls *.pdf`; do convert -density 300 "$i" "$i".jpg; done
You may want to rescale the images (this could probably be done more elegantly in one step, I just did it in two).
for i in `ls *.jpg`; do convert "$i" -geometry x128 "$i"; done
PDF to Raw Text Process: For my paper index page, I just needed the first page (ACM style), so I used the following command:
for i in `ls *.pdf`; xpdf-pdftotext -f 1 -l 1 "$i" "$i".txt; done
This gives you a more or less correct raw text version of the PDF. Your mileage may vary, problem cases are especially footnotes, small caps, and strange formatting. I had to manually go through the text files and deleted everything besides the paper title and abstract.
Paper Index Page Process: I wrote a small script that loops over all PDFs, fetches the corresponding abstract, title, and image preview and returns the HTML. The resulting paper index page is online now (if you spot any conversion errors, please let me know, there are quite a few I am sure).
My write-up is inspired by these two resources:
http://www.medicalnerds.com/batch-converting-pdf-to-jpgjpeg-using-free-software/ and http://www.cyberciti.biz/faq/converter-pdf-files-to-text-format-command/