Der aktuelle Klugschiss

{ llengua: catalĂ  | Sprache: Deutsch }

xungu - nicht ganz einfach

cutre - schäbig, gammelig

xulu - cool

Related

ATOM 1.0ATOM 1.0

Thomas Steiner is @tomayac on Twitter@tomayac on Twitter

Tweet archive of tweets by @tomayac@tomayac Tweet Archive

Thomas Steiner's Google ProfileGoogle Profile

A Look Inside the Think Tank...

WS-REST 2014 and Weaving the Web(VTT) of Data

Created on Tuesday, April 08, 2014 at 19:27:24 and categorized as Technical by Thomas Steiner

Weaving the Web(VTT) of Data

This week, I'm attending the World Wide Web conference (WWW2014) in Seoul, Korea. Yesterday, I co-ran the 5th International Workshop on Web APIs and RESTful Design (WS-REST2014). It was a great workshop, for the papers, the keynote by my fellow Google colleague Sam Goto, and above all, for the people:



Now today, I presented work of ours in the workshop Linked Data on the Web (LDOW2014). The title of our paper is Weaving the Web(VTT) of Data, you can see the slides that I used for my talk below.

Show/Hide Comments | Permalink

Tweet

Telling Breaking News Stories from Wikipedia with Social Multimedia: A Case Study of the 2014 Winter Olympics

Created on Wednesday, April 02, 2014 at 08:57:09 and categorized as Technical by Thomas Steiner

Telling Breaking News Stories from Wikipedia with Social Multimedia: A Case Study of the 2014 Winter Olympics

This week, I am attending the International Conference on Multimedia Retrieval (ICMR 2014) in Glasgow, Scotland. I contributed a paper to the workshop on Social Multimedia and Storytelling (SoMuS 2014) based on my work around combining Wikipedia Live Monitor and Social Media Illustrator. The title of the paper is Telling Breaking News Stories from Wikipedia with Social Multimedia: A Case Study of the 2014 Winter Olympics, you can read it on arXiv.org (deeplink to the PDF). My slides are available online and also embedded below.



This research was also featured by the MIT Technology Review in an article titled The Evolution of Automated Breaking News Stories.
The Evolution of Automated Breaking News Stories

Show/Hide Comments | Permalink

Tweet

Bots vs. Wikipedians?Who edits more?

Created on Monday, October 14, 2013 at 16:49:46 and categorized as Technical by Thomas Steiner

Bots vs. Wikipedians?Who edits more?

I have just released an app called Bots vs. Wikipedians that displays live stats on Wikipedia and Wikidata bot activity. Bots vs. Wikipedians has a public API that sends out Wikipedia and Wikidata edits as Server-Sent Events. You can learn everything about Server-Sent Events (SSE) from an amazing HTML5 Rocks article by Eric Bidelman. The code of Bots vs. Wikipedians is available under the terms of the Apache 2.0 license and published on GitHub.

Show/Hide Comments | Permalink

Tweet

Batch-convert PDFs to JPEGs and extract raw text from PDFs

Created on Monday, September 16, 2013 at 00:14:22 and categorized as Technical by Thomas Steiner

Batch-convert PDFs to JPEGs and extract raw text from PDFs

One of the requirements my university imposes in order to formally submit a PhD thesis is to provide an overview of all published papers and their abstracts. I had PDF versions of all papers and wanted to automate the process. Here is a documentation of the steps, in the hope that this may be useful for others.

Requirements: This write-up assumes you are working on Mac, but it should work on all platforms.



PDF to JPG / JPEG Process: First, put all PDFs in one separate folder, this will make your life easier. Second, convert all PDFs to JPEGs using this command:

for i in `ls *.pdf`; do convert -density 300 "$i" "$i".jpg; done

You may want to rescale the images (this could probably be done more elegantly in one step, I just did it in two).

for i in `ls *.jpg`; do convert "$i" -geometry x128 "$i"; done

PDF to Raw Text Process: For my paper index page, I just needed the first page (ACM style), so I used the following command:

for i in `ls *.pdf`; xpdf-pdftotext -f 1 -l 1 "$i" "$i".txt; done

This gives you a more or less correct raw text version of the PDF. Your mileage may vary, problem cases are especially footnotes, small caps, and strange formatting. I had to manually go through the text files and deleted everything besides the paper title and abstract.

Paper Index Page Process: I wrote a small script that loops over all PDFs, fetches the corresponding abstract, title, and image preview and returns the HTML. The resulting paper index page is online now (if you spot any conversion errors, please let me know, there are quite a few I am sure).

My write-up is inspired by these two resources:
http://www.medicalnerds.com/batch-converting-pdf-to-jpgjpeg-using-free-software/ and http://www.cyberciti.biz/faq/converter-pdf-files-to-text-format-command/

Show/Hide Comments | Permalink

Tweet

PDF Archive of All My Published Papers

Created on Monday, September 02, 2013 at 18:10:08 and categorized as Technical by Thomas Steiner

PDF Archive of All My Published Papers

I have created a folder on my Web space of all my published papers from my Google Scholar profile. Hopefully Google Scholar will index them correctly and match them with my profile. Providing this list is part of the preparational steps for thesis submission at my university.

This blog post is (mostly) for search engines, not so much humans. Sorry.

Show/Hide Comments | Permalink

Tweet