A Look Inside the Think Tank...

Progressive Web Apps in the HTTP Archive

Created on and categorized as Technical.
Written by Thomas Steiner.

Thomas Steiner, Google Hamburg, Germany

tomac@google.com • @tomayac • tomayac

Abstract

In this document, we present three different approaches and discuss their particular pros and cons for extracting data about Progressive Web Apps (PWA) from the HTTP Archive. Approach 1 is based on data that is tracked in the context of runs of the Lighthouse tool, Approach 2 is based on use counters in the Chrome browser to record per-page anonymous aggregated metrics on feature usage, and Approach 3 is based on parsing the source code of web pages for traces of service worker registrations and Web App Manifest references. We find that by all three approaches the popularity of PWAs increases roughly linearly over time and provide further research ideas based on the extracted data, whose underlying queries we share publicly.

Introduction to Progressive Web Apps

Progressive Web Apps (PWA) are a new class of web applications, enabled for the most part by the Service Worker APIs. Service workers allow apps to support network-independent loading by intercepting network requests to deliver programmatic or cached responses, service workers can receive push notifications and synchronize data in the background even when the corresponding app is not running, and service workers—together with Web App Manifests—allow users to install PWAs to their devices’ home screens. Service workers were first implemented in Chrome 40 Beta released in December 2014, and the term Progressive Web Apps was coined by Frances Berriman and Alex Russell in 2015.

Research Questions and Problem Statement

As service workers are now finally implemented in all major browsers, we at the Google Web Developer Relations team were wondering “how many PWAs are actually out there in the wild and how do they make use of these new technologies?” Certain advanced APIs like Background Sync are currently still only available on Chromium-based browsers, so as an additional question we looked into “what features do these PWAs actually use—or in the sense of progressive enhancement—try to use?” Our first idea was to check some of the curated PWA catalogues, for example, PWA.rocks, PWA Directory, Outweb, or PWA Stats. The problem with such catalogues is that they suffer from what we call submission bias. Anecdotal evidence shows that authors of PWAs want to be included in as many catalogues as possible, but oftentimes the listed examples are not very representative of the web and rather longtail. For example, at the time of writing, the first listed PWA on PWA Directory is feuerwehr-eisolzried.de, a PWA on the "latest news, dates and more from [the] fire department in Eisolzried, Bavaria." Second, while PWA Stats offers tags, for example, on the use of notifications, not all PWA features are classified in their tagging system. In short, PWA catalogues are not very well suited for answering our research questions.

The HTTP Archive to the Rescue

The HTTP Archive tracks how the web is built and provides historical data to quantitatively illustrate how the web is evolving. The archive’s crawlers process 500,000 URLs for both desktop and mobile twice a month. These URLs come from the most popular 500,000 sites in the Alexa Top 1,000,000 list and are mostly homepages that may or may not be representative for the rest of the site. The data in the HTTP Archive can be queried through BigQuery, where multiple tables are available in the httparchive project. As these tables tend to get fairly big, they are partitioned, but multiple associated tables can be queried using the wildcard symbol '*'. For our purposes, three families of tables are relevant, leading to three different approaches:

  • httparchive.lighthouse.*, which contains data about Lighthouse runs.
  • httparchive.pages.*, which contain the JSON-encoded parent documents’ HAR data.
  • httparchive.response_bodies.*, which contains the raw response bodies of all resources and sub-resources of all sites in the archive.

In the following, we will discuss all three approaches and their particular pros and cons, as well as present the extractable data and ideas for further research. All queries are also available on GitHub and are released under the terms of the Apache 2.0 license.

Warning: while BigQuery grants everyone a certain amount of free quota per month, on-demand pricing kicks in once the free quota is consumed. Currently, this is $5 per terabyte. Some of the shown queries process 70+(!) terabytes! You can see the amount of data that will be processed by clicking on the Validator icon:

Approach 1: httparchive.lighthouse.* Tables

Description

Lighthouse is an automated open-source tool for improving the quality of web pages. One can run it against any web page, public or requiring authentication. It has audits for Performance, Accessibility, Progressive Web App, and more. The httparchive.lighthouse.* tables contain JSON dumps (example) of past reports that can be extracted via BigQuery.

Cons

The biggest con is that obviously the tables only contain data of web pages that were ever run through the tool, so there is a blind spot. Additionally, while latest versions of Lighthouse process mobile and desktop pages, the currently used Lighthouse only processes mobile pages, so there are no results for desktop. One pitfall when working with these tables is that in a past version of Lighthouse Progressive Web App was the first category that was shown in the tool, however the order was flipped in the current version so that now Performance is first. In the query we need to take this corner case into account.

Pros

On the positive side, Lighthouse has clear scoring guidelines based on the Baseline PWA Checklist for each version of the tool (v2, v3), so by requiring a minimum Progressive Web App score of ?75, we can, to some extent, determine what PWA features we want to have included, namely, we can require offline capabilities and make sure the app can be added to the home screen.

Query and Results

Running the query below and then selecting distinct PWA URLs returns 799 unique PWA results that are known to work offline and to be installable to the user’s home screen.

#standardSQL
CREATE TEMPORARY FUNCTION
  getPWAScore(report STRING)
  RETURNS FLOAT64
  LANGUAGE js AS """
$=JSON.parse(report);
return $.reportCategories.find(i => i.name === 'Progressive Web App').score;
""";
CREATE TABLE IF NOT EXISTS
  `progressive_web_apps.lighthouse_pwas` AS
SELECT
  DISTINCT url AS pwa_url,
  IFNULL(rank,
    1000000) AS rank,
  date,
  platform,
  CAST(ROUND(score) AS INT64) AS lighthouse_pwa_score
FROM (
  SELECT
    REGEXP_REPLACE(JSON_EXTRACT(report,
        "$.url"), """, "") AS url,
    getPWAScore(report) AS score,
    REGEXP_REPLACE(REGEXP_EXTRACT(_TABLE_SUFFIX, "\d{4}(?:_\d{2}){2}"), "_", "-") AS date,
    REGEXP_EXTRACT(_TABLE_SUFFIX, ".*_(\w+)$") AS platform
  FROM
    `httparchive.lighthouse.*`
  WHERE
    report IS NOT NULL
    AND JSON_EXTRACT(report,
      "$.audits.service-worker.score") = 'true' )
LEFT JOIN (
  SELECT
    Alexa_rank AS rank,
    Alexa_domain AS domain
  FROM
    # Hard-coded due to https://github.com/HTTPArchive/bigquery/issues/42
    `httparchive.urls.20170315`
  WHERE
    Alexa_rank IS NOT NULL
    AND Alexa_domain IS NOT NULL ) AS urls
ON
  urls.domain = NET.REG_DOMAIN(url)
WHERE
  # Lighthouse "Good" threshold
  score >= 75
GROUP BY
  url,
  date,
  score,
  platform,
  date,
  rank
ORDER BY
  rank ASC,
  url,
  date DESC;
Research Ideas

An interesting analysis we can run based on this data is the development of average Lighthouse PWA scores over time and the number of PWAs (note that the presented naive approach does not take the in relation also growing HTTP Archive into account, but purely counts absolute numbers).

#standardSQL
SELECT
  date,
  count (DISTINCT pwa_url) AS total_pwas,
  round(AVG(lighthouse_pwa_score), 1) AS avg_lighthouse_pwa_score
FROM
  `progressive_web_apps.lighthouse_pwas`
GROUP BY
  date
ORDER BY
  date;

Approach 2: httparchive.pages.* Tables

Description

Another straightforward way for estimating the amount of PWAs (however completely neglecting Web App Manifests) is to look for so-called use counters in the httparchive.pages.* tables. Particularly interesting is the ServiceWorkerControlledPage use counter, which, according to Chrome engineer Matt Falkenhagen, “is counted whenever a page is controlled by a service worker, which typically happens only on subsequent loads.”

Cons

No qualitative attributes other than the absolute fact that a service worker controlled the loading of the page can be extracted. More importantly, as the counter is typically triggered on subsequent loads only (and not on the first load that the crawler sees), this method undercounts and only contains sites that claim their clients (self.clients.claim()) on the first load.

Pros

On the bright side, the precision is high due to the browser-level tracking, so we can be sure the page actually registered a service worker. The query also covers both desktop and mobile.

Query and Results

This approach, at time of writing, turns up 5,368 unique results, however, as mentioned before, not all of these results necessarily qualify as PWA due to the potentially missing Web App Manifest that affects the installability of the app.

#standardSQL
CREATE TABLE IF NOT EXISTS
  `progressive_web_apps.usecounters_pwas` AS
SELECT
  DISTINCT REGEXP_REPLACE(url, "^http:", "https:") AS pwa_url,
  IFNULL(rank,
    1000000) AS rank,
  date,
  platform
FROM (
  SELECT
    DISTINCT url,
    REGEXP_REPLACE(REGEXP_EXTRACT(_TABLE_SUFFIX, "\d{4}(?:_\d{2}){2}"), "_", "-") AS date,
    REGEXP_EXTRACT(_TABLE_SUFFIX, ".*_(\w+)$") AS platform
  FROM
    `httparchive.pages.*`
  WHERE
    # From https://cs.chromium.org/chromium/src/third_party/blink/public/platform/web_feature.mojom
    JSON_EXTRACT(payload,
      '$._blinkFeatureFirstUsed.Features.ServiceWorkerControlledPage') IS NOT NULL)
LEFT JOIN (
  SELECT
    Alexa_domain AS domain,
    Alexa_rank AS rank
  FROM
    # Hard-coded due to https://github.com/HTTPArchive/bigquery/issues/42
    `httparchive.urls.20170315` AS urls
  WHERE
    Alexa_rank IS NOT NULL
    AND Alexa_domain IS NOT NULL )
ON
  domain = NET.REG_DOMAIN(url)
ORDER BY
  rank ASC,
  date DESC,
  pwa_url;
Research Ideas

Similar to the second query in Approach 1 from above, we can also track the number of pages controlled by a service worker over time (the gap in the September 1, 2017 dataset is due to a parsing issue in the data collection pipeline).

#standardSQL
SELECT
  date,
  count (DISTINCT pwa_url) AS total_pwas
FROM
  `progressive_web_apps.usecounters_pwas`
GROUP BY
  date
ORDER BY
  date;

Approach 3: httparchive.response_body.* Tables

Description

A third less obvious way to answer our research questions is to look at actual response bodies. The httparchive.response_bodies.* tables contain raw data of all resources and sub-resources of all sites in the archive, so we can use fulltext search to find patterns that are indicators for the presence of PWA features like, for instance, the existence of variations of the string navigator.serviceWorker.register(" that provide a clue that the page might be registering a service worker on the one hand, and variations of <link rel="manifest" that point to a potential Web App Manifest on the other hand.

Cons

The downside of this approach is that we are trying to parse HTML with regular expressions to begin with, which is commonly known to be impossible and a bad practice. One example where things can go wrong is that we might detect out-commented code or struggle with incorrectly nested code.

Pros

Despite all challenges, as the service worker JavaScript files and the Web App Manifest JSON files are subresources of the page and therefore stored in the httparchive.response_bodies.* tables, we can still bravely attempt to examine their contents and try to gain an in-depth understanding of the PWAs’ capabilities. By checking the service worker JavaScript code for the events the service worker listens to, we can see if a PWA—at least in theory—deals with Web Push notifications, handles fetches, etc., and by looking at the Web App Manifest JSON document, we can see if the PWA specifies a start URL, provides a name, and so on.

Query and Results

We have split the analysis of service workers and Web App Manifests, and use a common helper table to extract PWA candidates from the large response body tables. As references to service worker script files and Web App Manifest JSON files may be relative or absolute, we need a User-Defined Function to resolve paths like ../../manifest.json relative to their base URL. Our function is a hacky simplification based on path.resolve([...paths]) in Node.js and not very elegant. We deliberately ignore references that would require executing JavaScript, for example, URLs like window.location.href + 'sw.js', so our regular expressions are a bit involved to make sure we exclude these cases.

PWA Candidates Helper Table
#standardSQL
CREATE TEMPORARY FUNCTION
  pathResolve(path1 STRING,
    path2 STRING)
  RETURNS STRING
  LANGUAGE js AS """
  function normalizeStringPosix(e,t){for(var n="",r=-1,i=0,l=void 0,o=!1,h=0;h<=e.length;++h){if(h<e.length)l=e.charCodeAt(h);else{if(l===SLASH)break;l=SLASH}if(l===SLASH){if(r===h-1||1===i);else if(r!==h-1&&2===i){if(n.length<2||!o||n.charCodeAt(n.length-1)!==DOT||n.charCodeAt(n.length-2)!==DOT)if(n.length>2){for(var g=n.length-1,a=g;a>=0&&n.charCodeAt(a)!==SLASH;--a);if(a!==g){n=-1===a?"":n.slice(0,a),r=h,i=0,o=!1;continue}}else if(2===n.length||1===n.length){n="",r=h,i=0,o=!1;continue}t&&(n.length>0?n+="/..":n="..",o=!0)}else{var f=e.slice(r+1,h);n.length>0?n+="/"+f:n=f,o=!1}r=h,i=0}else l===DOT&&-1!==i?++i:i=-1}return n}function resolvePath(){for(var e=[],t=0;t<arguments.length;t++)e[t]=arguments[t];for(var n="",r=!1,i=void 0,l=e.length-1;l>=-1&&!r;l--){var o=void 0;l>=0?o=e[l]:(void 0===i&&(i=getCWD()),o=i),0!==o.length&&(n=o+"/"+n,r=o.charCodeAt(0)===SLASH)}return n=normalizeStringPosix(n,!r),r?"/"+n:n.length>0?n:"."}var SLASH=47,DOT=46,getCWD=function(){return""};if(/^https?:/.test(path2)){return path2;}if(/^\//.test(path2)){return path1+path2.substr(1);}return resolvePath(path1, path2).replace(/^(https?:\/)/, '$1/');
""";
CREATE TABLE IF NOT EXISTS
  `progressive_web_apps.pwa_candidates` AS
SELECT
  DISTINCT REGEXP_REPLACE(page, "^http:", "https:") AS pwa_url,
  IFNULL(rank,
    1000000) AS rank,
  pathResolve(REGEXP_REPLACE(page, "^http:", "https:"),
    REGEXP_EXTRACT(body, "navigator\.serviceWorker\.register\s*\(\s*["']([^\),\s"']+)")) AS sw_url,
  pathResolve(REGEXP_REPLACE(page, "^http:", "https:"),
    REGEXP_EXTRACT(REGEXP_EXTRACT(body, "(<link[^>]+rel=["']?manifest["']?[^>]+>)"), "href=["']?([^\s"'>]+)["']?")) AS manifest_url
FROM
  `httparchive.response_bodies.*`
LEFT JOIN (
  SELECT
    Alexa_domain AS domain,
    Alexa_rank AS rank
  FROM
    # Hard-coded due to https://github.com/HTTPArchive/bigquery/issues/42
    `httparchive.urls.20170315` AS urls
  WHERE
    Alexa_rank IS NOT NULL
    AND Alexa_domain IS NOT NULL )
ON
  domain = NET.REG_DOMAIN(page)
WHERE
  (REGEXP_EXTRACT(body, "navigator\.serviceWorker\.register\s*\(\s*["']([^\),\s"']+)") IS NOT NULL
    AND REGEXP_EXTRACT(body, "navigator\.serviceWorker\.register\s*\(\s*["']([^\),\s"']+)") != "/")
  AND (REGEXP_EXTRACT(REGEXP_EXTRACT(body, "(<link[^>]+rel=["']?manifest["']?[^>]+>)"), "href=["']?([^\s"'>]+)["']?") IS NOT NULL
    AND REGEXP_EXTRACT(REGEXP_EXTRACT(body, "(<link[^>]+rel=["']?manifest["']?[^>]+>)"), "href=["']?([^\s"'>]+)["']?") != "/")
ORDER BY
  rank ASC,
  pwa_url;
Web App Manifests Analysis

Based on this helper table, we can then run the analysis of the Web App Manifests. We check for the existence of properties defined in the WebAppManifest dictionary combined with non-standard, but well-known properties like "gcm_sender_id" from the deprecated Google Cloud Messaging or "share_target" from the currently in flux Web Share Target API. Turns out, not many manifests are in the archive; from 2,823 candidate manifest URLs in the helper table we actually only find 30 unique Web App Manifests and thus PWAs in the response bodies, but these at least archived in several versions.

#standardSQL
  CREATE TABLE IF NOT EXISTS `progressive_web_apps.web_app_manifests` AS
SELECT
  pwa_url,
  rank,
  manifest_url,
  date,
  platform,
  REGEXP_CONTAINS(manifest_code,
    r""dir"s*:") AS dir_property,
  REGEXP_CONTAINS(manifest_code,
    r""lang"s*:") AS lang_property,
  REGEXP_CONTAINS(manifest_code,
    r""name"s*:") AS name_property,
  REGEXP_CONTAINS(manifest_code,
    r""short_name"s*:") AS short_name_property,
  REGEXP_CONTAINS(manifest_code,
    r""description"s*:") AS description_property,
  REGEXP_CONTAINS(manifest_code,
    r""scope"s*:") AS scope_property,
  REGEXP_CONTAINS(manifest_code,
    r""icons"s*:") AS icons_property,
  REGEXP_CONTAINS(manifest_code,
    r""display"s*:") AS display_property,
  REGEXP_CONTAINS(manifest_code,
    r""orientation"s*:") AS orientation_property,
  REGEXP_CONTAINS(manifest_code,
    r""start_url"s*:") AS start_url_property,
  REGEXP_CONTAINS(manifest_code,
    r""serviceworker"s*:") AS serviceworker_property,
  REGEXP_CONTAINS(manifest_code,
    r""theme_color"s*:") AS theme_color_property,
  REGEXP_CONTAINS(manifest_code,
    r""related_applications"s*:") AS related_applications_property,
  REGEXP_CONTAINS(manifest_code,
    r""prefer_related_applications"s*:") AS prefer_related_applications_property,
  REGEXP_CONTAINS(manifest_code,
    r""background_color"s*:") AS background_color_property,
  REGEXP_CONTAINS(manifest_code,
    r""categories"s*:") AS categories_property,
  REGEXP_CONTAINS(manifest_code,
    r""screenshots"s*:") AS screenshots_property,
  REGEXP_CONTAINS(manifest_code,
    r""iarc_rating_id"s*:") AS iarc_rating_id_property,
  REGEXP_CONTAINS(manifest_code,
    r""gcm_sender_id"s*:") AS gcm_sender_id_property,
  REGEXP_CONTAINS(manifest_code,
    r""gcm_user_visible_only"s*:") AS gcm_user_visible_only_property,
  REGEXP_CONTAINS(manifest_code,
    r""share_target"s*:") AS share_target_property,
  REGEXP_CONTAINS(manifest_code,
    r""supports_share"s*:") AS supports_share_property
FROM
  `progressive_web_apps.pwa_candidates`
JOIN (
  SELECT
    url,
    body AS manifest_code,
    REGEXP_REPLACE(REGEXP_EXTRACT(_TABLE_SUFFIX, "\d{4}(?:_\d{2}){2}"), "_", "-") AS date,
    REGEXP_EXTRACT(_TABLE_SUFFIX, ".*_(\w+)$") AS platform
  FROM
    `httparchive.response_bodies.*`
  WHERE
    body IS NOT NULL
    AND body != ""
    AND url IN (
    SELECT
      DISTINCT manifest_url
    FROM
      `progressive_web_apps.pwa_candidates`) ) AS manifest_bodies
ON
  manifest_bodies.url = manifest_url
ORDER BY
  rank ASC,
  pwa_url,
  date DESC,
  platform,
  manifest_url;
Research Ideas

With this data at hand, we can extract all (well, not really all, but all known according to our query) PWAs that still use the deprecated Google Cloud Messaging service.

#standardSQL
SELECT
  DISTINCT pwa_url,
  manifest_url
FROM
  `progressive_web_apps.web_app_manifests`
WHERE
  gcm_sender_id_property;
Service Workers Analysis

Similarly to the analysis of Web App Manifests, the analysis of the various ServiceWorkerGlobalScope events is based on regular expressions. Events can be listened to using two JavaScript syntaxes: (i) the property syntax (e.g., self.oninstall = […] or (ii) the event listener syntax (e.g., self.addEventListener('install', […])). As an additional data point, we extract potential uses of the increasingly popular library Workbox by looking for telling traces of various Workbox versions in the code. Running this query we obtain 1,151 unique service workers and thus PWAs.

#standardSQL
CREATE TABLE IF NOT EXISTS
  `progressive_web_apps.service_workers` AS
SELECT
  pwa_url,
  rank,
  sw_url,
  date,
  platform,
  REGEXP_CONTAINS(sw_code, r".oninstalls*=|addEventListener(s*["']install["']") AS install_event,
  REGEXP_CONTAINS(sw_code, r".onactivates*=|addEventListener(s*["']activate["']") AS activate_event,
  REGEXP_CONTAINS(sw_code, r".onfetchs*=|addEventListener(s*["']fetch["']") AS fetch_event,
  REGEXP_CONTAINS(sw_code, r".onpushs*=|addEventListener(s*["']push["']") AS push_event,
  REGEXP_CONTAINS(sw_code, r".onnotificationclicks*=|addEventListener(s*["']notificationclick["']") AS notificationclick_event,
  REGEXP_CONTAINS(sw_code, r".onnotificationcloses*=|addEventListener(s*["']notificationclose["']") AS notificationclose_event,
  REGEXP_CONTAINS(sw_code, r".onsyncs*=|addEventListener(s*["']sync["']") AS sync_event,
  REGEXP_CONTAINS(sw_code, r".oncanmakepayments*=|addEventListener(s*["']canmakepayment["']") AS canmakepayment_event,
  REGEXP_CONTAINS(sw_code, r".onpaymentrequests*=|addEventListener(s*["']paymentrequest["']") AS paymentrequest_event,
  REGEXP_CONTAINS(sw_code, r".onmessages*=|addEventListener(s*["']message["']") AS message_event,
  REGEXP_CONTAINS(sw_code, r".onmessageerrors*=|addEventListener(s*["']messageerror["']") AS messageerror_event,
  REGEXP_CONTAINS(sw_code, r"new Workbox|new workbox|workbox.precaching.|workbox.strategies.") AS uses_workboxjs
FROM
  `progressive_web_apps.pwa_candidates`
JOIN (
  SELECT
    url,
    body AS sw_code,
    REGEXP_REPLACE(REGEXP_EXTRACT(_TABLE_SUFFIX, "\d{4}(?:_\d{2}){2}"), "_", "-") AS date,
    REGEXP_EXTRACT(_TABLE_SUFFIX, ".*_(\w+)$") AS platform
  FROM
    `httparchive.response_bodies.*`
  WHERE
    body IS NOT NULL
    AND body != ""
    AND url IN (
    SELECT
      DISTINCT sw_url
    FROM
      `progressive_web_apps.pwa_candidates`) ) AS sw_bodies
ON
  sw_bodies.url = sw_url
ORDER BY
  rank ASC,
  pwa_url,
  date DESC,
  platform,
  sw_url;
Research Ideas

Having detailed service worker data allows for interesting analyses. For example, we can use this data to track Workbox usage over time.

#standardSQL
SELECT
  date,
  count (uses_workboxjs) AS total_uses_workbox
FROM
  `progressive_web_apps.service_workers`
WHERE
  uses_workboxjs
  AND platform = 'mobile'
GROUP BY
  date
ORDER BY
  date;

Lines of code (LOC) is a great metric (not) to estimate a team’s productivity and to predict a task’s complexity. Let’s analyze the development of a given site’s service worker in terms of string length. Seems like the team deserves a raise…

#standardSQL
SELECT
  DISTINCT pwa_url,
  sw_url,
  date,
  CHAR_LENGTH(body) AS sw_length
FROM
  `progressive_web_apps.service_workers`
JOIN
  `httparchive.response_bodies.*`
ON
  sw_url = url
  AND date = REGEXP_REPLACE(REGEXP_EXTRACT(_TABLE_SUFFIX, "\d{4}(?:_\d{2}){2}"), "_", "-")
  AND platform = REGEXP_EXTRACT(_TABLE_SUFFIX, ".*_(\w+)$")
WHERE
  # Redacted
  pwa_url = "https://example.com/"
  AND platform = "mobile"
ORDER BY
  date ASC;

A final idea is to examine service worker events over time and see if there are interesting developments. Something that stands out in the analysis is how increasingly the fetch event is being listened to as well as the message event. Both are an indicator for more complex offline handling scenarios.

#standardSQL
SELECT
  date,
  COUNT(IF (install_event,
      TRUE,
      NULL)) AS install_events,
  COUNT(IF ( activate_event,
      TRUE,
      NULL)) AS activate_events,
  COUNT(IF ( fetch_event,
      TRUE,
      NULL)) AS fetch_events,
  COUNT(IF ( push_event,
      TRUE,
      NULL)) AS push_events,
  COUNT(IF ( notificationclick_event,
      TRUE,
      NULL)) AS notificationclick_events,
  COUNT(IF ( notificationclose_event,
      TRUE,
      NULL)) AS notificationclose_events,
  COUNT(IF ( sync_event,
      TRUE,
      NULL)) AS sync_events,
  COUNT(IF ( canmakepayment_event,
      TRUE,
      NULL)) AS canmakepayment_events,
  COUNT(IF ( paymentrequest_event,
      TRUE,
      NULL)) AS paymentrequest_events,
  COUNT(IF ( message_event,
      TRUE,
      NULL)) AS message_events,
  COUNT(IF ( messageerror_event,
      TRUE,
      NULL)) AS messageerror_events
FROM
  `progressive_web_apps.service_workers`
WHERE
  NOT uses_workboxjs
  AND date LIKE "2018-%"
GROUP BY
  date
ORDER BY
  date;

Meta Approach: Approaches 1–3 Combined

An interesting meta analysis is to combine all approaches to get a feeling for the overall landscape of PWAs in the HTTP Archive (with all aforementioned pros and cons regarding precision and recall applied). If we run the query below, we find exactly 6,647 unique PWAs. They may not necessarily still be PWAs today; some of the previously very prominent PWA lighthouse cases are known to have regressed, and some were only very briefly experimenting with the technologies, but in the HTTP Archive we have evidence of the glory moment in history where all of these pages fulfilled at least one of our three approaches’ criteria for being counted as a PWA.

#standardSQL
SELECT
  DISTINCT pwa_url,
  rank
FROM (
  SELECT
    DISTINCT pwa_url,
    rank
  FROM
    `progressive_web_apps.lighthouse_pwas` union all
  SELECT
    DISTINCT pwa_url,
    rank
  FROM
    `progressive_web_apps.service_workers` union all
  SELECT
    DISTINCT pwa_url,
    rank
  FROM
    `progressive_web_apps.usecounters_pwas`)
ORDER BY
  rank ASC;

If we aggregate by dates and ignore some runaway values, we can see linear growth in the total number of PWAs, with a slight decline at the end of our observation period that we will have an eye on in future research.

#standardSQL
SELECT
  DISTINCT date,
  COUNT(pwa_url) AS pwas
FROM (
  SELECT
    DISTINCT date,
    pwa_url
  FROM
    `progressive_web_apps.lighthouse_pwas`
  UNION ALL
  SELECT
    DISTINCT date,
    pwa_url
  FROM
    `progressive_web_apps.service_workers`
  UNION ALL
  SELECT
    DISTINCT date,
    pwa_url
  FROM
    `progressive_web_apps.usecounters_pwas`)
GROUP BY
  date
ORDER BY
  date;

Future Work and Conclusions

In this document, we have presented three different approaches to extracting PWA data from the HTTP Archive. Each has its individual pros and cons, but especially Approach 3 has proven very interesting as a basis for further analyses. All presented queries are “evergreen” in a sense that they are not tied to a particular crawl’s tables, allowing for ongoing analyses also in the future. Depending on people’s interest, we will see to what extent the data can be made generally available as part of the HTTP Archive’s public tables. There are likewise interesting research opportunities by combining our results with the Chrome User Experience Report that is also accessible with BigQuery. Concluding, the overall trends show in the right direction. More and more pages are controlled by a service worker, leading to PWAs with a generally increasing Lighthouse PWA score. Something to watch out for is the decline in PWAs observed in the Meta Approach, which, however, is not reflected in the most precise and neutral Approach 2, where rather the opposite is the case. We look forward to learning about new ways people make use of our research and to PWAs becoming more and more mainstream.

Acknowledgements

In no particular order we would like to thank Mathias Bynens for help with shaping one of the initial queries, Kenji Baheux for pointers that led to Approach 2, Rick Viscomi and Patrick Meenan for general HTTP Archive help and the video series, Jeff Posnick, Ade Oshineye, Ilya Grigorik, John Mueller, Cheney Tsai, Miguel Carlos Martínez Díaz, and Eric Bidelman for editorial comments, as well as Matt Falkenhagen and Matt Giuca for providing technical background on use counters.

Linux on a Mid 2007 iMac

Created on and categorized as Technical.
Written by Thomas Steiner.

We have a 24 inch Mid 2007 iMac in our living room that is clearly showing its age, but still working OK enough (albeit slow) as a TV (SD channels on DVB-C via USB stick), DVD player, and Web streaming station. The machine is stuck on macOS El Capitan (OS X version 10.11.6) and doesn't get updates anymore. I was wondering if putting Linux on it might help.

I ended up installing Xubuntu 18.04 (64 bit), mainly because it uses the lightweight Xfce window manager. For some reason USB boot never worked (the same USB pen booted on a MacBook Air just fine), so I went and, in 2018, bought some DVD+RWs (probably the last time I ever do this) and it booted.

Pros

All hardware was immediately recognized and ready to go, including the (proprietary) Wi-Fi drivers, the multimedia keys on the iMac keyboard, the mouse wheel (I had to compile a kernel back in Debian 2.0. Hamm times), and even the USB TV stick (Hauppauge 930C, I needed to copy its firmware file to /lib/firmwares). The system overall boots a lot faster (it's still a spinning HDD and still only has 2GB of RAM) and, once up, feels really snappy again (Firefox Quantum is a lot faster than Chrome Beta to start, though). Dual boot works fine, everyone recommends rEFInd, so I just went with that.

Cons

The system still doesn't decode HD DVB-C channels smoothly (it does now, see the update below), despite being on only ~25% CPU load (compared to 99% on macOS where it works neither). If there's someone from Linux TV reading along, I'd love to talk to you. I don't particularly like Gnome Software that Xubuntu ships with instead of Ubunutu Software Center, but I fixed this by installing the Synaptic Package Manager.

2018. The year of the Linux desktop :-)

Update

I learned about the VDR project yesterday, and after some fiddling now have HD DVB-C TV on Xubuntu, something that never even worked on macOS. The setup was super involved and old-school Linux, but thanks a ton to the Debian folks for the helpful instructions to install VDR.

AMP cache adding an external css style sheet?

Created on and categorized as Technical.
Written by Thomas Steiner.

I dug into the AMP Toolbox Optimizer a bit and realized it replaces the boilerplate CSS with an externally referenced file called v0.css and was like 🤔 hmmm, why is that? Here’s a real world example, check for the 2nd request v0.css. So I tried to understand what was going on, if you care, read on.

Introduction

Initially, the browser doesn’t know what AMP components like <amp-something> are. It only knows after the AMP runtime (and each components’ libraries) is loaded. The problem is, browsers are forgiving, so they just assume unknown tags are there on error, and ignore them:

<dif>

  <strong>Yeah!</strong>

</dif>

Note the error? I wrote <dif>, and by that created an unknown tag. The browser will still display my Yeah!, and ignore everything else.

It gets worse when you add AMP, as some AMP components alter the box model of things. You can see the effect in this example file (view source):

​The fake <amp-karussell> I have created simulates this issue. You can’t visually differentiate the two side by side AMP images from my fake carousel.

Adding AMP

The AMP boilerplate essentially makes sure that for a grace period nothing is shown (white screen). You can simulate this by request-blocking the AMP runtime on a real AMP page and you will notice that after the grace period of 8 seconds is over, the layout is messed up, the <div>s form the last example show up as block element, as HTML’s creators wanted them to appear:

Toolbox Optimizations

Now to the actual question, why the CSS file? It’s not the boilerplate, but sort of an AMP CSS Normalizer (more to that in the next section). With the AMP Toolbox, you can simply on your own site already apply the optimizations that the cache would apply on the CDN. I’m quoting directly from the documentation:

“In order to avoid Flash of Unstyled Content (FOUC) and reflows resulting from to the usage of web-components, AMP requires websites to add the amp-boilerplate in the header.

The amp-boilerplate renders the page invisible by changing it’s opacity, while the fonts and the AMP Runtime load. Once the AMP runtime is loaded, it is able to correctly set the sizes of the custom elements and once that happens, the runtimes makes the page visible again.

As a consequence, the first render of the page doesn’t happen until the AMP Runtime is loaded.

To improve this, AMP server-side rendering applies the same rules as the AMP Runtime on the server. This ensures that the reflow will not happen and the AMP boilerplate is no longer needed. The first render no longer depends on the AMP Runtime being loaded, which improves load times.

Caveats: it’s important to note that, even though the text content and layout will show faster, content that depends on the custom AMP elements (eg: any element in the page that starts with ’amp-’) will only be visible after the AMP Runtime is loaded.”

Looking into the CSS File

So now what does the CSS file that I called AMP CSS Normalizer do? If we look at the beautified source code here, we can see this beauty:

.i-amphtml-layout-container,.i-amphtml-layout-fixed-height,.i-amphtml-layout-responsive,[layout=container],[layout=fixed-height][height],[layout=responsive][width][height]:not(.i-amphtml-layout-responsive),[width][height][sizes]:not(.i-amphtml-layout-responsive) {

  display: block;

  position: relative

}

The toolbox optimizes…

<amp-img width=360 height=200 layout=responsive src=image.png></amp-img>

…to…

<amp-img width="360" height="200" layout="responsive" src="image.png" class="i-amphtml-layout-responsive i-amphtml-layout-size-defined" i-amphtml-layout="responsive"></amp-img>

What this means is that when we have a responsive <amp-img> and even if the browser has no clue what an <amp-img> is, it would still display it as a block element.

Styling actually (and maybe surprisingly) still works, even if the browser apart from that ignores the unknown tag:

<style>

  beer {

    display: block;

    border: solid red 1px;

  }

</style>

Wooohoo, <beer>beer!</beer> Cheers!

This makes sure that the <beer> tag does what I told it to do:

 

Hope this was helpful.

Service Worker Detector Chrome Extension Released

Created on and categorized as Technical.
Written by Thomas Steiner.

I've released a new Chrome extension today that detects Service Workers in modern websites.

Why would you want this? If you aren't into Web development, most probably you wouldn't. However, if you are into Web development, the extension helps you identify (unexpected) Service Worker registrations in the wild and lets you analyze their code and learn from them.

Why would you use this extension and not just the amazing Chrome Developer Tools? The answer is that the extension proactively detects Service Workers before you even have to open the Developer Tools (which you probably eventually will end up doing anyway).

World Wide Web Conference (WWW2016): Trip Report

Created on and categorized as Work.
Written by Thomas Steiner.

Last week, I attended the 25th International World Wide Web Conference (WWW2016) that took place from April 11 to 15, 2016 in Montréal, Canada. The main proceedings and the companion proceedings are both available online. Google was one of the gold sponsors and Google Director of Research Peter Norvig delivered one of the main keynotes. This is my trip report with personal highlights and observations.

Workshops, Day 1

I started the conference with the Making Sense of Microposts workshop that began with an invited talk by Yahoo Research Scientist Mihajlo Grbovic on Leveraging Blogging Activity on Tumblr to Infer Demographics and Interests of Users for Advertising Purposes. As a ground truth for their gender prediction they have used US Census data on popular baby names and for female reached a precision of 0.806 (recall 0.838) and for male a precision of 0.794 (recall 0.689). I spent the rest of the day with session hopping between the Microposts workshop and the Computational Social Science for the Web tutorial.

Workshops, Day 2

My second day was fully dedicated to the Wiki Workshop that started with surprise guest and Wikipedia co-founder Jimmy Wales, which led to a short discussion of, among other topics, payment and reward models for authors on Wikipedia and Wikia.

The workshop had an interesting concept of invited talks that filled the day, and the actual papers being presented at a poster session during lunch. I want to highlight the paper With a Little Help from my Neighbors: Person Name Linking Using the Wikipedia Social Network by J. Geiß and M. Gertz on named entity linking and disambiguation based on their co-occurrence in Wikipedia pages, and the paper Finding Structure in Wikipedia Edit Activity: An Information Cascade Approach by R. Tinati et al. My own paper Wikipedia Tools for Google Spreadsheets introduces a Google Spreadsheets add-on that facilitates working with data from Wikipedia and Wikidata from within a spreadsheet context.

My invited talk at the workshop covered The Wiki(pedia|data) Edit Streams Firehose, which you can see visualized and audiolized in my Wikipedia Screensaver that I have developed for the talk and released as open source.

Main Conference, Day 1

The main conference started with a keynote by Sir Tim Berners-Lee whose talk touched on the topic of mobile Web apps—which he prefers over native apps, because when [one goes] native, [one] become[s] part of a value chain—and that Web apps need to get closer to the capabilites of native apps (he did not mention Service Worker specifically, but it was somewhat clear from the context that he was aiming at this API).

After the keynote, I saw the presentation of a paper titled Immersive Recommendation: News and Event Recommendations Using Personal Digital Traces on an approach to leverage cross-platform user profiles for news and event recommendations. The authors' demo worked very well when I tested it with my YouTube and Twitter accounts.

Next, I learned how the team at YouTube deal with spammy comments by analyzing the temporal graph based on the engagement behavior pattern between users and videos from the paper presentation of In a World That Counts: Clustering and Detecting Fake Social Engagement at Scale.

An interesting idea to prevent online trackers from tracking personally identifiable information was shown in the paper Tracking the Trackers by the makers of the Web browser CLIQZ. Their approach leverages concepts from k-anonymity by—rather than working with fixed block lists—having users collectively identify unsafe tracking elements in the background that have the potential to uniquely identify individual users, and by then removing such information from tracking requests.

The paper Crowdsourcing Annotations for Websites' Privacy Policies: Can It Really Work? tackles the issue of lengthy and hard-to-read privacy policies and whether crowdsourcing their annotation can help. The authors come to the conclusion that, if carefully deployed, crowdsourcing can indeed result in the generation of non-trivial annotations and can also help identify elements of ambiguity in policies. A demo with annotated privacy policies shows some examples.

From the poster session, I especially liked Visual Positions of Links and Clicks on Wikipedia that looked at the visual positions of clicked links on Wikipedia based on the Wikipedia clickstream dataset and Travel the World: Analyzing and Predicting Booking Behavior using E-Mail Travel Receipts that examined more than 25 million travel receipts from Yahoo Mail users to predict their booking behavior.

Main Conference, Day 2

Day 2 started with a keynote by Mary Ellen Zurko, Principal Engineer at Cisco Systems, in which she provided a tour down memory lane through security from S-HTTP to Experimenting At Scale With Google Chrome's SSL Warning.

From the research track, I first want to highlight a Yahoo Labs Research paper on Predicting Pre-click Quality for Native Advertisements. Native ads are defined as a specific form of online advertising, where ads replicate the look-and-feel of their serving platform. The authors introduce the notion of bad ads that have a high Offensive Feedback Rate (OFR), i.e., the relation between the number of times an ad was rated offensive and the number of impressions. According to the paper, the OFR metrics are more reliable than the commonly used click-through rate (CTR) metrics.

One of my favorite papers of the conference was Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes (lighter reading: slides, project homepage) that aims at identifying hoaxes on Wikipedia, i.e., deliberately fabricated falsehood made to masquerade as truth. Some famous hoaxes survived for more than nine years and were widely cited in the media.

I continued with the presentation of our Industry Track paper From Freebase to Wikidata: The Great Migration, in which we describe our ongoing data transfer project for migrating the (now shut-down) structured knowledge base Freebase to Wikidata. We further report on the data mapping challenges, provide an analysis of the progress so far, and also describe the Primary Sources Tool that aims to facilitate this—and future—data migrations. The tool has been released as open source.

For me, the day ended with an interesting paper on The QWERTY Effect on the Web—How Typing Shapes the Meaning of Words in Online Human-Computer Interaction. I had never heard of the QWERTY effect before, but it is based on the hypothesis that on average words typed with more letters from the right side of the keyboard are more positive in meaning than words typed with more letters from the left. According to the paper, there is some evidence that this hypothesis also holds true for the Web.

Main Conference, Day 3

In the paper Tell Me About Yourself: The Malicious CAPTCHA Attack, the authors show how fake CAPTCHAS (Completely Automated Public Turing tests to tell Computers and Humans Apart) can be used to trick users into unwillingly disclosing private information like one's Facebook name displayed in (social widget) iframes embedded in attack pages that do not have access to this private data due to the Same Origin Policy by having users solve such fake CAPTCHAs consisting of many CSS-disguised iframes.

Google runs a service called Safe Browsing that alerts users when websites get compromised. In the paper Remedying Web Hijacking: Notification Effectiveness and Webmaster Comprehension, the authors provide a study that captures the life cycle of 760,935 hijacking incidents from July, 2014 to June, 2015, as identified by Google Safe Browsing and Search Quality. They observe that direct communication with webmasters increases the likelihood of cleanup by over 50% and reduces infection lengths by at least 62%.

Another paper on Wikipedia looked at Growing Wikipedia Across Languages via Recommendation by detecting missing articles, ranking them by local importance, and finally contacting potential Wikipedia editors via email and suggesting them to write the article in question. The authors have deployed the Wikipedia GapFinder that shows the appraoch in practice.

Other Observations

The Social Media Research Foundation provides a NodeXL-based visualization of the network of tweets that used the #WWW2016 hashtag, including all my #WWW2016 tweets.

One thing I noticed at the conference is that we (and I fully include myself here) from time to time still tend to unconsciously use stereotyped, gendered language where it is inadequate in the general case ("so easy my mom or grandma could use it", "to pass the 'mom test'", etc.). I called this out in a tweet. You may want to follow the interesting conversation it has started on Twitter or Facebook (if you are friends with me). This tweet led Christopher Gutteridge to create the imaginative naive Web user Rube.

Oh, and in the old days, there used to be more bananas… Next conference!