How to extract domains in AWS Athena / Presto SQL
When you work with URLs, you might want to extract domains (for example, to match web visits to a list of domains that signify a certain type of browsing behaviour). By “domain”, people usually refer to the second-level domain (SLD) plus the top-level domain (TLD). For example, in www.example.com, the SLD is example, which is the part that makes a website “memorable”; the TLD is com; and the domain is thus example.com. Or consider www.sub.example.co.uk: the SLD is example, the TLD co.uk and hence the domain is example.co.uk.
The latter case also illustrates why we cannot simply extract the domain with a naïve regular expression, since TLDs have a varying number of parts: For example, both com and co.uk are TLDs, and so is api.gov.uk, and believe it or not, chirurgiens-dentistes.fr. Put simply, extracting domains properly must rely on a list of all TLDs (or “suffixes”), as explained here. Fortunately, the Public Suffix List, maintained by the Mozilla Foundation, is just that list.
In many programming languages, there are packages based on this list (tldextract
in R and tld
in Python), but there is no equivalent for Presto SQL, which is what AWS Athena works with. I assume the problem exists for other SQL flavours, too. So I developed my own solution, heavily based on the logic of R’s tldextract
.
The first step is to get the actual list of TLDs in order to import it as a table into AWS Athena. (Note that the list of TLDs updates periodically.) Doing this in R:
# get list of TLDs; this code is from tldextract packageraw_results <- GET(“https://publicsuffix.org/list/effective_tld_names.dat")
parsed_results <- unlist(strsplit(content(raw_results, as = “text”),”\n+”))
tldnames <- parsed_results[grep(pattern = “^//”, x = parsed_results, invert = TRUE)]
tldnames <- iconv(tldnames, to = “UTF-8”)
wilds <- grepl(‘^\\*’, tldnames)
wildcard <- sub(‘\\*\\.’, “”, tldnames[wilds])static <- tldnames[!wilds]# Export a csv with TLDsstatic_df = data.frame(tld = static, type = "static")
wilds_df = data.frame(tld = wildcard, type = "wildcard")
tld_df <- rbind(static_df, wilds_df)
write.csv(tld_df, paste0(path, "/tld_df.csv"), row.names = F)
I import this file as domains_tld
into Athena. To illustrate how to parse domains out of a data set of URLs, let’s first make some toy data in SQL, immediately extracting the host with Presto’s inbuilt function:
CREATE TABLE dat AS
SELECT url, url_extract_host(url) AS url_host FROM
(WITH temp AS (
SELECT * FROM (VALUES
('https://subsub.sub.domain.co.uk/page?q=1234#abcd'),
('https://domain.api.gov.uk/page?q=1234#abcd'),
('https://domain.co.uk/page?q=1234#abcd'),
('https://www.domain.com/page?q=1234#abcd'),
('https://domain.com/page?q=1234#abcd')) AS tab (url))
SELECT url from temp)
Now, the algorithm of the tldextract
package is the following: (1) Split up the URL host at every dot into “parts”: for example, domain.co.uk is split up into domain, co and uk. (2) Put the parts together incrementally, starting from the last part: in the example, this gives us uk, co.uk and domain.co.uk. (3) Match the resulting combinations with TLDs from the Public Suffix List.
To do (1) in SQL, we first need to know the maximum number of parts in our data. In the toy data, subsub.sub.domain.co.uk has 5 parts. We therefore need five new variables:
CREATE TABLE dat2 AS
SELECT url_host,
SPLIT(url_host, '.')[1] AS part_1,
TRY(SPLIT(url_host, '.')[2]) AS part_2,
TRY(SPLIT(url_host, '.')[3]) AS part_3,
TRY(SPLIT(url_host, '.')[4]) AS part_4,
TRY(SPLIT(url_host, '.')[5]) AS part_5
FROM dat
For (2), we just concatenate the parts, starting from the last. Note that the resulting combinations do not need to integrate the first part, because that one can never be part of the TLD.
CREATE TABLE dat3 AS
SELECT url_host, part_1, part_2, part_3, part_4, part_5,
CASE
WHEN part_5 IS NOT NULL THEN part_5
WHEN part_5 IS NULL AND part_4 IS NOT NULL THEN part_4
WHEN part_4 IS NULL AND part_3 IS NOT NULL THEN part_3
WHEN part_3 IS NULL AND part_2 IS NOT NULL THEN part_2
END AS last,
CASE
WHEN part_5 IS NOT NULL THEN concat(part_4, '.', part_5)
WHEN part_5 IS NULL AND part_4 IS NOT NULL THEN concat(part_3, '.', part_4)
WHEN part_4 IS NULL AND part_3 IS NOT NULL THEN concat(part_2, '.', part_3)
END AS two_last,
CASE
WHEN part_5 IS NOT NULL THEN concat(part_3, '.', part_4, '.', part_5)
WHEN part_5 IS NULL AND part_4 IS NOT NULL THEN concat(part_2, '.', part_3, '.', part_4)
END AS three_last,
CASE
WHEN part_5 IS NOT NULL THEN concat(part_2, '.', part_3, '.', part_4, '.', part_5)
END AS four_last
FROM dat2
Now we bring this table into “long” format so that each original row (i.e., host) is repeated as many times as there are combinations for that row.
CREATE TABLE dat4 AS
SELECT * FROM
(WITH temp AS (
SELECT url_host, array[last, two_last, three_last, four_last] last_combs
FROM dat3)
SELECT url_host, pattern FROM temp
CROSS JOIN UNNEST(last_combs) as t(pattern))
WHERE pattern IS NOT NULL
Step (3): We match the combinations to the list of TLDs. If there are multiple matches per host, we only take the longest TLD. For example, when a host matches both uk and co.uk, co.uk has to be the TLD because there cannot be a domain called “co.uk”.
CREATE TABLE dat5 AS
SELECT url_host, max_by(tld, tld_length) tld FROM
(SELECT url_host, pattern, d.tld, length(d.tld) tld_length
FROM dat4 h
LEFT JOIN domains_tld d ON h.pattern = d.tld
WHERE d.tld IS NOT NULL)
GROUP BY url_host
Finally, we extract the domain out of the host!
SELECT url_host, tld,
regexp_extract(url_host, concat(‘\p{Alnum}+\.’, tld)) AS sld
FROM dat5
All of this seems a bit tedious, but is essentially what tldextract
and similar packages do. (Note that I have not taken care about wild-card TLDs — see Public Suffix List for details.)