How to extract domains in AWS Athena / Presto SQL

Bernhard Clemm
4 min readNov 9, 2022

When you work with URLs, you might want to extract domains (for example, to match web visits to a list of domains that signify a certain type of browsing behaviour). By “domain”, people usually refer to the second-level domain (SLD) plus the top-level domain (TLD). For example, in, the SLD is example, which is the part that makes a website “memorable”; the TLD is com; and the domain is thus Or consider the SLD is example, the TLD and hence the domain is

The latter case also illustrates why we cannot simply extract the domain with a naïve regular expression, since TLDs have a varying number of parts: For example, both com and are TLDs, and so is, and believe it or not, Put simply, extracting domains properly must rely on a list of all TLDs (or “suffixes”), as explained here. Fortunately, the Public Suffix List, maintained by the Mozilla Foundation, is just that list.

In many programming languages, there are packages based on this list (tldextract in R and tld in Python), but there is no equivalent for Presto SQL, which is what AWS Athena works with. I assume the problem exists for other SQL flavours, too. So I developed my own solution, heavily based on the logic of R’s tldextract.

The first step is to get the actual list of TLDs in order to import it as a table into AWS Athena. (Note that the list of TLDs updates periodically.) Doing this in R:

# get list of TLDs; this code is from tldextract packageraw_results <- GET(“")
parsed_results <- unlist(strsplit(content(raw_results, as = “text”),”\n+”))
tldnames <- parsed_results[grep(pattern = “^//”, x = parsed_results, invert = TRUE)]
tldnames <- iconv(tldnames, to = “UTF-8”)
wilds <- grepl(‘^\\*’, tldnames)
wildcard <- sub(‘\\*\\.’, “”, tldnames[wilds])static <- tldnames[!wilds]
# Export a csv with TLDsstatic_df = data.frame(tld = static, type = "static")
wilds_df = data.frame(tld = wildcard, type = "wildcard")
tld_df <- rbind(static_df, wilds_df)
write.csv(tld_df, paste0(path, "/tld_df.csv"), row.names = F)

I import this file as domains_tld into Athena. To illustrate how to parse domains out of a data set of URLs, let’s first make some toy data in SQL, immediately extracting the host with Presto’s inbuilt function:

SELECT url, url_extract_host(url) AS url_host FROM
(WITH temp AS (
('')) AS tab (url))
SELECT url from temp)

Now, the algorithm of the tldextract package is the following: (1) Split up the URL host at every dot into “parts”: for example, is split up into domain, co and uk. (2) Put the parts together incrementally, starting from the last part: in the example, this gives us uk, and (3) Match the resulting combinations with TLDs from the Public Suffix List.

To do (1) in SQL, we first need to know the maximum number of parts in our data. In the toy data, has 5 parts. We therefore need five new variables:

SELECT url_host,
SPLIT(url_host, '.')[1] AS part_1,
TRY(SPLIT(url_host, '.')[2]) AS part_2,
TRY(SPLIT(url_host, '.')[3]) AS part_3,
TRY(SPLIT(url_host, '.')[4]) AS part_4,
TRY(SPLIT(url_host, '.')[5]) AS part_5
FROM dat

For (2), we just concatenate the parts, starting from the last. Note that the resulting combinations do not need to integrate the first part, because that one can never be part of the TLD.

SELECT url_host, part_1, part_2, part_3, part_4, part_5,
WHEN part_5 IS NOT NULL THEN part_5
WHEN part_5 IS NULL AND part_4 IS NOT NULL THEN part_4
WHEN part_4 IS NULL AND part_3 IS NOT NULL THEN part_3
WHEN part_3 IS NULL AND part_2 IS NOT NULL THEN part_2
END AS last,
WHEN part_5 IS NOT NULL THEN concat(part_4, '.', part_5)
WHEN part_5 IS NULL AND part_4 IS NOT NULL THEN concat(part_3, '.', part_4)
WHEN part_4 IS NULL AND part_3 IS NOT NULL THEN concat(part_2, '.', part_3)
END AS two_last,
WHEN part_5 IS NOT NULL THEN concat(part_3, '.', part_4, '.', part_5)
WHEN part_5 IS NULL AND part_4 IS NOT NULL THEN concat(part_2, '.', part_3, '.', part_4)
END AS three_last,
WHEN part_5 IS NOT NULL THEN concat(part_2, '.', part_3, '.', part_4, '.', part_5)
END AS four_last
FROM dat2

Now we bring this table into “long” format so that each original row (i.e., host) is repeated as many times as there are combinations for that row.

(WITH temp AS (
SELECT url_host, array[last, two_last, three_last, four_last] last_combs
FROM dat3)
SELECT url_host, pattern FROM temp
CROSS JOIN UNNEST(last_combs) as t(pattern))

Step (3): We match the combinations to the list of TLDs. If there are multiple matches per host, we only take the longest TLD. For example, when a host matches both uk and, has to be the TLD because there cannot be a domain called “”.

SELECT url_host, max_by(tld, tld_length) tld FROM
(SELECT url_host, pattern, d.tld, length(d.tld) tld_length
FROM dat4 h
LEFT JOIN domains_tld d ON h.pattern = d.tld
GROUP BY url_host

Finally, we extract the domain out of the host!

SELECT url_host, tld, 
regexp_extract(url_host, concat(‘\p{Alnum}+\.’, tld)) AS sld
FROM dat5

All of this seems a bit tedious, but is essentially what tldextract and similar packages do. (Note that I have not taken care about wild-card TLDs — see Public Suffix List for details.)