Interfaces the libcurl URL parser.
URLs are automatically normalized where possible, such as in the case of
relative paths or url-encoded queries (see examples).
When parsing hyperlinks from a HTML document, it is possible to set baseurl
to the location of the document itself such that relative links can be resolved.
curl_parse_url(url, baseurl = NULL, decode = TRUE, params = TRUE)
a character string of length one
use this as the parent if url
may be a relative path
automatically url-decode output.
Set to FALSE
to get output in url-encoded format.
parse individual parameters assuming query is in application/x-www-form-urlencoded
format.
A valid URL contains at least a scheme and a host, other pieces are optional. If these are missing, the parser raises an error. Otherwise it returns a list with the following elements:
url: the normalized input URL
scheme: the protocol part before the ://
(required)
host: name of host without port (required)
port: decimal between 0 and 65535
path: normalized path up till the ?
of the url
query: search query: part between the ?
and #
of the url. Use params
below to get individual parameters from the query.
fragment: the hash part after the #
of the url
user: authentication username
password: authentication password
params: named vector with parameters from query
if set
Each element above is either a string or NULL
, except for params
which
is always a character vector with the length equal to the number of parameters.
Note that the params
field is only usable if the query
is in the usual
application/x-www-form-urlencoded
format which is technically not part of
the RFC. Some services may use e.g. a json blob as the query, in which case
the parsed params
field here can be ignored. There is no way for the parser
to automatically infer or validate the query format, this is up to the caller.
For more details on the URL format see rfc3986 or the steps explained in the whatwg basic url parser.
On platforms that do not have a recent enough curl version (basically only RHEL-8) the Ada URL library is used as fallback. Results should be identical, though curl has nicer error messages. This is a temporary solution, we plan to remove the fallback when old systems are no longer supported.
url <- "https://jerry:secret@google.com:888/foo/bar?test=123#bla"
curl_parse_url(url)
#> $url
#> [1] "https://jerry:secret@google.com:888/foo/bar?test=123#bla"
#>
#> $scheme
#> [1] "https"
#>
#> $host
#> [1] "google.com"
#>
#> $port
#> [1] "888"
#>
#> $path
#> [1] "/foo/bar"
#>
#> $query
#> [1] "test=123"
#>
#> $fragment
#> [1] "bla"
#>
#> $user
#> [1] "jerry"
#>
#> $password
#> [1] "secret"
#>
#> $params
#> test
#> "123"
#>
# Resolve relative links from a baseurl
curl_parse_url("/somelink", baseurl = url)
#> $url
#> [1] "https://jerry:secret@google.com:888/somelink"
#>
#> $scheme
#> [1] "https"
#>
#> $host
#> [1] "google.com"
#>
#> $port
#> [1] "888"
#>
#> $path
#> [1] "/somelink"
#>
#> $query
#> NULL
#>
#> $fragment
#> NULL
#>
#> $user
#> [1] "jerry"
#>
#> $password
#> [1] "secret"
#>
#> $params
#> character(0)
#>
# Paths get normalized
curl_parse_url("https://foobar.com/foo/bar/../baz/../yolo")$url
#> [1] "https://foobar.com/foo/yolo"
# Also normalizes URL-encoding (these URLs are equivalent):
url1 <- "https://ja.wikipedia.org/wiki/\u5bff\u53f8"
url2 <- "https://ja.wikipedia.org/wiki/%e5%af%bf%e5%8f%b8"
curl_parse_url(url1)$path
#> [1] "/wiki/寿司"
curl_parse_url(url2)$path
#> [1] "/wiki/寿司"
curl_parse_url(url1, decode = FALSE)$path
#> [1] "/wiki/%e5%af%bf%e5%8f%b8"
curl_parse_url(url1, decode = FALSE)$path
#> [1] "/wiki/%e5%af%bf%e5%8f%b8"