vignettes/rmd/anytime-intro.Rmd
anytime-intro.Rmd
Abstract
The package provides functions which convert from both a number of different input variable types (integer, numeric, character, factor) and different input formats which are tried heuristically offering a powerful and versatile date and time converter that (generally) requires no user input and operates autonomously.
R excels at computing with dates, and times. Using a typed representation for your data is highly recommended not only because of the functionality offered but also because of the added safety stemming from proper representation.
But there is a small nuisance cost in interactive work as well as in
programming. Users must have told as.POSIXct()
about a
million times that the origin is (of course) the epoch. Do we really
have to say it a million more times? Similarly, when parsing dates that
are some variant of the common YYYYMMDD format, do we really
have to manually convert from integer
or
numeric
or factor
or ordered
to
character? Having one of several common separators and/or date formats
(YYYY-MM-DD, YYYY/MM/DD, YYYYMMDD, YYYY-mon-DD and so on, with or
without times), do we really need a format string? Or could a smart
converter function do this for us?
The anytime()
function aims to provide such a
general purpose converter returning a proper
POSIXct
(or Date
) object no matter the input
(provided it was parseable), relying on Boost Date_Time for the
(efficient, performant) conversion. anydate()
is an
additional wrapper returning a Date
object instead.
utctime()
and utcdate()
are two variants which
interpret input as coordinated universal time (UTC),
i.e. free of any timezone.
We set up the R environment and display for the examples below. Note
that the package caches the (local) timezone information (and
anytime:::setTZ()
can be used to reset this value
later).
Sys.setenv(TZ=anytime:::getTZ()) # TZ helper
library(anytime) # caches TZ info
options(width=50, # column width
digits.secs=6) # fractional secs
For numeric dates in the range of the (numeric) yyyymmdd
format, we use anydate()
.
## integer
anydate(20160101L + 0:2)
## [1] "2016-01-01" "2016-01-02" "2016-01-03"
## numeric
anydate(20160101 + 0:2)
## [1] "2016-01-01" "2016-01-02" "2016-01-03"
Numeric input also works for datetimes if its range corresponds to
the range of as.numeric()
values of POSIXct
variables:
## integer
anytime(1451628000L + 0:2)
## [1] "2016-01-01 06:00:00 UTC"
## [2] "2016-01-01 06:00:01 UTC"
## [3] "2016-01-01 06:00:02 UTC"
## numeric
anytime(1451628000 + 0:2)
## [1] "2016-01-01 06:00:00 UTC"
## [2] "2016-01-01 06:00:01 UTC"
## [3] "2016-01-01 06:00:02 UTC"
This is a change from version 0.3.0; the old behaviour (which was not
fully consistent in how it treated numeric input values, but convenient
for input in the ranges shown here) can be enabled via either an
argument to the function or a global options, see
help(anytime)
for details:
## integer
anytime(20160101L + 0:2, oldHeuristic=TRUE)
## [1] "2016-01-01 UTC" "2016-01-02 UTC"
## [3] "2016-01-03 UTC"
## numeric
anytime(20160101 + 0:2, oldHeuristic=TRUE)
## [1] "2016-01-01 UTC" "2016-01-02 UTC"
## [3] "2016-01-03 UTC"
In general, it is now preferred to use anydate()
on
values in this range (or resort to using oldHeuristics=TRUE
as shown).
Factor variables and their order variant are also supported directly.
## [1] "2016-01-01 UTC" "2016-01-02 UTC"
## [3] "2016-01-03 UTC"
## ordered
anytime(as.ordered(20160101 + 0:2))
## [1] "2016-01-01 UTC" "2016-01-02 UTC"
## [3] "2016-01-03 UTC"
Note that factor
and ordered
variables may
appear to be like numeric variables, they are in fact converted
to character first and treated just like character input (described in
the next section).
Character input is supported in a variety of formats. We first show simple formats.
## Dates: Character
anytime(as.character(20160101 + 0:2))
## [1] "2016-01-01 UTC" "2016-01-02 UTC"
## [3] "2016-01-03 UTC"
## [1] "2016-01-01 UTC" "2016-01-02 UTC"
## [3] "2016-01-03 UTC"
ISO8661 date(time) formats are supported with both ‘T’ and a space as separator of date and time.
## Datetime: ISO with/without fractional seconds
anytime(c("2016-01-01 10:11:12",
"2016-01-01T10:11:12.345678"))
## [1] "2016-01-01 10:11:12.000000 UTC"
## [2] "2016-01-01 10:11:12.345678 UTC"
Date formats with month abbreviations are supported in a number of common orderings.
## [1] "2016-09-01 10:11:12 UTC"
## [2] "2016-09-01 10:11:12 UTC"
## [3] "2016-09-01 10:11:12 UTC"
## Datetime: Mixed format
## (cf http://stackoverflow.com/questions/39259184)
anytime(c("Thu Sep 01 10:11:12 2016",
"Thu Sep 01 10:11:12.345678 2016"))
## [1] "2016-09-01 10:11:12.000000 UTC"
## [2] "2016-09-01 10:11:12.345678 UTC"
This shows an important aspect. When not working in localtime (by
overriding to UTC
) the change in difference to UTC
is correctly covered (which the underlying Boost Date_Time library
does not do by itself).
## [1] "2016-01-31 12:13:14 UTC"
## [2] "2016-08-31 12:13:14 UTC"
## [1] "2016-01-31 12:13:14 UTC"
## [2] "2016-08-31 12:13:14 UTC"
The actual parsing and conversion is done by two different Boost libraries. First, the top-level R function checks the input argument type and branches on date or datetime types. All other types get handed to a function using Boost lexical_cast to convert from anything numeric to a string representation. This textual representation is then parsed by Boost Date_Time to create the corresponding date, or datetime, type. (There are also a number of special cases where numeric values are directly converted; see below for a discussion.) We use the package to access these Boost libraries, and rely on for a seamless C++ interface to and from R.
The Boost Date_Time library is addressing the need for parsing date and datetimes from text. It permits us to loop over a suitably large number of candidate formats with considerable ease. The formats are generally variants of the ISO 8601 date format, i.e., of the YYYY-MM-DD ordering. We also allow for textual representation of months, e.g., ‘Jan’ for January. This feature is not internationalised.
The list of current formats can be retrieved by the
getFormats()
function. Users can also add to this list at
run-time by calling addFormats()
, as well as removing
formats. User-provided formats are tried before the formats supplied by
the package.
fmts <- getFormats()
length(fmts)
## [1] 89
head(fmts,10)
## [1] "%Y-%m-%d %H:%M:%S%f" "%Y-%m-%e %H:%M:%S%f"
## [3] "%Y-%m-%d %H%M%S%f" "%Y-%m-%e %H%M%S%f"
## [5] "%Y/%m/%d %H:%M:%S%f" "%Y/%m/%e %H:%M:%S%f"
## [7] "%Y%m%d %H%M%S%f" "%Y%m%d %H:%M:%S%f"
## [9] "%m/%d/%Y %H:%M:%S%f" "%m/%e/%Y %H:%M:%S%f"
tail(fmts,10)
## [1] "%d-%b-%Y" "%e-%b-%Y" "%Y-%B-%d" "%Y-%B-%e"
## [5] "%Y%B%d" "%Y%B%e" "%B/%d/%Y" "%B/%e/%Y"
## [9] "%B-%d-%Y" "%B-%e-%Y"
As a fallback for, e.g., different behavior on Windows where
Boost does not consult the environment variable, and to be generally as
close as possible to parsing by the R language and system, we also
support the parser from R itself. As R does not expose this part of its
API at the C level, we use the package . This code path is enabled when
useR=TRUE
is used.
A related topic is faithful and easy to read representation of datetime objects in output, i.e., formatting and printing such objects.
In the spirit of no configuration used on the parsing side, formating support is provided via several functions. These all follow different known standards and are accessible by the name of the standard, or, in one case, the non-standard convention. All return a a character representation.
## [1] "2016-01-31T12:13:14"
rfc2822(pt)
## [1] "Sun, 31 Jan 2016 12:13:14.123456 +0000"
rfc3339(pt)
## [1] "2016-01-31T12:13:14.123456+0000"
yyyymmdd(pt)
## [1] "20160131"
The package is designed to operate heuristically on a number of plausible and sane formats. This cannot possibly cover all conceivable cases.
In general, tries to gently nudge users towards ISO 8601 order of
year followed by month and day. But for
example in the United States, another prevalent form insists on
month-day-year ordering. As many users are likely to encounter such
input format, accomodates this use provided a separator is used: input
with either a slash (/
) or a hyphen (-
) is
accepted and parsed.
The package also contains two helper functions that can assist in
defensive programming by validating input arguments. The
assertTime()
and assertDate()
functions
validate if the given input can be parsed, respectively, as
Datetime
or Date
objects. In case one of the
inputs cannot be parsed, an error is triggered. Otherwise the parsed
input is returned invisibly.
The aims to satisfy two goal: be performant, and the same time flexible in terms of not requiring an explicit input format. We can gauge the relative performance via several pairwise compariosns.
The as.POSIXct()
function in R provides a useful
baseline as it is also implemented in compiled code. The
fastPOSIct()
function from the package excels at converting
one (and only one) input format fast to a (UTC-only) datetime
object. A simple benchmark converting 100 input strings 100,000 times
finds both as.POSIXct()
and anytime()
at very
comparable and similar performance, but well over one order of magnitude
slower that the highly-focussed fastPOSIXct()
. Table shows
the detailed results; the underlying code can be seen in the appendix. This result is reasonable: a highly
specialised function can (yand should) outperform two (relatively fast)
universal converters. anytime()
is still compelling as it
easier to use than as.POSIXct()
by not requiring a format
string (for formats other than ISO 8601).
The package brings the very general date parsing utility from the
version control software to . In a similar comparison of 100 input
strings parsed 10,000 times, we find its parse_date()
function to be more than an order of magnitude slower than
anytime()
or as.POSIXct()
—see table for the
results based on the code in the appendix.
Again, this result is reasonable as the greater flexibility of comes at
a cost in performance relative to the more restricted alternatives.
The package is a widely-used package for working with dates and times. It offers a very anywide variety of functions for working with dates and times: we count a full 168 exported functions in the current version. Its parser for dates and times requires at least a hint: the user has to specify whether input is ordered as, say, year-month-day, or day-month-year, or another form. has changed its internals considerably over the years. Early versions did not contain compiled code; a C-based parser was added first, and current versions embed the CCTZ C++ library which was first made available to R by the package .
While is less general than (in that it generally requires user input
on the ordering of date elements), it is also slower as can be seen from
the results in table based on the code in the appendix. The more-widely used form (here
ymd_hms()
) is over an order of magnitude slower; the less
well-known function parse_data_times()
(which still
requires hints) is still several times slower as shown
below.
We describe the package which offers fast, convenient and reliable date and datetime conversion for R users along with helper functions for formatting and assertions. Different types of input are illustrated and described in detail, and performance is analyzed via several benchmark comparisons.
We show that the package is no slower than the base R parser, and much faster than either the most flexible parsing alternative, or a commonly-used package in this space—all the while freeing users from having to supply explicit formats specified in advance. The combination of features, performance and ease-of-use may make a compelling alternative for R users parsing and analysing dates and times.
The benchmark results shown in tables , and are based on the code included below, and obtained via execution under R version 3.6.1 running under Ubuntu 19.04 with Linux kernel 5.0.0-25 on an Intel i7-8700k processor.
library(anytime)
library(rbenchmark)
library(fasttime)
inp <- rep("2019-01-02 03:04:05", 100)
res1 <- benchmark(fasttime=fastPOSIXct(inp),
baseR=as.POSIXct(inp),
anytime=anytime(inp),
replications=1e5)[, 1:4]
res1
library(parsedate)
inp <- rep("2019-01-02 03:04:05", 100)
res2 <- benchmark(parsedate=parse_date(inp),
baseR=as.POSIXct(inp),
anytime=anytime(inp),
replications=1e4)[, 1:4]
res2
suppressMessages(library(lubridate))
inp <- rep("2019-01-02 03:04:05", 100)
res3 <- benchmark(ymd_hms=ymd_hms(inp),
parse_date_time=
parse_date_time(inp,
"ymd_HMS"),
anytime=anytime(inp),
replications=1e4)[, 1:4]
res3