No token available. Code chunks will not be evaluated.
Basic Sheet reading is shown in the Get started article. Here we show how to target a specific (work)sheet or cell range, how to deal with column types, and how to get detailed cell data.
As a regular, interactive user, you can just let googlesheets4 prompt you for anything it needs re: auth.
Since this article is compiled noninteractively on a server, we have arranged for googlesheets4 to use a service account token (not shown).
read_sheet()
and range_read()
are
synonyms
The main “read” function of the googlesheets4 package goes by two names, because we want it to make sense in two contexts:
read_sheet()
evokes other table-reading functions,
like readr::read_csv()
and
readxl::read_excel()
. The sheet
in this case
refers to a Google (spread)Sheet.
range_read()
is technically the right name according
to the naming convention used throughout the googlesheets4 package,
because we can read from an arbitrary cell range.
read_sheet()
and range_read()
are synonyms
and you can use either one. Throughout this article, we’re going to use
range_read()
.
Note: The first release of googlesheets used a sheets_
prefix everywhere, so we had sheets_read()
. It still works,
but it’s deprecated and will go away rather swiftly.
Here we read from the “mini-gap” and “deaths” example Sheets to show some of the different ways to specify (work)sheet and cell ranges.
range_read(gs4_example("mini-gap"), sheet = 2)
range_read(gs4_example("mini-gap"), sheet = "Oceania", n_max = 3)
range_read(gs4_example("deaths"), skip = 4, n_max = 10)
The example below shows the use of range
to specify both
the (work)sheet and an A1-style cell range.
It also demonstrates how col_types
gives control of
column types, similar to how col_types
works in readr and
readxl. Note that currently there is only support for the “shortcode”
style of column specification and we plan to align better with readr’s
capabilities in a future release.
range_read()
.
range_read(
gs4_example("deaths"), range = "other!A5:F15", col_types = "?ci??D"
)
If you looked at the “deaths” spreadsheet in the browser (it’s here), you know that it has some of the typical features of real world spreadsheets: the main data rectangle has prose intended for human-consumption before and after it. That’s why we have to specify the range when we read from it.
We’ve designated the data rectangles as named
ranges, which provides a very slick way to read them – definitely
less brittle and mysterious than approaches like
range = "other!A5:F15"
or
skip = 4, n_max = 10
. A named range can be passed via the
range =
argument:
gs4_example("deaths") %>%
range_read(range = "arts_data")
The named ranges, if any exist, are part of the information returned
by gs4_get()
.
range_read_cells()
returns a data frame with one row per
cell and it gives access to raw cell data sent by the Sheets API.
(df <- range_read_cells(gs4_example("deaths"), range = "E5:E7"))
df$cell[[3]]
Specify cell_data = "full", discard_empty = FALSE
to get
even more data if you, for example, need access to cell formulas or
formatting.
spread_sheet()
converts data in the “one row per cell”
form into the data frame you get from range_read()
, which
involves reshaping and column typing.
df %>% spread_sheet(col_types = "D")
## is same as ...
range_read(gs4_example("deaths"), range = "E5:E7", col_types ="D")
If your Sheet is so large that the speed of range_read()
is causing problems, consider range_speedread()
. It uses a
special URL that allows a Sheet to be read as comma-separated values
(CSV). Access via this URL doesn’t use the Sheets API (although
range_speedread()
still makes an API call to retrieve Sheet
metadata). As an example, on a Sheet with around 57,000 rows and 25
columns (over 1.4 million cells), range_speedread()
takes
~5 seconds, whereas range_read()
takes closer to 3 minutes.
Why wouldn’t we always take the faster option?!? Because the speed
difference is imperceptible for many Sheets and there are some downsides
(described later).
range_speedread()
has much the same interface as
range_read()
.
gs4_example("gapminder") %>%
range_speedread(sheet = "Oceania", n_max = 3)
The output above reveals that, under the hood,
range_speedread()
calls an external function for CSV
parsing (namely, readr::read_csv()
). An important
consequence is that all arguments around column type specification are
passed along to the CSV-parsing function. Here is a demo using
readr-style column specification:
gs4_example("deaths") %>%
range_speedread(
range = "other!A5:F15",
col_types = readr::cols(
Age = readr::col_integer(),
`Date of birth` = readr::col_date("%m/%d/%Y"),
`Date of death` = readr::col_date("%m/%d/%Y")
)
)
Compare that to how we would read the same data with
range_read()
:
gs4_example("deaths") %>%
range_read(range = "other_data", col_types = "??i?DD")
This example highlights two important differences:
range = "other!A5:F15"
versus
range = "other_data"
: range_speedread()
can’t
access a named range, whereas range_read()
can.readr::col_date("%m/%d/%Y")
vs D
:
range_speedread()
must parse a character representation of
all cell data, including datetimes, whereas range_read()
has access to the actual cell data and its type.What’s the speed difference for something like the Africa sheet in the “gapminder” example Sheet? (around 625 rows x 6 columns, or 3700 cells)
system.time(
gs4_example("gapminder") %>% range_speedread(sheet = "Africa")
)
system.time(
gs4_example("gapminder") %>% range_read(sheet = "Africa")
)
The modest difference above shows that the speed difference is unlikely to be a gamechanger in many settings.
Summary of how to think about range_speedread()
vs
range_read()
:
gs4_deauth()
.range_speedread()
is faster, but it’s not noticeable
for typical Sheets.range_speedread()
uses readr-style column type
specification, which is actually more flexible than what
range_read()
currently does. In future googlesheets4
releases, we will adopt readr-style column type specification.range_speedread()
requires more detailed column type
specification, because it cannot access unformatted cell data and the
actual cell type, as range_read()
can.range_speedread()
can’t access full cell data, e.g.,
formatting.range_speedread()
can’t work with named ranges.