As an alternative to calling collect()
on a Dataset
query, you can
use this function to access the stream of RecordBatch
es in the Dataset
.
This lets you do more complex operations in R that operate on chunks of data
without having to hold the entire Dataset in memory at once. You can include
map_batches()
in a dplyr pipeline and do additional dplyr methods on the
stream of data in Arrow after it.
map_batches(X, FUN, ..., .schema = NULL, .lazy = TRUE, .data.frame = NULL)
A Dataset
or arrow_dplyr_query
object, as returned by the
dplyr
methods on Dataset
.
A function or purrr
-style lambda expression to apply to each
batch. It must return a RecordBatch or something coercible to one via
`as_record_batch()'.
Additional arguments passed to FUN
An optional schema()
. If NULL, the schema will be inferred
from the first batch.
Use TRUE
to evaluate FUN
lazily as batches are read from
the result; use FALSE
to evaluate FUN
on all batches before returning
the reader.
Deprecated argument, ignored
An arrow_dplyr_query
.
This is experimental and not recommended for production use. It is also single-threaded and runs in R not C++, so it won't be as fast as core Arrow methods.