Read a CSV file into a Spark DataFrame

Read a tabular data file into a Spark DataFrame.

Usage

spark_read_csv(
  sc,
  name = NULL,
  path = name,
  header = TRUE,
  columns = NULL,
  infer_schema = is.null(columns),
  delimiter = ",",
  quote = "\"",
  escape = "\\",
  charset = "UTF-8",
  null_value = NULL,
  options = list(),
  repartition = 0,
  memory = TRUE,
  overwrite = TRUE,
  ...
)

Arguments

sc: A spark_connection.
name: The name to assign to the newly generated table.
path: The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
header: Boolean; should the first row of data be used as a header? Defaults to TRUE.
columns: A vector of column names or a named vector of column types. If specified, the elements can be "binary" for BinaryType, "boolean" for BooleanType, "byte" for ByteType, "integer" for IntegerType, "integer64" for LongType, "double" for DoubleType, "character" for StringType, "timestamp" for TimestampType and "date" for DateType.
infer_schema: Boolean; should column types be automatically inferred? Requires one extra pass over the data. Defaults to is.null(columns).
delimiter: The character used to delimit each column. Defaults to ','.
quote: The character used as a quote. Defaults to '"'.
escape: The character used to escape other characters. Defaults to '\'.
charset: The character set. Defaults to "UTF-8".
null_value: The character to use for null, or missing, values. Defaults to NULL.
options: A list of strings with additional options.
repartition: The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory: Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
overwrite: Boolean; overwrite the table with the given name if it already exists?
...: Optional arguments; currently unused.

Details

You can read data from HDFS (hdfs://), S3 (s3a://), as well as the local file system (file://).

When header is FALSE, the column names are generated with a V prefix; e.g. V1, V2, ....

Other Spark serialization routines: collect_from_rds(), spark_insert_table(), spark_load_table(), spark_read(), spark_read_avro(), spark_read_binary(), spark_read_delta(), spark_read_image(), spark_read_jdbc(), spark_read_json(), spark_read_libsvm(), spark_read_orc(), spark_read_parquet(), spark_read_source(), spark_read_table(), spark_read_text(), spark_save_table(), spark_write_avro(), spark_write_csv(), spark_write_delta(), spark_write_jdbc(), spark_write_json(), spark_write_orc(), spark_write_parquet(), spark_write_source(), spark_write_table(), spark_write_text()

Usage

Arguments

Details

See also