Package index • sparklyr

Spark Sessions

spark_connect() spark_connection_is_open() spark_disconnect() spark_disconnect_all() spark_submit(): Manage Spark Connections
spark_config(): Read Spark Configuration
spark_install() spark_uninstall() spark_install_dir() spark_install_tar() spark_installed_versions() spark_available_versions(): Download and install various versions of Spark
spark_log(): View Entries in the Spark Log
spark_web(): Open the Spark web interface

Spark Data

spark_read(): Read file(s) into a Spark DataFrame using a custom reader
spark_read_avro(): Read Apache Avro data into a Spark DataFrame.
spark_read_binary(): Read binary data into a Spark DataFrame.
spark_read_csv(): Read a CSV file into a Spark DataFrame
spark_read_delta(): Read from Delta Lake into a Spark DataFrame.
spark_read_image(): Read image data into a Spark DataFrame.
spark_read_jdbc(): Read from JDBC connection into a Spark DataFrame.
spark_read_json(): Read a JSON file into a Spark DataFrame
spark_read_libsvm(): Read libsvm file into a Spark DataFrame.
spark_read_orc(): Read a ORC file into a Spark DataFrame
spark_read_parquet(): Read a Parquet file into a Spark DataFrame
spark_read_source(): Read from a generic source into a Spark DataFrame.
spark_read_table(): Reads from a Spark Table into a Spark DataFrame.
spark_read_text(): Read a Text file into a Spark DataFrame
spark_write(): Write Spark DataFrame to file using a custom writer
spark_write_avro(): Serialize a Spark DataFrame into Apache Avro format
spark_write_csv(): Write a Spark DataFrame to a CSV
spark_write_delta(): Writes a Spark DataFrame into Delta Lake
spark_write_jdbc(): Writes a Spark DataFrame into a JDBC table
spark_write_json(): Write a Spark DataFrame to a JSON file
spark_write_orc(): Write a Spark DataFrame to a ORC file
spark_write_parquet(): Write a Spark DataFrame to a Parquet file
spark_write_rds(): Write Spark DataFrame to RDS files
spark_write_source(): Writes a Spark DataFrame into a generic source
spark_write_table(): Writes a Spark DataFrame into a Spark table
spark_write_text(): Write a Spark DataFrame to a Text file
spark_insert_table(): Inserts a Spark DataFrame into a Spark table
spark_save_table(): Saves a Spark DataFrame as a Spark table
collect_from_rds(): Collect Spark data serialized in RDS format into R

Spark Tables

src_databases(): Show database list
tbl_cache(): Cache a Spark Table
tbl_change_db(): Use specific database
tbl_uncache(): Uncache a Spark Table

Spark DataFrames

dplyr_hof: dplyr wrappers for Apache Spark higher order functions
sdf_save_table() sdf_load_table() sdf_save_parquet() sdf_load_parquet(): Save / Load a Spark DataFrame
sdf_predict() sdf_transform() sdf_fit() sdf_fit_and_transform(): Spark ML – Transform, fit, and predict methods (sdf_ interface)
sdf_along(): Create DataFrame for along Object
sdf_bind_rows() sdf_bind_cols(): Bind multiple Spark DataFrames by row and column
sdf_broadcast(): Broadcast hint
sdf_checkpoint(): Checkpoint a Spark DataFrame
sdf_coalesce(): Coalesces a Spark DataFrame
sdf_collect(): Collect a Spark DataFrame into R.
sdf_copy_to() sdf_import(): Copy an Object into Spark
sdf_crosstab(): Cross Tabulation
sdf_debug_string(): Debug Info for Spark DataFrame
sdf_describe(): Compute summary statistics for columns of a data frame
sdf_dim() sdf_nrow() sdf_ncol(): Support for Dimension Operations
sdf_distinct(): Invoke distinct on a Spark DataFrame
sdf_drop_duplicates(): Remove duplicates from a Spark DataFrame
sdf_expand_grid(): Create a Spark dataframe containing all combinations of inputs
sdf_from_avro(): Convert column(s) from avro format
sdf_is_streaming(): Spark DataFrame is Streaming
sdf_last_index(): Returns the last index of a Spark DataFrame
sdf_len(): Create DataFrame for Length
sdf_num_partitions(): Gets number of partitions of a Spark DataFrame
sdf_partition_sizes(): Compute the number of records within each partition of a Spark DataFrame
sdf_persist(): Persist a Spark DataFrame
sdf_pivot(): Pivot a Spark DataFrame
sdf_project(): Project features onto principal components
sdf_quantile(): Compute (Approximate) Quantiles with a Spark DataFrame
sdf_random_split() sdf_partition(): Partition a Spark Dataframe
sdf_rbeta(): Generate random samples from a Beta distribution
sdf_rbinom(): Generate random samples from a binomial distribution
sdf_rcauchy(): Generate random samples from a Cauchy distribution
sdf_rchisq(): Generate random samples from a chi-squared distribution
sdf_read_column(): Read a Column from a Spark DataFrame
sdf_register(): Register a Spark DataFrame
sdf_repartition(): Repartition a Spark DataFrame
sdf_residuals(): Model Residuals
sdf_rexp(): Generate random samples from an exponential distribution
sdf_rgamma(): Generate random samples from a Gamma distribution
sdf_rgeom(): Generate random samples from a geometric distribution
sdf_rhyper(): Generate random samples from a hypergeometric distribution
sdf_rlnorm(): Generate random samples from a log normal distribution
sdf_rnorm(): Generate random samples from the standard normal distribution
sdf_rpois(): Generate random samples from a Poisson distribution
sdf_rt(): Generate random samples from a t-distribution
sdf_runif(): Generate random samples from the uniform distribution U(0, 1).
sdf_rweibull(): Generate random samples from a Weibull distribution.
sdf_sample(): Randomly Sample Rows from a Spark DataFrame
sdf_schema(): Read the Schema of a Spark DataFrame
sdf_separate_column(): Separate a Vector Column into Scalar Columns
sdf_seq(): Create DataFrame for Range
sdf_sort(): Sort a Spark DataFrame
sdf_sql(): Spark DataFrame from SQL
sdf_to_avro(): Convert column(s) to avro format
sdf_unnest_longer(): Unnest longer
sdf_unnest_wider(): Unnest wider
sdf_weighted_sample(): Perform Weighted Random Sampling on a Spark DataFrame
sdf_with_sequential_id(): Add a Sequential ID Column to a Spark DataFrame
sdf_with_unique_id(): Add a Unique ID Column to a Spark DataFrame
hof_aggregate(): Apply Aggregate Function to Array Column
hof_array_sort(): Sorts array using a custom comparator
hof_exists(): Determine Whether Some Element Exists in an Array Column
hof_filter(): Filter Array Column
hof_forall(): Checks whether all elements in an array satisfy a predicate
hof_map_filter(): Filters a map
hof_map_zip_with(): Merges two maps into one
hof_transform(): Transform Array Column
hof_transform_keys(): Transforms keys of a map
hof_transform_values(): Transforms values of a map
hof_zip_with(): Combines 2 Array Columns
transform_sdf(): transform a subset of column(s) in a Spark Dataframe

Spark ML - Regression

ml_linear_regression(): Spark ML – Linear Regression
ml_aft_survival_regression() ml_survival_regression(): Spark ML – Survival Regression
ml_isotonic_regression(): Spark ML – Isotonic Regression
ml_generalized_linear_regression(): Spark ML – Generalized Linear Regression

Spark ML - Classification

ml_naive_bayes(): Spark ML – Naive-Bayes
ml_one_vs_rest(): Spark ML – OneVsRest
ml_logistic_regression(): Spark ML – Logistic Regression
ml_multilayer_perceptron_classifier() ml_multilayer_perceptron(): Spark ML – Multilayer Perceptron
ml_linear_svc(): Spark ML – LinearSVC

Spark ML - Tree

ml_decision_tree_classifier() ml_decision_tree() ml_decision_tree_regressor(): Spark ML – Decision Trees
ml_gbt_classifier() ml_gradient_boosted_trees() ml_gbt_regressor(): Spark ML – Gradient Boosted Trees
ml_random_forest_classifier() ml_random_forest() ml_random_forest_regressor(): Spark ML – Random Forest
ml_feature_importances() ml_tree_feature_importance(): Spark ML - Feature Importance for Tree Models

Spark ML - Clustering

ml_kmeans() ml_compute_cost() ml_compute_silhouette_measure(): Spark ML – K-Means Clustering
ml_kmeans_cluster_eval: Evaluate a K-mean clustering
ml_bisecting_kmeans(): Spark ML – Bisecting K-Means Clustering
ml_gaussian_mixture(): Spark ML – Gaussian Mixture clustering.
ml_power_iteration(): Spark ML – Power Iteration Clustering

Spark ML - Text

ml_lda() ml_describe_topics() ml_log_likelihood() ml_log_perplexity() ml_topics_matrix(): Spark ML – Latent Dirichlet Allocation
ml_chisquare_test(): Chi-square hypothesis testing for categorical data.
ml_default_stop_words(): Default stop words
ml_fpgrowth() ml_association_rules() ml_freq_itemsets(): Frequent Pattern Mining – FPGrowth
ml_prefixspan() ml_freq_seq_patterns(): Frequent Pattern Mining – PrefixSpan
ft_count_vectorizer() ml_vocabulary(): Feature Transformation – CountVectorizer (Estimator)

Spark ML - Recommendations

ml_als() ml_recommend(): Spark ML – ALS

Spark ML - Hyper-parameter tuning

ml_sub_models() ml_validation_metrics() ml_cross_validator() ml_train_validation_split(): Spark ML – Tuning

Spark ML - Evaluation

ml_metrics_binary(): Extracts metrics from a fitted table
ml_metrics_multiclass(): Extracts metrics from a fitted table
ml_metrics_regression(): Extracts metrics from a fitted table
ml_evaluate(): Evaluate the Model on a Validation Set
ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator(): Spark ML - Evaluators
ml_clustering_evaluator(): Spark ML - Clustering Evaluator

Spark ML - Operations

ml_model_data(): Extracts data associated with a Spark ML model
ml_call_constructor(): Wrap a Spark ML JVM object
ml_corr(): Compute correlation matrix
is_ml_transformer() is_ml_estimator() ml_fit() ml_transform() ml_fit_and_transform() ml_predict(): Spark ML – Transform, fit, and predict methods (ml_ interface)
ft_string_indexer() ml_labels() ft_string_indexer_model(): Feature Transformation – StringIndexer (Estimator)
ml_save() ml_load(): Spark ML – Model Persistence
ml_is_set() ml_param_map() ml_param() ml_params(): Spark ML – ML Params
ml_standardize_formula(): Standardize Formula Input for `ml_model`
ml_summary(): Spark ML – Extraction of summary metrics
ml_supervised_pipeline() ml_clustering_pipeline() ml_construct_model_supervised() ml_construct_model_clustering() new_ml_model_prediction() new_ml_model() new_ml_model_classification() new_ml_model_regression() new_ml_model_clustering(): Constructors for `ml_model` Objects
ml_uid(): Spark ML – UID

Spark Pipelines

ml_pipeline(): Spark ML – Pipelines
ml_stage() ml_stages(): Spark ML – Pipeline stage extraction
ml_add_stage(): Add a Stage to a Pipeline

Spark Feature Transformers

ft_binarizer(): Feature Transformation – Binarizer (Transformer)
ft_bucketizer(): Feature Transformation – Bucketizer (Transformer)
ft_chisq_selector(): Feature Transformation – ChiSqSelector (Estimator)
ft_count_vectorizer() ml_vocabulary(): Feature Transformation – CountVectorizer (Estimator)
ft_dct() ft_discrete_cosine_transform(): Feature Transformation – Discrete Cosine Transform (DCT) (Transformer)
ft_elementwise_product(): Feature Transformation – ElementwiseProduct (Transformer)
ft_feature_hasher(): Feature Transformation – FeatureHasher (Transformer)
ft_hashing_tf(): Feature Transformation – HashingTF (Transformer)
ft_idf(): Feature Transformation – IDF (Estimator)
ft_imputer(): Feature Transformation – Imputer (Estimator)
ft_index_to_string(): Feature Transformation – IndexToString (Transformer)
ft_interaction(): Feature Transformation – Interaction (Transformer)
ft_bucketed_random_projection_lsh() ft_minhash_lsh(): Feature Transformation – LSH (Estimator)
ml_approx_nearest_neighbors() ml_approx_similarity_join(): Utility functions for LSH models
ft_max_abs_scaler(): Feature Transformation – MaxAbsScaler (Estimator)
ft_min_max_scaler(): Feature Transformation – MinMaxScaler (Estimator)
ft_ngram(): Feature Transformation – NGram (Transformer)
ft_normalizer(): Feature Transformation – Normalizer (Transformer)
ft_one_hot_encoder(): Feature Transformation – OneHotEncoder (Transformer)
ft_one_hot_encoder_estimator(): Feature Transformation – OneHotEncoderEstimator (Estimator)
ft_pca() ml_pca(): Feature Transformation – PCA (Estimator)
ft_polynomial_expansion(): Feature Transformation – PolynomialExpansion (Transformer)
ft_quantile_discretizer(): Feature Transformation – QuantileDiscretizer (Estimator)
ft_r_formula(): Feature Transformation – RFormula (Estimator)
ft_regex_tokenizer(): Feature Transformation – RegexTokenizer (Transformer)
ft_robust_scaler(): Feature Transformation – RobustScaler (Estimator)
ft_standard_scaler(): Feature Transformation – StandardScaler (Estimator)
ft_stop_words_remover(): Feature Transformation – StopWordsRemover (Transformer)
ft_string_indexer() ml_labels() ft_string_indexer_model(): Feature Transformation – StringIndexer (Estimator)
ft_tokenizer(): Feature Transformation – Tokenizer (Transformer)
ft_vector_assembler(): Feature Transformation – VectorAssembler (Transformer)
ft_vector_indexer(): Feature Transformation – VectorIndexer (Estimator)
ft_vector_slicer(): Feature Transformation – VectorSlicer (Transformer)
ft_word2vec() ml_find_synonyms(): Feature Transformation – Word2Vec (Estimator)
ft_sql_transformer() ft_dplyr_transformer(): Feature Transformation – SQLTransformer

Extensions

ml_supervised_pipeline() ml_clustering_pipeline() ml_construct_model_supervised() ml_construct_model_clustering() new_ml_model_prediction() new_ml_model() new_ml_model_classification() new_ml_model_regression() new_ml_model_clustering(): Constructors for `ml_model` Objects
compile_package_jars(): Compile Scala sources into a Java Archive (jar)
connection_config(): Read configuration values for a connection
download_scalac(): Downloads default Scala Compilers
find_scalac(): Discover the Scala Compiler
spark_context() java_context() hive_context() spark_session(): Access the Spark API
hive_context_config(): Runtime configuration interface for Hive
invoke() invoke_static() invoke_new(): Invoke a Method on a JVM Object
j_invoke() j_invoke_static() j_invoke_new(): Invoke a Java function.
jarray(): Instantiate a Java array with a specific element type.
jfloat(): Instantiate a Java float type.
jfloat_array(): Instantiate an Array[Float].
register_extension() registered_extensions(): Register a Package that Implements a Spark Extension
spark_compilation_spec(): Define a Spark Compilation Specification
spark_default_compilation_spec(): Default Compilation Specification for Spark Extensions
spark_context_config(): Runtime configuration interface for the Spark Context.
spark_dataframe(): Retrieve a Spark DataFrame
spark_dependency(): Define a Spark dependency
spark_home_set(): Set the SPARK_HOME environment variable
spark_jobj(): Retrieve a Spark JVM Object Reference
spark_version(): Get the Spark Version Associated with a Spark Connection

Distributed Computing

spark_apply(): Apply an R Function in Spark
spark_apply_bundle(): Create Bundle for Spark Apply
spark_apply_log(): Log Writer for Spark Apply
registerDoSpark(): Register a Parallel Backend

Livy

livy_config(): Create a Spark Configuration for Livy
livy_service_start() livy_service_stop(): Start Livy

Streaming

stream_find(): Find Stream
stream_generate_test(): Generate Test Stream
stream_id(): Spark Stream's Identifier
stream_lag(): Apply lag function to columns of a Spark Streaming DataFrame
stream_name(): Spark Stream's Name
stream_read_csv() stream_read_text() stream_read_json() stream_read_parquet() stream_read_orc() stream_read_kafka() stream_read_socket() stream_read_delta() stream_read_cloudfiles() stream_read_table(): Read files created by the stream
stream_render(): Render Stream
stream_stats(): Stream Statistics
stream_stop(): Stops a Spark Stream
stream_trigger_continuous(): Spark Stream Continuous Trigger
stream_trigger_interval(): Spark Stream Interval Trigger
stream_view(): View Stream
stream_watermark(): Watermark Stream
stream_write_csv() stream_write_text() stream_write_json() stream_write_parquet() stream_write_orc() stream_write_kafka() stream_write_console() stream_write_delta(): Write files to the stream
stream_write_memory(): Write Memory Stream
stream_write_table(): Write Stream to Table
reactiveSpark(): Reactive spark reader

dplyr integration

copy_to(<spark_connection>): Copy an R Data Frame to Spark
distinct: Distinct
filter: Filter
full_join: Full join
inner_join: Inner join
inner_join(<tbl_spark>) left_join(<tbl_spark>) right_join(<tbl_spark>) full_join(<tbl_spark>): Join Spark tbls.
left_join: Left join
mutate: Mutate
right_join: Right join
select: Select

tidyr integration

pivot_longer: Pivot longer
pivot_wider: Pivot wider
fill: Fill
na.replace(): Replace Missing Values in Objects
nest: Nest
replace_na: Replace NA
separate: Separate
unite: Unite
unnest: Unnest

tidymodels integration

tidy(<ml_model_als>) augment(<ml_model_als>) glance(<ml_model_als>): Tidying methods for Spark ML ALS
tidy(<ml_model_generalized_linear_regression>) tidy(<ml_model_linear_regression>) augment(<ml_model_generalized_linear_regression>) augment(<_ml_model_linear_regression>) augment(<ml_model_linear_regression>) glance(<ml_model_generalized_linear_regression>) glance(<ml_model_linear_regression>): Tidying methods for Spark ML linear models
tidy(<ml_model_isotonic_regression>) augment(<ml_model_isotonic_regression>) glance(<ml_model_isotonic_regression>): Tidying methods for Spark ML Isotonic Regression
tidy(<ml_model_lda>) augment(<ml_model_lda>) glance(<ml_model_lda>): Tidying methods for Spark ML LDA models
tidy(<ml_model_linear_svc>) augment(<ml_model_linear_svc>) glance(<ml_model_linear_svc>): Tidying methods for Spark ML linear svc
tidy(<ml_model_logistic_regression>) augment(<ml_model_logistic_regression>) augment(<_ml_model_logistic_regression>) glance(<ml_model_logistic_regression>): Tidying methods for Spark ML Logistic Regression
tidy(<ml_model_multilayer_perceptron_classification>) augment(<ml_model_multilayer_perceptron_classification>) glance(<ml_model_multilayer_perceptron_classification>): Tidying methods for Spark ML MLP
tidy(<ml_model_naive_bayes>) augment(<ml_model_naive_bayes>) glance(<ml_model_naive_bayes>): Tidying methods for Spark ML Naive Bayes
tidy(<ml_model_pca>) augment(<ml_model_pca>) glance(<ml_model_pca>): Tidying methods for Spark ML Principal Component Analysis
tidy(<ml_model_aft_survival_regression>) augment(<ml_model_aft_survival_regression>) glance(<ml_model_aft_survival_regression>): Tidying methods for Spark ML Survival Regression
tidy(<ml_model_decision_tree_classification>) tidy(<ml_model_decision_tree_regression>) augment(<ml_model_decision_tree_classification>) augment(<_ml_model_decision_tree_classification>) augment(<ml_model_decision_tree_regression>) augment(<_ml_model_decision_tree_regression>) glance(<ml_model_decision_tree_classification>) glance(<ml_model_decision_tree_regression>) tidy(<ml_model_random_forest_classification>) tidy(<ml_model_random_forest_regression>) augment(<ml_model_random_forest_classification>) augment(<_ml_model_random_forest_classification>) augment(<ml_model_random_forest_regression>) augment(<_ml_model_random_forest_regression>) glance(<ml_model_random_forest_classification>) glance(<ml_model_random_forest_regression>) tidy(<ml_model_gbt_classification>) tidy(<ml_model_gbt_regression>) augment(<ml_model_gbt_classification>) augment(<_ml_model_gbt_classification>) augment(<ml_model_gbt_regression>) augment(<_ml_model_gbt_regression>) glance(<ml_model_gbt_classification>) glance(<ml_model_gbt_regression>): Tidying methods for Spark ML tree models
tidy(<ml_model_kmeans>) augment(<ml_model_kmeans>) glance(<ml_model_kmeans>) tidy(<ml_model_bisecting_kmeans>) augment(<ml_model_bisecting_kmeans>) glance(<ml_model_bisecting_kmeans>) tidy(<ml_model_gaussian_mixture>) augment(<ml_model_gaussian_mixture>) glance(<ml_model_gaussian_mixture>): Tidying methods for Spark ML unsupervised models

Spark Operations

get_spark_sql_catalog_implementation(): Retrieve the Spark connection's SQL catalog implementation property
connection_is_open(): Check whether the connection is open
connection_spark_shinyapp(): A Shiny app that can be used to construct a spark_connect statement
spark_session_config(): Runtime configuration interface for the Spark Session
spark_set_checkpoint_dir() spark_get_checkpoint_dir(): Set/Get Spark checkpoint directory
spark_connect() spark_connection_is_open() spark_disconnect() spark_disconnect_all() spark_submit(): Manage Spark Connections
spark_table_name(): Generate a Table Name from Expression
spark_install() spark_uninstall() spark_install_dir() spark_install_tar() spark_installed_versions() spark_available_versions(): Download and install various versions of Spark
spark_version_from_home(): Get the Spark Version Associated with a Spark Installation
spark_versions(): Returns a data frame of available Spark versions that can be installed.
spark_config_kubernetes(): Kubernetes Configuration
spark_config_settings(): Retrieve Available Settings
spark_connection_find(): Find Spark Connection
spark_dependency_fallback(): Fallback to Spark Dependency
spark_extension(): Create Spark Extension
spark_load_table(): Reads from a Spark Table into a Spark DataFrame.
list_sparklyr_jars(): list all sparklyr-*.jar files that have been built
spark_config_packages(): Creates Spark Configuration
spark_connection(): Retrieve the Spark Connection Associated with an R Object
spark_adaptive_query_execution(): Retrieves or sets status of Spark AQE
spark_advisory_shuffle_partition_size(): Retrieves or sets advisory size of the shuffle partition
spark_auto_broadcast_join_threshold(): Retrieves or sets the auto broadcast join threshold
spark_coalesce_initial_num_partitions(): Retrieves or sets initial number of shuffle partitions before coalescing
spark_coalesce_min_num_partitions(): Retrieves or sets the minimum number of shuffle partitions after coalescing
spark_coalesce_shuffle_partitions(): Retrieves or sets whether coalescing contiguous shuffle partitions is enabled
spark_connection-class: spark_connection class
spark_jobj-class: spark_jobj class
sparklyr_get_backend_port(): Return the port number of a `sparklyr` backend.
spark_connect_method(): Function that negotiates the connection with the Spark back-end
spark_ide_connection_open() spark_ide_connection_closed() spark_ide_connection_updated() spark_ide_connection_actions() spark_ide_objects() spark_ide_columns() spark_ide_preview(): Set of functions to provide integration with the RStudio IDE
spark_integ_test_skip(): It lets the package know if it should test a particular functionality or not
spark_last_error(): Surfaces the last error from Spark captured by internal `spark_error` function

Other

spark_statistical_routines: Generate random samples from some distribution
ensure: Enforce Specific Structure for R Objects
random_string(): Random string generation
`%->%`: Infix operator for composing a lambda expression
`[`(<tbl_spark>): Subsetting operator for Spark dataframe
generic_call_interface: Generic Call Interface