Spark ML – Bisecting K-Means Clustering
Source:R/ml_clustering_bisecting_kmeans.R
ml_bisecting_kmeans.RdA bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.
Usage
ml_bisecting_kmeans(
x,
formula = NULL,
k = 4,
max_iter = 20,
seed = NULL,
min_divisible_cluster_size = 1,
features_col = "features",
prediction_col = "prediction",
uid = random_string("bisecting_bisecting_kmeans_"),
...
)Arguments
- x
A
spark_connection,ml_pipeline, or atbl_spark.- formula
Used when
xis atbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.- k
The number of clusters to create
- max_iter
The maximum number of iterations to use.
- seed
A random seed. Set this value if you need your results to be reproducible across repeated calls.
- min_divisible_cluster_size
The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1.0).
- features_col
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by
ft_r_formula.- prediction_col
Prediction column name.
- uid
A character string used to uniquely identify the ML estimator.
- ...
Optional arguments, see Details. #' @return The object returned depends on the class of
x. If it is aspark_connection, the function returns aml_estimatorobject. If it is aml_pipeline, it will return a pipeline with the predictor appended to it. If atbl_spark, it will return atbl_sparkwith the predictions added to it.
Examples
if (FALSE) { # \dontrun{
library(dplyr)
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
iris_tbl %>%
select(-Species) %>%
ml_bisecting_kmeans(k = 4, Species ~ .)
} # }