For a recipe with at least one preprocessing operation that has been trained by
prep()
, apply the computations to new data.
bake(object, ...)
# S3 method for class 'recipe'
bake(object, new_data, ..., composition = "tibble")
A trained object such as a recipe()
with at least
one preprocessing operation.
One or more selector functions to choose which variables will be
returned by the function. See selections()
for more details.
If no selectors are given, the default is to use
dplyr::everything()
.
A data frame or tibble for whom the preprocessing will be
applied. If NULL
is given to new_data
, the pre-processed training
data will be returned (assuming that prep(retain = TRUE)
was used).
Either "tibble", "matrix", "data.frame", or "dgCMatrix" for the format of the processed data set. Note that all computations during the baking process are done in a non-sparse format. Also, note that this argument should be called after any selectors and the selectors should only resolve to numeric columns (otherwise an error is thrown).
A tibble, matrix, or sparse matrix that may have different
columns than the original columns in new_data
.
bake()
takes a trained recipe and applies its operations to a
data set to create a design matrix. If you are using a recipe as a
preprocessor for modeling, we highly recommend that you use a workflow()
instead of manually applying a recipe (see the example in recipe()
).
If the data set is not too large, time can be saved by using the
retain = TRUE
option of prep()
. This stores the processed version of the
training set. With this option set, bake(object, new_data = NULL)
will return it for free.
Also, any steps with skip = TRUE
will not be applied to the
data when bake()
is invoked with a data set in new_data
.
bake(object, new_data = NULL)
will always have all of the steps applied.
data(ames, package = "modeldata")
ames <- mutate(ames, Sale_Price = log10(Sale_Price))
ames_rec <-
recipe(Sale_Price ~ ., data = ames[-(1:6), ]) %>%
step_other(Neighborhood, threshold = 0.05) %>%
step_dummy(all_nominal()) %>%
step_interact(~ starts_with("Central_Air"):Year_Built) %>%
step_ns(Longitude, Latitude, deg_free = 2) %>%
step_zv(all_predictors()) %>%
prep()
# return the training set (already embedded in ames_rec)
bake(ames_rec, new_data = NULL)
#> # A tibble: 2,924 × 259
#> Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area BsmtFin_SF_1
#> <dbl> <int> <int> <int> <dbl> <dbl>
#> 1 41 4920 2001 2001 0 3
#> 2 43 5005 1992 1992 0 1
#> 3 39 5389 1995 1996 0 3
#> 4 60 7500 1999 1999 0 7
#> 5 75 10000 1993 1994 0 7
#> 6 0 7980 1992 2007 0 1
#> 7 63 8402 1998 1998 0 7
#> 8 85 10176 1990 1990 0 3
#> 9 0 6820 1985 1985 0 3
#> 10 47 53504 2003 2003 603 1
#> # ℹ 2,914 more rows
#> # ℹ 253 more variables: BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
#> # Total_Bsmt_SF <dbl>, First_Flr_SF <int>, Second_Flr_SF <int>,
#> # Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
#> # Full_Bath <int>, Half_Bath <int>, Bedroom_AbvGr <int>, Kitchen_AbvGr <int>,
#> # TotRms_AbvGrd <int>, Fireplaces <int>, Garage_Cars <dbl>,
#> # Garage_Area <dbl>, Wood_Deck_SF <int>, Open_Porch_SF <int>, …
# apply processing to other data:
bake(ames_rec, new_data = head(ames))
#> # A tibble: 6 × 259
#> Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area BsmtFin_SF_1
#> <dbl> <int> <int> <int> <dbl> <dbl>
#> 1 141 31770 1960 1960 112 2
#> 2 80 11622 1961 1961 0 6
#> 3 81 14267 1958 1958 108 1
#> 4 93 11160 1968 1968 0 1
#> 5 74 13830 1997 1998 0 3
#> 6 78 9978 1998 1998 20 3
#> # ℹ 253 more variables: BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
#> # Total_Bsmt_SF <dbl>, First_Flr_SF <int>, Second_Flr_SF <int>,
#> # Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
#> # Full_Bath <int>, Half_Bath <int>, Bedroom_AbvGr <int>, Kitchen_AbvGr <int>,
#> # TotRms_AbvGrd <int>, Fireplaces <int>, Garage_Cars <dbl>,
#> # Garage_Area <dbl>, Wood_Deck_SF <int>, Open_Porch_SF <int>,
#> # Enclosed_Porch <int>, Three_season_porch <int>, Screen_Porch <int>, …
# only return selected variables:
bake(ames_rec, new_data = head(ames), all_numeric_predictors())
#> # A tibble: 6 × 258
#> Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area BsmtFin_SF_1
#> <dbl> <int> <int> <int> <dbl> <dbl>
#> 1 141 31770 1960 1960 112 2
#> 2 80 11622 1961 1961 0 6
#> 3 81 14267 1958 1958 108 1
#> 4 93 11160 1968 1968 0 1
#> 5 74 13830 1997 1998 0 3
#> 6 78 9978 1998 1998 20 3
#> # ℹ 252 more variables: BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
#> # Total_Bsmt_SF <dbl>, First_Flr_SF <int>, Second_Flr_SF <int>,
#> # Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
#> # Full_Bath <int>, Half_Bath <int>, Bedroom_AbvGr <int>, Kitchen_AbvGr <int>,
#> # TotRms_AbvGrd <int>, Fireplaces <int>, Garage_Cars <dbl>,
#> # Garage_Area <dbl>, Wood_Deck_SF <int>, Open_Porch_SF <int>,
#> # Enclosed_Porch <int>, Three_season_porch <int>, Screen_Porch <int>, …
bake(ames_rec, new_data = head(ames), starts_with(c("Longitude", "Latitude")))
#> # A tibble: 6 × 4
#> Longitude_ns_1 Longitude_ns_2 Latitude_ns_1 Latitude_ns_2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.570 -0.0141 0.472 0.394
#> 2 0.570 -0.0142 0.481 0.360
#> 3 0.569 -0.00893 0.484 0.348
#> 4 0.563 0.0212 0.496 0.301
#> 5 0.562 -0.212 0.405 0.634
#> 6 0.562 -0.212 0.407 0.630