From ed633288dc806bc308d92fbc7e4cfa142e4c3e81 Mon Sep 17 00:00:00 2001 From: Nathan Moore Date: Tue, 10 Dec 2019 15:11:45 +1300 Subject: [PATCH] update pipelines refer to chapter 4 modelling (not chapter 5) comment out (((purpose of))) in para 2 (maybe meant to be a reference?) --- pipelines.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pipelines.Rmd b/pipelines.Rmd index a255089..8b16f9c 100644 --- a/pipelines.Rmd +++ b/pipelines.Rmd @@ -13,7 +13,7 @@ source("r/render.R") In [Chapter 4](#modeling), you learned how to build predictive models using the high-level functions Spark provides and well-known R packages that work well together with Spark. You learned about supervised methods first and finished the chapter with an unsupervised method over raw text. -In this chapter, we dive into Spark Pipelines, which is the engine that powers the features we demonstrated in [Chapter 5](#modeling). So, for instance, when you invoke an `MLlib` function via the formula interface in R—for example, `ml_logistic_regression(cars, am ~ .)`—a _pipeline_ is constructed for you under the hood. Therefore, Pipelines((("pipelines", "purpose of"))) also allow you to make use of advanced data processing and modeling workflows. In addition, a pipeline also facilitates collaboration across data science and engineering teams by allowing you to _deploy_ pipelines into production systems, web applications, mobile applications, and so on. +In this chapter, we dive into Spark Pipelines, which is the engine that powers the features we demonstrated in [Chapter 4](#modeling). So, for instance, when you invoke an `MLlib` function via the formula interface in R—for example, `ml_logistic_regression(cars, am ~ .)`—a _pipeline_ is constructed for you under the hood. Therefore, Pipelines also allow you to make use of advanced data processing and modeling workflows. In addition, a pipeline also facilitates collaboration across data science and engineering teams by allowing you to _deploy_ pipelines into production systems, web applications, mobile applications, and so on. This chapter also happens to be the last chapter that encourages using your local computer as a Spark cluster. You are just one chapter away from getting properly introduced to cluster computing and beginning to perform data science or machine learning that can scale to the most demanding computation problems. @@ -311,7 +311,7 @@ These functions are implemented using [S3](https://adv-r.hadley - If a DataFrame is provided to a feature transformer function (those with prefix `ft_`), or an ML algorithm without also providing a formula, the function instantiates the pipeline stage object, fits it to the data if necessary (if the stage is an estimator), and then transforms the DataFrame returning a DataFrame. - If a DataFrame and a formula are provided to an ML algorithm that supports the formula interface, `sparklyr` builds a pipeline model under the hood and returns an ML model object that contains additional metadata information. -The formula interface approach is what we studied in [Chapter 5](#modeling), and this is what we recommend users new to Spark start with, since its syntax is similar to existing R modeling packages and abstracts away some Spark ML peculiarities. However, to take advantage of the full power of Spark ML and leverage pipelines for workflow organization and interoperability, it is worthwhile to learn the ML Pipelines API. +The formula interface approach is what we studied in [Chapter 4](#modeling), and this is what we recommend users new to Spark start with, since its syntax is similar to existing R modeling packages and abstracts away some Spark ML peculiarities. However, to take advantage of the full power of Spark ML and leverage pipelines for workflow organization and interoperability, it is worthwhile to learn the ML Pipelines API. With the basics of pipelines down, we are now ready to discuss the collaboration and model deployment aspects hinted at in the introduction of this chapter.