MLパイプライン

\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\wv}{\mathbf{w}} \newcommand{\av}{\mathbf{\alpha}} \newcommand{\bv}{\mathbf{b}} \newcommand{\N}{\mathbb{N}} \newcommand{\id}{\mathbf{I}} \newcommand{\ind}{\mathbf{1}} \newcommand{\0}{\mathbf{0}} \newcommand{\unit}{\mathbf{e}} \newcommand{\one}{\mathbf{1}} \newcommand{\zero}{\mathbf{0}} \]

このセクションでは、MLパイプラインの概念を紹介します。MLパイプラインは、DataFrame上に構築された、ユーザーが実用的な機械学習パイプラインを作成および調整するのに役立つ、統一された一連の高レベルAPIを提供します。

パイプラインの主要な概念
コード例

パイプラインの主要な概念

MLlibは、機械学習アルゴリズムのAPIを標準化し、複数のアルゴリズムを単一のパイプライン、つまりワークフローに簡単に組み合わせることができるようにします。このセクションでは、パイプラインAPIによって導入された主要な概念について説明します。パイプラインの概念は、主にscikit-learnプロジェクトに触発されています。

DataFrame: このML APIは、Spark SQLのDataFrameをMLデータセットとして使用します。これは、さまざまなデータ型を保持できます。たとえば、DataFrameは、テキスト、特徴量ベクトル、真のラベル、および予測を格納する異なる列を持つことができます。
Transformer: Transformerは、1つのDataFrameを別のDataFrameに変換できるアルゴリズムです。たとえば、MLモデルは、特徴量を持つDataFrameを、予測を持つDataFrameに変換するTransformerです。
Estimator: Estimatorは、DataFrameに適合させてTransformerを生成できるアルゴリズムです。たとえば、LogisticRegressionなどの学習アルゴリズムはEstimatorであり、fit()を呼び出すと、ModelであるLogisticRegressionModelがトレーニングされます。これはTransformerでもあります。
Pipeline: Pipelineは、複数のTransformerとEstimatorをチェーンして、MLワークフローを指定します。
Parameter: すべてのTransformerとEstimatorは、パラメータを指定するための共通APIを共有するようになりました。

DataFrame

機械学習は、ベクトル、テキスト、画像、構造化データなど、さまざまなデータ型に適用できます。このAPIは、さまざまなデータ型をサポートするために、Spark SQLのDataFrameを採用しています。

DataFrameは、多くの基本型と構造化型をサポートしています。サポートされている型のリストについては、Spark SQLデータ型リファレンスを参照してください。Spark SQLガイドにリストされている型に加えて、DataFrameはML Vector型を使用できます。

DataFrameは、通常のRDDから暗黙的または明示的に作成できます。以下のコード例と、Spark SQLプログラミングガイドの例を参照してください。

DataFrameの列には名前が付けられています。以下のコード例では、「text」、「features」、「label」などの名前を使用しています。

パイプラインコンポーネント

Transformer

Transformerは、特徴量変換器と学習済みモデルを含む抽象化です。技術的には、Transformerはtransform()メソッドを実装しており、これは1つのDataFrameを別のDataFrameに変換します。通常は、1つ以上の列を追加することによって変換します。例えば

特徴量変換器は、DataFrameを取得し、列（例：テキスト）を読み取り、それを新しい列（例：特徴量ベクトル）にマッピングし、マッピングされた列が追加された新しいDataFrameを出力する可能性があります。
学習モデルは、DataFrameを取得し、特徴量ベクトルを含む列を読み取り、各特徴量ベクトルのラベルを予測し、予測されたラベルが列として追加された新しいDataFrameを出力する可能性があります。

Estimator

Estimatorは、学習アルゴリズム、またはデータに適合またはトレーニングするアルゴリズムの概念を抽象化したものです。技術的には、Estimatorはfit()メソッドを実装しており、これはDataFrameを受け取り、TransformerであるModelを生成します。たとえば、LogisticRegressionなどの学習アルゴリズムはEstimatorであり、fit()を呼び出すと、ModelであるLogisticRegressionModelがトレーニングされます。これはTransformerでもあります。

パイプラインコンポーネントのプロパティ

Transformer.transform()とEstimator.fit()はどちらもステートレスです。将来的には、代替概念を介してステートフルアルゴリズムがサポートされる可能性があります。

TransformerまたはEstimatorの各インスタンスには、パラメータの指定に役立つ一意のIDがあります（以下で説明）。

パイプライン

機械学習では、一連のアルゴリズムを実行してデータを処理し、データから学習するのが一般的です。たとえば、単純なテキストドキュメント処理ワークフローには、いくつかの段階が含まれる場合があります

各ドキュメントのテキストを単語に分割します。
各ドキュメントの単語を数値特徴量ベクトルに変換します。
特徴量ベクトルとラベルを使用して予測モデルを学習します。

MLlibは、このようなワークフローをPipelineとして表します。これは、特定の順序で実行される一連のPipelineStage（TransformerとEstimator）で構成されます。このセクションでは、この単純なワークフローを実行例として使用します。

仕組み

Pipelineは、一連のステージとして指定され、各ステージはTransformerまたはEstimatorのいずれかです。これらのステージは順番に実行され、入力DataFrameは各ステージを通過する際に変換されます。Transformerステージの場合、DataFrameでtransform()メソッドが呼び出されます。Estimatorステージの場合、fit()メソッドが呼び出されてTransformer（PipelineModel、つまり適合したPipelineの一部になります）が生成され、そのTransformerのtransform()メソッドがDataFrameで呼び出されます。

単純なテキストドキュメントワークフローについて、これを説明します。以下の図は、Pipelineの*トレーニング時*の使用法を示しています。

ML Pipeline Example

上記の最上段は、3つのステージを持つPipelineを表しています。最初の2つ（TokenizerとHashingTF）はTransformer（青色）であり、3つ目（LogisticRegression）はEstimator（赤色）です。下段はパイプラインを流れるデータを表しており、円柱はDataFrameを示しています。Pipeline.fit()メソッドは、生のテキスト文書とラベルを持つ元のDataFrameに対して呼び出されます。Tokenizer.transform()メソッドは、生のテキスト文書を単語に分割し、単語を含む新しい列をDataFrameに追加します。HashingTF.transform()メソッドは、単語の列を特徴ベクトルに変換し、それらのベクトルを含む新しい列をDataFrameに追加します。ここで、LogisticRegressionはEstimatorであるため、Pipelineは最初にLogisticRegression.fit()を呼び出してLogisticRegressionModelを生成します。PipelineにさらにEstimatorがある場合、DataFrameを次のステージに渡す前に、LogisticRegressionModelのtransform()メソッドをDataFrameに対して呼び出します。

PipelineはEstimatorです。したがって、Pipelineのfit()メソッドが実行された後、TransformerであるPipelineModelが生成されます。このPipelineModelは*テスト時*に使用されます。以下の図はこの使用方法を示しています。

ML PipelineModel Example

上の図では、PipelineModelは元のPipelineと同じ数のステージを持っていますが、元のPipelineのすべてのEstimatorはTransformerになっています。PipelineModelのtransform()メソッドがテストデータセットに対して呼び出されると、データは適合されたパイプラインを順番に通過します。各ステージのtransform()メソッドはデータセットを更新し、次のステージに渡します。

PipelineとPipelineModelは、トレーニングデータとテストデータが同じ特徴量処理手順を経ることを保証するのに役立ちます。

詳細

DAG Pipeline：Pipelineのステージは、順序付けられた配列として指定されます。ここで示した例はすべて線形Pipeline、つまり各ステージが前のステージによって生成されたデータを使用するPipelineです。データフローグラフが有向非巡回グラフ（DAG）を形成する限り、非線形Pipelineを作成することが可能です。このグラフは現在、各ステージの入力および出力列名（一般的にパラメータとして指定されます）に基づいて暗黙的に指定されます。PipelineがDAGを形成する場合、ステージはトポロジカル順に指定する必要があります。

実行時チェック：PipelineはさまざまなタイプのDataFrameを操作できるため、コンパイル時型チェックを使用できません。PipelineとPipelineModelは、代わりにPipelineを実際に実行する前に実行時チェックを行います。この型チェックは、DataFrameの*スキーマ*、つまりDataFrameの列のデータ型の記述を使用して行われます。

一意のPipelineステージ：Pipelineのステージは一意のインスタンスである必要があります。たとえば、Pipelineステージは一意のIDを持つ必要があるため、同じインスタンスmyHashingTFをPipelineに2回挿入しないでください。ただし、異なるインスタンスmyHashingTF1とmyHashingTF2（どちらもHashingTF型）は、異なるインスタンスが異なるIDで作成されるため、同じPipelineに入れることができます。

パラメータ

MLlibのEstimatorとTransformerは、パラメータを指定するための統一APIを使用します。

Paramは、自己完結型のドキュメントを持つ名前付きパラメータです。ParamMapは、（パラメータ、値）のペアのセットです。

アルゴリズムにパラメータを渡すには、主に2つの方法があります。

インスタンスのパラメータを設定します。たとえば、lrがLogisticRegressionのインスタンスである場合、lr.setMaxIter(10)を呼び出して、lr.fit()が最大10回の反復を使用するようにすることができます。このAPIは、spark.mllibパッケージで使用されるAPIに似ています。
ParamMapをfit()またはtransform()に渡します。ParamMap内のパラメータは、セッターメソッドを介して以前に指定されたパラメータをオーバーライドします。

パラメータは、EstimatorとTransformerの特定のインスタンスに属します。たとえば、2つのLogisticRegressionインスタンスlr1とlr2がある場合、両方のmaxIterパラメータが指定されたParamMapを構築できます：ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)。これは、PipelineにmaxIterパラメータを持つアルゴリズムが2つある場合に便利です。

ML永続性: パイプラインの保存と読み込み

多くの場合、後で使用するためにモデルまたはパイプラインをディスクに保存する価値があります。Spark 1.6では、モデルのインポート/エクスポート機能がPipeline APIに追加されました。Spark 2.3以降、spark.mlおよびpyspark.mlのDataFrameベースのAPIは完全に網羅されています。

MLの永続性は、Scala、Java、Pythonで機能します。ただし、Rは現在変更された形式を使用しているため、Rに保存されたモデルはRでのみロードできます。これは将来修正される予定であり、SPARK-15572で追跡されています。

ML永続性の下位互換性

一般的に、MLlibはMLの永続性について後方互換性を維持します。つまり、SparkのあるバージョンでMLモデルまたはPipelineを保存した場合、将来のバージョンのSparkでそれをロードして使用できるはずです。ただし、まれに例外があり、以下に説明します。

モデルの永続性：SparkバージョンXでApache Spark ML永続性を使用して保存されたモデルまたはPipelineは、SparkバージョンYでロードできますか？

メジャーバージョン：保証はありませんが、ベストエフォートです。
マイナーバージョンとパッチバージョン：はい。これらは後方互換性があります。
形式に関する注意：安定した永続性形式の保証はありませんが、モデルのロード自体は後方互換性を持つように設計されています。

モデルの動作：SparkバージョンXのモデルまたはPipelineは、SparkバージョンYで同じように動作しますか？

メジャーバージョン：保証はありませんが、ベストエフォートです。
マイナーバージョンとパッチバージョン：バグ修正を除いて、同じ動作です。

モデルの永続性とモデルの動作の両方について、マイナーバージョンまたはパッチバージョン間の破壊的な変更は、Sparkバージョンのリリースノートに報告されています。リリースノートに破損が報告されていない場合、修正されるべきバグとして扱う必要があります。

コード例

このセクションでは、上記の機能を示すコード例を示します。詳細については、APIドキュメント（Scala、Java、Python）を参照してください。

例: Estimator、Transformer、およびParam

この例では、Estimator、Transformer、Paramの概念について説明します。

APIの詳細については、Estimator Pythonドキュメント、Transformer Pythonドキュメント、Params Pythonドキュメントを参照してください。

from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

# Prepare training data from a list of (label, features) tuples.
training = spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique IDs for this
# LogisticRegression instance.
print("Model 1 was fit using parameters: ")
print(model1.extractParamMap())

# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30  # Specify 1 Param, overwriting the original maxIter.
# Specify multiple Params.
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55})  # type: ignore

# You can combine paramMaps, which are python dictionaries.
# Change output column name
paramMap2 = {lr.probabilityCol: "myProbability"}
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)  # type: ignore

# Now learn a new model using the paramMapCombined parameters.
# paramMapCombined overrides all parameters set earlier via lr.set* methods.
model2 = lr.fit(training, paramMapCombined)
print("Model 2 was fit using parameters: ")
print(model2.extractParamMap())

# Prepare test data
test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

# Make predictions on test data using the Transformer.transform() method.
# LogisticRegression.transform will only use the 'features' column.
# Note that model2.transform() outputs a "myProbability" column instead of the usual
# 'probability' column since we renamed the lr.probabilityCol parameter previously.
prediction = model2.transform(test)
result = prediction.select("features", "label", "myProbability", "prediction") \
    .collect()

for row in result:
    print("features=%s, label=%s -> prob=%s, prediction=%s"
          % (row.features, row.label, row.myProbability, row.prediction))

Sparkリポジトリの「examples/src/main/python/ml/estimator_transformer_param_example.py」で完全なコード例を見つけてください。

APIの詳細については、Estimator Scalaドキュメント、Transformer Scalaドキュメント、Params Scalaドキュメントを参照してください。

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Row

// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")

// Create a LogisticRegression instance. This instance is an Estimator.
val lr = new LogisticRegression()
// Print out the parameters, documentation, and any default values.
println(s"LogisticRegression parameters:\n ${lr.explainParams()}\n")

// We may set parameters using setter methods.
lr.setMaxIter(10)
  .setRegParam(0.01)

// Learn a LogisticRegression model. This uses the parameters stored in lr.
val model1 = lr.fit(training)
// Since model1 is a Model (i.e., a Transformer produced by an Estimator),
// we can view the parameters it used during fit().
// This prints the parameter (name: value) pairs, where names are unique IDs for this
// LogisticRegression instance.
println(s"Model 1 was fit using parameters: ${model1.parent.extractParamMap}")

// We may alternatively specify parameters using a ParamMap,
// which supports several methods for specifying parameters.
val paramMap = ParamMap(lr.maxIter -> 20)
  .put(lr.maxIter, 30)  // Specify 1 Param. This overwrites the original maxIter.
  .put(lr.regParam -> 0.1, lr.threshold -> 0.55)  // Specify multiple Params.

// One can also combine ParamMaps.
val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")  // Change output column name.
val paramMapCombined = paramMap ++ paramMap2

// Now learn a new model using the paramMapCombined parameters.
// paramMapCombined overrides all parameters set earlier via lr.set* methods.
val model2 = lr.fit(training, paramMapCombined)
println(s"Model 2 was fit using parameters: ${model2.parent.extractParamMap}")

// Prepare test data.
val test = spark.createDataFrame(Seq(
  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
  (0.0, Vectors.dense(3.0, 2.0, -0.1)),
  (1.0, Vectors.dense(0.0, 2.2, -1.5))
)).toDF("label", "features")

// Make predictions on test data using the Transformer.transform() method.
// LogisticRegression.transform will only use the 'features' column.
// Note that model2.transform() outputs a 'myProbability' column instead of the usual
// 'probability' column since we renamed the lr.probabilityCol parameter previously.
model2.transform(test)
  .select("features", "label", "myProbability", "prediction")
  .collect()
  .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>
    println(s"($features, $label) -> prob=$prob, prediction=$prediction")
  }

Sparkリポジトリの「examples/src/main/scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala」で完全なコード例を見つけてください。

APIの詳細については、Estimator Javaドキュメント、Transformer Javaドキュメント、Params Javaドキュメントを参照してください。

import java.util.Arrays;
import java.util.List;

import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.ml.linalg.VectorUDT;
import org.apache.spark.ml.linalg.Vectors;
import org.apache.spark.ml.param.ParamMap;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

// Prepare training data.
List<Row> dataTraining = Arrays.asList(
    RowFactory.create(1.0, Vectors.dense(0.0, 1.1, 0.1)),
    RowFactory.create(0.0, Vectors.dense(2.0, 1.0, -1.0)),
    RowFactory.create(0.0, Vectors.dense(2.0, 1.3, 1.0)),
    RowFactory.create(1.0, Vectors.dense(0.0, 1.2, -0.5))
);
StructType schema = new StructType(new StructField[]{
    new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
    new StructField("features", new VectorUDT(), false, Metadata.empty())
});
Dataset<Row> training = spark.createDataFrame(dataTraining, schema);

// Create a LogisticRegression instance. This instance is an Estimator.
LogisticRegression lr = new LogisticRegression();
// Print out the parameters, documentation, and any default values.
System.out.println("LogisticRegression parameters:\n" + lr.explainParams() + "\n");

// We may set parameters using setter methods.
lr.setMaxIter(10).setRegParam(0.01);

// Learn a LogisticRegression model. This uses the parameters stored in lr.
LogisticRegressionModel model1 = lr.fit(training);
// Since model1 is a Model (i.e., a Transformer produced by an Estimator),
// we can view the parameters it used during fit().
// This prints the parameter (name: value) pairs, where names are unique IDs for this
// LogisticRegression instance.
System.out.println("Model 1 was fit using parameters: " + model1.parent().extractParamMap());

// We may alternatively specify parameters using a ParamMap.
ParamMap paramMap = new ParamMap()
  .put(lr.maxIter().w(20))  // Specify 1 Param.
  .put(lr.maxIter(), 30)  // This overwrites the original maxIter.
  .put(lr.regParam().w(0.1), lr.threshold().w(0.55));  // Specify multiple Params.

// One can also combine ParamMaps.
ParamMap paramMap2 = new ParamMap()
  .put(lr.probabilityCol().w("myProbability"));  // Change output column name
ParamMap paramMapCombined = paramMap.$plus$plus(paramMap2);

// Now learn a new model using the paramMapCombined parameters.
// paramMapCombined overrides all parameters set earlier via lr.set* methods.
LogisticRegressionModel model2 = lr.fit(training, paramMapCombined);
System.out.println("Model 2 was fit using parameters: " + model2.parent().extractParamMap());

// Prepare test documents.
List<Row> dataTest = Arrays.asList(
    RowFactory.create(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
    RowFactory.create(0.0, Vectors.dense(3.0, 2.0, -0.1)),
    RowFactory.create(1.0, Vectors.dense(0.0, 2.2, -1.5))
);
Dataset<Row> test = spark.createDataFrame(dataTest, schema);

// Make predictions on test documents using the Transformer.transform() method.
// LogisticRegression.transform will only use the 'features' column.
// Note that model2.transform() outputs a 'myProbability' column instead of the usual
// 'probability' column since we renamed the lr.probabilityCol parameter previously.
Dataset<Row> results = model2.transform(test);
Dataset<Row> rows = results.select("features", "label", "myProbability", "prediction");
for (Row r: rows.collectAsList()) {
  System.out.println("(" + r.get(0) + ", " + r.get(1) + ") -> prob=" + r.get(2)
    + ", prediction=" + r.get(3));
}

Sparkリポジトリの「examples/src/main/java/org/apache/spark/examples/ml/JavaEstimatorTransformerParamExample.java」で完全なコード例を見つけてください。

例: パイプライン

この例では、上記の図に示されている単純なテキストドキュメントPipelineに従います。

APIの詳細については、Pipeline Pythonドキュメントを参照してください。

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    rid, text, prob, prediction = row
    print(
        "(%d, %s) --> prob=%s, prediction=%f" % (
            rid, text, str(prob), prediction   # type: ignore
        )
    )

Sparkリポジトリの「examples/src/main/python/ml/pipeline_example.py」で完全なコード例を見つけてください。

APIの詳細については、Pipeline Scalaドキュメントを参照してください。

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

// Prepare training documents from a list of (id, text, label) tuples.
val training = spark.createDataFrame(Seq(
  (0L, "a b c d e spark", 1.0),
  (1L, "b d", 0.0),
  (2L, "spark f g h", 1.0),
  (3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.001)
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lr))

// Fit the pipeline to training documents.
val model = pipeline.fit(training)

// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")

// We can also save this unfit pipeline to disk
pipeline.write.overwrite().save("/tmp/unfit-lr-model")

// And load it back in during production
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")

// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
  (4L, "spark i j k"),
  (5L, "l m n"),
  (6L, "spark hadoop spark"),
  (7L, "apache hadoop")
)).toDF("id", "text")

// Make predictions on test documents.
model.transform(test)
  .select("id", "text", "probability", "prediction")
  .collect()
  .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
    println(s"($id, $text) --> prob=$prob, prediction=$prediction")
  }

Sparkリポジトリの「examples/src/main/scala/org/apache/spark/examples/ml/PipelineExample.scala」で完全なコード例を見つけてください。

APIの詳細については、Pipeline Javaドキュメントを参照してください。

import java.util.Arrays;

import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.feature.HashingTF;
import org.apache.spark.ml.feature.Tokenizer;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

// Prepare training documents, which are labeled.
Dataset<Row> training = spark.createDataFrame(Arrays.asList(
  new JavaLabeledDocument(0L, "a b c d e spark", 1.0),
  new JavaLabeledDocument(1L, "b d", 0.0),
  new JavaLabeledDocument(2L, "spark f g h", 1.0),
  new JavaLabeledDocument(3L, "hadoop mapreduce", 0.0)
), JavaLabeledDocument.class);

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
Tokenizer tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words");
HashingTF hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol())
  .setOutputCol("features");
LogisticRegression lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.001);
Pipeline pipeline = new Pipeline()
  .setStages(new PipelineStage[] {tokenizer, hashingTF, lr});

// Fit the pipeline to training documents.
PipelineModel model = pipeline.fit(training);

// Prepare test documents, which are unlabeled.
Dataset<Row> test = spark.createDataFrame(Arrays.asList(
  new JavaDocument(4L, "spark i j k"),
  new JavaDocument(5L, "l m n"),
  new JavaDocument(6L, "spark hadoop spark"),
  new JavaDocument(7L, "apache hadoop")
), JavaDocument.class);

// Make predictions on test documents.
Dataset<Row> predictions = model.transform(test);
for (Row r : predictions.select("id", "text", "probability", "prediction").collectAsList()) {
  System.out.println("(" + r.get(0) + ", " + r.get(1) + ") --> prob=" + r.get(2)
    + ", prediction=" + r.get(3));
}

Sparkリポジトリの「examples/src/main/java/org/apache/spark/examples/ml/JavaPipelineExample.java」で完全なコード例を見つけてください。

モデル選択 (ハイパーパラメータチューニング)

MLパイプラインを使う大きなメリットとして、ハイパーパラメータの最適化が挙げられます。自動モデル選択について詳しくは、MLチューニングガイドをご覧ください。

MLlib: メインガイド

MLlib: RDDベースのAPIガイド