【gokart】Pipelineライブラリによる機械学習ワークフローの開発

こんにちは。

現役エンジニアの”はやぶさ”@Cpp_Learningです。仕事でもプライベートでも機械学習で色々やってます。

今回はPipelineライブラリ gokart について紹介します。

Contents

1 gokartとは
2 gorkarに関するTips集
3 実践！gorkartによるデータ処理ワークフロー開発
- 3.1 インストール
- 3.2 ソースコード
4 まとめ

gokartとは

gorkarの概要は以下の通りです。

gorkar

機械学習プロジェクト向けPipelineライブラリ
Luigi の wrapper
エムスリーが開発・運用しているOSS

Luigiについては、以下の記事もどうぞ。

【Luigi】Pipelineライブラリによるデータ処理ワークフローの開発こんにちは。現役エンジニアの”はやぶさ”@Cpp_Learningです。仕事でもプライベートでも機械学習で色々やってます。 ...

gorkarに関するTips集

gorkarについて”ググる”と以下の情報がありました。

公式の情報が分かりやすくてオススメです。

実践！gorkartによるデータ処理ワークフロー開発

この記事と同じ以下のワークフローを作ってみます。

※データ取得 ⇒ データ処理A・データ処理Bを並行処理 ⇒ 結果出力（ローカルにファイルを保存）

インストール

最初に以下のコマンドで Luigi をインストールします。

pip install gokart

※Windowsはサポートしていないので注意

ソースコード

Luigi を使ったことがある人なら、難なく以下のコードを読めると思います。そして”楽さ”に感動すると思います（以下のコードを以降から main.py と呼びます）。

import datetime
import pandas as pd
import luigi
import gokart


class GetData(gokart.TaskOnKart):
    def requires(self):
        pass  # no requires

    def output(self):
        return self.make_target('./tmp/GetData.csv')

    def run(self):
        # get data
        input_df = pd.read_csv('./dataset/iris.csv')
        # save input data with the file path '.output/GetData_{unique_id}.csv'
        self.dump(input_df)


class PreprocessA(gokart.TaskOnKart):
    def requires(self):
        return GetData()

    def output(self):
        return self.make_target('./tmp/PreprocessA.csv')

    def run(self):
        # load data which GetData output
        input_df = self.load()
        # preprocess with input_df, and make result_df.
        result_df = input_df.drop(columns=['SepalLength'])
        # save results with the file path '.tmp/PreprocessA_{unique_id}.csv'
        self.dump(result_df)


class PreprocessB(gokart.TaskOnKart):
    def requires(self):
        return GetData()

    def output(self):
        return self.make_target('./tmp/PreprocessB.csv')

    def run(self):
        # load data which GetData output
        input_df = self.load()
        # preprocess with input_df, and make result_df.
        result_df = input_df.drop(columns=['SepalWidth'])
        # save results with the file path '.tmp/PreprocessB_{unique_id}.csv'
        self.dump(result_df)


class Sample(gokart.TaskOnKart):
    def requires(self):
        return {'a': PreprocessA(), 'b': PreprocessB()}

    def output(self):
        # save results with the file path './output/result_{unique_id}.csv'
        return self.make_target('./output/result.csv')

    def run(self):
        # input a
        input_a = self.load('a')
        # input b
        input_b = self.load('b')
        # process (marge)
        df = pd.concat([input_a, input_b], axis=1)
        # output
        self.dump(df)


if __name__ == '__main__':
    luigi.configuration.LuigiConfigParser.add_config_path('./conf/config.ini')
    gokart.run(['Sample', '--local-scheduler'])

import datetime

import pandas as pd

import luigi

import gokart

class GetData(gokart.TaskOnKart):

def requires(self):

pass # no requires

def output(self):

return self.make_target('./tmp/GetData.csv')

def run(self):

# get data

input_df = pd.read_csv('./dataset/iris.csv')

# save input data with the file path '.output/GetData_{unique_id}.csv'

self.dump(input_df)

class PreprocessA(gokart.TaskOnKart):

def requires(self):

return GetData()

def output(self):

return self.make_target('./tmp/PreprocessA.csv')

def run(self):

# load data which GetData output

input_df = self.load()

# preprocess with input_df, and make result_df.

result_df = input_df.drop(columns=['SepalLength'])

# save results with the file path '.tmp/PreprocessA_{unique_id}.csv'

self.dump(result_df)

class PreprocessB(gokart.TaskOnKart):

def requires(self):

return GetData()

def output(self):

return self.make_target('./tmp/PreprocessB.csv')

def run(self):

# load data which GetData output

input_df = self.load()

# preprocess with input_df, and make result_df.

result_df = input_df.drop(columns=['SepalWidth'])

# save results with the file path '.tmp/PreprocessB_{unique_id}.csv'

self.dump(result_df)

class Sample(gokart.TaskOnKart):

def requires(self):

return {'a': PreprocessA(), 'b': PreprocessB()}

def output(self):

# save results with the file path './output/result_{unique_id}.csv'

return self.make_target('./output/result.csv')

def run(self):

# input a

input_a = self.load('a')

# input b

input_b = self.load('b')

# process (marge)

df = pd.concat([input_a, input_b], axis=1)

# output

self.dump(df)

if __name__ == '__main__':

luigi.configuration.LuigiConfigParser.add_config_path('./conf/config.ini')

gokart.run(['Sample', '--local-scheduler'])

簡単にLuigiとの比較をしておきます。

self.dump(input_df) の 1行でファイルを保存
input_df = self.load() の 1行でファイルをロード
resources にログ・中間ファイル・出力結果をまとめて保存

※ resources(デフォルト)以外のディレクトリも選択できます

以下は main.py を含む exampleディレクトリ の中身です。

example
│  main.py  # ソースコード
│
├─conf
│  config.cfg  # 設定ファイル
│
├─dataset
│  iris.csv  # 入力ファイル
│
└─resources
    ├─log
    │  ├─module_versions  # 使用モジュールのVer管理用
    │  │      GetData_e1f86452d6660737bd120d9ba7d49914.txt
    │  │      PreprocessA_dac47bc6daf2d59dafbfe1cbbc8d3ba4.txt
    │  │      PreprocessB_8a953195cef609d5e53364df7a7bd93c.txt
    │  │      Sample_d2ad11a7461f09c2539e73761bd61ada.txt
    │  │
    │  ├─processing_time  # 各タスクの処理時間
    │  │      GetData_e1f86452d6660737bd120d9ba7d49914.pkl
    │  │      PreprocessA_dac47bc6daf2d59dafbfe1cbbc8d3ba4.pkl
    │  │      PreprocessB_8a953195cef609d5e53364df7a7bd93c.pkl
    │  │      Sample_d2ad11a7461f09c2539e73761bd61ada.pkl
    │  │
    │  ├─random_seed
    │  │      GetData_e1f86452d6660737bd120d9ba7d49914.pkl
    │  │      PreprocessA_dac47bc6daf2d59dafbfe1cbbc8d3ba4.pkl
    │  │      PreprocessB_8a953195cef609d5e53364df7a7bd93c.pkl
    │  │      Sample_d2ad11a7461f09c2539e73761bd61ada.pkl
    │  │
    │  ├─task_log  # 出力先（path）の情報
    │  │      GetData_e1f86452d6660737bd120d9ba7d49914.pkl
    │  │      PreprocessA_dac47bc6daf2d59dafbfe1cbbc8d3ba4.pkl
    │  │      PreprocessB_8a953195cef609d5e53364df7a7bd93c.pkl
    │  │      Sample_d2ad11a7461f09c2539e73761bd61ada.pkl
    │  │
    │  └─task_params  # タスク実行に利用したparameter
    │          GetData_e1f86452d6660737bd120d9ba7d49914.pkl
    │          PreprocessA_dac47bc6daf2d59dafbfe1cbbc8d3ba4.pkl
    │          PreprocessB_8a953195cef609d5e53364df7a7bd93c.pkl
    │          Sample_d2ad11a7461f09c2539e73761bd61ada.pkl
    │
    ├─output  # 出力結果の保存場所
    │   result_d2ad11a7461f09c2539e73761bd61ada.csv
    │
    └─tmp  # 中間ファイル保存場所
        GetData_e1f86452d6660737bd120d9ba7d49914.csv
        PreprocessA_dac47bc6daf2d59dafbfe1cbbc8d3ba4.csv
        PreprocessB_8a953195cef609d5e53364df7a7bd93c.csv

example

│ main.py # ソースコード

│

├─conf

│ config.cfg # 設定ファイル

│

├─dataset

│ iris.csv # 入力ファイル

│

└─resources

├─log

│ ├─module_versions # 使用モジュールのVer管理用

│ │ GetData_e1f86452d6660737bd120d9ba7d49914.txt

│ │ PreprocessA_dac47bc6daf2d59dafbfe1cbbc8d3ba4.txt

│ │ PreprocessB_8a953195cef609d5e53364df7a7bd93c.txt

│ │ Sample_d2ad11a7461f09c2539e73761bd61ada.txt

│ │

│ ├─processing_time # 各タスクの処理時間

│ │ GetData_e1f86452d6660737bd120d9ba7d49914.pkl

│ │ PreprocessA_dac47bc6daf2d59dafbfe1cbbc8d3ba4.pkl

│ │ PreprocessB_8a953195cef609d5e53364df7a7bd93c.pkl

│ │ Sample_d2ad11a7461f09c2539e73761bd61ada.pkl

│ │

│ ├─random_seed

│ │ GetData_e1f86452d6660737bd120d9ba7d49914.pkl

│ │ PreprocessA_dac47bc6daf2d59dafbfe1cbbc8d3ba4.pkl

│ │ PreprocessB_8a953195cef609d5e53364df7a7bd93c.pkl

│ │ Sample_d2ad11a7461f09c2539e73761bd61ada.pkl

│ │

│ ├─task_log # 出力先（path）の情報

│ │ GetData_e1f86452d6660737bd120d9ba7d49914.pkl

│ │ PreprocessA_dac47bc6daf2d59dafbfe1cbbc8d3ba4.pkl

│ │ PreprocessB_8a953195cef609d5e53364df7a7bd93c.pkl

│ │ Sample_d2ad11a7461f09c2539e73761bd61ada.pkl

│ │

│ └─task_params # タスク実行に利用したparameter

│ GetData_e1f86452d6660737bd120d9ba7d49914.pkl

│ PreprocessA_dac47bc6daf2d59dafbfe1cbbc8d3ba4.pkl

│ PreprocessB_8a953195cef609d5e53364df7a7bd93c.pkl

│ Sample_d2ad11a7461f09c2539e73761bd61ada.pkl

│

├─output # 出力結果の保存場所

│ result_d2ad11a7461f09c2539e73761bd61ada.csv

│

└─tmp # 中間ファイル保存場所

GetData_e1f86452d6660737bd120d9ba7d49914.csv

PreprocessA_dac47bc6daf2d59dafbfe1cbbc8d3ba4.csv

PreprocessB_8a953195cef609d5e53364df7a7bd93c.csv

Pickleファイルは以下のコマンドで中身を確認できます。

python -m pickle Sample_*.pkl

今回は以下のように表示されました。

{‘file_path’: [‘./resources/./output/result_*.csv’]}

実験管理に必要な情報が自動で生成され、かつファイル名にユニークな番号が自動で付与されるので、日付によるファイル管理も不要です。

gorkarすごい！

まとめ

最近 Luigi と gorkar を同時に触ってましたが、gokartの方が楽に書けるし、実験管理も自動化してくれて素敵でした。

さすがは機械学習プロジェクト向けPipelineライブラリですね。

これでWindowsもサポートしてくれたら、完全にLuigiから移行してたかも…

くるる

Linux と Windows を行ったり来たりすることが多いから、両方で使えるライブラリじゃないとメインで使うのは少し怖い…

そう思ってるフクロウや人がいるかもね。

はやぶさ

Linux環境のプロジェクトやプライベートで使わせて頂きます。素敵なライブラリをありがとうございます。

【gokart】Pipelineライブラリによる機械学習ワークフローの開発

gokartとは

gorkarに関するTips集

実践！gorkartによるデータ処理ワークフロー開発

インストール

ソースコード

まとめ

CSV編集に役立つVSCodeの拡張機能３選

プライム感謝祭セールを完全攻略！事前準備やお得キャンペーンを徹底解説

【アマゾンプライムデー2023年】完全攻略！事前準備やお得キャンペーンを徹底解説

【2023年版】アマゾンプライム会員のメリット・デメリットを紹介！

Golang × WebAssembly（wasm）入門

【GitHub】シンプルなREADME.mdの書き方 -コピペで使えるテンプレート付き-

【Pyxel】Pythonでレトロゲームを作ろう！総集編 -まるっと1週間でゲーム開発入門-

【仕事効率化】Visual Studio Code で Markdown を使いこなす

【深層学習入門】画像処理の基礎(画素操作)からCNN設計まで

【Pyxel】Pythonでレトロゲームを作ろう！ Day 1 -画像の扱い方-

gokartとは

gorkarに関するTips集

実践！gorkartによるデータ処理ワークフロー開発

インストール

ソースコード

まとめ

CSV編集に役立つVSCodeの拡張機能３選

プライム感謝祭セールを完全攻略！事前準備やお得キャンペーンを徹底解説

【アマゾンプライムデー2023年】完全攻略！事前準備やお得キャンペーンを徹底解説

【2023年版】アマゾンプライム会員のメリット・デメリットを紹介！

Golang × WebAssembly（wasm）入門

【GitHub】シンプルなREADME.mdの書き方 -コピペで使えるテンプレート付き-

【Pyxel】Pythonでレトロゲームを作ろう！ 総集編 -まるっと1週間でゲーム開発入門-

【仕事効率化】Visual Studio Code で Markdown を使いこなす

【深層学習入門】画像処理の基礎(画素操作)からCNN設計まで

【Pyxel】Pythonでレトロゲームを作ろう！ Day 1 -画像の扱い方-

【Pyxel】Pythonでレトロゲームを作ろう！総集編 -まるっと1週間でゲーム開発入門-