MXNetで『カメラ・動画対応！物体検出ソフト』を作った -Yolo, SSD, Faster-RCNNモデル対応-

こんにちは。

コンピュータビジョン(『ロボットの眼』開発)が専門の”はやぶさ”@Cpp_Learningです。

『深層学習による物体検出』が好きで色んな記事を書いてます。

MXNetとは

MXNetとは、AWSが公式サポートしている深層学習フレームワークです。

深層学習フレームワークという意味で、Tensorﬂow・Pytorch・Chainerなどの仲間です。

【MXNetの特徴】をリスト化したものが以下です。

MXNetの特徴

AWSが公式サポートしている深層学習フレームワーク
自然言語処理・物体検出などを簡単に扱えるGluon APIがある
インストール簡単
チュートリアル充実
ONNXモデルのImort/Exportの両方をサポート
C++、JavaScript、Python、R、Matlab、Julia、Scala、Clojure、Perl といった幅広いプログラミング言語をサポート

MXNetの魅力をもっと知りたい人は、AWS公式サイトを覗いてみて下さい。

Gluonとは

Gluonとは、MXNet用の柔軟で使い易いAPIのことです。

以下の充実したチュートリアルで勉強すると、Gluon APIが柔軟かつ直感的に扱えることが分かると思います。

To get started with Gluon, checkout the following resources and tutorials:

60-minute Gluon Crash Course – six 10-minute lessons on using Gluon

GluonCV Toolkit – implementations of state of the art deep learning algorithms in Computer Vision (CV)

GluonNLP Toolkit – implementations of state of the art deep learning algorithms in Natural Language Processing (NLP)

Dive into Deep Learning – notebooks designed to teach deep learning from the ground up, all using the Gluon API

引用元：MXNet/Gluon｜公式

本記事では、物体検出を実践するので、GluonCVを使います。

環境構築

MXNetおよびGluonは、pipで簡単にインストールできます。

Gluon公式サイトの”インストール”ページに飛ぶと以下の画面が表示されます。

環境に応じた選択を行い、表示された「インストールコマンド（COMMAND）」をコピペして実行すれば、インストールが始まります。

上図は、Windowsにインストールする場合の例です。

Intel CPU搭載マシンなら、MKL-DNNを選択した方がパフォーマンスが向上します

また、動画を扱うときにOpenCVを使いたいので、以下のコマンドでインストールします。

pip install opencv-python

以降で説明するソースコードは以下のバージョンで動作確認しました。

Python==3.7.3
mxnet==1.4.1
gluoncv==0.4.0
opencv-python==4.1.0
matplotlib==3.1.0

Model Zoo for GluonCV

GluonCVには、Model Zoo（モデルの動物園）があります。Model ZooのObject Detectionを覗いてみると…

GluonCVで直ぐに試せる、以下の物体検出モデルが用意してあります。

SSD
Faster-RCNN
YOLO-v3

下図はデータセット：Pascal VOCで学習されたSSDモデルのリストです。

例えば、上図の青枠は「MobileNetV1ベースSSDモデル（入力サイズ512×512）」です。

MobileNet SSDについては、以下の記事でも扱っているので、良ければ参考にして下さい。

PyTorchでMobileNet SSDによるリアルタイム物体検出深層学習フレームワークPytorchを使い、ディープラーニングによる物体検出の記事を書きました。物体検出手法にはいくつか種類がありますが、今回はMobileNetベースSSDによる『リアルタイム物体検出』を行いました。...

物体検出ソフト『MXNetCV_Cam.py』の開発

【MXNetの特徴】で説明した通り、チュートリアルが充実しているため、SSDモデルのチュートリアルも用意してあります。

このチュートリアルのコードを使えば簡単に静止画の物体検出を実現できます。

ただし、今回やりたいのは「カメラ・動画対応の物体検出」だったので、新規でソースコード”MXNetCV_Cam.py”を作成しました。

import argparse
import matplotlib.pyplot as plt
from timeit import default_timer as timer
import cv2

import gluoncv as gcv
import mxnet as mx

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-model', default='ssd_512_mobilenet1.0_voc')
    parser.add_argument('-threshold', type=float, default=0.5)
    parser.add_argument('video')
    args = parser.parse_args()

    # Set threshold
    th = args.threshold

    # Load the webcam handler
    if args.video == "0":
        cap = cv2.VideoCapture(0)
    else:
        cap = cv2.VideoCapture(args.video)
    if not cap.isOpened():
        raise ImportError("Couldn't open video file or webcam.")

    # Compute aspect ratio of video
    vidw = cap.get(cv2.CAP_PROP_FRAME_WIDTH)
    vidh = cap.get(cv2.CAP_PROP_FRAME_HEIGHT)
    vidw = int(vidw)
    vidh = int(vidh)
    print(vidw)
    print(vidh)

    # Load the model
    net = gcv.model_zoo.get_model(args.model, pretrained=True)
    # net = gcv.model_zoo.get_model('yolo3_mobilenet1.0_coco', pretrained=True)
    # net = gcv.model_zoo.get_model('ssd_512_mobilenet1.0_voc', pretrained=True)

    # Time parameter
    accum_time = 0
    curr_fps = 0
    fps = "FPS: ??"
    prev_time = timer()

    frame_count = 1
    while True:
        # Load frame from the camera
        ret, frame = cap.read()
        if ret == False:
            print("Done!")
            return

        # Result image
        result_img = frame.copy()

        # Image pre-processing
        frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
        rgb_nd, scaled_frame = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)

        # Run frame through network
        class_IDs, scores, bounding_boxes = net(rgb_nd)

        # Convert NDArray to numpy.ndarray
        bounding_boxes = bounding_boxes.asnumpy()
        scores = scores.asnumpy()
        class_IDs = class_IDs.asnumpy()

        # Class name list
        class_names = net.classes

        # Display the result
        for i, bbox in enumerate(bounding_boxes[0]):
            if th < scores[0][i]:
                score = scores[0][i]
                class_id = class_IDs[0][i]
                class_name = class_names[int(class_id)]

                xmin = int(bbox[0])
                ymin = int(bbox[1])
                xmax = int(bbox[2])
                ymax = int(bbox[3])

                # Draw box
                cv2.rectangle(result_img, (xmin, ymin), (xmax, ymax), (0,255,0), 2)

                text = class_name + " " + ('%.2f' % score)
                print(text)

                text_top = (xmin, ymin - 10)
                text_bot = (xmin + 80, ymin + 5)
                text_pos = (xmin + 5, ymin)

                # Draw class and score
                cv2.rectangle(result_img, text_top, text_bot, (255,255,255), -1)
                cv2.putText(result_img, text, text_pos, 
                cv2.FONT_HERSHEY_SIMPLEX, 0.35, (0, 0, 0), 1)
            else:
                # Cut low score
                break

        # Calculate FPS
        curr_time = timer()
        exec_time = curr_time - prev_time
        prev_time = curr_time
        accum_time = accum_time + exec_time
        curr_fps = curr_fps + 1
        if accum_time > 1:
            accum_time = accum_time - 1
            fps = "FPS: " + str(curr_fps)
            curr_fps = 0

        # Draw FPS in top right corner
        cv2.rectangle(result_img, (vidw-50, 0), (vidw, 17), (0, 0, 0), -1)
        cv2.putText(result_img, fps, (vidw-45, 10), 
        cv2.FONT_HERSHEY_SIMPLEX, 0.35, (255, 255, 255), 1)

        # Draw Frame Number in top left corner
        cv2.rectangle(result_img, (0, 0), (50, 17), (0, 0, 0), -1)
        cv2.putText(result_img, str(frame_count), (0, 10), 
        cv2.FONT_HERSHEY_SIMPLEX, 0.35, (255, 255, 255), 1)

        # Output Result
        title = args.model + " Result"
        cv2.imshow(title, result_img)

        # Stop Processing
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

        frame_count += 1

if __name__ == '__main__':
    main()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

import argparse

import matplotlib.pyplot as plt

from timeit import default_timer as timer

import cv2

import gluoncv as gcv

import mxnet as mx

def main():

parser = argparse.ArgumentParser()

parser.add_argument('-model', default='ssd_512_mobilenet1.0_voc')

parser.add_argument('-threshold', type=float, default=0.5)

parser.add_argument('video')

args = parser.parse_args()

# Set threshold

th = args.threshold

# Load the webcam handler

if args.video == "0":

cap = cv2.VideoCapture(0)

else:

cap = cv2.VideoCapture(args.video)

if not cap.isOpened():

raise ImportError("Couldn't open video file or webcam.")

# Compute aspect ratio of video

vidw = cap.get(cv2.CAP_PROP_FRAME_WIDTH)

vidh = cap.get(cv2.CAP_PROP_FRAME_HEIGHT)

vidw = int(vidw)

vidh = int(vidh)

print(vidw)

print(vidh)

# Load the model

net = gcv.model_zoo.get_model(args.model, pretrained=True)

# net = gcv.model_zoo.get_model('yolo3_mobilenet1.0_coco', pretrained=True)

# net = gcv.model_zoo.get_model('ssd_512_mobilenet1.0_voc', pretrained=True)

# Time parameter

accum_time = 0

curr_fps = 0

fps = "FPS: ??"

prev_time = timer()

frame_count = 1

while True:

# Load frame from the camera

ret, frame = cap.read()

if ret == False:

print("Done!")

return

# Result image

result_img = frame.copy()

# Image pre-processing

frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')

rgb_nd, scaled_frame = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)

# Run frame through network

class_IDs, scores, bounding_boxes = net(rgb_nd)

# Convert NDArray to numpy.ndarray

bounding_boxes = bounding_boxes.asnumpy()

scores = scores.asnumpy()

class_IDs = class_IDs.asnumpy()

# Class name list

class_names = net.classes

# Display the result

for i, bbox in enumerate(bounding_boxes[0]):

if th < scores[0][i]:

score = scores[0][i]

class_id = class_IDs[0][i]

class_name = class_names[int(class_id)]

xmin = int(bbox[0])

ymin = int(bbox[1])

xmax = int(bbox[2])

ymax = int(bbox[3])

# Draw box

cv2.rectangle(result_img, (xmin, ymin), (xmax, ymax), (0,255,0), 2)

text = class_name + " " + ('%.2f' % score)

print(text)

text_top = (xmin, ymin - 10)

text_bot = (xmin + 80, ymin + 5)

text_pos = (xmin + 5, ymin)

# Draw class and score

cv2.rectangle(result_img, text_top, text_bot, (255,255,255), -1)

cv2.putText(result_img, text, text_pos,

cv2.FONT_HERSHEY_SIMPLEX, 0.35, (0, 0, 0), 1)

else:

# Cut low score

break

# Calculate FPS

curr_time = timer()

exec_time = curr_time - prev_time

prev_time = curr_time

accum_time = accum_time + exec_time

curr_fps = curr_fps + 1

if accum_time > 1:

accum_time = accum_time - 1

fps = "FPS: " + str(curr_fps)

curr_fps = 0

# Draw FPS in top right corner

cv2.rectangle(result_img, (vidw-50, 0), (vidw, 17), (0, 0, 0), -1)

cv2.putText(result_img, fps, (vidw-45, 10),

cv2.FONT_HERSHEY_SIMPLEX, 0.35, (255, 255, 255), 1)

# Draw Frame Number in top left corner

cv2.rectangle(result_img, (0, 0), (50, 17), (0, 0, 0), -1)

cv2.putText(result_img, str(frame_count), (0, 10),

cv2.FONT_HERSHEY_SIMPLEX, 0.35, (255, 255, 255), 1)

# Output Result

title = args.model + " Result"

cv2.imshow(title, result_img)

# Stop Processing

if cv2.waitKey(1) & 0xFF == ord('q'):

break

frame_count += 1

if __name__ == '__main__':

main()

実験や研究などで使い易いように、以下のポイントを考慮して作成しました。

複数モデル対応
スコア閾値の変更が簡単
カメラ・ビデオ対応
フレーム数・FPSの表示

『MXNetCV_Cam.py』の使い方

パーサーで以下の項目を設定してから、”MXNetCV_Cam.py”を実行します。

設定項目	記号	選択候補
モデル	–model	Model Zooにある好きなモデル
スコア閾値	–threshold	ボックス描画したいスコアの下限値
カメラ・動画モード選択	末尾に設定	カメラなら”０” 動画ならファイルパス