入門_Snorkel

12.入門 Snorkel

本Colab源自https://github.com/snorkel-team/snorkel-tutorials ，並調整至可在Colab執行，供學習參考。

安裝模組

下載官方教學
安裝相關模組(可能要"Restart Runtime")

!git clone https://github.com/snorkel-team/snorkel-tutorials.git
!pip3 install -r /content/snorkel-tutorials/requirements.txt
!pip3 install -r /content/snorkel-tutorials/spam/requirements.txt
!pip install TensorBoard==1.15
!pip install snorkel

%cd snorkel-tutorials

import os
# Make sure we're running from the spam/ directory
if os.path.basename(os.getcwd()) == "snorkel-tutorials":
    os.chdir("getting_started")

#確認Colab路徑為/content/snorkel-tutorials/getting_started
!pwd

使用 Snorkel 以程式構建和管理訓練數據

Snorkel 是一種「無需手動標記」即可以程式構建和管理訓練數據集的系統。在 Snorkel 中，可以在數小時或數天內開發大型訓練數據集，而不是在數週或數月內手動標記。

Snorkel 目前公開了三個關鍵的程序化操作：

標記數據 Labeling data : 例如使用啟發式規則或遠程監督技術。
轉換數據 Transforming data : 例如旋轉或拉伸圖像以執行數據增強。
資料切片 Slicing data : 將數據分成不同的關鍵子集以進行監控或有針對性的改進。

然後，Snorkel 使用新穎的、有理論依據的技術自動建模、清理和整合生成的訓練數據。

我們將完成五個基本步驟，透過官方範例 YouTube comments 資料集示範，，簡單定義3個標籤並接續後續流程:

# Define the label mappings for convenience
ABSTAIN = -1 #放棄標註
NOT_SPAM = 0
SPAM = 1

from utils import load_unlabeled_spam_dataset

df_train = load_unlabeled_spam_dataset()

1. 編寫標籤函數 (LFs)：

將用 LFs 以編程方式標記我們未標記的數據集，而不是手動標記任何訓練數據。以下為官方導覽介紹的3種LF函數寫法，
關鍵字判別、正規表達式判別與用外部模組判別。

關鍵字：

from snorkel.labeling import labeling_function

@labeling_function()
def lf_keyword_my(x):
    """Many spam comments talk about 'my channel', 'my video', etc."""
    return SPAM if "my" in x.text.lower() else ABSTAIN

正規表達式：

import re

@labeling_function()
def lf_regex_check_out(x):
    """Spam comments say 'check out my video', 'check it out', etc."""
    return SPAM if re.search(r"check.*out", x.text, flags=re.I) else ABSTAIN

任意啟發式：

@labeling_function()
def lf_short_comment(x):
    """Non-spam comments are often short, such as 'cool video!'."""
    return NOT_SPAM if len(x.text.split()) < 5 else ABSTAIN

第三方模組：

from textblob import TextBlob


@labeling_function()
def lf_textblob_polarity(x):
    """
    We use a third-party sentiment classification model, TextBlob.

    We combine this with the heuristic that non-spam comments are often positive.
    """
    return NOT_SPAM if TextBlob(x.text).sentiment.polarity > 0.3 else ABSTAIN

更多類型的標記函數（包括文本以外的數據模式），請參閱其他官方範例和實際示例。

2. 建模和組合 LF：

將前述設定好的LabelModel LF 組合為 list，將 LFs 應用於偽標註的訓練數據。
由於標註函數 LFs 的準確度和相關性未知，輸出標籤可能會重疊和衝突。 snorkel.labeling.model.LabelModel 可以自動估計它們的準確性和相關性，重新加權和組合它們的標籤，並生成我們最終的乾淨、集成的訓練標籤集：

from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier

# Define the set of labeling functions (LFs)
lfs = [lf_keyword_my, lf_regex_check_out, lf_short_comment, lf_textblob_polarity]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df_train["label"] = label_model.predict(L=L_train, tie_break_policy="abstain")

from snorkel.labeling import LFAnalysis

Y_valid = df_train.label.values
LFAnalysis(L_train, lfs).lf_summary(Y_valid)

由於前述LabelModel可能很多數據為標註結果為放棄標示狀態的ABSTAIN = -1，為清理訓練資料集，將明顯標註SPAM、NOT_SPAM的訓練資料集保留進行後去處理。

df_train = df_train[df_train.label != ABSTAIN]

3. 編寫數據增強的TF函數

接著透過建立一個TF函數來增強這個標記的訓練集。
以下get_synonyms()用nltk.wordnet獲取單詞的同義詞。

使用 TF snorkel.augmentation.transformation_function 做為裝飾子，自訂 tf_replace_word_with_synonym() 函數將生成的同義詞加入訓練資料集。

import random

import nltk
from nltk.corpus import wordnet as wn

from snorkel.augmentation import transformation_function

nltk.download("wordnet", quiet=True)


def get_synonyms(word):
    """Get the synonyms of word from Wordnet."""
    lemmas = set().union(*[s.lemmas() for s in wn.synsets(word)])
    return list(set(l.name().lower().replace("_", " ") for l in lemmas) - {word})


@transformation_function()
def tf_replace_word_with_synonym(x):
    """Try to replace a random word with a synonym."""
    words = x.text.lower().split()
    idx = random.choice(range(len(words)))
    synonyms = get_synonyms(words[idx])
    if len(synonyms) > 0:
        x.text = " ".join(words[:idx] + [synonyms[0]] + words[idx + 1 :])
        return x

將自訂 TF 函數加入訓練數據集。

from snorkel.augmentation import ApplyOnePolicy, PandasTFApplier

tf_policy = ApplyOnePolicy(n_per_original=2, keep_original=True)
tf_applier = PandasTFApplier([tf_replace_word_with_synonym], tf_policy)
df_train_augmented = tf_applier.apply(df_train)

更多數據增強的調整可參閱 Spam TFs tutorial。

4. 建立切片函數 Slicing Function , SF

Snorkel 的 Slicing Function 可用以監控特定切片，以及透過針對不同切片增加特徵以提高模型性能。
延續 Youtube 評論之中可能有惡意連結的想法，為此撰寫一個查找可疑縮網址的程式，這對找出惡意垃圾評論可能很關鍵。設定好 SF 可監控此切片的性能：

from snorkel.slicing import slicing_function


@slicing_function()
def short_link(x):
    """
    Return whether text matches common pattern 
    for shortened ".ly" links.
    """
    return int(bool(re.search(r"\w+\.ly", x.text)))

5. 訓練分類器

Snorkel 的最終目標是創建一個標註完成的訓練資料集，然後將其插入任意機器學習框架（例如 TensorFlow、Keras、PyTorch、Scikit-Learn、Ludwig、XGBoost），以訓練強大的機器學習模型。
接續範例，將前述第3步完成的訓練資料集df_train_augmented，以 Scikit-Learn 的 n-gram 邏輯回歸模型進行推論，完成整個運用Snorkel 弱監督分類模型。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

train_text = df_train_augmented.text.tolist()
X_train = CountVectorizer(ngram_range=(1, 2)).fit_transform(train_text)

clf = LogisticRegression(solver="lbfgs")
clf.fit(X=X_train, y=df_train_augmented.label.values)

小結

Snorkel 透過程式邏輯標註程式(Labeling data)，透過數據增強方式自動化轉換數據(Transforming data)，並且可以切片監控特定子資料集(Slicing data)，好處是可以輕易地融入機械學習系統工作流程，並且有自動標註的好處，標註水準還不錯。
雖然好用，但官方範例較複雜，希望能整理一份方便使用的指引，供您後續標註資料的參考。

12.入門 Snorkel

安裝模組​

使用 Snorkel 以程式構建和管理訓練數據​

1. 編寫標籤函數 (LFs)：​

2. 建模和組合 LF：​

3. 編寫數據增強的TF函數​

4. 建立切片函數 Slicing Function , SF​

5. 訓練分類器​

小結​