Text Classification

Posted on 2025-02-07 Edited on 2025-02-15 In AI Word count in article: 1.1k Reading time ≈ 4 mins.

Chapter 4 - 使用表征类模型（BERT/特征抽取类）进行分类，使用生成式模型进行分类

#deepLearning/llm/4

文本分类模型

![[“Although both representation and generative models can be used for classification, their approaches differ.”.png]]

from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

# outputs:
# 数据分为训练集、测试集和验证集
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

# 训练集的第一条和最后一条
data["train"][0, -1]

# outputs:
{
    'text': [
        'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 
        'things really get weird , though not particularly scary : the movie is all portent and no content .'
    ],
    'label': [1, 0]
}

表征类文本分类（Text Classification with Representation Models）

表征类文本分类是一种基于判别式模型的方法，其核心是通过学习文本的表示（vector）直接建模条件概率 \(P(label∣text)\)，映射到 label 里从而区分不同类别。

使用和任务相关的模型（task-specific models）

from transformers import pipeline

# https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest
# Labels: 0 -> Negative; 1 -> Neutral; 2 -> Positive
# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)

对测试集 data["test"] 进行文本分类，并存储结果到 y_pred。

import numpy as np
from tqdm import tqdm # 让循环显示进度条
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    # pipe 会自动使用 模型内置的 label 3 分类
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

evaluate_performance(data["test"]["label"], y_pred)

# outputs:
                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066

macro/micro avg 的区别:

macro avg 针对每个类别算自己的 precision/recall/f1，再对所有类别的指标取简单平均。
micro avg 先计算总体的TP、FP、FN等,然后再计算总体指标。

![[The confusion matrix describes four types of predictions we can make..png]]

Supervised Classification

from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

train_embeddings.shape

# outputs:
(8530, 768)

使用逻辑回归分类

from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

# outputs:
                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066

如果不使用分类器，可以通过计算每个 embedding的平均值，然后计算 cosine 相似度来预测。

import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)

# outputs:
                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066

Zero-shot Classification

step1: 把所有正(负）加起来，计算平均；这样就可以得到正样本的 embedding （其实和矩阵分解（MF，Matrix Factorization）（通过把多个样本的特征融合，获得一个能够代表该类别整体特征的向量) 的思想非常的相似）
step2: 计算 test sample 中样本 embedding 和正负样本 embedding 的相似度，更接近的那个就是类似

# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])

from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

evaluate_performance(data["test"]["label"], y_pred)

# outputs:
                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066

pipe("is the following sentence positice or negative? this movie is like eating a fly's shit!")

# outputs:
[[{'label': 'negative', 'score': 0.931709885597229},
  {'label': 'neutral', 'score': 0.060942187905311584},
  {'label': 'positive', 'score': 0.007347979582846165}]]

生成式模型文本分类（Text Classification with Generative Models）

Encoder-decoder Models

# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)

# Prepare our data
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})
data

# outputs:
DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)
    
evaluate_performance(data["test"]["label"], y_pred)

# outputs:
                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066

pipe("is the following sentence positice or negative? this movie is like eating a fly's shit!")

# outputs:
[{'generated_text': 'negative'}]