Text Clustering and Topic Modeling

Posted on 2025-02-08 Edited on 2025-02-15 In AI Word count in article: 2k Reading time ≈ 7 mins.

Chapter 5 - 文本聚类与主题建模

#deepLearning/llm/5

Text Clustering and Topic Modeling

文本聚类是一种将大量文本数据按照内容或语义的相似性进行自动分组的无监督学习方法。

ArXiv Articles: Computation and Language

# Load data from huggingface
from datasets import load_dataset
dataset = load_dataset("maartengr/arxiv_nlp")["train"]

# Extract metadata
abstracts = dataset["Abstracts"]
titles = dataset["Titles"]

dataset

# outputs:
Dataset({
    features: ['Titles', 'Abstracts', 'Years', 'Categories'],
    num_rows: 44949
})

# abstracts[:1], 

titles[:1]

# outputs:
['Introduction to Arabic Speech Recognition Using CMUSphinx System']

abstracts[:1]

# outputs:
['  In this paper Arabic was investigated from the speech recognition problem\npoint of view. We propose a novel approach to build an Arabic Automated Speech\nRecognition System (ASR). This system is based on the open source CMU Sphinx-4,\nfrom the Carnegie Mellon University. CMU Sphinx is a large-vocabulary;\nspeaker-independent, continuous speech recognition system based on discrete\nHidden Markov Models (HMMs). We build a model using utilities from the\nOpenSource CMU Sphinx. We will demonstrate the possible adaptability of this\nsystem to Arabic voice recognition.\n']

A Common Pipeline for Text Clustering

Embedding Documents

from sentence_transformers import SentenceTransformer

# Create an embedding for each abstract
embedding_model = SentenceTransformer('thenlper/gte-small')
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

# Check the dimensions of the resulting embeddings
embeddings.shape

# outputs:
(44949, 384)

文本降维（Reducing the Dimensionality of Embeddings）

from umap import UMAP

# We reduce the input embeddings from 384 dimenions to 5 dimenions
umap_model = UMAP(
    n_components=5, min_dist=0.0, metric='cosine', random_state=42
)
reduced_embeddings = umap_model.fit_transform(embeddings)

reduced_embeddings.shape

# outputs:
(44949, 5)

根据降维后的 embedding 进行聚类(Cluster the Reduced Embeddings)

同样可以用 sklearn 的聚类方法，比如 sklearn.cluster.KMeans， DBSCAN 等方法

from hdbscan import HDBSCAN

# euclidean 是欧几里得距离，cluster_selection_method='eom' 是基于模型的聚类方法
hdbscan_model = HDBSCAN(
    min_cluster_size=50, metric='euclidean', cluster_selection_method='eom'
).fit(reduced_embeddings)
clusters = hdbscan_model.labels_

# 查看聚类的数量
len(set(clusters))

# outputs:
162

检查聚类结果（Inspecting the Clusters）

手动检查 cluster 0 中的前三个文档

import numpy as np

# 打印 cluster 0 中的前三个文档
cluster = 0
for index in np.where(clusters==cluster)[0][:3]:
    print(abstracts[index][:300] + "... \n")
    
# outputs:
  This works aims to design a statistical machine translation from English text
to American Sign Language (ASL). The system is based on Moses tool with some
modifications and the results are synthesized through a 3D avatar for
interpretation. First, we translate the input text to gloss, a written fo... 

  Researches on signed languages still strongly dissociate lin- guistic issues
related on phonological and phonetic aspects, and gesture studies for
recognition and synthesis purposes. This paper focuses on the imbrication of
motion and meaning for the analysis, synthesis and evaluation of sign lang... 

  Modern computational linguistic software cannot produce important aspects of
sign language translation. Using some researches we deduce that the majority of
automatic sign language translation systems ignore many aspects when they
generate animation; therefore the interpretation lost the truth inf...

静态绘图

把 embedding 降到 2 维

import pandas as pd

# Reduce 384-dimensional embeddings to 2 dimensions for easier visualization
reduced_embeddings = UMAP(
    n_components=2, min_dist=0.0, metric='cosine', random_state=42
).fit_transform(embeddings)

# Create dataframe
df = pd.DataFrame(reduced_embeddings, columns=["x", "y"])
df["title"] = titles
df["cluster"] = [str(c) for c in clusters]

# Select outliers and non-outliers (clusters)
clusters_df = df.loc[df.cluster != "-1", :]
outliers_df = df.loc[df.cluster == "-1", :]

import matplotlib.pyplot as plt

# 分别绘制离群点和非离群点
# 解释：alpha 是透明度，s 是点的大小，cmap 是颜色图
plt.scatter(outliers_df.x, outliers_df.y, alpha=0.05, s=2, c="grey")
plt.scatter(
    clusters_df.x, clusters_df.y, c=clusters_df.cluster.astype(int),
    alpha=0.6, s=2, cmap='tab20b'
)
# plt.savefig("matplotlib.png", dpi=300)  # Uncomment to save the graph as a .png

![[matplotlib.png.png]]

从文本聚类到主题建模（From Text Clustering to Topic Modeling）

BERTopic: 一个模块化的主题建模框架（BERTopic: A Modular Topic Modeling Framework）

from bertopic import BERTopic

# Train our model with our previously defined models
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=True
).fit(abstracts, embeddings)


# outputs:
2024-04-24 10:39:22,540 - BERTopic - Dimensionality - Completed ✓
2024-04-24 10:39:22,543 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-24 10:39:24,548 - BERTopic - Cluster - Completed ✓
2024-04-24 10:39:24,563 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-24 10:39:34,185 - BERTopic - Representation - Completed ✓

topic_model.get_topic_info()

# outputs:

![[topics.png]]

使用默认模型生成了数百个主题！要获取每个主题的前 10 个关键词以及它们的 c-TF-IDF 权重，我们可以使用 get_topic() 函数：

TF-IDF 是词频-逆文档频率（Term Frequency-Inverse Document Frequency）的缩写，是一种用于评估词语在文档集合中的重要性的统计方法。 c-TF-IDF 是类-词频-逆文档频率（Class-based TF-IDF）的缩写，是一种在主题建模中常用的加权方法，它考虑了文档类别对词语重要性的影响。

topic_model.get_topic(0)

# outputs:
[('speech', 0.028216480930622023),
 ('asr', 0.018903579737368923),
 ('recognition', 0.013553139794284205),
 ('end', 0.010026507690881847),
 ('acoustic', 0.009696868164422345),
 ('speaker', 0.00688304460778908),
 ('audio', 0.0068022131315230725),
 ('wer', 0.006414446042943717),
 ('error', 0.0063871666249343045),
 ('automatic', 0.006153347638246464)]

我们可以使用 find_topics() 函数基于搜索词来查找特定主题。让我们搜索一个关于主题建模的主题：

"""
Returns:
    similar_topics: the most similar topics from high to low
    similarity: the similarity scores from high to low
"""
topic_model.find_topics("topic modeling")

# outputs:
([23, -1, 43, 82, 40],
 [0.95474184, 0.91240776, 0.90763277, 0.9037941, 0.90360355])

结果显示主题 30 与我们的搜索词有较高的相似度（0.95）。如果我们进一步检查这个主题，我们可以看到它确实是一个关于主题建模的主题：

topic_model.get_topic(30)

# outputs:
[('sense', 0.06648167634000589),
 ('wsd', 0.03922167607376744),
 ('senses', 0.030380260753374123),
 ('word', 0.029309445839038818),
 ('disambiguation', 0.028877095221845412),
 ('embeddings', 0.012890323147034454),
 ('wordnet', 0.012073044968388328),
 ('words', 0.0114478022625593),
 ('polysemous', 0.007590680908917357),
 ('ambiguous', 0.007242555580821115)]

这其实就是经典 LDA 技术所特征化的主题。让我们看看 BERTopic 论文是否也被分配到了主题 30

titles.index('BERTopic: Neural topic modeling with a class-based TF-IDF procedure')

# outputs:
25033

topic_model.topics_[titles.index('BERTopic: Neural topic modeling with a class-based TF-IDF procedure')]

# outputs:
23

可视化（Visualizations）

Visualize Documents

# Visualize topics and documents
fig = topic_model.visualize_documents(
    titles,
    reduced_embeddings=reduced_embeddings,
    width=1200,
    hide_annotations=True
)

# Update fonts of legend for easier visualization
fig.update_layout(font=dict(size=16))

# Visualize barchart with ranked keywords
topic_model.visualize_barchart()

# Visualize relationships between topics
topic_model.visualize_heatmap(n_clusters=30)

# Visualize the potential hierarchical structure of topics
topic_model.visualize_hierarchy()

![[newplot.png]]

表示模型（Representation Models）

from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic

# Create your representation model
representation_model = KeyBERTInspired()

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

为了使用表示模型，我们首先要复制我们的主题模型，这样可以方便地展示有表示模型和没有表示模型的模型之间的差异。

# 保存原始表示
from copy import deepcopy
original_topics = deepcopy(topic_model.topic_representations_)

def topic_differences(model, original_topics, nr_topics=5):
    """显示两个模型之间主题表示的差异"""
    df = pd.DataFrame(columns=["Topic", "Original", "Updated"])
    for topic in range(nr_topics):

        # 提取每个模型每个主题的前5个关键词
        og_words = " | ".join(list(zip(*original_topics[topic]))[0][:5])
        new_words = " | ".join(list(zip(*model.get_topic(topic)))[0][:5])
        df.loc[len(df)] = [topic, og_words, new_words]

    return df

KeyBERTInspired

from bertopic.representation import KeyBERTInspired

# 更新我们的主题表示为 KeyBERTInspired
representation_model = KeyBERTInspired()
topic_model.update_topics(abstracts, representation_model=representation_model)

# 显示主题表示的差异
topic_differences(topic_model, original_topics)

![[topic_differences.png]]

Maximal Marginal Relevance

from bertopic.representation import MaximalMarginalRelevance

# 更新我们的主题表示为 MaximalMarginalRelevance
representation_model = MaximalMarginalRelevance(diversity=0.5)
topic_model.update_topics(abstracts, representation_model=representation_model)

# 显示主题表示的差异
topic_differences(topic_model, original_topics)

![[topic_differences_2.png]]

KeyBERTInspired 和 Maximal Marginal Relevance 的区别

KeyBERTInspired:

基于 KeyBERT 算法，使用 BERT 嵌入来提取关键词
通过计算词嵌入和文档嵌入之间的余弦相似度来选择最相关的词
倾向于选择语义上最相关的词，可能会导致一些重复

Maximal Marginal Relevance (MMR):

在相关性和多样性之间取得平衡
通过迭代选择既相关又不同于已选词的词
可以通过调整 diversity 参数来控制多样性程度
有助于生成更多样化的主题表示，避免重复

主要区别:

KeyBERTInspired 专注于相关性，MMR 在相关性和多样性之间平衡
MMR 可以产生更多样化的结果，而 KeyBERTInspired 可能更专注但有重复
MMR 有一个可调节的多样性参数，KeyBERTInspired 没有这种直接控制

生成模型做文本聚类

Flan-T5

from transformers import pipeline
from bertopic.representation import TextGeneration

prompt = """I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the documents and keywords, what is this topic about?"""
# bertopic 会默认替换到 [DOCUMENTS] 和 [KEYWORDS]
# 初始文档表示：BERTopic 首先使用预训练的语言模型（如 BERT、RoBERTa 等）将文档转换为向量表示。这一步不需要任何预定义的主题或关键词。

# Update our topic representations using Flan-T5
generator = pipeline(
    'text2text-generation', 
    model='google/flan-t5-small', 
    device="cuda:0"
)
representation_model = TextGeneration(
    generator, prompt=prompt, doc_length=50, tokenizer="whitespace"
)
topic_model.update_topics(abstracts, representation_model=representation_model)

# 显示主题表示的差异
topic_differences(topic_model, original_topics)

![[topic_differences_3.png]]

print(abstracts[0])

# outputs:
  In this paper Arabic was investigated from the speech recognition problem
point of view. We propose a novel approach to build an Arabic Automated Speech
Recognition System (ASR). This system is based on the open source CMU Sphinx-4,
from the Carnegie Mellon University. CMU Sphinx is a large-vocabulary;
speaker-independent, continuous speech recognition system based on discrete
Hidden Markov Models (HMMs). We build a model using utilities from the
OpenSource CMU Sphinx. We will demonstrate the possible adaptability of this
system to Arabic voice recognition.

fig = topic_model.visualize_document_datamap(
    titles,
    topics=list(range(20)),
    reduced_embeddings=reduced_embeddings,
    width=1200,
    label_font_size=11,
    label_wrap_width=20,
    use_medoids=True,
)
plt.show()
# plt.savefig("datamapplot.png", dpi=300)

![[Documents and Topics.png]]