# outputs: ['Introduction to Arabic Speech Recognition Using CMUSphinx System']
abstracts[:1]
# outputs: [' In this paper Arabic was investigated from the speech recognition problem\npoint of view. We propose a novel approach to build an Arabic Automated Speech\nRecognition System (ASR). This system is based on the open source CMU Sphinx-4,\nfrom the Carnegie Mellon University. CMU Sphinx is a large-vocabulary;\nspeaker-independent, continuous speech recognition system based on discrete\nHidden Markov Models (HMMs). We build a model using utilities from the\nOpenSource CMU Sphinx. We will demonstrate the possible adaptability of this\nsystem to Arabic voice recognition.\n']
A Common Pipeline for
Text Clustering
Embedding Documents
from sentence_transformers import SentenceTransformer
# Create an embedding for each abstract embedding_model = SentenceTransformer('thenlper/gte-small') embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
# Check the dimensions of the resulting embeddings embeddings.shape
# outputs: (44949, 384)
文本降维(Reducing
the Dimensionality of Embeddings)
from umap import UMAP
# We reduce the input embeddings from 384 dimenions to 5 dimenions umap_model = UMAP( n_components=5, min_dist=0.0, metric='cosine', random_state=42 ) reduced_embeddings = umap_model.fit_transform(embeddings)
reduced_embeddings.shape
# outputs: (44949, 5)
根据降维后的
embedding 进行聚类(Cluster the Reduced Embeddings)
# 打印 cluster 0 中的前三个文档 cluster = 0 for index in np.where(clusters==cluster)[0][:3]: print(abstracts[index][:300] + "... \n") # outputs: This works aims to design a statistical machine translation from English text to American Sign Language (ASL). The system is based on Moses tool with some modifications and the results are synthesized through a 3D avatar for interpretation. First, we translate the input text to gloss, a written fo...
Researches on signed languages still strongly dissociate lin- guistic issues related on phonological and phonetic aspects, and gesture studies for recognition and synthesis purposes. This paper focuses on the imbrication of motion and meaning for the analysis, synthesis and evaluation of sign lang...
Modern computational linguistic software cannot produce important aspects of sign language translation. Using some researches we deduce that the majority of automatic sign language translation systems ignore many aspects when they generate animation; therefore the interpretation lost the truth inf...
静态绘图
把 embedding 降到 2 维
import pandas as pd
# Reduce 384-dimensional embeddings to 2 dimensions for easier visualization reduced_embeddings = UMAP( n_components=2, min_dist=0.0, metric='cosine', random_state=42 ).fit_transform(embeddings)
# Create dataframe df = pd.DataFrame(reduced_embeddings, columns=["x", "y"]) df["title"] = titles df["cluster"] = [str(c) for c in clusters]
""" Returns: similar_topics: the most similar topics from high to low similarity: the similarity scores from high to low """ topic_model.find_topics("topic modeling")
from transformers import pipeline from bertopic.representation import TextGeneration
prompt = """I have a topic that contains the following documents: [DOCUMENTS] The topic is described by the following keywords: '[KEYWORDS]'. Based on the documents and keywords, what is this topic about?""" # bertopic 会默认替换到 [DOCUMENTS] 和 [KEYWORDS] # 初始文档表示:BERTopic 首先使用预训练的语言模型(如 BERT、RoBERTa 等)将文档转换为向量表示。这一步不需要任何预定义的主题或关键词。
# outputs: In this paper Arabic was investigated from the speech recognition problem point of view. We propose a novel approach to build an Arabic Automated Speech Recognition System (ASR). This system is based on the open source CMU Sphinx-4, from the Carnegie Mellon University. CMU Sphinx is a large-vocabulary; speaker-independent, continuous speech recognition system based on discrete Hidden Markov Models (HMMs). We build a model using utilities from the OpenSource CMU Sphinx. We will demonstrate the possible adaptability of this system to Arabic voice recognition.