查看: 236|回复: 0


发表于 2022-11-29 11:40 | 显示全部楼层 |阅读模式

    1. 导入数据并查看信息

    2. 使用CountVectorizer构建单词字典并建模预测

    2.1 CountVectorizer用法示例

    2.2 使用CountVectorizer进行特征向量转换

    2.3 使用贝叶斯模型进行建模预测

    3. 使用TfidfVectorizer进行特征向量转换并建模预测

    3.1 TfidfVectorizer使用示例

    3.2 对新闻数据进行TfidfVectorizer变换

    3.3 进行建模与预测

    3.4 去除停用词并进行建模与预测
1. 导入数据并查看信息

from sklearn.datasets import fetch_20newsgroupsfrom sklearn.model_selection import train_test_split# 加载新闻数据news = fetch_20newsgroups(subset='all')# data为一个列表,长度18846,每一个元素为一个新闻内容的字符串print(len(news.data))18846news.data[0]"From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>\nSubject: Pens fans reactions\nOrganization: Post Office, Carnegie Mellon, Pittsburgh, PA\nLines: 12\nNNTP-Posting-Host: po4.andrew.cmu.edu\n\n\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"# news.target为目标分类对应的编号news.targetarray([10,  3, 17, ...,  3,  1,  7])# 目标标签名称有20个,因此一共分20类新闻len(news.target_names)20# 查看第一篇新闻属于什么类别print(news.target[0])print(news.target_names[news.target[0]])10rec.sport.hockey2. 使用CountVectorizer构建单词字典并建模预测

2.1 CountVectorizer用法示例

from sklearn.feature_extraction.text import CountVectorizertexts=["pig bird cat","dog dog cat cat","bird fish bird", 'pig bird']cv = CountVectorizer()# 将文本向量化cv_fit=cv.fit_transform(texts)# 查看转换后的向量,会统计单词个数,并写在指定索引位置print(cv.get_feature_names())   # 获取单词序列print(cv_fit.toarray())         # 将文本变为向量['bird', 'cat', 'dog', 'fish', 'pig'][[1 1 0 0 1] [0 2 2 0 0] [2 0 0 1 0] [1 0 0 0 1]]2.2 使用CountVectorizer进行特征向量转换

cv = CountVectorizer()cv_data = cv.fit_transform(news.data)2.3 使用贝叶斯模型进行建模预测

from sklearn.model_selection import cross_val_score from sklearn.naive_bayes import MultinomialNBx_train,x_test,y_train,y_test = train_test_split(cv_data, news.target)mul_nb = MultinomialNB()train_scores = cross_val_score(mul_nb, x_train, y_train, cv=3, scoring='accuracy')  test_scores = cross_val_score(mul_nb, x_test, y_test, cv=3, scoring='accuracy')  print("train scores:", train_scores)print("test scores:", test_scores)train scores: [0.81457936 0.81260611 0.82925792]test scores: [0.64258555 0.56687898 0.61700767]3. 使用TfidfVectorizer进行特征向量转换并建模预测

TfidfVectorizer使用了一个高级的计算方法,称为Term Frequency Inverse Document Frequency (TF-IDF)。IDF是逆文本频率指数(Inverse Document Frequency)。


3.1 TfidfVectorizer使用示例

from sklearn.feature_extraction.text import TfidfVectorizer# 文本文档列表text = ["The quick brown fox jumped over the lazy dog.","The lazy dog.","The brown fox"]# 创建变换函数vectorizer = TfidfVectorizer()# 词条化以及创建词汇表vectorizer.fit(text)# 总结print(vectorizer.vocabulary_)print(vectorizer.idf_)# 编码文档vector = vectorizer.transform([text[0]])# 总结编码文档print(vector.shape)print(vector.toarray()){'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}[1.28768207 1.28768207 1.28768207 1.69314718 1.28768207 1.69314718 1.69314718 1.        ](1, 8)[[0.29362163 0.29362163 0.29362163 0.38607715 0.29362163 0.38607715  0.38607715 0.45604677]]3.2 对新闻数据进行TfidfVectorizer变换

# 创建变换函数vectorizer = TfidfVectorizer()# 词条化以及创建词汇表tfidf_data = vectorizer.fit_transform(news.data)3.3 进行建模与预测

x_train,x_test,y_train,y_test = train_test_split(tfidf_data, news.target)mul_nb = MultinomialNB()train_scores = cross_val_score(mul_nb, x_train, y_train, cv=3, scoring='accuracy')  test_scores = cross_val_score(mul_nb, x_test, y_test, cv=3, scoring='accuracy')  print("train scores:", train_scores)print("test scores:", test_scores)train scores: [0.8238287  0.83379325 0.81937952]test scores: [0.68103995 0.68809675 0.68030691]3.4 去除停用词并进行建模与预测

def get_stop_words():    result = set()    for line in open('stopwords_en.txt', 'r').readlines():        result.add(line.strip())    return result# 加载停用词stop_words = get_stop_words()# 创建变换函数vectorizer = TfidfVectorizer(stop_words=stop_words)# 词条化以及创建词汇表tfidf_data = vectorizer.fit_transform(news.data)x_train,x_test,y_train,y_test = train_test_split(tfidf_data,news.target)mul_nb = MultinomialNB(alpha=0.01)train_scores = cross_val_score(mul_nb, x_train, y_train, cv=3, scoring='accuracy')  test_scores = cross_val_score(mul_nb, x_test, y_test, cv=3, scoring='accuracy')  print("train scores:", train_scores)print("test scores:", test_scores)train scores: [0.90419669 0.89577584 0.90095643]test scores: [0.85107731 0.8433121  0.84526854]
通过对比发现使用 TfidVectorizer构建特征向量的建模效果要好于CounterVecorizer。同时去除停用词之后,模型准确率也会有较大的提升。

懒得打字嘛,点击右侧快捷回复 【右侧内容,后台自定义】
您需要登录后才可以回帖 登录 | 立即注册


小黑屋|手机版|Unity开发者联盟 ( 粤ICP备20003399号 )

GMT+8, 2025-2-22 18:35 , Processed in 0.087398 second(s), 25 queries .

Powered by Discuz! X3.5 Licensed

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表