一文搞定繁体字预处理和词云生成（wordcloud）

1. 使用的是ckiptagger的分词包（不用jieba的原因是这个的准确度更高）：

具体下载和使用可以查看ckiptagger的GitHub链接

from ckiptagger import data_utils, construct_dictionary, WS, POS, NER
data_utils.download_data_gdown("./") # gdrive-ckip
# 使用 GPU：
#    1. 安裝 tensorflow-gpu (請見安裝說明)
#    2. 設定 CUDA_VISIBLE_DEVICES 環境變數，例如：os.environ["CUDA_VISIBLE_DEVICES"] = "0"
#    3. 設定 disable_cuda=False，例如：ws = WS("./data", disable_cuda=False)
# 使用 CPU：
ws = WS("./data")
pos = POS("./data")
ner = NER("./data")

2. 导入需要处理的TXT文件：

# test.txt 是我们需要读入的繁体文本，如果遇到无法解码的错误，用errors跳过
f = open("test.txt", encoding='utf-8', errors="ignore")sentences = ''
for line in f.readlines():line = re.sub('\n','',line)  #去掉列表中每一个元素的换行符line = re.sub('[a-zA-Z0-9]','',line) #去掉数字，字母sentences += linesentence_list = re.split(r'[，,。.]', sentences) #获得句子的list
print(sentence_list)

注意： encoding这里对应的解码方式为utf-8

最后打印出来：

3. 分词（查看词性）：

from itertools import chainword_sentence_list = ws(sentence_list,# sentence_segmentation = True, # To consider delimiters# segment_delimiter_set = {",", "。", ":", "?", "!", ";"}), # This is the defualt set of delimiters# recommend_dictionary = dictionary1, # words in this dictionary are encouraged# coerce_dictionary = dictionary2, # words in this dictionary are forced
)pos_sentence_list = pos(word_sentence_list)entity_sentence_list = ner(word_sentence_list, pos_sentence_list)def print_word_pos_sentence(word_sentence, pos_sentence):assert len(word_sentence) == len(pos_sentence)for word, pos in zip(word_sentence, pos_sentence):print(f"{word}({pos})", end="\u3000")print()returnfor i, sentence in enumerate(sentence_list):print()print_word_pos_sentence(word_sentence_list[i],  pos_sentence_list[i])

输出是：（只展示一部分）分词后的单词对应的词性

接着将分好的词转为一维数组——方便处理

from itertools import chainword_list = list(chain.from_iterable(word_sentence_list))
print(word_list)

4. 移除word_list中的停留词：

# 移除停留詞
def remove_stop_words(file_name,seg_list):with open(file_name,encoding='utf-8') as f:stop_words = f.readlines()stop_words = [stop_word.rstrip() for stop_word in stop_words]new_list = []for seg in seg_list:if seg not in stop_words:new_list.append(seg) #若在for loop裡用remove的話則會改變總長度return new_list
file_name = './stop_words.txt'
seg_list = remove_stop_words(file_name,word_list)
print('remove_stop_words: ',seg_list)

这边的file_name需要自己找到繁体停留词的集合，可以到网上下载对应的TXT文件

处理之后就清晰很多了

5. 生成wordcloud：

# 分词后用空格隔开每个单词
sentence = ''
for j in seg_list:sentence += j + ' '
print(sentence)from wordcloud import WordCloud, STOPWORDS, ImageColorGeneratorwc = WordCloud(background_color='white',        #   背景顏色max_words=30,                    #   最大分詞數量mask=None,                       #   背景圖片max_font_size=None,              #   顯示字體的最大值font_path='./KAIU.TTF',          #   若為中文則需引入中文字型(.TTF)random_state=None,               #   隨機碼生成各分詞顏色prefer_horizontal=0.9)           #   調整分詞中水平和垂直的比例
wc.generate(sentence) 
wc.to_file('./test.png')

注意：font_path需要自己去下载对应的繁体字包到本地，直接google下载就可以

最后生成的图片

发布者：admin，转转请注明出处：http://www.yc00.com/news/1708495819a1568471.html

一文搞定繁体字预处理和词云生成（wordcloud）

一文搞定繁体字预处理和词云生成（wordcloud）

1. 使用的是ckiptagger的分词包（不用jieba的原因是这个的准确度更高）：

2. 导入需要处理的TXT文件：

3. 分词（查看词性）：

4. 移除word_list中的停留词：

5. 生成wordcloud：

发表回复

评论列表（0条）

联系我们

400-800-8888

一文搞定繁体字预处理和词云生成（wordcloud）

一文搞定繁体字预处理和词云生成（wordcloud）

1. 使用的是ckiptagger的分词包（不用jieba的原因是这个的准确度更高）：

2. 导入需要处理的TXT文件：

3. 分词（查看词性）：

4. 移除word_list中的停留词：

5. 生成wordcloud：

相关推荐

一文搞定繁体字预处理和词云生成（wordcloud）

发表回复

评论列表（0条）

联系我们

400-800-8888