一文搞定繁体字预处理和词云生成(wordcloud)
1. 使用的是ckiptagger的分词包(不用jieba的原因是这个的准确度更高):
具体下载和使用可以查看ckiptagger的GitHub链接
from ckiptagger import data_utils, construct_dictionary, WS, POS, NER
data_utils.download_data_gdown("./") # gdrive-ckip
# 使用 GPU:
# 1. 安裝 tensorflow-gpu (請見安裝說明)
# 2. 設定 CUDA_VISIBLE_DEVICES 環境變數,例如:os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# 3. 設定 disable_cuda=False,例如:ws = WS("./data", disable_cuda=False)
# 使用 CPU:
ws = WS("./data")
pos = POS("./data")
ner = NER("./data")
2. 导入需要处理的TXT文件:
# test.txt 是我们需要读入的繁体文本,如果遇到无法解码的错误,用errors跳过
f = open("test.txt", encoding='utf-8', errors="ignore")sentences = ''
for line in f.readlines():line = re.sub('\n','',line) #去掉列表中每一个元素的换行符line = re.sub('[a-zA-Z0-9]','',line) #去掉数字,字母sentences += linesentence_list = re.split(r'[,,。.]', sentences) #获得句子的list
print(sentence_list)
注意: encoding这里对应的解码方式为utf-8
最后打印出来:
3. 分词(查看词性):
from itertools import chainword_sentence_list = ws(sentence_list,# sentence_segmentation = True, # To consider delimiters# segment_delimiter_set = {",", "。", ":", "?", "!", ";"}), # This is the defualt set of delimiters# recommend_dictionary = dictionary1, # words in this dictionary are encouraged# coerce_dictionary = dictionary2, # words in this dictionary are forced
)pos_sentence_list = pos(word_sentence_list)entity_sentence_list = ner(word_sentence_list, pos_sentence_list)def print_word_pos_sentence(word_sentence, pos_sentence):assert len(word_sentence) == len(pos_sentence)for word, pos in zip(word_sentence, pos_sentence):print(f"{word}({pos})", end="\u3000")print()returnfor i, sentence in enumerate(sentence_list):print()print_word_pos_sentence(word_sentence_list[i], pos_sentence_list[i])
输出是:(只展示一部分) 分词后的单词对应的词性
接着将分好的词转为一维数组——方便处理
from itertools import chainword_list = list(chain.from_iterable(word_sentence_list))
print(word_list)
4. 移除word_list中的停留词:
# 移除停留詞
def remove_stop_words(file_name,seg_list):with open(file_name,encoding='utf-8') as f:stop_words = f.readlines()stop_words = [stop_word.rstrip() for stop_word in stop_words]new_list = []for seg in seg_list:if seg not in stop_words:new_list.append(seg) #若在for loop裡用remove的話則會改變總長度return new_list
file_name = './stop_words.txt'
seg_list = remove_stop_words(file_name,word_list)
print('remove_stop_words: ',seg_list)
这边的file_name需要自己找到繁体停留词的集合,可以到网上下载对应的TXT文件
处理之后就清晰很多了
5. 生成wordcloud:
# 分词后用空格隔开每个单词
sentence = ''
for j in seg_list:sentence += j + ' '
print(sentence)from wordcloud import WordCloud, STOPWORDS, ImageColorGeneratorwc = WordCloud(background_color='white', # 背景顏色max_words=30, # 最大分詞數量mask=None, # 背景圖片max_font_size=None, # 顯示字體的最大值font_path='./KAIU.TTF', # 若為中文則需引入中文字型(.TTF)random_state=None, # 隨機碼生成各分詞顏色prefer_horizontal=0.9) # 調整分詞中水平和垂直的比例
wc.generate(sentence)
wc.to_file('./test.png')
注意:font_path需要自己去下载对应的繁体字包到本地,直接google下载就可以
最后生成的图片
发布者:admin,转转请注明出处:http://www.yc00.com/news/1708495819a1568471.html
评论列表(0条)