一文搞定繁体字预处理和词云生成(wordcloud)

一文搞定繁体字预处理和词云生成(wordcloud)

1. 使用的是ckiptagger的分词包(不用jieba的原因是这个的准确度更高):

具体下载和使用可以查看ckiptagger的GitHub链接

from ckiptagger import data_utils, construct_dictionary, WS, POS, NER
data_utils.download_data_gdown("./") # gdrive-ckip
# 使用 GPU:
#    1. 安裝 tensorflow-gpu (請見安裝說明)
#    2. 設定 CUDA_VISIBLE_DEVICES 環境變數,例如:os.environ["CUDA_VISIBLE_DEVICES"] = "0"
#    3. 設定 disable_cuda=False,例如:ws = WS("./data", disable_cuda=False)
# 使用 CPU:
ws = WS("./data")
pos = POS("./data")
ner = NER("./data")

2. 导入需要处理的TXT文件:

# test.txt 是我们需要读入的繁体文本,如果遇到无法解码的错误,用errors跳过
f = open("test.txt", encoding='utf-8', errors="ignore")sentences = ''
for line in f.readlines():line = re.sub('\n','',line)  #去掉列表中每一个元素的换行符line = re.sub('[a-zA-Z0-9]','',line) #去掉数字,字母sentences += linesentence_list = re.split(r'[,,。.]', sentences) #获得句子的list
print(sentence_list)

 注意: encoding这里对应的解码方式为utf-8

最后打印出来:

3. 分词(查看词性):

from itertools import chainword_sentence_list = ws(sentence_list,# sentence_segmentation = True, # To consider delimiters# segment_delimiter_set = {",", "。", ":", "?", "!", ";"}), # This is the defualt set of delimiters# recommend_dictionary = dictionary1, # words in this dictionary are encouraged# coerce_dictionary = dictionary2, # words in this dictionary are forced
)pos_sentence_list = pos(word_sentence_list)entity_sentence_list = ner(word_sentence_list, pos_sentence_list)def print_word_pos_sentence(word_sentence, pos_sentence):assert len(word_sentence) == len(pos_sentence)for word, pos in zip(word_sentence, pos_sentence):print(f"{word}({pos})", end="\u3000")print()returnfor i, sentence in enumerate(sentence_list):print()print_word_pos_sentence(word_sentence_list[i],  pos_sentence_list[i])

 输出是:(只展示一部分) 分词后的单词对应的词性 

接着将分好的词转为一维数组——方便处理

from itertools import chainword_list = list(chain.from_iterable(word_sentence_list))
print(word_list)

4. 移除word_list中的停留词:

# 移除停留詞
def remove_stop_words(file_name,seg_list):with open(file_name,encoding='utf-8') as f:stop_words = f.readlines()stop_words = [stop_word.rstrip() for stop_word in stop_words]new_list = []for seg in seg_list:if seg not in stop_words:new_list.append(seg) #若在for loop裡用remove的話則會改變總長度return new_list
file_name = './stop_words.txt'
seg_list = remove_stop_words(file_name,word_list)
print('remove_stop_words: ',seg_list)

这边的file_name需要自己找到繁体停留词的集合,可以到网上下载对应的TXT文件

 处理之后就清晰很多了

5. 生成wordcloud:

# 分词后用空格隔开每个单词
sentence = ''
for j in seg_list:sentence += j + ' '
print(sentence)from wordcloud import WordCloud, STOPWORDS, ImageColorGeneratorwc = WordCloud(background_color='white',        #   背景顏色max_words=30,                    #   最大分詞數量mask=None,                       #   背景圖片max_font_size=None,              #   顯示字體的最大值font_path='./KAIU.TTF',          #   若為中文則需引入中文字型(.TTF)random_state=None,               #   隨機碼生成各分詞顏色prefer_horizontal=0.9)           #   調整分詞中水平和垂直的比例
wc.generate(sentence) 
wc.to_file('./test.png')

注意:font_path需要自己去下载对应的繁体字包到本地,直接google下载就可以

最后生成的图片

 

发布者:admin,转转请注明出处:http://www.yc00.com/news/1708495819a1568471.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信