2023年6月29日发(作者:)
Python爬⾍爬取哈利波特⼩说,并⽤数据可视化分析出场⼈物前⾔本⽂的⽂字及图⽚来源于⽹络,仅供学习、交流使⽤,不具有任何商业⽤途,版权归原作者所有,如有问题请及时联系我们以作处理。
先简单介绍⼀下jieba中⽂分词包,jieba包主要有三种分词模式:精确模式:默认情况下是精确模式,精确地分词,适合⽂本分析;全模式:把所有能成词的词语都分出来, 但是词语会存在歧义;搜索引擎模式:在精确模式的基础上,对长词再次切分,适合⽤于搜索引擎分词。jieba 包常⽤的语句:精确模式分词:(text,cut_all = False),当cut_all = True时为全模式⾃定义词典:_userdict(file_name)增加词语:_word(seg,freq,flag)删除词语:_word(seg)《哈利·波特》是英国作家J·K·罗琳的奇幻⽂学系列⼩说,描写主⾓哈利·波特在霍格沃茨魔法学校7年学习⽣活中的冒险故事。下⾯将以《哈利波特》错综复杂的⼈物关系为例,实践⼀下jieba包。#加载所需包import numpy as npimport pandas as pdimport jieba,codecsimport as pseg #标注词性模块from pyecharts import Bar,WordCloud#导⼊⼈名、停⽤词、特定词库renmings = _csv('⼈名.txt',engine='python',encoding='utf-8',names=['renming'])['renming']stopwords = _csv('',engine='python',encoding='utf-8',names=['stopwords'])['stopwords'].tolist()book = open('哈利波特.txt',encoding='utf-8').read()_userdict('哈利波特词库.txt')#定义⼀个分词函数def words_cut(book): words = list((book)) stopwords1 = [w for w in words if len(w)==1] #添加停⽤词 seg = set(words) - set(stopwords) - set(stopwords1) #过滤停⽤词,得到更为精确的分词 result = [i for i in words if i in seg] return result#初次分词bookwords = words_cut(book)renming = [(' ')[0] for i in set(renmings)] #只要⼈物名字,出掉词频以及词性nameswords = [i for i in bookwords if i in set(renming)] #筛选出⼈物名字#统计词频bookwords_count = (bookwords).value_counts().sort_values(ascending=False)nameswords_count = (nameswords).value_counts().sort_values(ascending=False)bookwords_count[:100].index 经过初次分词之后,我们发现⼤部分的词语已经ok了,但是还是有⼩部分名字类的词语分得不精确,⽐如说'布利'、'罗恩说'、'伏地'、'斯内'、'地说'等等,还有像'乌姆⾥奇'、'霍格沃兹'等分成两个词语的。#⾃定义部分词语_word('邓布利多',100,'nr')_word('霍格沃茨',100,'n')_word('乌姆⾥奇',100,'nr')_word('拉唐克斯',100,'nr')_word('伏地魔',100,'nr')_word('罗恩说')_word('地说')_word('斯内')#再次分词bookwords = words_cut(book)nameswords = [i for i in bookwords if i in set(renming)]bookwords_count = (bookwords).value_counts().sort_values(ascending=False)nameswords_count = (nameswords).value_counts().sort_values(ascending=False)bookwords_count[:100].index
再次分词之后,我们可以看到在初次分词出现的错误已经得到修正了,接下来我们统计分析。#统计词频TOP15的词语bar = Bar('出现最多的词语TOP15',background_color = 'white',title_pos = 'center',title_text_size = 20)x = bookwords_count[:15].()y = bookwords_count[:15].()('',x, y,xaxis_interval = 0,xaxis_rotate = 30,is_label_show = True)bar
整部⼩说出现最多的词语TOP15中出现了哈利、赫敏、罗恩、邓布利多、魔杖、魔法、马尔福、斯内普和⼩天狼星等字眼。我们⾃⼰串⼀下,⼤概可以知道《哈利波特》的主要内容了,就是哈利在⼩伙伴赫敏、罗恩的陪伴下,经过⼤法师邓布利多的帮助与培养,利⽤魔杖使⽤魔法把⼤boss伏地魔k.o的故事。当然啦,《哈利波特》还是⾮常精彩的。#统计⼈物名字TOP20的词语bar = Bar('主要⼈物Top20',background_color = 'white',title_pos = 'center',title_text_size = 20)x = nameswords_count[:20].()y =nameswords_count[:20].()('',x, y,xaxis_interval = 0,xaxis_rotate = 30,is_label_show = True)bar
整部⼩说按照出场次数,我们发现哈利作为主⾓的地位⽆可撼动,⽐排名第⼆的赫敏远超13000多次,当然这也是⾮常正常的,毕竟这本书是《哈利波特》,⽽不是《赫敏格兰杰》。#整本⼩说的词语词云分析name = bookwords_()value = bookwords_()wc = WordCloud(background_color = 'white')("", name, value, word_size_range=[10, 200],shape = 'diamond')wc #⼈物关系分析names = {}
relationships = {}
lineNames = []with ('哈利波特.txt','r','utf8') as f: n = 0 for line in nes():
n+=1 print('正在处理第{}⾏'.format(n)) poss = (line) ([]) for w in poss: if in set(nameswords): lineNames[-1].append() if () is None: names[] = 0 relationships[] = {}
names[] += 1for line in lineNames: for name1 in line: for name2 in line: if name1 == name2: continue if relationships[name1].get(name2) is None: relationships[name1][name2]= 1 else: relationships[name1][name2] = relationships[name1][name2]+ 1node = ame(columns=['Id','Label','Weight'])edge = ame(columns=['Source','Target','Weight'])for name,times in (): [len(node)] = [name,name,times]for name,edges in (): for v, w in (): if w > 3: [len(edge)] = [name,v,w]处理之后,我们发现同⼀个⼈物出现了不同的称呼,因此合并并统计,得出88个节点。[node['Id']=='哈利','Id'] = '哈利波特'[node['Id']=='波特','Id'] = '哈利波特'[node['Id']=='阿不思','Id'] = '邓布利多'[node['Label']=='哈利','Label'] = '哈利波特'[node['Label']=='波特','Label'] = '哈利波特'[node['Label']=='阿不思','Label'] = '邓布利多'[edge['Source']=='哈利','Source'] = '哈利波特'[edge['Source']=='波特','Source'] = '哈利波特'[edge['Source']=='阿不思','Source'] = '邓布利多'[edge['Target']=='哈利','Target'] = '哈利波特'[edge['Target']=='波特','Target'] = '哈利波特'[edge['Target']=='阿不思','Target'] = '邓布利多'nresult = node['Weight'].groupby([node['Id'],node['Label']]).agg({'Weight':}).sort_values('Weight',ascending = False)eresult = _values('Weight',ascending = False)_csv('',index = False)_csv('',index = False)有了节点node以及边edge后,通过gephi对《哈利波特》的⼈物关系进⾏分析:
发布者:admin,转转请注明出处:http://www.yc00.com/xiaochengxu/1687981627a63408.html
评论列表(0条)