2023年7月9日发(作者:)
实验三 搜索引擎及SEO实验
一、实验目的
研究并学习几种常见的搜索引擎算法,包括网络蜘蛛爬行策略、中文分词算法、网页正文提取算法、网页去重算法、PageRank和MapReduce算法,了解它们的基本实现原理;运用所学SEO技术对网页进行优化。
二、实验内容
1. 研究常用的网络蜘蛛爬行策略,如深度优先策略、广度优先策略、网页选择策略、重访策略和并行策略等,了解其实现原理;
2. 研究至少两种中文分词算法,了解其实现原理;
3. 研究至少两种网页正文提取算法,了解其实现原理;
4. 研究至少两种网页去重算法,了解其实现原理;
5. 研究Google的PageRank和MapReduce算法,了解它们的实现原理;
6. 使用所学的SEO技术,对实验二所设计的网站静态首页实施SEO,在实施过程中需采用如下技术:
(1) 网页标题(title)的优化;
(2) 选取合适的关键词并对关键词进行优化;
(3) 元标签的优化;
(4) 网站结构和URL的优化;
(5) 创建文件,禁止蜘蛛抓取网站后台页面;
(6) 网页内部链接的优化;
(7) Heading标签的优化;
(8) 图片优化;
(9) 网页减肥技术。
7. 使用C++、C#和Java等任意一种编程语言,设计并实现一个简单的网络蜘蛛爬行程序,要求在输入关键词、设置爬行深度和初始网页URL之后能够实现网页搜索,输出包含关键词的网页的URL和网页标题。【注:实验7为补充实验,不要求每个同学都完成,感兴趣者可自行实现该程序,不计入实验报告评分。】
三、实验要求
1. 研究几种常用的网络蜘蛛爬行策略,填写相应的表格,表格必须填写完整;
2. 研究两种中文分词算法,填写相应的表格,表格必须填写完整;
3. 研究两种网页正文提取算法,填写相应的表格,表格必须填写完整;
4. 研究两种网页去重算法,填写相应的表格,表格必须填写完整;
1 5. 研究PageRank算法和MapReduce算法,填写相应的表格,表格必须填写完整;
6. 提供实施SEO之后的网站静态首页界面和HTML代码,尽量多地使用所学SEO技术;
7. 严禁大面积拷贝互联网上已有文字资料,尽量用自己的理解来阐述算法原理,必要时可以通过图形来描述算法;
8. 使用任意一种编程语言实现一个简单的网络蜘蛛程序,需提供网络蜘蛛程序完整源代码及实际运行结果。
四、实验步骤
1. 通过使用搜索引擎并查阅相关资料,研究并整理几种常用的网络蜘蛛爬行策略相关资料,填写相应的表格;
2. 通过使用搜索引擎并查阅相关资料,研究并整理两种中文分词算法的基本原理,填写相应的表格;
3. 通过使用搜索引擎并查阅相关资料,研究并整理两种网页正文提取算法的基本原理,填写相应的表格;
4. 通过使用搜索引擎并查阅相关资料,研究并整理两种网页去重算法的基本原理,填写相应的表格;
5. 通过使用搜索引擎并查阅相关资料,研究并整理PageRank算法和MapReduce算法的基本原理,填写相应的表格;
6. 对实验二所设计的网站静态首页实施SEO;
7. 使用任意一种编程语言,设计并实现一个简单的网络蜘蛛爬行程序。
五、实验报告要求
1. 研究几种常用的网络蜘蛛爬行策略并填写如下表格:
策略名称
深度优先策略
基本原理
深度优先搜索是一种在开发爬虫早期使用较多的方法。它的目的是要达到被搜索结构的叶结点(即那些不包含任何超链的HTML文件) 。在一个HTML文件中,当一个超链被选择后,被链接的HTML文件将执行深度优先搜索,即在搜索其余的超链结果之前必须先完整地搜索单独的一条链。深度优先搜索沿着HTML文件上的超链走到不能再深入为止,然后返回到某一个HTML文件,再继续选择该HTML文件2
参考资料
百度百科 深度优先搜索:/view/
中的其他超链。当不再有其他超链可选择时,说明搜索已经结束。
广度优先策略
宽度优先搜索算法(又称广度优先搜索)是最简便的图的搜索算法之一,这一算法也是很多重要的图的算法的原型。Dijkstra单源最短路径算法和Prim最小生成树算法都采用了和宽度优先搜索类似的思想。其别名又叫BFS,属于一种盲目搜寻法,目的是系统地展开并检查图中的所有节点,以找寻结果。换句话说,它并不考虑结果的可能位址,彻底地搜索整张图,直到找到结果为止。
网页选择策略
对搜索引擎而言,要搜索互联网李志义《网络爬虫的优化策略探略》,上所有的网页几乎不可能,即使广东广州510631
全球知名的搜索引擎google也只能搜索整个Internet网页的30%左右。其中的原因主要有两方面,一是抓取技术的瓶颈。网络爬虫无法遍历所有的网页;二是存储技术和处理技术的问题。因此,网络爬虫在抓取网页时。
尽量先采集重要的网页,即采用网页优先抓取策略。
网页选择策略是给予重要程度、等级较高的Web页以较高的抓取优先级,即Web页越重要,则越应优先抓取。其实质上是一种使网络爬虫在一定条件下较快地锁定互联网中被用户普遍关注的重要信息资源的方法。而实现该策略的前提是正确评测Web页的重要程度bJ,目前评测的主要指标有PageRank
3
百度百科 广度优先搜索:/view/ 值、平均链接深度等。
重访策略
(1)依据Web站点的更新频率确定重访频率
此法符合实际情况,能够更有效地管理和利用网络爬
虫。例如,门户网站通常每天要不断地更新信息和添加新
的信息,重访的频率则以天或小时为周期进行网页的重访。
(2)不关心Web站点的更新频率问题,而是间隔一段
时间重访已被抓取的冈页。其弊端是重复抓取的概率大,
容易造成不必要的资源浪费。
(3)根据搜索引擎开发商对网页的主观评价,提供个
性化的服务
网页的重访需要搜索引擎开发商对主要的站点进行网
页更新频率的主观评价,可以根据需求提供个性化的服务。
并行策略
实施并行策略的核心是在增加李志义《网络爬虫的优化策略探略》,协同工作的爬虫数量的同时,科广东广州510631
学合理地分配每个爬虫的任务,尽量避免不同的爬虫做相同的Web信息抓取。一般通过两种方法来分配抓取任务,一是按照Web站点所对应的m地址划分任务,一个爬虫只需遍历某一组地址所包含Web页即可;另一种方法是依据Web站点的域名动态分配爬行任务,每个爬虫完成某个或某些域名段内Web信息的搜集。
李志义《网络爬虫的优化策略探略》,广东广州510631
2. 研究两种中文分词算法并填写如下表格:
算法名称 基本原理 参考资料
算法一:最大匹配算法
最大匹配算法是一种有着广泛应用的机张玉茹 肇庆526070械分词方法,该方法依据一个分词词表和《中文分词算法之最一个基本的切分评估原则.即“长词优先”大匹配算法的研究》
原则,来进行分词
算法二:基于无词典的分词算法
基于汉字之间的互信息和t-测试信息的刘红芝 徐州医学院图分词算法。汉语的词可以理解为字与字之书馆 江苏徐州221004间的稳定结合,因此。如果在上下文中某4
几个相邻的字出现的次数越多,那么,这《中文分词技术的研几个字成词的可能性就很大。根据这个道究》
理引入互信息(Mutual information)和t-
测试值(t—score)的概念,用来表示两个汉字之间结合关系的紧密程度。该方法的分词原理是:对于一个汉字字符串,计算汉字之间的互信息和t-测试差信息,选择互信息和t-测试差信息大的组成词。该方法的局限性是只能处
理长度为2的词,且对于一些共现频率高的但并不是词的字组,常被提取出来,并且常用词的计算开销大,但可以识别一些新词,消除歧义。对于一个成熟的
分词系统来说,不可能单独依靠某一个算法来实现,都需要综合不同的算法,在实际的应用中,要根据具体的情况来选择不同的分词方案。
3. 研究两种网页正文提取算法并填写如下表格:
算法名称
算法一基于相似度的中文网页正文提取算法
基本原理
正文文本在HTML源文件中有两种修饰方式:有标签提示和无标签提示。有标签文本中标签的作用一般包含分块信息、表格信息、或者文本的字体颜色信息等。这种文本采用基于分块的方法能有不错的效果。而无标签信息的正文文本处理之后不在分块中,也不在表格内。采用先分块后提取放入网页正文提取方法,无法达到理想的精度。本文提出根据相似度来提取网页正文的算法。算法分为两个步骤:首先取出网页中包含中文最多的行,然后利用鉴于此余弦相似度匹配和标签相似度来提取网页正文。该算法最大的特点是
避免了上述的分块步骤。
算法二基于FFT的网页正文提取算法研究与实现
给定一个底层网页的HTML源文件,
求解最佳的正文区问。对于任何字符串区间(b,e),(O≤6 s为源文件的长度.S为源文件),都有一个评价值,问题转化 为求评价函数的最大解。 5 参考资料 熊子奇张晖林茂松 (西南科技大学计算机科学与技术学院四川绵阳621010) 《基于相似度的中文网页正文提取算法》 李蕾,王劲林,白鹤,胡晶晶 《基于FFT的网页正文提取算法研究与实现》 4. 研究两种网页去重算法并填写如下表格: 算法名称 算法一:同源网页去重URL哈希值计算. 基本原理 参考资料 构造一个适当的哈希函数H可得到从网页 高凯,王永成, 肖君 URL字符序列到哈希值的映射,相同的URL上海200030《网页去重字符串会得到相同的哈希值,从而说明该策略》 URL已被下载过.在对解析出来的URL进行预处理后,以其各字符对应的码值按下式计算出其哈希值: 式中为解析出的网页URL集合;Ai为Ui的哈希地址;ni为对Ui进行预处理后的字串长度;Ck为对Ui进行预处理后左起第k个字符的码值;S为哈希槽容量.上式表示从URL字符串U到其哈希散列值H的映射关系. 输入:URL;S. 输出:URL哈希值. 算法描述: (1)针对URL初始化; (2)按照式(1)进行URI。哈希值计算; (3)释放空间,返回哈希值. 算法二:基于网页内用网页主体内容间的相似程度来判断它高凯,王永成, 肖君 们 上海200030《网页去重是否为近似相同,而网页主体采用主题概策略》 念进行表示.当两个网页主体相似比例达到设定的经验阈值时就认为它们为近似相同,不需重复下载.网页Ui(i∈[1,n])使用特征向量进行表示,其主题概念权 值wij采用以tfidf为主其他策略为辅的方式来确定, 上式对tfidf算子(用t表示)乘以一个因子C来表示不同类型的页面tags对权值的影响,目的是对位于不同位置的词条作不同的加权处理.通过试验分析可以确定针对不同tag标记相应的系数C的经验 值.同时综合考虑概念长因子z、词性因子P等诸多因素,加权体系可表示为上述诸多因素的一个函数, 最后输出最能代表该文档的优个权值较容的去重 6 大的主题概念.而用来判断两个网页A和B之间相似的标准是通过统计主题概念词串的共现个数.如果共现个数大于预先设定的经验阈值,就认为网页A和B为近似相似. 5. 研究PageRank算法和MapReduce算法并填写如下表格: 算法名称 PageRank 基本原理 PageRank超链分析算法是Google搜索引擎采用的页面排序算法。Google沿用了传统搜索引擎的架构设计,其与传统的搜索引擎最大的不同之处在于它对网页进行了排序处理,使在最重要的网页出现在检索结果的最前面,其核心就是PageRank超链分析算法。通过计算出网页的PageRank值,从而决定网页在查淘返回结果集中的位置。PageRank值越高的网页,在返回结果中越靠前。 该算法基于下面2个前提: 前提1:一个网页被多次链接,则它可能是很重要的;一个网页虽然没有被多次引用,但是被重要的网页链接,则它也可能是很重要的;一个网页的重要性平均地传递到它所链接的网页。 前提2:假定用户一开始随机地访问网页集合中的一个网页,以后跟随网页的链接向前浏览网页,从不回退浏览,而浏览者选择本页中任意一个链接前进的 概率是相等的。在每个一个页面,浏览者都有可能对本页面的链接不再感兴趣,从而随机选择一个新的页面开始新的浏览。这个离开的可能设为d。从而页面的PageRank值就是浏览者访问到该网页的概率。 设定一个网页A,假设指向它的网页有T1,T2,...,Tn。令C(A)为A出发 指向其它页面的链接数目,PR(A)为A的PageRank,d为衰减因子(通常设为 0.85),则有: 参考资料 陈杰 浙江大学 《主题搜索引擎中网络蜘蛛搜索策略研究》 MapReduc MapReduce是一种编程模型,用于大规模数7 百度百科MapReduce e 据集(大于1TB)的并行运算。概念"Map(映射)"和"Reduce(化简)",和他们的主要思想,都是从函数式编程语言里借来的,还有从矢量编程语言里借来的特性。他极大地方便了编程人员在不会分布式并行编程的情况下,将自己的程序运行在分布式系统上。 当前的软件实现是指定一个Map(映射)函数,用来把一组键值对映射成一组新的键值对,指定并发的Reduce(化简)函数,用来保证所有映射的键值对中的每一个共享相同的键组 /view/ 6. 提供通过SEO优化之后的网站首页静态效果图和完整的HTML源代码。
8
marginheight="0">
|
|
|
|
|
| ||
|
|
| |||||
9 |
|
| |||||
|
|
| |||||
|
| ||||||
|
|
| |||||
|
|
| |||||
|
|
| |||||
|
| ||||||
|
|
|
| ||||
|
| ||||||
|
11 | ||||||
|
| ||||||
|
| ||||||
|
| ||||||
|
| ||||||
|
| ||||||
|
|
|
7. 选做:提供网络蜘蛛程序完整源代码及实际运行结果界面截屏(实验报告中需包含源代码和界面截屏)。
13 import .*;
import .*;// need this to access the color object
/*
*
*
*/
/**
* Input Verifier to verifier integer text fields
*
* Checks for valid integer input, and to see if the number is between a
* specified max and min value.
*
* @author Mark Pendergast
*/
public class IntegerVerifier extends erifier {
/** listener to get valid/invalid data reports */
private VerifierListener listener = null;
/** blank fields allowed, true for ok, false for error */
14 private boolean blankOk = false;
/** minimum valid value */
int minValue = _VALUE;
/** maximum valid value*/
int maxValue = _VALUE;
/** Creates a new instance of IntegerVerifier
*
* @param alistener VerifierListener to receive invalid/valid data class (null means no
listener)
* @param blankok if true, then the field can be left blank
* @param min minimum valid value
* @param max maximum valid value
*/
public IntegerVerifier(VerifierListener alistener, boolean blankok, int min, int max) {
listener = alistener;
blankOk = blankok;
minValue = min;
maxValue = max;
}
/**
* Verifies contents of the specified component
*
* @param jComponent the component to check
* @return true if the component is ok, else false
*
*/
public boolean verify(nent jComponent) {
JTextField thefield = (JTextField)jComponent;
String input = t();
int number;
input = (); // strip off leading and trailing spaces as these gives
nt problems
if(() == 0 && blankOk)
{
15 eground();
if(listener != null)
ata(jComponent);
return true; // if empty, just return true
}
else
if(() == 0 && !blankOk)
{
reportError(thefield,"Field cannot be blank!");
return false; // if empty, just return true
}
/*
* try to convert to an integer
*/
try{
number = nt(input);
}
catch (NumberFormatException e)
{
reportError(thefield,"You must enter a valid number");
return false;
}
/*
* test if its in the range
*/
if(number < minValue || number > maxValue)
{
reportError(thefield,"You must enter a number between "+minValue+"
"+maxValue);
return false;
}
/*
* report good data
16
and */
eground();
t(""+number); // reset what we converted into the component
if(listener != null)
ata(jComponent);
return true; // valid input found
}
/**
* report error to the listener (if any)
* @param thefield text field being checked
* @param message error message to report
*/
private void reportError(JTextField thefield, String message)
{
eground(); // paint the text red, return false invalid input
if(listener != null)
dData(message,thefield);
}
}
/*
*
*
*/
import .*;
import .*;
import .*;
import .*;
import .*;
import .*;
import itorKit.*;
import .*;
import .*;
/**
17 * Object used to search the web (or a subset of given domains) for a list of keywords
* @author Mark Pendergast
*/
public class Spider extends Thread{
/** site visit limit (stops search at some point) */
private int siteLimit = 100;
/** search depth limit */
private int depthLimit = 100;
/** keyword list for seach */
private String keywordList[];
/** ip type list */
private String ipDomainList[];
/** visited tree */
private JTree searchTree = null;
/** message JTextArea, place to post errors */
private JTextArea messageArea;
/** place to put search statistics */
private JLabel statsLabel;
/** keep track of web sites searched */
private int sitesSearched = 0;
/** keep track of web sites found with matching criteria */
private int sitesFound = 0;
/** starting site for the search */
private String startSite;
/** flag used to stop search */
private boolean stopSearch = false;
/**
* Creates a new instance of Spider
* @param atree JTree used to display the search space
* @param amessagearea JTextArea used to display error/warning messages
* @param astatlabel JLabel to display number of searched sites and hits
* @param akeywordlist list of keywords to search for
18 * @param aipdomainlist list of top level domains
* @param asitelimit maximum number of web pages to look at
* @param adepthlimit maximum number of levels down to search (controls recursion)
* @param astartsite web site to use to start the search
*/
public Spider(JTree atree, JTextArea amessagearea,JLabel astatlabel, String astartsite,
String[] akeywordlist, String[] aipdomainlist, int asitelimit, int adepthlimit) {
searchTree = atree; // place to display search tree
messageArea = amessagearea; // place to display error messages
statsLabel = astatlabel; // place to put run statistics
startSite = fixHref(astartsite);
keywordList = new String[];
for(int i = 0; i< ; i++)
keywordList[i] = akeywordlist[i].toUpperCase(); // use all upper case for
matching
ipDomainList = new String[];
for(int i = 0; i< ; i++)
ipDomainList[i] = aipdomainlist[i].toUpperCase(); // use all upper case for
matching
siteLimit = asitelimit; // max number of sites to look at
depthLimit = adepthlimit; // max depth of recursion to use
DefaultMutableTreeNode
UrlTreeNode("Root"));
DefaultTreeModel treeModel = new DefaultTreeModel(root); // create a tree model
with a root
el(treeModel);
lRenderer(new UrlNodeRenderer()); // use a custom cell renderer
}
/**
* start running the search in a new thread
*/
root = new DefaultMutableTreeNode(new
19 public void run()
{
DefaultTreeModel treeModel = (DefaultTreeModel)el(); // get our
model
DefaultMutableTreeNode root = (DefaultMutableTreeNode)t();
String urllc = rCase();
if(!With("") && !With("ftp://") &&
!With("www."))
{
startSite = "file:///"+startSite; // note you must have 3 slashes !
}
else // http missing ?
if(With("www."))
{
startSite = ""+startSite; // tack on
}
startSite = e('', '/'); // fix bad slashes
sitesFound = 0;
sitesSearched = 0;
updateStats();
searchWeb(root,startSite); // search the web
("Done!nn");
}
/**
* search the url search tree to see if we've already visited the specified url
* @param urlstring url to search for
* @return true if the url is already in the tree
*/
public boolean urlHasBeenVisited(String urlstring)
{
String teststring = fixHref(urlstring);
DefaultTreeModel treeModel = (DefaultTreeModel)el(); // get our
20 model
DefaultMutableTreeNode root = (DefaultMutableTreeNode)t();
Enumeration etree = hFirstEnumeration();
while(eElements())
{
UrlTreeNode node
(UrlTreeNode)(((DefaultMutableTreeNode)ement()).getUserObject());
if(node instanceof UrlTreeNode && (teststring))
return true;
}
return false;
}
/**
* Check depth of search
* @return true if depth limit exceeded
* @param node search tree node to test the depth limit of
*/
public boolean depthLimitExceeded(DefaultMutableTreeNode node)
{
if(el() >= depthLimit)
return true;
else
return false;
}
/**
* add a node to the search tree
* @param parentnode parent to add the new node under
* @param newnode node to be added to the tree
*
*/
private DefaultMutableTreeNode addNode(DefaultMutableTreeNode
UrlTreeNode newnode)
21
=
parentnode, {
DefaultMutableTreeNode node = new DefaultMutableTreeNode(newnode);
DefaultTreeModel treeModel = (DefaultTreeModel)el(); // get our
model
int index = ldCount(parentnode); // how many children are there
already?
NodeInto(node, parentnode,index); // add as last child
TreePath tp = new TreePath(h());
Path(tp); // make sure the user can see the node just added
return node;
}
/**
* determines if the given url is in a one of the top level domains in the domain
* search list
*
* @param url url to be checked
* @return true if its ok, else false if url should be skipped
*/
private boolean isDomainOk(URL url)
{
if(tocol().equals("file"))
return true; // file protocol always ok
String host = t();
int lastdot = dexOf(".");
if(lastdot <= 0)
return true;
String domain = ing(lastdot); // just the .com or .edu part
if( == 0)
return true;
for(int i=0; i < ; i++)
22 {
if(ipDomainList[i].equalsIgnoreCase("
return true;
if(ipDomainList[i].equalsIgnoreCase(domain))
return true;
}
return false;
}
/**
* upate statistics label
*/
private void updateStats()
{
t("Sites searched : "+sitesSearched+" Sites found : "+sitesFound);
}
/**
* repairs a sloppy href, flips backwards /, adds missing /
* @return repaired web page reference
* @param href web site reference
*/
public static String fixHref(String href)
{
String newhref = e('', '/'); // fix sloppy web references
int lastdot = dexOf('.');
int lastslash = dexOf('/');
if(lastslash > lastdot)
{
if((()-1) != '/')
newhref = newhref+"/"; // add on missing /
}
return newhref;
23 }
/**
* recursive routine to search the web
* @param parentnode parentnode in the search tree
* @param urlstr web page address to search
*/
public void searchWeb(DefaultMutableTreeNode parentnode, String urlstr)
{
if(urlHasBeenVisited(urlstr)) // have we been here?
return; // yes, just return
if(depthLimitExceeded(parentnode))
return;
if(sitesSearched > siteLimit)
return;
yield(); // allow the main program to run
if(stopSearch)
return;
("Searching :"+urlstr+" n");
sitesSearched++;
updateStats();
//
// now look in the file
//
try{
URL url = new URL(urlstr); // create the url object from a string.
String protocol = tocol(); // ask the url for its protocol
if(!IgnoreCase("http") && !IgnoreCase("file"))
{
24 (" Skipping : "+urlstr+" not a http sitenn");
return;
}
String path = h(); // ask the url for its path
int lastdot = dexOf("."); // check for file extension
if(lastdot > 0)
{
String extension = ing(lastdot); // just the file extension
if(!IgnoreCase(".html") && !IgnoreCase(".htm"))
return; // skip everything but html files
}
if(!isDomainOk(url))
{
(" Skipping : "+urlstr+" not in domain listnn");
return;
}
UrlTreeNode newnode = new UrlTreeNode(url); // create the node
InputStream in = ream(); // ask the url object to create an input stream
InputStreamReader isr = new InputStreamReader(in); // convert the stream to a reader.
DefaultMutableTreeNode treenode = addNode(parentnode, newnode);
SpiderParserCallback cb = new SpiderParserCallback(treenode); // create a callback
object
ParserDelegator pd = new ParserDelegator(); // create the delegator
(isr,cb,true); // parse the stream
(); // close the stream
} // end try
catch(MalformedURLException ex)
{
25 (" Bad URL encountered : "+urlstr+"nn");
}
catch(IOException e)
{
("
"+sage()+"nn");
}
yield();
return;
}
/**
* Stops the search.
*/
public void stopSearch()
{
stopSearch = true;
}
/**
* Inner class used to html handle parser callbacks
*/
public class SpiderParserCallback extends Callback {
/** url node being parsed */
private UrlTreeNode node;
/** tree node */
private DefaultMutableTreeNode treenode;
/** contents of last text element */
private String lastText = "";
/**
* Creates a new instance of SpiderParserCallback
* @param atreenode search tree node that is being parsed
*/
public SpiderParserCallback(DefaultMutableTreeNode atreenode) {
IOException, could not access site :
26 treenode = atreenode;
node = (UrlTreeNode)rObject();
}
/**
* handle HTML tags that don't have a start and end tag
* @param t HTML tag
* @param a HTML attributes
* @param pos Position within file
*/
public void handleSimpleTag( t,
MutableAttributeSet a,
int pos)
{
if(())
{
ges(1);
return;
}
if(())
{
Object value = ribute();
if(value != null)
e(fixHref(ng()));
}
}
/**
* take care of start tags
* @param t HTML tag
* @param a HTML attributes
* @param pos Position within file
*/
public void handleStartTag( t,
MutableAttributeSet a,
int pos)
27 {
if(())
{
lastText="";
return;
}
if((.A))
{
Object value = ribute();
if(value != null)
{
ks(1);
String href = ng();
href = fixHref(href);
try{
URL referencedURL = new URL(e(),href);
searchWeb(treenode,
tocol()+"://"+t()+h());
}
catch (MalformedURLException e)
{
(" Bad URL encountered : "+href+"nn");
return;
}
}
}
}
/**
* take care of start tags
* @param t HTML tag
* @param pos Position within file
*/
public void handleEndTag( t,
28 int pos)
{
if(() && lastText != null)
{
le(());
DefaultTreeModel tm = (DefaultTreeModel)el();
anged(treenode);
}
}
/**
* take care of text between tags, check against keyword list for matches, if
* match found, set the node match status to true
* @param data Text between tags
* @param pos position of text within web page
*/
public void handleText(char[] data, int pos)
{
lastText = new String(data);
rs(());
String text = rCase();
for(int i = 0; i < ; i++)
{
if(f(keywordList[i]) >= 0)
{
if(!h())
{
sitesFound++;
updateStats();
}
ch(keywordList[i]);
return;
}
}
}
29
}
}
/*
*
*
*/
import .*;
import .*;
import .*;
import .*;
import .*;
/**
* User interface to conduct web searches with the Spider object
* @author Mark Pendergast
*/
public class SpiderControl extends implements VerifierListener{
/** Creates new form SpiderControl */
public SpiderControl() {
initComponents();
setSize(650,600);
setTitle("Web Spider Demo");
//
// center the frame on the screen
//
Dimension oursize = getSize();
Dimension screensize = aultToolkit().getScreenSize();
int x = ( - )/2;
int y = (- )/2;
x = (0,x); // keep the corner on the screen
y = (0,y); //
30 setLocation(x,y);
DefaultMutableTreeNode root = new DefaultMutableTreeNode("Empty");
DefaultTreeModel treeModel = new DefaultTreeModel(root); // create a tree model
with a root
el(treeModel);
URL iconurl = getClass().getResource(""); // capitalization counts
on the filename
if(iconurl != null)
{
ImageIcon ic = new ImageIcon(iconurl);
setIconImage(ge()); // tell the frame to set is as its icon
}
}
/** This method is called from within the constructor to
* initialize the form.
* WARNING: Do NOT modify this code. The content of this method is
* always regenerated by the Form Editor.
*/
private void initComponents() {//GEN-BEGIN:initComponents
toolBar = new ();
startButton = new n();
stopButton = new n();
clearMessageButton = new n();
viewButton = new n();
exitButton = new n();
centerPane = new dPane();
formTab = new ();
siteLabel = new ();
siteField = new ield();
depthLabel = new ();
depthField = new ield();
31 keywordLabel = new ();
keywordPane = new lPane();
keywordArea = new rea();
domainLabel = new ();
domainPane = new lPane();
domainList = new ();
startingLabel = new ();
errorLabel = new ();
jTextArea1 = new rea();
startSiteField = new ield();
treeTab = new ();
searchTreePane = new lPane();
searchTree = new ();
pageStatistics = new rea();
messageTab = new lPane();
messageArea = new rea();
statusLabel = new ();
setBackground(new (153, 153, 255));
addWindowListener(new Adapter() {
public void windowClosing(Event evt) {
exitForm(evt);
}
});
out(new yout());
kground(new (204, 204, 204));
t(new ("Arial", 1, 11));
t("Start Search");
lTipText("Start the search");
ionListener(new Listener() {
public void actionPerformed(Event evt) {
startButtonActionPerformed(evt);
}
32 });
(startButton);
t(new ("Arial", 1, 11));
t("Stop Search");
lTipText("Stop the search that is in progress");
ionListener(new Listener() {
public void actionPerformed(Event evt) {
stopButtonActionPerformed(evt);
}
});
(stopButton);
t(new ("Arial", 1, 11));
t("Clear message area");
ionListener(new Listener() {
public void actionPerformed(Event evt) {
clearMessageButtonActionPerformed(evt);
}
});
(clearMessageButton);
t(new ("Arial", 1, 11));
t("View Selected Web Page");
ionListener(new Listener() {
public void actionPerformed(Event evt) {
viewButtonActionPerformed(evt);
}
});
(viewButton);
33 t(new ("Arial", 1, 11));
t("Exit");
ionListener(new Listener() {
public void actionPerformed(Event evt) {
exitButtonActionPerformed(evt);
}
});
(exitButton);
getContentPane().add(toolBar, );
der(new Border());
out(null);
kground(new (204, 204, 204));
t("Maximum number of sites to visit : ");
(siteLabel);
nds(20, 30, 250, 15);
umns(8);
t("100");
utVerifier(new IntegerVerifier(this,false,1,10000)
);
(siteField);
nds(260, 30, 70, 21);
t("Maximum search depth : ");
(depthLabel);
nds(20, 80, 230, 15);
umns(8);
t("10");
utVerifier(new IntegerVerifier(this,false,1,10000)
);
34 (depthField);
nds(260, 80, 70, 21);
t("Keywords or phrases (one to a line) :");
(keywordLabel);
nds(20, 110, 220, 15);
umns(20);
wportView(keywordArea);
(keywordPane);
nds(260, 110, 170, 140);
t("Domains to search : ");
(domainLabel);
nds(30, 270, 120, 15);
el(new ctListModel() {
String[] strings = { "
".org", ".us", ".ca" };
public int getSize() { return ; }
public Object getElementAt(int i) { return strings[i]; }
});
ectedIndex(0);
wportView(domainList);
(domainPane);
nds(260, 270, 170, 60);
t("Portal (starting site): ");
(startingLabel);
nds(30, 340, 120, 15);
eground(new (255, 51, 51));
izontalAlignment();
35 t(" ");
(errorLabel);
nds(30, 395, 400, 20);
(jTextArea1);
nds(240, 360, 0, 17);
umns(80);
(startSiteField);
nds(150, 340, 320, 21);
("Search Parameters", formTab);
out(new Layout());
kground(new (204, 204, 204));
der(new Border());
eSelectionListener(new
lectionListener() {
public void valueChanged(lectionEvent evt) {
searchTreeSelectionChange(evt);
}
});
wportView(searchTree);
(searchTreePane, );
kground(new (204, 204, 204));
umns(80);
s(2);
t("Select item to display its statistics");
der(new Border("Page statistics:"));
(pageStatistics, );
36 ("Search Tree", treeTab);
kground(new (204, 204, 204));
der(new Border());
ticalScrollBarPolicy(AL_SCROLLBAR_ALWAYS);
umns(100);
s(5);
wportView(messageArea);
("Messages", messageTab);
getContentPane().add(centerPane, );
t("Inactive");
getContentPane().add(statusLabel, );
pack();
}//GEN-END:initComponents
private void searchTreeSelectionChange(lectionEvent evt)
{//GEN-FIRST:event_searchTreeSelectionChange
// TODO add your handling code here:
TreePath path = ectionPath();
if(path == null)
return;
DefaultMutableTreeNode
(DefaultMutableTreeNode)tPathComponent();
UrlTreeNode data = (UrlTreeNode)rObject();
if(data != null && data instanceof UrlTreeNode)
{
String kstr = words();
t("Keywords found : "+kstr+"n");
(eStats());
node =
37 }
else
t("");
}//GEN-LAST:event_searchTreeSelectionChange
private void viewButtonActionPerformed(Event evt)
{//GEN-FIRST:event_viewButtonActionPerformed
// TODO add your handling code here:
try{
TreePath path = ectionPath();
if(path == null)
return;
DefaultMutableTreeNode
(DefaultMutableTreeNode)tPathComponent();
UrlTreeNode data = (UrlTreeNode)rObject();
if(data instanceof UrlTreeNode)
{
String urlstr = String();
time().exec("C:program filesInternet
"+urlstr);
}
}
catch(Exception e)
{
ssageDialog(this,"Could
Explorer","Error",_MESSAGE);
}
}//GEN-LAST:event_viewButtonActionPerformed
private void clearMessageButtonActionPerformed(Event evt)
{//GEN-FIRST:event_clearMessageButtonActionPerformed
// TODO add your handling code here:
t("");
}//GEN-LAST:event_clearMessageButtonActionPerformed
not launch Internet
node =
38 private void stopButtonActionPerformed(Event evt)
{//GEN-FIRST:event_stopButtonActionPerformed
// TODO add your handling code here:
if(spidey != null)
arch();
}//GEN-LAST:event_stopButtonActionPerformed
private void startButtonActionPerformed(Event evt)
{//GEN-FIRST:event_startButtonActionPerformed
// TODO add your handling code here:
int sitelimit =100, depthlimit=100;
try{
sitelimit = nt(t().trim()) ;
depthlimit = nt(t().trim()) ;
}
catch(NumberFormatException e)
{
t("Invalid input for site limit or depth limit");
return;
}
//
// retrieve domains from combobox
//
Object selected[] = ectedValues();
String[] domains = new String[];
for(int i = 0; i < ; i++)
domains[i]= selected[i].toString();
//
// retrieve search strings
//
StringTokenizer keywordtokens = new StringTokenizer(t(),
"n");
String keywords[] = new String[okens()];
39 int i = 0;
while(eTokens())
keywords[i++] = ken();
//
// retrieve start site
//
String startsite = t();
if(() <= 0)
{
t("Starting web site cannot be blank!");
return;
}
//
// create and start the spider
//
t("");
t("Click on site to view its statistics");
ectedIndex(1); // show the search tree tab
spidey = new Spider(searchTree, messageArea, statusLabel, startsite, keywords,
domains, sitelimit,depthlimit);
();
}//GEN-LAST:event_startButtonActionPerformed
private void exitButtonActionPerformed(Event evt)
{//GEN-FIRST:event_exitButtonActionPerformed
// TODO add your handling code here:
(0);
}//GEN-LAST:event_exitButtonActionPerformed
/** Exit the Application */
private void exitForm(Event evt)
{//GEN-FIRST:event_exitForm
(0);
}//GEN-LAST:event_exitForm
40
/**
* Main method.
* @param args the command line arguments (not used)
*/
public static void main(String args[]) {
new SpiderControl().show();
}
/**
* Verifier listener routine used to report bad data
* @param message error message
* @param jComponent component that caused the error.
*/
public void invalidData(String message, JComponent jComponent) {
t(message);
eground();
bled(false); // turn off the start button
getToolkit().beep();
}
/**
* Verifier listener routine used to report good data
* @param jComponent component that has tested ok
*/
public void validData(JComponent jComponent) {
t(" ");
bled(true); // turn on the start button
}
// Variables declaration - do not modify//GEN-BEGIN:variables
private dPane centerPane;
private n clearMessageButton;
private ield depthField;
private depthLabel;
private domainLabel;
private domainList;
41 private lPane domainPane;
private errorLabel;
private n exitButton;
private formTab;
private rea jTextArea1;
private rea keywordArea;
private keywordLabel;
private lPane keywordPane;
private rea messageArea;
private lPane messageTab;
private rea pageStatistics;
private searchTree;
private lPane searchTreePane;
private ield siteField;
private siteLabel;
private n startButton;
private ield startSiteField;
private startingLabel;
private statusLabel;
private n stopButton;
private toolBar;
private treeTab;
private n viewButton;
// End of variables declaration//GEN-END:variables
/** spider object */
Spider spidey = null;
}
/*
*
*
*/
import .*;
import .*;
import .*;
import .*;
42 /**
* Custom tree node renderer. If the url has a match, it will be drawn in blue
* simple icons are used for all nodes.
*
* @author Mark Pendergast
*/
public class UrlNodeRenderer extends DefaultTreeCellRenderer{
/**
* icon used to display on the search tree
*/
public static Icon icon= null;
/** Creates a new instance of UrlNodeRenderer */
public UrlNodeRenderer() {
icon = new ImageIcon(getClass().getResource(""));
}
/**
* Sets the value of the current tree cell to value.
* If selected is true, the cell will be drawn as if selected.
* If expanded is true the node is currently expanded and if leaf is
* true the node represets a leaf and if hasFocus is true the node
* currently has focus. tree is the JTree the receiver is being
* configured for. Returns the Component that the renderer uses to draw the value.
*
* Modified the TreeCellRender to match a specific nodes attributes. If
* the node is a match, then blue is used for the text color. Icon is customized
*
* @param tree the JTree being redrawn
* @param value node in the tree
* @param sel true if selected by the user
* @param expanded true if path is expanded
* @param leaf true if this node is a leaf
* @param row row number (vertical position)
* @param hasFocus true if it has the focus
43 * @return the Component that the renderer uses to draw the value
*/
public Component getTreeCellRendererComponent(
JTree tree,
Object value,
boolean sel,
boolean expanded,
boolean leaf,
int row,
boolean hasFocus) {
eCellRendererComponent(
tree, value, sel,
expanded, leaf, row,
hasFocus);
UrlTreeNode node
(UrlTreeNode)(((DefaultMutableTreeNode)value).getUserObject());
if (h()) // set color
setForeground();
else
setForeground();
if(icon != null) // set a custom icon
{
setOpenIcon(icon);
setClosedIcon(icon);
setLeafIcon(icon);
}
return this;
}
44
=
}
/*
*
*
*/
import .*;
import .*;
import .*;
import .*;
/**
* Class used to hold information about a web site that has
* been searched by the spider class
*
* @author Mark Pendergast
*/
public class UrlTreeNode {
/** URL */
private URL url;
/** base web address for this page */
private URL base;
/** is match, set to true if node matched search criteria */
private boolean isMatch;
/** set if the node is a plain text and not url */
private boolean isText = false;
/** list of keywords matched by this node */
private Vector keywords = new Vector(3,2);
/** title of web page */
private String title = "";
/** number of text characters on page */
private int nChars = 0;
/** number of images on page */
private int nImages = 0;
/** number of links on page */
45 private int nLinks = 0;
/** Creates a new instance of UrlTreeNode
* @param aurl url of the web page
*/
public UrlTreeNode(URL aurl) {
url = aurl;
isMatch = false;
base = url; // initialize default value of base
isText = false;
}
/** Creates a new instance of UrlTreeNode
* @param atext text for the node
*/
public UrlTreeNode(String atext)
{
isMatch = false;
base = null;
isText = true;
title = atext;
}
/**
* return url string for display on screen
* @return String representation of the object
*/
public String toString()
{
if(isText)
return title;
else
{
// return tocol()+"://"+t()+h();
return title+" - "+tocol()+"://"+t()+h();
}
46 }
/**
* get the keywords found in this node
* @return all keywords in this node as a single comma separated string
*/
public String getKeywords()
{
String s ="";
if(() > 0)
{
s += (String)(0);
for(int i = 1; i < (); i++)
s += ", "+(String)(i);
}
return s;
}
/**
* return state of node
* @return true if node matched search criteria
*/
public boolean isMatch()
{
return isMatch;
}
/**
* returns the url object for this node or null if it is a text node
* @return url of the node or null
*/
public URL getUrl()
{
if(isText)
47 return null;
else
return url;
}
/**
*
* returns whether or not this node contains a match for the spiders search criteria
* @param keyword keyword found in web site.
*/
public void setMatch(String keyword)
{
isMatch = true;
if(!ns(keyword))
ment(keyword);
}
/**
* sets the base location for the node, called in response to finding a base tag in the web
page
*
* @param abase base url to use for relative addressing
*/
public void setBase(String abase)
{
try{
base = new URL(abase);
}
catch(MalformedURLException e)
{
}
}
/**
* returns base url
* @return base url
48 */
public URL getBase()
{
return base;
}
/**
* sets the title attribute of the node
* @param atitle title of web page from
*/
public void setTitle(String atitle)
{
title = atitle;
}
/**
* test for equality
* @param urlstr string containing url to compare
* @return true if it is the same page
*/
public boolean equals(String urlstr)
{
if(isText)
return (urlstr);
else
return (tocol()+"://"+t()+h());
}
/**
* Increments character count
* @param n number of characters to add
*/
public void addChars(int n)
{
nChars += n;
}
49
发布者:admin,转转请注明出处:http://www.yc00.com/xiaochengxu/1688890392a181644.html
评论列表(0条)