Eclipse中Java做网络爬虫基本方法|江阴雨辰互联

2023年6月29日发(作者：)

Eclipse中Java做⽹络爬⾍基本⽅法基本⽅法分为两⼤步，第⼀步即利⽤HttpClient建⽴⽹络连接并发送请求，第⼆步即利⽤HtmlParser解析⽹页。1.利⽤HttpClient建⽴⽹络连接（并获得整个⽹页的内容）⾸先，我们必须安装好 HttpClient。HttpClient 可以在下载HttpClient ⽤到了 Apache Jakarta common 下的⼦项⽬ logging，可以从这个地址下载到 common logging，从下载后的压缩包中取出加到 CLASSPATH 中HttpClient ⽤到了 Apache Jakarta common 下的⼦项⽬ codec，可以从这个地址下载到最新的 common codec，从下载后的压缩包中取出加到 CLASSPATH 中参考连接需要导⼊的3个包如下图所⽰：⽰例代码如下，请求的http地址根据实际情况改动：package SeanCrawler;import rayOutputStream;import ption;import tream;import ient;import ception;import atus;import hod;public class HttpClientTest{ public static void main(String[] argv){ //连接http服务器端 HttpClient httpClient=new HttpClient(); byte[] responseBody = null; GetMethod getMethod=new GetMethod("localhost/raw_dataset/"); try{ int statusCode=eMethod(getMethod); if(statusCode!=_OK){ n("Method failed: "+tusLine()); } //byte[] responseBody=ponseBody(); //n(new String(responseBody));

//当⽹页内容数据量⼤时推荐使⽤ InputStream in=ponseBodyAsStream(); if (in != null) { byte[] tmp = new byte[4096]; int bytesRead = 0; ByteArrayOutputStream buffer = new ByteArrayOutputStream(1024); while ((bytesRead = (tmp)) != -1) { (tmp, 0, bytesRead); } responseBody = Array(); n(new String(responseBody)); } }catch(HttpException e){ n("Please check your provided http address!"); tackTrace(); }catch(IOException e){ tackTrace(); }catch(Exception ex){ n("Error:"+ng()); }finally{

eConnection(); } }}2.利⽤HtmlParser解析获取的⽹页，取得感兴趣的元素内容需要下载和两个包并导⼊项⽬，如下图所⽰在有了前⾯的httpclient对整个⽹页内容的获取后，在添加htmlparser对⽹页解析获取想要的数据，使⽤⽰例代码如下：//下⾯⽤html解析⽹页Parser parser=new Parser();parser=Parser(new String(responseBody,"gb2312"),"gb2312");//原⽹站编码格式gb2312NodeFilter filter1=new HasAttributeFilter("class","new_table");//设置过滤器，这⾥的意思是设定具有class属性且属性值为new_table的过滤器NodeList list=tAllNodesThatMatch(filter1);//抓取所有通过过滤器的⽹页DOM节点

for (int i=0; i<(); i++) {//基于设定的过滤器我知道抓取的是⼀个table元素，实际情况就需要你根据你想要的元素内容设置⾃⼰的过滤器（这是关键） TableTag table = (TableTag) tAt(i); TableRow[] rows = s();//遍历table元素内各个tr乃⾄td for (int r=1; r<; r++) { TableRow tr = rows[r]; TableColumn[] td = umns(); //可以通过诸如td[1].toPlainTextString()来访问到td内的元素值及某处表格数据值 }}参考连接总结：以上就是⼤题思路，变化较多的就在于HtmlParser解析⽹页各种元素时实⽤的过滤⽅法，这个需要多时间才能熟练掌握。mark⼀记，⾃⼰也是学习的过程~

发布者：admin，转转请注明出处：http://www.yc00.com/news/1687984563a63786.html