Python爬虫

爬虫：一段自动抓取互联网信息的程序
价值：互联网数据，为我所用。
爬虫运行原理
URL管理器：管理待抓取的URL集合和已抓取URL，防止重复和循环抓取。
URL管理器实现：

用ｓｅｔ（）集合存放待爬取和已爬取得ＵＲＬ。
mysql数据库存放待爬取和已爬取得ＵＲＬ，url（url,is_crewled）
缓存数据库存放待爬取和已爬取得ＵＲＬ。
网页下载器

网页下载器是将互联网上的url对应的网页下载到本地的工具。

Python的网页下载器：urllib2

urllib2下载网页方法1：最简洁方法

  import urillb2 
  response=urilb2.urlopen('http:www.baidu.com')//直接请求 
  printf response.getcode()//获取状态码(如果是200，表示获取成功) 
  cont = response.read()//读取内容

urllib2下载网页方法2：添加data http header

  import urillb2 
  request=urilb2.Request(url)//创建request对象
  request.add_data('a','1')//添加数据
  request.add_header('User-Agent','Mozilla/5.0')//添加http的header
 response = urllib2.urlopen(request)//发送获取结果

urllib2下载网页方法3:添加特俗情景的处理器

import urllib2,cookielib
cj=cookielib.CookieJar()//创建cookie容器
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))//创建一个opener
urllib2.install_opener(opener)//给urllib2安装opener
res=urllib2.urlopen("http://www.baidu.com")//使用带有cookie的urllib2访问网页

在Eclipse中安装PyDev

在Eclipse中：Help-Install New Software
然后在弹出的Install窗口中，点击Add去添加仓库。
然后就可以看到Eclipse去搜索了，很快，就可以找到PyDev了
取消掉那个：Contact all update sites during install to find required software
然后一直next就行哈！

网页解析器

网页解析器的种类：正则表达式 html.parser Beautiful Soup插件（最强大） lxml插件（除正则表达式外，其他都是结构化解析）

安装Beautiful Soup
Beautiful Soup是Python的第三方库，用于从HTML和xml中提取数据
官网：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

从cmd中进入到Python的安装目录下的Scripts，执行：
```
 pip install beautifulsoup4
```
安装好之后，eclipse执行：
```
 #coding:utf-8
 import bs4
 print bs4
```
结果不报错，则安装成功

Beautiful Soup语法
由html网页内容创建Beautiful Soup对象，有两个方法：find_all(寻找所有满足要求的节点)，find（寻找第一个满足要求的节点），两个方法的参数一模一样。
通过节点在访问节点名称、属性、文字。

创建Beautiful Soup对象

    from bs4 import BeautifulSoup

    #根据网页内容创建BeautifulSoup对象
     soup =BeautifulSoup(
              html_doc,             # html文档字符
              'html.parser',         # HTML解析器
              from_encoding='utf8'  # HTML文档编码
                  )

搜索节点

find_all(name,attrs,string)

访问节点信息

  node.name    #获取节点标签名
  node['href']  #获取节点href属性
  node.get_text()  #获取节点文字

Beautiful Soup实例

 #coding:utf-8
 from bs4 import BeautifulSoup
 from setuptools.package_index import HREF
 import re

 html_doc = """
 <html<head<titleThe Dormouse's story</title</head
 <body
 <p class="title"<bThe Dormouse's story</b</p

 <p class="story"Once upon a time there were three little sisters; and their names were
 <a href="http://example.com/elsie" class="sister" id="link1"Elsie</a,
 <a href="http://example.com/lacie" class="sister" id="link2"Lacie</a and
 <a href="http://example.com/tillie" class="sister" id="link3"Tillie</a;
 and they lived at the bottom of a well.</p

 <p class="story"...</p
 """


     soup =BeautifulSoup(
            html_doc,             # html文档字符
            'html.parser',         # HTML解析器
           from_encoding='utf8'  # HTML文档编码
                )
 print '获取所有链接'
 links = soup.find_all('a')
 for link in links:
     print link.name,link['href'],link.get_text()


 print '获取lacie的链接'
 link_node = soup.find('a',href='http://example.com/lacie')
 print link_node.name,link_node['href'],link_node.get_text()


 print "正则匹配"
 link1 = soup.find('a',href=re.compile(r"ill"))
 print link1.name,link1['href'],link1.get_text()


 print "获取P段落文字"
 p_node = soup.find('p',class_="title")
 print p_node.name,p_node.get_text()

实例爬虫

 #coding:utf-8
 import urllib2
 import cookielib

 url = "http://www.baidu.com"

 print "第一种方法"
 response1 = urllib2.urlopen(url)
 print "打印状态码，200即为请求成功"
 print response1.getcode()
 print "打印网页内容的长度"
 print len(response1.read())

 print '第二种方法'
 request = urllib2.Request(url)
 request.add_header("user-agent","Mozilla/5.0")
 response2 = urllib2.urlopen(request)
 print response2.getcode()
 print len(response2.read())



 print '第三种方法'
 cj = cookielib.CookieJar()
 opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
 urllib2.install_opener(opener)
 response3 = urllib2.urlopen(url)
 print response3.getcode()
 print cj
 print "打印网页内容"
 print response3.read()

目标：百度百科Python词条相关词条网页

入口页：http://baike.baidu.com/view/21087.html

URL格式：

词条页面URL：/view/125370.htm
数据格式：
标题：

<dd class="lemmaWgt-lemmaTitle-title"<h1****</h1</dd

简介：

"<div class="lemma-summary" label-module="lemmaSummary"***</div"

页面编码：UTF-8

实例代码：爬取百度百科Python词条相关1000个页面数据

代码下载链接：github下载

Python爬虫

爬虫运行原理

URL管理器实现：

网页下载器

urllib2下载网页方法1：最简洁方法

urllib2下载网页方法2：添加data http header

urllib2下载网页方法3:添加特俗情景的处理器

在Eclipse中安装PyDev

网页解析器

安装Beautiful Soup

从cmd中进入到Python的安装目录下的Scripts，执行：

安装好之后，eclipse执行：

Beautiful Soup语法

创建Beautiful Soup对象

搜索节点

访问节点信息

Beautiful Soup实例

实例爬虫

目标：百度百科Python词条相关词条网页

入口页：http://baike.baidu.com/view/21087.html

URL格式：

数据格式：

页面编码：UTF-8

实例代码：爬取百度百科Python词条相关1000个页面数据