Python爬虫
- 网页下载器是将互联网上的url对应的网页下载到本地的工具。
- Python的网页下载器:urllib2
urllib2下载网页方法1:最简洁方法
import urillb2 response=urilb2.urlopen('http:www.baidu.com')//直接请求 printf response.getcode()//获取状态码(如果是200,表示获取成功) cont = response.read()//读取内容
urllib2下载网页方法2:添加data http header
import urillb2 request=urilb2.Request(url)//创建request对象 request.add_data('a','1')//添加数据 request.add_header('User-Agent','Mozilla/5.0')//添加http的header response = urllib2.urlopen(request)//发送获取结果
urllib2下载网页方法3:添加特俗情景的处理器
import urllib2,cookielib
cj=cookielib.CookieJar()//创建cookie容器
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))//创建一个opener
urllib2.install_opener(opener)//给urllib2安装opener
res=urllib2.urlopen("http://www.baidu.com")//使用带有cookie的urllib2访问网页
在Eclipse中安装PyDev
- 在Eclipse中:Help-Install New Software
- 然后在弹出的Install窗口中,点击Add去添加仓库。
- 然后就可以看到Eclipse去搜索了,很快,就可以找到PyDev了
- 取消掉那个:Contact all update sites during install to find required software
- 然后一直next就行哈!
网页解析器
网页解析器的种类:正则表达式 html.parser Beautiful Soup插件(最强大) lxml插件(除正则表达式外,其他都是结构化解析)
安装Beautiful Soup
Beautiful Soup是Python的第三方库,用于从HTML和xml中提取数据
官网:http://www.crummy.com/software/BeautifulSoup/bs4/doc/
从cmd中进入到Python的安装目录下的Scripts,执行:
pip install beautifulsoup4
安装好之后,eclipse执行:
#coding:utf-8 import bs4 print bs4
结果不报错,则安装成功
Beautiful Soup语法
由html网页内容创建Beautiful Soup对象,有两个方法:find_all(寻找所有满足要求的节点),find(寻找第一个满足要求的节点),两个方法的参数一模一样。
通过节点在访问节点名称、属性、文字。
创建Beautiful Soup对象
from bs4 import BeautifulSoup
#根据网页内容创建BeautifulSoup对象
soup =BeautifulSoup(
html_doc, # html文档字符
'html.parser', # HTML解析器
from_encoding='utf8' # HTML文档编码
)
搜索节点
- find_all(name,attrs,string)
访问节点信息
node.name #获取节点标签名
node['href'] #获取节点href属性
node.get_text() #获取节点文字
Beautiful Soup实例
#coding:utf-8
from bs4 import BeautifulSoup
from setuptools.package_index import HREF
import re
html_doc = """
<html<head<titleThe Dormouse's story</title</head
<body
<p class="title"<bThe Dormouse's story</b</p
<p class="story"Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"Elsie</a,
<a href="http://example.com/lacie" class="sister" id="link2"Lacie</a and
<a href="http://example.com/tillie" class="sister" id="link3"Tillie</a;
and they lived at the bottom of a well.</p
<p class="story"...</p
"""
soup =BeautifulSoup(
html_doc, # html文档字符
'html.parser', # HTML解析器
from_encoding='utf8' # HTML文档编码
)
print '获取所有链接'
links = soup.find_all('a')
for link in links:
print link.name,link['href'],link.get_text()
print '获取lacie的链接'
link_node = soup.find('a',href='http://example.com/lacie')
print link_node.name,link_node['href'],link_node.get_text()
print "正则匹配"
link1 = soup.find('a',href=re.compile(r"ill"))
print link1.name,link1['href'],link1.get_text()
print "获取P段落文字"
p_node = soup.find('p',class_="title")
print p_node.name,p_node.get_text()
实例爬虫
#coding:utf-8
import urllib2
import cookielib
url = "http://www.baidu.com"
print "第一种方法"
response1 = urllib2.urlopen(url)
print "打印状态码,200即为请求成功"
print response1.getcode()
print "打印网页内容的长度"
print len(response1.read())
print '第二种方法'
request = urllib2.Request(url)
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())
print '第三种方法'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print cj
print "打印网页内容"
print response3.read()
目标:百度百科Python词条相关词条网页
入口页:http://baike.baidu.com/view/21087.html
URL格式:
<dd class="lemmaWgt-lemmaTitle-title"<h1****</h1</dd
简介:
"<div class="lemma-summary" label-module="lemmaSummary"***</div"
页面编码:UTF-8
实例代码:爬取百度百科Python词条相关1000个页面数据
- 代码下载链接:github下载