一本之道中文字幕东京热,国产精品自在在线午夜精华在线 ,女人18毛片A级毛片免费视频

1.背景

1.1 初識爬蟲

網(wǎng)絡(luò)爬蟲，是一種按照一定規(guī)則，自動抓取互聯(lián)網(wǎng)信息的程序或者腳本，其本質(zhì)是模擬瀏覽器打開網(wǎng)頁，獲取網(wǎng)頁中我們想要的數(shù)據(jù)。常用的百度、谷歌的搜索引擎也是一個爬蟲，把互聯(lián)網(wǎng)中的數(shù)據(jù)搜集組合起來便于用戶檢索。

1.2 合法性

網(wǎng)絡(luò)爬蟲領(lǐng)域當(dāng)前還屬于拓荒階段， “ 允許哪些行為 ” 這種基本秩序還處于建設(shè)之中。如果抓取的數(shù)據(jù)屬于個人使用或科研范疇，基本不存在問題; 如果數(shù)據(jù)屬于商業(yè)盈利范疇，就要就事而論，可能違法，可能不違法。

1.3 robots協(xié)議

Robots協(xié)議(也稱為爬蟲協(xié)議、機器人協(xié)議等)的全稱是“網(wǎng)絡(luò)爬蟲排除標(biāo)準(zhǔn)”(Robots Exclusion Protocol),內(nèi)容網(wǎng)站通過Robots協(xié)議告訴搜索引擎怎樣更高效的索引到結(jié)果頁并提供給用戶。它規(guī)定了網(wǎng)站里的哪些內(nèi)容可以抓取，哪些不可以抓取，大部分網(wǎng)站都會有一個robots協(xié)議，一般存在網(wǎng)站的根目錄下，命名為robots.txt,以知乎為例，https://www.zhihu.com/robots.txt

但robots協(xié)議終究是業(yè)內(nèi)的一個約定，到底如何做還得看使用者。在使用爬蟲時，應(yīng)稍微克制一下行為，而不是使勁的薅，看看12306都慘成啥樣了，被各種搶票軟件，各路爬蟲瘋狂輸出……

2.要求

2.1 當(dāng)前開發(fā)環(huán)境

操作系統(tǒng)：Window 10
python版本：3.8
編輯器：pycharm
庫管理：Anconda

以上是我電腦的配置，python版本起碼3+；編輯器不限制，看自己喜歡；Anconda是真的好用，早用早享受

2.2 編程基礎(chǔ)

要有一定的前端知識，會HTML，CSS，JS的基礎(chǔ)用法
懂得Python的基礎(chǔ)語法

3.快速上手Urllib

Urllib是python內(nèi)置的一個http請求庫，不需要額外的安裝。只需要關(guān)注請求的鏈接，參數(shù)，提供了強大的解析功能

Urllib庫有四個模塊：request，error， parse， robotparser

request：發(fā)起請求（重要）
error：處理錯誤
parse：解析RUL或目錄等
robotparser(不怎么用)：解析網(wǎng)站的robot.txt

3.1 request模塊

方法介紹：

1.請求方法
urllib.request.urlopen(url, data=None, [timeout, ]*)
url：地址，可以是字符串，也可以是一個Request對象
data：請求參數(shù)
timeout：設(shè)置超時

一個簡單的get請求：

"""
# 爬蟲就是模擬用戶，向服務(wù)器發(fā)起請求，服務(wù)器會返回對應(yīng)數(shù)據(jù)
# 數(shù)據(jù)抓包，使用chrome，盡量不要使用國產(chǎn)瀏覽器
# F12打開界面，點擊network，刷新，會顯示網(wǎng)頁的請求，常見的請求有GET, POST, PUT, DELETE, HEAD, OPTIONS, TRACE，其中GET 和 POST 最常用
# GET請求把請求參數(shù)都暴露在URL上
# POST請求的參數(shù)放在request body，一般會對密碼進(jìn)行加密
# 請求頭：用來模擬一個真實用戶
# 相應(yīng)狀態(tài)碼：200表示成功
"""
?
?
# 引入請求模塊
import urllib.request
# 發(fā)起請求,設(shè)置超時為1s
response = urllib.request.urlopen('http://www.baidu.com', timeout = 1)
# 使用read()讀取整個頁面內(nèi)容，使用decode('utf-8')對獲取的內(nèi)容進(jìn)行編碼
print(response.read().decode('utf-8'))
print(response.status) # 狀態(tài)碼，判斷是否成功,200
print(response.getheaders()) ??????# 響應(yīng)頭 得到的一個元組組成的列表
print(response.getheader('Server')) ??#得到特定的響應(yīng)頭

推薦一個測試網(wǎng)站，用于提交各種請求：http://httpbin.org/，該網(wǎng)站的更多的用法自行搜索

一個簡單的post請求

import urllib.parse
import urllib.request
# data需要的是字節(jié)流編碼格式的內(nèi)容，此時請求方式為post
data = bytes(urllib.parse.urlencode({"name": "WenAn"}), encoding= 'utf-8')
response = urllib.request.urlopen('http://httpbin.org/post', data= data)
print(response.read().decode('utf-8'))

Request對象

瀏覽器發(fā)起請求時都會有請求頭header，爬蟲想要爬取信息時，添加header，讓服務(wù)器以為你是瀏覽器，而不是一個爬蟲。urlopen無法添加其他參數(shù)，因此我們需要聲明一個request對象來添加header

如何獲取Header：

隨便打開一個網(wǎng)頁(以chrome為例)，快捷鍵F12或者右鍵打開開發(fā)者頁面，點擊network，刷新頁面，再隨便點擊一個鏈接

Request介紹

urllib.request.Request(url, data=None, headers={}, method=None)
headers: 定義請求頭
method：默認(rèn)為get，當(dāng)傳入?yún)?shù)時為post

例子：

import urllib.request
import urllib.parse
?
url = 'http://httpbin.org/post'
# 添加請求頭
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', 'Host':'httpbin.org'}
dict = {
 ? ?'name':'WenAn'
}
data = bytes(urllib.parse.urlencode(dict), encoding = 'utf-8')
request = urllib.request.Request(url, data=data, headers=headers, method='POST')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

3.2Error 模塊

Error模塊下有三個異常類：

URLError
- 處理程序在遇到問題時會引發(fā)此異常（或其派生的異常）
- 只有一個reason屬性
HTTPError
- 是URLError的一個子類，有更多的屬性，如code， reason，headers
- 適用于處理特殊 HTTP 錯誤例如作為認(rèn)證請求的時候。
ContentTooShortError
- 此異常會在 urlretrieve() 函數(shù)檢測到已下載的數(shù)據(jù)量小于期待的數(shù)據(jù)量（由 Content-Length 頭給定）時被引發(fā)。 content 屬性中將存放已下載（可能被截斷）的數(shù)據(jù)。

例子1.

from urllib import request, error
try:
 ? ?# 打開百度里面的a.html頁面，因為它根本不存在，所以會拋出異常
 ? ?response = request.urlopen('http://www.baidu.com/a.html')
except error.URLError as e:
 ? ?print(e.reason) #Not Found

例子2.

# 一樣的例子，只不過把URLError換成了HTTPError
from urllib import request, error
try:
 ? ?response = request.urlopen('http://www.baidu.com/a.html')
except error.HTTPError as e:
 ? ?print(e.reason)
 ? ?print(e.code)
 ? ?print(e.headers)
?
# 輸出結(jié)果
"""
Not Found
?
404
?
Content-Length: 204
Connection: keep-alive
Content-Type: text/html; charset=iso-8859-1
Date: Sat, 18 Sep 2021 14:18:51 GMT
Keep-Alive: timeout=4
Proxy-Connection: keep-alive
Server: Apache
"""

3.3Parse 模塊

parse模塊定義了url的標(biāo)準(zhǔn)接口，實現(xiàn)url的各種抽取，解析，合并，編碼，解碼

urlencode()介紹---參數(shù)編碼

它將字典構(gòu)形式的參數(shù)序列化為url編碼后的字符串，在前面的request模塊有用到

import urllib.parse
dict = {
 ? ?'name':'WenAn',
 ? ?'age': 20
}
params = urllib.parse.urlencode(dict)
print(params)
# name=WenAn&age=20

quote()介紹---中文RUL編解碼

import urllib.parse
params = '憨憨沒了心'
base_url = 'https://www.baidu.com/s?wd='
url = base_url + urllib.parse.quote(params)
print(url)
# https://www.baidu.com/s?wd=%E6%86%A8%E6%86%A8%E6%B2%A1%E4%BA%86%E5%BF%83

# 使用unquote()對中文解碼
url1 = 'https://www.baidu.com/s?wd=%E6%86%A8%E6%86%A8%E6%B2%A1%E4%BA%86%E5%BF%83'
print(urllib.parse.unquote(url1))
# https://www.baidu.com/s?wd=憨憨沒了心

urlparse()介紹—-URL分段

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
urlstring：待解析的url
scheme=''：假如解析的url沒有協(xié)議,可以設(shè)置默認(rèn)的協(xié)議,如果url有協(xié)議，設(shè)置此參數(shù)無效
allow_fragments=True：是否忽略錨點、片斷標(biāo)識符，如'#' ,默認(rèn)為True表示不忽略,為False表示忽略

例子1.

from urllib.parse import urlparse
a = urlparse("https://docs.python.org/zh-cn/3/library/urllib.parse.html")
print(a)
# 返回一個數(shù)組，是url的拼接部分，可以訪問具體的值
# ParseResult(scheme='https', netloc='docs.python.org', path='/zh-cn/3/library/urllib.parse.html', params='', query='', fragment='')

print(a.scheme)
print(a.netloc)
print(a.path)
print(a.params)
print(a.query)
print(a.fragment)
"""
scheme:表示協(xié)議
netloc:域名
path:路徑
params:參數(shù)
query:查詢條件，一般都是get請求的url
fragment:錨點，用于直接定位頁面的下拉位置，跳轉(zhuǎn)到網(wǎng)頁的指定位置
"""

urlunparse()介紹----URL構(gòu)造

import urllib.parse
url_params = ('http', 'baidu.com', '/a', '', '', '')
url = urllib.parse.urlunparse(url_params)
print(url)

#http://baidu.com/a

urljoin()介紹----URL拼接

# 給一個基礎(chǔ)url，給一個后綴url，進(jìn)行拼接
from urllib import parse
base_url = 'http://www.cwi.nl/%7Eguido/Python.html'
sub_url = 'FAQ.html'
url = parse.urljoin(base_url, sub_url)
print(url)
# http://www.cwi.nl/%7Eguido/FAQ.html

4.高級應(yīng)用

4.1 Opener

opener是 urllib.request.OpenerDirector 的實例，如上文提到的urlopen便是一個已經(jīng)構(gòu)建好的特殊opener，但urlopen()僅提供了最基本的功能，如不支持代理，cookie等

自定義Opener的流程

使用相關(guān)的 Handler處理器來創(chuàng)建特定功能的處理器對象
通過 urllib.request.build_opener()方法使用處理器對象，創(chuàng)建自定義opener對象
使用自定義的opener對象，調(diào)用open()方法發(fā)送請求

關(guān)于全局Opener

如果要求程序里面的所有請求都使用自定義的opener，使用urllib.request.install_opener()

import urllib.request
# 創(chuàng)建handler
http_handler = urllib.request.HTTPHandler()
# 創(chuàng)建opener
opener = urllib.request.build_opener(http_handler)
# 創(chuàng)建Request對象
request = urllib.request.Request('https://www.python.org/')

# 局部opener,只能使用.open()來訪問
# response = opener.open(request)


# 全局opener，之后調(diào)用urlopen，都將使用這個自定義opener
urllib.request.install_opener(opener)
response = urllib.request.urlopen(request)

print(response.read().decode('utf8'))

4.2 代理設(shè)置

代理原理

正常流程：本機請求訪問一個網(wǎng)站，把請求發(fā)給Web服務(wù)器，Web服務(wù)器把相應(yīng)數(shù)據(jù)傳回。

使用代理：本機和服務(wù)器之間出現(xiàn)了第三方，本機向代理服務(wù)器發(fā)出請求，代理服務(wù)器向Web服務(wù)器發(fā)出請求，相應(yīng)數(shù)據(jù)通過代理服務(wù)器轉(zhuǎn)發(fā)回到本機

區(qū)別：Web服務(wù)器識別出的IP只是代理服務(wù)器的IP，而不是本機的IP，從而實現(xiàn)了IP偽裝

使用代理IP，可以更方便的爬取數(shù)據(jù)。很多網(wǎng)站會在某個時間段內(nèi)檢查某個IP的訪問次數(shù)，如果訪問次數(shù)過高出現(xiàn)異常，便會封禁該IP，禁止其訪問網(wǎng)站。使用代理便可以每隔一個時間段換一個代理，如果某個IP被禁了，換個IP便可繼續(xù)爬取數(shù)據(jù)。

這里推薦幾個提供免費代理服務(wù)的網(wǎng)站：

http://www.xiladaili.com/
https://www.kuaidaili.com/free/
https://ip.jiangxianli.com/?page=1

更多的IP自己去網(wǎng)上搜索，誰也不能保證這些免費ip能用到啥時候

import urllib.request

# 創(chuàng)建handler
proxy_handler = urllib.request.ProxyHandler({
    'http': '218.78.22.146:443',
    'http': '223.100.166.3',
    'http': '113.254.178.224',
    'http': '115.29.170.58',
    'http': '117.94.222.233'
})
# 創(chuàng)建opener
opener = urllib.request.build_opener(proxy_handler)
header = {
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"
}
request = urllib.request.Request('https://www.httpbin.org/get', headers=header)

# 配置全局opener
urllib.request.install_opener(opener)
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

4.3 Cookie

Cookie是網(wǎng)站為了識別用戶身份而存儲在用戶本地端的數(shù)據(jù)，該數(shù)據(jù)通常經(jīng)過加密。

Cookie 主要用于以下三個方面：

會話狀態(tài)管理（如用戶登錄狀態(tài)、購物車、游戲分?jǐn)?shù)或其它需要記錄的信息）
個性化設(shè)置（如用戶自定義設(shè)置、主題等）
瀏覽器行為跟蹤（如跟蹤分析用戶行為等）

cookielib庫

該模塊主要功能是提供可存儲cookie的對象。使用此模塊捕獲cookie并在后續(xù)連接請求時重新發(fā)送，還可以用來處理包含cookie數(shù)據(jù)的文件。

這個模塊主要提供了這幾個對象，CookieJar，F(xiàn)ileCookieJar，MozillaCookieJar,LWPCookieJar。

CookieJar：對象存儲在內(nèi)存中
FileCookieJar，MozillaCookieJar，LWPCookieJar：存儲在文件中，生成對應(yīng)格式的cookie文件，一般使用某兩種

獲取Cookie

import http.cookiejar
import urllib.request
# 創(chuàng)建cookie對象
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
# 在獲取response后cookie會被自動賦值
for i in cookie:
    print(i.name+"="+i.value)

保存Cookie到本地

import http.cookiejar
import urllib.request

filename = 'cookie.txt'
# cookie = http.cookiejar.MozillaCookieJar(filename)
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
# 獲取response后cookie會被自動賦值
response = opener.open('http://www.baidu.com')
# 保存cookie.txt文件
cookie.save(ignore_discard=True, ignore_expires=True)

讀取Cookie文件

import urllib.request
import http.cookiejar

# cookie對象要和生產(chǎn)cookie文件的對象保持一致，是LWP還是Mozilla
# cookie = http.cookiejar.MozillaCookieJar
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_expires=True, ignore_discard=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
print(response.read().decode('utf-8'))

以上便是urllib的常用內(nèi)容

附上urllib參考文檔：https://docs.python.org/zh-cn/3/library/urllib.html

個人博客：wenancoding.com

本文摘自：https://www.cnblogs.com/

Python系列之Urllib2022-04-19 11:07:29