国产口爆吞精在线视频,国产成人精品一区二三区在线观看

寫在前面

在Scrapy基礎(chǔ)——Spider中，我簡(jiǎn)要地說了一下Spider類。Spider基本上能做很多事情了，但是如果你想爬取知乎或者是簡(jiǎn)書全站的話，你可能需要一個(gè)更強(qiáng)大的武器。
CrawlSpider基于Spider，但是可以說是為全站爬取而生。

簡(jiǎn)要說明

CrawlSpider是爬取那些具有一定規(guī)則網(wǎng)站的常用的爬蟲，它基于Spider并有一些獨(dú)特屬性

· rules: 是Rule對(duì)象的集合，用于匹配目標(biāo)網(wǎng)站并排除干擾

· parse_start_url: 用于爬取起始響應(yīng)，必須要返回Item，Request中的一個(gè)。

因?yàn)?i>rules是Rule對(duì)象的集合，所以這里也要介紹一下Rule。它有幾個(gè)參數(shù)：link_extractor、callback=None、cb_kwargs=None、follow=None、process_links=None、process_request=None
其中的link_extractor既可以自己定義，也可以使用已有LinkExtractor類，主要參數(shù)為：

· allow：滿足括號(hào)中“正則表達(dá)式”的值會(huì)被提取，如果為空，則全部匹配。

· deny：與這個(gè)正則表達(dá)式(或正則表達(dá)式列表)不匹配的URL一定不提取。

· allow_domains：會(huì)被提取的鏈接的domains。

· deny_domains：一定不會(huì)被提取鏈接的domains。

· restrict_xpaths：使用xpath表達(dá)式，和allow共同作用過濾鏈接。還有一個(gè)類似的restrict_css

下面是官方提供的例子，我將從源代碼的角度開始解讀一些常見問題：

import scrapyfrom scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):

name = 'example.com'

allowed_domains = ['example.com']

start_urls = ['http://www.example.com']

rules = (

# Extract links matching 'category.php' (but not matching 'subsection.php')

# and follow links from them (since no callback means follow=True by default).

Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

# Extract links matching 'item.php' and parse them with the spider's method parse_item

Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),

)

def parse_item(self, response):

self.logger.info('Hi, this is an item page! %s', response.url)

item = scrapy.Item()

item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')

item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()

item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()

return item

問題：CrawlSpider如何工作的？

因?yàn)?span>CrawlSpider繼承了Spider，所以具有Spider的所有函數(shù)。
首先由start_requests對(duì)start_urls中的每一個(gè)url發(fā)起請(qǐng)求（make_requests_from_url)，這個(gè)請(qǐng)求會(huì)被parse接收。在Spider里面的parse需要我們定義，但CrawlSpider定義parse去解析響應(yīng)（self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)）
_parse_response根據(jù)有無callback,follow和self.follow_links執(zhí)行不同的操作

def _parse_response(self, response, callback, cb_kwargs, follow=True):

##如果傳入了callback，使用這個(gè)callback解析頁面并獲取解析得到的reques或item

if callback:

cb_res = callback(response, **cb_kwargs) or ()

cb_res = self.process_results(response, cb_res)

for requests_or_item in iterate_spider_output(cb_res):

yield requests_or_item

## 其次判斷有無follow，用_requests_to_follow解析響應(yīng)是否有符合要求的link。

if follow and self._follow_links:

for request_or_item in self._requests_to_follow(response):

yield request_or_item

其中_requests_to_follow又會(huì)獲取link_extractor（這個(gè)是我們傳入的LinkExtractor）解析頁面得到的link（link_extractor.extract_links(response)）,對(duì)url進(jìn)行加工（process_links，需要自定義），對(duì)符合的link發(fā)起Request。使用.process_request(需要自定義）處理響應(yīng)。

問題：CrawlSpider如何獲取rules？

CrawlSpider類會(huì)在__init__方法中調(diào)用_compile_rules方法，然后在其中淺拷貝rules中的各個(gè)Rule獲取要用于回調(diào)(callback)，要進(jìn)行處理的鏈接（process_links）和要進(jìn)行的處理請(qǐng)求（process_request)

def _compile_rules(self):

def get_method(method):

if callable(method):

return method

elif isinstance(method, six.string_types):

return getattr(self, method, None)

self._rules = [copy.copy(r) for r in self.rules]

for rule in self._rules:

rule.callback = get_method(rule.callback)

rule.process_links = get_method(rule.process_links)

rule.process_request = get_method(rule.process_request)

那么Rule是怎么樣定義的呢？

class Rule(object):

def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):

self.link_extractor = link_extractor

self.callback = callback

self.cb_kwargs = cb_kwargs or {}

self.process_links = process_links

self.process_request = process_request

if follow is None:

self.follow = False if callback else True

else:

self.follow = follow

因此LinkExtractor會(huì)傳給link_extractor。

有callback的是由指定的函數(shù)處理，沒有callback的是由哪個(gè)函數(shù)處理的？

由上面的講解可以發(fā)現(xiàn)_parse_response會(huì)處理有callback的（響應(yīng)）respons。
cb_res = callback(response, **cb_kwargs) or ()
而_requests_to_follow會(huì)將self._response_downloaded傳給callback用于對(duì)頁面中匹配的url發(fā)起請(qǐng)求（request）。
r = Request(url=link.url, callback=self._response_downloaded)

如何在CrawlSpider進(jìn)行模擬登陸

因?yàn)?span>CrawlSpider和Spider一樣，都要使用start_requests發(fā)起請(qǐng)求，用從Andrew_liu大神借鑒的代碼說明如何模擬登陸：

##替換原來的start_requests，callback為def start_requests(self):

return [Request("http://www.zhihu.com/#signin", meta = {'cookiejar' : 1}, callback = self.post_login)]def post_login(self, response):

print 'Preparing login'

#下面這句話用于抓取請(qǐng)求網(wǎng)頁后返回網(wǎng)頁中的_xsrf字段的文字, 用于成功提交表單

xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]

print xsrf

#FormRequeset.from_response是Scrapy提供的一個(gè)函數(shù), 用于post表單

#登陸成功后, 會(huì)調(diào)用after_login回調(diào)函數(shù)

return [FormRequest.from_response(response, #"http://www.zhihu.com/login",

meta = {'cookiejar' : response.meta['cookiejar']},

headers = self.headers,

formdata = {

'_xsrf': xsrf,

'email': '1527927373@qq.com',

'password': '321324jia'

callback = self.after_login,

dont_filter = True

)]#make_requests_from_url會(huì)調(diào)用parse，就可以與CrawlSpider的parse進(jìn)行銜接了def after_login(self, response) :

for url in self.start_urls :

yield self.make_requests_from_url(url)

理論說明如上，有不足或不懂的地方歡迎在留言區(qū)和我說明。
其次，我會(huì)寫一段爬取簡(jiǎn)書全站用戶的爬蟲來說明如何具體使用CrawlSpider

最后貼上Scrapy.spiders.CrawlSpider的源代碼，以便檢查

"""

This modules implements the CrawlSpider which is the recommended spider to use

for scraping typical web sites that requires crawling pages.

See documentation in docs/topics/spiders.rst

"""

import copyimport six

from scrapy.http import Request, HtmlResponsefrom scrapy.utils.spider import iterate_spider_outputfrom scrapy.spiders import Spider

def identity(x):

return x

class Rule(object):

def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):

self.link_extractor = link_extractor

self.callback = callback

self.cb_kwargs = cb_kwargs or {}

self.process_links = process_links

self.process_request = process_request

if follow is None:

self.follow = False if callback else True

else:

self.follow = follow

class CrawlSpider(Spider):

rules = ()

def __init__(self, *a, **kw):

super(CrawlSpider, self).__init__(*a, **kw)

self._compile_rules()

def parse(self, response):

return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

def parse_start_url(self, response):

return []

def process_results(self, response, results):

return results

def _requests_to_follow(self, response):

if not isinstance(response, HtmlResponse):

return

seen = set()

for n, rule in enumerate(self._rules):

links = [lnk for lnk in rule.link_extractor.extract_links(response)

if lnk not in seen]

if links and rule.process_links:

links = rule.process_links(links)

for link in links:

seen.add(link)

r = Request(url=link.url, callback=self._response_downloaded)

r.meta.update(rule=n, link_text=link.text)

yield rule.process_request(r)

def _response_downloaded(self, response):

rule = self._rules[response.meta['rule']]

return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

def _parse_response(self, response, callback, cb_kwargs, follow=True):

if callback:

cb_res = callback(response, **cb_kwargs) or ()

cb_res = self.process_results(response, cb_res)

for requests_or_item in iterate_spider_output(cb_res):

yield requests_or_item

if follow and self._follow_links:

for request_or_item in self._requests_to_follow(response):

yield request_or_item

def _compile_rules(self):

def get_method(method):

if callable(method):

return method

elif isinstance(method, six.string_types):

return getattr(self, method, None)

self._rules = [copy.copy(r) for r in self.rules]

for rule in self._rules:

rule.callback = get_method(rule.callback)

rule.process_links = get_method(rule.process_links)

rule.process_request = get_method(rule.process_request)

@classmethod

def from_crawler(cls, crawler, *args, **kwargs):

spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)

spider._follow_links = crawler.settings.getbool(

'CRAWLSPIDER_FOLLOW_LINKS', True)

return spider

def set_crawler(self, crawler):

super(CrawlSpider, self).set_crawler(crawler)

self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

原文來自：簡(jiǎn)書／hoptop

99热99这里只有精品6国产,亚洲中文字幕在线天天更新,在线观看亚洲精品国产福利片 ,久久久久综合网

Scrapy基礎(chǔ)之CrawlSpider詳解

寫在前面

簡(jiǎn)要說明

問題：CrawlSpider如何工作的？

問題：CrawlSpider如何獲取rules？

有callback的是由指定的函數(shù)處理，沒有callback的是由哪個(gè)函數(shù)處理的？

如何在CrawlSpider進(jìn)行模擬登陸

熱門帖子

Swift 教程

最新帖子

Xcode 9.4下載

99热99这里只有精品6国产,亚洲中文字幕在线天天更新,在线观看亚洲精品国产福利片 ,久久久久综合网

Scrapy基礎(chǔ)之CrawlSpider詳解

寫在前面

簡(jiǎn)要說明

問題：CrawlSpider如何工作的？

問題：CrawlSpider如何獲取rules？

有callback的是由指定的函數(shù)處理，沒有callback的是由哪個(gè)函數(shù)處理的？

如何在CrawlSpider進(jìn)行模擬登陸

熱門帖子

Swift 教程

最新帖子

Xcode 9.4下載

問題：CrawlSpider如何工作的？

問題：CrawlSpider如何獲取rules？

有callback的是由指定的函數(shù)處理，沒有callback的是由哪個(gè)函數(shù)處理的？