from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
d = runner.crawl('followall', domain='scrapinghub.com')
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
Running spiders outside projects it’s not much different. You have to create a generic Settings object and populate it as needed (See 內(nèi)置設(shè)定參考手冊(cè) for the available settings), instead of using the configuration returned by get_project_settings.

Spiders can still be referenced by their name if SPIDER_MODULES is set with the modules where Scrapy should look for spiders. Otherwise, passing the spider class as first argument in the CrawlerRunner.crawl method is enough.

from twisted.internet import reactor
from scrapy.spider import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.settings import Settings

class MySpider(Spider):
    # Your spider definition
    ...

settings = Settings({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
runner = CrawlerRunner(settings)

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

參見(jiàn)

Twisted Reactor Overview.

同一進(jìn)程運(yùn)行多個(gè) spider

默認(rèn)情況下，當(dāng)您執(zhí)行 scrapy crawl 時(shí)，Scrapy 每個(gè)進(jìn)程運(yùn)行一個(gè) spider。當(dāng)然，Scrapy 通過(guò)內(nèi)部(internal)API 也支持單進(jìn)程多個(gè) spider。

下面以 testspiders 作為例子來(lái)說(shuō)明如何同時(shí)運(yùn)行多個(gè) spider:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
    d = runner.crawl('followall', domain=domain)
    dfs.add(d)

defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished

相同的例子，不過(guò)通過(guò)鏈接(chaining) deferred 來(lái)線性運(yùn)行 spider:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

@defer.inlineCallbacks
def crawl():
    for domain in ['scrapinghub.com', 'insophia.com']:
        yield runner.crawl('followall', domain=domain)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

參見(jiàn)

在腳本中運(yùn)行 Scrapy。

分布式爬蟲(chóng)(Distributed crawls)

Scrapy 并沒(méi)有提供內(nèi)置的機(jī)制支持分布式(多服務(wù)器)爬取。不過(guò)還是有辦法進(jìn)行分布式爬取，取決于您要怎么分布了。

如果您有很多 spider，那分布負(fù)載最簡(jiǎn)單的辦法就是啟動(dòng)多個(gè) Scrapyd，并分配到不同機(jī)器上。

如果想要在多個(gè)機(jī)器上運(yùn)行一個(gè)單獨(dú)的 spider，那您可以將要爬取的 url 進(jìn)行分塊，并發(fā)送給 spider。例如:

首先，準(zhǔn)備要爬取的 url 列表，并分配到不同的文件 url 里：

http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/part2.list
http://somedomain.com/urls-to-crawl/spider1/part3.list

接著在 3 個(gè)不同的 Scrapd 服務(wù)器中啟動(dòng) spider。spider 會(huì)接收一個(gè)(spider)參數(shù) part，該參數(shù)表示要爬取的分塊：

curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3

避免被禁止(ban)

有些網(wǎng)站實(shí)現(xiàn)了特定的機(jī)制，以一定規(guī)則來(lái)避免被爬蟲(chóng)爬取。與這些規(guī)則打交道并不容易，需要技巧，有時(shí)候也需要些特別的基礎(chǔ)。如果有疑問(wèn)請(qǐng)考慮聯(lián)系商業(yè)支持。

下面是些處理這些站點(diǎn)的建議(tips):

使用 user agent 池，輪流選擇之一來(lái)作為 user agent。池中包含常見(jiàn)的瀏覽器的 user agent(google 一下一大堆)
禁止 cookies(參考 COOKIES_ENABLED)，有些站點(diǎn)會(huì)使用 cookies 來(lái)發(fā)現(xiàn)爬蟲(chóng)的軌跡。
設(shè)置下載延遲(2 或更高)。參考 DOWNLOAD_DELAY 設(shè)置。
如果可行，使用 Google cache 來(lái)爬取數(shù)據(jù)，而不是直接訪問(wèn)站點(diǎn)。
使用 IP 池。例如免費(fèi)的 Tor 項(xiàng)目或付費(fèi)服務(wù)(ProxyMesh)。
使用高度分布式的下載器(downloader)來(lái)繞過(guò)禁止(ban)，您就只需要專注分析處理頁(yè)面。這樣的例子有:Crawlera

如果您仍然無(wú)法避免被 ban，考慮聯(lián)系商業(yè)支持。

動(dòng)態(tài)創(chuàng)建 Item 類

對(duì)于有些應(yīng)用，item 的結(jié)構(gòu)由用戶輸入或者其他變化的情況所控制。您可以動(dòng)態(tài)創(chuàng)建 class。

from scrapy.item import DictItem, Field

def create_item_class(class_name, field_list):
fields = {field_name: Field() for field_name in field_list}

return type(class_name, (DictItem,), {'fields': fields})

上一篇：Spiders Contracts下一篇：Item Exporters

在线观看不卡亚洲电影_亚洲妓女99综合网_91青青青亚洲娱乐在线观看_日韩无码高清综合久久

實(shí)踐經(jīng)驗(yàn)(Common Practices)

在腳本中運(yùn)行 Scrapy

同一進(jìn)程運(yùn)行多個(gè) spider

分布式爬蟲(chóng)(Distributed crawls)

避免被禁止(ban)

動(dòng)態(tài)創(chuàng)建 Item 類