本章節(jié)記錄了使用 Scrapy 的一些實(shí)踐經(jīng)驗(yàn)(common practices)。 這包含了很多使用不會(huì)包含在其他特定章節(jié)的的內(nèi)容。
除了常用的 scrapy crawl 來(lái)啟動(dòng) Scrapy,您也可以使用 API 在腳本中啟動(dòng) Scrapy。
需要注意的是,Scrapy 是在 Twisted 異步網(wǎng)絡(luò)庫(kù)上構(gòu)建的,因此其必須在 Twisted reactor 里運(yùn)行。
另外,在 spider 運(yùn)行結(jié)束后,您必須自行關(guān)閉 Twisted reactor。這可以通過(guò)在 CrawlerRunner.crawl 所返回的對(duì)象中添加回調(diào)函數(shù)來(lái)實(shí)現(xiàn)。
下面給出了如何實(shí)現(xiàn)的例子,使用 testspiders 項(xiàng)目作為例子。
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
d = runner.crawl('followall', domain='scrapinghub.com')
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
Running spiders outside projects it’s not much different. You have to create a generic Settings object and populate it as needed (See 內(nèi)置設(shè)定參考手冊(cè) for the available settings), instead of using the configuration returned by get_project_settings.
Spiders can still be referenced by their name if SPIDER_MODULES is set with the modules where Scrapy should look for spiders. Otherwise, passing the spider class as first argument in the CrawlerRunner.crawl method is enough.
from twisted.internet import reactor
from scrapy.spider import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.settings import Settings
class MySpider(Spider):
# Your spider definition
...
settings = Settings({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
runner = CrawlerRunner(settings)
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
參見(jiàn)
Twisted Reactor Overview.
默認(rèn)情況下,當(dāng)您執(zhí)行 scrapy crawl 時(shí),Scrapy 每個(gè)進(jìn)程運(yùn)行一個(gè) spider。 當(dāng)然,Scrapy 通過(guò)內(nèi)部(internal)API 也支持單進(jìn)程多個(gè) spider。
下面以 testspiders 作為例子來(lái)說(shuō)明如何同時(shí)運(yùn)行多個(gè) spider:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
d = runner.crawl('followall', domain=domain)
dfs.add(d)
defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
相同的例子,不過(guò)通過(guò)鏈接(chaining) deferred 來(lái)線性運(yùn)行 spider:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
@defer.inlineCallbacks
def crawl():
for domain in ['scrapinghub.com', 'insophia.com']:
yield runner.crawl('followall', domain=domain)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
參見(jiàn)
在腳本中運(yùn)行 Scrapy。
Scrapy 并沒(méi)有提供內(nèi)置的機(jī)制支持分布式(多服務(wù)器)爬取。不過(guò)還是有辦法進(jìn)行分布式爬取,取決于您要怎么分布了。
如果您有很多 spider,那分布負(fù)載最簡(jiǎn)單的辦法就是啟動(dòng)多個(gè) Scrapyd,并分配到不同機(jī)器上。
如果想要在多個(gè)機(jī)器上運(yùn)行一個(gè)單獨(dú)的 spider,那您可以將要爬取的 url 進(jìn)行分塊,并發(fā)送給 spider。 例如:
首先,準(zhǔn)備要爬取的 url 列表,并分配到不同的文件 url 里:
http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/part2.list
http://somedomain.com/urls-to-crawl/spider1/part3.list
接著在 3 個(gè)不同的 Scrapd 服務(wù)器中啟動(dòng) spider。spider 會(huì)接收一個(gè)(spider)參數(shù) part,該參數(shù)表示要爬取的分塊:
curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3
有些網(wǎng)站實(shí)現(xiàn)了特定的機(jī)制,以一定規(guī)則來(lái)避免被爬蟲(chóng)爬取。 與這些規(guī)則打交道并不容易,需要技巧,有時(shí)候也需要些特別的基礎(chǔ)。 如果有疑問(wèn)請(qǐng)考慮聯(lián)系商業(yè)支持。
下面是些處理這些站點(diǎn)的建議(tips):
COOKIES_ENABLED),有些站點(diǎn)會(huì)使用 cookies 來(lái)發(fā)現(xiàn)爬蟲(chóng)的軌跡。DOWNLOAD_DELAY 設(shè)置。如果您仍然無(wú)法避免被 ban,考慮聯(lián)系商業(yè)支持。
對(duì)于有些應(yīng)用,item 的結(jié)構(gòu)由用戶輸入或者其他變化的情況所控制。您可以動(dòng)態(tài)創(chuàng)建 class。
from scrapy.item import DictItem, Field
def create_item_class(class_name, field_list):
fields = {field_name: Field() for field_name in field_list}
return type(class_name, (DictItem,), {'fields': fields})