本章節(jié)記錄了使用 Scrapy 的一些實(shí)踐經(jīng)驗(yàn)(common practices)。 這包含了很多使用不會包含在其他特定章節(jié)的的內(nèi)容。
除了常用的 scrapy crawl 來啟動 Scrapy,您也可以使用 API 在腳本中啟動 Scrapy。
需要注意的是,Scrapy 是在 Twisted 異步網(wǎng)絡(luò)庫上構(gòu)建的,因此其必須在 Twisted reactor 里運(yùn)行。
另外,在 spider 運(yùn)行結(jié)束后,您必須自行關(guān)閉 Twisted reactor。這可以通過在 CrawlerRunner.crawl 所返回的對象中添加回調(diào)函數(shù)來實(shí)現(xiàn)。
下面給出了如何實(shí)現(xiàn)的例子,使用 testspiders 項目作為例子。
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
d = runner.crawl('followall', domain='scrapinghub.com')
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
Running spiders outside projects it’s not much different. You have to create a generic Settings object and populate it as needed (See 內(nèi)置設(shè)定參考手冊 for the available settings), instead of using the configuration returned by get_project_settings.
Spiders can still be referenced by their name if SPIDER_MODULES is set with the modules where Scrapy should look for spiders. Otherwise, passing the spider class as first argument in the CrawlerRunner.crawl method is enough.
from twisted.internet import reactor
from scrapy.spider import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.settings import Settings
class MySpider(Spider):
# Your spider definition
...
settings = Settings({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
runner = CrawlerRunner(settings)
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
參見
Twisted Reactor Overview.
默認(rèn)情況下,當(dāng)您執(zhí)行 scrapy crawl 時,Scrapy 每個進(jìn)程運(yùn)行一個 spider。 當(dāng)然,Scrapy 通過內(nèi)部(internal)API 也支持單進(jìn)程多個 spider。
下面以 testspiders 作為例子來說明如何同時運(yùn)行多個 spider:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
d = runner.crawl('followall', domain=domain)
dfs.add(d)
defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
相同的例子,不過通過鏈接(chaining) deferred 來線性運(yùn)行 spider:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
@defer.inlineCallbacks
def crawl():
for domain in ['scrapinghub.com', 'insophia.com']:
yield runner.crawl('followall', domain=domain)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
參見
在腳本中運(yùn)行 Scrapy。
Scrapy 并沒有提供內(nèi)置的機(jī)制支持分布式(多服務(wù)器)爬取。不過還是有辦法進(jìn)行分布式爬取,取決于您要怎么分布了。
如果您有很多 spider,那分布負(fù)載最簡單的辦法就是啟動多個 Scrapyd,并分配到不同機(jī)器上。
如果想要在多個機(jī)器上運(yùn)行一個單獨(dú)的 spider,那您可以將要爬取的 url 進(jìn)行分塊,并發(fā)送給 spider。 例如:
首先,準(zhǔn)備要爬取的 url 列表,并分配到不同的文件 url 里:
http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/part2.list
http://somedomain.com/urls-to-crawl/spider1/part3.list
接著在 3 個不同的 Scrapd 服務(wù)器中啟動 spider。spider 會接收一個(spider)參數(shù) part,該參數(shù)表示要爬取的分塊:
curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3
有些網(wǎng)站實(shí)現(xiàn)了特定的機(jī)制,以一定規(guī)則來避免被爬蟲爬取。 與這些規(guī)則打交道并不容易,需要技巧,有時候也需要些特別的基礎(chǔ)。 如果有疑問請考慮聯(lián)系商業(yè)支持。
下面是些處理這些站點(diǎn)的建議(tips):
COOKIES_ENABLED),有些站點(diǎn)會使用 cookies 來發(fā)現(xiàn)爬蟲的軌跡。DOWNLOAD_DELAY 設(shè)置。如果您仍然無法避免被 ban,考慮聯(lián)系商業(yè)支持。
對于有些應(yīng)用,item 的結(jié)構(gòu)由用戶輸入或者其他變化的情況所控制。您可以動態(tài)創(chuàng)建 class。
from scrapy.item import DictItem, Field
def create_item_class(class_name, field_list):
fields = {field_name: Field() for field_name in field_list}
return type(class_name, (DictItem,), {'fields': fields})