在线观看不卡亚洲电影_亚洲妓女99综合网_91青青青亚洲娱乐在线观看_日韩无码高清综合久久

鍍金池/ 問答/Python/ scrapy爬蟲過程卡頓

scrapy爬蟲過程卡頓

使用scrapy爬取百度新聞過程中總是會間歇性卡頓幾秒,比如這樣:

clipboard.png

嚴重影響了爬取速度,而用類似構(gòu)造的代碼爬取百度招聘時則很順暢,請問這可能是什么原因呢?是不是setting里有什么相關(guān)設(shè)置?

這是代碼:

class BaiduxinwenpaquSpider(scrapy.Spider):
    conn = pymysql.connect(
        host='127.0.0.1',
        user='root',
        password='127127',
        db='company_news',
        port=3306,
        charset='utf8'
    )
    cursor = conn.cursor()
    sql = "select `company_name` from`wuxi_a_business_info`"
    cursor.execute(sql)
    rests_tuple = cursor.fetchall()

    # tuple_=('無錫市司法局','無錫市人口和計劃生育委員會')
    name = 'baiduxinwenpaqu'
    allowed_domains = ['news.baidu.com/']
    start_urls = ['http://news.baidu.com/ns?word="{}"&pn=0&cl=2&ct=0&tn=news&rn=20&ie=utf-8&bt=0&et=0'.format(name[0]) for name in rests_tuple[20000:200000]]

    def parse(self,response):
        conn = pymysql.connect(
            host='127.0.0.1',
            user='root',
            password='127127',
            db='company_news',
            port=3306,
            charset='utf8'
        )
        cursor = conn.cursor()
        try:
            company_name=re.search(r'word=(.*)&pn',response.url).group(1)
            company_name = unquote(company_name).replace('"', '').replace('"', '')
            id_sql = "select `company_id` from `wuxi_a_business_info`where `company_name`='{}'".format(company_name)
            cursor.execute(id_sql)
            rest = cursor.fetchall()
            company_id = rest[0][0]
        except:
            company_name=''
            company_id=''

        for page in range(0,81,20):
            next_url = re.sub('pn=\d+', 'pn=%d' % (page), response.url)
            yield Request(url=next_url,callback=self.parse_detail, dont_filter=True,meta={'company_name':company_name,'company_id':company_id})

    def parse_detail(self, response):
        company_name = response.meta['company_name']
        company_id = response.meta['company_id']
        infos = response.xpath('//div[@class="result"]')
        for info in infos:
            titlelist = info.xpath('h3/a//text()').extract()
            title = ''
            for t in titlelist:
                title += t
            try:
                source = info.xpath('div[@class="c-summary c-row "]/p[@class="c-author"]/text()').extract()[0].split(
                '\xa0\xa0')[0]
            except:
                source=''
            try:
                time = info.xpath('div[@class="c-summary c-row "]/p[@class="c-author"]/text()').extract()[0].split(
                '\xa0\xa0')[1]
            except:
                time=''
            if time.endswith('前'):
                time = datetime.now().timetuple()
                VersionInfo = str(time.tm_year) + '年' + str(time.tm_mon) + '月' + str(time.tm_mday) + '日'
                time = VersionInfo
            link = info.xpath('h3/a/@href').extract()[0]
            abstract = ''
            try:
                abstractlist = info.xpath('div[@class="c-summary c-row "]//text()').extract()
                for a in abstractlist[1:-2]:
                    abstract += a
            except:
                abstract=''

            ninfo = BaiduxinwenItem()
            ninfo['company_name'] = company_name
            ninfo['company_id'] = company_id
            ninfo['title'] = title
            ninfo['source'] = source
            ninfo['time'] = time
            ninfo['link'] = link
            ninfo['abstract'] = abstract
            yield ninfo
回答
編輯回答
不舍棄

如果相同的配置去爬去不同站點的數(shù)據(jù),其中一個站總是出現(xiàn)幾秒的卡頓的話,可能是這個卡頓站的響應(yīng)速度問題。
建議加一些log,來具體定位問題。

2018年6月23日 14:11