在线观看不卡亚洲电影_亚洲妓女99综合网_91青青青亚洲娱乐在线观看_日韩无码高清综合久久

鍍金池/ 問答/人工智能  Python/ scrapy 如何下載圖片并分類?

scrapy 如何下載圖片并分類?

問題描述

在下載http://www.umei.cc/p/gaoqing/...,無法將一個圖集放到同一個目錄中

問題出現(xiàn)的環(huán)境背景及自己嘗試過哪些方法

嘗試了網(wǎng)上很多方法,無法解決

相關(guān)代碼

// 請把代碼文本粘貼到下方(請勿用圖片代替代碼)

#coding:utf-8
import random
import re
import urllib2
from urllib import urlopen

import requests
import logging

import time
from bs4 import BeautifulSoup,Comment
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from z2.items import Z2Item
from scrapy.http import Request

logging.basicConfig(
    level=logging.INFO,
    format=
    '%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
    datefmt='%a, %d %b %Y %H:%M:%S',
    filename='cataline.log',
    filemode='w')

class Spider(CrawlSpider):
    name = 'z2'
    img_urls = []
    allowed_domains = ["www.umei.cc"]
    start_urls = ['http://www.umei.cc/p/gaoqing/rihan/']
    # rules = (
    #     Rule(LinkExtractor(allow=('http://www.umei.cc/p/gaoqing/rihan/\d{1,6}.htm',), deny=('http://www.umei.cc/p/gaoqing/rihan/\d{1,6}_\d{1,6}.htm')),
    #          callback='parse_z2_info', follow=True),
    # )

    def start_requests(self):
        yield Request(url='http://www.umei.cc/p/gaoqing/rihan/',
                      callback=self.parse_z2_key)

    def parse_z2_key(self, response):
        soup = BeautifulSoup(response.body, "lxml")
        content = soup.find("div", attrs={'class': 'TypeList'})
        # logging.debug(content)
        for link in content.findAll("a", attrs={'href': re.compile( r'(.*)(/rihan/)(\d{1,6})(.htm)'), 'class': 'TypeBigPics'}):
            logging.debug(link['href'])
            yield Request(url=link['href'],
                          callback=self.parse_z2_info)
            break

    def parse_z2_info(self, response):
        soup = BeautifulSoup(response.body, "lxml")
        item = Z2Item()
        # 去除html注釋
        for element in soup(text=lambda text: isinstance(text, Comment)):
            element.extract()

        # 過濾script標(biāo)簽
        [s.extract() for s in soup('script')]

        # 過濾b標(biāo)簽
        [s.extract() for s in soup('b')]


        ArticleDesc = soup.find("p", attrs={'class': 'ArticleDesc'})
        logging.debug(ArticleDesc.get_text())

        Pages = soup.find("div", attrs={'class': 'NewPages'}).find('li')
        pageCounts = filter(str.isdigit, Pages.get_text().encode('gbk'))
        # 第一種含中文的字符串中提取數(shù)字的方法
        # logging.debug(re.findall(r"\d+\.?\d*", Pages.get_text())[0])

        # 第二種
        # logging.debug(Pages.get_text()[1:-3])

        # 第三種
        logging.debug(filter(str.isdigit, Pages.get_text().encode('gbk')))

        # img = soup.find("div", attrs={'class': 'ImageBody'}).find('img')
        # url = img.attrs['src']
        # self.img_urls.append(url)
        # logging.debug(self.img_urls)

        item['name'] = re.match(".*/(\d+)", response.url).group(1)
        logging.debug(item['name'])

        # image_urls = []
        # item['image_urls'] = image_urls
        sourceUrl = response.url[0:-4]
        # logging.debug(sourceUrl)
        for i in xrange(1, int(pageCounts) + 1):
            nextUrl = sourceUrl + '_' + str(i) + '.htm'
            # logging.debug(nextUrl)
            yield  Request(url=nextUrl,callback=self.parse_z2_single_img)
        item['image_urls'] = self.img_urls
        yield item


    def parse_z2_single_img(self, response):
        soup = BeautifulSoup(response.body, "lxml")
        img = soup.find("div", attrs={'class': 'ImageBody'}).find('img')
        url = img.attrs['src']
        self.img_urls.append(url)






你期待的結(jié)果是什么?實際看到的錯誤信息又是什么?

回答
編輯回答
筱饞貓

最后解決:提取URL作為標(biāo)識文件夾.同樣的套圖,url前綴是一樣的,用URL名稱作為前綴即可

2017年1月29日 20:56