python爬蟲結(jié)果出現(xiàn)五次重復(fù)

爬取中國(guó)天氣網(wǎng)時(shí)，一共34個(gè)省份，但重復(fù)了5次結(jié)果，你們可以運(yùn)行一下看看。
不知道為什么會(huì)出現(xiàn)重復(fù)。
代碼如下：

from bs4 import BeautifulSoup
import requests
import time
# 1.第一步：把網(wǎng)頁(yè)數(shù)據(jù)全部抓下來(lái)(requests)
# 2.第二步：把抓下來(lái)的數(shù)據(jù)進(jìn)行過濾，把需要的數(shù)據(jù)提取出來(lái)，把不需要得過濾掉(bs4)

#get/post
def get_temperature(url): 
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0',
        'Referer':'http://www.weather.com.cn/textFC/hb.shtml',
        'Host':'www.weather.com.cn'
}
    data=requests.get(url,headers=headers)
#如果我直接打上print(data.content),會(huì)出現(xiàn)編碼錯(cuò)誤，中文顯示英文,
#上面的代碼正常但是運(yùn)行的時(shí)候結(jié)果遇到中文會(huì)以\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80代替，這是一種byte字節(jié)。.
#python 3輸出位串，而不是可讀的字符串，需要對(duì)其進(jìn)行轉(zhuǎn)換
#需要在前面加上一個(gè)轉(zhuǎn)換——html =str(data.content,'utf-8')
    h =str(data.content,'utf-8')
#print (h)
    user=h

#真正有用的數(shù)據(jù)，div class table tr td

    soup=BeautifulSoup(user,'lxml')
    conMidtab=soup.find_all('div',class_="conMidtab")
    conMidtab2_list=soup.find_all('div',class_="conMidtab2")
    for x in conMidtab2_list:
        tr_list=x.find_all('tr')[2:]#list從0開始，省份是從第2個(gè)標(biāo)簽開始的
        province='1'#定義
        for index,tr in enumerate(tr_list):
            if index==0:
                td_list=tr.find_all('td')
                province=td_list[0].text.replace('\n','')
                city=td_list[1].text.replace('\n','')
                weather=td_list[5].text.replace('\n','')
                wind=td_list[6].text.replace('\n','')
                tmin=td_list[7].text.replace('\n','')
            else:
                td_list=tr.find_all('td')
                city=td_list[0].text.replace('\n','')
                weather=td_list[4].text.replace('\n','')
                wind=td_list[5].text.replace('\n','')
                tmin=td_list[6].text.replace('\n','')
            print ('%s %s %s %s %s' % (province,city,weather,wind,tmin))#replace('\n','')用空白代替空行
def main():
    urls=['http://www.weather.com.cn/textFC/hb.shtml',
          'http://www.weather.com.cn/textFC/db.shtml',
          'http://www.weather.com.cn/textFC/hd.shtml',
          'http://www.weather.com.cn/textFC/hz.shtml',
          'http://www.weather.com.cn/textFC/hn.shtml',
          'http://www.weather.com.cn/textFC/xb.shtml',
          'http://www.weather.com.cn/textFC/xn.shtml']
    for url in urls:
        get_temperature(url)
        time.sleep(2)
if __name__=='__main__':
    main()

回答

編輯回答

舊言

5次?不是7次嗎.

你直接for循環(huán)了conMidtab2下所有的城市,但是沒有注意到頁(yè)面上的天氣不只是一天.

2017年8月3日 00:05

編輯回答

淺時(shí)光

conMidtab=soup.select('body > div.lqcontentBoxH > div.contentboxTab > div > div > div.hanml > div')[0]
conMidtab2_list=conMidtab.find_all('div',class_="conMidtab2")
print(len(conMidtab2_list))    # 5 你的35 沒顯示的也打印了

2017年10月6日 22:11