為什么用gb2312解碼內涵段子吧得到的中文是亂碼？

用以下代碼獲取了網頁內容之后，解碼再編碼print出來的中文部分是亂碼?？赡苁鞘裁丛?/p>

import chardet # 一個檢查編碼的庫
url = "http://www.neihan8.com/article/list_5_1.html"
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'

headers = {'User-Agent': user_agent}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
html = response.read()
checkCode = chardet.detect(html) # 檢測網頁的編碼格式
print('checkCode', checkCode)  
#上面那句輸出的結果checkCode {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
gbk_html = html.decode(checkCode['encoding']).encode('utf-8')
print(gbk_html)

這上面的代碼運行結果里第一個title標簽如下所示

<title>\xd7\xee\xd3\xd0\xc4\xda\xba\xad\xb5\xc4\xd0\xa6\xbb\xb0_\xc3\xbf
\xc8\xd5\xd2\xbb\xd0\xa6\xca\xd5\xbc\xaf\xd7\xee\xd0\xc2\xb5\xc4\xc4\xda
\xba\xad\xb8\xe3\xd0\xa6\xb6\xce\xd7\xd3\xd0\xa6\xbb\xb0_\xbb\xe7\xb6\xce\
xd7\xd3\xd0\xa6\xbb\xb0_\xc4\xda\xba\xad\xb0\xc9</title>\r\n

回答里的代碼運行的結果第一個title標簽是這樣的


import requests
url = "http://www.neihan8.com/article/list_5_1.html"
print requests.get(url).content.decode('gb2312').encode('utf-8')

<title>\xe6\x9c\x80\xe6\x9c\x89\xe5\x86\x85\xe6\xb6\xb5\xe7\x9a\x84\xe7\
xac\x91\xe8\xaf\x9d_\xe6\xaf\x8f\xe6\x97\xa5\xe4\xb8\x80\xe7\xac\x91\xe6
\x94\xb6\xe9\x9b\x86\xe6\x9c\x80\xe6\x96\xb0\xe7\x9a\x84\xe5\x86\x85\xe6
\xb6\xb5\xe6\x90\x9e\xe7\xac\x91\xe6\xae\xb5\xe5\xad\x90\xe7\xac\x91\xe8
\xaf\x9d_\xe8\x8d\xa4\xe6\xae\xb5\xe5\xad\x90\xe7\xac\x91\xe8\xaf\x9d_\x
e5\x86\x85\xe6\xb6\xb5\xe5\x90\xa7</title>\r\n

是不是我這邊的ide的輸入輸出配置有問題？

回答

編輯回答

祈歡

你在爬去一個網頁的內容首先要看網頁的編碼方式的，一般在網頁的head中。然后在爬取的時候在選擇相應的編碼方式
圖片描述

2017年7月21日 06:44

編輯回答

離人歸

多謝大家的回答。
不過問題確實有好幾個。我把后來改正過來代碼貼一下。

# python3
url = "http://www.neihan8.com/article/list_5_" + str(page) + ".html"
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'

headers = {'User-Agent': user_agent}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
html = response.read()
print('python3 response.read()', type(html))
checkCode = chardet.detect(html)
print('checkCode', checkCode)
_html = html.decode(checkCode['encoding'])
print('python3 response.read().decode(gb2312)',type(_html))

# 輸出：
python3 response.read() <class 'bytes'>
checkCode {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
python3 response.read().decode(gb2312) <class 'str'>
python3 type(requests.get(url).content) <class 'bytes'>
python3 type(requests.get(url).content.decode('gb2312')) <class 'str'>

下面是python2的代碼：

url = "http://www.neihan8.com/article/list_5_" + str(page) + ".html"
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'

headers = {'User-Agent': user_agent}
req = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(req)
html = response.read()
print('python2 response.read()', type(html))
checkCode = chardet.detect(html)
print('checkCode', checkCode)
#gbk_html = html.encode(checkCode['encoding']).decode('utf-8')
_html = html.decode('gb2312')#.decode('gb2312')
print('python2 response.read().decode(\'gb2312\')', type(_html))

# 輸出：
('python2 response.read()', <type 'str'>)
('checkCode', {'confidence': 0.99, 'language': 'Chinese', 'encoding': 'GB2312'})
("python2 response.read().decode('gb2312')", <type 'unicode'>)
('python2 type(requests.get(url).content)', <type 'str'>)
("python2 type(requests.get(url).content.decode('gb2312'))", <type 'unicode'>)

雖然2和3返回的源編碼不是同一類型，但只要decode成unicode格式就能print出來了，總結起來還是對輸入輸出的編碼理解有問題。

2018年5月15日 14:53

編輯回答

小眼睛

首先推薦你使用requests，簡單好用。其次，你要的這個功能，這樣子就能解決：

import requests
url = "http://www.neihan8.com/article/list_5_1.html"
print requests.get(url).content.decode('gb2312').encode('utf-8')

2017年11月30日 03:46

編輯回答

初心

import requests

url = "http://www.neihan8.com/article/list_5_1.html"
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'
headers = {'User-Agent': user_agent}
req = requests.get(url, headers=headers)

print(req.encoding)
req.encoding = 'gbk'
print(req.encoding)
print(req.text)

圖片描述

2018年5月10日 16:47