第二章反向傳播算法如何工作的？

1. 第二章反向傳播算法如何工作的？

第一章使用神經(jīng)網(wǎng)絡(luò)識別手寫數(shù)字

1. 第一章使用神經(jīng)網(wǎng)絡(luò)識別手寫數(shù)字

第三章改進(jìn)神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)方法（上）

1. 第三章改進(jìn)神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)方法（上）

第五章深度神經(jīng)網(wǎng)絡(luò)為何很難訓(xùn)練

1. 第五章深度神經(jīng)網(wǎng)絡(luò)為何很難訓(xùn)練

第六章深度學(xué)習(xí)

1. 第六章深度學(xué)習(xí)

第三章改進(jìn)神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)方法（下）

1. 第三章改進(jìn)神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)方法（下）

第六章深度學(xué)習(xí)

在上一章，我們學(xué)習(xí)了深度神經(jīng)網(wǎng)絡(luò)通常比淺層神經(jīng)網(wǎng)絡(luò)更加難以訓(xùn)練。我們有理由相信，若是可以訓(xùn)練深度網(wǎng)絡(luò)，則能夠獲得比淺層網(wǎng)絡(luò)更加強(qiáng)大的能力，但是現(xiàn)實很殘酷。從上一章我們可以看到很多不利的消息，但是這些困難不能阻止我們使用深度神經(jīng)網(wǎng)絡(luò)。本章，我們將給出可以用來訓(xùn)練深度神經(jīng)網(wǎng)絡(luò)的技術(shù)，并在實戰(zhàn)中應(yīng)用它們。同樣我們也會從更加廣闊的視角來看神經(jīng)網(wǎng)絡(luò)，簡要地回顧近期有關(guān)深度神經(jīng)網(wǎng)絡(luò)在圖像識別、語音識別和其他應(yīng)用中的研究進(jìn)展。然后，還會給出一些關(guān)于未來神經(jīng)網(wǎng)絡(luò)又或人工智能的簡短的推測性的看法。

這一章比較長。為了更好地讓你們學(xué)習(xí)，我們先粗看一下整體安排。本章的小結(jié)之間關(guān)聯(lián)并不太緊密，所以如果讀者熟悉基本的神經(jīng)網(wǎng)絡(luò)的知識，那么可以任意跳到自己最感興趣的部分。

本章主要的部分是對最為流行神經(jīng)網(wǎng)絡(luò)之一的深度卷積網(wǎng)絡(luò)的介紹。我們將細(xì)致地分析一個使用卷積網(wǎng)絡(luò)來解決 MNIST 數(shù)據(jù)集的手寫數(shù)字識別的例子（包含了代碼和講解）：

http://wiki.jikexueyuan.com/project/neural-networks-and-deep-learning-zh-cn/images/159.png" alt="MNIST 數(shù)據(jù)集樣例" />

我們將從淺層的神經(jīng)網(wǎng)絡(luò)開始來解決上面的問題。通過多次的迭代，我們會構(gòu)建越來越強(qiáng)大的網(wǎng)絡(luò)。在這個過程中，也將要探究若干強(qiáng)大技術(shù)：卷積、pooling、使用GPU來更好地訓(xùn)練、訓(xùn)練數(shù)據(jù)的算法性擴(kuò)展（避免過匹配）、dropout 技術(shù)的使用（同樣為了防止過匹配現(xiàn)象）、網(wǎng)絡(luò)的 ensemble 使用和其他技術(shù)。最終的結(jié)果能夠接近人類的表現(xiàn)。在 10,000 幅 MNIST 測試圖像上 —— 模型從未在訓(xùn)練中接觸的圖像 —— 該系統(tǒng)最終能夠?qū)⑵渲?9,967 幅正確分類。這兒我們看看錯分的 33 幅圖像。注意正確分類是右上的標(biāo)記；系統(tǒng)產(chǎn)生的分類在右下：

http://wiki.jikexueyuan.com/project/neural-networks-and-deep-learning-zh-cn/images/160.png" alt="深度神經(jīng)網(wǎng)絡(luò)在 MNIST 實驗中的性能" />

可以發(fā)現(xiàn)，這里面的圖像對于正常人類來說都是非常困難區(qū)分的。例如，在第一行的第三幅圖。我看的話，看起來更像是 “9” 而非 “8”，而 “8” 卻是給出的真實的結(jié)果。我們的網(wǎng)絡(luò)同樣能夠確定這個是 “9”。這種類型的“錯誤” 最起碼是容易理解的，可能甚至值得我們贊許。最后用對最近使用深度（卷積）神經(jīng)網(wǎng)絡(luò)在圖像識別上的研究進(jìn)展作為關(guān)于圖像識別的討論的總結(jié)。

本章剩下的部分，我們將會從一個更加寬泛和宏觀的角度來討論深度學(xué)習(xí)。概述一些神經(jīng)網(wǎng)絡(luò)的其他模型，例如 RNN 和 LSTM 網(wǎng)絡(luò)，以及這些網(wǎng)絡(luò)如何在語音識別、自然語言處理和其他領(lǐng)域中應(yīng)用的。最后會試著推測一下，神經(jīng)網(wǎng)絡(luò)和深度學(xué)習(xí)未來發(fā)展的方向，會從 intention-driven user interfaces 談?wù)勆疃葘W(xué)習(xí)在人工智能的角色。這章內(nèi)容建立在本書前面章節(jié)的基礎(chǔ)之上，使用了前面介紹的諸如 BP、規(guī)范化、softmax 函數(shù)，等等。然而，要想閱讀這一章，倒是不需要太過細(xì)致地掌握前面章節(jié)中內(nèi)容的所有的細(xì)節(jié)。當(dāng)然讀完第一章關(guān)于神經(jīng)網(wǎng)絡(luò)的基礎(chǔ)是非常有幫助的。本章提到第二章到第五章的概念時，也會在文中給出鏈接供讀者去查看這些必需的概念。

需要注意的一點(diǎn)是，本章所沒有包含的那一部分。這一章并不是關(guān)于最新和最強(qiáng)大的神經(jīng)網(wǎng)絡(luò)庫。我們也不是想訓(xùn)練數(shù)十層的神經(jīng)網(wǎng)絡(luò)來處理最前沿的問題。而是希望能夠讓讀者理解深度神經(jīng)網(wǎng)絡(luò)背后核心的原理，并將這些原理用在一個 MNIST 問題的解決中，方便我們的理解。換句話說，本章目標(biāo)不是將最前沿的神經(jīng)網(wǎng)絡(luò)展示給你看。包括前面的章節(jié)，我們都是聚焦在基礎(chǔ)上，這樣讀者就能夠做好充分的準(zhǔn)備來掌握眾多的不斷涌現(xiàn)的深度學(xué)習(xí)領(lǐng)域最新工作。本章仍然在Beta版。期望讀者指出筆誤，bug，小錯和主要的誤解。如果你發(fā)現(xiàn)了可疑的地方，請直接聯(lián)系 mn@michaelnielsen.org。

卷積網(wǎng)絡(luò)簡介

在前面的章節(jié)中，我們教會了神經(jīng)網(wǎng)絡(luò)能夠較好地識別手寫數(shù)字：

http://wiki.jikexueyuan.com/project/neural-networks-and-deep-learning-zh-cn/images/161.png" alt="MNIST 手寫數(shù)字" />

我們在深度神經(jīng)網(wǎng)絡(luò)中使用全連接的鄰接關(guān)系。網(wǎng)絡(luò)中的神經(jīng)元與相鄰的層上的所有神經(jīng)元均連接：

http://wiki.jikexueyuan.com/project/neural-networks-and-deep-learning-zh-cn/images/162.png" alt="全連接深度神經(jīng)網(wǎng)絡(luò)" />

特別地，對輸入圖像中的每個像素點(diǎn)，我們將其光強(qiáng)度作為對應(yīng)輸入層神經(jīng)元的輸入。對于 $$28 \times 28$$ 像素的圖像，這意味著我們輸入神經(jīng)元需要有 $$784(=28 \times 28)$$ 個。

實踐中的卷積神經(jīng)網(wǎng)絡(luò)

我們現(xiàn)已看到卷積神經(jīng)網(wǎng)絡(luò)中核心思想?，F(xiàn)在我們就來看看如何在實踐中使用卷積神經(jīng)網(wǎng)絡(luò)，通過實現(xiàn)某些卷積網(wǎng)絡(luò)，應(yīng)用在 MNIST 數(shù)字分類問題上。我們使用的程序是 network3.py，這是network.py 和 network2.py 的改進(jìn)版本。代碼可以在GitHub 下載。注意我們會在下一節(jié)詳細(xì)研究一下代碼。本節(jié)，我們直接使用 network3.py 來構(gòu)建卷積網(wǎng)絡(luò)。

network.py 和 network2.py 是使用 python 和矩陣庫 numpy 實現(xiàn)的。這些程序從最初的理論開始，并實現(xiàn)了 BP、隨機(jī)梯度下降等技術(shù)。我們既然已經(jīng)知道原理，對 network3.py，我們現(xiàn)在就使用 Theano 來構(gòu)建神經(jīng)網(wǎng)絡(luò)。使用 Theano 可以更方便地實現(xiàn)卷積網(wǎng)絡(luò)的 BP，因為它會自動計算所有包含的映射。Theano 也會比我們之前的代碼（容易看懂，運(yùn)行蠻）運(yùn)行得快得多，這會更適合訓(xùn)練更加復(fù)雜的神經(jīng)網(wǎng)絡(luò)。特別的一點(diǎn)，Theano 支持 CPU 和 GPU，我們寫出來的 Theano 代碼可以運(yùn)行在 GPU 上。這會大幅度提升學(xué)習(xí)的速度，這樣就算是很復(fù)雜的網(wǎng)絡(luò)也是可以用在實際的場景中的。

如果你要繼續(xù)跟下去，就需要安裝 Theano。跟隨這些參考就可以安裝 Theano 了。后面的例子在 Theano 0.6 上運(yùn)行。有些是在 Mac OS X Yosemite上，沒有 GPU。有些是在 Ubuntu 14.4 上，有 NVIDIA GPU。還有一些在兩種情況都有運(yùn)行。為了讓 network3.py 運(yùn)行，你需要在 network3.py 的源碼中將 GPU 置為 True 或者 False。除此之外，讓 Theano 在 GPU 上運(yùn)行，你可能要參考 the instructions here。網(wǎng)絡(luò)上還有很多的教程，用 Google 很容易找到。如果沒有 GPU，也可以使用 Amazon Web Services EC2 G2 spot instances。注意即使是 GPU，訓(xùn)練也可能花費(fèi)很多時間。很多實驗花了數(shù)分鐘或者數(shù)小時才完成。在 CPU 上，則可能需要好多天才能運(yùn)行完最復(fù)雜的實驗。正如在前面章節(jié)中提到的那樣，我建議你搭建環(huán)境，然后閱讀，偶爾回頭再檢查代碼的輸出。如果你使用 CPU，可能要降低訓(xùn)練的次數(shù)，甚至跳過這些實驗。

為了獲得一個基準(zhǔn)，我們將啟用一個淺層的架構(gòu)，僅僅使用單一的隱藏層，包含 $$100$$ 個隱藏元。訓(xùn)練 $$60$$ 次，使用學(xué)習(xí)率為 $$\eta = 0.1$$，mini-batch 大小為 $$10$$，無規(guī)范化。Let‘s go：

>>> import network3
>>> from network3 import Network
>>> from network3 import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer
>>> training_data, validation_data, test_data = network3.load_data_shared()
>>> mini_batch_size = 10
>>> net = Network([ FullyConnectedLayer(n_in=784, n_out=100), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
>>> net.SGD(training_data, 60, mini_batch_size, 0.1, validation_data, test_data)

卷積網(wǎng)絡(luò)的代碼

好了，現(xiàn)在來看看我們的卷積網(wǎng)絡(luò)代碼，network3.py。整體看來，程序結(jié)構(gòu)類似于 network2.py，盡管細(xì)節(jié)有差異，因為我們使用了 Theano。首先我們來看 FullyConnectedLayer 類，這類似于我們之前討論的那些神經(jīng)網(wǎng)絡(luò)層。下面是代碼

class FullyConnectedLayer(object):

    def __init__(self, n_in, n_out, activation_fn=sigmoid, p_dropout=0.0):
        self.n_in = n_in
        self.n_out = n_out
        self.activation_fn = activation_fn
        self.p_dropout = p_dropout
        # Initialize weights and biases
        self.w = theano.shared(
            np.asarray(
                np.random.normal(
                    loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),
                dtype=theano.config.floatX),
            name='w', borrow=True)
        self.b = theano.shared(
            np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)),
                       dtype=theano.config.floatX),
            name='b', borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        self.inpt = inpt.reshape((mini_batch_size, self.n_in))
        self.output = self.activation_fn(
            (1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
        self.y_out = T.argmax(self.output, axis=1)
        self.inpt_dropout = dropout_layer(
            inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
        self.output_dropout = self.activation_fn(
            T.dot(self.inpt_dropout, self.w) + self.b)

    def accuracy(self, y):
        "Return the accuracy for the mini-batch."
        return T.mean(T.eq(y, self.y_out))

__init__ 方法中的大部分都是可以自解釋的，這里再給出一些解釋。我們根據(jù)正態(tài)分布隨機(jī)初始化了權(quán)重和偏差。代碼中對應(yīng)這個操作的一行看起來可能很嚇人，但其實只在進(jìn)行載入權(quán)重和偏差到 Theano 中所謂的共享變量中。這樣可以確保這些變量可在 GPU 中進(jìn)行處理。對此不做過深的解釋。如果感興趣，可以查看Theano documentation。而這種初始化的方式也是專門為 sigmoid 激活函數(shù)設(shè)計的（參見這里）。理想的情況是，我們初始化權(quán)重和偏差時會根據(jù)不同的激活函數(shù)（如 tanh 和 Rectified Linear Function）進(jìn)行調(diào)整。這個在下面的問題中會進(jìn)行討論。初始方法 __init__ 以 self.params = [self.W, self.b] 結(jié)束。這樣將該層所有需要學(xué)習(xí)的參數(shù)都?xì)w在一起。后面，Network.SGD 方法會使用 params 屬性來確定網(wǎng)絡(luò)實例中什么變量可以學(xué)習(xí)。

set_inpt 方法用來設(shè)置該層的輸入，并計算相應(yīng)的輸出。我使用 inpt 而非 input 因為在python 中 input 是一個內(nèi)置函數(shù)。如果將兩者混淆，必然會導(dǎo)致不可預(yù)測的行為，對出現(xiàn)的問題也難以定位。注意我們實際上用兩種方式設(shè)置輸入的：self.input 和 self.inpt_dropout。因為訓(xùn)練時我們可能要使用 dropout。如果使用 dropout，就需要設(shè)置對應(yīng)丟棄的概率 self.p_dropout。這就是在set_inpt 方法的倒數(shù)第二行 dropout_layer 做的事。所以 self.inpt_dropout 和 self.output_dropout在訓(xùn)練過程中使用，而 self.inpt 和 self.output 用作其他任務(wù)，比如衡量驗證集和測試集模型的準(zhǔn)確度。

ConvPoolLayer 和 SoftmaxLayer 類定義和 FullyConnectedLayer 定義差不多。所以我這兒不會給出代碼。如果你感興趣，可以參考本節(jié)后面的 network3.py 的代碼。

盡管這樣，我們還是指出一些重要的微弱的細(xì)節(jié)差別。明顯一點(diǎn)的是，在 ConvPoolLayer 和 SoftmaxLayer 中，我們采用了相應(yīng)的合適的計算輸出激活值方式。幸運(yùn)的是，Theano 提供了內(nèi)置的操作讓我們計算卷積、max-pooling 和 softmax 函數(shù)。

不大明顯的，在我們引入softmax layer 時，我們沒有討論如何初始化權(quán)重和偏差。其他地方我們已經(jīng)討論過對 sigmoid 層，我們應(yīng)當(dāng)使用合適參數(shù)的正態(tài)分布來初始化權(quán)重。但是這個啟發(fā)式的論斷是針對 sigmoid 神經(jīng)元的（做一些調(diào)整可以用于 tanh 神經(jīng)元上）。但是，并沒有特殊的原因說這個論斷可以用在 softmax 層上。所以沒有一個先驗的理由應(yīng)用這樣的初始化。與其使用之前的方法初始化，我這里會將所有權(quán)值和偏差設(shè)置為 $$0$$。這是一個 ad hoc 的過程，但在實踐使用過程中效果倒是很不錯。

好了，我們已經(jīng)看過了所有關(guān)于層的類。那么 Network 類是怎樣的呢？讓我們看看 __init__ 方法：

class Network(object):

    def __init__(self, layers, mini_batch_size):
        """Takes a list of `layers`, describing the network architecture, and
        a value for the `mini_batch_size` to be used during training
        by stochastic gradient descent.

        """
        self.layers = layers
        self.mini_batch_size = mini_batch_size
        self.params = [param for layer in self.layers for param in layer.params]
        self.x = T.matrix("x")  
        self.y = T.ivector("y")
        init_layer = self.layers[0]
        init_layer.set_inpt(self.x, self.x, self.mini_batch_size)
        for j in xrange(1, len(self.layers)):
            prev_layer, layer  = self.layers[j-1], self.layers[j]
            layer.set_inpt(
                prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)
        self.output = self.layers[-1].output
        self.output_dropout = self.layers[-1].output_dropout

這段代碼大部分是可以自解釋的。self.params = [param for layer in ...] 此行代碼對每層的參數(shù)捆綁到一個列表中。Network.SGD 方法會使用 self.params 來確定 Network 中哪些變量需要學(xué)習(xí)。而 self.x = T.matrix("x") 和 self.y = T.ivector("y") 則定義了 Theano 符號變量 x 和 y。這些會用來表示輸入和網(wǎng)絡(luò)得到的輸出。

這兒不是 Theano 的教程，所以不會深度討論這些變量指代什么東西。但是粗略的想法就是這些代表了數(shù)學(xué)變量，而非顯式的值。我們可以對這些變量做通常需要的操作：加減乘除，作用函數(shù)等等。實際上，Theano 提供了很多對符號變量進(jìn)行操作方法，如卷積、max-pooling等等。但是最重要的是能夠進(jìn)行快速符號微分運(yùn)算，使用 BP 算法一種通用的形式。這對于應(yīng)用隨機(jī)梯度下降在若干種網(wǎng)絡(luò)結(jié)構(gòu)的變體上特別有效。特別低，接下來幾行代碼定義了網(wǎng)絡(luò)的符號輸出。我們通過下面這行

init_layer.set_inpt(self.x, self.x, self.mini_batch_size)

設(shè)置初始層的輸入。

請注意輸入是以每次一個 mini-batch 的方式進(jìn)行的，這就是 mini-batch size 為何要指定的原因。還需要注意的是，我們將輸入 self.x 傳了兩次：這是因為我們我們可能會以兩種方式（有dropout和無dropout）使用網(wǎng)絡(luò)。for 循環(huán)將符號變量 self.x 通過 Network 的層進(jìn)行前向傳播。這樣我們可以定義最終的輸出 output 和 output_dropout 屬性，這些都是 Network 符號式輸出。

現(xiàn)在我們理解了 Network 是如何初始化了，讓我們看看它如何使用 SGD 方法進(jìn)行訓(xùn)練的。代碼看起來很長，但是它的結(jié)構(gòu)實際上相當(dāng)簡單。代碼后面也有一些注解。

def SGD(self, training_data, epochs, mini_batch_size, eta,
            validation_data, test_data, lmbda=0.0):
        """Train the network using mini-batch stochastic gradient descent."""
        training_x, training_y = training_data
        validation_x, validation_y = validation_data
        test_x, test_y = test_data

        # compute number of minibatches for training, validation and testing
        num_training_batches = size(training_data)/mini_batch_size
        num_validation_batches = size(validation_data)/mini_batch_size
        num_test_batches = size(test_data)/mini_batch_size

        # define the (regularized) cost function, symbolic gradients, and updates
        l2_norm_squared = sum([(layer.w**2).sum() for layer in self.layers])
        cost = self.layers[-1].cost(self)+\
               0.5*lmbda*l2_norm_squared/num_training_batches
        grads = T.grad(cost, self.params)
        updates = [(param, param-eta*grad)
                   for param, grad in zip(self.params, grads)]

        # define functions to train a mini-batch, and to compute the
        # accuracy in validation and test mini-batches.
        i = T.lscalar() # mini-batch index
        train_mb = theano.function(
            [i], cost, updates=updates,
            givens={
                self.x:
                training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        validate_mb_accuracy = theano.function(
            [i], self.layers[-1].accuracy(self.y),
            givens={
                self.x:
                validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        test_mb_accuracy = theano.function(
            [i], self.layers[-1].accuracy(self.y),
            givens={
                self.x:
                test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        self.test_mb_predictions = theano.function(
            [i], self.layers[-1].y_out,
            givens={
                self.x:
                test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        # Do the actual training
        best_validation_accuracy = 0.0
        for epoch in xrange(epochs):
            for minibatch_index in xrange(num_training_batches):
                iteration = num_training_batches*epoch+minibatch_index
                if iteration % 1000 == 0:
                    print("Training mini-batch number {0}".format(iteration))
                cost_ij = train_mb(minibatch_index)
                if (iteration+1) % num_training_batches == 0:
                    validation_accuracy = np.mean(
                        [validate_mb_accuracy(j) for j in xrange(num_validation_batches)])
                    print("Epoch {0}: validation accuracy {1:.2%}".format(
                        epoch, validation_accuracy))
                    if validation_accuracy >= best_validation_accuracy:
                        print("This is the best validation accuracy to date.")
                        best_validation_accuracy = validation_accuracy
                        best_iteration = iteration
                        if test_data:
                            test_accuracy = np.mean(
                                [test_mb_accuracy(j) for j in xrange(num_test_batches)])
                            print('The corresponding test accuracy is {0:.2%}'.format(
                                test_accuracy))
        print("Finished training network.")
        print("Best validation accuracy of {0:.2%} obtained at iteration {1}".format(
            best_validation_accuracy, best_iteration))
        print("Corresponding test accuracy of {0:.2%}".format(test_accuracy))

前面幾行很直接，將數(shù)據(jù)集分解成 x 和 y 兩部分，并計算在每個數(shù)據(jù)集中 mini-batch 的數(shù)量。接下來的幾行更加有意思，這也體現(xiàn)了 Theano 有趣的特性。那么我們就摘錄詳解一下：

# define the (regularized) cost function, symbolic gradients, and updates 
l2_norm_squared = sum([(layer.w**2).sum() for layer in self.layers]) 
cost = self.layers[-1].cost(self)+\ 0.5*lambda*l2_norm_squared/num_training_batches 
grads = T.grad(cost, self.params) 
updates = [(param, param-eta*grad) for param, grad in zip(self.params, grads)]

這幾行，我們符號化地給出了規(guī)范化的 log-likelihood 代價函數(shù)，在梯度函數(shù)中計算了對應(yīng)的導(dǎo)數(shù)，以及對應(yīng)參數(shù)的更新方式。Theano 讓我們通過這短短幾行就能夠獲得這些效果。唯一隱藏的是計算 cost 包含一個對輸出層 cost 方法的調(diào)用；該代碼在 network3.py 中其他地方。但是，總之代碼很短而且簡單。有了所有這些定義好的東西，下面就是定義 train_mini_batch 函數(shù)，該 Theano 符號函數(shù)在給定 minibatch 索引的情況下使用 updates 來更新 Network 的參數(shù)。類似地，validate_mb_accuracy 和 test_mb_accuracy 計算在任意給定的 minibatch 的驗證集和測試集合上 Network 的準(zhǔn)確度。通過對這些函數(shù)進(jìn)行平均，我們可以計算整個驗證集和測試數(shù)據(jù)集上的準(zhǔn)確度。

SGD 方法剩下的就是可以自解釋的了——我們對次數(shù)進(jìn)行迭代，重復(fù)使用訓(xùn)練數(shù)據(jù)的 minibatch 來訓(xùn)練網(wǎng)絡(luò)，計算驗證集和測試集上的準(zhǔn)確度。

好了，我們已經(jīng)理解了 network3.py 代碼中大多數(shù)的重要部分。讓我們看看整個程序，你不需過分仔細(xì)地讀下這些代碼，但是應(yīng)該享受粗看的過程，并隨時深入研究那些激發(fā)出你好奇地代碼段。理解代碼的最好的方法就是通過修改代碼，增加額外的特征或者重新組織那些你認(rèn)為能夠更加簡潔地完成的代碼。代碼后面，我們給出了一些對初學(xué)者的建議。這兒是代碼：

在 GPU 上使用 Theano 可能會有點(diǎn)難度。特別地，很容在從 GPU 中拉取數(shù)據(jù)時出現(xiàn)錯誤，這可能會讓運(yùn)行變得相當(dāng)慢。我已經(jīng)試著避免出現(xiàn)這樣的情況，但是也不能肯定在代碼擴(kuò)充后出現(xiàn)一些問題。對于你們遇到的問題或者給出的意見我洗耳恭聽（mn@michaelnielsen.org）。

"""network3.py
~~~~~~~~~~~~~~
A Theano-based program for training and running simple neural
networks.
Supports several layer types (fully connected, convolutional, max
pooling, softmax), and activation functions (sigmoid, tanh, and
rectified linear units, with more easily added).
When run on a CPU, this program is much faster than network.py and
network2.py.  However, unlike network.py and network2.py it can also
be run on a GPU, which makes it faster still.
Because the code is based on Theano, the code is different in many
ways from network.py and network2.py.  However, where possible I have
tried to maintain consistency with the earlier programs.  In
particular, the API is similar to network2.py.  Note that I have
focused on making the code simple, easily readable, and easily
modifiable.  It is not optimized, and omits many desirable features.
This program incorporates ideas from the Theano documentation on
convolutional neural nets (notably,
http://deeplearning.net/tutorial/lenet.html ), from Misha Denil's
implementation of dropout (https://github.com/mdenil/dropout ), and
from Chris Olah (http://colah.github.io ).
"""

#### Libraries
# Standard library
import cPickle
import gzip

# Third-party libraries
import numpy as np
import theano
import theano.tensor as T
from theano.tensor.nnet import conv
from theano.tensor.nnet import softmax
from theano.tensor import shared_randomstreams
from theano.tensor.signal import downsample

# Activation functions for neurons
def linear(z): return z
def ReLU(z): return T.maximum(0.0, z)
from theano.tensor.nnet import sigmoid
from theano.tensor import tanh

#### Constants
GPU = True
if GPU:
    print "Trying to run under a GPU.  If this is not desired, then modify "+\
        "network3.py\nto set the GPU flag to False."
    try: theano.config.device = 'gpu'
    except: pass # it's already set
    theano.config.floatX = 'float32'
else:
    print "Running with a CPU.  If this is not desired, then the modify "+\
        "network3.py to set\nthe GPU flag to True."

#### Load the MNIST data
def load_data_shared(filename="../data/mnist.pkl.gz"):
    f = gzip.open(filename, 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()
    def shared(data):
        """Place the data into shared variables.  This allows Theano to copy
        the data to the GPU, if one is available.
        """
        shared_x = theano.shared(
            np.asarray(data[0], dtype=theano.config.floatX), borrow=True)
        shared_y = theano.shared(
            np.asarray(data[1], dtype=theano.config.floatX), borrow=True)
        return shared_x, T.cast(shared_y, "int32")
    return [shared(training_data), shared(validation_data), shared(test_data)]

#### Main class used to construct and train networks
class Network(object):

    def __init__(self, layers, mini_batch_size):
        """Takes a list of `layers`, describing the network architecture, and
        a value for the `mini_batch_size` to be used during training
        by stochastic gradient descent.
        """
        self.layers = layers
        self.mini_batch_size = mini_batch_size
        self.params = [param for layer in self.layers for param in layer.params]
        self.x = T.matrix("x")
        self.y = T.ivector("y")
        init_layer = self.layers[0]
        init_layer.set_inpt(self.x, self.x, self.mini_batch_size)
        for j in xrange(1, len(self.layers)):
            prev_layer, layer  = self.layers[j-1], self.layers[j]
            layer.set_inpt(
                prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)
        self.output = self.layers[-1].output
        self.output_dropout = self.layers[-1].output_dropout

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            validation_data, test_data, lmbda=0.0):
        """Train the network using mini-batch stochastic gradient descent."""
        training_x, training_y = training_data
        validation_x, validation_y = validation_data
        test_x, test_y = test_data

        # compute number of minibatches for training, validation and testing
        num_training_batches = size(training_data)/mini_batch_size
        num_validation_batches = size(validation_data)/mini_batch_size
        num_test_batches = size(test_data)/mini_batch_size

        # define the (regularized) cost function, symbolic gradients, and updates
        l2_norm_squared = sum([(layer.w**2).sum() for layer in self.layers])
        cost = self.layers[-1].cost(self)+\
               0.5*lmbda*l2_norm_squared/num_training_batches
        grads = T.grad(cost, self.params)
        updates = [(param, param-eta*grad)
                   for param, grad in zip(self.params, grads)]

        # define functions to train a mini-batch, and to compute the
        # accuracy in validation and test mini-batches.
        i = T.lscalar() # mini-batch index
        train_mb = theano.function(
            [i], cost, updates=updates,
            givens={
                self.x:
                training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        validate_mb_accuracy = theano.function(
            [i], self.layers[-1].accuracy(self.y),
            givens={
                self.x:
                validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        test_mb_accuracy = theano.function(
            [i], self.layers[-1].accuracy(self.y),
            givens={
                self.x:
                test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        self.test_mb_predictions = theano.function(
            [i], self.layers[-1].y_out,
            givens={
                self.x:
                test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        # Do the actual training
        best_validation_accuracy = 0.0
        for epoch in xrange(epochs):
            for minibatch_index in xrange(num_training_batches):
                iteration = num_training_batches*epoch+minibatch_index
                if iteration % 1000 == 0:
                    print("Training mini-batch number {0}".format(iteration))
                cost_ij = train_mb(minibatch_index)
                if (iteration+1) % num_training_batches == 0:
                    validation_accuracy = np.mean(
                        [validate_mb_accuracy(j) for j in xrange(num_validation_batches)])
                    print("Epoch {0}: validation accuracy {1:.2%}".format(
                        epoch, validation_accuracy))
                    if validation_accuracy >= best_validation_accuracy:
                        print("This is the best validation accuracy to date.")
                        best_validation_accuracy = validation_accuracy
                        best_iteration = iteration
                        if test_data:
                            test_accuracy = np.mean(
                                [test_mb_accuracy(j) for j in xrange(num_test_batches)])
                            print('The corresponding test accuracy is {0:.2%}'.format(
                                test_accuracy))
        print("Finished training network.")
        print("Best validation accuracy of {0:.2%} obtained at iteration {1}".format(
            best_validation_accuracy, best_iteration))
        print("Corresponding test accuracy of {0:.2%}".format(test_accuracy))

#### Define layer types

class ConvPoolLayer(object):
    """Used to create a combination of a convolutional and a max-pooling
    layer.  A more sophisticated implementation would separate the
    two, but for our purposes we'll always use them together, and it
    simplifies the code, so it makes sense to combine them.
    """

    def __init__(self, filter_shape, image_shape, poolsize=(2, 2),
                 activation_fn=sigmoid):
        """`filter_shape` is a tuple of length 4, whose entries are the number
        of filters, the number of input feature maps, the filter height, and the
        filter width.
        `image_shape` is a tuple of length 4, whose entries are the
        mini-batch size, the number of input feature maps, the image
        height, and the image width.
        `poolsize` is a tuple of length 2, whose entries are the y and
        x pooling sizes.
        """
        self.filter_shape = filter_shape
        self.image_shape = image_shape
        self.poolsize = poolsize
        self.activation_fn=activation_fn
        # initialize weights and biases
        n_out = (filter_shape[0]*np.prod(filter_shape[2:])/np.prod(poolsize))
        self.w = theano.shared(
            np.asarray(
                np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape),
                dtype=theano.config.floatX),
            borrow=True)
        self.b = theano.shared(
            np.asarray(
                np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)),
                dtype=theano.config.floatX),
            borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        self.inpt = inpt.reshape(self.image_shape)
        conv_out = conv.conv2d(
            input=self.inpt, filters=self.w, filter_shape=self.filter_shape,
            image_shape=self.image_shape)
        pooled_out = downsample.max_pool_2d(
            input=conv_out, ds=self.poolsize, ignore_border=True)
        self.output = self.activation_fn(
            pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))
        self.output_dropout = self.output # no dropout in the convolutional layers

class FullyConnectedLayer(object):

    def __init__(self, n_in, n_out, activation_fn=sigmoid, p_dropout=0.0):
        self.n_in = n_in
        self.n_out = n_out
        self.activation_fn = activation_fn
        self.p_dropout = p_dropout
        # Initialize weights and biases
        self.w = theano.shared(
            np.asarray(
                np.random.normal(
                    loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),
                dtype=theano.config.floatX),
            name='w', borrow=True)
        self.b = theano.shared(
            np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)),
                       dtype=theano.config.floatX),
            name='b', borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        self.inpt = inpt.reshape((mini_batch_size, self.n_in))
        self.output = self.activation_fn(
            (1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
        self.y_out = T.argmax(self.output, axis=1)
        self.inpt_dropout = dropout_layer(
            inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
        self.output_dropout = self.activation_fn(
            T.dot(self.inpt_dropout, self.w) + self.b)

    def accuracy(self, y):
        "Return the accuracy for the mini-batch."
        return T.mean(T.eq(y, self.y_out))

class SoftmaxLayer(object):

    def __init__(self, n_in, n_out, p_dropout=0.0):
        self.n_in = n_in
        self.n_out = n_out
        self.p_dropout = p_dropout
        # Initialize weights and biases
        self.w = theano.shared(
            np.zeros((n_in, n_out), dtype=theano.config.floatX),
            name='w', borrow=True)
        self.b = theano.shared(
            np.zeros((n_out,), dtype=theano.config.floatX),
            name='b', borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        self.inpt = inpt.reshape((mini_batch_size, self.n_in))
        self.output = softmax((1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
        self.y_out = T.argmax(self.output, axis=1)
        self.inpt_dropout = dropout_layer(
            inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
        self.output_dropout = softmax(T.dot(self.inpt_dropout, self.w) + self.b)

    def cost(self, net):
        "Return the log-likelihood cost."
        return -T.mean(T.log(self.output_dropout)[T.arange(net.y.shape[0]), net.y])

    def accuracy(self, y):
        "Return the accuracy for the mini-batch."
        return T.mean(T.eq(y, self.y_out))

#### Miscellanea
def size(data):
    "Return the size of the dataset `data`."
    return data[0].get_value(borrow=True).shape[0]

def dropout_layer(layer, p_dropout):
    srng = shared_randomstreams.RandomStreams(
        np.random.RandomState(0).randint(999999))
    mask = srng.binomial(n=1, p=1-p_dropout, size=layer.shape)
    return layer*T.cast(mask, theano.config.floatX)

問題

目前，SGD 方法需要用戶手動確定訓(xùn)練的次數(shù)（epoch）。早先在本書中，我們討論了一種自動選擇訓(xùn)練次數(shù)的方法，也就是early stopping。修改 network3.py 以實現(xiàn) Early stopping。
增加一個 Network 方法來返回在任意數(shù)據(jù)集上的準(zhǔn)確度。
修改 SGD 方法來允許學(xué)習(xí)率 $$\eta$$ 可以是訓(xùn)練次數(shù)的函數(shù)。提示：在思考這個問題一段時間后，你可能會在this link 找到有用的信息。
在本章前面我曾經(jīng)描述過一種通過應(yīng)用微小的旋轉(zhuǎn)、扭曲和變化來擴(kuò)展訓(xùn)練數(shù)據(jù)的方法。改變 network3.py 來加入這些技術(shù)。注意：除非你有充分多的內(nèi)存，否則顯式地產(chǎn)生整個擴(kuò)展數(shù)據(jù)集是不大現(xiàn)實的。所以要考慮一些變通的方法。
在 network3.py 中增加 load 和 save 方法。
當(dāng)前的代碼缺點(diǎn)就是只有很少的用來診斷的工具。你能想出一些診斷方法告訴我們網(wǎng)絡(luò)過匹配到什么程度么？加上這些方法。
我們已經(jīng)對rectified linear unit 及 sigmoid 和 tanh 函數(shù)神經(jīng)元使用了同樣的初始方法。正如這里所說，這種初始化方法只是適用于 sigmoid 函數(shù)。假設(shè)我們使用一個全部使用 RLU 的網(wǎng)絡(luò)。試說明以常數(shù) $$c$$ 倍調(diào)整網(wǎng)絡(luò)的權(quán)重最終只會對輸出有常數(shù) $$c$$ 倍的影響。如果最后一層是 softmax，則會發(fā)生什么樣的變化？對 RLU 使用 sigmoid 函數(shù)的初始化方法會怎么樣？有沒有更好的初始化方法？注意：這是一個開放的問題，并不是說有一個簡單的自包含答案。還有，思考這個問題本身能夠幫助你更好地理解包含 RLU 的神經(jīng)網(wǎng)絡(luò)。
我們對于不穩(wěn)定梯度問題的分析實際上是針對 sigmoid 神經(jīng)元的。如果是 RLU，那分析又會有什么差異？你能夠想出一種使得網(wǎng)絡(luò)不太會受到不穩(wěn)定梯度問題影響的好方法么？注意：好實際上就是一個研究性問題。實際上有很多容易想到的修改方法。但我現(xiàn)在還沒有研究足夠深入，能告訴你們什么是真正的好技術(shù)。

圖像識別領(lǐng)域中的近期進(jìn)展

在 1998 年，MNIST 數(shù)據(jù)集被提出來，那時候需要花費(fèi)數(shù)周能夠獲得一個最優(yōu)的模型，和我們現(xiàn)在使用 GPU 在少于 1 小時內(nèi)訓(xùn)練的模型性能差很多。所以，MNIST 已經(jīng)不是一個能夠推動技術(shù)邊界前進(jìn)的問題了；不過，現(xiàn)在的訓(xùn)練速度讓 MNIST 能夠成為教學(xué)和學(xué)習(xí)的樣例。同時，研究重心也已經(jīng)發(fā)生了轉(zhuǎn)變，現(xiàn)代的研究工作包含更具挑戰(zhàn)性的圖像識別問題。在本節(jié)，我們簡短介紹一些近期使用神經(jīng)網(wǎng)絡(luò)進(jìn)行圖像識別上的研究進(jìn)展。

本節(jié)內(nèi)容和本書其他大部分都不一樣。整本書，我都專注在那些可能會成為持久性的方法上——諸如 BP、規(guī)范化、和卷積網(wǎng)絡(luò)。我已經(jīng)盡量避免提及那些在我寫書時很熱門但長期價值未知的研究內(nèi)容了。在科學(xué)領(lǐng)域，這樣太過熱門容易消逝的研究太多了，最終對科學(xué)發(fā)展的價值卻是很微小的。所以，可能會有人懷疑：“好吧，在圖像識別中近期的發(fā)展就是這種情況么？兩到三年后，事情將發(fā)生變化。所以，肯定這些結(jié)果僅僅是一些想在研究前沿陣地領(lǐng)先的專家的專屬興趣而已？為何又費(fèi)力來討論這個呢？”

這種懷疑是正確的，近期研究論文中一些改良的細(xì)節(jié)最終會失去其自身的重要性。過去幾年里，我們已經(jīng)看到了使用深度學(xué)習(xí)解決特別困難的圖像識別任務(wù)上巨大進(jìn)步。假想一個科學(xué)史學(xué)者在 2100 年寫起計算機(jī)視覺。他們肯定會將 2011 到 2015（可能再加上幾年）這幾年作為使用深度卷積網(wǎng)絡(luò)獲得重大突破的時段。但這并不意味著深度卷積網(wǎng)絡(luò)，還有dropout、RLU等等，在 2100 年仍在使用。但這確實告訴我們在思想的歷史上，現(xiàn)在，正發(fā)生著重要的轉(zhuǎn)變。這有點(diǎn)像原子的發(fā)現(xiàn)，抗生素的發(fā)明：在歷史的尺度上的發(fā)明和發(fā)現(xiàn)。所以，盡管我們不會深入這些細(xì)節(jié)，但仍值得從目前正在發(fā)生的研究成果中獲得一些令人興奮的研究發(fā)現(xiàn)。

The 2012 LRMD paper：讓我們從一篇來自 Stanford 和 Google 的研究者的論文開始。后面將這篇論文簡記為 LRMD，前四位作者的姓的首字母命名。LRMD 使用神經(jīng)網(wǎng)絡(luò)對 ImageNet 的圖片進(jìn)行分類，這是一個具有非常挑戰(zhàn)性的圖像識別問題。2011 年 ImageNet 數(shù)據(jù)包含了 $$16,000,000$$ 的全色圖像，有 $$20,000$$ 個類別。圖像從開放的網(wǎng)絡(luò)上爬去，由 Amazon Mechanical Turk 服務(wù)的工人分類。下面是幾幅 ImageNet 的圖像：

http://wiki.jikexueyuan.com/project/neural-networks-and-deep-learning-zh-cn/images/163.png" alt="Paste_Image.png" />

上面這些分別屬于 圓線刨，棕色爛根須，加熱的牛奶，及 通常的蚯蚓。如果你想挑戰(zhàn)一下，你可以訪問hand tools，里面包含了一系列的區(qū)分的任務(wù)，比如區(qū)分 圓線刨、短刨、倒角刨以及其他十幾種類型的刨子和其他的類別。我不知道讀者你是怎么樣一個人，但是我不能將所有這些工具類型都確定地區(qū)分開。這顯然是比 MNIST 任務(wù)更具挑戰(zhàn)性的任務(wù)。LRMD 網(wǎng)絡(luò)獲得了不錯的 15.8% 的準(zhǔn)確度。這看起很不給力，但是在先前最優(yōu)的 9.3% 準(zhǔn)確度上卻是一個大的突破。這個飛躍告訴人們，神經(jīng)網(wǎng)絡(luò)可能會成為一個對非常困難的圖像識別任務(wù)的強(qiáng)大武器。

The 2012 KSH paper：在 2012 年，出現(xiàn)了一篇 LRMD 后續(xù)研究 Krizhevsky, Sutskever and Hinton (KSH)。KSH 使用一個受限 ImageNet 的子集數(shù)據(jù)訓(xùn)練和測試了一個深度卷積神經(jīng)網(wǎng)絡(luò)。這個數(shù)據(jù)集是機(jī)器學(xué)習(xí)競賽常用的一個數(shù)據(jù)集——ImageNet Large-Scale Visual Recognition Challenge（ILSVRC）。使用一個競賽數(shù)據(jù)集可以方便比較神經(jīng)網(wǎng)絡(luò)和其他方法之間的差異。ILSVRC-2012 訓(xùn)練集包含 $$120,000$$ 幅 ImageNet 的圖像，共有 $$1,000$$ 類。驗證集和測試集分別包含 $$50,000$$ 和 $$150,000$$ 幅，也都是同樣的 $$1,000$$ 類。

ILSVRC 競賽中一個難點(diǎn)是許多圖像中包含多個對象。假設(shè)一個圖像展示出一只拉布拉多犬追逐一只足球。所謂“正確的”分類可能是拉布拉多犬。但是算法將圖像歸類為足球就應(yīng)該被懲罰么？由于這樣的模糊性，我們做出下面設(shè)定：如果實際的ImageNet分類是出于算法給出的最可能的 5 類，那么算法最終被認(rèn)為是正確的。KSH 深度卷積網(wǎng)絡(luò)達(dá)到了 84.7% 的準(zhǔn)確度，比第二名的 73.8% 高出很多。使用更加嚴(yán)格度量，KSH 網(wǎng)絡(luò)業(yè)達(dá)到了 63.3% 的準(zhǔn)確度。

我們這里會簡要說明一下 KSH 網(wǎng)絡(luò)，因為這是后續(xù)很多工作的源頭。而且它也和我們之前給出的卷積網(wǎng)絡(luò)相關(guān)，只是更加復(fù)雜精細(xì)。KSH 使用深度卷積網(wǎng)絡(luò)，在兩個 GPU 上訓(xùn)練。使用兩個 GPU 因為 GPU 的型號使然（NVIDIA GeForce GTX 580 沒有足夠大的內(nèi)存來存放整個網(wǎng)絡(luò)）所以用這樣的方式進(jìn)行內(nèi)存的分解。

KSH 網(wǎng)絡(luò)有 $$7$$ 個隱藏層。前 $$5$$ 個隱藏層是卷積層（可能會包含 max-pooling），而后兩個隱藏層則是全連接層。輸出層則是 $$1,000$$ 的 softmax，對應(yīng)于 $$1,000$$ 種分類。下面給出了網(wǎng)絡(luò)的架構(gòu)圖，來自 KSH 的論文。我們會給出詳細(xì)的解釋。注意很多層被分解為 $$2$$ 個部分，對應(yīng)于 $$2

上一篇：第一章使用神經(jīng)網(wǎng)絡(luò)識別手寫數(shù)字

在线观看不卡亚洲电影_亚洲妓女99综合网_91青青青亚洲娱乐在线观看_日韩无码高清综合久久

第六章 深度學(xué)習(xí)

卷積網(wǎng)絡(luò)簡介

實踐中的卷積神經(jīng)網(wǎng)絡(luò)

卷積網(wǎng)絡(luò)的代碼

問題

圖像識別領(lǐng)域中的近期進(jìn)展

第六章深度學(xué)習(xí)