English & math corrections

Yoshua Bengio · Yoshua Bengio · commit 8352a319b795 · 2010-01-14T18:28:54.000-05:00
diff --git a/doc/gettingstarted.txt b/doc/gettingstarted.txt
@@ -27,7 +27,10 @@ MNIST Dataset
 
  The `MNIST <http://yann.lecun.com/exdb/mnist>`_ dataset consists of handwritten 
  digit images and it is divided in 60 000 examples for the training set and 
- 10 000 examples for testing. All examples have been size-normalized and 
+ 10 000 examples for testing. In many papers as well as in this tutorial, the
+ official training set of 60 000 is divided into an actual training set of 50 000
+ examples and 10 000 validation examples (for selecting hyper-parameters like
+ learning rate and size of the model). All digit images have been size-normalized and 
  centered in a fixed size image of 28 x 28 pixels. In the original dataset 
  each pixel of the image is represented by a value between 0 and 255, where 
  0 is black, 255 is  white and anything in between is a different shade of grey. 
@@ -150,7 +153,7 @@ List of Symbols and acronyms
 
 * :math:`D`: number of input dimensions.
 * :math:`D_h^{(i)}`: number of hidden units in the :math:`i`-th layer.
-* :math:`f_{\theta}(x)`, :math:`f(x)`: prediction function of a model :math:`P(Y|x,\theta)`, defined as :math:`argmax_k P(Y=k|x,\theta)`.
+* :math:`f_{\theta}(x)`, :math:`f(x)`: classification function associated with a model :math:`P(Y|x,\theta)`, defined as :math:`argmax_k P(Y=k|x,\theta)`.
   Note that we will often drop the :math:`\theta` subscript.
 * L: number of labels.
 * :math:`\mathcal{L}(\theta, \cal{D})`: log-likelihood :math:`\cal{D}`
@@ -270,7 +273,7 @@ as:
 
 The NLL of our classifier is a differentiable surrogate for the zero-one loss,
 and we use the gradient of this function over our training data as a
-supervised learning signal for deep learning.
+supervised learning signal for deep learning of a classifier.
 
 This can be computed using the following line of code : 
 
@@ -297,7 +300,7 @@ What is ordinary gradient descent?  it is a simple
 algorithm in which we repeatedly make small steps downward on an error 
 surface defined by a loss function of some parameters.  
 For the purpose of ordinary gradient descent we consider that the training 
-data is rolled into the loss function. Then the pseducode of this
+data is rolled into the loss function. Then the pseudocode of this
 algorithm can be described as :
 
 .. code-block:: python
@@ -355,7 +358,7 @@ estimator, that time would be better spent on additional gradient steps.
 An optimal :math:`B` is model-, dataset-, and hardware-dependent, and can be
 anywhere from 1 to maybe several hundreds.  In the tutorial we set it to 20, 
 but this choice is almost arbitrary (though harmless). All code-blocks
-above show psuedocode of how the algorithm looks like. Implementing such 
+above show pseudocode of how the algorithm looks like. Implementing such 
 algorithm in Theano can be done as follows : 
 
 .. code-block:: python
@@ -417,9 +420,9 @@ or, in our case
 
 .. math::
 
-	E(\theta, \mathcal{D}) =  NLL(\theta, \mathcal{D}) + \lambda||\theta||_p
+	E(\theta, \mathcal{D}) =  NLL(\theta, \mathcal{D}) + \lambda||\theta||_p^p
 
-with
+where
 
 .. math::
 	
@@ -444,7 +447,8 @@ data.
 
 Note that the fact that a solution is "simple" does not mean that it will
 generalize well. Empirically, it was found that performing such regularization
-in the context of neural networks helps with generalization.
+in the context of neural networks helps with generalization, especially
+on small datasets.
 The code block below shows how to compute the loss in python when it
 contains both a L1 regularization term weighted by :math:`\lambda_1` and
 L2 regularization term weighted by :math:`\lambda_2`