A Review of Artificial Neural Networks: How Well Do They Perform in Forecasting Time Series?
              
            
            
              Analíti a
            
            
              k
            
            
              6
            
            
              Revista de Análisis Estadístico
            
            
              Journal of Statistical Analysis
            
            
              formances of ANN in terms of their result in applications
            
            
              taking as a benchmark the MLP. In section 6, we present an
            
            
              application. Finally, in section 7, we conclude.
            
            
              
                2 The Multilayer Perceptron
              
            
            
              The neuron (or node) is the basic unit of a neural net-
            
            
              work. In the case of the MLP, it includes an input layer
            
            
              (that does not do any processing), one output layer and at
            
            
              least one hidden layer. The layers consist of a set of nodes;
            
            
              in the case of the hidden layer its inputs come from units
            
            
              in the previous layer and send its outputs to the next la-
            
            
              yer. The input and output layers indicate the flow of infor-
            
            
              mation during the training phase where the learning algo-
            
            
              rithm is implemented. The MLP generally learns by means
            
            
              of a backpropagation algorithm, which is basically a gra-
            
            
              dient technique. It has also been implemented variants of
            
            
              the algorithm to work on the problem of slow convergen-
            
            
              ce (for example, momentum term, see Haykin, 1994). Once
            
            
              the trained process is carried out, the network weights are
            
            
              frozen and can be used to compute output values for new
            
            
              input samples. In what follows, we provide a brief expla-
            
            
              nation of the backpropagation algorithm.
            
            
              The network learning is a process in which the weights,
            
            
              
                w
              
            
            
              , are adapted by a continuous interaction
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              with the en-
            
            
              vironment, in such a way that
            
            
              
                w
              
            
            
              
                nj
              
            
            
              (
            
            
              
                k
              
            
            
              +
            
            
              1
            
            
              ) =
            
            
              
                w
              
            
            
              
                nj
              
            
            
              (
            
            
              
                k
              
            
            
              ) +
            
            
              ∆
            
            
              
                w
              
            
            
              
                nj
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              where
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              is the previous value of the weight vector and
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              +
            
            
              1
            
            
              )
            
            
              is the updated value. The learning algorithm is a
            
            
              set of rules to solve the learning problem and determine
            
            
              the values
            
            
              
                w
              
            
            
              
                nj
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              .
            
            
              One of the most important algorithms is that of the
            
            
              error correction. Consider the
            
            
              
                n
              
            
            
              -th neuron in the iteration.
            
            
              Let
            
            
              
                y
              
            
            
              
                n
              
            
            
              be the response of this neuron;
            
            
              
                x
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              is the vector of
            
            
              environment stimuli, and
            
            
              {
            
            
              
                x
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              ,
            
            
              
                d
              
            
            
              
                n
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              }
            
            
              is the pair of trai-
            
            
              ning. Define the following error signal equation:
            
            
              
                e
              
            
            
              
                n
              
            
            
              (
            
            
              
                k
              
            
            
              ) =
            
            
              
                d
              
            
            
              
                n
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              −
            
            
              
                y
              
            
            
              
                n
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              The objective is to minimize the cost function (criterion)
            
            
              which takes into account this error. After selecting the cri-
            
            
              teria, the problem of error correction learning becomes one
            
            
              of optimization. Consider a function
            
            
              
                ǫ
              
            
            
              (
            
            
              
                w
              
            
            
              )
            
            
              , which is a conti-
            
            
              nuously differentiable function of a weight vector. The fun-
            
            
              ction
            
            
              
                ǫ
              
            
            
              (
            
            
              
                w
              
            
            
              )
            
            
              transforms the elements from
            
            
              
                w
              
            
            
              to real numbers.
            
            
              We need to find an optimal solution
            
            
              
                w
              
            
            
              ∗
            
            
              that satisfies the
            
            
              condition:
            
            
              
                ǫ
              
            
            
              (
            
            
              
                w
              
            
            
              ∗
            
            
              )
            
            
              ≤
            
            
              
                ǫ
              
            
            
              (
            
            
              
                w
              
            
            
              )
            
            
              .
            
            
              Then it is necessary to solve an optimization problem wit-
            
            
              hout constraints posed as: the cost function minimization
            
            
              
                e
              
            
            
              (
            
            
              
                w
              
            
            
              )
            
            
              with respect to the weight
            
            
              
                vector
              
            
            
              . The necessary con-
            
            
              dition for optimality is given by:
            
            
              ∇
            
            
              
                ǫ
              
            
            
              (
            
            
              
                w
              
            
            
              ∗
            
            
              ) =
            
            
              0
            
            
              where
            
            
              ∇
            
            
              is the gradient operator. An important class of
            
            
              optimization algorithms without constraints is based on
            
            
              the idea of iterative descent (gradient descent method and
            
            
              Newton’s method). Starting with an initial condition
            
            
              
                w
              
            
            
              (
            
            
              0
            
            
              )
            
            
              ,
            
            
              it generates a sequence
            
            
              
                w
              
            
            
              (
            
            
              1
            
            
              )
            
            
              ,
            
            
              
                w
              
            
            
              (
            
            
              2
            
            
              )
            
            
              , . . ., such that the cost
            
            
              function
            
            
              
                ǫ
              
            
            
              (
            
            
              
                w
              
            
            
              )
            
            
              decreases in every algorithm iteration. It is
            
            
              desirable that the algorithm eventually converge in to the
            
            
              optimal solution in such a way that
            
            
              
                ǫ
              
            
            
              (
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              +
            
            
              1
            
            
              ))
            
            
              <
            
            
              
                ǫ
              
            
            
              (
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              ))
            
            
              In the descent gradient method, the successive adjustments
            
            
              are applied to the weight vector, in the direction of the gra-
            
            
              dient descent. For convenience, we will use the following
            
            
              notation:
            
            
              
                g
              
            
            
              =
            
            
              ∇
            
            
              
                ǫ
              
            
            
              (
            
            
              
                w
              
            
            
              )
            
            
              .
            
            
              The gradient descent algorithm can be written formally as:
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              +
            
            
              1
            
            
              ) =
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              −
            
            
              
                η
              
            
            
              
                g
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              where
            
            
              
                η
              
            
            
              is a positive constant called the learning rate, and
            
            
              
                g
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              is the gradient vector evaluated at
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              . Therefore, the
            
            
              correction applied to the weight vector can be written as:
            
            
              ∆
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              ) =
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              +
            
            
              1
            
            
              )
            
            
              −
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              ) =
            
            
              −
            
            
              
                η
              
            
            
              
                g
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              .
            
            
              This method converges slowly to an optimal solution
            
            
              
                w
              
            
            
              ∗
            
            
              .
            
            
              However, the learning rate has a larger impact on this con-
            
            
              vergent behavior. When
            
            
              
                η
              
            
            
              is small, the path of
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              , over
            
            
              the plane
            
            
              
                W
              
            
            
              is smooth. When
            
            
              
                η
              
            
            
              is large, the path of
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              over the plane
            
            
              
                W
              
            
            
              is oscillatory, and when
            
            
              
                η
              
            
            
              exceeds a cer-
            
            
              tain critical value, the path
            
            
              
                w
              
            
            
              (
            
            
              
                k
              
            
            
              )
            
            
              over the plane
            
            
              
                W
              
            
            
              becomes
            
            
              unstable. Thus, the backpropagation algorithm is a tech-
            
            
              nique to implement the method of descent gradient in a
            
            
              weight space for a multilayer network. The basic idea is to
            
            
              efficiently calculate the partial derivatives of an approxi-
            
            
              mate function of the behavior by the neural network with
            
            
              respect to all the elements of the adjustable vector of para-
            
            
              meters
            
            
              
                w
              
            
            
              for a given value of the input vector
            
            
              
                x
              
            
            
              .
            
            
              
                3 Types of ANNs
              
            
            
              The specialized literature identifies several groups of
            
            
              networks used as approximators and/or classifiers. This
            
            
              section provides a classification in terms of the general cha-
            
            
              racteristics of the ANNs.
            
            
              1. In the first group, we can find the Feedforward Net-
            
            
              works (FFNs), like MLP. Its main feature is that their
            
            
              connection is forward so they do not establish any con-
            
            
              nections between the nodes on the same layer or pre-
            
            
              vious nodes. The networks that share this feature are:
            
            
              The Radial Basic Function (RBF) (Bildirici
            
            
              
                et al
              
            
            
              . 2010;
            
            
              Dhamija and Bhalla, 2011; Cheng, 1996); the Genera-
            
            
              lized Regression Neural Network (GRNN) (Enke and
            
            
              Thawornwong, 2005; Mostafa, 2010); the Group Met-
            
            
              hod of Data Handling Network (GMDHN) (Pham and
            
            
              Lui, 1995); the Probabilistic Neural Network (PNN) (En-
            
            
              ke and Thawornwong, 2005; Thawornwong and Enke,
            
            
              2004); the Dynamic Neural Network (DNN) (Guresen,
            
            
              Kayakutlu and Daim, 2010) and the Cerebellar Model
            
            
              Articulation Controller (CMAC) (Chen, 1996).
            
            
              Analítika,
            
            
              
                Revista de análisis estadístico
              
            
            
              , 3 (2013), Vol. 6(2): 7-15
            
            
              9