DeepNets: Dynamics of Deep Neural Networks

Sara-Viola Kuntz

Deep learning has significantly transformed various fields by providing remarkable flexibility and power in tackling complex analytical and predictive tasks. However, a complete theoretical, particularly mathematical, understanding and explanation of deep neural networks and their associated algorithmic procedures is still lacking. In this blog post, we explore several approaches to analyzing deep neural networks through the lens of dynamical systems.

First, we discuss the two-time-scale structure of neural network dynamics, which typically consists of a fast information propagation process coupled with a slower learning process. Next, we introduce neural ordinary differential equations (neural ODEs), which can be understood as an infinite-depth limit of residual neural networks, a special class of feed-forward neural networks. Neural ODEs have gained significant attention in recent years as they bridge the gap between traditional discrete feed-forward neural network architectures and continuous-time dynamical systems. We then categorize feed-forward neural networks and neural ODEs into three different types: non-augmented, augmented, and bottleneck/degenerate architectures. Depending on the chosen architecture, the input-output map of the neural network has a different geometric structure, which can be analyzed using Morse functions. Finally, we introduce the concepts of universal embedding and universal approximation, which are crucial for analyzing the information propagation process within the neural network dynamics. In the literature, many universal approximation theorems have been established for feed-forward neural networks. For neural ODEs, recent progress has been made in studying the universal embedding property.

Multiple Time-Scale Dynamics of Neural Networks

Most neural networks can be represented as a directed weighted graph G = (V,E), where the set of nodes V corresponds to the neurons, and the set of edges E represents the connections between them. Within the node set V, we define a subset of input nodes V_in, which have only outgoing edges, and a subset of output nodes V_out, which have only incoming edges. The remaining neurons constitute the set of hidden nodes V_hid = V/(V_in ∪ V_out), as illustrated in Figure 1. Without any additional assumptions about the network’s structure, edges within the input set V_in or the output set V_out, as well as loops, are possible.

Figure 1: Neural network modeled as a directed weighted graph G = (V,E) with input nodes V_in, hidden nodes V_hid, and output nodes V_out.

A typical neural network model consists of two main processes: information propagation and learning.

Information propagation process: A neural network maps the state of the input nodes V_in at time t₀ to the state of the output nodes V_out at time t₁. This relationship defines the input-output map Φ: V_in(t₀) ↦ V_out(t₁). The information that flows through the network during the time interval (t₀,t₁) is a function of the state of the neurons, the connections between them and time. The dynamics of this process can be defined on either a continuous or a discrete time-scale. Since information propagation describes a process on a fixed graph, we refer to it as the dynamics on the network.
Learning process: To approximate the given data using the input-output map Φ of a neural network, the weights of the edges can be adjusted, and new connections can be established. Typically, the training data consists of input-output pairs (x_i, y_i), which are used to minimize a predefined loss function through (stochastic) gradient-based methods. For each input data point x_i, the loss function depends on the difference between the actual output y_i and the predicted value Φ(x_i). As the learning process iteratively modifies the underlying graph, we refer to it as the dynamics of the network.

The input-output map Φ is evaluated at every step of the learning process; hence, the information propagation process influences the dynamics of the network. Conversely, the network’s topology influences the input-output map Φ, which means that the learning process influences the dynamics on the network. In general, this interdependent relationship is referred to as an adaptive dynamical network^[1]. The step size of gradient-based methods is controlled by the learning rate, which is typically set to a value asymptotically smaller than one. The small learning rate causes the learning process to evolve more slowly than the information propagation, which induces a time-scale separation. As a result, the network can be classified as an adaptive (stochastic) two-time-scale system.

From Deep Neural Networks to Neural ODEs

In the first proposed and most studied neural network architectures, the neurons are assumed to be structured in layers h_l for l ∈ {0,…,L}, building a feed-forward neural network. The number of layers L is called the depth of the network and the maximal number of neurons per layer is called the width of the network. The defined neural network maps the input layer h₀ to the output layer h_L, the layers in between are called hidden layers. In the most straightforward case, all layers are fully connected and have the same number n of neurons per layer, as visualized in Figure 2.

Figure 2: Example of a fully connected feed-forward neural network with constant width n = 5, depth L, input layer h₀, hidden layers h_l, l ∈ {1,…,L-1} and output layer h_L.

The most famous feed-forward neural network model is the Multi-Layer Perceptron (MLP)^[2], which was already studied by Rosenblatt in 1958. In MLPs, the layers are iteratively updated by

h_l+1= f_l(h_l,θ_l)

for l ∈ {0,…,L-1}. In this update rule, every f_l is a non-linear, component-wise applied activation function, typically chosen as a sigmoid, ReLu, or tanh function. The parameter θ_l is the weighted adjacency matrix of the connections between the nodes of layer h_l and the nodes of layer h_l+1.

A more advanced class of feed-forward neural networks is the Residual Neural Network (ResNet)^[3]. In ResNet architectures, all layers are of the same width n and all activation functions f_l are chosen to be the same function f_ResNet. Additionally, ResNets allow for shortcut connections between layers, resulting in the update rule

h_l+1= h_l+ f_ResNet(h_l,θ_l)

for l ∈ {0,…,L-1}. The additional summand h_l, which appears in the update rule of ResNets but not in that of MLPs, is referred to as a short-skip or highway connection.

The iterative update rule of ResNets can be obtained as an Euler discretization of the initial value problem (IVP)

h’ = f(h(t),θ(t)), h(0) = h₀,

over the time interval [0,T] with step size δ = T/L and a vector field f, which is obtained by rescaling the function f_ResNet. The solution h(t) of the IVP represents the hidden states, while the function θ(t) interpolates the weight matrices. The initial condition of the ODE is the initial layer h₀, and the output layer h_L corresponds to the time-T map h(T) of the IVP. Typically, learning algorithms are applied to stationary parameters of neural networks rather than to parameter functions. In order to be able to optimize with respect to stationary parameters, the considered IVP can be rewritten in the form

h’ = f(h(t),θ,t), h(0) = h₀,

which is called a neural ODE^[4]. A neural ODE is a neural network that maps its initial data h(0) = h₀ to its time-T map h(T), as illustrated in Figure 3.

Figure 3: Example of a neural ODE that maps its initial data h(0) = h₀ to its time-T map h(T).

A feed-forward neural network with a large number of layers L is referred to as a deep neural network. As the discretization error of an Euler discretization depends on the step size δ = T/L, the discretization error decreases as the depth of the ResNet increases. This means that the deeper the ResNet, the better its dynamics can be approximated by the dynamics of a neural ODE. By using the theory of continuous dynamical systems, we can gain a better understanding of the theoretical properties of neural ODEs and transfer this insight to deep feed-forward neural networks. Furthermore, more generalized neural ODE models provide several benefits: they can model dynamical systems exhibiting time-continuous behavior, adapt to uneven time-series data, and their efficient implementation can reduce computational costs.

Classification of Architectures

ResNets and Neural ODEs are limited to modeling data with the same input and output dimensions. To overcome this restriction, two affine linear layers denoted by λ₁ and λ₂ can be added: the first layer h₀ is the transformation λ₁ applied to the initial data, and the output of the neural ODE is the transformation λ₂ applied to the time-T map h(T). Hence, the input-output map Φ of the neural ODE architecture maps some input data x to Φ(x) = λ₂(h(T)). This modeling approach works analogously for ResNets by considering the last layer h_L instead of the time-T map h(T). In the following, we restrict ourselves to neural ODEs, as ResNets also build a special case of feed-forward neural networks, for which different types of architectures are introduced afterward. Without loss of generality, we study neural ODEs and feed-forward neural networks with scalar output; in the case of a multi-dimensional output, our results apply to single components of the output.

Suppose the weight matrices of the affine linear transformations λ₁ and λ₂ have full rank, then we can subdivide neural ODEs into non-augmented and augmented architectures:

Non-augmented neural ODE: The initial value problem of the neural ODE is solved in a phase space, which has at most the dimension of the input.
Augmented neural ODE: The initial value problem of the neural ODE is solved in a phase space with a larger dimension than the input dimension.

The two different types of neural ODE architectures are visualized in Figure 4. If at least one of the weight matrices of the affine linear transformations λ₁ or λ₂does not have full rank, the neural ODE architecture is called degenerate.

Figure 4: Example of a non-augmented and an augmented neural ODE architecture.

Classical feed-forward neural networks, such as multi-layer perceptrons, can also be categorized into different types of architectures based on the dimensionality of their layers. Under certain assumptions regarding the activation functions, we can assume that all weight matrices of the neural network have full rank. If a feed-forward neural network has a singular weight matrix, we can identify an equivalent feed-forward neural network that has the same input-output map Φ, but all weight matrices have full rank. The relevant algorithm for the construction of the equivalent neural network architecture can be found in our recent work^[5].

In the following, we introduce the concepts of non-augmented, augmented, and bottleneck architectures for feed-forward neural networks with full-rank matrices.

Non-augmented feed-forward neural network: The dimensions of the layers are monotonically decreasing from h₀ to h_L.
Augmented feed-forward neural network: The layer of maximal dimension h_max is wider than the input layer h₀. Additionally, the layer dimensions are monotonically increasing before the widest layer h_max and monotonically decreasing after the widest layer h_max.
Feed-forward neural network with a bottleneck: There is a layer h_B called a bottleneck, such that before and after the layer h_B there exist layers h_i and h_j with larger dimensions than h_B. Additionally, the dimensions are monotonically decreasing from h_i to h_B, and monotonically increasing from h_B to h_j.

The three different types of feed-forward neural network architectures with full-rank matrices are visualized in Figure 5. Note that for neural ODEs, no bottleneck can occur as the phase space is constant during the initial value problem.

Figure 5: Example of a non-augmented, an augmented, and a bottleneck feed-forward neural network.

Feed-Forward Neural Networks, Neural ODEs, and Morse Functions

The geometric structure of the input-output map Φ of feed-forward neural networks and neural ODEs can be analyzed via Morse functions. Morse theory^[6] is a mathematical framework for studying the existence and regularity of critical points of smooth functions. A scalar smooth function has a critical point when its gradient is zero. The regularity of the critical points is determined by the rank of the Hessian matrix at that point: if the Hessian matrix has full rank, the critical point is non-degenerate; if the Hessian matrix is singular, the critical point is degenerate. A Morse function is defined as a smooth function where every critical point is non-degenerate. In the space of smooth scalar functions, Morse functions are dense, i.e., it is a generic property of a smooth function to be Morse.

One-dimensional examples of Morse functions include linear functions and the quadratic function x². In contrast, all higher-order monomials xⁿ with n≥3 are no Morse functions. In Figure 6, two examples of Morse functions with two-dimensional input are shown. By distinguishing whether the considered functions take positive or negative values, typical datasets such as the circle or the XOR dataset can be represented as Morse functions.

Figure 6: Two examples of Morse functions with two-dimensional input are visualized. The areas with positive and negative function values are highlighted in blue and orange. The figure on the left-hand side represents the circle dataset, and the figure on the right-hand side shows the XOR dataset.

In the context of feed-forward neural networks and neural ODEs, Morse functions play an essential role, as depending on the type of architecture (non-augmented, augmented, bottleneck/degenerate), different results regarding the existence and regularity of critical points are obtained. For both feed-forward neural networks and neural ODEs, the non-augmented and augmented architectures defined in the previous section show the same properties:

Non-augmented architecture: the input-output map Φ cannot have any critical points. Hence, Φ is a Morse function.
Augmented architecture: for almost all sets of weights, every critical point of the input-output map Φ is non-degenerated. Hence, in the space of weight matrices, Φ is almost surely a Morse function.

For feed-forward neural networks with a bottleneck or neural ODEs with singular weight matrices, the characterization of the input-output map Φ is more evolved. A detailed classification of the geometric structure of Φ depending on the architecture chosen can be found in our recent research^[5].

Universal Approximation and Universal Embedding

The findings discussed in the previous section regarding feed-forward neural networks, neural ODEs, and Morse functions have implications on their universal embedding^[7] and universal approximation^[8] properties. A general neural network has the universal embedding property with respect to a given function space if every function can be represented exactly. In contrast, a general neural network has the universal approximation property with respect to a normed function space if every function can be approximated arbitrarily well with respect to the corresponding norm.

For many feed-forward neural networks, universal approximation theorems are well established. They can be subdivided into different groups depending on the depth (the number of layers) and the width (the number of neurons per layer). The two main groups are the following:

Fixed depth, arbitrary width:^[9] If the depth of a neural network and a precision ε are fixed, the width of the neural network can be increased such that every continuous function can be approximated up to the error ε.
Fixed width, arbitrary depth:^[10] If the width of an augmented neural network and a precision ε are fixed, the depth of the neural network can be increased such that every continuous function can be approximated up to the error ε.

From the results of the last section, we can directly deduce that every non-augmented neural network cannot have the universal embedding property, as these networks cannot represent functions with critical points. A perturbation result shows that non-augmented neural networks also cannot have the universal approximate property.

For neural ODEs, the universal approximation property is less studied. In our recent work^[7], we study the embedding capability of neural ODEs via iterative functional equations, differential geometry, suspension flows, and dynamical systems theory. One significant result is that if the dimension of the initial value problem is at least as large as the sum of input and output dimensions, every continuous function can be represented exactly. From the property that the input-output map Φ of non-augmented neural ODEs has no critical point, we can directly infer that non-augmented neural ODEs cannot have the universal embedding property with respect to the space of continuous functions.

A simple, one-dimensional example visualizing the topological constraints of non-augmented neural ODEs is the following: if the vector field is locally Lipschitz continuous, the solution curves are unique and cannot cross or meet. Consequently, for example, the map x ↦ x² cannot be represented as a time-T map of an initial value problem, as solution curves would need to cross or meet, see Figure 7. As affine linear transformations are bijective, also neural ODEs with one-dimensional affine linear layers before and after the initial value problem cannot represent the map x ↦ x².

Figure 7: To represent the map x ↦ x² as a time-T map of a solution of an ODE, both initial conditions x and -x at time t=0 have to be mapped to the point x² at time t=T. As the two solution curves need to cross or meet, the map x ↦ x² cannot be represented by a neural ODE.

In summary, we introduced in this blog post several architectures for feed-forward neural networks and their continuous time analog called neural ODEs. Furthermore, we highlighted the interplay between the structural design and the dynamical properties of neural networks and their implications on the universal embedding and universal approximation property. The insights presented contribute to a more profound and theoretical understanding of the dynamics on the network. For a full understanding of the adaptive neural network, the coupling with the learning process, i.e., the dynamics of the network, needs to be studied in future work.