Russian scientists against the war

Hi, I'm an applied mathematician at Skoltech.I'm interested in various kinds of applied mathematics. See my Google Scholar profile. Below I describe some of the topics I have worked on.

The well-known
universal approximation theorem states that neural networks
can approximate any function with arbitrarily small error if we make the network large enough.
But can we achieve this *without* increasing the network?
Maiorov and Pinkus
have shown in 1999 that this is possible if we use some very special activation functions -
a fixed-size network can then fit any continuous function on any compact domain with any accuracy
merely by adjusting the weights. The respective activation functions are analytic and monotone,
but otherwise quite complicated, in particular not elementary.

In this paper
I proposed to call such activations "*superexpressive*" and proved that there
exists *elementary* superexpressive activations.
The proof relies significantly on the
density of irrational flow on the torus.
For example, the network in the picture with activations \(\sin\) and \(\arcsin\)
is such a superexpressive network (likely not optimal in terms of size) for two-variable functions.
At the same time, common non-periodic activations such as sigmoid, ReLU, ELU, softplus, etc. are *not* superexpressive.

Network training by gradient descent based algorithms is a complex process that is generally hard to analyze theoretically. In this work with Maksim Velikanov we have shown, however, that under some reasonably general assumptions on the target function one can rather accurately describe the asymptotic evolution of the loss under gradient descent in the NTK regime.

The asymptotic behavior is given by a power law \(L(t)\sim Ct^{-\xi}\), where not only the exponent \(\xi\) but also the
coefficient \(C\) can be written analytically. Our assumptions roughly say that the data distribution \(\mu\) is smooth and generic while
the target is characterized by singularities of "particular order and extent". The exponent \(\xi\) is then universal in the sense that it is
only determined by the dimension of data and the type of singularity in the target and the network activation. The coefficient \(C\) is more complicated, but can still be written
in terms of some integral expressions. For example, if the target belongs to the class of indicator functions of domains \(\Omega\subset\mathbb R^d\)
with smooth boundary, the loss evolves by
\[L(t)\sim\int_{\partial\Omega} (\mu(\mathbf x)\widetilde{\theta}_{\mathbf x}(\mathbf n))^{-\frac{1}{d+\alpha}}dS
\cdot \tfrac{1}{2\pi}\Gamma(\tfrac{1}{d+\alpha}+1)\cdot(2t)^{-\frac{1}{d+\alpha}},\]
with some homogeneous function \(\widetilde{\theta}_{\mathbf x}\) and value \(\alpha\) determined by the activation function and network architecture.

In this work we describe a complete phase diagram of approximation rates for deep ReLU networks. We assume that the target function \(f\) belongs to a HÃ¶lder ball \(F_{r,d}\) that, roughly speaking, consists of \(d\)-variate functions having bounded derivatives up to order \(r\). We then ask for which exponents \(p\) we can find approximations \(\widetilde f_W\) of such \(f\) by ReLU networks with \(W\) weights so that \(\|f-\widetilde f_W\|_\infty=O(W^{-p})\) as \(W\to\infty\).

It turns out that there are two very different regimes. The "slower" regime provides approximation rates up to \(p=\tfrac{r}{d}\). This regime can be implemented by relatively shallow networks, and so that their weights depend continuously on the target. This regime is rather similar to classical linear approximation methods such as spline or Fourier expansion.

The "faster" regime is based on an entirely different idea of "weight encoding" (rather than linear constructions). This regime can provide approximation rates up to \(p=\tfrac{2r}{d}\). It requires network depths to grow as a power law in \(W\); in particular, at \(p=\tfrac{2r}{d}\) the network size fully "goes into the depth" rather than the width. The weight precision also must grow as a power law in \(W\). The weight assignment in this regime is fundamentally discontinuous.

An important idea in machine learning is to exploit the natural *invariance* (or more generally, *equivariance*)
of target maps with respect to various groups of transformations (e.g., shifts, rotations, reflections, etc.).
Building the relevant symmetry into the models generally makes them more efficient in terms of accuracy, complexity or training time.
A standard example is the convolutional networks
that are designed to be invariant with respect to grid translations.

In this paper I analyzed neural networks that are *invariant and universal*
with respect to various groups of transformations. This means that the network must be not only invariant, but also capable of approximating
any invariant map. This question is especially subtle in the case of groups such as the Euclidean rotation group, because it cannot be exactly
implemented on finite grids on which the data is usually defined. It turns out that one can still rigorously describe neural network-type
models that are provably *invariant and universal* even for this group, by considering a suitable limiting process
(specifically, a map on the space of signals on \(\mathbb R^2\) is continuous and SE(2)-equivariant if and only if it can be approximated
by models of suitable architecture in the limit of infinitely detailed discretization).

Space tether systems is an interesting class of systems, potentially useful for various purposes such as space debris removal, satellite collocation, etc. In this joint work with our Astrium colleagues we studied a "hub-and-spoke" pyramidal formation rotating about a central satellite and holding another satellite beneath it. Unfortunately, this configuration requires a relatively high fuel consumption.

So, in this paper we
proposed another, *freely moving* (no fuel!) formation serving
the same purpose. Instead of a circle, deputy satellites now move
along Lissajous curves. We find relations between the system's
parameters ensuring that the satellites and tethers never collide and
the main satellite remains immobile, and show how all these relations
can be satisfied.

Interestingly, the model seems to be especially stable if there are at least 5 deputy satellites. Also interestingly, the tethers can get entangled during operation; we have been able to only partially demarcate the cases of absent or present entanglement (based on the winding number invariant).

In this post I tried to explain in simple terms the idea of SBO and its most natural version based on Expected Improvement (EI).

My research in this area concerned the following question: can EI-based SBO fail, in the sense of never getting near the true global optimum? The expected answer is "yes", but the proof is not obvious because the behavior of SBO trajectories is not well understood on a rigorous level. Nevertheless, in this paper I give a rigorous example of failure in a sort of "analytic black hole" scenario.

In this paper I developed a quadratic form-based perturbation theory and used it to prove that small perturbations of the AKLT model remain gapped (which was widely believed, but hard to prove).

In this paper (preprint) I prove uniqueness of the ground state of a weakly interacting system in a strong sense involving "most general quantum boundary conditions", and discuss how one can interprete these conditions.

In this paper (preprint) I show that the so-called "commensurate-incommensurate transition" in the AKLT model can be explained by a peculiar Poisson-type random walk with a single reversal.