Russian scientists against the war
Hi, I'm an applied mathematician at Skoltech.I'm interested in various kinds of applied mathematics. See my Google Scholar profile. Below I describe some of the topics I have worked on.
The well-known universal approximation theorem states that neural networks can approximate any function with arbitrarily small error if we make the network large enough. But can we achieve this without increasing the network? Maiorov and Pinkus have shown in 1999 that this is possible if we use some very special activation functions - a fixed-size network can then fit any continuous function on any compact domain with any accuracy merely by adjusting the weights. The respective activation functions are analytic and monotone, but otherwise quite complicated, in particular not elementary.
In this paper I proposed to call such activations "superexpressive" and proved that there exists elementary superexpressive activations. The proof relies significantly on the density of irrational flow on the torus. For example, the network in the picture with activations sin and arcsin is such a superexpressive network (likely not optimal in terms of size) for two-variable functions. At the same time, common non-periodic activations such as sigmoid, ReLU, ELU, softplus, etc. are not superexpressive.
Network training by gradient descent based algorithms is a complex process that is generally hard to analyze theoretically. In this work with Maksim Velikanov we have shown, however, that under some reasonably general assumptions on the target function one can rather accurately describe the asymptotic evolution of the loss under gradient descent in the NTK regime.
In this work we describe a complete phase diagram of approximation rates for deep ReLU networks. We assume that the target function f belongs to a Hölder ball Fr,d that, roughly speaking, consists of d-variate functions having bounded derivatives up to order r. We then ask for which exponents p we can find approximations ˜fW of such f by ReLU networks with W weights so that ‖f−˜fW‖∞=O(W−p) as W→∞.
It turns out that there are two very different regimes. The "slower" regime provides approximation rates up to p=rd. This regime can be implemented by relatively shallow networks, and so that their weights depend continuously on the target. This regime is rather similar to classical linear approximation methods such as spline or Fourier expansion.
The "faster" regime is based on an entirely different idea of "weight encoding" (rather than linear constructions). This regime can provide approximation rates up to p=2rd. It requires network depths to grow as a power law in W; in particular, at p=2rd the network size fully "goes into the depth" rather than the width. The weight precision also must grow as a power law in W. The weight assignment in this regime is fundamentally discontinuous.
An important idea in machine learning is to exploit the natural invariance (or more generally, equivariance) of target maps with respect to various groups of transformations (e.g., shifts, rotations, reflections, etc.). Building the relevant symmetry into the models generally makes them more efficient in terms of accuracy, complexity or training time. A standard example is the convolutional networks that are designed to be invariant with respect to grid translations.
In this paper I analyzed neural networks that are invariant and universal with respect to various groups of transformations. This means that the network must be not only invariant, but also capable of approximating any invariant map. This question is especially subtle in the case of groups such as the Euclidean rotation group, because it cannot be exactly implemented on finite grids on which the data is usually defined. It turns out that one can still rigorously describe neural network-type models that are provably invariant and universal even for this group, by considering a suitable limiting process (specifically, a map on the space of signals on R2 is continuous and SE(2)-equivariant if and only if it can be approximated by models of suitable architecture in the limit of infinitely detailed discretization).
Space tether systems is an interesting class of systems, potentially useful for various purposes such as space debris removal, satellite collocation, etc. In this joint work with our Astrium colleagues we studied a "hub-and-spoke" pyramidal formation rotating about a central satellite and holding another satellite beneath it. Unfortunately, this configuration requires a relatively high fuel consumption.
So, in this paper we proposed another, freely moving (no fuel!) formation serving the same purpose. Instead of a circle, deputy satellites now move along Lissajous curves. We find relations between the system's parameters ensuring that the satellites and tethers never collide and the main satellite remains immobile, and show how all these relations can be satisfied.
Interestingly, the model seems to be especially stable if there are at least 5 deputy satellites. Also interestingly, the tethers can get entangled during operation; we have been able to only partially demarcate the cases of absent or present entanglement (based on the winding number invariant).
In this post I tried to explain in simple terms the idea of SBO and its most natural version based on Expected Improvement (EI).
My research in this area concerned the following question: can EI-based SBO fail, in the sense of never getting near the true global optimum? The expected answer is "yes", but the proof is not obvious because the behavior of SBO trajectories is not well understood on a rigorous level. Nevertheless, in this paper I give a rigorous example of failure in a sort of "analytic black hole" scenario.
In this paper I developed a quadratic form-based perturbation theory and used it to prove that small perturbations of the AKLT model remain gapped (which was widely believed, but hard to prove).
In this paper (preprint) I prove uniqueness of the ground state of a weakly interacting system in a strong sense involving "most general quantum boundary conditions", and discuss how one can interprete these conditions.
In this paper (preprint) I show that the so-called "commensurate-incommensurate transition" in the AKLT model can be explained by a peculiar Poisson-type random walk with a single reversal.