Xinwei Niu

Triton Notes

2026-02-04T00:00:00+00:00

TBC

Online Convex Optimization and Accelerated Gradient Descent Methods for Efficient Training

2026-02-04T00:00:00+00:00

TBC

Non-convex optimization for Over-parameterized Neural Nets: Reproducing Kernel Hilbert Space and Neural Tangent Kernel

2026-02-01T00:00:00+00:00

This blog is based on Real Analysis by Elias M. Stein and Rami Shakarchi, and Learning Theory on First Principles by Francis Bach.

1. Reproducing Kernel Hilbert Space(RKHS)

Here are the prerequisites for understanding RKHS:

1.1 Kernel Functions

(Definition 1.1) Hilbert Space: A Hilbert Space $\mathcal{H}$ is a vector space with an inner product $\langle f,g\rangle_\mathcal{H}$ where the norm is defined by $|f|\mathcal{H} = \sqrt{\langle f,f\rangle\mathcal{H}} $. ($L^2$ space is a type of Hilbert Space.)

(Theorem 1.1) Riesz Representation Theorem: Every continuous linear functional $\Phi$ can be written in the form:

\[\Phi(f) = \langle f, g \rangle\] for some appropriate element $g \in \mathcal{H}$.

(Definition 1.2) Reproducing Property of RKHS: A kernel of Hilbert Space has the reproducing property if, given an kernel function $K(\cdot, x)$, define a reproducing kernel feature map $\Phi: \mathcal{X} \rightarrow \mathbb{R}^\mathcal{X}$ which satisfies:

\[\Phi (x) = \langle\Phi(\cdot), K(\cdot, x)\rangle\]

(Theorem 1.2) Moore-Aronszajn Theorem: For every positive definite kernel function $K: \mathbb{R}^N \times \mathbb{R}^N \rightarrow \mathbb{R}$, there exists a Hilbert Space $\mathcal{H}$ with $K$ as its reproducing kernel.

2. Neural Tangent Kernel

2.1 Intuition of NTK

The Neural Tangent Kernel (NTK) is a tool that allows us to treat a deep neural network as a linear model in a high-dimensional feature space. The central intuition is that as the width of a neural network layer $m$ approaches infinity, the training dynamics simplify. Instead of the weights moving in a complex, non-linear way, the network’s output evolves as if it were performing a simple gradient descent on a fixed kernel.

2.2 Lazy Training

Under over-parameterized settings, the network enters a regime known as Lazy Training. In this state, because the model has an abundance of parameters, the “effort” required to minimize the loss is spread across so many neurons that each individual weight only needs to move an infinitesimal distance from its random initialization $\theta_0$.Mathematically, the parameters $\theta$ stay so close to $\theta_0$ that the model can be linearized via a Taylor expansion:\[f(x, \theta) \approx f(x, \theta_0) + \nabla_\theta f(x, \theta_0)^\top (\theta - \theta_0)\] Because the weights barely move, the features learned by the network are essentially the random features present at initialization.

2.3 Definition of Neural Tangent Kernels

For a neural network $f(x, \theta)$, the NTK $\Theta$ is defined by the inner product of the gradients of the model output with respect to its parameters:\[\Theta(x, x’) = \sum_{p=1}^P \frac{\partial f(x, \theta)}{\partial \theta_p} \frac{\partial f(x’, \theta)}{\partial \theta_p} = \langle \nabla_\theta f(x, \theta), \nabla_\theta f(x’, \theta) \rangle\] In the infinite-width limit, this kernel becomes deterministic and remains constant throughout the entire training process.

2.4 Connection between NTK and RKHS

The connection is established by viewing the gradient $\nabla_\theta f(x, \theta_0)$ as a feature map $\Phi(x)$. According to the Moore-Aronszajn Theorem, the NTK defines a unique RKHS $\mathcal{H}_{ntk}$. Training an infinite-width neural network with gradient descent is mathematically equivalent to finding the minimum-norm solution in this RKHS, essentially performing Kernel Ridge Regression with the NTK.

3. Application of Neural Tangent Kernel in Deep Learning

In modern research of Deep Learning topics, researchers begin to take NTK concepts for NLP tasks, with mainly over context window extension methods (known as NTK-aware interpolation) to preserve the high-frequency components that the network needs to learn fine-grained distinctions between nearby positions. This is particularly relevant when scaling Rotary Positional Embeddings (RoPE), where simple linear scaling often causes the model to “forget” high-resolution positional information.

Also, NTK is a tool which shows the convergence of neural networks. By proving that the kernel remains positive definite and constant during training, researchers can demonstrate that over-parameterized networks will always converge to a global minimum when trained with gradient descent, providing a theoretical bedrock for why “bigger is often better” in model architecture.

Efficient Methods for Generative Models 1: Linear Attention, State-Space Models, and Linear RNNs

2025-11-20T00:00:00+00:00

Modern sequence modeling has evolved from recurrent architectures to attention-based models and, more recently, state-space approaches. Traditional RNNs introduced an efficient way to process sequential data but struggled with long-term dependencies. Transformers later revolutionized the field with attention mechanisms, though their quadratic cost limits scalability to long contexts. This has driven research into more efficient alternatives—such as linear attention, state-space models like S4 and Mamba, and newer architectures like DeltaNet, that aim to combine scalability, stability, and strong modeling capacity for long-range sequence tasks.

Introduction to Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are a class of neural networks specifically designed to handle sequential and temporal data by maintaining a hidden state that evolves over time. Unlike feedforward networks, RNNs have recurrent connections that allow information from previous time steps to influence the current computation, effectively providing the network with memory. At each time step $t$, the hidden state $h_t$ is computed as:

\[ h_t = f(W_{xh} x_t + W_{hh} h_{t-1} + b_h) \]

where $x_t$ is the input at time step $t$, $h_{t-1}$ is the hidden state from the previous step, $W_{xh}$ and $W_{hh}$ are learnable weight matrices, $b_h$ is a bias term, and $f(\cdot)$ is a nonlinear activation function such as $\tanh$ or $\text{ReLU}$.

A key advantage of RNNs is their linear computational complexity with respect to the input sequence length. At each step, the network performs a fixed set of operations, making the total cost scale linearly with the number of time steps. This efficiency allows RNNs to process sequences of arbitrary length without a combinatorial explosion in computation, which was one reason for their widespread adoption in language modeling, speech recognition, and time-series prediction during the 2010s.

Despite this efficiency, standard RNNs face significant training challenges, primarily the vanishing and exploding gradient problem. During backpropagation through time (BPTT), gradients are propagated through the recurrent weights over many time steps. For long sequences, this can lead to gradients that decay exponentially (vanishing) or grow uncontrollably (exploding), making it difficult for the network to capture long-term dependencies. These limitations motivated the development of gated variants such as Long Short-Term Memory (LSTM)[2] and Gated Recurrent Unit (GRU) networks, which incorporate mechanisms to preserve information over extended sequences and stabilize training.

Introduction to Linear Attention

The attention mechanism in the Transformer[1] involves computing pairwise interactions between all tokens in the sequence. Given $Q, K,$ and $V$ which are query, key, value respectively, If $Q \in \mathbb{R}^{N \times d}$ and $K \in \mathbb{R}^{N \times d}$, then the attention score matrix is $ QK^{\top} \in \mathbb{R}^{N \times N}, $ and its computational complexity is $ \mathcal{O}(N^{2}), $

Under long-context reasoning tasks, the quadratic growth of the attention operation becomes a major bottleneck, since increasing the sequence length $N$ leads to a rapid increase in both computation and memory cost. As $N$ grows into the hundreds of thousands or millions, storing and manipulating the full $N \times N$ attention matrix becomes infeasible, motivating the development of more efficient attention mechanisms that reduce or avoid the $\mathcal{O}(N^{2})$ complexity.

Hence, later works started looking for clever ways to get around this massive computation bottleneck. One of the first big ideas was Locality-Sensitive Hashing (LSH) Attention [4], which cuts the complexity down to $\mathcal{O}(N \log N)$. The trick here is simple but powerful: instead of comparing every token with every other token, LSH groups similar tokens into buckets so that attention only happens within those smaller groups. Fewer comparisons, faster models.

Then came Linear Attention [5], which pushed things even further by reducing the cost to linear time, $\mathcal{O}(N)$. By rethinking the softmax attention itself: Instead of explicitly building the full attention matrix, Linear Attention rewrites the operation using kernel functions so that attention can be computed through a series of efficient matrix multiplications. The result is that you get attention-like behavior without paying the quadratic price. It assumes that the exponential kernel can be approximated or represented by a feature map $\phi$ such that $ \exp(q^\top k)=\phi(q)^\top \phi(k) $. Substituting this into the numerator yields \[ \sum_{j=1}^N \exp(q_i^\top k_j)v_j = \phi(q_i)^\top \left( \sum_{j=1}^N \phi(k_j)v_j^\top \right), \] allowing us to define $ S=\phi(K)^\top V $ so that all numerators can be written compactly as $ \phi(Q)S = \phi(Q)(\phi(K)^\top V) $. Similarly, the denominator becomes \[ \sum_{j=1}^N \exp(q_i^\top k_j)=\phi(q_i)^\top \left( \sum_{j=1}^N \phi(k_j) \right), \] and defining $ z=\phi(K)^\top \mathbf{1}_N $ gives the denominator vector as $ \phi(Q)z $. Combining numerator and denominator gives the linear-attention approximation: \[ \mathrm{Att}(Q,K,V)=\frac{\phi(Q)(\phi(K)^\top V)}{\phi(Q)(\phi(K)^\top \mathbf{1}_N)}, \] and in row-wise form $ y_i=\frac{\phi(q_i)^\top S}{\phi(q_i)^\top z} $, where $ S=\phi(K)^\top V $ and $ z=\phi(K)^\top \mathbf{1}_N $. Since $S$ and $z$ are computed once in $\mathcal{O}(N)$ time and reused for all queries, the overall complexity becomes linear in sequence length, yielding the key efficiency benefit of linear attention.

State-Space Model: S4, Mamba and Mamba 2

State-space models (SSMs) provide a mathematically grounded framework for sequence modeling by representing long-range temporal dependencies through latent linear dynamical systems. A generic continuous-time SSM is expressed as \[ \dot{x}(t) = A x(t) + B u(t), \qquad y(t) = C x(t) + D u(t), \] which, after discretization, yields a structured recurrence that can be evaluated either sequentially or in parallel. The challenge in modern deep learning applications is to parameterize these systems such that they remain stable, expressive, and computationally efficient on long sequences. Recent advances have focused on imposing algebraic structure on the transition matrix $A$ to enable fast convolution, efficient parallelization, and numerically stable training.

The Structured State Space sequence model (S4) is a foundational example of this line of work. S4 draws heavily on the HiPPO (High-Order Polynomial Projection Operators) method to initialize state dynamics in a basis capable of retaining long-range history. S4 represents the continuous-time transition matrix $A$ in a Normal Plus Low-Rank (NPLR) form, ensuring controlled spectral properties while maintaining flexibility. Through a learned similarity transformation, this representation is converted into a Diagonal Plus Low-Rank (DPLR) form. The diagonal component allows exact and efficient discretization and supports parallel prefix multiplication, while the low-rank correction is applied using Woodbury-style updates. This architecture enables fast kernel generation and efficient convolution-based evaluation, allowing S4 to model long sequences with high stability and strong gradient flow.

Mamba extends this framework by introducing selective state-space updates, where the discretized recurrence is modulated by input-dependent gates. Instead of a fixed state update

\[ x_{t+1} = \overline{A} x_t + \overline{B} u_t, \] Mamba applies elementwise modulation, \[ x_{t+1} = f_t \odot (\overline{A} x_t) + g_t \odot (\overline{B} u_t), \]

where the gates $f_t$ and $g_t$ are dynamically computed from the input sequence. This allows the model to selectively retain, suppress, or transform information at each timestep while preserving the computational structure of the underlying SSM. The key advantage of Mamba lies in the fact that the gating functions augment expressivity without sacrificing linear-time recurrence or the ability to execute fast parallel prefix computations.

Recurrent mode evaluation in Mamba processes tokens sequentially and operates in $O(N)$ per timestep, where $N$ is the state dimension. GPU implementations typically fuse the diagonal evolution, low-rank correction, and gating operations into a single kernel to minimize memory movement. This yields high streaming throughput suitable for autoregressive inference and online applications. In contrast, parallel mode evaluation constructs a time-varying convolution kernel induced by the gated recurrence. Since the coefficients depend on the input, Mamba performs a parallel prefix-product over the diagonal dynamics and integrates low-rank corrections in a block-wise manner. This allows efficient training on long sequences and batch inference, although the algorithmic structure is more complex than in fixed-parameter SSMs such as S4.

Mamba 2 further refines this design by improving numerical conditioning, parameter efficiency, and hardware alignment. From a modeling perspective, Mamba 2 adopts more stable parameterizations for the diagonal and low-rank components, ensuring contractive behavior under long prefix-product scans and reducing sensitivity to mixed-precision training. Algorithmically, Mamba 2 introduces a more efficient parallel scan procedure that avoids explicit storage of per-timestep kernel factors, significantly reducing memory overhead. Systems-level optimizations include a redesigned layout for state vectors and parameters to maximize GPU coalescing, as well as enhanced fused kernels that reduce launch overhead and exploit register-level tiling. These improvements yield faster training, lower memory consumption, and more robust behavior on very long sequences. Overall, Mamba 2 represents an evolution of the selective SSM framework that is both theoretically more stable and practically more hardware-efficient.

RetNet, FWP, and DeltaNet

Figure: Parallel vs. recurrent mode of RetNet [9].

DeltaNet explores the intermediate space between purely recurrent SSMs and fully parallelizable convolutional or attention-based architectures. Its design decomposes the sequence update into components that can be executed either in parallel or recurrently, allowing the model to capture long-range dependencies while maintaining predictable latency. DeltaNet variants often incorporate structured transitions combined with delta-function or short-range convolutional updates, enabling them to achieve strong performance on long-context tasks while remaining simpler to implement and optimize compared to fully diagonalized SSMs.

Future research directions include advancing the theoretical understanding of structured transition operators, particularly the stability properties of diagonal-plus-low-rank parameterizations under input-dependent modulation. Hardware-aware model design remains another important goal, as efficient execution of SSMs increasingly depends on custom fused kernels, memory layouts, and prefix-scan algorithms. Hybrid architectures that combine SSMs with sparse or content-based attention are also promising, offering ways to capture both long-range and local interactions efficiently. Finally, emerging directions such as adaptive state dimension, learnable discretization schemes, quantized SSM inference, and multimodal selective SSMs present exciting opportunities for further improving scalability and applicability.

Hybrid Models

References

[1] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.

[2] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

[3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

[4] Kitaev, N., Kaiser, Ł., & Levskaya, A. (2020). Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451.

[5] Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020, November). Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning (pp. 5156-5165). PMLR.

[6] Schlag, I., Irie, K., & Schmidhuber, J. (2021, July). Linear transformers are secretly fast weight programmers. In International conference on machine learning (pp. 9355-9366). PMLR.

[7] Gu, A., Goel, K., & Ré, C. (2021). Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.

[8] Gu, A., & Dao, T. (2024, May). Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling.

[9] Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., … & Wei, F. (2023). Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621.

[] Yang, S., Wang, B., Zhang, Y., Shen, Y., & Kim, Y. (2024). Parallelizing linear transformers with the delta rule over sequence length. Advances in neural information processing systems, 37, 115491-115522.

[] Yang, S., Kautz, J., & Hatamizadeh, A. (2024). Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464.

[] Team Kimi(2025). Kimi Linear: An Expressive, Efficient Attention Architecture. arXiv preprint arXiv:2510.26692.

Efficient Methods for Generative Models 2: KV Cache, FlashAttention, vLLM

2025-11-20T00:00:00+00:00

Introduction to Recurrent Neural Networks (RNNs)

\[ h_t = f(W_{xh} x_t + W_{hh} h_{t-1} + b_h) \]

Introduction to Linear Attention

Under long-context reasoning tasks, the quadratic growth of the attention operation becomes a major bottleneck, since increasing the sequence length $N$ leads to a rapid increase in both computation and memory cost. As $N$ grows into the hundreds of thousands or millions, storing and manipulating the full (N \times N) attention matrix becomes infeasible, motivating the development of more efficient attention mechanisms that reduce or avoid the $\mathcal{O}(N^{2})$ complexity.

Hence, later works started to trying mitigate this computation bottleneck Locality-Sensitive Hashing Attention [4] . Linear Attention[5]

References

Efficient Methods for Generative Models 3: Sparse and Adaptive Attention, Dynamic Token Pooling

2025-11-20T00:00:00+00:00

Introduction to Recurrent Neural Networks (RNNs)

\[ h_t = f(W_{xh} x_t + W_{hh} h_{t-1} + b_h) \]

Introduction to Linear Attention

Under long-context reasoning tasks, the quadratic growth of the attention operation becomes a major bottleneck, since increasing the sequence length $N$ leads to a rapid increase in both computation and memory cost. As $N$ grows into the hundreds of thousands or millions, storing and manipulating the full (N \times N) attention matrix becomes infeasible, motivating the development of more efficient attention mechanisms that reduce or avoid the $\mathcal{O}(N^{2})$ complexity.

Hence, later works started to trying mitigate this computation bottleneck Locality-Sensitive Hashing Attention [4] . Linear Attention[5]

References

Tillet, P., Kung, H. T., & Cox, D. (2019, June). Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (pp. 10-19).

Note on Submodular Function Optimization, Minimization and Maximization, Lazy Greedy

2025-11-20T00:00:00+00:00

This blog is based on week 10 of PKU Algorithms for Big Data Analysis.

Introduction of Submodular Functions

Definition 1.1 (Submodular Function).
Let ( V ) be a finite ground set. A set function
[ f : 2^{V} \rightarrow \mathbb{R} ] is called submodular if for all subsets ( A, B \subseteq V ), [ f(A) + f(B) \geq f(A \cup B) + f(A \cap B). ]

Definition 1.2 (Diminishing Returns Property).
A set function ( f : 2^{V} \rightarrow \mathbb{R} ) is said to satisfy the diminishing returns property if for all ( A \subseteq B \subseteq V ) and all ( x \in V \setminus B ), [ f(A \cup {x}) - f(A) \geq f(B \cup {x}) - f(B). ]