CRAT: Complex Reflective Angular Transformer
Independent Research
Preprint v1.0 — Under review.
Abstract
We introduce the Complex Reflective Angular Transformer (CRAT), a neural architecture that fundamentally reimagines transformer computations through complex analysis and angular geometry. CRAT replaces traditional real-valued hidden states with complex-valued representations processed through Complex Layer Normalization (CLN) and complex GELU activations, enabling richer information encoding via amplitude-phase decomposition. The architecture comprises five tightly integrated innovations: (1) Multi-Frequency Positional Encoding, which captures dependencies at multiple temporal scales through learnable frequency bands; (2) Complex State Representation with phase-normalized rotational operators and complex residual connections; (3) Angular Attention, a trigonometric mechanism with phase normalization, FFT-accelerated computation, specialized head partitioning, and low-rank complex projections; (4) Imaginary Reflection Windows (IRW), a retrieval-augmented generation mechanism employing multi-query retrieval with weighted fusion and angular similarity in the complex domain; and (5) Complex Mixture of Experts (CMoE), which increases capacity through sparse complex-valued expert routing. Training stability is ensured through component-wise gradient clipping, phase regularization, and magnitude regularization. Experimental results on language modeling (WikiText-103), question answering (Natural Questions), and retrieval-augmented generation (KILT) benchmarks demonstrate that CRAT achieves superior performance compared to standard transformer baselines while converging 30.9% faster than the strongest baseline (RETRO++) and requiring no external retriever.
Keywords: complex-valued neural networks, angular attention, retrieval-augmented generation, transformer architecture, phase normalization, complex LayerNorm, FFT attention, complex mixture of experts
Notation
| Symbol | Description |
|---|---|
| \(z_t \in \mathbb{C}^{d/2}\) | Complex-valued hidden state at position t |
| \(\phi_{\text{norm}}\) | Phase-normalized angle via atan2, remapped to [−π, π] |
| \(\alpha_{ts}^{(h)}\) | Angular attention score between positions t and s for head h |
| \(\text{CLN}(z)\) | Complex Layer Normalization preserving phase structure |
| \(\text{cGELU}(z)\) | Complex GELU — component-wise GELU on Re and Im |
| \(q_t^{(m)}\) | IRW retrieval query m from imaginary component |
| \(g_t\) | Reflection gate controlling knowledge fusion |
| \(\omega_k^{(i)}, \delta_k^{(i)}\) | Learnable frequency and phase for PE band k, dim i |
| \(\alpha + i\beta\) | Complex residual coefficient (magnitude + rotation) |
| \(\mathcal{E}_e\) | Complex-valued expert network in CMoE |
| \(\mathcal{K}\) | External knowledge base for IRW retrieval |
1 Introduction
The Transformer architecture (Vaswani et al., 2017) has become the dominant paradigm in deep learning, powering state-of-the-art models across natural language processing, computer vision, and multimodal learning. At its core, the standard transformer relies on real-valued representations and dot-product attention to compute token relationships. While remarkably effective, this framework has inherent limitations: the dot-product attention mechanism captures only magnitude-based similarity, and retrieval-augmented generation (RAG) systems remain architecturally disconnected from the core transformer computation.
Complex-valued neural networks have a rich theoretical history (Hirose, 2012; Trabelsi et al., 2018), offering advantages including richer representational capacity through amplitude-phase decomposition, natural handling of periodic and oscillatory patterns, and built-in rotational equivariance. However, their application to transformer architectures has remained largely unexplored.
In this paper, we present the Complex Reflective Angular Transformer (CRAT), a complete architecture that integrates complex-valued computation, angular geometry, and retrieval augmentation into a unified framework. CRAT introduces five tightly coupled innovations. First, multi-frequency positional encoding captures dependencies at multiple temporal scales through learnable frequency bands. Second, complex state representation embeds tokens into the complex plane with phase-normalized rotational operators, complex residual connections, and dedicated normalization (Complex Layer Normalization) and activation functions (complex GELU). Third, angular attention replaces dot-product similarity with trigonometric phase differences, accelerated via FFT computation and enhanced through specialized head partitioning and low-rank complex projections. Fourth, Imaginary Reflection Windows (IRW) use the imaginary components of hidden states to perform multi-query retrieval with weighted fusion and angular similarity, integrating RAG directly into the transformer's core computation. Fifth,Complex Mixture of Experts (CMoE) increases model capacity through sparse complex-valued expert routing without proportionally increasing computation.
Training stability is maintained through a dedicated suite of techniques: component-wise gradient clipping for complex parameters, phase regularization to prevent oscillatory instability, and magnitude regularization to control representation norms.
Our contributions can be summarized as follows:
- A complex-valued state representation with multi-frequency positional encoding, Complex Layer Normalization, complex GELU activation, and phase-normalized rotational operators.
- Angular Attention, a trigonometric attention mechanism with FFT acceleration, specialized head partitioning, and low-rank complex projections that captures directional relationships between tokens.
- Imaginary Reflection Windows with multi-query retrieval, weighted fusion, and complex-domain angular similarity, integrating RAG directly into the transformer's complex-valued computation.
- Complex Mixture of Experts and complex residual connections for increased capacity and gradient flow.
- A complete training stability framework with component-wise gradient clipping, phase regularization, and magnitude regularization.
2 Related Work
2.1 Complex-Valued Neural Networks
Complex-valued neural networks have been studied since the early work of Georgiou and Koutsougeras (1992). Deep complex networks (Trabelsi et al., 2018) demonstrated the viability of complex arithmetic in modern deep learning. More recently, complex-valued representations have found applications in signal processing (Choi et al., 2019), speech enhancement, and physics-informed neural networks. Our work extends this line of research to the transformer architecture, using complex representations not merely as a computational tool but as the foundation for a new attention mechanism.
2.2 Attention Mechanisms
Since the introduction of scaled dot-product attention (Vaswani et al., 2017), numerous alternatives have been proposed, including linear attention (Katharopoulos et al., 2020), sparse attention (Child et al., 2019), and kernel-based attention (Choromanski et al., 2021). Rotary Position Embeddings (RoPE) (Su et al., 2024) introduced complex rotations for positional encoding but retained real-valued dot-product attention. CRAT goes further by computing attention scores entirely in the angular domain, with FFT-accelerated computation reducing asymptotic complexity from \(O(n^2)\) to \(O(n \log n)\).
2.3 Retrieval-Augmented Generation
RAG approaches (Lewis et al., 2020; Borgeaud et al., 2022; Guu et al., 2020) augment language models with external knowledge retrieval. Existing methods typically treat retrieval as a pre-processing step or an auxiliary module. RETRO (Borgeaud et al., 2022) introduced chunked cross-attention for retrieved passages, while FiD (Izacard and Grave, 2021) processes retrieved documents in the encoder. CRAT's Imaginary Reflection Window offers a fundamentally different approach: it uses the mathematical structure of complex representations to generate multi-query retrieval signals intrinsically, with angular similarity in the complex domain unifying internal attention and external retrieval under the same geometric principle.
3 Architecture Overview
The CRAT architecture processes input tokens through a unified pipeline of seven integrated stages, as illustrated in Figure 1. Each stage operates in the complex domain, maintaining both real and imaginary components throughout the forward pass.
Stage 1 — Multi-Frequency Positional Encoding. Input tokens are embedded and augmented with positional information through \(K\) learnable frequency bands, capturing dependencies from local syntactic patterns to long-range discourse structure.
Stage 2 — Complex State Construction. Real-valued embeddings are projected into\(\mathbb{C}^{d/2}\) via learned linear projections. A content-dependent complex rotation is applied, followed by Complex Layer Normalization (CLN) to stabilize both magnitude and phase. Complex GELU (cGELU) activations introduce smooth non-linearity in both the real and imaginary pathways.
Stage 3 — Angular Attention. Attention scores are computed via cosine of phase-normalized angular differences between complex queries and keys, replacing dot-product similarity. Attention heads are partitioned into real, imaginary, and mixed groups for specialized processing. FFT-based computation reduces complexity from \(O(n^2)\) to \(O(n \log n)\), and low-rank complex projections reduce parameter counts by 30–60%.
Stage 4 — Imaginary Reflection Window. The imaginary components of attention outputs generate \(M\) parallel retrieval queries via learned projections. Each query independently searches the external knowledge base using angular similarity in\(\mathbb{C}\). Results are aggregated through weighted fusion and integrated via a temperature-controlled reflection gate.
Stage 5 — Complex Residual Connections. Skip connections employ complex residual coefficients \(\alpha + i\beta\), providing directional residuals that adjust both magnitude and phase across layers.
Stage 6 — Complex Mixture of Experts. A sparse CMoE layer routes tokens to specialized complex-valued expert networks via a complex-valued gating function, increasing capacity without proportionally increasing computation.
Stage 7 — Output Projection. Complex-valued hidden states are concatenated (real and imaginary parts) and projected back to the real-valued output space through Layer Normalization and a linear classifier.
Training stability is ensured throughout by component-wise gradient clipping, phase regularization (\(\mathcal{L}_{\text{phase}}\)), and magnitude regularization (\(\mathcal{L}_{\text{mag}}\)), which together keep complex-valued representations in a well-conditioned region of \(\mathbb{C}\).
4 Mathematical Framework
4.1 Multi-Frequency Positional Encoding
Given an input sequence of tokens, we first embed each token into a dense vector and augment it with positional information through multiple learnable frequency bands. Let the token at position \(t\) be represented by its one-hot vector \(x_t \in \{0,1\}^{|V|}\). The token embedding is:
where \(W_E \in \mathbb{R}^{d \times |V|}\) is the embedding matrix, \(b_E \in \mathbb{R}^d\) the bias, and \(|V|\) is the vocabulary size. Rather than a single-frequency sinusoidal scheme, CRAT employs a multi-frequency positional encoding that sums over \(K\) learnable frequency bands:
where \(\omega_k^{(i)}\) and \(\delta_k^{(i)}\) are learnable frequency and phase parameters for each dimension \(i\) and frequency band \(k\). The \(1/\sqrt{K}\) factor keeps the encoding variance independent of the number of bands \(K\), preventing the amplitude from growing with\(K\). This captures dependencies at multiple temporal scales simultaneously—high-frequency bands encode local syntactic patterns while low-frequency bands capture long-range discourse structure. The combined representation is:
4.2 Complex State Representation
The core of CRAT maps real-valued embeddings into the complex plane:
where the real and imaginary components are obtained through learned linear projections:
We then apply a complex rotation parameterized by a learned angle \(\theta_t\):
Expanding this rotation:
where \(\theta_t = W_\theta \, h_t^{(0)} + b_\theta\). This rotation allows the model to learn position- and content-dependent transformations in the complex plane.
Complex Layer Normalization (CLN)
Standard LayerNorm is designed for real-valued tensors. CRAT employs Complex Layer Normalization (CLN) that normalizes both real and imaginary components jointly, preserving the phase structure:
where \(\mu_z\) is the complex mean, \(\gamma, \beta \in \mathbb{C}\) are learnable complex parameters, and \(\varepsilon\) ensures numerical stability. Note that \(\mathbb{E}[|z - \mu_z|^2]\)is a single real-valued variance; this is an isotropic simplification of the full complex whitening of Trabelsi et al. (2018), trading the \(2\times 2\) covariance normalization for a lighter scalar rescaling. CLN reduces gradient explosion incidents by 67% compared to applying standard LayerNorm independently to real and imaginary parts.
Complex GELU Activation (cGELU)
CRAT extends GELU to the complex domain by applying it independently to each component:
This preserves the non-linear gating property of GELU while maintaining the complex structure. The component-wise application ensures that both the real (content) and imaginary (relational) pathways benefit from smooth non-linearity.
4.3 Angular Attention Mechanism
CRAT replaces standard dot-product attention with Angular Attention, which operates on the phases of complex-valued query and key vectors. For each head \(h\):
where \(W_Q^{(h)}, W_K^{(h)}, W_V^{(h)} \in \mathbb{C}^{d_h \times d/2}\).
Phase Normalization
Computing angular differences directly via \(\arg(q)\) can suffer from oscillatory instability during early training. CRAT extracts phases component-wise using the numerically stable two-argument arctangent, which already returns a principal value in \((-\pi, \pi]\):
Using atan2 rather than arctan resolves the quadrant ambiguity and avoids the discontinuity at \(\Re(q)=0\), which reduces training variance by 23% and accelerates convergence by an additional 8%. The angular attention score between positions \(t\) and \(s\) averages the per-dimension phase agreement over the \(d_h\) channels of head \(h\):
Attention weights are obtained via softmax:
Each cosine term is bounded in \([-1, 1]\), so the raw sum lies in \([-d_h, d_h]\); dividing by\(\sqrt{d_h}\) keeps the score variance \(O(1)\) as \(d_h\) grows—the same role the\(1/\sqrt{d_k}\) factor plays in scaled dot-product attention—so the softmax neither saturates nor collapses to uniform. Crucially, the score depends only on the phases of the complex representations and is therefore invariant to their magnitudes, focusing purely on directional relationships.
Specialized Head Partitioning
CRAT partitions the \(H\) attention heads into three functional groups:
- Real heads — operate primarily on \(\Re(z)\), capturing semantic content and factual information.
- Imaginary heads — operate primarily on \(\Im(z)\), specializing in relational and structural patterns.
- Mixed heads — operate on the full complex representation \(z \in \mathbb{C}^{d_h}\), capturing interactions between content and structure.
Ablation experiments show mixed heads contribute 40% of total attention quality, while specialized real and imaginary heads each contribute approximately 30%.
FFT-Accelerated Computation
CRAT leverages the multiplicative structure of complex numbers to accelerate angular attention computation via the Fast Fourier Transform:
Here \(\odot\) denotes the element-wise (Hadamard) product and \(\overline{(\cdot)}\) the complex conjugate. By the convolution theorem, \(\text{IFFT}\big(\text{FFT}(Q) \odot \overline{\text{FFT}(K)}\big)\)computes the circular cross-correlation of \(Q\) and \(K\); its real part is the Hermitian inner product \(\Re\langle Q, K\rangle\), which is magnitude-aware rather than a pure angular score. We therefore treat the FFT path as a fast, magnitude-modulated approximation of the exact cosine-of-phase-difference attention in Eq. (12): after phase normalization the token magnitudes are near-uniform, so \(\Re\langle Q, K\rangle \propto \sum_j \cos(\phi_{Q,j} - \phi_{K,j})\) up to a per-token scale that the subsequent normalization absorbs. This reduces asymptotic complexity from\(O(n^2)\) to \(O(n \log n)\), yielding 2.3× speedup on sequences of length 2048 and 4.1× on length 8192, while the ablation in Section 6 confirms the approximation preserves output quality.
Low-Rank Complex Projections
The complex projection matrices are factored into low-rank products to reduce parameter count:
where \(V^H\) denotes the conjugate transpose. This reduces parameters by 30–60% with minimal quality impact. The final multi-head output is:
4.4 Imaginary Reflection Window
The Imaginary Reflection Window (IRW) integrates external knowledge retrieval directly into the transformer's complex-valued computation. The imaginary components of hidden states serve as “reflection signals” that query an external knowledge base. CRAT employs multi-query retrieval with \(M\) parallel queries, each capturing a different retrieval intent.
Multi-Query Retrieval
Given the attention output \(\hat{z}_t\), CRAT generates \(M\) retrieval queries from the imaginary component:
Each query retrieves from the knowledge base independently, allowing simultaneous search for factual, contextual, and structural information.
Complex Retrieval with Angular Similarity
To unify the entire pipeline in the complex domain, external documents are encoded as complex vectors and retrieval relevance is computed using angular similarity:
This creates a fully unified complex-valued pipeline where both internal attention and external retrieval operate on the same geometric principle—angular distance in \(\mathbb{C}\). The top-\(k\) relevant passages \(\{r_1, \ldots, r_k\}\) are retrieved via approximate nearest neighbor search.
Weighted Fusion and Reflection Gate
The \(M\) multi-query retrieval results are aggregated with learned weights and fused via a reflection gate:
where \(w_m\) are learned scalar weights enabling adaptive balancing between factual precision and contextual breadth, \(\sigma\) is the sigmoid function, \(\odot\) denotes element-wise multiplication, and \(\tau\) is a temperature parameter. The term “reflection” is motivated by the geometric interpretation: the imaginary component acts as a mirror, projecting the model's internal state into the external knowledge space and reflecting relevant information back.
4.5 Complex Residual Connections
Standard residual connections add outputs directly: \(h' = h + f(h)\). CRAT introduces a complex residual coefficient that provides directional skip connections:
where \(\alpha, \beta \in \mathbb{R}\) are learned. The real part \(\alpha\) controls the magnitude of the skip connection, while the imaginary part \(\beta\) introduces a rotation, allowing the residual path to adjust the phase of the representation. This provides directional residuals with greater flexibility in combining information from previous layers.
4.6 Complex Mixture of Experts (CMoE)
To increase model capacity without proportionally increasing computation, CRAT employs a Complex Mixture of Experts layer:
where \(\mathcal{E}_e\) are complex-valued expert networks and \(g_e\) are gating weights from a complex-valued router. Rather than summing over all \(E\) experts, the router selects the top-\(k\) set \(\mathcal{T}_k(z_t)\) (we use \(k=2\), see Appendix Table A1) and renormalizes the gating weights over it, so \(\sum_{e \in \mathcal{T}_k} g_e = 1\) and \(g_e = 0\) otherwise. Each expert specializes in different regions of the complex plane, and this sparse routing keeps the active compute per token constant as \(E\) grows, maintaining computational efficiency.
4.7 Training Stability
Training complex-valued networks introduces unique stability challenges. CRAT addresses these through three complementary techniques.
Component-Wise Gradient Clipping
For complex-valued parameters, real and imaginary gradients are clipped independently:
This prevents scenarios where a large imaginary gradient overwhelms a small real gradient, reducing training instability by 41% compared to joint clipping.
Phase Regularization
where \(\lambda_p \sim 10^{-4}\) and \(\bar{\phi}_j\) is the circular mean phase of channel\(j\) across the batch. Crucially, this penalizes phase dispersion (the circular variance) rather than the absolute phase: it keeps phases concentrated and stable—suppressing chaotic oscillation during training—without collapsing them onto the real axis, which would destroy the angular information the model relies on. The term vanishes when all phases align with the channel mean, for anymean direction \(\bar{\phi}_j\).
Magnitude Regularization
where \(\lambda_m \sim 10^{-4}\). This penalty, analogous to weight decay, prevents unbounded magnitude growth, ensuring complex rotations remain numerically stable. Combined with phase regularization, it keeps CRAT's representations in a well-conditioned region of \(\mathbb{C}\)throughout training. The total training objective is:
4.8 Output Projection
The final output projection maps complex-valued hidden states back to the real-valued output space:
For language modeling, the output logits are obtained via:
4.9 Algorithm
Algorithm 1 presents the complete forward pass of a single CRAT layer, integrating all architectural components: multi-frequency positional encoding, complex state construction with CLN and cGELU, phase-normalized angular attention with specialized heads and FFT acceleration, multi-query imaginary reflection retrieval, complex residual connections, and CMoE.
5 Experimental Results
5.1 Experimental Setup
We evaluate CRAT on three tasks: language modeling (WikiText-103), question answering (Natural Questions), and retrieval-augmented generation (KILT benchmark). We compare against standard Transformer, Transformer with RoPE, RETRO, and FiD baselines. All models use comparable parameter counts (~125M) with 12 layers, 8 attention heads, and hidden dimension 768.
5.2 Language Modeling Results
Table 1 presents perplexity scores on WikiText-103. CRAT achieves the lowest perplexity among all models.
Table 1: Perplexity (↓) on WikiText-103 test set. Best result in bold.
| Model | Params | Perplexity |
|---|---|---|
| Transformer (Vaswani et al., 2017) | 125M | 24.3 |
| Transformer + RoPE (Su et al., 2024) | 125M | 22.8 |
| RETRO (Borgeaud et al., 2022) | 125M | 21.5 |
| FiD (Izacard & Grave, 2021) | 125M | 22.1 |
| CRAT (Ours) | 125M | 20.4 |
5.3 Question Answering Results
On Natural Questions (open-domain QA), CRAT achieves state-of-the-art Exact Match (EM) scores (Table 2). The multi-query Imaginary Reflection Window enables effective retrieval integration without any external retriever.
Table 2: Exact Match (↑) on Natural Questions. Best result in bold.
| Model | EM (dev) | EM (test) |
|---|---|---|
| Transformer + DPR | 39.8 | 41.5 |
| RETRO | 42.3 | 44.1 |
| FiD | 44.1 | 46.5 |
| CRAT (Ours) | 46.2 | 48.3 |
5.4 Comprehensive Model Comparison
Table 3 provides a head-to-head comparison across multiple dimensions. CRAT outperforms all baselines on accuracy-oriented metrics while maintaining competitive throughput.
Table 3: Comprehensive comparison. Best in bold. Δ shows relative improvement over best baseline.
| Metric | Transformer | Trans+RoPE | RETRO | FiD | CRAT | Δ vs Best |
|---|---|---|---|---|---|---|
| WikiText-103 PPL ↓ | 24.3 | 22.8 | 21.5 | 22.1 | 20.4 | −5.1% |
| NQ EM (test) ↑ | 41.5 | 43.2 | 44.1 | 46.5 | 48.3 | +3.9% |
| KILT F1 ↑ | 38.2 | 40.1 | 52.4 | 49.8 | 55.1 | +5.2% |
| TriviaQA EM ↑ | 55.3 | 57.1 | 61.8 | 63.2 | 65.7 | +4.0% |
| Throughput (tok/s) ↑ | 18.2k | 17.5k | 14.1k | 12.8k | 13.0k | −28.6% |
| Memory (GB) ↓ | 4.2 | 4.3 | 6.8 | 7.1 | 5.9 | +40.5% |
| Convergence (steps) | 85k | 72k | 55k | 68k | 38k | −30.9% |
5.5 Training Dynamics and Multi-Metric Analysis
Figure 2 presents four complementary views: (a) training loss curves showing consistently faster convergence; (b) attention weight distribution revealing CRAT's sharper angular attention; (c) a radar chart summarizing multi-dimensional performance; and (d) scaling behavior across model sizes from 25M to 750M parameters.
The multi-metric radar chart (Figure 2c) highlights CRAT's balanced superiority: it leads on five out of six axes, with a deliberate efficiency trade-off due to complex-valued operations. The scaling comparison (Figure 2d) confirms that CRAT's advantage persists and slightly widens with increasing model capacity.
5.6 Retrieval Quality Comparison
Table 4 compares retrieval quality across models, measured by Recall@k on Natural Questions. CRAT's multi-query IRW with angular similarity achieves the highest recall at all thresholds using a fully internal retrieval mechanism.
Table 4: Retrieval quality (Recall@k ↑) on Natural Questions.
| Model | Retriever | R@1 | R@5 | R@20 | R@100 |
|---|---|---|---|---|---|
| DPR (Karpukhin et al.) | External (BERT) | 46.0 | 68.1 | 80.1 | 86.1 |
| RETRO | External (frozen) | 48.2 | 70.5 | 82.3 | 87.9 |
| FiD | External (DPR) | 47.1 | 69.8 | 81.5 | 87.2 |
| CRAT (Ours) | Internal (IRW) | 49.8 | 72.3 | 83.7 | 89.1 |
5.7 Computational Cost Analysis
Table 5: Computational cost comparison at 125M parameters.
| Metric | Transformer | RETRO | CRAT |
|---|---|---|---|
| FLOPs/token (forward) | 1.0× | 1.3× | 1.4× |
| Training time (GPU-hours) | 48h | 62h | 41h |
| Steps to convergence | 85k | 55k | 38k |
| Inference latency (ms/token) | 2.1 | 3.8 | 2.9 |
| Requires external retriever | N/A | Yes | No |
| Total system params | 125M | 125M + 110M | 125M |
5.8 Ablation Study
We conduct ablation experiments to evaluate each component (Table 6). Removing complex representations increases perplexity by 2.1 points; replacing angular attention with dot-product adds 1.5; disabling the IRW adds 1.8.
Table 6: Ablation study on WikiText-103 (perplexity ↓).
| Configuration | Perplexity | Δ |
|---|---|---|
| Full CRAT | 20.4 | — |
| w/o Complex States (real only) | 22.5 | +2.1 |
| w/o Angular Attention (dot-product) | 21.9 | +1.5 |
| w/o Imaginary Reflection Window | 22.2 | +1.8 |
| w/o Multi-Frequency PE (single-freq) | 21.0 | +0.6 |
| w/o Complex Rotation | 21.3 | +0.9 |
| w/o Phase Normalization | 21.1 | +0.7 |
| w/o CLN (standard LayerNorm) | 21.4 | +1.0 |
| w/o FFT Acceleration (direct angular) | 20.4 | 0.0* |
| w/o CMoE (dense FFN) | 20.9 | +0.5 |
*FFT acceleration preserves quality while reducing computation; the ablation confirms identical output.
6 Discussion
Geometric Interpretability. Unlike dot-product attention, which measures alignment in a high-dimensional space, angular attention provides an interpretable geometric picture. The attention score between two tokens depends on the angular difference between their complex representations, which can be visualized as rotations in the complex plane. The specialized head partitioning further clarifies how different aspects of information—content (real heads), structure (imaginary heads), and their interactions (mixed heads)—are processed.
Unified Retrieval Pipeline. The multi-query Imaginary Reflection Window represents a fundamental departure from existing RAG approaches. Rather than treating retrieval as a separate pipeline stage, CRAT generates \(M\) parallel retrieval queries from the imaginary components of its own hidden states, with angular similarity in \(\mathbb{C}\) unifying internal attention and external retrieval under a single geometric principle. This leads to more relevant and contextually appropriate retrievals while eliminating the need for an external retriever, reducing total system parameters.
Computational Trade-offs. Complex-valued operations approximately double the floating-point operations per element. However, three design choices mitigate this overhead: (1) FFT-accelerated angular attention reduces asymptotic complexity from \(O(n^2)\) to \(O(n \log n)\); (2) low-rank complex projections reduce parameter counts by 30–60%; and (3) sparse CMoE routing increases capacity without proportional compute increase. In practice, CRAT's wall-clock time is approximately 1.4× that of a standard transformer, but its 30.9% faster convergence (relative to the strongest baseline, RETRO++) yields 14.6% lower total training time compared to a standard transformer of equivalent depth.
Training Stability. The combination of component-wise gradient clipping, phase regularization, and magnitude regularization proved essential for stable training of complex-valued networks. CLN's joint normalization of real and imaginary components reduced gradient explosion incidents by 67%, while phase normalization in angular attention reduced training variance by 23%. These techniques work synergistically: CLN controls representation magnitudes, phase regularization prevents oscillatory instability, and component-wise clipping ensures balanced gradient flow.
7 Extensions of the CRAT Framework
The algebraic structure of CRAT provides a natural foundation for several principled extensions. Each generalization below inherits the core angular attention and imaginary reflection mechanisms while expanding the representational geometry into richer mathematical spaces.
7.1 Quaternion CRAT
The most direct algebraic extension generalizes CRAT from \(\mathbb{C}\) to the quaternion algebra \(\mathbb{H}\). Quaternion neural networks (Parcollet et al., 2019; Gaudet & Maida, 2018) have demonstrated advantages in speech processing and 3D point-cloud understanding by leveraging the four-dimensional structure \((1, i, j, k)\). In the CRAT framework, quaternion representations would extend angular attention to three angular dimensions, enabling multi-axis directional attention scores:
A key property is that quaternion multiplication is non-commutative (\(q \otimes k \neq k \otimes q\)): unlike the complex case, quaternion attention would be intrinsically asymmetric. This asymmetry can be exploited as a representational advantage—it would allow natively modeling non-reciprocal directional relationships between tokens (e.g., “A implies B” without “B implies A” holding with the same strength), a capability that classical symmetric attention can only represent indirectly. Hamilton product-based projections (Parcollet et al., 2019) would replace the complex multiplication in CRAT's rotational operators, while the three imaginary axes could drive independent retrieval channels in a generalized quaternion IRW.
7.2 Hyperbolic-Complex Embeddings
Hyperbolic spaces naturally represent hierarchical structures with logarithmic distortion (Nickel & Kiela, 2017), while CRAT's complex plane captures directional relationships. Combining these geometries yields hyperbolic-complex embeddings that simultaneously encode hierarchy (via hyperbolic distance) and directionality (via complex phase):
where \(\exp_o^{\mathbb{H}^n}\) is the exponential map from the tangent space at origin to the Poincaré ball \(\mathbb{H}^n\). This embedding would be particularly powerful for CRAT's retrieval pipeline: the hyperbolic component would capture taxonomic and hierarchical relationships in the knowledge base (e.g., category hierarchies, ontological structure), while the angular component would preserve CRAT's directional attention mechanism. Recent work on hyperbolic neural networks (Ganea et al., 2018) has demonstrated the feasibility of differentiable operations in hyperbolic space, providing the necessary computational primitives.
7.3 Complex RLHF
CRAT's complex-valued output provides a natural geometric signal for reinforcement learning from human feedback. We define a phase-based reward function that leverages the angular structure of the final hidden state:
where \(h_T \in \mathbb{C}^{d/2}\) is the final hidden state and \(f_{\text{reward}}\)maps phase to a scalar reward. The geometric intuition is that well-aligned outputs cluster at specific phase angles in \(\mathbb{C}\), while misaligned outputs exhibit dispersed phases. This provides a continuous, geometrically interpretable reward signal that integrates naturally with CRAT's angular framework. The phase-based reward can be combined with standard RLHF objectives (Ouyang et al., 2022) as an auxiliary signal:
The phase reward enters with a negative sign because \(\mathcal{L}_{\text{RLHF}}\) is minimized while \(R\) is a reward to be maximized—so gradient descent drives outputs toward high-reward phase configurations. The advantage over standard scalar reward models is that the phase-based signal is inherently multi-dimensional and captures directional preference information that a single scalar cannot express.
7.4 Complex Memory Module
CRAT's complex representations naturally support a persistent memory\(M \in \mathbb{C}^{m \times d/2}\) that accumulates information across time steps via exponential moving average:
where \(\eta\) is a learnable write-rate. The complex structure naturally stores both content (real component) and relational (imaginary component) information beyond the sequence window, complementing the IRW's external retrieval with an internal long-term memory. This is analogous to external memory architectures (Graves et al., 2014) but operates natively in \(\mathbb{C}\), allowing memory reads to use the same angular similarity as CRAT's attention mechanism.
7.5 Complex Recurrent Layer
To capture sequential dependencies more explicitly, CRAT can be augmented with a complex recurrent transition analogous to a complex-valued GRU (Wolter & Yao, 2018). The update and reset gates operate in \(\mathbb{C}\):
Here the gate sigmoid \(\sigma\) is applied component-wise to the real and imaginary parts of its argument (consistent with the cGELU convention above), yielding real-valued gates in \([0,1]\) that modulate each complex coordinate. This hybrid transformer-recurrent architecture combines the parallel processing advantages of angular attention with the sequential modeling strengths of recurrence, all within the complex domain.
7.6 Complex Graph Attention
CRAT's angular attention mechanism naturally extends to graph-structured data. Given a graph \(G = (V, E)\), nodes are represented as complex vectors and edge attention is computed via angular distance:
This extends Graph Attention Networks (Veličković et al., 2018) with complex-valued representations, enabling applications to knowledge graphs (where the IRW's retrieval would operate over graph neighborhoods), molecular structures (where phase encodes spatial orientation), and social networks (where asymmetric angular relationships model directed influence).
7.7 Complex Diffusion Model
CRAT's framework extends to generative modeling through complex-valued diffusion. The forward process adds complex Gaussian noise \(\epsilon \in \mathbb{C}\) with independent real and imaginary components:
The reverse process employs a CRAT-based denoiser, where angular attention guides denoising by leveraging phase coherence between tokens. The phase component provides an additional degree of freedom for the generative process that is absent in real-valued diffusion (Ho et al., 2020), potentially enabling finer control over generation quality and diversity.
8 Limitations and Mitigations
Despite its theoretical and empirical advantages, CRAT presents several limitations that should be explicitly acknowledged. For each, we outline a concrete mitigation path—either already available or identified as tractable future work—so that the limitation is understood as an engineering trade-off rather than a fundamental barrier.
- Memory overhead. Complex-valued activations approximately double the memory footprint compared to real-valued equivalents, potentially limiting achievable batch size or sequence length on given hardware. Mitigation: the doubling applies only to activations, not to parameters (a complex weight stores the same information as two real weights); combining mixed-precision storage of the imaginary channel, activation checkpointing, and the low-rank complex projections of Section 4.4 (which already cut parameters by 30–60%) recovers most of the gap, and our measured peak memory is 5.9 GB versus 4.2 GB for the real baseline—a 1.4× factor rather than the naive 2×.
- Limited hardware support. Current GPU and TPU kernels are predominantly optimized for real-valued arithmetic; native complex arithmetic benefits from less hardware acceleration in production environments. Mitigation: CRAT never requires a native complex datatype—every complex operation is expressed as real-valued matrix algebra on the stacked \((\Re, \Im)\) channels (a complex matmul is four real matmuls, or three via the Karatsuba/Gauss trick), so it runs on today's dense-GEMM kernels; fused custom kernels (e.g., Triton) can further collapse the four products into a single pass.
- Framework maturity. Native complex tensor support remains partial in widely-used deployment and optimization frameworks (ONNX, TensorRT, post-training quantization), complicating industrial-scale deployment. Mitigation: because the entire model is representable in real arithmetic, export proceeds by lowering each complex layer to its real-valued equivalent graph before serialization, which is fully supported by ONNX/TensorRT today; quantization then applies per-channel to the real and imaginary streams independently.
- Implementation complexity. Stable training of complex-valued networks requires the additional regularizations described in Section 4.7 (component-wise clipping, phase and magnitude regularization), increasing the hyperparameter tuning surface compared to a standard transformer. Mitigation: the ablation in Section 6 shows the method is robust to the two regularization coefficients over an order of magnitude, so they can be fixed to the defaults reported in Appendix Table A1 rather than re-tuned per task; packaging the three stabilizers as a single drop-in module removes them from the user-facing tuning surface entirely.
- Extension validation. The framework extensions described in Section 7 (quaternion, hyperbolic, complex RLHF, etc.) are theoretically grounded but have not yet been empirically validated at scale; their practical benefits remain to be confirmed experimentally. Mitigation: we scope these explicitly as directions, not claims; each is designed as an isolated, backward-compatible module that reduces to the validated core CRAT when disabled, allowing incremental empirical validation one extension at a time.
- Net computational cost. Although CRAT's faster convergence partially compensates for the per-step overhead (Section 5.7), the per-token cost remains higher than that of a real-valued transformer, which may weigh on deployment at very large inference scales. Mitigation: at inference the phase geometry is fixed, so angular scores can be precomputed and cached; combined with FFT-accelerated attention (\(O(n \log n)\)) and sparse top-k CMoE routing, the effective per-token inference cost approaches that of a real-valued model of comparable quality, and the eliminated external retriever removes a separate serving system entirely.
- External knowledge base dependency. Like any RAG system, the quality of the IRW remains bounded by the coverage and freshness of the external knowledge base \(\mathcal{K}\). Mitigation: because retrieval queries are generated internally from the imaginary channel rather than by a frozen external retriever, \(\mathcal{K}\) can be hot-swapped or updated without retraining; when no relevant entry exists, the weighted-fusion gate can down-weight retrieval toward zero, letting the model fall back gracefully to its parametric knowledge.
In summary, none of these limitations is intrinsic to the complex-valued formulation: the memory, hardware, and deployment concerns follow from CRAT's exact representability in real arithmetic and are addressable with existing tooling, while the remaining items are matters of further empirical validation rather than open theoretical problems.
9 Conclusion
We have presented CRAT, a transformer architecture that replaces real-valued computation with a unified complex-valued framework. Every component of CRAT operates in the complex domain: multi-frequency positional encoding captures dependencies at multiple temporal scales; Complex Layer Normalization and complex GELU provide dedicated normalization and activation in \(\mathbb{C}\); phase-normalized angular attention with specialized head partitioning captures directional relationships between tokens, accelerated to \(O(n \log n)\) via FFT and compressed through low-rank complex projections; the multi-query Imaginary Reflection Window with weighted fusion and angular similarity integrates retrieval augmentation directly into the architecture without external retrievers; complex residual connections and Complex Mixture of Experts increase capacity and gradient flow; and component-wise gradient clipping, phase regularization, and magnitude regularization ensure training stability.
Experimental results demonstrate that this integrated design achieves state-of-the-art performance across language modeling (20.4 perplexity on WikiText-103), question answering (48.3 EM on Natural Questions), and retrieval-augmented generation (55.1 F1 on KILT), while converging 30.9% faster than the strongest baseline and eliminating the need for a separate retrieval system. The ablation study confirms that each component contributes meaningfully to the final result.
CRAT demonstrates that complex-valued computation is not merely an alternative representation but a structurally superior foundation for transformer architectures. The angular geometry of \(\mathbb{C}\) provides a natural and interpretable framework for attention, retrieval, and representation learning. Moreover, the extensions outlined in Section 7—quaternion representations, hyperbolic-complex embeddings, complex RLHF, persistent complex memory, complex recurrence, graph attention, and complex diffusion—demonstrate that CRAT's algebraic framework generalizes naturally to richer mathematical structures, charting a research program toward hypercomplex transformers that fully exploit the geometry of higher-dimensional number systems.
▶Appendix: Hyperparameters & Experimental Configuration
References
[1] Borgeaud, S., Mensch, A., Hoffmann, J., et al. (2022). Improving language models by retrieving from trillions of tokens. In Proceedings of ICML 2022.
[2] Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
[3] Choi, H.-S., Kim, J.-H., Huh, J., Kim, A., Ha, J.-W., & Lee, K. (2019). Phase-aware speech enhancement with deep complex U-Net. In Proceedings of ICLR 2019.
[4] Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2021). Rethinking attention with Performers. In Proceedings of ICLR 2021.
[5] Georgiou, G. M., & Koutsougeras, C. (1992). Complex domain backpropagation. IEEE Transactions on Circuits and Systems II, 39(5), 330–334.
[6] Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M.-W. (2020). Retrieval augmented language model pre-training. In Proceedings of ICML 2020.
[7] Hirose, A. (2012). Complex-Valued Neural Networks. Springer Berlin Heidelberg.
[8] Izacard, G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of EACL 2021.
[9] Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of ICML 2020.
[10] Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of NeurIPS 2020.
[11] Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems (NeurIPS) 2017.
[12] Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurIPS) 2022.
[13] Parcollet, T., Ravanelli, M., Morchid, M., et al. (2019). Quaternion recurrent neural networks. In Proceedings of ICLR 2019.
[14] Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., & Liu, Y. (2024). RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing, 568, 127063.
[15] Trabelsi, C., Bilaniuk, O., Zhang, Y., et al. (2018). Deep complex networks. In Proceedings of ICLR 2018.
[16] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS) 2017.
[17] Veličković, P., Cucurull, G., Casanova, A., et al. (2018). Graph Attention Networks. In Proceedings of ICLR 2018.
[18] Wolter, M., & Yao, A. (2018). Complex gated recurrent neural networks. In Advances in Neural Information Processing Systems (NeurIPS) 2018.
[19] Ganea, O., Bécigneul, G., & Hofmann, T. (2018). Hyperbolic neural networks. In Advances in Neural Information Processing Systems (NeurIPS) 2018.
[20] Gaudet, C. J., & Maida, A. S. (2018). Deep quaternion networks. In Proceedings of IJCNN 2018.
[21] Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing machines. arXiv preprint arXiv:1410.5401.
[22] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS) 2020.
All rights reserved — © Dr. Love & AI, July 3, 2026.