Deep Tensor Network
Abstract
The quadratic complexity of dot-product attention introduced in Transformer remains a fundamental bottleneck impeding the progress of foundation models toward unbounded context lengths. Addressing this challenge, we introduce the Deep Tensor Network, a new architectural framework that fundamentally reformulates attention by unifying the expressive power of tensor algebra with neural network design. Our approach moves beyond both conventional dot-product attention and subsequent linear-time approximations to capture higher-order statistical dependencies. We introduce two core operators derived from this framework: \emph{Tensor Attention}, which models complex token-mixing via data-dependent polynomial kernels, and Tensor Interaction, a novel mechanism for adaptive channel-mixing. We demonstrate that these operators are powered by second-order summaries that entirely bypass the formation of matrices, enabling a causality-preserving streaming implementation with per-token updates and state. This efficiency rivals that of modern State Space Models while retaining an attention-like formulation. The Deep Tensor Network thus provides a principled and powerful new class of building blocks for next-generation sequence models, bridging the gap between scalable computation and rich, expressive interaction modeling.
Cite
@article{arxiv.2311.11091,
title = {Deep Tensor Network},
author = {Yifan Zhang},
journal= {arXiv preprint arXiv:2311.11091},
year = {2025}
}