Related papers: Towards Linear Algebra over Normalized Data
Machine learning algorithms use error function minimization to fit a large set of parameters in a preexisting model. However, error minimization eventually leads to a memorization of the training dataset, losing the ability to generalize to…
Statistics and Optimization are foundational to modern Machine Learning. Here, we propose an alternative foundation based on Abstract Algebra, with mathematics that facilitates the analysis of learning. In this approach, the goal of the…
Machine Learning (ML) applications are proliferating in the enterprise. Relational data which are prevalent in enterprise applications are typically normalized; as a result, data has to be denormalized via primary/foreign-key joins to be…
Large language models (LLMs) have shown promise in table Question Answering (Table QA). However, extending these capabilities to multi-table QA remains challenging due to unreliable schema linking across complex tables. Existing methods…
Data processing systems roughly group into families such as relational, array, graph, and key-value. Many data processing tasks exceed the capabilities of any one family, require data stored across families, or run faster when partitioned…
Without any doubt, the relational paradigm has been a huge success. At the same time, we believe that the time is ripe to rethink how database systems could look like if we designed them from scratch. Would we really end up with the same…
We present module theory and linear maps as a powerful generalised and computationally efficient framework for the relational data model, which underpins today's relational database systems. Based on universal constructions of modules we…
Machine learning has a long collaborative tradition with several fields of mathematics, such as statistics, probability and linear algebra. We propose a new direction for machine learning research: $C^*$-algebraic ML $-$ a…
Randomized linear algebra (RLA) algorithms are a modern class of numerical linear algebra techniques that play an essential role in scientific computing and machine learning, with broad and growing adoption. However, their discovery remains…
Linear algebraic expressions are the essence of many computationally intensive problems, including scientific simulations and machine learning applications. However, translating high-level formulations of these expressions to efficient…
Machine learning can provide deep insights into data, allowing machines to make high-quality predictions and having been widely used in real-world applications, such as text mining, visual classification, and recommender systems. However,…
Large language models (LLMs) are increasingly used to automate feature engineering in tabular learning. Given task-specific information, LLMs can propose diverse feature transformation operations to enhance downstream model performance.…
The machine learning (ML) training over disparate data sources traditionally involves materialization, which can impose substantial time and space overhead due to data movement and replication. Factorized learning, which leverages direct…
This paper uses typed linear algebra (LA) to represent data and perform analytical querying in a single, unified framework. The typed approach offers strong type checking (as in modern programming languages) and a diagrammatic way of…
We propose a new formalism for specifying and reasoning about problems that involve heterogeneous "pieces of information" -- large collections of data, decision procedures of any kind and complexity and connections between them. The essence…
In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in parsing textual data and generating code. However, their performance in tasks involving tabular data, especially those requiring symbolic reasoning,…
This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research. The input to learning classification and…
We consider a novel backward-compatible paradigm of general data analytics over a recently-reported semisimple algebra (called t-algebra). We study the abstract algebraic framework over the t-algebra by representing the elements of…
What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization…
In modern data analytics, analysts frequently face the challenge of searching for desirable entities by evaluating, for each entity, a collection of its feature relations to derive key analytical properties. This search is challenging…