Tensor programming for databases, with first class dimensions
Tensor libraries like PyTorch and JAX have developed compact and accelerated APIs for manipulating n-dimensional arrays. N-dimensional arrays are kind of similar to tables in database, and this results in the logical question which is could you setup a Tensor-like API to do queries on databases that would be normally done with SQL? We have two challenges:
- Tensor computation is typically uniform and data-independent. But SQL relational queries are almost entirely about filtering and joining data in a data-dependent way.
- JOINs in SQL can be thought of as performing outer joins, which is not a very common operation in tensor computation.
However, we have a secret weapon: first class dimensions were primarily designed to as a new frontend syntax that made it easy to express einsum, batching and tensor indexing expressions. They might be good for SQL too.
Representing the database. First, how do we represent a database? A simple model following columnar database is to have every column be a distinct 1D tensor, where all columns part of the same table have a consistent indexing scheme. For simplicity, we'll assume that we support rich dtypes for the tensors (e.g., so I can have a tensor of strings). So if we consider our classic customer database of (id, name, email), we would represent this as:
customers_id: int64[C] customers_name: str[C] customers_email: str[C]
Where C is the number of the entries in the customer database. Our tensor type is written as dtype[DIM0, DIM1, ...], where I reuse the name that I will use for the first class dimension that represents it. Let's suppose that the index into C does not coincide with id (which is good, because if they did coincide, you would have a very bad time if you ever wanted to delete an entry from the database!)
This gives us an opportunity for baby's first query: let's implement this query:
SELECT c.name, c.email FROM customers c WHERE c.id = 1000
Notice that the result of this operation is data-dependent: it may be zero or one depending on if the id is in the database. Here is a naive implementation in standard PyTorch:
mask = customers_id == 1000 return (customers_name[mask], customers_email[mask])
Here, we use boolean masking to perform the data-dependent filtering operation. This implementation in eager is a bit inefficient; we materialize a full boolean mask that is then fed into the subsequent operations; you would prefer for a compiler to fuse the masking and indexing together. First class dimensions don't really help with this example, but we need to introduce some new extensions to first class dimensions. First, what we can do:
C = dims(1) c_id = customers_id[C] # {C} => int64[] c_name = customers_name[C] # {C} => str[] c_email = customers_email[C] # {C} => str[] c_mask = c_id == 1000 # {C} => bool[]
Here, a tensor with first class tensors has a more complicated type {DIM0, DIM1, ...} => dtype[DIM2, DIM3, ...]. The first class dimensions are all reported in the curly braces to the left of the double arrow; curly braces are used to emphasize the fact that first class dimensions are unordered.
What next? The problem is that now we want to do something like torch.where(c_mask, c_name, ???) but we are now in a bit of trouble, because we don't want anything in the false branch of where: we want to provide something like "null" and collapse the tensor to a smaller number of elements, much like how boolean masking did it without first class dimensions. To express this, we'll introduce a binary version of torch.where that does exactly this, as well as returning the newly allocated FCD for the new, data-dependent dimension:
C2, c2_name = torch.where(c_mask, c_name) # {C2} => str[] _C2, c2_email = torch.where(c_mask, c_email) # {C2} => str[], n.b. C2 == _C2 return c2_name, c2_email
Notice that torch.where introduces a new first-class dimension. I've chosen that this FCD gets memoized with c_mask, so whenever we do more torch.where invocations we still get consistently the same new FCD.
Having to type out all the columns can be a bit tiresome. If we assume all elements in a table have the same dtype (let's call it dyn, short for dynamic type), we can more compactly represent the table as a 2D tensor, where the first dimension is the indexing as before, and the second dimension is the columns of the database. For clarity, we'll support using the string name of the column as a shorthand for the numeric index of the column. If the tensor is contiguous, this gives a more traditional row-wise database. The new database can be conveniently manipulated with FCDs, as we can handle all of the columns at once instead of typing them out individually):
customers: dyn[C, C_ATTR] C = dims(1) c = customers[C] # {C} => dyn[C_ATTR] C2, c2 = torch.where(c["id"] == 1000, c) # {C2} => dyn[C_ATTR] return c2[["name", "email"]].order(C2) # dyn[C2, ["name", "email"]]
We'll use this for the rest of the post, but the examples should be interconvertible.
Aggregation. What's the average age of all customers, grouped by the country they live in?
SELECT AVG(c.age) FROM customers c GROUP BY c.country;
PyTorch doesn't natively support this grouping operation, but essentially what is desired here is a conversion into a nested tensor, where the jagged dimension is the country (each of which will have a varying number of countries). Let's hallucinate a torch.groupby analogous to its Pandas equivalent:
customers: dyn[C, C_ATTR] customers_by_country = torch.groupby(customers, "country") # dyn[COUNTRY, JC, C_ATTR] COUNTRY, JC = dims(2) c = customers_by_country[COUNTRY, JC] # {COUNTRY, JC} => dyn[C_ATTR] return c["age"].mean(JC).order(COUNTRY) # f32[COUNTRY]
Here, I gave the generic indexing dimension the name JC, to emphasize that it is a jagged dimension. But everything proceeds like we expect: after we've grouped the tensor and rebound its first class dimensions, we can take the field of interest and explicitly specify a reduction on the dimension we care about.
In SQL, aggregations have to operate over the entirety of groups specified by GROUP BY. However, because FCDs explicitly specify what dimensions we are reducing over, we can potentially decompose a reduction into a series of successive reductions on different columns, without having to specify subqueries to progressively perform the reductions we are interested in.
Joins. Given an order table, join it with the customer referenced by the customer id:
SELECT o.id, c.name, c.email FROM orders o JOIN customers c ON o.customer_id = c.id
First class dimensions are great at doing outer products (although, like with filtering, it will expensively materialize the entire outer product naively!)
customers: dyn[C, C_ATTR] orders: dyn[O, O_ATTR] C, O = dims(2) c = customers[C] # {C} => dyn[C_ATTR] o = orders[O] # {O} => dyn[O_ATTR] mask = o["customer_id"] == c["id"] # {C, O} => bool[] outer_product = torch.cat(o[["id"]], c[["name", "email"]]) # {C, O} => dyn[["id", "name", "email"]] CO, co = torch.where(mask, outer_product) # {CO} => dyn[["id", "name", "email"]] return co.order(CO) # dyn[C0, ["id", "name", "email"]]
What's the point. There are a few reasons why we might be interested in the correspondence here. First, we might be interested in applying SQL ideas to the Tensor world: a lot of things people want to do in preprocessing are similar to what you do in traditional relational databases, and SQL can teach us what optimizations and what use cases we should think about. Second, we might be interested in applying Tensor ideas to the SQL world: in particular, I think first class dimensions are a really intuitive frontend for SQL which can be implemented entirely embedded in Python without necessitating the creation of a dedicated DSL. Also, this might be the push needed to get TensorDict into core.