On Foundations

Your data
is not
a set.

It is the totality of operations you can apply to it. The Yoneda lemma, translated for people who work with tables.

Sebastian Zając · SGH Warsaw School of Economics

There is a habit, almost a reflex, of thinking of a dataset as a collection of rows. Each row is a vector; each vector is a point in some ℝ^d; the dataset is the set of these points. Statistics, machine learning, and even quantum encoding all begin from this picture. The picture is not wrong, but it is misleading in a specific way: it suggests that the data point is the primary object, and that the operations one performs on it — distances, kernels, transformations, encodings — are secondary, applied to the data from outside.

The Yoneda lemma reverses the priority. It says, in plain words, that an object is determined entirely by the totality of arrows into it (or out of it). What you can do with X is what X is. Two objects with identical patterns of incoming morphisms are indistinguishable; two patterns of incoming morphisms determine the object up to isomorphism. The object dissolves into its operational role.

Yoneda, in one line

For an object X in a category 𝒞, the functor hom(−, X) : 𝒞^op → Set determines X up to isomorphism.

Translation: tell me what arrows land in X, from every other object, and I will tell you exactly what X is.

§1The translation to data

A row of a table, looked at in isolation, has no content. The number 3.7 in the column "petal length" is meaningless without the operations that come with it: comparison with other rows, scaling, distance to another row, projection onto a learned feature, encoding into a quantum state. None of these operations is "applied" to the row. Each of them is, in fact, a morphism from the row into some other space — into the real line, into a Hilbert space, into a representation, into a class label.

The dataset is the totality of these morphisms. What you call "the data" is a shorthand for the family of admissible maps that you have decided to consider.

Two datasets with identical morphism profiles into every relevant target are operationally indistinguishable. A dataset is not what it contains. A dataset is what one can do with it.

§2Where this is already happening

Practitioners do this without naming it. The Yoneda perspective is implicit in nearly every modern technique that treats raw values with suspicion.

Kernel methods

A kernel k(x, y) never asks what x is. It asks what x is similar to. The kernel is the morphism profile.

Equivariant ML

A network is built to commute with a group of transformations. The data is treated as a representation, not as a set of points.

Augmentation

Augmenting a single image with rotations and crops is the explicit construction of its hom-set into the orbit space.

Embeddings

A learned embedding is a functor: it preserves a chosen subset of the morphisms (similarity, analogy, hierarchy) and discards the rest.

Quantum encoding

A QML encoding is a map x ↦ |ψ(x)⟩. The class of all such maps is the data, in its quantum-relevant form.

In each case the row of the table is no longer the primary object. The relations are.

x is the family of arrows leaving it

§3What changes when you take this seriously

If a data point is its hom-functor, then two questions that look identical separate.

The first is "what is in the dataset." The second is "what class of maps am I willing to call admissible." The first is fixed by collection; the second is a free parameter, and it is where all the modeling happens.

Choose only linear maps and you get classical statistics. Choose smooth maps respecting a group action and you get equivariant ML. Choose unitary maps into a Hilbert space and you get a quantum encoding. Choose unitary maps into a Fock space whose creation operators carry physical meaning, and you get QIFT. The data has not changed. The category in which it lives has.

Classical statistics is the Yoneda functor restricted to one specific, narrow class of admissible maps. Quantum data analysis is the same functor, restricted to a different, larger class.

§4The practical consequence

Stop asking what the data is. Ask what the admissible morphisms are. The first question has no operational answer; the second is a modeling decision, and every consequential choice in machine learning, statistics, and quantum encoding turns out to be an answer to it in disguise.

The Equivalence Theorem of QIFT, in this language, says: the class of probability-loading-followed-by-fixed-unitary encodings is operationally a single class. Different maps within it are isomorphic in the relevant sense. To get genuinely new behavior, the class itself must change.

That is Yoneda for practitioners. Not a theorem you apply, but a question you ask before everything else.

· · ·

Your datais nota set.

§1The translation to data

§2Where this is already happening

§3What changes when you take this seriously

§4The practical consequence

Your data
is not
a set.