Entities, links, data [1ab930b7] ()

Context

During a live discussion with Lion Kimbro on his Discord server, we've touched the topic of data serialization formats, especially JSON and his own in-house format called nLSD (number, lists, strings, dictionaries) (see here for reference).

Starting from these topics, I've proposed adding a few higher-level concepts (entity and link) that helps in the ways I describe below.

For referring to this document, please use the following URL: https://scratchpad.volution.ro/ciprian/992c7f2944456f18cdde77f683f49aa7/1ab930b7.html

Data

JSON, Lion's own nLSD, and pretty much every other serialization format out there, allows one to serialize (if not graphs then at least) trees of structured data.

Let's take as example an RSS-like feed (say an extension to the RSS / Atom, serialized as JSON):

All this can be expressed by just using lists and dictionaries (and atomic data-types like numbers and strings).

If one would open such a JSON file in a JSON viewer or editor (say for example directly in Firefox), one would see a tree of dictionaries and lists, and one is able to collapse or expand any node. No other visual cue or aid can be given by the viewer; instead of looking at raw JSON, one is seeing at best a pretty-printed, colored and properly indented JSON structure.

Entities

Say that the programmer that implemented the JSON feed generator uses his native language instead of English. (I.e. one can't understand the meaning of attributes, and the values don't give much insight.)

(Also remember that the feed contains articles, authors, and dates amongst other things.)

If one looks at such a JSON tree in a JSON viewer, one can't discern anything about the "structure" of the tree; one sees just dictionaries and lists; one can discern some patterns (i.e. some sub-trees look similarly), but can't easily draw any "boundaries" saying "this is an entity that seems to have a list of other entities".

Let's imagine that we extend the JSON syntax to use <{ "key" : "value", ... }> instead of { ... }, giving it the following semantic:

Getting back to our generic JSON-extended viewer, the UI can now mark somehow an entity as a self-contained "object", and demarcates it from other entities. (Note that an entity can have attributes that are dictionaries, but these dictionaries are not always entities themselves, like is the case of timestamps and dates.)

Entities as cells parallel

If one still can't visualize the difference between data-objects and entity-objects, or just data and entities, or why some dictionaries can be entities, but not all dictionaries are entities, think about the following parallel:

Entities as unique or identifiable objects

A data-structure taken in isolation (without it's context) doesn't mean much. For example, the name of a person; at most we can say it's a list of words, but if we don't identify the words as a person name, it can mean anything.

An entity taken in isolation can still have a meaning by itself, for example a person, building, car, etc. We might not "know" what a person is (i.e. we might not know its schema), but we can still generically infer that it's a self-contained object.

Getting back to our JSON feed, we might recognize that some attributes that contain other entities aren't actually attributes, but pointers or links to those entities.

For example, our article might have an "authors" list where each element is a pointer to a "person".

Let's imagine that we extend the JSON syntax to introduce the following construct:

This !<data> syntax would be interpreted as such: take the data sub-tree and consider it as an entity that is represented by this sub-tree; wherever in the same JSON one finds the same !<data>, it points to the same entity.

In other words, the ! extension adds the concept of links towards other entities that might appear multiple times within the same JSON.

(The topic of linking will be extended upon in another document. At the moment I just want to introduce the concept of links between entities.)

Parallel to the relational model

Setting aside entities and links, let's look at the relational model:

However, by introducing just the following three concepts: tables, records and foreign-keys (i.e. links), one can build useful generic SQL viewers and browsers, regardless of the actual schema (i.e. the semantic of the records as interpreted by humans;)

Final thoughts

Although entities and links are represented (in this example) as just ordinary JSON sub-trees, I think it is useful to somehow "mark" some of these sub-trees as being "special" or "serving a higher function".

Granted, this doesn't technically help much a computer program, but it does allow the human user that interacts with the raw data to more easily identify high-order constructs, and discern between plain-data and physical or virtual entities.

Also, in a more elaborate system, entities and links might have identifiers (locally unique or perhaps globally unique), but this is besides the point of this article. I just wanted to introduce the idea that perhaps a serialization format for "linked data" needs a few more higher-level constructs than just plain data and URL's.

Perhaps entities and links are not enough, perhaps they are the wrong model, perhaps a relational model would work better.

Perhaps JSON is not the best serialization format to demonstrate these concepts, but it's the one most people know, so it serves as a good example foundation.

Finally, one could argue that by using a schema attached to the JSON tree might help "mark" the sub-trees, but I think that is not enough. (For example a JSON schema wouldn't make a difference between an author object and a date object, where one is an entity the later being a plain data-structure.)

Example of extended JSON syntax

Again, this is not a syntax proposed for implementation. It just tries to help the user visualize how various sub-trees from a JSON file might represent different entities and links.

{ // the feed object;  not an entity, just a data-structure
    "updated": {
        // the date object;  not an entity, just a data-structure
        "year":2020, "month":9, "day":10
    },
    ... // other feed-related data
    "articles": [
        // here we don't use !<...>, because two articles that have the same contents might actually be different entities;
        // (they might just share the same title, date and authors.
        <{
            "title":"...",
            "authors":[!<"Ciprian Craciun">],
        }>,
        <{
            "title":"...",
            "authors":[!<"Lion Kimbro">, !<"Ciprian Craciun">],
        }>,
    ]