Machine ID

From Freebase

Revision as of 02:18, 21 July 2010 by Thadguidry (Talk | contribs)
Jump to: navigation, search

A mid or machine id is a short form of id for any Freebase topic. Mids allow topics to be more easily managed and changes tracked over time, particularly when they are stored in external data sources and used as foreign keys into Freebase.

They differ from a guid in being shorter, and in magically working in spite of topic merges and other such transformations.



Freebase machine-generated ids ("mids") are ids that are assigned to topics at creation time, and are managed throughout the topic's lifetime. They play a critical role when topics are merged or split, allowing external applications to track the logical topic even though the physical Freebase identity (the topic's guid) may change. Machine-generated ids differ from current human-readable Freebase ids (returned by the "id" property) in that they are:

  • guaranteed to exist
  • machine-generated
  • designed to support offline comparison
  • not designed to convey meaning to humans
  • short (possibly fixed length)
  • ideal for quick exchange of keys between external systems and components (external, exchange)
  • not guids

Machine-generated ids can also be used freely as ids (e.g. with the "id" property), and can also be used in roles where a compact representation is required (e.g. in urls).

The identity problem

Today, developers have two alternatives when they wish to store references (foreign keys) to Freebase topics in their internal database. They can refer to topics by guid, or by human-readable ids. If they store guids, then the guid may end up pointing to a "tombstone" topic in Freebase if that topic gets merged with another. (A tombstone topic is one that has been updated to contain a single property, "/dataworld/gardening_hint/replaced_by", which points to the topic "merge winner".) It is up to the application to notice that the guid has been replaced by another, and update their database to record this fact. There is currently no way to be notified of these updates, or notice them automatically short of detecting a query failure and explicitly looking for the "replaced_by" property.

Alternatively, developers can store ids in their internal database (returned by the "id" property), but this has several problems. First, many topics are not assigned ids at creation time, either because a good human-readable name is not yet known, or the rules used to create them automatically stumble into unknown territory. This leaves many topics to return a guid-based id when the "id" property is queried (e.g. "/guid/9202a8c04000641f8000000012eb7a9a", the id for the topic describing the Three-masted Barque ship class). So if the goal is to have human-readable ids, this often fails.

Second, since ids are assigned after the fact, an application must be prepared to notice that "/guid/9202a8c04000641f8000000012eb7a9a" was later assigned a better (primary) id of "/en/three_masted_barque". If the application stored "/guid/9202a8c04000641f8000000012eb7a9a" on behalf of one user, and a second user turns up the id "/en/three_masted_barque", for instance by a search or Freebase Suggest, it is difficult for the application to know that it has already stored information about this topic. The application cannot simply compare the new id with one it may have stored.

Proposed solution

To address these problems, we propose the addition of machine-generated ids to Freebase topics. All topics in Freebase will be assigned machine-generated ids, and any newly created topics will be automatically assigned new machine-generated ids. In the event of a topic merge, machine-generated ids will be migrated forward to refer to the newly merged topic. In the case of a topic split, one portion of the split will retain the current machine-generated id, and the split off topic(s) will be assigned new machine-generated ids. Machine-generated ids will subsume today's use of guids as foreign keys to Freebase topics. They may also be used directly as ids with the "id" property, although they are not human-readable.

The "mid" property

Machine-generated ids will be accessible directly through a new property, "/type/object/mid". The "mid" property will return the machine-generated id(s) currently associated with a topic.

Since the "mid" property is a member of "/type/object", it is freely accessible to all objects without requiring a fully-qualified property name. Its schema is as follows:

  • name: "mid"
  • type: /type/property
  • schema: /type/object
  • expected_type: /type/id
  • unique: false
  • enumeration: /m
  • master_property: -none-
  • reverse_property: -none-
  • delegated: -none-
  • requires_permission: -none-

When queried the form:

{ "mid":null, ... }

the "mid" property will return the primary machine-generated id associated with the topic in question. It can be appear anywhere within the query. When queried the form:

{ "mid":[], ... }

it will return all machine-generated ids currently associated with a topic, including those that have been reassigned after merges. This form can be used to easily find and update any previously stored machine-generated ids whenever a new one is encountered.


The "mid" property will enumerate machine-generated ids in a new namespace, "/m". The keys of machine-generated ids are short variable-length sequences of characters consisting of digits, lower-case letters excluding vowels, and underscore. Upper-case characters may be used, but MQL will interpret them using their lower-case equivalents. (By avoiding vowels, we hope to avoid accidently generating offensive identifiers.) Mids are also URL-safe, i.e. they don't require any escaping or unescaping to be used in URLs. Here are some examples:

Although mids appear under the "/m" namespace, they are not stored in Freebase in the traditional way using /type/key with its "key" and "namespace" properties. Instead, mids are synthetic, and completely derived from the topic's guid and the presence of any "/dataworld/gardening_hint/replaced_by" links. (I.e. adding a new "/dataworld/gardening_hint/replaced_by" link will cause a topic's mid to change.)

Primary mids

Since any topic may be assigned multiple machine-generated ids, the need arises to designate one as primary. Maintaining a single primary machine-generated id may be advantageous to applications wishing to maintain a single row for a topic in a database table, or compare topics for identity without issuing a query to Freebase, i.e. by comparing the mids directly. To allow for this, we specify that when "mid":null appears in a query, it is always the primary machine-generated id that is returned. When "mid":[] appears in a query, it is always the first mid in the result that is primary. Using this mechanism, machine-generated ids may be determined to be non-primary, and stored copies of them may be updated.

From an implementation perspective, the primary mid is the one that corresponds to the topic's guid. This will always exist, and is cheap to return.

"id" property changes

MQL's "id" property will also be changed to support machine-generated ids. Currently, when no human-readable id has been assigned to a topic, a /guid id is formed from the topic's guid and returned. This proposal would change that to return a machine-generated id instead.

Machine-generated ids would be the lowest priority id returned by the "id" property. All other human-readable, non-blacklisted ids would be returned before a machine-generated id is returned.

mids and merges

When topics are merged, a "/dataworld/gardening_hint/replaced_by" link is written connecting the "losing" topic with the "winner". Additionally, any properties of the losing topic are updated (moved) to become properties of the winning topic, thereby leaving the losing topic to be tombstone -- completely uninteresting except for the fact that someone may have stored a reference to it. The advantage of using mids to refer to topics is that MQL will automatically follow the "replaced_by" links to retrieve properties of the winning topic as if they were properties of the losing topic.

When asking for the "mid" property of a merger winner, the primary mid is returned as described above. However, when asking for the "mid" property of a merge loser, the topic's original primary mid is returned rather than the mid of the merge winner. (Note: This is a change from the original design.) This is done to help users more clearly differentiate the topics, and aids in some of Freebase's housekeeping operations. However, it is important to note that both mids "resolve" to the winning topic when used in queries, and allow properties to be directly retrieved. It is also unlikely that one would ever encounter at merge losers in query results unless one was directly inspecting "replaced_by" links.

The only reason to care whether a mid is associated with a merge winner or loser (i.e. whether it is primary or not) is when an application wants to compare newly retrieved mids with ones it has previously stored. By always querying with "mid":[], the complete set of mids resulting from merges can be retrieved, and any stored mids from this set can be updated to the primary (first) mid in the list. This effectively amortizes the cost of these updates across each query, so that a bulk "merge feed" isn't required. Applications that do not care about performing offline comparisons (comparisons that don't involve a round-trip to Freebase) are free to use whatever mid is returned by "mid":null.

Topic lifecycle

Simple case

In the simplest case, a topic is created and receives a machine-generated id. At some later time, it is assigned a human-readable id. Here is the sequence of events, and the results that various queries return:

  1. topic 1 is created
    • "mid":null returns "/m/09gqznv"
    • "mid":[] returns ["/m/09gqznv"]
    • "id":null returns "/m/09gqznv"
  2. topic 1 is assigned id "/en/three_masted_barque"
    • "mid":null returns "/m/09gqznv"
    • "mid":[] returns ["/m/09gqznv"]
    • "id":null returns "/en/three_masted_barque"

Simple merge case

In the simple merge case, 2 topics, each with a single machine-generated id are merged together:

  1. topic 1 is created (e.g. /m/06bnz aka /en/russia)
    • "mid":null returns "/m/06bnz"
    • "mid":[] returns ["/m/06bnz"]
    • "id":null returns "/m/06bnz"
  2. topic 2 is created (e.g. /guid/9202a8c04000641f80000000051d9afe)
    • "mid":null returns "/m/02kw6rz"
    • "mid":[] returns ["/m/02kw6rz"]
    • "id":null returns "/m/02kw6rz"
  3. topic 2 is merged into topic 1
    • for topic 1 (now also referred to via "/m/02kw6rz"):
      • "mid":null returns "/m/06bnz"
      • "mid":[] returns ["/m/06bnz", "/m/02kw6rz"]
      • "id":null returns "/m/06bnz"
    • for topic 2:
      • "mid":null returns "/m/02kw6rz"
        • topic 2 retains its original mid in order to distinguish it from topic 1, although this mid is no longer primary
      • "mid":[] returns ["/m/02kw6rz"]
        • the list result only contains one mid since it is not a merge winner
      • "id":null returns "/m/02kw6rz"
        • no other id is available for this topic

Complex merge case

A more complex merge may result in the case where a topic has already been merged with another. In this case multiple machine-generated ids may be used for either topic:

  1. topic 1 exists with multiple machine-generated ids
    • "mid":null returns "/m/0lt264"
    • "mid":[] returns ["/m/0lt264", "/m/010cqtp"]
    • "id":null returns "/m/0lt264"
  2. topic 2 exists with multiple machine-generated ids
    • "mid":null returns "/m/0zj97l"
    • "mid":[] returns ["/m/0zj97l", "/m/0pzwd_"]
    • "id":null returns "/m/0zj97l"
  3. topic 2 is merged into topic 1
    • for topic 1 (now also referred to via /m/010cqtp, /m/0zj97l, and /m/0pzwd_):
      • "mid":null returns "/m/0lt264"
      • "mid":[] returns ["/m/0lt264", "/m/010cqtp", "/m/0zj97l", "/m/0pzwd_"]
      • "id":null returns "/m/0lt264"
    • for topic 2:
      • "mid":null returns "/m/0zj97l"
      • "mid":[] returns ["/m/0zj97l"]
        • "/m/0zj97l" and "/m/0pzwd_" have now been moved to topic 1
      • "id":null returns "/m/0zj97l"

Again, from an mid perspective, topic 1 and topic 2 are now indistinguishable. This is what we want -- automatic tracking of topics when they merge, even when old mids are used to refer to them.

Split case

In the case where a topic is split into 2 topics, a new topic with a new mid will be created for the newly split off subset:

  1. topic 1 exists with possibly multiple machine-generated ids (e.g. /en/mark_levinson -- person, or company?)
    • "mid":null returns "/m/03n8sx"
    • "mid":[] returns ["/m/03n8sx", "/m/0498_7x"]
    • "id":null returns "/m/03n8sx"
  2. topic 2 is split off from topic 1 (taking a subset of its properties)
    • topic 1 retains all the original machine-generated ids
    • topic 2 receives a new mid
      • "mid":null returns "/m/08fb_93c"
      • "mid":[] returns ["/m/08fb_93c"]
      • "id":null returns "/m/08fb_93c"
    • in addition, topic 1 and topic 2 should be linked with the "/dataworld/gardening_hint/split_to" property.

Note that when a topic is split, there's always the risk that an application that has stored its original mid really meant it as a reference to the part of the topic that is split off. In this case the application will end up referring to the wrong topic. The only way to resolve this situation is to notice that a split has occurred by following the split_to link, and to then inspect the stored mid to determine whether the split off topic was the more appropriate one.

A feed of "/dataworld/gardening_hint/split_to" links has been discussed but has not yet been implemented.

Use cases

The following use cases illustrate how machine-generated ids can be used to track and manage topics over time.

Equating topics returned by relevance search

The first use case addresses the problem today where an application stores an id that it has obtained from a relevance search, but later can't easily equate it to another that relevance returns because the topic has been assigned an /en id. Using machine-generated ids, this use case works as follows:

  1. user 1 uses relevance search to find a topic without an /en id, e.g. /m/04191v, which is returned to the application
    • relevance is changed to return all machine-generated ids and any human-readable ids that have been assigned to the topic
  2. application stores /m/04191v in its database
  3. at a later time, Freebase assigns /en/speculaas to /m/04191v
    • now the topic has 2 ids; /en/speculaas is primary
  4. user 2 uses relevance to find /en/speculaas
  5. relevance search returns both /en/speculaas and /m/04191v to the application
  6. application can determine that user 1 and user 2 have identified the same topic by looking at the machine-generated ids

Equating topics after merge

A similar problem arises today when 2 topics are merged. Since applications today store guids or /guid ids (because that is currently the most available and reliable form of id), when a topic is merged, the guid of the merge loser is effectively defunct, and it is hard or impossible to determine the new equivalent topic. Using machine-generated ids, this use case works as follows:

  1. user 1 uses relevance search to find a topic without an /en id, e.g. /m/04191v
  2. application stores /m/04191v in its database
  3. at a later time, Freebase merges /m/04191v into /m/0hj8iav
    • now the merge winner has 2 mids: /m/04191v (its original mid) and /m/0hj8iav (its mid assigned after the merge)
  4. user 2 uses relevance to find /m/0hj8iav
  5. relevance search returns /m/0hj8iav and /m/04191v to the application (2 machine-generated ids)
  6. application can determine that user 1 and user 2 have identified the same topic by looking at the machine-generated ids

tracking human-readable ids

Applications wishing to include Freebase ids in urls will want to use human-readable /en ids rather than machine-generated ids (for instance, in HTML 5 "itemprop" markup). This helps users understand what topics or concepts are being referred to, and promotes a sense of universal identity that goes beyond a machine-generated string.

In the case where an application uses or stores machine-generated ids, the human-readable id can be obtained as follows:

{ "mid": "/m/04191v", "id": null }

Tracking mid updates

Applications that store machine-generated ids as foreign keys into Freebase may want to be notified when a mid changes due to a merge. There are several ways in which this could happen:

  1. notice all the machine-generated ids returned by a relevance search
    • iterate through them and update any stored copies to the primary mid (see below)
  2. query for "mid":null whenever a mid is used in a query to obtain the update
    • e.g. {"mid":"/m/04191v", "updated:mid":null, ...}
  3. obtain a feed from listing all the merged topics and their machine-generated ids

In the event of a topic split, the "/dataworld/gardening_hint/split_to" property specifies the original topic from which the topic in question was split.

Short urls

Applications desiring short urls (e.g. twitter) can use the machine-generated id to formulate a url rather than the human-readable id returned by the "id" property.

Note that Freebase already provides a url-shortening service Tinyify that is used in conjunction with the Tippify application. We may eventually update Tinyify's shortening scheme to use machine-generated ids once this mechanism is in place.

Related changes

  • topic merge/split code must maintain machine-generated ids
  • relevance search and Freebase Suggest must return machine-generated ids as well as human-readable ids
  • create a feed service listing all merged topics and their machine-generated ids
  • deprecate /guid ids and "guid" property
  • update all dumps (BDB, WEX) to include machine-generated ids (ultimately replacing guids)
Personal tools