Data linkage guidelines

From Freebase

Jump to: navigation, search

Contents

Overview

This page gives suggestions on how to integrate Freebase entities into your application. Often, when building an application using Freebase data, it's necessary to keep a local copy of the entity IDs that you're interested in. This allows you to:

  • Easily query the Freebase API for up-to-date information about these entities.
  • Avoid having to reconcile the entities over and over.
  • Augment Freebase entity data with your own application-specific data.

But it also has the following challenges:

  • Some types of Freebase IDs change over time.
  • Some Freebase topics get split into multiple topics.
  • Some Freebase topics get merged into a single topic.

How to add Freebase entities to your database

  1. Select the tables in your database which contain entities that you wish to link to Freebase data.
  2. Export the tables as TSV files.
  3. Load each table individually into Google Refine and using the Freebase Reconciliation Service to link each row to a topic in Freebase.
  4. Add a new column to based on the Freebase key that you wish to use to link your entities.
  5. Export each table from Refine and load them back into your database.

Which namespace to use

In order to get the most out of your Freebase linkage its important to choose the right type of key to store in your database. In Freebase, keys are organized by into namespaces. Different namespaces have different pros and cons for developers.

IDs

By default most queries of the form [{"id":null,...}] return an ID in one of the available namespaces for that topic. Its not advisable to use the output of the "id" property as a key in your database because they are from many different namespaces and the values will change over time.

MIDs

Freebase MIDs are keys in the namespace /m . MIDs are are unique identifiers which are automatically assigned to every topic at creation time.

Pros:

  • MIDs are never deleted. When topics get merged, the new topic inherits all the MIDs of the merged topics. This means you don't have to update your database because your MIDs will still point to the merged topic.

Cons

  • MID are not human-readable and therefore have no implied semantics. This means that, in the event of a topic split, it may not always be possible to know which MIDs should go where and you may have to re-align your database depending on whether the MID that you're using got split off to another topic.

/en keys

Keys in the /en space are human-readable English keys which have been periodically added to Freebase topics. These keys are being DEPRECATED. They will still work if you're using them, but no new /en keys will be generated and they won't show up by default in [{"id":null,...}] queries.

/authority namespaces

Authority namespaces like /authority/twitter or /authority/imdb contains keys into those respective datasets. These keys are aligned with whatever ID system the target dataset uses and we do our bes to keep them in sync. If you're building an application that specifically works with data from one of these

Pros:

  • /authority keys may or may not be human-readable but they do have some implied semantics in that things with /authority/twitter will almost always be people or organization and things with /authority/imdb keys will always be either people, movies or TV shows. This means, that in the event of a topic split, your /authority key will most likely get moved to the right topic and you won't have to update your database.

Cons:

  • /authority namespaces don't have the same coverage as general-purpose namespaces like MIDs so if you rely on an /authority namespace in your application, you may not be able to link certain entities unless that datasets supports them.

Application-specific namespaces

As a registered Freebase user you are permitted to create your own namespaces under your personal namespace (/user/yourusername/...) This allows you to specify exactly which topics in Freebase align withe the topics in your database.

Pros:

  • This is the most stable way to identify topics in Freebase. Each key will always point to exactly the topic that you added it to.

Cons:

  • You're responsible for moving the key in the event of a topic split. This might actually be what you want but it's still more work and the current Freebase client doesn't have strong support for adding/removing keys.

How do I protect my site/application from malicious edits in Freebase?

Freebase is an open database where anyone with an account is free to contribute. This has a lot of benefits in terms of discovering broken or missing data but it also has some challenges with regards to malicious edits.

Freebase maintains a complete history of all edits to the database so, in the long term, malicious edits are very easy to roll back and misbehaving user accounts are easy to block. However, in the short term there is still the concern that malicious edits could propagate to your external site/application via the Freebase APIs and cause some damage/confusion before they are detected and rolled back.

To prevent this sort of problem from affecting people who build their applications using Freebase data, we've come up with a very simple mechanism that you can use as part of your MQL queries. It's called the as_of_time query parameter and it lets you query the Freebase graph for all existing facts up until the timestamp that you provide.

{
  "q1" : {
    "as_of_time" : "2007-01-09T22:00:56.0000Z",
    "query" : [
      {
        "domain" : "/architecture",
        "id" : null,
        "return" : "count",
        "timestamp" : null,
        "type" : "/type/type"
      }
    ]
  }
}

Try this out in the query editor. Note the as_of_time parameter is available from Envelope tab at the bottom.

What this does is it allows you to record timestamps when you've verified that the Freebase data that you're interested in is known to be accurate and then only query the Freebase API up until that timestamp. You can store individual timestamps for each entity in your database for the last known "good" timestamp and then update those timestamps as new data is contributed and reviewed.

Personal tools