Wikipedia
From Freebase
Wikipedia is one of the major data sources for Freebase. Freebase also uses the article content for a topic's description. As well, many of the pictures on freebase come from the Wikimedia Commons.
Contents |
Import
At its birth, Freebase imported every Wikipedia article, as very often, an article corresponds to a topic. However, Wikipedia templates for deletion are forms of articles that we don't need on freebase. These are primarily imported from Wikipedia and consist of lists or complex subjects (topics about topics, like 'Economy of Poland'). Freebase uses structured data, queries and views to recreate these lists, but relies on the storage of simple topics only.
Staying up-to-date
One of the key pieces of data infrastructure behind the scenes at Freebase is the Wikipedia Recon Pipeline (aka "Topic Updater"). Every two weeks, it synchronizes Freebase topics to Wikipedia articles. It shifts our /wikipedia/en and /wikipedia/en_id keys to match the titles, redirects, and WPIDs at Wikipedia, and creates new topics for newly created articles. The service also updates the blurbs we display for all of our articles.
It takes up to a week to do the full update cycle. There is a "cooling off" period of two days, meaning that the articles that are older than two days at the time of update will be part of the update.
Version 2.0 of this trusty piece of infrastructure has been running for over a year now, and it's starting to show its age. There are a few niggling edge cases it doesn't deal with well, leading to out-of- sync blurbs and topics that fail to get created. What's worse, the old infrastructure wasn't able to reconcile new wikipedia articles with already existing Freebase topics. As more data gets loaded from sources other than wikipedia, this will become more of a problem.
To solve these problems, as well as making everything generally more maintainable, freebase deployed version 3.0 of the Wikipedia Recon Pipeline in September 2009. This version uses many of the lessons we've learned about how to properly reconcile items in bulk, and should improve the few corner cases where we were failing before. It will also allow us to run the recon pipeline weekly instead of biweekly (although we will be keeping a biweekly run for the first few weeks after deployment.) The reconciliation is fixed / improved for about 3,000 topics in Freebase that had gotten messed up by corner cases in the past.
Staying afloat
As wikipedia grows at 1000 articles a day, it is a very big job just to keep up to. A chart of this progress is available here.
Connection
Connections between freebase and Wikipedia are stored in keys. To see the wikipedia articles for Barack Obama:
[{
"id": "/en/barack_obama",
"key": [{
"namespace": "/wikipedia/en",
"value": null
}]
}]
You can navigate between freebase and wikipedia using this greasemonkey script, or this chrome extension
or this bookmarklet -
javascript:window.location="http://ubiquity.freebaseapps.com/gotofreebase?uri="+window.location;
Data-mining
Type-category relationships
A freebase type is very similar to a wikipedia category, as it collects shared things together.
Categories and types often correllate very highly. Categories like 'Category:1923 deaths' are extremely strong evidence that someone is a '/people/deceased_person'. However, if you look at Category:American Idol, you’ll find that many of the topics linked, like “Canadian Idol” or “Malaysian Idol”, are television programs, but by no means all of them. There is a topic that's actually a book written by an “Idol” judge, so that's not a TV program.
The difference is that freebase types are 'semantic', or more structured than wikipedia categories.
Freebase has a learning application detailed here, to make inferences based on these relationships. If any categories from Wikipedia have a high confidence of accuracy based on human votes with typewriter, they start automatically asserting them rather than seeking human confirmation.
Fat cat is an acre app used to manually add freebase data to members of wikipedia categories.
Tables
Two projects to enable easy importing from html tables to freebase are underdevelopment. One from Washington and one from Princeton.
TableTools sorts, filters or copies any HTML table.
Outwit also parses html tables.
Lists
This acre app parses the first link on each line in a Wikipedia bulleted list, and converts it to a list of Freebase Ids.
Infoboxes
The The Big Friendly Graph has parsed Wikipedia templates and infoboxes, but the (often unstandardized) data needs to be imported by hand.
Link structure
aliases.freebaseapps.com is an Acre app for finding aliases in Wikipedia redirects.
Natural language
Natural language processing wikipedia text is especially direct, because wikipedia links are unambiguous. See Natural language processing.