WEX

From Freebase

Revision as of 00:41, 7 July 2010 by Viral (Talk | contribs)
Jump to: navigation, search

The Freebase Wikipedia Extraction (WEX) is a processed dump of the English language Wikipedia. The wiki markup for each article is transformed into machine-readable XML, and common relational features such as templates, infoboxes, categories, article sections, and redirects are extracted in tabular form. Freebase WEX is provided as a set of database tables in TSV format for PostgreSQL, along with tables providing mappings between Wikipedia articles and Freebase topics, and corresponding Freebase Types.

Freebase WEX is provided free of charge for any purpose with regular updates by Metaweb Technologies. It is distributed, like Wikipedia itself, under the terms of version 1.2 of the GNU Free Documentation License or any later version published by the Free Software Foundation.

Contents

Citing

If you'd like to cite WEX in a publication, you may use:

  • Metaweb Technologies, Freebase Wikipedia Extraction (WEX), http://download.freebase.com/wex/, <month> <day>, <year>

Or as BibTeX:

 
@misc{metaweb:wex,
  title = "Freebase Wikipedia Extraction (WEX)",
  author = "Metaweb Technologies",
  howpublished = "\url{http://download.freebase.com/wex/}",
  edition = "<month> <day>, <year>",
  year = "<year>"
}

Related Work

  • DBpedia, "a community effort to extract structured information from Wikipedia and to make this information available on the Web", http://dbpedia.org/
  • Hugo Zaragoza, Jordi Atserias, Massimiliano Ciaramita and Giuseppe Attardi

(Yahoo! Research Barcelona), Semantically Annotated Snapshot of the English Wikipedia, http://www.yr-bcn.es/semanticWikipedia, 2007.

Setup

Freebase WEX data files are provided in TSV format suitable for bulk loading into PostgreSQL (version 8.3). Other database servers such as MySQL should also be suitable, but will require a different setup procedure to what is presented here, and are not currently tested (let us know if you use Freebase WEX on such a platform). The following procedure assumes you have already installed PostgreSQL, and are running a UNIX operating system such as Linux.

Download the latest Freebase WEX archive:

wget http://download.freebase.com/wex/2010-06-20/freebase-wex-2010-06-20.tar

Extract the archive:

tar xvf freebase-wex-2010-06-20.tar
cd freebase-wex-2010-06-20

Contained are 9 bzip2 compressed TSV data files corresponding to each of the Freebase WEX tables:

  • freebase-wex-2010-06-20-articles.tsv.bz2
  • freebase-wex-2010-06-20-redirects.tsv.bz2
  • freebase-wex-2010-06-20-sections.tsv.bz2
  • freebase-wex-2010-06-20-template_calls.tsv.bz2
  • freebase-wex-2010-06-20-template_values.tsv.bz2
  • freebase-wex-2010-06-20-category_members.tsv.bz2
  • freebase-wex-2010-06-20-freebase_wpid.tsv.bz2
  • freebase-wex-2010-06-20-freebase_names.tsv.bz2
  • freebase-wex-2010-06-20-freebase_types.tsv.bz2

And 6 files to facilitate the setup procedure:

  • readme.html (this document)
  • wexsetup.py
  • tables.sql
  • constraints.sql
  • indexes.sql
  • checksum.md5

Decompress the TSV files:

bunzip2 --show-progress *.tsv.bz2

Verify the MD5 checksum of the TSV files:

md5sum --check checksum.md5

The included wexsetup.py Python script will generate a SQL script to load and optimize Freebase WEX to a PostgreSQL server. By default, this script will look for TSV files in the current working directory, and will load to the public schema of the selected database. Use the xml option to take advantage of the new XML parsing features introduced in PostgreSQL 8.3:

usage: wexsetup.py [options]

options:
  -h, --help       show this help message and exit
  --schema=SCHEMA  PostgreSQL schema [public]
  --path=PATH      path to WEX TSV files [.]
  --xml            use the 'xml' datatype introduced in PostgreSQL 8.3 instead
                   of 'text' for XML content

Note that the Freebase WEX TSV files must be present on the same filesystem as the database server, since they will be loaded with the PostgreSQL COPY command, designed for bulk importing from local files. Save the output of this script to a file:

./wexsetup.py > wexsetup.sql

The output of this script will initialize the database schema, COPY all data, build constraints and indexes, and ANALYZE all tables. If you need to modify this procedure in any way (for instance, to skip the loading of unwanted tables), edit this output file accordingly.

If you haven't already, create a database for Freebase WEX with UTF-8 encoding using createdb, adding any connection or authentication parameters as needed:

createdb --encoding UTF-8 wex

Note that the generated SQL script will not DROP tables if they already exist, so be sure to do so first if you are updating from an earlier version of Freebase WEX.

Finally, execute the SQL script using psql, adding any connection or authentication parameters as needed:

psql --echo-queries --dbname wex --file wexsetup.sql

Example Usage

Get the Freebase WEX XML for a particular article:

SELECT xml FROM articles WHERE name='Abraham Lincoln'

Get the Freebase GUID for a particular article (a redirect in this case):

SELECT guid FROM freebase_names WHERE name = 'Honest Abe'

Get the categories of a particular article:

SELECT category_name FROM category_members
INNER JOIN articles ON wpid = article_wpid
WHERE name = 'Abraham Lincoln'

Get a particular template parameter for a particular article:

SELECT template_values.xml FROM template_values
INNER JOIN template_calls ON call_id = template_calls.id
INNER JOIN articles ON articles.wpid = article_wpid
WHERE template_article_name = 'Template:Infobox Officeholder'
  AND template_values.name = 'spouse'
  AND articles.name = 'Abraham Lincoln'

Get the outgoing links from a particular article as an array (requires PostgreSQL 8.3):

SELECT xpath('//link//target/text()', xml)
FROM articles WHERE name = 'Abraham Lincoln'

Tables

Foreign keys are indicated with (→table.column).

articles

The articles table contains all non-redirect articles in the Main, Talk, Image, Category, and Template namespaces. The columns are:

  • wpid - The page_id from the MediaWiki database. This id tends
to stay the
 same for each article over time, but sometimes it changes if a page is moved.

  • name - The article name. The name includes the article
namespace as a
 prefix with a colon if the article isn't a namespace-0 article. The name of 
 an article often changes over time, and sometimes gets swapped with redirects
 for the article.
  • updated - The last update to the article. The Freebase WEX output only
has the
 most recent version of each article.
  • xml - The Wiki XML for the article.

redirects

The redirects table contains all redirect articles, and maps them to the articles table. There are many redirects that point to redirects in Wikipedia, and this table collapses these redirect chains. The columns are:

  • wpid - The page_id of the redirect article.
  • name - The name of the redirect article.
  • redirects_to - The name of the collapsed redirect target
article (→articles.name).

category_members

The category_members table contains the category membership of each article. The columns are:

  • article_wpid - The wpid of the article
(→articles.wpid).
  • category_name - The name of the category.
(→articles.name).

sections

The sections table contains the sections of each article, including the section name, the parent section, and the markup contained within the section. The first section of each article doesn't have an explicit name, so it is called the "summary" section in this table. The XML for the section is enclosed in top-level <section> tags. The columns are:

  • id - A unique id for the row.
  • parent_id - The id of the parent section that contains this
section. NULL
 if it is a top-level section (→sections.id).
  • ordinal - The depth of this section.
  • article_wpid - The wpid of the article
(→articles.wpid).
  • name - The name of the section.
  • xml - The XML for the section.

template_calls

The template_calls table captures template and infobox data from Wikipedia. Each template or infobox is defined in its own article, and then referenced in calls in other articles. Each template call includes a set of named parameters to the template, which constitutes useful semi-structured data. The columns of the table are:

  • article_wpid - The article that the template appeared in
(→articles.wpid).
  • id - A unique id for the row (the same template can appear
multiple times
 in an article).
  • section_id - The section id in which the template appeared
(→sections.id).
  • template_article_name - The name of the template article
referenced (→articles.name).

template_values

The template_values table captures the parameters referenced in all template calls. The columns are:

  • call_id - The id of the template call in which this value
appears (→template_calls.id).
  • name - The name of the parameter.
  • xml - The original xml value.

freebase_wpid

The freebase_wpid table provides a mapping between Wikipedia numeric article/redirect IDs and Freebase GUIDs (Global Unique IDs).

  • guid - The GUID of this article in Freebase.
  • wpid - The page_id of this article from the MediaWiki
database.

freebase_names

The freebase_names table provides a mapping between Wikipedia article/redirect names and Freebase GUIDs.

  • guid - The GUID of this article in Freebase.
  • name - The name of a Wikipedia article, or any of its
redirects.

freebase_types

The freebase_types table contains the Freebase types of every Wikipedia article in Freebase.

  • guid - The GUID of this article in Freebase.
  • type - The Freebase type.

Freebase WEX XML

The Freebase WEX XML is generated by a MediaWiki markup parser written by Magnus Manske, one of the original authors of MediaWiki. It transforms MediaWiki markup into a machine-parsable structure. You can find sample Freebase WEX XML output that has been pretty-print formatted for the article on Abraham Lincoln [wex-sample-abraham-lincoln.xml here]. The tags are described below:

Links

Wikipedia links are denoted with the link tag. If the link points to an external URL, the attribute type='external' is used, along with an href attribute. If the link points to a Wiki page, the target is contained within a target tag. If there is alternative display text, this is contained in a part tag.

An external link:

<link type='external' 
 href='http://www.loc.gov/rr/program/bib/prespoetry/al.html'>
 Poetry written by Abraham Lincoln</link>

A simple internal link:

<link><target>log cabin</target></link>

An internal link with alternative display text:

<link><target>Perry County, Indiana</target><part>Perry County</part></link>

Synthetic Links

Wikipedia has a general policy of only hyperlinking a subject once per article. In order to allow for more complex link analysis, text that appears to correspond with an existing link has been re-created. The link tags for these kinds of links have an attribute synthetic="true".

Sentences

A simple sentence algorithm has been applied that marks complete sentences using the sentence tag.

Images

Wikipedia images are an extension of the link tag, where the target is in the Image: namespace, and image display parameters are passed in as a series of part tags:

<link>
  <target>Image:Piatt and DeWitt County Lincoln marker wide.jpg</target>
  <part>thumb</part>
  <part>In the 1920s historical markers were placed at the county lines along
   the route Lincoln traveled in the eight judicial district. This example is on the
   border of <space/>

   <link><target>Piatt County, Illinois</target><part>Piatt</part></link>
   <space/>and<space/>
   <link><target>DeWitt County, Illinois</target><part>DeWitt counties</part></link>

  </part>
</link>

Headings

Wikipedia headings are denoted with the heading tag. The depth is denoted with the level attribute. This is a level 2 heading:

<heading level='2'>Lincoln 1809 to 1854</heading>

Character Formatting

Bold formatting is denoted with the bold tag. Italics are denoted with the italics tag. Non-breaking spaces are denoted by the space tag.

Lists

Lists are denoted with the list tag, and individual items with the listitem tag. The type attribute on the list tag indicates the list type: bullet for bulleted lists, numbered for numbered lists, and ident for indented blocks. This is a bulleted list:

<list type='bullet'>
  <listitem>
   Robert Emmet Sherwood;<space/>
   <italics>Abe Lincoln in Illinois: A Play in Twelve Scenes</italics>
   <space/>(1939)<space/><link type='external' 
    href='http://www.questia.com/PM.qst?a=o&d=179725'>online version</link><space/>

  </listitem>
  <listitem>
   <link><target>Gore Vidal</target></link>.<space/>
   <italics>Lincoln</italics><space/>ISBN 0-375-70876-6, a novel.
  </listitem>

</list>

Templates

Wikipedia templates and infoboxes are described using the template tag, with the name of the template in the name attribute. Parameters are contained within param tags, with their names in the name attribute. Note that this data is also available in relational form via the [#template_calls|template_calls] table. This is the abbreviated infobox for Abraham Lincoln:

<template name="Infobox_President">
  <param name="name">Abraham Lincoln</param>
  <param name="nationality">American</param>

  <param name="image">Abraham Lincoln head on shoulders photo portrait.jpg</param>
  <param name="order">
   16th<space/><link><target>President of the United States</target></link>

  </param>
  <param name="term_start">
   <link><target>March 4</target></link>,<space/><link><target>1861</target></link>

  </param>
  <param name="term_end">
   <link><target>April 15</target></link>,<space/><link><target>1865</target></link>

  </param>
  <param name="predecessor">
   <link><target>James Buchanan</target></link>
  </param>

  <param name="successor">
   <link><target>Andrew Johnson</target></link>
  </param>
 ....
</template>

Extensions

Wikipedia extensions, such as ref tags are denoted with the extension tag. Here's an example of a ref extension:

Lincoln countered that he was "not in favor of bringing about in any way the social and
political equality of the white and black races."
<extension extension_name='ref'>
  <link type='external' href='http://www.nps.gov/liho/debate4.htm'>
   Fourth Debate with Stephen A. Douglas at Charleston, Illinois
  </link>, September 18, 1858
</extension>
Personal tools