WEX
From Freebase
The Freebase Wikipedia Extraction (WEX) is a processed dump of the English language Wikipedia. The wiki markup for each article is transformed into machine-readable XML, and common relational features such as templates, infoboxes, categories, article sections, and redirects are extracted in tabular form. Freebase WEX is provided as a set of database tables in TSV format for PostgreSQL, along with tables providing mappings between Wikipedia articles and Freebase topics, and corresponding Freebase Types.
Freebase WEX is provided free of charge for any purpose with regular updates by Metaweb Technologies. It is distributed, like Wikipedia itself, under the terms of version 1.2 of the GNU Free Documentation License or any later version published by the Free Software Foundation.
Contents |
Citing
If you'd like to cite WEX in a publication, you may use:
Metaweb Technologies, Freebase Wikipedia Extraction (WEX), http://download.freebase.com/wex/, <month> <day>, <year>
Or as BibTeX:
@misc{metaweb:wex,
title = "Freebase Wikipedia Extraction (WEX)",
author = "Metaweb Technologies",
howpublished = "\url{http://download.freebase.com/wex/}",
edition = "<month> <day>, <year>",
year = "<year>"
}
Related Work
- DBpedia, "a community effort to extract structured information from Wikipedia and to make this information available on the Web", http://dbpedia.org/
- Hugo Zaragoza, Jordi Atserias, Massimiliano Ciaramita and Giuseppe Attardi, (Yahoo! Research Barcelona), Semantically Annotated Snapshot of the English Wikipedia, http://www.yr-bcn.es/semanticWikipedia, 2007.
Setup
Freebase WEX data files are provided in TSV format suitable for bulk loading into PostgreSQL (version 8.3). Other database servers such as MySQL should also be suitable, but will require a different setup procedure to what is presented here, and are not currently tested (let us know if you use Freebase WEX on such a platform). The following procedure assumes you have already installed PostgreSQL, and are running a UNIX operating system such as Linux.
Download the latest Freebase WEX archive:
wget http://download.freebase.com/wex/YYYY-MM-DD/freebase-wex-YYYY-MM-DD.tar
Extract the archive:
tar xvf freebase-wex-YYYY-MM-DD.tar cd freebase-wex-YYYY-MM-DD
Contained are 9 bzip2 compressed TSV data files corresponding to each of the Freebase WEX tables:
- freebase-wex-YYYY-MM-DD-articles.tsv.bz2
- freebase-wex-YYYY-MM-DD-redirects.tsv.bz2
- freebase-wex-YYYY-MM-DD-sections.tsv.bz2
- freebase-wex-YYYY-MM-DD-template_calls.tsv.bz2
- freebase-wex-YYYY-MM-DD-template_values.tsv.bz2
- freebase-wex-YYYY-MM-DD-category_members.tsv.bz2
- freebase-wex-YYYY-MM-DD-freebase_wpid.tsv.bz2
- freebase-wex-YYYY-MM-DD-freebase_names.tsv.bz2
- freebase-wex-YYYY-MM-DD-freebase_types.tsv.bz2
And 6 files to facilitate the setup procedure:
- readme.html (this document)
- wexsetup.py
- tables.sql
- constraints.sql
- indexes.sql
- checksum.md5
Decompress the TSV files:
bunzip2 --show-progress *.tsv.bz2
Verify the MD5 checksum of the TSV files:
md5sum --check checksum.md5
The included wexsetup.py Python script will generate a SQL script
to load and optimize Freebase WEX to a PostgreSQL server. By default, this script will
look for TSV files in the current working directory, and will load to the
public schema of the selected database. Use the xml option to
take advantage of the new XML parsing features introduced in
PostgreSQL 8.3:
usage: wexsetup.py [options]
options:
-h, --help show this help message and exit
--schema=SCHEMA PostgreSQL schema [public]
--path=PATH path to WEX TSV files [.]
--xml use the 'xml' datatype introduced in PostgreSQL 8.3 instead
of 'text' for XML content
Note that the Freebase WEX TSV files must be present on the same filesystem as the database server, since they will be loaded with the PostgreSQL COPY command, designed for bulk importing from local files. Save the output of this script to a file:
./wexsetup.py > wexsetup.sql
The output of this script will initialize the database schema,
COPY all data, build constraints and indexes, and
ANALYZE
all tables. If you need to modify this procedure in any way (for instance, to
skip the loading of unwanted tables), edit this output file accordingly.
If you haven't already, create a database for Freebase WEX with UTF-8 encoding using createdb, adding any connection or authentication parameters as needed:
createdb --encoding UTF-8 wex
Note that the generated SQL script will not DROP
tables if they already exist, so be sure to do so first if you are updating
from an earlier version of Freebase WEX.
Finally, execute the SQL script using psql, adding any connection or authentication parameters as needed:
psql --echo-queries --dbname wex --file wexsetup.sql
Example Usage
Get the Freebase WEX XML for a particular article:
SELECT xml FROM articles WHERE name='Abraham Lincoln'
Get the Freebase GUID for a particular article (a redirect in this case):
SELECT guid FROM freebase_names WHERE name = 'Honest Abe'
Get the categories of a particular article:
SELECT category_name FROM category_members INNER JOIN articles ON wpid = article_wpid WHERE name = 'Abraham Lincoln'
Get a particular template parameter for a particular article:
SELECT template_values.xml FROM template_values INNER JOIN template_calls ON call_id = template_calls.id INNER JOIN articles ON articles.wpid = article_wpid WHERE template_article_name = 'Template:Infobox Officeholder' AND template_values.name = 'spouse' AND articles.name = 'Abraham Lincoln'
Get the outgoing links from a particular article as an array (requires PostgreSQL 8.3):
SELECT xpath('//link//target/text()', xml)
FROM articles WHERE name = 'Abraham Lincoln'
Tables
Foreign keys are indicated with (→table.column).
articles
The articles table contains all non-redirect articles in the
Main, Talk, Image, Category,
and Template namespaces. The columns
are:
-
wpid- The page_id from the MediaWiki database. This id tends
to stay the same for each article over time, but sometimes it changes if a page is moved.
-
name- The article name. The name includes the article
namespace as a prefix with a colon if the article isn't a namespace-0 article. The name of an article often changes over time, and sometimes gets swapped with redirects for the article.
-
updated- The last update to the article. The Freebase WEX output only
has the most recent version of each article.
-
xml- The Wiki XML for the article.
redirects
The redirects table contains all redirect articles, and maps them to the
articles table. There are many redirects that point to redirects in Wikipedia,
and this table collapses these redirect chains. The columns are:
-
wpid- The page_id of the redirect article. -
name- The name of the redirect article. -
redirects_to- The name of the collapsed redirect target
article (→articles.name).
category_members
The category_members table contains the category membership of each article.
The columns are:
-
article_wpid- The wpid of the article
(→articles.wpid).
-
category_name- The name of the category.
(→articles.name).
sections
The sections table contains the sections of each article,
including the section name, the parent section, and the markup
contained within the section. The first section of each article doesn't have an
explicit name, so it is called the "summary" section in this table. The XML for
the section is enclosed in top-level <section> tags.
The columns are:
-
id- A unique id for the row. -
parent_id- The id of the parent section that contains this
section. NULL
if it is a top-level section (→sections.id).
-
ordinal- The depth of this section. -
article_wpid- The wpid of the article
(→articles.wpid).
-
name- The name of the section. -
xml- The XML for the section.
template_calls
The template_calls table captures template and infobox data
from Wikipedia.
Each template or infobox is defined in its own article, and then referenced in
calls in other articles. Each template call includes a set of named parameters
to the template, which constitutes useful semi-structured data. The columns of
the table are:
-
article_wpid- The article that the template appeared in
(→articles.wpid).
-
id- A unique id for the row (the same template can appear
multiple times in an article).
-
section_id- The section id in which the template appeared
(→sections.id).
-
template_article_name- The name of the template article
referenced (→articles.name).
template_values
The template_values table captures the parameters referenced in all template
calls. The columns are:
-
call_id- The id of the template call in which this value
appears (→template_calls.id).
-
name- The name of the parameter. -
xml- The original xml value.
freebase_wpid
The freebase_wpid table provides a mapping between Wikipedia numeric
article/redirect IDs and
Freebase GUIDs
(Global Unique IDs).
-
guid- The GUID of this article in Freebase. -
wpid- The page_id of this article from the MediaWiki
database.
freebase_names
The freebase_names table provides a mapping between Wikipedia article/redirect
names and Freebase GUIDs.
-
guid- The GUID of this article in Freebase. -
name- The name of a Wikipedia article, or any of its
redirects.
freebase_types
The freebase_types table contains the
Freebase types
of every Wikipedia article in Freebase.
-
guid- The GUID of this article in Freebase. -
type- The Freebase type.
Freebase WEX XML
The Freebase WEX XML is generated by a MediaWiki markup parser written by Magnus Manske, one of the original authors of MediaWiki. It transforms MediaWiki markup into a machine-parsable structure. You can find sample Freebase WEX XML output that has been pretty-print formatted for the article on Abraham Lincoln [wex-sample-abraham-lincoln.xml here]. The tags are described below:
Links
Wikipedia
links
are denoted with the link tag. If the link points to an
external URL, the attribute type='external' is used,
along with an href attribute. If the link points to a Wiki page,
the target is contained within a target tag. If there is
alternative display text, this is contained in a part tag.
An external link:
<link type='external' href='http://www.loc.gov/rr/program/bib/prespoetry/al.html'> Poetry written by Abraham Lincoln</link>
A simple internal link:
<link><target>log cabin</target></link>
An internal link with alternative display text:
<link><target>Perry County, Indiana</target><part>Perry County</part></link>
Synthetic Links
Wikipedia has a general policy of only hyperlinking a subject once per article.
In order to allow for more complex link analysis, text that appears to correspond
with an existing link has been re-created. The link tags for
these kinds of links have an attribute synthetic="true".
Sentences
A simple sentence algorithm has been applied that marks complete sentences using
the sentence tag.
Images
Wikipedia
images
are an extension of the link tag, where the target is in the
Image: namespace, and image display parameters are passed in as a
series of part tags:
<link> <target>Image:Piatt and DeWitt County Lincoln marker wide.jpg</target> <part>thumb</part> <part>In the 1920s historical markers were placed at the county lines along the route Lincoln traveled in the eight judicial district. This example is on the border of <space/> <link><target>Piatt County, Illinois</target><part>Piatt</part></link> <space/>and<space/> <link><target>DeWitt County, Illinois</target><part>DeWitt counties</part></link> </part> </link>
Headings
Wikipedia
headings
are denoted with the heading tag. The depth is denoted with the
level attribute. This is a level 2 heading:
<heading level='2'>Lincoln 1809 to 1854</heading>
Character Formatting
Bold formatting is denoted with the bold tag. Italics are
denoted with the italics tag. Non-breaking spaces are denoted by
the space tag.
Lists
Lists are denoted with the list tag, and individual items with
the listitem tag. The type attribute on the
list tag indicates the list type: bullet for
bulleted lists, numbered for numbered lists, and
ident for indented blocks. This is a bulleted list:
<list type='bullet'>
<listitem>
Robert Emmet Sherwood;<space/>
<italics>Abe Lincoln in Illinois: A Play in Twelve Scenes</italics>
<space/>(1939)<space/><link type='external'
href='http://www.questia.com/PM.qst?a=o&d=179725'>online version</link><space/>
</listitem>
<listitem>
<link><target>Gore Vidal</target></link>.<space/>
<italics>Lincoln</italics><space/>ISBN 0-375-70876-6, a novel.
</listitem>
</list>
Templates
Wikipedia templates
and
infoboxes
are described using the template tag, with the name of the
template in the name attribute. Parameters are contained within
param tags, with their names in the name attribute.
Note that this data is also available in relational form via the
[#template_calls|template_calls] table. This is the
abbreviated infobox for Abraham Lincoln:
<template name="Infobox_President"> <param name="name">Abraham Lincoln</param> <param name="nationality">American</param> <param name="image">Abraham Lincoln head on shoulders photo portrait.jpg</param> <param name="order"> 16th<space/><link><target>President of the United States</target></link> </param> <param name="term_start"> <link><target>March 4</target></link>,<space/><link><target>1861</target></link> </param> <param name="term_end"> <link><target>April 15</target></link>,<space/><link><target>1865</target></link> </param> <param name="predecessor"> <link><target>James Buchanan</target></link> </param> <param name="successor"> <link><target>Andrew Johnson</target></link> </param> .... </template>
Extensions
Wikipedia extensions, such as
ref tags
are denoted with the extension tag. Here's an example of a
ref extension:
Lincoln countered that he was "not in favor of bringing about in any way the social and political equality of the white and black races." <extension extension_name='ref'> <link type='external' href='http://www.nps.gov/liho/debate4.htm'> Fourth Debate with Stephen A. Douglas at Charleston, Illinois </link>, September 18, 1858 </extension>