This page describes our best understanding of what licenses are compatible with Freebase for loading data, and how to comply with other licenses when loading data into Freebase.
This is a community-editable wiki. The contents of this wiki do not constitute legal advice. If you really need to know the definitive answer on this subject, consult an intellectual property lawyer familiar with Creative Commons and other open licenses.
This represents the view of Metaweb staff and there are significant unanswered questions from non-staff members as to its legal validity.
For loading structured data:
- If a data source is under CC-BY, you can load it into Freebase as long as you provide attribution.
- If a data source is in the public domain, you can load it without any attribution
- If a data source is under any other license, you must seek the permission of the owner before loading it into Freebase.
- If you are doing highly creative work to infer assertions from unstructured data, copyright probably doesn't apply.
For topic descriptions and other text:
- The Freebase client offers a range of licenses for use when uploading an image. You may choose any of these.
- Copyright images may be uploaded under the provisions of Fair use. They will be thumbnailed to 150x150px.
What is structured/unstructured data?
A structured data source is one that starts out as tables, records, fields, etc. It might be a relational database, a spreadsheet, a set of search results presented in tabular form, well-structured XML, the output of a web-based API, etc.
The steps for loading structured data are:
- identify desired fields
- match them to Freebase Schema
- format data, if necessary (eg. convert names to "Firstname Lastname" or measurements to metric)
- reconcile against Freebase data
The process of loading highly structured data into Freebase is not particularly interpretive, creative, etc. You are just taking a collection of facts and loading them, with as little change as possible, into Freebase. Therefore if you load data like this, it retains any copyright it may have, and you will need to comply with the license terms set by the provider.
On the other hand, if you are inferring facts from unstructured data, or going through more complex processes, the resulting assertions probably aren't copyrightable. Examples of things that might fit this include:
- inferring a person's gender from the language of an article about them
- guessing a geolocation by searching multiple geo-related services then finding the geometric centre of the results
- extracting facts from prose using Natural language processing
- using messy, semi-structured data to feed a RABJ queue then get human judgement to help make a final assertion
It's usually best to start those sorts of processes with data sources that are as open as possible, but that's more a matter of etiquette than legality, i.e. if the source has an intent toward openness, they're likely to think what you're doing is pretty cool. (Of course, you need to make sure you comply with any terms and conditions around Web scraping and the like).
Compatible licenses for structured data
If you are loading structured data, it must be under one of the following licenses/terms in order to load it:
No other licenses are compatible with Freebase's CC-BY license. In particular, the following are not compatible because:
- CC-BY-SA - means that Freebase, and anyone that used Freebase, would also have to use CC-BY-SA
- CC-BY-NC - non-commercial use would prevent Metaweb or any other commercial enterprise from using the data
- CC-BY-ND - no derivatives means you can't mash it up with anything
- GFDL - like CC-BY-SA, means that Freebase and anyone using Freebase would need to use the GFDL
(Note: the use of GFDL for Wikipedia-sourced topic Blurbs is a special case, because the text is not mixed in with other data, and does not pollute it. Also, people aren't allowed to edit the Wikipedia blurb, only replace it.)
If you have a data source that is not public domain or CC-BY and you want to load it, you can seek permission of the owner. A sample email to a data owner is available at License compatibility/Seeking permission email.
Attributing CC-BY data when you load it
Freebase offers facilities to appropriately attribute data as you load it.
First up, when doing the data load, you should use attribution nodes to associate each primitive you write with a Mass Data Operation that describes your data load and links back to the original source.
Secondly, you should create an ID for each topic for which you import data, using enumerations, to link back to the original data source. You could also create a URI template to auto-populate weblinks based on that ID.
Thirdly, you can use an attribution template to make the attribution show on Freebase's topic page.
Third-party attribution (i.e. subsequent use of loaded data)
People using Freebase data are required to link back to the relevant Freebase topic page. However, the Freebase policies don't/can't specify how third parties should attribute data sources at one or more removes (NB: Wikipedia and image attribution aren't about structured data, and are not relevant to this discussion right now).
This is an overall problem with CC-BY, not just with Freebase; a common practice among most Creative Commons users is to attribute the nearest source, and allow consumers of the work to hop back further through attributions by themselves if they wish.
- Alice publishes a data set under CC-BY
- Bob creates a visualisation of that data set, attributes Alice, and makes it CC-BY
- Carol writes a blog post about the visualisation, including and image of it, and attributes Bob, and makes it CC-BY
- Diane writes another blog post, quoting and attributing Carol...
There is nothing in the language of CC-BY to require that Diane attribute Bob or Alice, though it is good manners to do so, if you can figure out that you are in fact using some of Bob or Alice's work.
Freebase offers some facilities to make it easier to trace attribution back to other/more distant sources:
- Inspecting the IDs and/or attribution template will tell you, at a topic level, what sources have contributed to data on that topic. (A MQL extension to simplify this will be forthcoming.)
- Inspecting the attribution on links will tell you, at the level of specific assertions, where they came from.
If you are doing a mass load of assertions based on some kind of unstructured data (eg. prose) or from merging and massaging multiple sources to the point where no particular one is *the* source, you should still:
- create a Mass Data Operation to explain your load
- use attribution to link your assertions with that MDO
Text (eg. topic descriptions)
Short version: don't use anything copyrighted.
If adding images via the API, you will want to inspect the schema connecting images to licenses, and make sure you write the appropriate properties to attribute the image.