Refinery

From Freebase

Jump to: navigation, search

The Freebase Refinery is the place where people trying to load data to Freebase using Google Refine can monitor their loads. "Refinery" also refers to the whole conduit through which data flows conceptually from Google Refine into Freebase. Refinery's main function as a project is to drive and coordinate the "quality assurance" processes around data before it enters Freebase. Starting from version 2.0, users using Google Refine to load data to Freebase go thru the Refinery.

Help Channels

For help, please direct your questions to the Freebase Discuss mailing list, or join us on IRC channels #freebase or #grefine on irc.freenode.net:6667.

Work Flow

We recommend the following work flow:

  1. Reconcile your data one column at a time, starting from the easiest (e.g., countries' names) to the hardest (e.g., people's names).
  2. Align your data to Freebase's schemas. (Sometimes you might be tempted to give us everything you have using only 1 load. Avoid that temptation and instead consider partitioning your load to help users in the QA process make easy judgments regarding your data. For instance, loading against a particular Freebase Type and only 1 or 2 properties, and if you have data regarding other properties, load those using another separate load. This will make judgments easier and get your data into Freebase that much quicker.)
  3. Load data into Sandbox without the QA checkbox checked.
  4. Track the load on Refinery until it's done.
    • Watch for problematic triples that show up in the CANT tab.
  5. Inspect your newly loaded data in Sandbox. Make sure topics or property values that you expect show up in all contexts (e.g., in other topics that link to them).
    • If a topic or property value doesn't show up on Sandbox, chances are it doesn't have the right expected type(s). In that case, update your schema alignment skeleton in Google Refine to assert its expected type(s) explicitly.
    • Your data might disappear when Sandbox gets refreshed periodically. Just do another load.
  6. Repeat from step 1 to here until everything looks great.
  7. Now do another load with the QA checkbox checked.
  8. Track the load on Refinery until it's done. You should now see QA queues created one for each column in your data.
  9. Tell other people to help you out by going through those QA queues.
  10. Once all queues are done, you should get a total score.
    • If the score passes a certain threshold (currently set to 99% agreement), there will be a button that lets you "re-play" the load on Freebase proper.
    • If the score does not pass that threshold, inspect the queues and their problematic questions to locate and understand the errors. Fix the errors in your data as systematically as possible, and repeat from step 7.

Links

Personal tools