Working with Freebase data on Amazon AWS
The Freebase data dumps are fairly large files which makes it time-consuming to download, unzip and process. One way to make it easier to work with these data dumps is to use a cloud-based service like Amazon AWS where you can just connect to the data immediately.
AWS Data Sets
Here's how to get started:
1. Launch an EC2 instance
- Sign up for Amazon AWS and log in to the EC2 Management Console.
- Click on the Launch Instance button and select one of the free Linux snapshots (ex. Basic 64-bit Amazon Linux AMI 2011.02.1 Beta)
- Make sure that you download the .pem file for your keypair so that you can SSH into your instance later on.
- When you're done, you should see your instance with the status of "running" (it might take a minute or two to start up).
2. Create an EBS volume
- Now go to the Volumes section in navigation pane. you should already have one volume "in-use". This is the one that was created to host the OS for your instance.
- Click Create Volume to add a new volume to hold the Freebase data dump that you're interested in.
- In the dialog that opens choose a size (must be big enough to fit the data dump you're using).
- Make sure that you set the availability zone to be the same as your existing volume.
- Select one of the Freebase data dump snapshots from the drop down menu ( they're near the bottom). Then, click the Create button.
- If you refresh the Management Console you should now see your new volume listed as "available".
- Click the Attach Volume button and select your EC2 instance and choose a device (the default is fine).
3. SSH into your instance
- Go back to the Instances section and click on your instance. Then, in the bottom pane, scroll down to get the Public DNS for connecting to your instance.
- Open up a terminal or your favorite SSH client and connect to your instance like so:
ssh -i ~/your_keypair.pem email@example.com
Where your_keypair.pem is replaced with the name of the .pem file that you downloaded in Part 1 and ec2-123-45-678-910.compute-1.amazonaws.com is replaced with the Public DNS from the instances page.
4. Mount the EBS volume
- From SSH on your EC2 instance, create a freebase directory and mount the EBS volume there.
sudo mkdir /freebase sudo mount /dev/sdb1 /freebase chmod 777 /freebase
Congratulations, you're ready to start writing code and crunching data. Both of the data dumps contain uncompressed TSV files. You can see the format by just looking at the first 10 lines of a file like this:
head -10 /freebase/freebase-simple-topic-dump.tsv