SWC Banner

Billion Triples dataset

The complete dataset is composed of a set of smaller datasets. Each download is in one of two formats: (1) WARC or (2) tar.gz. You can read about the WARC format by following this link to the mailing list. The tar.gz format is a tarred and gzipped file containing triples given in the N-Triples syntax.

Note: the Webscope data set comes with a license you need to sign and return before you can download it (see README).

If you are interested in participating in the Billion Triples Track, please join the mailing list.

Data set Format Triples URLs Size Download License Readme Source MD5
Webscope WARC 82,768,342 1,979,022 2.7 GB Download License README Source MD5
Falcon WARC 32,512,340 541,518 834 MB Download Source MD5
Swoogle WARC 174,981,639 1,468,766 3.2 GB Download Source MD5
Watson WARC 59,750,019 130,701 267 MB Download Source MD5
SWSE-1 WARC 30,346,451 194,259 4 GB Download Source MD5
SWSE-2 WARC 60,504,716 389,107 2.4 GB Download Source MD5
DBpedia tar.gz 110,241,463 29 1.9 GB Download Source MD5
Geonames WARC 69,778,255 6,668,395 3.4 GB Download Source MD5
SwetoDBLP tar.gz 14,936,600 1 167 MB Download Source MD5
WordNet tar.gz 1,942,887 1 17 MB Download Source MD5
Freebase tar.gz 63,069,952 1 569 MB Download License README Source MD5
US Census tar.gz 445,752,172 1 3.3 GB Download Source MD5