GitHub - curran/data: A collection of public data sets (as of Jan 15, 2016) at


README.md


A collection of public data sets for testing out visualization methods. These data sets are at various stages of preparation, some are just raw data, some are CSV files, and some are exposed as AMD modules. This collection is messy, but with some digging you may find hidden gems.

Targets for import:


Here's a listing of data sets with more detail. Columns will be marked in terms of their type for visualization, including:

  • Q = Quantitative, continuously varying numeric columns
  • T = Temporal, a timestamp
  • O = Ordered, distinct categories with a natural order (e.g. Low, Medium, High)
  • N = Nominal, distinct categories with no natural order (e.g. Ethnicity)
  • G = Geospatial identifiers (e.g. Country, City)



UCI Machine Learning Repository - Adult (3.8 MB)


This data set demonstrates a mix of quantitative, ordinal, and nominal columns. To analyze this data set using visualization, it would be useful to aggregate the data on the fly before visualization.

  • age: Q
  • workclass: N
  • education: O
  • education-num: Q
  • marital-status: N
  • occupation: N
  • relationship: N
  • race: N
  • sex: N
  • capital-gain: Q
  • capital-loss: Q
  • hours-per-week: Q
  • native-country: N

Data Canvas Sense Your City (237MB or Real-time API)


This data set contains measures collected by DIY sensor kits across several major cities %22San Francisco%22, %22Bangalore%22, %22Boston%22, %22Geneva%22, %22Rio de Janeiro%22, %22Shanghai%22, %22Singapore%22. There is a visualization competition for this data set, submissions due March 20.

  • city: G
  • timestamp: T
  • temperature: Q
  • light: Q
  • airquality: Q
  • sound: Q
  • humidity: Q
  • dust: Q

Medical Store Geospatial Challenge (< 100KB)


This is a data set is small, but comes with a set of real-world questions about the data. This is also a competition, with submissions due April 25.

Referrers - Each row corresponds to information on a particular client referral source.

referrer_code: N

  • visit_count: Q
  • city — referrer city
  • postal_code_referrer: G
(latitude, longitude): G

Clients - Each row corresponds to a client visit to the store

client_id: N

  • referrer_code: N
  • city — referrer city
  • postal_code_referrer: G
  • (latitude, longitude): G
  • initial_visit_date: T
  • product_count: Q

UCI Machine Learning Repository - Individual household electric power consumption (20 MB)


This data set would be a great candidate to show multi-scale temporal aggregation.

  • timestamp: T
  • global_active_power: Q
  • global_reactive_power: Q
  • voltage: Q
  • global_intensity: Q

BrightKite User Check-ins (57.2 MB)


This data set would be a useful example for multi-scale aggregation in both space and time. This has been used as the motivating example for several Big Data visualization systems based on data cubes (imMens: Real‐time Visual Querying of Big Data, Nanocubes for real-time exploration of spatiotemporal datasets).

  • user-id: N
  • timestamp: T
  • (latitude, longitude): G

ACLED (Armed Conflict Location and Event Data Project) (35MB)


This data set contains entries for each violent event in Africa from 1997 - 2014. This data set would be a good candidate for visualization with a linked timeline and choropleth map, where selections in the timeline can drive the filtering of data shown on the map.

  • timestamp: T
  • (latitude, longitude): G
  • country: G
  • number of fatalities: Q

Safecast (3.2GB)


Grassroots sensor data about nuclear radiation in Japan


Statistical Computing Statistical Graphics Data expo Airline on-time performance (12GB)


A great data set for scalability testing. This is the data set used in the Crossfilter Demo.


The GDELT Data Set (~100GB)


This would be a great data set for more extreme scalability testing. There is an Open Source project for loading this data set into Spark on AWS.


The Indian Census has lots of public data.


Best Buy has a developer portal for querying their data via a Web API.