As part of my day-to-day work, I often need to run the tooling I’m working on against databases to test it. The databases need to have a variaty of different schemas and data in them to do a representative test. A big problem with this is getting hold of an array of different databases in the first place. This problem is compounded further by the fact that the tooling I work on supports five different database platforms, so there is a need for different databases for each platform.
A useful source of ready-made databases is the reference databases that are provided by the vendor or community around each database. The ones I’ve found so far are listed below, with a bit of detail about what each one represents.
SQL Server
| Name | Description | Install |
|---|---|---|
| AdventureWorks | A fictional bicycle manufacturer “Adventure Works Cycles”, containing data including manufacturing, sales, purchasing, product management, contact management, and human resources. Available in Online transactional (OLTP), Data warehousing (DW), and Lightweight (LT) formats. | backups scripts |
| Chinook | A fictional digital media store, containing data including artists, albums, tracks, invoices, and customers. | script |
| Contoso | “Contoso Corporation”, a fictional multinational business (more info here), containing data including manufacturing, sales, and products. | scripts |
| Northwind | “Northwind Traders”, a fictional importer/exporter of speciality foods from around the world, containing sales data including customers, orders, inventory, purchasing, suppliers, shipping, employees, and accounting. | scripts |
| WideWorldImporters | “Wide World Importers”, a fictional wholesale novelty goods importer and distributor operating from the San Francisco bay area (more info here), containing data including purchasing, sales and stock. Available in Online transactional (OLTP) and Data warehousing (DW) formats. | backups |
Stack Exchange / StackOverflow
The databases behind the many sites in the Stack Exchange network can be downloaded here. The database for each Stack Exchange site can be downloaded, including that for StackOverflow. However, the StackOverflow database is quite big (~65GB), so it has been split up into several files. That makes it non-trivial to get it into a SQL Server database.
Brent Ozar has simplified this process by providing downloads of the database files in SQL Server 2016 format.
PostgreSQL
| Name | Description | Install |
|---|---|---|
| AdventureWorks | A port of the SQL Server version of AdventureWorks, a fictional bicycle manufacturer “Adventure Works Cycles”, containing data including manufacturing, sales, purchasing, product management, contact management, and human resources. | script |
| Chinook | A port of the SQL Server version of Chinook, a fictional digital media store, containing data including artists, albums, tracks, invoices, and customers. | script |
| Pagila | A database of a fictional DVD rental store, containing data including films, actors, customers, staff, and payments. | scripts |
Bluebox
Ryan Booz has created Bluebox, a database based on Pagila that aims to make it more full-featured.
NYC Census
If using PostGIS (an extension for managing spatial data), the introduction to PostGIS tutorial contains a sample database of census data for New York City.
Other Databases
The PostgreSQL website has a long list of other sample databases, including IMDB, the UK land registry of sales, and OpenStreetMap.
MySQL / MariaDB
| Name | Description | Install |
|---|---|---|
| Bureau of Transportation Statistics | A database of US commercial airline flight data including airlines, airports, and flights. | scripts |
| Employees | A database containing generated data including employees, departments, employees, and salaries. | scripts |
| Sakila | A fictional DVD rental store, containing data including films, actors, customers, staff, and payments. | scripts |
| World | A database containing information about the countries and cities of the world, containing data including countries, cities, and languages spoken. | script |
Other Databases
The MySQL website has a list of other sample databases, including some with large data sets.
Oracle
Oracle themselves provide some interlinked sample schemas:
- Customer Orders (CO)
- Human Resources (HR)
- Online Catalog (OC)
- Order Entry (OE)
- Product Media (PM)
- Sales History (SH)
These are documented on the Oracle site, and can be obtained from the Oracle samples GitHub repository.
| Name | Description | Install |
|---|---|---|
| OT | A global fictitious company that sells computer hardware including storage, motherboard, RAM, video card, and CPU. | scripts |
| O7 | A schema for a fictitious bank, containing customers, accounts, products, branches, departments, and employees. | scripts (see section 2) |
Gerenal data sources
There are many sites that provide data suitable for using as a reference database. While many of these sites provide the data, few of them provide it in a format that can be loaded straight into the relevant database platform. They can however be imported fairly trivially.
| Name | Description |
|---|---|
| Google Dataset Search | A search engine for datasets from across the internet. |
| Kaggle | Datasets for data science and machine learning. Interesting datasets include NBA Basketball and Spotify. |
| UC Irvine Machine Learning Repository | Well-documented and clean data sources, mainly geared towards machine learning. |
| Awesome Public Data Sets | A list of high quality, topic-centric public data sources. |
| Data Is Plural | A weekly newsletter and archive of real-world datasets. |
| Hugging Face | A huge set of data, mainly targeted at machine learning. |