Data Formats

Data comes in a thousand and one formats, some friendlier than others. Let's review a few!

API

APIs - application programming interface - is a way for computers to communicate to one another. For us, this generally means sharing data. We'll be coding up Python scripts to talk to and request data from machines around the world, from Twitter to the United States government.

CSV

csv city,population New York,8406000 Los Angeles,3884000 Richmond,214114

Comma-separated values are the most common format for data. It's a quick export away from Excel or Google Spreadsheets, and you'll find yourself working from CSV's more often than not.

Although "comma-separated" is in the name, a CSV can arguably also use tabs, pipes, or any other character as a field delimiter (although the tab-separated one can also be a TSV).

JSON

json { state: "Tennessee", presidents: [ { name: "Andrew Jackson", term: [1829, 1837] }, { name: "James K. Polk", term: [1845, 1849] }, { name: "Andrew Johnson", term: [1865, 1869] } ] }

JSON stands for JavaScript Object Notation, and it's a slightly more complicated format than a CSV. It can contain lists, numbers, sub-items, and all sorts of complexities that are great for expressing the nuance of real-world data. Data from APIs is often formatted as JSON.

Shapefiles

Shapefiles are by far the most common format for geographic data. City council districts, state boundaries, and the nearest wifi spots can often be found as shapefiles. You can import them into software like QGIS or convert them to geography-friendly JSON.

GeoJSON and TopoJSON

json { "type": "FeatureCollection", "features": [ { "type": "Feature", "properties": {}, "geometry": { "type": "Point", "coordinates": [ -73.970947265625, 40.81380923056958 ] } } ] }

GeoJSON and TopoJSON are both specially-formatted JSON files that contain geographic information.

SQL

sql INSERT INTO cities (name, population) VALUES ("New York", 8406000); INSERT INTO cities (name, population) VALUES ("Los Angeles", 3884000); INSERT INTO cities (name, population) VALUES ("Richmond", 214114);

SQL is the language used to talk to databases, and you'll sometimes find datasets in SQL format, ready to be imported into your database of choice.