Motivation¶
Before we begin downloading datasets and exploring them, we want to share the motivation behind this project. Throughout our master’s graduate program, we had worked on clean and well formatted datasets that have code books and documentation. We created a few variables and then we are able to run some regressions. However, when we embarked on our thesis projects, depending on our questions, we had to find publicly available datasets and put the data together. This project attempts to show how one could go about finding, cleaning and setting up an analytical dataset.
Public Use Files¶
There are many sources of publicly available data. Here is a list of them
- https://data.detroitmi.gov/
- https://opendata.cityofnewyork.us/
- https://www.data.gov/
- https://www.google.com/publicdata/directory
- https://data.gov.sg/
- https://www.ebrd.com/cs/Satellite?c=Content&cid=1395236498263&d=Mobile&pagename=EBRD%2FContent%2FContentLayout
- http://erf.org.eg/
- https://www.cdc.gov/nchs/index.htm
- https://capstat.nyc/
- https://atlasdata.dartmouth.edu/
For even more public datasets, see https://github.com/cambridgegis/awesome-public-datasets
For this exercise, we will use the NYPD’s Motor Vehicle Collisions data, which can be found on https://opendata.cityofnewyork.us/.
Stata’s import command allows for several types of data. There are other user written commands that allow you to read in other types of data. For example: - insheetjson - libjson - spshape2dta
[ ]:
import delimited "https://data.cityofnewyork.us/api/views/h9gi-nx95/rows.csv?accessType=DOWNLOAD", clear
We can use Stata’s global macros to view some information like dates and times by invoking $S_DATE and $S_TIME.
[ ]:
display "$S_DATE $S_TIME"
We can include a note in the dataset that we save. This can be used to hold information about the dataset that we are don’t want/unable to store in the dataset filename. Here we create a time stamp note.
[ ]:
notes: "Downloaded $S_DATE $S_TIME"
notes list
Relative file paths allow us to move from one folder (current directory) to another easily. Right now, the current directory is in the dofiles folder. We use ..\ to move up one level to the folder “Stata Class” and then down one level into “input_data” to save the dataset.
[1]:
cd
C:\Users\jerem\Documents\Stata Class\dofiles
[ ]:
save "..\input_data\NYPD_Motor_Vehicle_Collisions.dta"