Do I need to clean my data?¶
The short answer is always assume your data needs to be cleaned. Data collection is rarely perfect, and even if the data is collected perfectly there are multiple points in the data pipeline where errors can be introduced. Data errors can be catastrophic for a myriad of reasons; modern analytic methods require data to confirm to certain standards for your analysis to be valid, and in some cases regression models will not run if your data is not properly cleaned. For the purposes of this demonstration I will be using the 2003 Stop, Question and Frisk dataset from the NYPD. I also recommend downloading the data dictionary from here.
Because the data is not in stata format (*.dta), we must import it into stata using the import command. In this case we will be using import delimited because the data is in text format with a delimiter. A delimiter separates values in your dataset. The most common type of delimiter is a comma, and files stored in this format usually end in .csv, which stands for comma separated values
[1]:
import delimited /home/omari/Downloads/2003.csv, clear
(111 vars, 160,851 obs)
Before proceeding to clean your data, it is often useful to look at metadata. Metadata is simply data that describes your data, and is usually available for public use datasets. The most common forms of metadata you’ll encounter are data dictionaries and codebooks. See a detailed description of codebooks here. Data dictonaries are much leaner than codebooks, and focus primarily on providing information about data formats.
[2]:
/* This is a Stata comment. It allows us to write notes in our code that Stata will not execute.
If we do not specify it as a comment, Stata will attempt run our notes as code and will return a syntax error.
Multi line comments must be enclosed between /* ... */ */
// Alternatively, single line comments only require // and don't need to be enclosed.
// describe lists all variables in our datasets and gives information about their storage type
describe
Contains data
obs: 160,851
vars: 111
size: 58,388,913
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
year int %8.0g
pct int %8.0g
ser_num long %12.0g
datestop long %12.0g
timestop str5 %9s
recstat str1 %9s
inout str1 %9s
trhsloc str1 %9s
perobs int %8.0g
crimsusp str30 %30s
perstop int %8.0g
typeofid str1 %9s
explnstp str1 %9s
othpers str1 %9s
arstmade str1 %9s
arstoffn str30 %30s
sumissue str1 %9s
sumoffen str30 %30s
compyear byte %8.0g
comppct byte %8.0g
offunif str1 %9s
officrid str1 %9s
frisked str1 %9s
searched str1 %9s
contrabn str1 %9s
adtlrept str1 %9s
pistol str1 %9s
riflshot str1 %9s
asltweap str1 %9s
knifcuti str1 %9s
machgun str1 %9s
othrweap str1 %9s
pf_hands str1 %9s
pf_wall str1 %9s
pf_grnd str1 %9s
pf_drwep str1 %9s
pf_ptwep str1 %9s
pf_baton str1 %9s
pf_hcuff str1 %9s
pf_pepsp str1 %9s
pf_other str1 %9s
radio str1 %9s
ac_rept str1 %9s
ac_inves str1 %9s
rf_vcrim str1 %9s
rf_othsw str1 %9s
ac_proxm str1 %9s
rf_attir str1 %9s
cs_objcs str1 %9s
cs_descr str1 %9s
cs_casng str1 %9s
cs_lkout str1 %9s
rf_vcact str1 %9s
cs_cloth str1 %9s
cs_drgtr str1 %9s
ac_evasv str1 %9s
ac_assoc str1 %9s
cs_furtv str1 %9s
rf_rfcmp str1 %9s
ac_cgdir str1 %9s
rf_verbl str1 %9s
cs_vcrim str1 %9s
cs_bulge str1 %9s
cs_other str1 %9s
ac_incid str1 %9s
ac_time str1 %9s
rf_knowl str1 %9s
ac_stsnd str1 %9s
ac_other str1 %9s
sb_hdobj str1 %9s
sb_outln str1 %9s
sb_admis str1 %9s
sb_other str1 %9s
repcmd str4 %9s
revcmd str4 %9s
rf_furt str1 %9s
rf_bulg str1 %9s
offverb str1 %9s
offshld str1 %9s
sex str1 %9s
race str1 %9s
dob str8 %9s
age str4 %9s
ht_feet str1 %9s
ht_inch str2 %9s
weight str5 %9s
haircolr str2 %9s
eyecolor str2 %9s
build str1 %9s
othfeatr str1 %9s
addrtyp str1 %9s
--more--
The two most frequent types of variables you’ll encounter in Stata are character and numeric. Numeric variables can be stored as int, byte, float, long and double; character variables are stored as str (strings). The number at the end of str gives the width of the string, which is the maximum number of characters the string can hold. An important part of data cleaning is ensuring your variables are in the right format.
[3]:
/* The describe command from earlier can take variables as arguments. */
describe age
storage display value
variable name type format label variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
age str4 %9s
[4]:
sum age
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
age | 0
[5]:
tab age
age | Freq. Percent Cum.
------------+-----------------------------------
| 162 0.10 0.10
0 | 6,349 3.95 4.05
1 | 168 0.10 4.15
10 | 42 0.03 4.18
103 | 1 0.00 4.18
11 | 61 0.04 4.22
110 | 1 0.00 4.22
113 | 1 0.00 4.22
114 | 1 0.00 4.22
115 | 1 0.00 4.22
12 | 268 0.17 4.39
120 | 5 0.00 4.39
121 | 1 0.00 4.39
1271 | 1 0.00 4.39
13 | 811 0.50 4.89
130 | 7 0.00 4.90
1316 | 1 0.00 4.90
132 | 1 0.00 4.90
135 | 3 0.00 4.90
14 | 2,383 1.48 6.38
140 | 13 0.01 6.39
145 | 6 0.00 6.40
147 | 1 0.00 6.40
148 | 1 0.00 6.40
149 | 2 0.00 6.40
15 | 4,893 3.04 9.44
150 | 12 0.01 9.45
154 | 1 0.00 9.45
155 | 5 0.00 9.45
156 | 1 0.00 9.45
157 | 1 0.00 9.45
158 | 1 0.00 9.45
16 | 7,340 4.56 14.02
160 | 10 0.01 14.02
165 | 10 0.01 14.03
1655 | 1 0.00 14.03
169 | 1 0.00 14.03
17 | 9,063 5.63 19.66
170 | 8 0.00 19.67
1742 | 1 0.00 19.67
175 | 3 0.00 19.67
18 | 10,175 6.33 26.00
180 | 11 0.01 26.00
181 | 1 0.00 26.00
185 | 7 0.00 26.01
187 | 2 0.00 26.01
189 | 5 0.00 26.01
19 | 9,677 6.02 32.03
190 | 9 0.01 32.04
195 | 4 0.00 32.04
196 | 1 0.00 32.04
198 | 2 0.00 32.04
2 | 37 0.02 32.06
20 | 11,168 6.94 39.01
200 | 4 0.00 39.01
205 | 6 0.00 39.01
206 | 3 0.00 39.01
21 | 7,727 4.80 43.82
210 | 4 0.00 43.82
213 | 3 0.00 43.82
214 | 2 0.00 43.82
215 | 1 0.00 43.82
219 | 1 0.00 43.82
22 | 8,182 5.09 48.91
220 | 5 0.00 48.91
221 | 2 0.00 48.92
222 | 2 0.00 48.92
223 | 3 0.00 48.92
224 | 2 0.00 48.92
225 | 1 0.00 48.92
23 | 7,487 4.65 53.58
230 | 3 0.00 53.58
232 | 1 0.00 53.58
235 | 2 0.00 53.58
236 | 1 0.00 53.58
239 | 1 0.00 53.58
24 | 5,404 3.36 56.94
240 | 3 0.00 56.94
241 | 1 0.00 56.94
244 | 1 0.00 56.94
245 | 1 0.00 56.94
247 | 3 0.00 56.95
25 | 5,988 3.72 60.67
250 | 1 0.00 60.67
254 | 1 0.00 60.67
255 | 1 0.00 60.67
258 | 1 0.00 60.67
26 | 4,223 2.63 63.30
265 | 1 0.00 63.30
269 | 2 0.00 63.30
27 | 3,834 2.38 65.68
270 | 2 0.00 65.68
28 | 3,499 2.18 67.86
280 | 3 0.00 67.86
285 | 1 0.00 67.86
287 | 2 0.00 67.86
29 | 2,965 1.84 69.70
--more--
Stata returns 0 observations when we try to calculate the mean and standard deviation of age using the sum (summarize) command, However, when we tab (tabulate) it shows us a range of values and more than 0 observations. This is because age is stored as a string variable, and we are unable to perform arithmetic operations on string variables. In order for us to include age in our analysis, we must convert it to a numeric variable. As we saw earlier, there are multiple formats for
numerical variables, so which format should we convert age to? byte, int, and long can only store integers (whole numbers), whereas float and double can store both integers and decimals. float and double consume more memory to store information, so it is not computationally efficient to store a whole numbers as a float. See this Stata help file for more information. To determine what numeric format we should
convert age to we must first examine the values of age.
We see that all the values of age are whole numbers, so we should not convert it to a float or double. There are also values of age that exceeds 100, so it cannot be stored as a byte. Logically, it makes sense to store it as an integer. We’ll do so using the destring command.
[8]:
destring age, gen(age_cleaned)
age: all characters numeric; age_cleaned generated as int
(162 missing values generated)
Byte can only store a maximum value of 100, and we can see from the frequency table that there are values of age that exceed 100. Because of this Stata choses to convert age to int, and our optional argument gen() creates a new variable (or column). Stata stores all datasets as 2-dimensional arrays; variables are stored as columns and observations are stored as rows. Alternatively, we could have specified replace in lieu of gen() and Stata would have overwritten our age column
instead of creating a new one. This option is risky and should be used with caution. Another thing you may have noticed is that Stata created 162 missing values. This is the result of Stata converting an empty string to a numeric value. The first row of our frequency table shows 162 people with no values for age.
[15]:
sum age_cleaned
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
age_cleaned | 160,689 31.76657 225.3125 0 9999
[7]:
sum age_cleaned, d
age_cleaned
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 14 0
10% 16 0 Obs 160,689
25% 18 0 Sum of Wgt. 160,689
50% 23 Mean 31.76657
Largest Std. Dev. 225.3125
75% 32 9999
90% 43 9999 Variance 50765.73
95% 49 9999 Skewness 43.67801
99% 72 9999 Kurtosis 1930.646
We’re now able to calculate the mean of age using our “cleaned” age variable. However our detailed summary (that’s what the , d stands for) shows that we have rather unrealistic values of age, such as 9999. The most straightforward way to deal with unrealistic values of a variable is to replaced them as missing, thus removing the values from our analysis and preventing them from potentially skewing our results. While this may be the most straightforward way, it is often the least desirable
way due to the loss of observations.
[9]:
replace age_cleaned = . if age_cleaned>100
(405 real changes made, 405 to missing)
[10]:
sum age_cleaned, d
age_cleaned
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 14 0
10% 16 0 Obs 160,284
25% 18 0 Sum of Wgt. 160,284
50% 23 Mean 26.01252
Largest Std. Dev. 12.9699
75% 32 99
90% 42 99 Variance 168.2183
95% 49 99 Skewness 1.52006
99% 64 99 Kurtosis 8.901667
We now see that the mean value of age is slightly lower, and the standard deviation is much smaller. Given the extremely small number of observations that are lost due to converting the extreme values to missing (405 out of 160,851), it may be sufficient to simply leave the values as missing. However if we were to lose more substantial amounts of observations by converting the extreme values to missing, it would be necessary to exlpore ways to impute those missing values with a more realistic number. Imputing the mean value is a popular choice. There are also statistical methods that can be used to impute missing values by way of predictive modeling. The latter is much more advanced than this tutorial aims to be, so we’ll resort to the former.
[11]:
return list
scalars:
r(p99) = 64
r(p95) = 49
r(p90) = 42
r(p75) = 32
r(p50) = 23
r(p25) = 18
r(p10) = 16
r(p5) = 14
r(p1) = 0
r(max) = 99
r(min) = 0
r(sum) = 4169391
r(kurtosis) = 8.901666570994999
r(skewness) = 1.520059538798137
r(sd) = 12.96990027108677
r(Var) = 168.2183130419367
r(mean) = 26.01252152429438
r(sum_w) = 160284
r(N) = 160284
Stata stores the results of most of its computational commands in temporary scalars/vectors/matrices that can be viewed with return list. It is temporary because the next computational command will overwrite the current content of return list, and all content is lost when Stata is closed.
[12]:
replace age_cleaned = r(mean) if age_cleaned==.
variable age_cleaned was int now float
(567 real changes made)
We’ ve now replaced the 405 missing values we created in our previous step, as well as the 162 missing values that were already by imputing the mean value.
[15]:
sum age_cleaned, d
age_cleaned
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 14 0
10% 16 0 Obs 160,851
25% 18 0 Sum of Wgt. 160,851
50% 23 Mean 26.01252
Largest Std. Dev. 12.94702
75% 32 99
90% 42 99 Variance 167.6253
95% 48 99 Skewness 1.522746
99% 64 99 Kurtosis 8.933156
Because r(mean) was a float (had decimal places), it changed our age_cleaned from int to float. In order to return to integer, we can use the recast command which will truncate all floats and keep only the numbers to the left of the decimal. Note, truncation does not round up or round down. It only returns numbers to the left of the decimal.
[17]:
tab age_cleaned
age_cleaned | Freq. Percent Cum.
------------+-----------------------------------
0 | 6,349 3.95 3.95
1 | 168 0.10 4.05
2 | 37 0.02 4.07
3 | 23 0.01 4.09
4 | 12 0.01 4.10
5 | 83 0.05 4.15
6 | 21 0.01 4.16
7 | 5 0.00 4.16
8 | 3 0.00 4.17
9 | 5 0.00 4.17
10 | 42 0.03 4.20
11 | 61 0.04 4.23
12 | 268 0.17 4.40
13 | 811 0.50 4.90
14 | 2,383 1.48 6.39
15 | 4,893 3.04 9.43
16 | 7,340 4.56 13.99
17 | 9,063 5.63 19.62
18 | 10,175 6.33 25.95
19 | 9,677 6.02 31.97
20 | 11,168 6.94 38.91
21 | 7,727 4.80 43.71
22 | 8,182 5.09 48.80
23 | 7,487 4.65 53.46
24 | 5,404 3.36 56.81
25 | 5,988 3.72 60.54
26 | 4,223 2.63 63.16
26.01252 | 567 0.35 63.52
27 | 3,834 2.38 65.90
28 | 3,499 2.18 68.07
29 | 2,965 1.84 69.92
30 | 3,781 2.35 72.27
31 | 2,577 1.60 73.87
32 | 2,891 1.80 75.67
33 | 2,870 1.78 77.45
34 | 2,405 1.50 78.95
35 | 2,774 1.72 80.67
36 | 2,149 1.34 82.01
37 | 2,187 1.36 83.37
38 | 2,157 1.34 84.71
39 | 2,151 1.34 86.05
40 | 2,620 1.63 87.67
41 | 1,776 1.10 88.78
42 | 2,081 1.29 90.07
43 | 1,868 1.16 91.23
44 | 1,374 0.85 92.09
45 | 1,530 0.95 93.04
46 | 1,143 0.71 93.75
47 | 1,086 0.68 94.42
48 | 945 0.59 95.01
49 | 814 0.51 95.52
50 | 873 0.54 96.06
51 | 585 0.36 96.42
52 | 705 0.44 96.86
53 | 638 0.40 97.26
54 | 429 0.27 97.53
55 | 536 0.33 97.86
56 | 328 0.20 98.06
57 | 308 0.19 98.25
58 | 265 0.16 98.42
59 | 225 0.14 98.56
60 | 233 0.14 98.70
61 | 134 0.08 98.79
62 | 170 0.11 98.89
63 | 131 0.08 98.97
64 | 76 0.05 99.02
65 | 90 0.06 99.08
66 | 57 0.04 99.11
67 | 53 0.03 99.15
68 | 39 0.02 99.17
69 | 39 0.02 99.19
70 | 48 0.03 99.22
71 | 27 0.02 99.24
72 | 31 0.02 99.26
73 | 32 0.02 99.28
74 | 21 0.01 99.29
75 | 32 0.02 99.31
76 | 15 0.01 99.32
77 | 19 0.01 99.33
78 | 12 0.01 99.34
79 | 9 0.01 99.35
80 | 5 0.00 99.35
81 | 7 0.00 99.36
82 | 6 0.00 99.36
83 | 7 0.00 99.36
84 | 1 0.00 99.36
85 | 3 0.00 99.37
86 | 1 0.00 99.37
88 | 6 0.00 99.37
90 | 5 0.00 99.37
91 | 2 0.00 99.37
92 | 2 0.00 99.38
93 | 1 0.00 99.38
94 | 3 0.00 99.38
95 | 4 0.00 99.38
97 | 2 0.00 99.38
99 | 994 0.62 100.00
--more--
[18]:
recast int age_cleaned
age_cleaned: 567 values would be changed; not changed
recast produces an error because Stata does not want to truncate your floats. You can override this with the optional argument force.
[19]:
recast int age_cleaned, force
age_cleaned: 567 values changed
[20]:
tab age_cleaned
age_cleaned | Freq. Percent Cum.
------------+-----------------------------------
0 | 6,349 3.95 3.95
1 | 168 0.10 4.05
2 | 37 0.02 4.07
3 | 23 0.01 4.09
4 | 12 0.01 4.10
5 | 83 0.05 4.15
6 | 21 0.01 4.16
7 | 5 0.00 4.16
8 | 3 0.00 4.17
9 | 5 0.00 4.17
10 | 42 0.03 4.20
11 | 61 0.04 4.23
12 | 268 0.17 4.40
13 | 811 0.50 4.90
14 | 2,383 1.48 6.39
15 | 4,893 3.04 9.43
16 | 7,340 4.56 13.99
17 | 9,063 5.63 19.62
18 | 10,175 6.33 25.95
19 | 9,677 6.02 31.97
20 | 11,168 6.94 38.91
21 | 7,727 4.80 43.71
22 | 8,182 5.09 48.80
23 | 7,487 4.65 53.46
24 | 5,404 3.36 56.81
25 | 5,988 3.72 60.54
26 | 4,790 2.98 63.52
27 | 3,834 2.38 65.90
28 | 3,499 2.18 68.07
29 | 2,965 1.84 69.92
30 | 3,781 2.35 72.27
31 | 2,577 1.60 73.87
32 | 2,891 1.80 75.67
33 | 2,870 1.78 77.45
34 | 2,405 1.50 78.95
35 | 2,774 1.72 80.67
36 | 2,149 1.34 82.01
37 | 2,187 1.36 83.37
38 | 2,157 1.34 84.71
39 | 2,151 1.34 86.05
40 | 2,620 1.63 87.67
41 | 1,776 1.10 88.78
42 | 2,081 1.29 90.07
43 | 1,868 1.16 91.23
44 | 1,374 0.85 92.09
45 | 1,530 0.95 93.04
46 | 1,143 0.71 93.75
47 | 1,086 0.68 94.42
48 | 945 0.59 95.01
49 | 814 0.51 95.52
50 | 873 0.54 96.06
51 | 585 0.36 96.42
52 | 705 0.44 96.86
53 | 638 0.40 97.26
54 | 429 0.27 97.53
55 | 536 0.33 97.86
56 | 328 0.20 98.06
57 | 308 0.19 98.25
58 | 265 0.16 98.42
59 | 225 0.14 98.56
60 | 233 0.14 98.70
61 | 134 0.08 98.79
62 | 170 0.11 98.89
63 | 131 0.08 98.97
64 | 76 0.05 99.02
65 | 90 0.06 99.08
66 | 57 0.04 99.11
67 | 53 0.03 99.15
68 | 39 0.02 99.17
69 | 39 0.02 99.19
70 | 48 0.03 99.22
71 | 27 0.02 99.24
72 | 31 0.02 99.26
73 | 32 0.02 99.28
74 | 21 0.01 99.29
75 | 32 0.02 99.31
76 | 15 0.01 99.32
77 | 19 0.01 99.33
78 | 12 0.01 99.34
79 | 9 0.01 99.35
80 | 5 0.00 99.35
81 | 7 0.00 99.36
82 | 6 0.00 99.36
83 | 7 0.00 99.36
84 | 1 0.00 99.36
85 | 3 0.00 99.37
86 | 1 0.00 99.37
88 | 6 0.00 99.37
90 | 5 0.00 99.37
91 | 2 0.00 99.37
92 | 2 0.00 99.38
93 | 1 0.00 99.38
94 | 3 0.00 99.38
95 | 4 0.00 99.38
97 | 2 0.00 99.38
99 | 994 0.62 100.00
------------+-----------------------------------
--more--