Do I need to clean my data?

The short answer is always assume your data needs to be cleaned. Data collection is rarely perfect, and even if the data is collected perfectly there are multiple points in the data pipeline where errors can be introduced. Data errors can be catastrophic for a myriad of reasons; modern analytic methods require data to confirm to certain standards for your analysis to be valid, and in some cases regression models will not run if your data is not properly cleaned. For the purposes of this demonstration I will be using the 2003 Stop, Question and Frisk dataset from the NYPD. I also recommend downloading the data dictionary from here.

Because the data is not in stata format (*.dta), we must import it into stata using the import command. In this case we will be using import delimited because the data is in text format with a delimiter. A delimiter separates values in your dataset. The most common type of delimiter is a comma, and files stored in this format usually end in .csv, which stands for comma separated values

[1]:
import delimited /home/omari/Downloads/2003.csv, clear
(111 vars, 160,851 obs)

Before proceeding to clean your data, it is often useful to look at metadata. Metadata is simply data that describes your data, and is usually available for public use datasets. The most common forms of metadata you’ll encounter are data dictionaries and codebooks. See a detailed description of codebooks here. Data dictonaries are much leaner than codebooks, and focus primarily on providing information about data formats.

[2]:
/* This is a Stata comment. It allows us to write notes in our code that Stata will not execute.
If we do not specify it as a comment, Stata will attempt run our notes as code and will return a syntax error.
Multi line comments must be enclosed between /* ... */ */

// Alternatively, single line comments only require // and don't need to be enclosed.

// describe lists all variables in our datasets and gives information about their storage type
describe

Contains data
  obs:       160,851
 vars:           111
 size:    58,388,913
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
year            int     %8.0g
pct             int     %8.0g
ser_num         long    %12.0g
datestop        long    %12.0g
timestop        str5    %9s
recstat         str1    %9s
inout           str1    %9s
trhsloc         str1    %9s
perobs          int     %8.0g
crimsusp        str30   %30s
perstop         int     %8.0g
typeofid        str1    %9s
explnstp        str1    %9s
othpers         str1    %9s
arstmade        str1    %9s
arstoffn        str30   %30s
sumissue        str1    %9s
sumoffen        str30   %30s
compyear        byte    %8.0g
comppct         byte    %8.0g
offunif         str1    %9s
officrid        str1    %9s
frisked         str1    %9s
searched        str1    %9s
contrabn        str1    %9s
adtlrept        str1    %9s
pistol          str1    %9s
riflshot        str1    %9s
asltweap        str1    %9s
knifcuti        str1    %9s
machgun         str1    %9s
othrweap        str1    %9s
pf_hands        str1    %9s
pf_wall         str1    %9s
pf_grnd         str1    %9s
pf_drwep        str1    %9s
pf_ptwep        str1    %9s
pf_baton        str1    %9s
pf_hcuff        str1    %9s
pf_pepsp        str1    %9s
pf_other        str1    %9s
radio           str1    %9s
ac_rept         str1    %9s
ac_inves        str1    %9s
rf_vcrim        str1    %9s
rf_othsw        str1    %9s
ac_proxm        str1    %9s
rf_attir        str1    %9s
cs_objcs        str1    %9s
cs_descr        str1    %9s
cs_casng        str1    %9s
cs_lkout        str1    %9s
rf_vcact        str1    %9s
cs_cloth        str1    %9s
cs_drgtr        str1    %9s
ac_evasv        str1    %9s
ac_assoc        str1    %9s
cs_furtv        str1    %9s
rf_rfcmp        str1    %9s
ac_cgdir        str1    %9s
rf_verbl        str1    %9s
cs_vcrim        str1    %9s
cs_bulge        str1    %9s
cs_other        str1    %9s
ac_incid        str1    %9s
ac_time         str1    %9s
rf_knowl        str1    %9s
ac_stsnd        str1    %9s
ac_other        str1    %9s
sb_hdobj        str1    %9s
sb_outln        str1    %9s
sb_admis        str1    %9s
sb_other        str1    %9s
repcmd          str4    %9s
revcmd          str4    %9s
rf_furt         str1    %9s
rf_bulg         str1    %9s
offverb         str1    %9s
offshld         str1    %9s
sex             str1    %9s
race            str1    %9s
dob             str8    %9s
age             str4    %9s
ht_feet         str1    %9s
ht_inch         str2    %9s
weight          str5    %9s
haircolr        str2    %9s
eyecolor        str2    %9s
build           str1    %9s
othfeatr        str1    %9s
addrtyp         str1    %9s
--more--

The two most frequent types of variables you’ll encounter in Stata are character and numeric. Numeric variables can be stored as int, byte, float, long and double; character variables are stored as str (strings). The number at the end of str gives the width of the string, which is the maximum number of characters the string can hold. An important part of data cleaning is ensuring your variables are in the right format.

[3]:
/* The describe command from earlier can take variables as arguments. */
describe age

              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
age             str4    %9s
[4]:
sum age

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |          0
[5]:
tab age

        age |      Freq.     Percent        Cum.
------------+-----------------------------------
            |        162        0.10        0.10
          0 |      6,349        3.95        4.05
          1 |        168        0.10        4.15
         10 |         42        0.03        4.18
        103 |          1        0.00        4.18
         11 |         61        0.04        4.22
        110 |          1        0.00        4.22
        113 |          1        0.00        4.22
        114 |          1        0.00        4.22
        115 |          1        0.00        4.22
         12 |        268        0.17        4.39
        120 |          5        0.00        4.39
        121 |          1        0.00        4.39
       1271 |          1        0.00        4.39
         13 |        811        0.50        4.89
        130 |          7        0.00        4.90
       1316 |          1        0.00        4.90
        132 |          1        0.00        4.90
        135 |          3        0.00        4.90
         14 |      2,383        1.48        6.38
        140 |         13        0.01        6.39
        145 |          6        0.00        6.40
        147 |          1        0.00        6.40
        148 |          1        0.00        6.40
        149 |          2        0.00        6.40
         15 |      4,893        3.04        9.44
        150 |         12        0.01        9.45
        154 |          1        0.00        9.45
        155 |          5        0.00        9.45
        156 |          1        0.00        9.45
        157 |          1        0.00        9.45
        158 |          1        0.00        9.45
         16 |      7,340        4.56       14.02
        160 |         10        0.01       14.02
        165 |         10        0.01       14.03
       1655 |          1        0.00       14.03
        169 |          1        0.00       14.03
         17 |      9,063        5.63       19.66
        170 |          8        0.00       19.67
       1742 |          1        0.00       19.67
        175 |          3        0.00       19.67
         18 |     10,175        6.33       26.00
        180 |         11        0.01       26.00
        181 |          1        0.00       26.00
        185 |          7        0.00       26.01
        187 |          2        0.00       26.01
        189 |          5        0.00       26.01
         19 |      9,677        6.02       32.03
        190 |          9        0.01       32.04
        195 |          4        0.00       32.04
        196 |          1        0.00       32.04
        198 |          2        0.00       32.04
          2 |         37        0.02       32.06
         20 |     11,168        6.94       39.01
        200 |          4        0.00       39.01
        205 |          6        0.00       39.01
        206 |          3        0.00       39.01
         21 |      7,727        4.80       43.82
        210 |          4        0.00       43.82
        213 |          3        0.00       43.82
        214 |          2        0.00       43.82
        215 |          1        0.00       43.82
        219 |          1        0.00       43.82
         22 |      8,182        5.09       48.91
        220 |          5        0.00       48.91
        221 |          2        0.00       48.92
        222 |          2        0.00       48.92
        223 |          3        0.00       48.92
        224 |          2        0.00       48.92
        225 |          1        0.00       48.92
         23 |      7,487        4.65       53.58
        230 |          3        0.00       53.58
        232 |          1        0.00       53.58
        235 |          2        0.00       53.58
        236 |          1        0.00       53.58
        239 |          1        0.00       53.58
         24 |      5,404        3.36       56.94
        240 |          3        0.00       56.94
        241 |          1        0.00       56.94
        244 |          1        0.00       56.94
        245 |          1        0.00       56.94
        247 |          3        0.00       56.95
         25 |      5,988        3.72       60.67
        250 |          1        0.00       60.67
        254 |          1        0.00       60.67
        255 |          1        0.00       60.67
        258 |          1        0.00       60.67
         26 |      4,223        2.63       63.30
        265 |          1        0.00       63.30
        269 |          2        0.00       63.30
         27 |      3,834        2.38       65.68
        270 |          2        0.00       65.68
         28 |      3,499        2.18       67.86
        280 |          3        0.00       67.86
        285 |          1        0.00       67.86
        287 |          2        0.00       67.86
         29 |      2,965        1.84       69.70
--more--

Stata returns 0 observations when we try to calculate the mean and standard deviation of age using the sum (summarize) command, However, when we tab (tabulate) it shows us a range of values and more than 0 observations. This is because age is stored as a string variable, and we are unable to perform arithmetic operations on string variables. In order for us to include age in our analysis, we must convert it to a numeric variable. As we saw earlier, there are multiple formats for numerical variables, so which format should we convert age to? byte, int, and long can only store integers (whole numbers), whereas float and double can store both integers and decimals. float and double consume more memory to store information, so it is not computationally efficient to store a whole numbers as a float. See this Stata help file for more information. To determine what numeric format we should convert age to we must first examine the values of age.

We see that all the values of age are whole numbers, so we should not convert it to a float or double. There are also values of age that exceeds 100, so it cannot be stored as a byte. Logically, it makes sense to store it as an integer. We’ll do so using the destring command.

[8]:
destring age, gen(age_cleaned)
age: all characters numeric; age_cleaned generated as int
(162 missing values generated)

Byte can only store a maximum value of 100, and we can see from the frequency table that there are values of age that exceed 100. Because of this Stata choses to convert age to int, and our optional argument gen() creates a new variable (or column). Stata stores all datasets as 2-dimensional arrays; variables are stored as columns and observations are stored as rows. Alternatively, we could have specified replace in lieu of gen() and Stata would have overwritten our age column instead of creating a new one. This option is risky and should be used with caution. Another thing you may have noticed is that Stata created 162 missing values. This is the result of Stata converting an empty string to a numeric value. The first row of our frequency table shows 162 people with no values for age.

[15]:
sum age_cleaned

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
 age_cleaned |    160,689    31.76657    225.3125          0       9999
[7]:
sum age_cleaned, d

                         age_cleaned
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%           14              0
10%           16              0       Obs             160,689
25%           18              0       Sum of Wgt.     160,689

50%           23                      Mean           31.76657
                        Largest       Std. Dev.      225.3125
75%           32           9999
90%           43           9999       Variance       50765.73
95%           49           9999       Skewness       43.67801
99%           72           9999       Kurtosis       1930.646

We’re now able to calculate the mean of age using our “cleaned” age variable. However our detailed summary (that’s what the , d stands for) shows that we have rather unrealistic values of age, such as 9999. The most straightforward way to deal with unrealistic values of a variable is to replaced them as missing, thus removing the values from our analysis and preventing them from potentially skewing our results. While this may be the most straightforward way, it is often the least desirable way due to the loss of observations.

[9]:
replace age_cleaned = . if age_cleaned>100
(405 real changes made, 405 to missing)
[10]:
sum age_cleaned, d

                         age_cleaned
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%           14              0
10%           16              0       Obs             160,284
25%           18              0       Sum of Wgt.     160,284

50%           23                      Mean           26.01252
                        Largest       Std. Dev.       12.9699
75%           32             99
90%           42             99       Variance       168.2183
95%           49             99       Skewness        1.52006
99%           64             99       Kurtosis       8.901667

We now see that the mean value of age is slightly lower, and the standard deviation is much smaller. Given the extremely small number of observations that are lost due to converting the extreme values to missing (405 out of 160,851), it may be sufficient to simply leave the values as missing. However if we were to lose more substantial amounts of observations by converting the extreme values to missing, it would be necessary to exlpore ways to impute those missing values with a more realistic number. Imputing the mean value is a popular choice. There are also statistical methods that can be used to impute missing values by way of predictive modeling. The latter is much more advanced than this tutorial aims to be, so we’ll resort to the former.

[11]:
return list

scalars:
                r(p99) =  64
                r(p95) =  49
                r(p90) =  42
                r(p75) =  32
                r(p50) =  23
                r(p25) =  18
                r(p10) =  16
                 r(p5) =  14
                 r(p1) =  0
                r(max) =  99
                r(min) =  0
                r(sum) =  4169391
           r(kurtosis) =  8.901666570994999
           r(skewness) =  1.520059538798137
                 r(sd) =  12.96990027108677
                r(Var) =  168.2183130419367
               r(mean) =  26.01252152429438
              r(sum_w) =  160284
                  r(N) =  160284

Stata stores the results of most of its computational commands in temporary scalars/vectors/matrices that can be viewed with return list. It is temporary because the next computational command will overwrite the current content of return list, and all content is lost when Stata is closed.

[12]:
replace age_cleaned = r(mean) if age_cleaned==.
variable age_cleaned was int now float
(567 real changes made)

We’ ve now replaced the 405 missing values we created in our previous step, as well as the 162 missing values that were already by imputing the mean value.

[15]:
sum age_cleaned, d

                         age_cleaned
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%           14              0
10%           16              0       Obs             160,851
25%           18              0       Sum of Wgt.     160,851

50%           23                      Mean           26.01252
                        Largest       Std. Dev.      12.94702
75%           32             99
90%           42             99       Variance       167.6253
95%           48             99       Skewness       1.522746
99%           64             99       Kurtosis       8.933156

Because r(mean) was a float (had decimal places), it changed our age_cleaned from int to float. In order to return to integer, we can use the recast command which will truncate all floats and keep only the numbers to the left of the decimal. Note, truncation does not round up or round down. It only returns numbers to the left of the decimal.

[17]:
tab age_cleaned

age_cleaned |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      6,349        3.95        3.95
          1 |        168        0.10        4.05
          2 |         37        0.02        4.07
          3 |         23        0.01        4.09
          4 |         12        0.01        4.10
          5 |         83        0.05        4.15
          6 |         21        0.01        4.16
          7 |          5        0.00        4.16
          8 |          3        0.00        4.17
          9 |          5        0.00        4.17
         10 |         42        0.03        4.20
         11 |         61        0.04        4.23
         12 |        268        0.17        4.40
         13 |        811        0.50        4.90
         14 |      2,383        1.48        6.39
         15 |      4,893        3.04        9.43
         16 |      7,340        4.56       13.99
         17 |      9,063        5.63       19.62
         18 |     10,175        6.33       25.95
         19 |      9,677        6.02       31.97
         20 |     11,168        6.94       38.91
         21 |      7,727        4.80       43.71
         22 |      8,182        5.09       48.80
         23 |      7,487        4.65       53.46
         24 |      5,404        3.36       56.81
         25 |      5,988        3.72       60.54
         26 |      4,223        2.63       63.16
   26.01252 |        567        0.35       63.52
         27 |      3,834        2.38       65.90
         28 |      3,499        2.18       68.07
         29 |      2,965        1.84       69.92
         30 |      3,781        2.35       72.27
         31 |      2,577        1.60       73.87
         32 |      2,891        1.80       75.67
         33 |      2,870        1.78       77.45
         34 |      2,405        1.50       78.95
         35 |      2,774        1.72       80.67
         36 |      2,149        1.34       82.01
         37 |      2,187        1.36       83.37
         38 |      2,157        1.34       84.71
         39 |      2,151        1.34       86.05
         40 |      2,620        1.63       87.67
         41 |      1,776        1.10       88.78
         42 |      2,081        1.29       90.07
         43 |      1,868        1.16       91.23
         44 |      1,374        0.85       92.09
         45 |      1,530        0.95       93.04
         46 |      1,143        0.71       93.75
         47 |      1,086        0.68       94.42
         48 |        945        0.59       95.01
         49 |        814        0.51       95.52
         50 |        873        0.54       96.06
         51 |        585        0.36       96.42
         52 |        705        0.44       96.86
         53 |        638        0.40       97.26
         54 |        429        0.27       97.53
         55 |        536        0.33       97.86
         56 |        328        0.20       98.06
         57 |        308        0.19       98.25
         58 |        265        0.16       98.42
         59 |        225        0.14       98.56
         60 |        233        0.14       98.70
         61 |        134        0.08       98.79
         62 |        170        0.11       98.89
         63 |        131        0.08       98.97
         64 |         76        0.05       99.02
         65 |         90        0.06       99.08
         66 |         57        0.04       99.11
         67 |         53        0.03       99.15
         68 |         39        0.02       99.17
         69 |         39        0.02       99.19
         70 |         48        0.03       99.22
         71 |         27        0.02       99.24
         72 |         31        0.02       99.26
         73 |         32        0.02       99.28
         74 |         21        0.01       99.29
         75 |         32        0.02       99.31
         76 |         15        0.01       99.32
         77 |         19        0.01       99.33
         78 |         12        0.01       99.34
         79 |          9        0.01       99.35
         80 |          5        0.00       99.35
         81 |          7        0.00       99.36
         82 |          6        0.00       99.36
         83 |          7        0.00       99.36
         84 |          1        0.00       99.36
         85 |          3        0.00       99.37
         86 |          1        0.00       99.37
         88 |          6        0.00       99.37
         90 |          5        0.00       99.37
         91 |          2        0.00       99.37
         92 |          2        0.00       99.38
         93 |          1        0.00       99.38
         94 |          3        0.00       99.38
         95 |          4        0.00       99.38
         97 |          2        0.00       99.38
         99 |        994        0.62      100.00
--more--
[18]:
recast int age_cleaned
age_cleaned:  567 values would be changed; not changed

recast produces an error because Stata does not want to truncate your floats. You can override this with the optional argument force.

[19]:
recast int age_cleaned, force
age_cleaned:  567 values changed
[20]:
tab age_cleaned

age_cleaned |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      6,349        3.95        3.95
          1 |        168        0.10        4.05
          2 |         37        0.02        4.07
          3 |         23        0.01        4.09
          4 |         12        0.01        4.10
          5 |         83        0.05        4.15
          6 |         21        0.01        4.16
          7 |          5        0.00        4.16
          8 |          3        0.00        4.17
          9 |          5        0.00        4.17
         10 |         42        0.03        4.20
         11 |         61        0.04        4.23
         12 |        268        0.17        4.40
         13 |        811        0.50        4.90
         14 |      2,383        1.48        6.39
         15 |      4,893        3.04        9.43
         16 |      7,340        4.56       13.99
         17 |      9,063        5.63       19.62
         18 |     10,175        6.33       25.95
         19 |      9,677        6.02       31.97
         20 |     11,168        6.94       38.91
         21 |      7,727        4.80       43.71
         22 |      8,182        5.09       48.80
         23 |      7,487        4.65       53.46
         24 |      5,404        3.36       56.81
         25 |      5,988        3.72       60.54
         26 |      4,790        2.98       63.52
         27 |      3,834        2.38       65.90
         28 |      3,499        2.18       68.07
         29 |      2,965        1.84       69.92
         30 |      3,781        2.35       72.27
         31 |      2,577        1.60       73.87
         32 |      2,891        1.80       75.67
         33 |      2,870        1.78       77.45
         34 |      2,405        1.50       78.95
         35 |      2,774        1.72       80.67
         36 |      2,149        1.34       82.01
         37 |      2,187        1.36       83.37
         38 |      2,157        1.34       84.71
         39 |      2,151        1.34       86.05
         40 |      2,620        1.63       87.67
         41 |      1,776        1.10       88.78
         42 |      2,081        1.29       90.07
         43 |      1,868        1.16       91.23
         44 |      1,374        0.85       92.09
         45 |      1,530        0.95       93.04
         46 |      1,143        0.71       93.75
         47 |      1,086        0.68       94.42
         48 |        945        0.59       95.01
         49 |        814        0.51       95.52
         50 |        873        0.54       96.06
         51 |        585        0.36       96.42
         52 |        705        0.44       96.86
         53 |        638        0.40       97.26
         54 |        429        0.27       97.53
         55 |        536        0.33       97.86
         56 |        328        0.20       98.06
         57 |        308        0.19       98.25
         58 |        265        0.16       98.42
         59 |        225        0.14       98.56
         60 |        233        0.14       98.70
         61 |        134        0.08       98.79
         62 |        170        0.11       98.89
         63 |        131        0.08       98.97
         64 |         76        0.05       99.02
         65 |         90        0.06       99.08
         66 |         57        0.04       99.11
         67 |         53        0.03       99.15
         68 |         39        0.02       99.17
         69 |         39        0.02       99.19
         70 |         48        0.03       99.22
         71 |         27        0.02       99.24
         72 |         31        0.02       99.26
         73 |         32        0.02       99.28
         74 |         21        0.01       99.29
         75 |         32        0.02       99.31
         76 |         15        0.01       99.32
         77 |         19        0.01       99.33
         78 |         12        0.01       99.34
         79 |          9        0.01       99.35
         80 |          5        0.00       99.35
         81 |          7        0.00       99.36
         82 |          6        0.00       99.36
         83 |          7        0.00       99.36
         84 |          1        0.00       99.36
         85 |          3        0.00       99.37
         86 |          1        0.00       99.37
         88 |          6        0.00       99.37
         90 |          5        0.00       99.37
         91 |          2        0.00       99.37
         92 |          2        0.00       99.38
         93 |          1        0.00       99.38
         94 |          3        0.00       99.38
         95 |          4        0.00       99.38
         97 |          2        0.00       99.38
         99 |        994        0.62      100.00
------------+-----------------------------------
--more--