NLSY79 Appendix 22: Migration Distance Variables for Respondent Locations

NLSY79 Appendix 22: Migration Distance Variables for Respondent Locations

To support research on respondent mobility, we created a series of variables for the distance between respondent addresses at each interview round. This supplements the data on state and county of residence in the Geocode release. These variables are available only on the Geocode data release. Researchers interested in obtaining the geocode CD must apply to the Bureau of Labor Statistics. Information on the application process is available on the BLS website (see https://www.bls.gov/nls/geocodeapp.htm). The program that computes the migration distance variables can be found at the end of this appendix.

The distance between the respondent's addresses at each date of interview was created for all unique pairs of survey years with names in the form DISTANCE_DTyrA_yrB. (In order to avoid unnecessary duplication of variables these distance based variables are between years A and B only for A>B.) The data described here do not actually provide a location for the respondent's residence; these variables only provide distances between the various places the respondent lives. This pairwise matrix of variables enables various types of migration research by enabling users to consider the distance between residences and to identify return migration to an area where the respondent has lived in the past.

In addition to the set of distance variables, we created indicators of the quality of the geographic data. For a variety of reasons, we may not have an address for the respondent. In such cases we geocode the respondent to the centroid of the zipcode when we can determine the zipcode. To identify these cases, an indicator for the quality of this distance measure was created based on the quality of the matches in both years. This variable, DISTANCE_DFyrA_yrB takes a value of 1 if the addresses in both years were exact address matches, a value of 2 if one match was based on the zip centroid and the other was an exact address match, and a value of 3 if both were zip centroid matches.

Finally, DISTANCE_DZCyrA_yrB, an indicator for whether the respondent was located in the same zip code, was created for all pairs of years.

Data Notes and Variable Construction:

The large number of "Don't Know" values for 1979 and 1980 reflects the weakness of the address data for the 1980 wave due to 1980 addresses being missing at a high rate. The 1980 addresses were only recovered from the archives a few years ago and what we have represents considerable effort to re-constitute the historical record.

Respondent address data can be rather messy with some fields of an address either missing or incorrectly carried over from a previous round. The most reliable address components we have are the state and county of residence at each interview round. Archivists put considerable effort into examining conflicting pieces of evidence to arrive at the most reasonable judgment as to where the respondent lived at the time of the interview. Users should bear in mind that some respondents live in temporary quarters with friends, relatives or in a shelter, are working at remote locations where the employer supplies a place to stay, are homeless or living out of a car or choose not to say where they live. Some addresses are RFD, descriptive (at that trailer park just outside of town) or the respondent uses a PO Box, making it impossible to assign longitude and latitude to the address. Archivists consulted a variety of data resources in both the interview form and records from the field to make as accurate a determination as possible. For example, in years when landlines dominated telephony, area codes and exchanges provided reliable data on location as interviewers use the telephone heavily in setting appointments and, in some cases, conducting the interview over the phone.

These distance data were constructed in 2009. We proceeded under the assumption that any attempt to correct the existing state and county of residence many years after the original geocoding effort was more likely to degrade data quality than improve it. Hence the created distance variables are consistent with the state and county of residence data released in the various waves of the NLSY79 geocode data.

The major task was in placing the respondent at a particular location in the county so that we could compute distances. A corollary to this is that distance measures involving moves from one county to another, and especially one state to another, will likely contain error that are a relatively small fraction of the total distance moved. However, the major motivation of this effort is to provide the user with an indication of when a respondent moved back to a location that is relatively nearby one of their former addresses, signifying a return to a place where the respondent may have an existing network of contacts.

There are several caveats in using these data about which researchers should be aware.

  1. In years 1984-1989 address information was in principle collected only for individuals who had changed addresses since the date of the previous interview. The raw data from those years show fewer moves from year to year than at any other time during the project. We believe moves are understated in these years. As a consequence the created variables between pairs of years that include any year 1984-1989 likely understate the degree of mobility that occurred or incorrectly assigning mobility to a later date when address data were collected with greater rigor.
  2. The distance between a respondent's addresses in pairs of years is an imperfect proxy for whether the respondent ever moved between those years. There is a fair amount of return migration, and respondents may have experienced multiple, high frequency address changes and be located at the same address at the time of the two consecutive interviews despite having moved between interviews. It is possible to detect some of these interim moves from migration histories collected in selected survey years, but these moves are not represented in these data, which only deal with location at the date of interview.
  3. The frequency of moves between all possible pairs of years, A>B, is higher than it is between adjacent interview years. Consider two individuals with complete data over a 10-year window. There are up to 9 possible moves between consecutive interviews and 45 pairs of years. If individual 1 moves once between consecutive years or 10% of the time between consecutive years, his mean of moving over all year pairs is 20% (assuming the move is in the interior of the interval). If individual 2 moves 5 times or 50% of the time between consecutive years, his mean of moves between all pairs of years is 78%. By similar logic one long distance move will generate indications of many long distance years among all possible pairs of years.
  4. There have been changes over time in the FIPS (Federal Interagency Processing Standards) codes for some counties and county equivalents. In each round of data the state and county codes reflect the contemporaneous FIPS codes. In creating the distance based variables the changes in coding were reconciled. There are cases in which the FIPS codes are different across pairs of years but the change of address variable indicates that the respondent was located at the same address. That is, the distance moved variables we created will show no move whereas an examination of FIPS codes would suggest a move.
  5. Archivists were able to identify respondents' state and county of residence even in cases where the address data were insufficient to locate the residence more precisely within the county. In some cases we could locate the respondent in a zipcode and we used the zipcode centroid as the respondent location. In other cases, we could not generate the respondent's latitude and longitude if the zipcode was not determined. As a result, there are numerous cases in which the pairwise distance variables are missing but state and county variables are available in both years.
  6. For pairs of addresses within the same county, we reviewed addresses to determine whether what appeared to be different addresses were actually the same place but with differences in spelling or other minor errors that might lead to an address being put in a different location or treated as not translatable into latitude and longitude. For example, 126 USA Street and 126 USO Street may be the same place with a typographical error. We have made an effort to reconcile these differences, eliminating such false moves.

Because software packages that assign latitude and longitude to addresses have evolved over time and differ slightly among vendors, the same address can generate slightly different latitude and longitude based on variation in the software system. We have enforced matching coordinates for matching addresses based upon the presumption that geocoding software has seen a secular improvement so that more recently generated coordinates are most likely more reliable. When the same address generates different coordinates, we accept the more recent coordinates.

Programming for Migration Distance Variables

Variable Names in Program                           Variable names in Nlsy79 Data Release
dist_2020_&yra (1979-2018)                        distance_d20_[survey year] (1979-2018)
dist_collapsed_2020_&yra (1979-2018)         distance_dt20_[survey year] (1979-2018)
dist_type_2020_&yra (1979-2018)                distance_df20_[survey year] (1979-2018)
zip_changed_2020_&yra (1979-2018)           distance_dzc20_[survey year] (1979-2018)

array rni (i) rni80-rni94 rni96 rni98 rni00 rni02 rni04 rni06 rni08 rni10 rni12 rni14 rni16 rni18;
array nonint (i) nonint1980-nonint1994 nonint1996 nonint1998 nonint2000 nonint2002 nonint2004 nonint2006 nonint2008 nonint2010 nonint2012 nonint2014 nonint2016 nonint2018;

do over rni;
if rni>=60 then nonint=1;
else nonint=0;
end;

nonint1979=0;

if curdate_y=-5 then nonint2020=1;
else if curdate_y>0 then nonint2020=0;
run;

data movematrix2020; merge r29.longlat79_20_edited rni8020;by norcid;

*calculate distances;
%macro survey_y(yra);
%if &yra < 2020 %then %do;
latA = lat&yra;
latB = lat2020;
longA = long&yra;
longB = long2020;
centrA = zip_centroid&yra;
centrB = zip_centroid2020;

if (latA ~= . and latB ~= .) then do;
if latA = latB and longA = longB then do;
dist = 0;
end;
else do;
* Convert to radians (multiply by pi/180);
latA = latA * 0.017453;
longA = longA * 0.017453;
latB = latB * 0.017453;
longB = longB * 0.017453;

dlat = latB - latA;
dlong = longB - longA;

a1 = (sin(dlat/2))**2;
a2 = cos(latA) * cos(latB) * ((sin(dlong/2))**2);
a = a1 + a2;
dist = 2.0 * 3963.0 * atan2( sqrt(a), sqrt(1-a) );

dist = round(dist * 100)/100;
end;

* collapsed variable;
if dist = 0 then dist_collapsed = 0;
else if dist < 0.189393939 then dist_collapsed = 1;* 1000 ft;
else if dist < 1 then dist_collapsed = 2;
else if dist < 5 then dist_collapsed = 3;
else if dist < 20 then dist_collapsed = 4;
else if dist < 50 then dist_collapsed = 5;
else if dist < 100 then dist_collapsed = 6;
else if dist < 500 then dist_collapsed = 7;
else if dist >=500 then dist_collapsed = 8;

/* Distance precission type depending on the zip-centroid flag */
if latA ~= . and latB ~= . then do;
if centrA = 0 and centrB = 0 then dist_type = 1;
else if centrA = 1 and centrB = 1 then dist_type = 3;
else dist_type = 2;
end;

dist_2020_&yra = dist;
dist_collapsed_2020_&yra = dist_collapsed;
dist_type_2020_&yra = dist_type;
end;

* Did the zipcode change ?;
if zip&yra = ' ' or zip2020 = ' ' then zip_changed = .;
else if zip&yra = zip2020 & latA~=. & latB~=. then zip_changed=0;
else if zip&yra ~= zip2020 & latA~=. & latB~=. then zip_changed=1;

if latA ~= . and latB ~= . then zip_changed_2020_&yra = zip_changed;
else zip_changed_2020_&yra = .;

* non-interview;
if dist_2020_&yra=. and (nonint&yra=1 or nonint2020=1) then dist_2020_&yra=-5;
if dist_collapsed_2020_&yra=. and (nonint&yra=1 or nonint2020=1) then dist_collapsed_2020_&yra=-5;
if dist_type_2020_&yra=. and (nonint&yra=1 or nonint2020=1) then dist_type_2020_&yra=-5;
if zip_changed_2020_&yra=. and (nonint&yra=1 or nonint2020=1) then zip_changed_2020_&yra=-5;

* invalid missing;
if dist_2020_&yra=. then dist_2020_&yra=-4;
if dist_collapsed_2020_&yra=. then dist_collapsed_2020_&yra=-4;
if dist_type_2020_&yra=. then dist_type_2020_&yra=-4;
if zip_changed_2020_&yra=. then zip_changed_2020_&yra=-4;

%end;
%mend;

%macro rd;
%do yra = 1979 %to 1994;
%survey_y(&yra);
%end;
%do yra = 1996 %to 2018 %by 2;
%survey_y(&yra);
%end;
%mend;

%rd;
run;