Geocode data format
This document provides a discussion on the creation of the variables available on the NLSY79 geocode data files. The geocode data is now being released as comma-delimited ASCII data with support files. Because of the volume of variables, the data have been split into five sets of files based on content:
The geocode CD also includes text/ASCII data files as well as SPSS, SAS and STATA programs to read them. In addition, the geocode CD contains documentation files (user’s guide and codebook supplement) for the NLSY79 in HTML format. The data files contain variables that are only available on the restricted geocode release, plus the identification number to allow merging with the main 1979-2008 public data.
Geocode data source files and content
County and City Data Books
For survey years 1979-2002, selected variables from the County and City Data Books from various years are provided along with geographic variables from the NLSY79 main data file. No variables from the County and City Data Books are included for survey years 2004-2008.
The county and state of residence for each NLSY79 respondent for each survey year between 1979 and 2002 were matched with the county and state variables in the specific County and City Data Book data files used for each year. Selected county-level or SMSA-level environmental variables were extracted from those files and included in the geocode data. The County and City Data Book data files were prepared by the U.S. Census Bureau. Related printed matter for each of these data files can be found in the County and City Data Book for the specified year. These books are also published by the Census Bureau.
The following is a brief description of the various NLSY79 geocode data for specific survey years and the County and City Data Book data files that were merged with the different years of NLSY79 data:
Variables from the 1988 County and City Data Book data file were selected with an eye toward comparability with the 1983 County and City Data Book variables. Similar considerations were made between the 1994 and 1988 variables. In the absence of updated information from the 1988 County and City Data Book data file, the 1983 County and City Data Book variables were retained. However, some differences do exist between similar variables selected from the various County and City Data Book data files.
The 1983 County and City Data Book data file variables for MSA/NECMA and CMSA have been combined into one 4-digit variable in the 1988 County and City Data Book data file. Therefore, the 1988 County and City Data Book geographic variables correspond to the 1983 County and City Data Book geographic variables in the following manner:
The population by age variables from the 1988 County and City Data Book data file are estimates made for the National Cancer Institute by the Census Bureau. These figures suppress data for counties in which the population is under 20,000. Users should keep this in mind during analysis.
City Reference Files
Another type of data file, the City Reference File (CRF) for various years, was also merged with the NLSY79 data in order to identify the SMSA/MSA for each respondent according to zip code. The City Reference File data files, prepared by the U.S. Census Bureau, contain the Federal Information Process Standards (FIPS) county and state codes, zip codes, and SMSA/MSA codes.
The following is a list of the various City Reference Files that were merged with the different years of NLSY79 data to identify the SMSA/MSA for each respondent:
Local Exchange Routing Guides
Between 1989 and 1994, a third type of data file was used to verify the geocode information provided by NORC for each respondent. The Local Exchange Routing Guide (LERG) data file is constructed by Bell Communications Research (BELLCORE) and contains address information for the "switches" which regulate each telephone area code and exchange.
Geostatistical Mapping Software
Beginning in 1996, geostatistical mapping software was employed in the geocoding process to assign latitude and longitude coordinates and other geographical codes. Basic standard geographic information such as latitude and longitude was linked to each respondent's address. This was accomplished by matching address data to information in the software database. Matching records were appended with the matching address, coordinates, Census information, and FIPS (Federal Information Processing Standards) codes for state, county, MCD (Minor Civil Division), and MSA (Metropolitan Statistical Area). The software packages used in specific survey years are listed below.
The following briefly outlines the procedures used to create the 1979-88 NLSY79 geocode data.
Hand-Edits and Changes in Matching Procedures
1979-1982
More than 1,000 hand-edits for each survey year from 1979 through 1982 were performed to constrain respondents' reported state, county, and zip codes so that they conformed to legitimate state-county-zip combinations. In some cases, this involved making estimates for one or more of the above items. The state and county codes from the main NLSY79 data for each year that are included in the NLSY79 geocode data are the original, unedited values for the respondents. The hand-edited versions of the state and county codes were used to match with the CRF and with the City and County Data Book data files for these years.
In compiling the 1981 geocode information, a systematic review of hand-edited state, county, and zip codes was also undertaken. All cases that required a hand-edit in any of the three survey years were included in this inspection. The point of this review was: (1) to check for consistency in hand-editing decision rules over the three years, and (2) where possible, to use the respondent's reported geocodes in subsequent years to check on the accuracy of hand-edits performed in preceding years (this was possible for those cases that required hand-edits in early years and which showed no change of residence over the period).
The results of these consistency checks were very encouraging. Only 13 cases turned up that seemed to be in error. These cases had their geocodes revised accordingly. While doing this review, several dozen other cases with keypunch or coding errors in the hand-edit code variables were also uncovered. These errors were also corrected. In any case, this procedure provides substantial validation to the overall hand-editing process.
1983-1988
In 1983-1987, the majority of hand-edits involved the derivation and addition of a zip code. Changes in the matching strategy were made because the zip code was more accurate than either the county or state geocodes taken individually. Some mismatching, however, did occur because the zip code was in error rather than the county or state code, but this error rate was smaller than another matching algorithm not requiring case by case hand edits. It is probable that some mismatching did occur because the county itself was in error. Nevertheless, we are confident that matching by zip code improved the quality of the match.
Users are cautioned that matching by state and zip code or by zip code only may result in a higher moving rate between 1987 and the previous interview year than might actually have occurred. We suspect that some NLSY79 county and state geocodes were not updated if the respondent reported an address change prior to the 1987 interview or the previous interview. If the geocodes were not updated in a previous interview, then there would have been an under-reporting of moving to a new county and/or state in that interview year that would now show up with the 1987 NLSY79 data because of the improved matching algorithm. In the 1987 NLSY79 geocode data, if the zip code and state did not match but the zip code alone matched, the state and county were added to the record.
Because the 1988 procedure required both the zip code and state to match, some cases in which the zip code alone matched, and which were possibly in error in 1987, may have been hand-edited in 1988. This may affect mobility rates between 1987 and 1988 to the extent that those inaccurate zip codes in 1987 may have been corrected in the 1988 NLSY79 data.
In 1988, more than 1,000 hand-edits were performed. Approximately 56.6% of these involved the derivation and addition of a zip code, while approximately 48.4% involved correction of the state of residence.
We believe that the requiring a zip code and state correspondence further improved the accuracy of the resulting matches. In support of this assumption, the cases that were actually hand-edited, produced only approximately 6% with an invalid county. The possibility of zip codes continuing across adjacent counties suggests that this may even be an overestimate of the actual error occurring.
1981 Changes in SMSA Designations
For those using these data to track the mobility of respondents over the 1979-81 survey years an additional caution applies. In June of 1981, the OMB announced the designation of 36 new SMSAs, the disqualification of one pre-existing SMSA, and the merger of two pre-existing SMSAs into one new area. The 1973 CRF file was updated by CHRR to reflect these changes, and the updates were applied beginning with the 1980 interview place of residence in the 1980 geocode data file. One consequence of these changes is that when attempting to match places of residence for respondents using data from the 1979 geocode data and separate 1980 updated geocode data, some respondents give the appearance of moving into (or out of) an SMSA between 1979 and 1980 when in fact they may not have moved at all. This faulty inference of mobility would be reached if one compared changes in SMSA designation between the separate 1979 and 1980 updated geocode data.
Users ordering a full complement of geocode data at any given point should not find this discrepancy in mobility. This applies only to those who ordered a 1979 geocode data and then updated that data with single year files in subsequent years. As single year files were no longer available after the 1979-89 release, recent purchasers of the geocode data would have received all available years of the geocode data, and should not detect the discrepancy resulting from the 1981 SMSA changes between the 1979 and 1980 separate data files. A variable representing the 1981 SMSA designation (if applicable) of place of residence at interview is currently present in the geocode data for all survey years, including 1979.
It is possible, however, that the created variable based upon SMSA of residence and found in the main NLSY79 data file named KEYVARS ("Is R's Current Residence in SMSA?"), would give a false impression of mobility in and out of an SMSA for respondents living in the same location for which the SMSA designation was changed between survey years. (See Appendix 6: Urban-Rural and SMSA-Central City Variables in the public-use file Codebook Supplement for further details on the creation of this variable.)
Note that all other SMSA environmental variables for those living in these new areas remain NA, since the County and City Data Book, 1972 and 1977 data files did not contain information for these SMSAs. There are 171 NLSY79 respondents who lived in those new SMSAs in 1980 and 2 who lived in the disqualified SMSA in 1979.
Rewrite of 1979-82 Geocode Data
In 1989, work was undertaken to reduce the number of variables provided in the 1979-82 NLSY79 geocode data so that the number and type of variables included in these data more closely resembled the geographic data available for the 1983 and subsequent survey years. The previous 1979-82 NLSY79 geocode data file contained 2,245 variables. This number was reduced to 545 variables with county-level and SMSA-level data retained. In addition, four new variables were included in the 1979-82 NLSY79 geocode data. These variables provide data on the "Continuous Unemployment Rate for the Labor Market of Current Residence" for each survey year. This reduction in the number of variables made it possible to better document the geocode variables and to produce codebooks like the ones produced for the main NLSY79 data.
A new procedure was implemented in 1989 as an initial step in verifying the county and state of residence by using address information from the "switch" associated with each area code and exchange. In the hand-editing process for the 1988 geocode data, reported telephone information was found to be very accurate, even in cases for which some or all of the address information was in error. Thus the telephone information presented itself as a reliable, independent source of verification for the address information. The state and county generated from the phone number are compared to the state and county in the NORC address file for each respondent. Cases in which the telephone information would indicate a different state and/or county from that in the address file are identified through this process. This procedure helped identify respondents with incorrect or inconsistent records. Cases that produced such a non-match were checked for accuracy and hand-edited if necessary.
The following briefly outlines the procedures used to create the 1989-1994 geocode data.
From this point, the procedures closely follow those applied in constructing the geocode data files in prior survey years, with minor modifications. The CRF matching was based upon state and county only for the purposes of the final matching of information from the County and City Data Book data files. As metropolitan statistical area information is based upon county delineations (except in New England), matching on cleaned state and county data should not affect the assignment of respondent MSAs.
Hand-Edits and Changes in Matching Procedures
In creating the 1989-94 geocode data, the same logical procedures were applied in identifying cases requiring individual examination. However, the automation of the decision rules and procedures to check for and identify such cases resulted in a substantial reduction in the number of cases requiring hand-editing.
1989-1994
The effect of the 1989 phone verification procedures on the ability to detect errors in the NORC geocode data may also affect mobility rates between 1988 and 1989. Due to time and personnel constraints, it was not possible to examine every case that did not initially match on the state, county, and zip codes.
In the 1989 procedure the geocodes established by the phone number were compared to the geocodes received directly from NORC. By using the 1989 CHRR-edited versions of the geocodes for comparison, updates and corrections that were made to the geocodes during the 1989 hand-editing processes were incorporated. This reduced the number of mismatches between the geocode information based upon the current phone number and the respondent-reported geocode information and increased the amount of consistency observed between survey years. The number of cases requiring individual examination was thereby reduced.
From this point, the procedures closely follow those applied in constructing the 1988 geocode data, with minor modifications. For 1989, CRF matching was based upon state and county only for the purposes of the final matching of information from the County and City Data Book data. A match on state, county, and zip was also required to construct a variable reflecting a respondent's SMSA/non-SMSA residence status for inclusion in the NLSY79 main data file. This match, which was included in the geocode procedures prior to 1989, was done separately for the 1989 release when the new set of initial procedures was instituted. To streamline programming tasks, however, the zip information was reinserted in the CRF matching program for 1990. Therefore, the CRF matching for the 1990 geocode data was again based upon state, county, and zip code, as it had been prior to 1989.
In earlier survey years, residence information was usually collected by NORC interviewers only when there was a change in that information from the previous interview. In 1990, however, an effort was made to get current information for all respondents. Many of the cases in this current update information also included counties that have been inconclusive (even in case-by-case hand-editing) in previous years. These are generally cases in which a zip code spans more than one county, and for which valid county data is missing from the respondent's reported residence information. For such cases, the possibility existed in the 1989 (and prior) data that counties assigned based upon such multiple-county zip codes might be in error in a small number of cases. This would result in the assignment of a county adjacent to the county in which the respondent actually lived. To the extent that current update information for the county of residence in 1990 showed the assigned county in 1989 to be in error, mobility determinations may have been affected. In contrast, using the 1989 CHRR-edited versions of the geocodes for comparison with the current geocode information should have improved the accuracy of mobility ratings. This is a more dependable confirmation of past geocode information, eliminating the need to make individual determinations in many cases with multiple-county zip codes as discussed above.
The procedures for the creation of the 1996 and subsequent geocode data changed from those used in previous years. Software packages were used to create the data for 1996-2008. The following briefly outlines the process used in these survey years.
Although different software packages were used, the procedures for data creation were essentially the same across these years. Three graduated matching methods were applied, depending on the quality of the address data available.
The procedures outlined in steps #2 and #3 approximate the hand-editing process described in previous survey years for records with different degrees of matched address data.
Urban-Rural and SMSA/CBSA-Central City residence variable
The procedures for creating the Urban-Rural and SMSA/CBSA-Central City residence variables (released in the KEYVARS area of interest) were modified for the 2000 public release. In 2000 and later survey years, these variables were created with the same software used to create the other geocode data. For further discussion of these variables, see Appendix 6: Urban-Rural and SMSA/CBSA-Central City Variables in the main file NLSY79 Codebook Supplement.
Migration History variables
In NLS79 survey years 2000-2008, respondents who had moved to a different county or state since the date of last interview were asked to report each address and the dates of each move. The FIPS code for the state and county of each address are included in the 2008 geocode data. The address items collected are found in the geocode files titled “survey_and_created_variables_081610.*”, and in the questionnaire and codebook, with question names beginning with "MIGR_". Similar migration histories were collected in several early survey years of the NLSY79.
Distance Measures variables
A series of variables was added with the 2006 geocode release containing the collapsed distance between each pair of residential addresses reported by the respondent for all survey years and indicating whether the respondent changed zipcode between each pair of addresses. These data have been updated for the 2008 release with more information for many address pairs in various survey years added.
Editing/Quality of Match Variables
A variable named "GEO10" provides information about the quality of the respondent's address match and the method used to locate the address. In 1994 and prior years, GEO10 contains information on the degree of match between different address elements. Between 1996 and 2004, GEO10 identifies whether the county was assigned based on the respondent-provided address or the zip centroid method. In 2006-2008, this variable differentiates between addresses located based on the actual address, the center of a short street, in the center of a long street or using a zip centroid method. This variable can be used to determine the level of certainty for the respondent’s geographic data.
The missing data values for all items on the geocode data files are-3, -4 and -5. The -5 values indicate a noninterview for a given year. -3 codes in the data after 1996 indicate respondents whose latitude and longitude of current residence could not be determined. Respondents who have a -4 value in the data for any variables from the County and City Data Book or other residence indicators fall into the following categories:
Finally, we have a few notes and suggestions concerning the use of these NLSY79 geographic data.
The NLSY79 geocode data should not be used in any fashion that would endanger the confidentiality of any sample member. Only those users who have signed a written licensing agreement consenting to protect respondent confidentiality and to other conditions, who agree not to make, or allow to be made, unauthorized copies of the geocode file, and who also agree to indemnify the Center for Human Resource Research for all claims arising from misuse of the file may use these data.
The data and the accompanying documentation should be used in conjunction with the printed versions of the 1972, 1977, 1983, 1988, and 1994 County and City Data Books that correspond to each variable desired in order to have complete information regarding variable descriptions and coding idiosyncrasies. No variables from the County and City Data Books have been included in the geocode data after 2002. Users wishing to attach specific individual items from that or other sources may do so by using the state, county and/or various MSA variables to merge data.
Edited variables describing the location of each respondent's residence are created as a result of this matching process. The first two variables, question names "GEO1" and "GEO2", provide the FIPS code for the respondent's county and state of residence. Two versions of the county and state of residence variables are included in the geocode data for most survey years from 1979-92. The state and county variables appearing at the beginning of each year’s variable listing are the edited versions that incorporate all revisions deemed necessary in the hand-editing process for each year. These edited variables are used in the construction of the final geocode data. The state and county variables appearing near the end of the variable listing for most of those years are the unedited version, as received directly from NORC. It is generally recommended that users employ the edited version as these contain corrected geocodes based upon the most current available information.
Researchers are encouraged to use caution during analyses because several modifications were made since 1987 in the programming procedures that create the geocode data files. Please refer to this document for discussion of specific modifications of note.
In years for which zip code centroids were assigned, users should note that there is some small possibility that a respondent's county may be misassigned using the centroid method in cases where more than one county is represented in a given ZIP code. In these cases, it is possible that a respondent might live in one county but that the center of the ZIP code area is in another county. However, since ZIP codes infrequently cross county lines and less than a quarter of respondents' counties were assigned using the ZIP centroid method, the number of counties incorrectly assigned should be quite small.