Item Nonresponse

Item Nonresponse

This section examines and quantifies the extent of missing data, formally called item nonresponse, in the NLSY79. To provide readers with a detailed view of this problem, six surveys are analyzed. Nonresponse rates are examined first in the 1979 survey and then in the surveys that occur at roughly five-year intervals (1984, 1989, 1994, 1998, and 2004). These years were chosen to capture the major changes in the NLSY79. Examining the 1979 survey shows the initial levels of nonresponse. Examining the 1984 survey shows the amount of nonresponse in the survey just before one part of the respondent pool was dropped. The 1989 data show nonresponse after the first set of NLSY79 respondents was dropped. The 1994 data show what occurred after users and interviewers were switched from paper-and-pencil interviewing (PAPI) to computer-assisted personal interviewing (CAPI). While no major survey changes occurred during the 1998 and 2004 surveys, these surveys show nonresponse rates after many respondents had participated around 20 times.

This section focuses on the three types of missing data: refusals, invalid skips, and don't knows. Overall, the section shows that in these six rounds of the NLSY79, 20 million questions were asked. Out of all the questions asked to respondents, about 1.5 percent do not have valid answers and are missing data. Of the three missing data categories, about half the missing data are don't knows and about half are invalid skips. Given the vast majority of invalid skips occur in paper-and-pencil years, the percentage of problems attributed to this category has been steadily falling as more computer survey rounds are fielded.


Missing data, or nonresponse, happens in a number of ways in the NLSY79. First, a number of respondents do not participate at all, causing all information in that particular survey to be missing. Participation rates and reasons for noninterview in each survey round are discussed in the section on Retention & Reasons for Noninterview.

A second reason missing data occurs is that respondents do not provide a valid answer to a question. When this happens, interviewers make a determination about whether to mark the answer as a refusal or don't know value. Users should be cautioned that the assignment of refusals and don't knows is likely to vary across interviewers. Moreover, some respondents may believe it is impolite to refuse a question and decline to answer by saying they do not know. Hence, whether a question is marked either a refusal or a don't know is somewhat arbitrary. Note: Financial questions may often elicit the "refusal" or "don't know" responses. For more information about nonresponse to financial questions, see Appendix 26.

The last major way missing data can occur is when the interviewer incorrectly follows the survey instrument's flow. Incorrect flows result in some respondents being skipped over a set of questions that should be answered while others answer questions that they should not have been asked. Data archivists have removed from the data most of the extraneous question responses. While extra information can be removed, missing data is not imputed in the NLSY79. Missing data caused by this reason is flagged with a special "invalid skip" code. The number of invalid skipped drops precipitously beginning in 1993 with the introduction of CAPI. Nevertheless, invalid skips are still possible in CAPI data. If the CAPI survey contains a programming mistake, the instrument could incorrectly sequence a respondent. When these errors are found, the CAPI survey is patched in the field to prevent further invalid skips but the incorrect cases are not asked the questions again.

All missing data are clearly flagged in the NLSY79 data set. Five negative numbers are used to indicate to users that the variable does not contain useful information. The five values are (-1) refusal, (-2) don't know, (-3) invalid skip, (-4) valid skip, and (-5) noninterview. These five numbers are reserved as missing value flags and, with a few exceptions (see Appendix 5), are rarely used in the NLSY79 for valid data values.

In the tables that follow, every attempt has been made to look at only variables in a given survey year that were filled in by either a respondent or an interviewer. The goal was to eliminate all created, machine check, date and time stamp, and variables generated in data post-processing from the analysis. Given there is no automatic way to check every question to see if it meets these criteria, the number of questions analyzed by the below tables overstates the number of questions actually filled in by the respondent or interviewer. The overstatement occurs because some questions with meaningful titles are actually hidden machine checks. While every effort was made to eliminate these questions it is impossible to eliminate all of them.

This section is not the only research on the extent of missing data in the NLS. Olsen (1992) investigated the effect of switching from PAPI to CAPI interviewing. His research shows fewer interviewer errors occur from navigating the instrument as well as fewer don't knows in the CAPI survey. More importantly, CAPI respondents appeared more willing to reveal sensitive material in the alcohol use section. Mott (1985, 1984, and 1983) examines the NLSY79's fertility data. In these reports, he examines the 1982 and 1983 surveys and finds very low refusal rates for the data in general. However, by shifting to a confidential abortion reporting method, the willingness to respond greatly increases. Mott (1998) examines the amount of missing data about the children of NLSY79 females. He finds that Hispanics or Latinos and, to a smaller extent blacks, have a much higher probability of not finishing the child assessments after starting the interview.

Information about nonresponse continues in the following three sections: 

  1. The first, Item Nonresponse by Section, examines which sections of the NLSY79 have high nonresponse rates.
  2. In the second, Item Nonresponse by Respondents, responses are examined to see how many times individuals do not respond to questions.
  3. The third, Item Nonresponse within Problem Sections, examines which particular questions in sections with high nonresponse rates are causing problems.