Tutorial: Constructing Comparable Samples across the NLSY79 and NLSY97

Example: Constructing work status at age 20 for both samples

Objective: This tutorial walks you through the basic steps of constructing parallel samples for research projects that use both the NLSY79 and the NLSY97 cohorts. It then helps you create a similar variable--work status at age 20--for both cohorts.

The two NLSY cohorts are carefully designed to support a variety of cross-cohort research. Both survey samples are based on birth year, are drawn from nationally representative area probability samples, and have similar questionnaire designs for many topics, particularly employment. However, there are small differences in design that one needs to take into account when preparing data files for research projects.

Knowledge Assumed:  This tutorial assumes that you already know how to use the NLS Investigator to create a tag set that saves your variables and to extract data. If you need assistance with the NLS Investigator before starting this tutorial you should see "How to Use the NLS Investigator".

Background Reading:

Preview of Steps

  1. Select the samples for analysis.
  2. Determine the age and the interview years you will need for your analysis.
  3. Create a tagset of variables to define work status at age 20 for the NLSY79.
  4. Create a tagset of variables to define work status at age 20 for the NLSY97.
  5. Construct work status variable for both samples.

Step 1: Select the samples for analysis

First we'll need to decide whether to exclude certain supplemental samples in the two cohorts from our analysis. The table below shows the types of samples that make up the surveys.

NLSY79 Sample Types (R01736.) NLSY97 Sample Types (R12358.)
Cross-sectional sample: Nationally representative sample of individuals born in 1957-1964 and living in the U.S. as of the first survey round Cross-sectional sample: Nationally representative sample of individuals born in 1980-1984 and living in the U.S. as of the first survey round
Oversamples of black and Hispanic individuals, same birth years as above Oversamples of black and Hispanic individuals, same birth years as above
Economically disadvantaged non-black, non-Hispanic oversample (discontinued after 1990), same birth years as above  
Military sample (mostly discontinued after 1985): Sample of individuals born in 1957-1961 and serving in the military as of September 30, 1978  


  1. Let's start with the NLSY79. The variable R0173600, Sample Identification Code, lists the gender, race/ethnicity, and sample type of each NLSY79 respondent. The NLSY79 includes two extra sub-samples not available in the NLSY97: a military sample and an economically disadvantaged non-black, non-Hispanic oversample. To omit the military sample from your analysis, you will want to exclude cases in which R0173600 ranges from 15 to 20. To omit the oversample of economically disadvantaged non-black, non-Hispanic respondents, you'll want to exclude cases in which R0173600 equals 9 or 12.
  2. Now let's turn to the NLSY97. The variable R1235800, CV_Sample_Type, lists the sample type of each NLSY97 respondent (if you are using the new version of NLS Investigator, this variable is preselected).
  3. The NLSY97 includes a cross-sectional sample and oversamples of black and Hispanic individuals, as in the NLSY79. To compare the two samples, one can use the full NLSY97 sample as well as the remaining NLSY79 cases (after excluding the military and poor white subsamples).
  4. Alternatively, you could restrict your analysis to the nationally representative cross-sectional samples in both the surveys. In the NLSY79, this amounts to restricting R0173600 equal to 1 through 8, and in the NLSY97, restricting R1235800 = 1. Doing so, however, will dramatically reduce your sample sizes and increase your standard errors.
  5. Note that we have not introduced the possibility of using sampling weights to compute population estimates. This is beyond the scope of this tutorial. However, see the section on Sample Weights, Design Effects & Clustering Adjustments in the NLSY79 User's Guide, and the section on Sample Weights and Design Effects in the NLSY97 User's Guide. In addition, if multiple waves of data are used, one can create appropriate weights using the custom weighting program offered for each survey. 

Step 2: Determine the age and the interview years you will need for your analysis

Depending on your research topic, you may want to look at respondents in each cohort at a given age or age range. Below are some issues to consider.

  1. Which topics are available at different ages or years across the surveys. Some topics, like employment and marital status, are collected in an event-history format. By gathering dates of jobs or marital changes, for example, the surveys collect a complete history of the particular topic. Other information is collected for a point in time, only in certain years, at certain ages, or for select birth cohorts. The starred summary charts (Asterisk Tables) for the NLSY79 and NLSY97 interviews offer a convenient way to see what topics are collected, and in which years they were obtained. One would need to look at the actual questionnaires to see how similar the question wordings are across the topics for the two cohorts.
  2. Which interviews are respondents the age you need for your research question. To determine in which interviews respondents are the ages you need for your research question, you could use the month and year of birth, interview date, and age at interview variables available in each data set. For example, if one wanted to find the first interview after the respondent turned 20, one could either compare the date of birth to the interview date or look at the age at interview variable in each round.
  3. Awareness of a special dating method used in some key event-history data, which allows one to compare the timing of various life events. Information about program participation, marriage, and schooling is provided on a monthly basis using a continuous month timeline in the NLSY97, starting with January 1980. Although not in a continuous month scheme, created event-history data in the NLSY79 often show the measures by month and year. Employment histories are presented on a weekly basis using a continuous week timeline in both the NLSY79 (starting with the week of January 1, 1978) and the NLSY97 (starting with the week of January 1, 1980).

In the next step, we'll show an example of using this continuous week timeline to determine work status during the week that includes October 1st for the year the respondent turns 20.