Skip to main content

Constructing Comparable Samples across the NLSY79 and NLSY97

Tutorial objective and prerequisites

Objective

This tutorial walks you through the basic steps of constructing parallel samples for research projects that use both the NLSY79 and the NLSY97 cohorts. It then creates a similar variable--work status at age 20--for both cohorts.

The two NLSY cohorts are carefully designed to support a variety of cross-cohort research. Both survey samples are based on birth year, are drawn from nationally representative area probability samples, and have similar questionnaire designs for many topics, particularly employment. However, there are small differences in design that one needs to take into account when preparing data files for research projects.

Knowledge assumed

This tutorial assumes that you already know how to use the NLS Investigator to create a tagset that saves your variables and to extract data. If you need assistance with the NLS Investigator before starting this tutorial, please review the Investigator User Guide or contact NLS User Services.

Background reading

Example: Constructing work status at age 20 for both samples

Preview of steps

Additional information provides the final statistics for the example and suggestions for extending the tutorial.

Step 1: Select the samples for analysis

First decide whether to exclude certain supplemental samples in the two cohorts from the analysis. The table below shows the types of samples that make up the surveys.

NLSY79 Sample Types (R01736.) NLSY97 Sample Types (R12358.)
Cross-sectional sample: Nationally representative sample of individuals born in 1957-1964 and living in the U.S. as of the first survey round Cross-sectional sample: Nationally representative sample of individuals born in 1980-1984 and living in the U.S. as of the first survey round
Oversamples of black and Hispanic individuals, same birth years as above Oversamples of black and Hispanic individuals, same birth years as above
Economically disadvantaged non-black, non-Hispanic oversample (discontinued after 1990), same birth years as above empty
Military sample (mostly discontinued after 1985): Sample of individuals born in 1957-1961 and serving in the military as of September 30, 1978 empty
  1. Begin with the NLSY79:
    • The variable R0173600, Sample Identification Code, lists the gender, race/ethnicity, and sample type of each NLSY79 respondent.
    • The NLSY79 includes two extra sub-samples not available in the NLSY97: a military sample and an economically disadvantaged non-black, non-Hispanic oversample.
    • To omit the military sample from the analysis, exclude cases in which R0173600 ranges from 15 to 20.
    • To omit the oversample of economically disadvantaged non-black, non-Hispanic respondents, exclude cases in which R0173600 equals 9 or 12.
  2. Next review the NLSY97:
    • The variable R1235800, CV_Sample_Type, lists the sample type of each NLSY97 respondent.
    • The NLSY97 includes a cross-sectional sample and oversamples of black and Hispanic individuals, as in the NLSY79.
  3. To compare the two samples:
    • Use the full NLSY97 sample as well as the remaining NLSY79 cases (after excluding the military and poor white subsamples).
    • Alternatively, you could restrict your analysis to the nationally representative cross-sectional samples in both the surveys. In the NLSY79, this amounts to restricting R0173600 equal to 1 through 8, and in the NLSY97, restricting R1235800 = 1. Doing so, however, will dramatically reduce your sample sizes and increase your standard errors.

Note that we have not introduced the possibility of using sampling weights to compute population estimates. This is beyond the scope of this tutorial. However, see the section on Sample Weights, Design Effects & Clustering Adjustments in the NLSY79 User's Guide, and the section on Sample Weights and Design Effects in the NLSY97 User's Guide. In addition, if multiple waves of data are used, one can create appropriate weights using the custom weighting program offered for each survey.

Step 2: Determine the age and the interview years you will need for your analysis

Depending on your research topic, you may want to look at respondents in each cohort at a given age or age range. Below are some issues to consider.

  • Which topics are available at different ages or years across the surveys. Some topics, like employment and marital status, are collected in an event-history format. By gathering dates of jobs or marital changes, for example, the surveys collect a complete history of the particular topic. Other information is collected for a point in time, only in certain years, at certain ages, or for select birth cohorts. The starred summary charts (Asterisk Tables) for the NLSY79 and NLSY97 interviews offer a convenient way to see what topics are collected, and in which years they were obtained. One would need to look at the actual questionnaires to see how similar the question wordings are across the topics for the two cohorts.
  • Which interviews are respondents the age you need for your research question. To determine in which interviews respondents are the ages you need for your research question, you could use the month and year of birth, interview date, and age at interview variables available in each data set. For example, if one wanted to find the first interview after the respondent turned 20, one could either compare the date of birth to the interview date or look at the age at interview variable in each round.
  • Awareness of a special dating method used in some key event-history data, which allows one to compare the timing of various life events. Information about program participation, marriage, and schooling is provided on a monthly basis using a continuous month timeline in the NLSY97, starting with January 1980. Although not in a continuous month scheme, created event-history data in the NLSY79 often show the measures by month and year. Employment histories are presented on a weekly basis using a continuous week timeline in both the NLSY79 (starting with the week of January 1, 1978) and the NLSY97 (starting with the week of January 1, 1980).

The next few steps will show an example of using this continuous week timeline to determine work status during the week that includes October 1st for the year the respondent turns 20.

Step 3: Create a tagset of variables to define work status at age 20 for the NLSY79

Now we will look at respondents in each cohort at a given age, for this example, it is age 20. By the 2006 interview, all NLSY97 respondents are over 20 years old, and by 1985 all NLSY79 respondents are over 20 years old.

Define the following variable for both cohorts: work status during the week that includes October 1st for the year the respondent turns 20. We will need Year of Birth and Weekly Labor Force Status variables from the Work History arrays for that particular week for both cohorts.

  1. In the NLSY79, respondents were born in the years 1957 through 1964. That means the year the respondents turn 20 ranges from 1977 to 1984.
    • Note that the Work History arrays begin on January 1, 1978, so you will need to exclude the 1957 birth year from the analysis.
  2. In Appendix 18 of the NLSY79 Codebook Supplement, there is a table that tells us the week numbers in the Work History arrays that correspond to each date. Two numbers are given, the week of the year and the week of the array.
    • For example, in 1979, the week of October 1st is week number 39 in 1979, but week number 92 in the Work-History array (number of weeks since 1/1/78). Given the layout of the data, the latter number is what we need. (In the NLSY97 the opposite is true.) We want to find the week numbers that correspond to the week of October 1st for the years 1978-1984--the years our respondents turn age 20. You will need: week 40 in 1978, 92 in 1979, 144 in 1980, 196 in 1981, 248 in 1982, 300 in 1983, and 353 in 1984.
  3. Searching on the "Work History-Weekly Labor Status" Area of Interest will give you the Weekly Labor Force Status arrays in the NLSY79. Tag the variables that correspond with the list above (W0065200, W0070400, W0110300, W0150200, W0190100, W0230000, W0270500).
  4. Also tag variables for Year of Birth (R0000500), Sample Composition (R0173600), and the Respondent ID (R0000100).
  5. Next, create an extract of your NLSY79 data set. The variables included are as follows:
Reference Number Question Name Question Title Year
R00001.00 CASEID Identification Code 1979
R00005.00 S01Q01A Date of Birth - Year 1979
R01736.00 S24Q01 Sample Identification Code 1979
W00652.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 40 1979
W00704.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 92 1979
W01103.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 144 1980
W01502.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 196 1981
W01901.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 248 1982
W02300.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 300 1983
W02705.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 353 1984

Using the NLS Investigator

To create a tagset of specific variables and then extract the data set, use the Save / Download Tab in the NLS Investigator.

Step 4: Create a tagset of variables to define work status at age 20 for the NLSY97

Now create a similar tagset for the NLSY97.

  1. In the NLSY97, respondents were born in the years 1980 through 1984. That means the year the respondents turn 20 ranges from 2000 to 2004.
  2. In Appendix 7 of the NLSY97 Codebook Supplement, find Table 1, an Excel spreadsheet, which tells us the week numbers in the Work History arrays that correspond to each date. Two weekly numbers are given, the week of the year and the week of the array.
    • For example, in 2000, the week that includes October 1st is week number 41 in 2000, but week number 1084 in the Work-History array (number of weeks since 1/1/80). Given the layout of the data, the former number is what we need. Here's what we need: week 41 in 2000, 40 in 2001, 40 in 2002, 40 in 2003, and 40 in 2004.
  3. Searching on the Survey Year = "XRND", Word in Title (enter search term) contains "Status", and Word in Title (enter search term) contains "Employment" gives us variables that include the Weekly Labor Force Status arrays in the NLSY97. Tag the variables that correspond with the list above (R8812500, R8908000, R9043500, R9048700, R9179400).
    • Note these variables have convenient question names in the data set: EMP_STATUS_year.week number.
  4. Also tag the variables for Year of Birth (R0536402), Sample Composition (R1235800), and Respondent ID (R0000100).
  5. Next, create an extract of your NLSY97 data set (see "Using the NLS Investigator" in Step 3). The variables included are as follows:
Reference Number Question Name Question Title Year
R00001.00 PUBID PUBID, Youth Case Identification Code 1997
R05364.02 KEY!BDATE_Y KEY!BDATE, Rs Birthdate Month/Year (Symbol) 1997
R12358.00 CV_SAMPLE_TYPE Sample Type. Cross-Sectional or Oversample 1997
R88125.00 EMP_STATUS_2000.41 2000 Employment: Employment Status in Week 41 XRND
R89080.00 EMP_STATUS_2000.40 2001 Employment: Employment Status in Week 40 XRND
R90435.00 EMP_STATUS_2000.40 2002 Employment: Employment Status in Week 40 XRND
R90487.00 EMP_STATUS_2000.40 2003 Employment: Employment Status in Week 40 XRND
R91794.00 EMP_STATUS_2000.40 2004 Employment: Employment Status in Week 40 XRND

Step 5: Construct work status variable for both samples

Once you have the two data sets from Steps 3 and 4, you are ready to start programming the variables. The logic is as follows:

  1. Start by restricting the NLSY79 data to the cross-section and oversamples of black, and Hispanic individuals. These are the same sample types available in the NLSY97.
  2. Next, look at the definitions of the Employment Status variables in both cohorts. Define an indicator variable equal to 1 if the respondent is working in a civilian or military job during the week that includes October 1st in the year he or she turns 20 and 0 if the respondent is not working.
  3. If you look at the codebook for the Employment Status variables in the NLSY79, you will see that a value of 100 or more means the respondent was working in a civilian job in that week. A value of 7 means they were in the military. The values 2, 4, and 5 correspond to not working in the particular week. Treat 0 and 3 as missing information.
  4. Similarly, the codebook for the Employment Status variables in the NLSY97 indicates that a value of 9701 or more means the respondent was working in a civilian job in that week. A value of 6 means they were in the military. The values 1, 2, 4, and 5 correspond to not working in the particular week, and again, treat 0 and 3 as missing information.

Click below for sample programming code in SAS and STATA.

NLSY79 SAS Code for Step 5

*drop military supplemental sample;
if (r0173600 ge 15 and r0173600 le 20) then delete;

*drop economically disadvantaged non-black, non-Hispanic supplemental sample;
if (r0173600 = 9 or r0173600 = 12) then delete;

**Sample size is now 9763 observations;

**define year of birth;
yob79 = r0000500;

**create variable for whether working during the week that includes October 1st in year turn 20;
**1 = working at a civilian or military job;
**0 = not working;
if yob79 = 58 then do;
if (w0065200 ge 100 or w0065200 = 7) then work79_20 = 1;
if (w0065200 = 2 or w0065200 = 4 or w0065200 = 5) then work79_20 = 0;
end;
if yob79 = 59 then do;
if (w0070400 ge 100 or w0070400 = 7) then work79_20 = 1;
if (w0070400 = 2 or w0070400 = 4 or w0070400 = 5) then work79_20 = 0;
end;
if yob79 = 60 then do;
if (w0110300 ge 100 or w0110300 = 7) then work79_20 = 1;
if (w0110300 = 2 or w0110300 = 4 or w0110300 = 5) then work79_20 = 0;
end;
if yob79 = 61 then do;
if (w0150200 ge 100 or w0150200 = 7) then work79_20 = 1;
if (w0150200 = 2 or w0150200 = 4 or w0150200 = 5) then work79_20 = 0;
end;
if yob79 = 62 then do;
if (w0190100 ge 100 or w0190100 = 7) then work79_20 = 1;
if (w0190100 = 2 or w0190100 = 4 or w0190100 = 5) then work79_20 = 0;
end;
if yob79 = 63 then do;
if (w0230000 ge 100 or w0230000 = 7) then work79_20 = 1;
if (w0230000 = 2 or w0230000 = 4 or w0230000 = 5) then work79_20 = 0;
end;
if yob79 = 64 then do;
if (w0270500 ge 100 or w0270500 = 7) then work79_20 = 1;
if (w0270500 = 2 or w0270500 = 4 or w0270500 = 5) then work79_20 = 0;
end;

*work79_20 (mean = .604, N =8603);
*missings are mostly due to lack of employment status information at 20 for youths born in 1957;

NLSY97 SAS Code for Step 5

**define year of birth;
yob97 = r0536402;

**create variable for whether working during the week that includes October 1st in year turn 20;
**1 = working at a civilian or military job;
**0 = not working;
if yob97 = 1980 then do;
if (r8812500 ge 9701 or r8812500 = 6) then work97_20 = 1;
if (r8812500 = 1 or r8812500 = 2 or r8812500 = 4 or r8812500 = 5) then work97_20 = 0;
end;
if yob97 = 1981 then do;
if (r8908000 ge 9701 or r8908000 = 6) then work97_20 = 1;
if (r8908000 = 1 or r8908000 = 2 or r8908000 = 4 or r8908000 = 5) then work97_20 = 0;
end;
if yob97 = 1982 then do;
if (r9043500 ge 9701 or r9043500 = 6) then work97_20 = 1;
if (r9043500 = 1 or r9043500 = 2 or r9043500 = 4 or r9043500 = 5) then work97_20 = 0;
end;
if yob97 = 1983 then do;
if (r9048700 ge 9701 or r9048700 = 6) then work97_20 = 1;
if (r9048700 = 1 or r9048700 = 2 or r9048700 = 4 or r9048700 = 5) then work97_20 = 0;
end;
if yob97 = 1984 then do;
if (r9179400 ge 9701 or r9179400 = 6) then work97_20 = 1;
if (r9179400 = 1 or r9179400 = 2 or r9179400 = 4 or r9179400 = 5) then work97_20 = 0;
end;

*work97_20 (mean = .665, N =8435);

Open SAS sample code in a separate browser window

NLSY79 STATA Code for Step 5

*drop military supplemental sample;
drop if r0173600 >= 15 & r0173600 <= 20;

*drop economically disadvantaged non-black, non-Hispanic supplemental sample;
drop if r0173600 ==9 | r0173600 ==12;

**Sample size is now 9763 observations;

**define year of birth;
gen yob79 = r0000500;

**create variable for whether working during the week that includes October 1st in year turn 20;
**1 = working at a civilian or military job;
**0 = not working;
gen work79_20 = .;
replace work79_20 = 1 if (yob79==58) & (w0065200 >= 100 | w0065200==7);
replace work79_20 = 0 if (yob79==58) & (w0065200==2 | w0065200==4 | w0065200==5);

replace work79_20 = 1 if (yob79==59) & (w0070400 >= 100 | w0070400==7);
replace work79_20 = 0 if (yob79==59) & (w0070400==2 | w0070400==4 | w0070400==5);

replace work79_20 = 1 if (yob79==60) & (w0110300 >= 100 | w0110300==7);
replace work79_20 = 0 if (yob79==60) & (w0110300==2 | w0110300==4 | w0110300==5);

replace work79_20 = 1 if (yob79==61) & (w0150200 >= 100 | w0150200==7);
replace work79_20 = 0 if (yob79==61) & (w0150200==2 | w0150200==4 | w0150200==5);

replace work79_20 = 1 if (yob79==62) & (w0190100 >= 100 | w0190100==7);
replace work79_20 = 0 if (yob79==62) & (w0190100==2 | w0190100==4 | w0190100==5);

replace work79_20 = 1 if (yob79==63) & (w0230000 >= 100 | w0230000==7);
replace work79_20 = 0 if (yob79==63) & (w0230000==2 | w0230000==4 | w0230000==5);

replace work79_20 = 1 if (yob79==64) & (w0270500 >= 100 | w0270500==7);
replace work79_20 = 0 if (yob79==64) & (w0270500==2 | w0270500==4 | w0270500==5);

*work79_20 (mean = .604, N =8603);
*missings are mostly due to lack of employment status information at 20 for youths born in 1957;

NLSY97 STATA Code for Step 5

**define year of birth;
gen yob97 = r0536402;

**create variable for whether working during the week that includes October 1st in year turn 20;
**1 = working at a civilian or military job;
**0 = not working;
gen work97_20 = .;
replace work97_20 = 1 if (yob97==1980) & (r8812500 >= 9701 | r8812500==6);
replace work97_20 = 0 if (yob97==1980) & (r8812500==1 | r8812500==2 | r8812500==4 | r8812500==5);

replace work97_20 = 1 if (yob97==1981) & (r8908000 >= 9701 | r8908000==6);
replace work97_20 = 0 if (yob97==1981) & (r8908000==1 | r8908000==2 | r8908000==4 | r8908000==5);

replace work97_20 = 1 if (yob97==1982) & (r9043500 >= 9701 | r9043500==6);
replace work97_20 = 0 if (yob97==1982) & (r9043500==1 | r9043500==2 | r9043500==4 | r9043500==5);

replace work97_20 = 1 if (yob97==1983) & (r9048700 >= 9701 | r9048700==6);
replace work97_20 = 0 if (yob97==1983) & (r9048700==1 | r9048700==2 | r9048700==4 | r9048700==5);

replace work97_20 = 1 if (yob97==1984) & (r9179400 >= 9701 | r9179400==6);
replace work97_20 = 0 if (yob97==1984) & (r9179400==1 | r9179400==2 | r9179400==4 | r9179400==5);

*work97_20 (mean = .665, N =8435);

Open STATA sample code in a separate browser window

Additional information

Final statistics from sample program

work79_20 (mean = .604, N =8603)
work97_20 (mean = .665, N =8435)

Extensions

This tutorial focuses on forming comparable variables that measure work status at a given age in the cross-sectional sample and over-samples of blacks and Hispanics in the NLSY79 and NLSY97. One could calculate additional comparable variables in the two surveys over many different domains across different ages and years, such as labor market experience and number of jobs held, marital status and transitions, and fertility.