This chapter provides some practical information about how NLS variables are collected, created, and arranged in the data set. An explanation of the cohort's hard copy and electronic documentation is also included. The first section describes the different survey instruments used to collect the raw NLSY97 data. This section also explains how question names are assigned. Next, the guide discusses the primary types of NLS variables and the process by which each is assigned a reference number and title that serve to identify it throughout the NLS documentation. The third section reviews the codebook--that is, the information about each variable contained in the data set--and the accompanying documentation. This discussion will help users understand how to interpret the various pieces of information presented in the NLS documentation system. Finally, this chapter gives researchers some basic instruction in using the search functions to find variables of interest.
The primary variables found within the main data set are derived directly from one or more survey instruments. This section explains the conventions used in the NLSY97 documentation to identify questionnaire items from some of the primary survey instruments.
The term "survey instrument" is used to refer to the NLSY97 questionnaires that serve as the primary source of information on a given respondent. In round 1, there were separate and distinctly different questionnaires for the household informant (the Screener, Household Roster, and Nonresident Roster Questionnaire), the NLSY97 respondent (the Youth Questionnaire), and the responding parent (the Parent Questionnaire). In each subsequent round the Youth Questionnaire has been used to collect information from respondents. A Household Income Update, used in rounds 2-5 supplemented the Youth Questionnaire with information on household income collected from a parent. Each questionnaire is organized around a set of topical subjects, the titles of which usually appear on either the first page of each section of the questionnaire or as a header. The various survey instruments are described in detail in section 1.4, "Content of the NLSY97."
| User Notes: The questionnaires are critical elements of the NLSY97 documentation system and should be used by researchers to determine the wording of questions, response categories, and the universe of respondents asked a given question. |
For each round, NLSY97 questionnaires record (1) interview dates; (2) responses to the topical survey questions; (3) locating information which will assist NORC in finding the respondent for the next interview (not available to users); and (4) interviewer remarks on such topics as the race and gender of the respondent, language in which the interview was conducted, interviewer's impressions, etc. The show card, an interviewing aid used in conjunction with the questionnaire, lists the possible response categories for select questions and helps the respondent keep the more complicated response categories in mind.
Questionnaire Item or Question Name: This generic term identifies the source of data for a given variable. A questionnaire item may be a question, a check item, or an interviewer's reference item appearing within one of the survey instruments. These items have question names that begin with an abbreviation of the section where each is located. Following the section abbreviation, the question name includes a combination of numbers and letters that identify it within the section. Many questions simply have numbers in numerical order. Some questions, as in the examples in the tables below, have a decimal extension that indicates the question is repeated or looped during the survey. For example, a question about hours worked would be repeated for each employer, with decimal extensions .01 through .09 indicating employers 1-9. Another common extension in question names is _D, _M, or _Y (or ~D, ~M, ~Y), indicating that the variable reports the day, month, or year of a date. If a question is repeated in more than one round, it will have the same question name in each round so that users can easily locate identical questions in the data set across survey years.
|
Section |
Question Names |
Rounds Included | |||||||||
|
Round 1 |
Round 2 |
Round 3 |
Round 4 |
Round 5 |
Round 6 |
Round 7 |
Round 8 | Round 9 | |||
|
Youth Questionnaire |
|||||||||||
| Information |
YINF-2560 |
* |
|||||||||
| Household Information | YHHI-50510.04, YHHI-4100.07~M |
* |
* |
* |
* |
* |
* |
* | * | ||
| Childhood Retrospective | YCHR-1080, YCHR-1750.01 |
|
* |
* |
* | * | |||||
| CPS |
YCPS-14400 |
* |
* |
||||||||
| Schooling | YSCH-22800.01, YSCH-26500 |
* |
* |
* |
* |
* |
* |
* |
* | * | |
| Peers/Opportunity Sets |
YPRS-800 (round 1 only) |
* |
|
||||||||
| Time Use |
YTIM-2200 (round 1 only) |
* |
|
|
|
||||||
| Employment | YEMP-1800.02, YEMP-103500 | * | * | * | * | * | * | * | * | * | |
| Training |
YTRN-800, YTRN-9200.01 |
* |
* |
* |
* |
* |
* |
* |
* | * | |
| Health |
YHEA-1600, YHEA-2050 |
* |
* |
* |
* |
* |
* |
* |
* | * | |
| Self-Administered1 | YSAQ-006A, YSAQ-447.03 | * | * | * | * | * | * | * | * | * | |
|
|
Marriage | YMAR-2100, YMAR-12700.01 | * | * | * | * | * | * | * | * | * |
| Fertility | YFER-700.04, YFER-14600 | * | * | * | * | * | * | * | * | * | |
| Child Care | YCCA-2100 | * | * | * | * | * | |||||
| Self-Administered 2 | YSAQ2-284, YSAQ2-298D18 | * | * | * | * | ||||||
| Best Friend | YFRD-120 | * | * | ||||||||
| Program Participation | YPRG-1700, YPRG-13500.01~M | * | * | * | * | * | * | * | * | * | |
| Income2 | YINC-2300, YINC-21400.01 | * | * | * | * | * | * | * | * | * | |
| Assets2 | YAST-2696, YAST-1240~000001 | * | * | * | * | * | * | * | * | * | |
| Expectations | YEXP-900 | * | * | * | |||||||
| PIAT Math | YPIA-100 | * | * | * | * | * | * | ||||
| Locator | YLOC-1500, YLOC-350 | * | * | * | * | * | * | * | * | * | |
| Interviewer Remarks | YIR-1500, YIR-1740.01 | * | * | * | * | * | * | * | * | * | |
| Domains of Influence | YDOM-300 | * | * | ||||||||
| Knowledge of Welfare | YWEK-200, YWEK-500M | * | |||||||||
| College Choice | YCOC-050D.01, YCOC-003C.01 | * | * | * | |||||||
| Political Participation | YPOL_105 | * | |||||||||
| Household Income Update | HIU-5 | * | * | * | * | ||||||
| 1Section called Self-Administered 1 beginning in round 6, but question names remain unchanged. | |||||||||||
| 2Sections were combined in round 1. All questions in that round start with YINC. | |||||||||||
|
Section |
Round 1 Question Names |
|
|
Screener, Household Roster, and Nonresident Roster Questionnaire |
||
|
Screener Household Roster Nonresident Roster |
SE-9, SE-31B.01 SH-1B, SH-103.05 SN-225.04, SN-337A.02 |
|
|
Parent Questionnaire |
||
|
Information Family Background Calendars Parent Health Income and Assets Self-Administered Child Calendar Child Health Child Income Expectations Family Parent Locator Parent Interviewer Remarks |
PINF-015_D, PINF-297.01 P2-029, P2-108B.01 P3-051.01_M, P3-137 P4-027 P5-073.02, P5-136 P6-021B, P6-036 PC8-009_Y, PC8-025, PC8-086.01 PC9-014, PC9-039.04 PC10-025 PC11-013 PC12-010, PC12-012A PLOC-018 PIR-007, PIR-009K |
|
| User Notes: Users should be aware that, while the source of the majority of variables in the main data sets is the questionnaire, certain variables are created either from other NLSY97 variables or from information found in an external data source (see "Types of Variables" below). |
There are five types of variables present in the NLSY97 data. The type of variable affects the title or variable description of each variable and the physical placement of the variable within the codebook. Types of variables include:
Direct (or raw) responses from a questionnaire or other survey instrument.
| User Notes: Users should note that survey personnel do not, in general, impute missing values or perform internal consistency checks across waves. Exceptions will be noted. |
Each variable within NLSY97 main file data sets has been assigned an 80-character summary title that serves as the descriptive representation of that variable throughout the hard copy and electronic documentation system. Variable titles are assigned by CHRR archivists who endeavor, within the limitations described below, to capture the core content of the variable and to incorporate universe identifiers that specify the subset of respondents for which each variable is relevant. Some titles indicate the reference periods (e.g., survey year or calendar year) of the variables as well.
Universe Identifiers: If two ostensibly identical variables differ only in their respondent universes, the variable title will include a reference to the applicable universe. The appropriate universe will either be appended in parentheses or identified before the variable title.
Example 1: R00029. "R Do
Any Work for Pay Last Week? (R Does Not Own Bus/Farm)"
R00030. "R Do Any Work for Pay or Profit Last Week? (R
Owns Bus/Farm)"
Example 2: R01075.
"Compensation Received (Start <16) EMP 01"
R01803. "Compensation Received (Start 16+) EMP 01"
| User Notes: Users should not presume that two variables with the same or similar titles necessarily have the same (1) universe of respondents or (2) coding categories or (3) time reference period. While the universe identifier conventions discussed above have been utilized, users are urged to consult the questionnaires for skip patterns and exact time periods for a given variable and to factor in the relevant fielding period(s) for the cohort. In addition, variables with similar content may have completely different titles, depending on the type of variable (raw versus created). |
There are two main types of variables not necessarily represented by a single item in the questionnaire: symbols and roster items. These items are used by the CAPI system during the interview to organize, display, and store information collected during the interview; to determine which question paths the respondent should follow; and to fill in respondent-specific text in various questions. For example, rather than asking about a respondent's "current employer," the CAPI software fills in the actual employer name reported earlier in the interview. Many of these symbols and roster items are provided in the data set for user reference; researchers should be aware of the differences between the two types and the uses of each.
Symbols are variables that are used by the NLSY97 CAPI software to determine the flow of the interview. Symbols may contain real-time information captured during the survey, or they may be created in advance of the interview by survey staff. For example, before the income section for rounds 1-5, the survey program created a symbol that states whether the respondent is independent (Y12!INDEPEN). This symbol is later used to determine whether the youth is asked certain income and asset questions. Similarly, before the survey round starts, survey staff create a symbol indicating the respondent's gender (SYMBOL!KEY!SEX) which is used throughout the interview to make sure that the respondent is asked appropriate questions about gender-specific topics such as pregnancy.
All symbol variables have "Symbols" as their primary area of interest. In general, question names for round 1 symbol variables begin with "KEY!"; symbols in subsequent rounds generally have "SYMBOL!" to start their question names.
The NLSY97 uses rosters in various sections in which information is collected on a number of persons, schools, or employers. Rosters are an important part of the NLSY97 data set. These grids of information help researchers to analyze data in an efficient and accurate way. However, the structure and use of rosters may be somewhat confusing, so it is vital that researchers understand how they are constructed.
| User Notes: In addition to the detailed discussion in the following paragraphs, the introduction to section 4.3, "Employment," contains an example that illustrates how to use the employer roster in research. Although that example pertains specifically to employers, the basic concepts apply to other NLSY97 rosters. Researchers using any roster data may find the example helpful. |
What is a roster? A roster may be thought of as a list--for example, a list of household members, a list of employers, or a list of children. A respondent with two children will have data on the first two lines of the child list, or child roster. A respondent with four employers will have information on the first four lines of the employer roster. In addition to the name of the person or item (which is not released to the public), the roster contains other basic information, such as the age, race, and labor force status of household members or the start date and stop date for each employer.
In the paper-and-pencil interviews (PAPI) of older NLS cohorts, the questionnaires included a chart or grid listing this type of information, like the one shown in Figure 1. For example, in the household roster grid, each household member's name was entered in a separate row. The interviewer asked the respondent for each member's date of birth, enrollment status, employment status, etc., filling in the answers in the appropriate column. This completed household roster contained all the pertinent information about household residents, and researchers could easily use the variables based on this roster to examine characteristics of household members.
|
What are the names of all family members who are living in your home?
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
When the NLS surveys changed to computer-assisted personal interviewing (CAPI), rosters became a very important way of organizing information during the interview. Instead of using an actual grid, however, CAPI questionnaires include a series of questions that gather the same types of information that would have been included in the grid in a paper-and-pencil interview. The computer then moves the answers to these questions into a grid, creating a roster from the information.
After the roster is created, it can be used to guide subsequent portions of the interview. For example, during the interview the NLSY97 questionnaire gathers the names, dates of attendance, and level of school (secondary school or college) for each of the respondent's schools and organizes them into a roster. The rest of the school section then asks questions about the first school on the roster, followed by questions about the second school, then the third, and so on. The information about the level of the school determines whether the respondent is asked questions that apply to high school or college.
The information from the roster is also presented in the data set as an organized list of data, so that these variables are easy for researchers to access. To the user, the school roster appears as a consolidated block of variables that contains key information such as dates of enrollment, an identification number for the school, and variables indicating the type (private or public) and level (junior high, high school, college) of the school. For example, the variables in the round 2 school roster are listed in Figure 2, along with their reference numbers. Thus, rosters are a way of organizing information both for researchers and for the actual interview so that questions are asked in a logical manner.
|
Question Name |
Variable Title |
Reference Numbers (one for each school) |
|
NEWSCHOOL_PERIODS.xx |
Number of Times R Enrolled in School xx |
R24605.-R24610.00 |
|
NEWSCHOOL_START1.xx |
Month/Year R Start 1st Enrollment in School xx |
R24611.00-R24616.01 |
|
NEWSCHOOL_START2.xx |
Month/Year R Start 2nd Enrollment in School xx |
R24617.00-R24620.01 |
|
NEWSCHOOL_START3.xx |
Month/Year R Start 3rd Enrollment in School xx |
R24621.00-R24621.01 |
|
NEWSCHOOL_STOP1.xx |
Month/Year R End 1st Enrollment in School xx |
R24622.00-R24627.01 |
|
NEWSCHOOL_STOP2.xx |
Month/Year R End 2nd Enrollment in School xx |
R24628.00-R24631.01 |
|
NEWSCHOOL_STOP3.xx |
Month/Year R End 3rd Enrollment in School xx |
R24632.00-R24632.01 |
|
NEWSCHOOL_SCHCODE.xx |
School Code Elementary, Middle, High, College |
R24633.-R24638.00 |
|
NEWSCHOOL_INTERVIEW.xx |
Which Survey Round School xx Reported in |
R24639.-R24644.00 |
|
NEWSCHOOL_TYPE.xx |
Type of School xx R has Attended |
R24645.-R24650.00 |
|
NEWSCHOOL_PUBID.xx |
PUBID of School xx R has Attended |
R24651.-R24656.00 |
How are rosters created during the interview? This section outlines the process used during the interview to create a roster. Rosters may include data from both previous interviews and the current interview. After the roster is created and sorted, it can be used to guide the rest of the interview. Figure 3 provides a pictorial overview of the creation of a roster.
3.2 Figure 3. How Rosters Are Created
Data from previous interviews: As shown in the figure, creation of a roster for the current round often begins with information found in the roster from the previous round. The appropriate respondent-specific data are saved on the interviewer's laptop before he or she administers the survey. When the interview gets to a point where roster information is collected, the data from the previous round's roster are often used as the base for the current roster. The respondent verifies and updates the information. If no changes have occurred since the last interview--for example, if exactly the same people live in the respondent's household--then the current round's roster will be the same as the one from the previous round.
For example, the interviewer reads a list of all of the people on the household roster from the last interview. The respondent first states whether any of those people have moved out of the household and then reports new household members. If any members remain from the previous year, their information--date of birth, gender, race/ethnicity, etc.--is carried over from the previous interview, and any missing data are collected. This method is more efficient than asking the respondent to report all household members every year.
Raw data collection: After the respondent and interviewer review and update the roster from the previous round, the survey collects current information. For example, new people might have moved into the household, so the interviewer asks the respondent about their characteristics. At this point, the respondent is done answering questions that will fill up the data grid on a particular topic.
Roster creation and roster sort: Using the updated roster from the previous round and the new raw data just collected, the computer creates a new roster for the current round. For example, the employer roster contains the following information for each job: a unique identification number for the employer, employment dates, whether the job was current at the interview date, whether the job was in the military, and whether the job was an internship. If the respondent had held the job at the time of the previous interview, the start date and employer identification number are carried over from the old roster, and the other information is taken from the questions at the beginning of the employment section for the current year. Similarly, the household roster contains information from the previous interview about household members reported at that time and data from the current interview about new household members.
In some cases, the computer also sorts the roster and puts the items in order based on a specified variable. For example, in the round 1 household roster, all youths in the age range of the NLSY97 cohort were listed first, and then all other household members were listed from oldest to youngest. The employer roster is sorted by job end date so that the most recent jobs are listed first.
Roster use in the interview: Finally, the roster is used to determine the order in which the other questions about each topic are asked. In most cases, the survey collects far more information than is stored in the actual roster, and the answers to these questions remain outside the roster as raw data. So that the interview makes sense to the respondent, these additional questions are asked about the people or things on the roster in the order that the people or things are listed.
For example, the respondent first answers questions about industry, occupation, rate of pay, etc., for the first employer listed on the roster. The same questions are then asked about the second job, then the third job, and so on. Similarly, the first set of questions about household members refers to the first person listed on the roster. When all of those questions have been answered, the same questions are asked about the second person, the third person, etc.
How should researchers use the roster data in analysis? The data set is organized so that rosters can be easily found and used in research. Because rosters present key pieces of information in a structured format, they are the best place to obtain that information. All variables found on rosters have "Roster Item" as their main area of interest. Each roster has a unique name that serves as the beginning of the question name for all variables on the roster; the same name appears at the beginning of the variable title for each item on the roster. Different rosters have been used in different rounds, depending on the topics included in the interview and the type of information collected. The roster names and question names are shown in Figure 4.
|
Roster |
Question name |
Round 1 |
Round 2 |
Round 3 |
Round 4 |
Round 5 |
Round 6 |
Round 7 |
Round 8 | Round 9 |
|
Household Information |
HHI2 (rd. 1), HHI (rds. 2-8) |
* |
* |
* |
* |
* |
* |
* |
* | * |
|
Nonresident Roster |
NONHHI |
* |
* |
* |
* |
* |
* |
* |
* | * |
|
Youth Information |
YOUTH |
* |
|
|
|
|
|
|
||
|
School Roster |
NEWSCHOOL |
|
* |
* |
* |
* |
* |
* |
* | * |
|
Employer Roster |
YEMP |
* |
* |
* |
* |
* |
* |
* |
* | * |
|
Freelance Jobs Roster |
FREELANCE |
|
* |
* |
* |
* |
||||
|
Training Roster |
TRAINING |
|
|
* |
* |
* |
* |
* |
* | * |
|
Biological Children Roster |
BIOCHILD |
* |
* |
* |
* |
|
|
|
||
|
Biological/Adopted Children |
BIOADOPT | * | * | * | * | * | ||||
|
Parent Household Information |
PARHHI |
* |
|
|
|
|
|
|
||
|
Parent Youth Information |
PARYOUTH |
* |
|
|
|
|
|
|
||
|
Partner/Spouse Roster |
PARTNERS | * | * | * | * | * | * | * | * | * |
|
Other Parents of Respondent's Children |
OTHERPARENTS | * | * | * | * | |||||
|
Partner/Spouse Information |
CUMPARTNERS | * | * | * |
|
Data hint |
Researchers can locate rosters in the data by looking at the roster item area of interest, by selecting the appropriate question name, or by searching the "Any Word in Context" index for variables with "ros" or "roster" and the name of the roster of interest in the title. |
| User Notes: When the NLSY97 data set was initially created, variables could only be assigned to one area of interest. The newer data extraction software permits variables to be linked to multiple areas of interest. However, additional areas have not been assigned to every variable. Because roster variables were initially located in the roster item area of interest, they may not be grouped with the rest of the data on a particular topic. For example, the school roster variables will not appear if the user searches for the "School Experience" area of interest. For this reason, it is very important that researchers become familiar with the rosters used in the data set. If a roster is available on the topic of a particular research project, users should always locate that roster using one of the search techniques mentioned above and examine it before using the other variables that relate to their research. |
Using rosters in single-round analyses: When looking at the data set, users will notice that many questions are repeated for each person or thing on the roster, and the titles for these repeated questions include a number. This number indicates the line on the roster that corresponds to the person or item being described in that variable. For example, the question "Self-Employed Business/Industry Job 02" indicates the industry of the second job listed on the respondent's self-employment roster. The researcher may then want to examine information such as the respondent's start and stop dates or rate of pay for that job. To find this information, he or she can then look at the data for those items contained in the roster for job #02, or the self-employment job that is on the second line of the roster. For all other questions asked after the roster was created in that same survey year, job #02 will refer to the same self-employment job.
Users should be aware that, in some cases, the information contained in the rosters actually appears in the data set more than once. As Figure 1 suggests, data may first be included at the point in the interview when the information was actually collected. For example, the round 1 screener question SE-28 asked the household informant for the date of birth of each household member. After all the raw data had been gathered, the computer sorted all the answers and created the household roster. At this point the date of birth information is also located in the round 1 roster variables named HHI2_DOB. In the case of the round 1 household roster, both the raw data and roster items are included in the data set.
In other cases, the raw answers may be blanked out of the public use data set. If a reference number is not listed for a given question in the questionnaire, then that raw data item may only be represented in roster form. For example, answers to the raw data questions used to create the employer roster are blanked out and do not appear in the data. In the printed questionnaire, these questions have no reference numbers. However, all of the data collected in these questions (except for confidential information like the name of the employer) appears in the employer roster.
|
Data hint |
Even though the data may appear more than once, survey staff strongly recommend that researchers use the roster information rather than the raw data whenever possible. Survey staff are working to eliminate these duplicate sources of information. For example, screener question SE-28 is one of the variables that has already been removed. |
For some variables, the roster information may be more accurate because some rosters are updated during the interview if the initial report was inaccurate. When survey staff prepare the data for release, they clean up the rosters if necessary but do not necessarily clean the corresponding raw data. Finally, because many rosters are sorted in a particular order, the number of a person or item on the roster will not match the number in the questions that precede roster creation. For example, in the household screener (the SE questions), person #01 is the first household resident mentioned to the interviewer. In the household roster and all later interview questions, person #01 is the oldest person in the household who was eligible for the NLSY97. Person #01 in the SE questions might be person #05 on the roster. It can be very difficult to determine to which person, school, or job a pre-sort question refers. For all of these reasons, roster data are always preferable to raw data in cases where both are available.
Using rosters from more than one round: Because the NLSY97 is a longitudinal survey, researchers often want to link data across survey rounds. However, household residents, jobs, and so on may move around on the roster in different interviews. That is, a father who was listed third on the roster in round 1 might move to position 2 or 4 in round 2. The unique identification numbers (UIDs) are the key to finding the same person or thing in different rounds. Most of the rosters contain variables assigning a unique number to each person or thing listed. This number never changes and can be used to link roster items across rounds. In some cases, it also makes it possible to link people between two different rosters in the same survey. For example, beginning in round 2 the unique ID listed for a child on the biological children roster is the same one assigned to that child on the household roster. Researchers can therefore examine data on both rosters about the same child.
An additional feature of most unique ID numbers is that they incorporate an indicator of the round in which the person or item was first reported. For example, IDs of roster items reported in round 1 may begin with "1" or "97," while those first reported in round 2 begin with "2" or "98." (Beginning with round 3, 4-digit years are used so that IDs begin with "1999" rather than just "99.") UIDs for people on the household roster are constructed in a slightly different manner; researchers should refer to section 4.6.5, "Household Composition," for more information.
Using the PARTNERS roster: The PARTNERS roster includes all persons who have lived with and/or have been married to the respondent since the date of last interview. This roster includes variables that indicate the spouse's or partner's unique id, whether the spouse or partner is currently living with the respondent, and start and stop dates for the periods this spouse or partner has lived with the respondent. The PARTNERS roster also shows whether this spouse or partner has ever had a child with the respondent.
The rows on the PARTNERS roster correspond directly to the loops in the marriage section of the questionnaire, which collect the history of each partner with the respondent. For example, the spouse or partner identified by the variable PARTNERS_UID.02 is the person about whose highest grade completed at the time he or she started living with the respondent is asked in question YMAR-3500.02. Whenever the titles for the questions in the marriage section refer to the number of a partner (e.g., PARTNER 02 HGC WHEN START LIVING TOGETHER), this number corresponds to the line of the PARTNERS roster where this spouse or partner can be found.
Although the marital section collects detailed information about the spouse's or partner's characteristics at the time the spouse or partner first started living with the respondent, the marital section does not collect current information about the spouse or partners' employment or education. If the spouse or partner is currently living with the respondent, this information can be found on the household (HHI) roster of persons currently living with the respondent. To find the correct person on the HHI roster, simply find the matching unique id number. For example, if PARTNERS_UID.02 = 200201 and HHI_UID.04 = 200201, the spouse or partner listed on line two of the PARTNERS roster is the same person listed on line four of the HHI roster.
It is possible to link the two rosters in either direction; however, it may be a more efficient method to start with the HHI roster to first locate the spouse or partner currently living with the respondent and then search for the person on the PARTNERS roster.Created variables generally start with "CV_" or "CVC_" in the codebook, as in the 'Hourly Rate of Pay' example later in this chapter, with a few exceptions. One major exception is the sampling weight variables, which have question names SAMPLING_WEIGHT and CS_SAMPLING_WEIGHT. Other exceptions to note include the validation variables for rounds 4 and 5, which have question name VALIDR_, and the timing variables (rounds 5-7) with question names R5_TIM, R6_TIM, and R7_TIM. In addition, the family process variables constructed by Child Trends (see sections 4.5 and 4.6) have question names beginning with "FP_" in the codebook. In the Event History data, all variables are created and can be located in the "Event History" area of interest (see section 4.4 of this guide for more information and question names).
| User Notes: Beginning in round 5 (2001), timing variables were created to measure the length of time a respondent took to complete the entire interview, along with a breakdown of the amount of time taken to complete each main questionnaire section. In round 7, timing variables were expanded to show the length of time it took to complete subsections. Each timing variable is tabulated in seconds, with one implied decimal place. Because of confidentiality concerns with the Welfare Knowledge section, round 7 timings are available only through the geocode release. |
Variables present in the NLSY97 main file are documented via (1) a codebook; (2) accompanying supplemental documentation; and (3) error updates. This section describes the three primary components of the NLSY97 documentation and discusses the important types of information found within each.
The codebook is the principal element of the NLSY97 documentation system and contains information intended to be complete and self-explanatory for each variable in a data file. The NLS Investigator software allows easy access to each variable's codebook information and permits the user to print a codebook extract for selected variables.
Every variable is presented as a block of information called a "codeblock." Sample codeblocks are shown in Figures 1 and 2. Codeblock entries depict the following important information: coding information, frequency distribution, questionnaire items, universe information, valid values range, and question text. Each of the above terms is described more completely in the following pages. Codeblocks for many variables also include special notes designed to assist in the accurate use of data.
3.3 Figure 1. NLSY97 Questionnaire Item Codeblock
3.3 Figure 2. NLSY97 Created Variable Codeblock
Questionnaire Item or Question Name: The question name provides the location of the question in the survey instrument or identifies it as a created variable. In the first example, the question name YSCH-1400 shows that the variable is based on a question in the schooling section of the youth instrument. In the second example, the question name CV_HRLY_PAY.01 indicates that the variable is created. For more information on how question names are assigned, refer to section 3.1, "Survey Instruments & Other Documentation."
Reference Number: A reference number is a unique identifying number, originally beginning with "R," which is assigned to each variable in the data set. Reference numbers never change after they are assigned to the variables from an interview, even as additional information is added to the data set from later surveys.
| User Notes: Users should note that, because available "R" numbers were exhausted, beginning in round 6 new reference numbers start with "S." |
Coding Information: Each codeblock entry presents the set of legitimate codes that a variable may assume along with a text entry describing the codes.
| User Notes: Users should note that coding information for a given variable in the NLSY97 codeblock is not necessarily consistent with the codes found within the questionnaire. If the two sources are different, the codebook is current and the questionnaire information should not be used in analysis. For example, an additional code may be added during data processing if a significant number of respondents gave the same answer to the "other--specify" option in an answer list. |
The following types of code entries occur in NLSY97 codeblocks:
Dichotomous (or variables answered yes/no), uniformly coded "Yes" = 1 and "No" = 0. Other dichotomous variables have frequently been reformulated to permit this convention to be followed.
Discrete (Categorical), as in the case of 'Month Enrolled in School.'
|
January |
May |
September |
Continuous (Quantitative), as in the case of 'Hourly Rate of Pay' in the example above. These variables have continuous data but are presented in the codebook using a convenient frequency distribution. Note that rate of pay variables often have two implied decimal points.
Valid data are generally positive numbers. In a small number of variables, negative responses are possible; users should check the minimum values allowed for each question to clarify whether negative numbers are permissible. The following missing value conventions are used throughout the data:
Noninterview -5
Valid Skip -4
Invalid Skip -3
Don't Know -2
Refusal -1
Frequency Distribution: In the case of discrete (categorical) variables, frequency counts are normally shown in the first column to the left of the code categories. In the case of continuous (quantitative) variables, a distribution of the variable is presented using a convenient class interval. The format of these distributions varies.
Universe Information: The universe information found in the codebook includes:
Universe Totals: Two totals are presented: (1) The sum of the frequency counts for each coding category is located below the individual codes. (2) The sum of the valid responses plus missing response counts of "refusals," "don't knows," and "invalid skips" can be found in the TOTAL=======> field. The number of respondents who were not asked a question because it did not apply to them--that is, "valid skips (-4)"--is also depicted.
Universe Skip Patterns: The following detailed universe information will enable users to trace the flow of respondents both backward and forward through the CAPI questionnaire:
"Go to # XXXXX," appended to certain coding categories, indicates that respondents selecting that answer category were routed to the next question specified.
"Lead In(s) # XXXXX" identifies the question or questions immediately preceding the codeblock question through which the universe of respondents was routed. Each lead-in number is followed by the relevant response value indicators (e.g., (Default), (ALL), [1:1], [1:6], etc.).
"Default Next Question" specifies the next question that all respondents to the current question will be asked unless some skip condition indicates otherwise.
Valid Values Range: Depicted below the frequency distribution is information relating to the range of valid values for that particular distribution. "MINIMUM" indicates the smallest recorded value exclusive of skips, refusals, and don't knows. "MAXIMUM" indicates the largest recorded value. As described below, the computer-assisted interview contains internal range checks that limit responses to those between predesignated values, warn interviewers to verify non-normative values, and bolster the information provided by the traditional minimum and maximum fields.
Minimum and Maximum Fields: The MIN and MAX fields define the range of responses, i.e., the minimum and the maximum values, for a data item. The MAX of 5000 ($50.00) in the 'Hourly Rate of Pay' question means that it was the highest value recorded.
Hardmax and Hardmin Fields: Hardmax and Hardmin fields denote the highest and lowest values that were accepted. Dates (e.g., month/day/year of the respondent's birth (%birth4%) and current interview (%curdate4%)) are often used as Hardmin and Hardmax values in order to restrict responses to certain questions to values within that range, as in the 'Enter Month/Year R Last Enrolled in School' example. Responses outside this range must be entered by the interviewer in the comment field; valid numbers are included in the data.
Softmax and Softmin Fields: Softmax and Softmin fields cover ranges where an answer may exceed normal limits yet remain within absolute limits; such answers are accepted after verification. A Softmax set to $80,000 on an income question will trigger an alert to interviewers that a higher value is unusual.
Income Values: Confidentiality issues restrict release of all income and asset values. To insure respondent confidentiality, the top 2 percent of reported values for many income or asset variables are all converted to one set value. This "topcoded" value is calculated separately for each variable by averaging all the values which exceed the limit for that variable. Calculating topcode values in this way allows statistics such as means to accurately reflect the status of the population under examination without violating respondent privacy.
Verbatim: When an NLSY97 variable is taken directly from the questionnaire, the verbatim of the question or the instructions to interviewers appear beneath the variable title. If a single question is the source for more than one variable, the first variable contains the verbatim, while subsequent variables prompt the user to refer to the variable containing the verbatim.
Archivist information, notes, etc.: Some variables include additional information for users regarding inconsistencies in the data, methods of variable derivation, references to supplemental documentation, and so on. These notes generally appear beneath the variable title or question verbatim.
Purchasers of the NLSY97 data set must have access to all relevant documentation. Documentation for the NLSY97 includes the following items:
Technical Sampling Report--Youth Survey: This technical manual published by NORC describes the procedures used to select the youth sample. The manual includes weights and standard errors for the initial survey year.
Interviewer Reference Manuals: Accompanying each NLSY97 questionnaire will be an Interviewer Reference Manual. In a CAPI survey, interviewers have ready access to general and specific instructions that guide them in the administration of the electronic questionnaire. These "help screens" are physically linked to the appropriate questions throughout the instrument and can be accessed electronically. The Interviewer Reference Manual reproduces the help screens so that researchers can view the various definitions and other pieces of information used during the interview.
Codebook Supplements: Variable creation procedures and supplemental coding information are provided within the Codebook Supplement, an HTML document included with the data set. This information is not available in the NLSY97 codebook pages. The attachments and appendices in the following list can be found in the NLSY97 Codebook Supplement.
Attachment:
1. Census Industrial and Occupational Classification Codes. This document lists the 3-digit 1990 Census codes and the 4-digit 2002 Census codes used to classify job and training information (Census Bureau, 1990 Census of Population Alphabetical Index of Industries and Occupations, Washington, DC: U.S. Government Printing Office, 1991).
Appendices:
1. Education Variable Creation. This document provides the programs for several created variables related to education. These include, among others, enrollment status, type of school, date received diploma, highest grade completed, and number of schools attended.
2. Employment Variable Creation. This appendix provides programs for created employment variables, including hourly rate of pay, hourly monetary compensation, number of weeks worked, total tenure at job, and number of jobs held.
3. Family Background Variable Creation. This appendix of created variable programs contains those dealing with family background, such as household size, marital status, fertility and child status, marriage and cohabitation history, and citizenship status.
4. Geographic Variable Creation. Several variables in the main data set provide information about the respondent's area of residence, permitting researchers to identify key characteristics of the area without needing access to the Geocode CD. Included in this appendix is a summary of the four Census geographic regions, an explanation of the MSA/central city status variable, and the definition for the rural vs. urban variable.
5. Income and Assets Variable Creation. This document provides the creation procedures for income and assets created variables. These include household net worth and gross household income, as well as receipt of public assistance.
6. Event History Creation and Documentation. This appendix explains the structure of the event history variables and describes the creation process.
7. Continuous Month Scheme and Crosswalk. This document explains the structure of the event history month-by-month and week-by-week status arrays and provides crosswalks from continuous month/week numbers to actual month and year dates.
8. Instrument Rosters. A number of rosters are used to organize information during various parts of the interview. This appendix identifies these rosters and shows how they were used in different parts of the survey. It also lists the variable names, titles, and reference numbers for the various instrument rosters used in each interview.
9. Family Process and Adolescent Outcome Measures. This document summarizes the creation procedures for the various scales and indexes created by Child Trends, Inc. The appendix also presents the results of Child Trends' statistical analyses of the scales, indexes, and a number of related attitude and behavior variables.
10. CAT-ASVAB Scores. This appendix discusses the administration of the Armed Services Vocational Aptitude Battery (ASVAB) to NLSY97 respondents. Topics include an explanation of the computer-adaptive test, the scoring of the ASVAB, and the variables available in the NLSY97 data set.
11. Collection of the Transcript Data. This appendix describes the survey materials used to collect data in the two waves of the NLSY97 Transcript Survey and explains the procedures and criteria for data entry and coding. It also lists specific details about individual Transcript Survey variables.
Geocode Codebook Supplements: Supplemental coding information specific to the Geocode CD is provided within the Geocode Codebook Supplement.
Attachments:
100. 1990 Census Bureau State and County Codes. This attachment provides coding information for the state and county variables on the NLSY97 Geocode CD. These variables use the current Federal Information Processing Standards (FIPS) codes.
101. MSA Codes. This document lists the Metropolitan Statistical Area (MSA) coding scheme used for NLSY97 geocode variables. It also presents Consolidated Metropolitan Statistical Area (CMSA) codes, New England Consolidated Metropolitan Area (NECMA) codes, and Primary Metropolitan Statistical Area (PMSA) codes.
102. IPEDS Data and College Identification Codes. This attachment explains the Integrated Postsecondary Education Data System (IPEDS), and how this and other codes are used to identify the colleges reported by NLSY97 respondents.
NLSY97 variables can be accessed using NLS Web Investigator, which is available as a Web application. The main application of NLS Web Investigator is to access NLS variables for the purposes of identifying, selecting, extracting, and/or running frequencies or cross-tabulations. The new interface allows the researcher to connect to a database and perform variable extractions without installing any software on a local computer. Through a personal online account, a researcher's selected variable tag sets (see below), frequencies, and extracts will be available for a specified period of time from any computer location with Web access. Because there will be one central data source for all users, researchers will have the assurance that they are always working with the most up-to-date data, and that any necessary corrections will be immediate and universal.
3.4 Figure 1. NLS Web Investigator Search Screen

A basic NLS Web Investigator data search might include the following steps:
select the desired NLS cohort
choose filters to narrow the scope of that cohort's variables
create a tag set of desired variables
run extracts and/or frequency tables for tagged variables
In this case, the NLSY97 cohort was selected in step 1. After selecting a specific NLS cohort, users can peruse the data set and choose the variables they need through several search index options (described in the next section). Utilizing drop-down menus, researchers can search indexes for Variable Title, Area of Interest, Survey Year, Reference Number, and Question Name. If preferred, users can also search for any word in the variable title, question name, or question text. In addition, the search indexes include a "not" option to exclude particular variables from a search.
During this search process, filters can be used to focus the search to specific variables of interest. Researchers can use more that one search index at a time; for example, a user could search for variables in the Child Care area of interest for the 2002 survey year (round 6) only. Filters can also be layered, so that more than one search term can be used in a single index; for example, a researcher could first use the Word in Title index to find variables containing the word "parent" and then fine-tune the variable list by saving the "parent" filter and searching for words like "contact" or "characteristics."
3.4 Figure 2. Using Filters in NLS Web Investigator

As users identify variables of interest, they can mark or save these variables in a "tag set," essentially a saved list of variables of interest. Users can save tag sets in their accounts on the central database server for up to 90 days and access the saved tag sets from any computer; tag sets can also be saved on the researcher's local machine for as long as desired.
Once a tag set of variables is formed, researchers can run simple statistics such as frequencies and cross-tabulations (with or without round-specific weights) via the Web Investigator. If more detailed data analysis is required, researchers can produce a data extract file with SAS or SPSS statements or a Stata dictionary for use on a local computer with the user's own statistical software. Results are received quickly, usually in less than a minute. Users can save the frequency, table, and extract files to their local computer or access the files from their personal NLS Web Investigator accounts. Extracts and tables are saved for 4 days.Several different search indexes are available within NLS Investigator for finding and selecting variables of interest. These indexes are described in greater detail below.
Word in Title. All words, numbers, and symbols found in any variable title form an index in the data set. The "Word in Title" search function allows the user to search this index and select NLSY97 variables whose titles contain any single word or combination of words.
| User Notes: Word in Title searches for NLSY97 variables are limited by the choice of variable titles. Flexibility in variable title assignment for raw data items is restricted by the wording of the question as it appears in the survey instrument and the maximum allowable length for variable titles. Users may need to try several different abbreviations. |
Area of interest. NLSY97 data files are organized so that variables sharing a common factor are stored in unique groupings called "areas of interest." Users can browse through a given area and examine the variables associated with that topic.
3.4 Figure 3. NLSY97 Areas of Interest
|
Achievement Tests |
Income |
School Characteristics |
| Assets & Debts | Industry & Occupation | School Experience |
| Attitudes | Interaction btwn Parents | Screener |
| Autonomy & Control | Interviewer Remarks | Screener Extended |
| Child Calendars | Job Search | Screener Household |
| Child Care | Jobs & Employers | Screener Non-Resident |
| Child Family Background | Labor Force Status | Screener Parent |
| Children | Locator | Self-Employment |
| College Experience | Machine Check | Sexual Activity |
| Common Variables | Marriage & Cohabitation | Substance Use |
| Contact w/Non-Res. Parent | Military | Symbols |
| Created Variables | Non-Res. Characteristics | Tenure w/ Employer |
| Demographic Indicators | Parent Background | Time Spent at Work |
| Ed. Status & Attainment | Parent Current Status | Time Use |
| Employment Gaps | Parent Family Background | Training |
| Expectations | Parent Information | Transcript Survey |
| Family Process Measures | Parent Locator | Validation |
| Fertility and Pregnancy | Parent Retrospective | Wages & Compensation |
| Fringe Benefits | PIAT | Work Experience |
| Geographic Indicators | Program Participation | Youth History |
| Health | Roster Item | Youth Locator |
| Household Characteristics | Sample Design & Screening | Youth Schooling |
| Illegal Activity & Arrest | School-Based Learning | Youth Self-Administered |
Survey year. This index lists variables by survey round. Because each round includes thousands of variables, this index is best used in combination with another type of search. Note that, while NLSY97 documentation generally uses round numbers rather than calendar years, NLS Investigator requires calendar years for this search index. The year in the index corresponds to the year in which the survey round began (round 1=1997, round 2=1998, etc.).
Reference number. Researchers can use this index to look up a specific variable by reference number. In general, this is most useful when the researcher needs to double check a known variable; simply browsing by reference number is not recommended.
Question name. This index lists all variables using the question name (that is, question number) from the questionnaire. This will facilitate finding the codebook page for an item of interest seen in the questionnaire or finding the same item across survey rounds. Note that, whenever possible, survey staff keep question names constant when the same item appears in multiple rounds, but occasionally changes must be made. If an item is not found by using the Question Name index, users may wish to double check using a different search function rather than assuming the item does not exist.
NLS Web Investigator offers two key additional features to help researchers better understand and use NLSY97 data.
Custom weights. The custom weighting program option that helps users to create a custom set of survey weights, which improves a researchers' ability to accurately calculate summary statistics from multiple years of data.
Supplementary documentation. In addition to a general user's manual, NLS Web Investigator supplies user's guides, questionnaires and other documentation for each cohort, including the NLSY97. Items available online for the NLSY97 include an HTML version of this user's guide, questionnaires from each round of the survey, main file and geocode codebook supplements for each round, and the round 1 technical sampling report, which provides statistical information about the selection of the initial NLSY97 sample. Questionnaires are provided in various formats, depending on the technology used to produce them; for the first several rounds the questionnaires are in Word or PDF format, while more recent rounds have fully linked HTML questionnaires that permit researchers to follow various question paths through the instrument. Similarly, codebook supplements from the first few survey rounds are PDF files, while more recent rounds are in HTML format.