All variables present on a main file data set (accessed through NLS Investigator) are documented via: (1) a cohort-specific codebook and (2) an accompanying codebook supplement. This section describes these components and discusses the important types of information found within each.


The codebook is the principal element of the documentation system and contains information intended to be complete and self-explanatory for each variable in a data file. Codebook information can be viewed with the use of NLS Investigator by clicking on a variable's reference number once a list of variables has been selected.

Every variable is presented within the documentation as a block of information called a "codeblock." Codeblock entries depict the following information: a reference number, variable title, coding information, frequency distribution, reference to the questionnaire item or source of the variable, and information on the derivation for created variables. The codeblocks of many variables include special notes containing additional information designed to assist in the accurate use of data from that variable.

Codebooks are arranged by reference number. Variables are first grouped according to survey year. Within each survey year, those variables related to the interview (e.g., interview method, interview date, reason for noninterview, sampling weight, etc.) appear first, followed by variables picked up directly from the questionnaire and Information Sheet. In general, created and edited variables appear last, although the created environmental variables are grouped with variables related to the interview in the early survey years.

User Note

NLS codebooks are not a substitute for the questionnaires. Although these two pieces of documentation contain similar information, the questionnaires should be used to determine precise universe information.

The following common types of information for each variable within a codeblock will be discussed in this section: coding information, multiple responses, missing responses, derivations, frequency distribution, questionnaire items (question numbers), universe information, and valid values range.

Coding Information

Each codeblock entry presents the set of legitimate codes that a variable may assume along with a text entry describing the codes. Users should note that coding information in the codeblock for a given variable is not necessarily consistent with the codes found within the questionnaire or for the same variable across years. Use only the codebook coding information for analysis. The following types of code entries occur in NLS codeblocks:

Dichotomous or yes/no variables are uniformly coded "Yes" = 1, "No" = 0. Some dichotomous variables in the 1990 Older Men survey were reformulated to permit this convention to be followed.

Discrete (Categorical), as in the case of 'Highest Grade Attended, 66'

   0  None 6  Sixth grade 12 Twelfth grade 18 Sixth+ years college
   1  First grade 7  Seventh grade 13 First year college -1 Elmentary, year unspecified
   2  Second grade 8  Eighth grade 14 Second year college -2 High school, year unspecified
   3  Third grade 9  Ninth grade 15 Third year college -3 College, year unspecified
   4  Fourth grade 10 Tenth grade 16 Fourth year college  
  5  Fifth grade 11 Eleventh grade 17 Fifth year college  

Continuous (Quantitative), as in the case of 'Hourly Rate of Pay at Current or Last Job 66 *KEY*' For continuous variables, the responses are presented in the codebook using a convenient class interval.

  000001 thru 999999 rate with two implied decimal places
   0  1500-1999 3500-3999 
   1-499 2000-2499 4000-4499
   500-999 2500-2999 4500-4999
   1000-1499 3000-3499  5000+

Combined Quantitative-Qualitative Variables, i.e., variables which are ostensibly quantitative but which may have several nonquantitative (categorical) responses, utilize positive integers equaling the actual values for the quantitative responses and negative values, beginning with -1, for the qualitative (categorical) responses. For example, the Older Men variable 'Expected Age of Retirement, 66' is coded as follows:

   45 thru 99 = 45-99 years  -1 = age not given  -2 = will not retire; don't plan to stop working -3 = already retired

Multiple Responses

Response categories to multiple entry questions found in certain Original Cohort job search, discrimination, and health questions have been coded in a geometric progression. More than one response was possible to, for example, the question "What were you doing in the past four weeks to find work?" The response categories to that question were coded as follows:

  1 Checked with public employment agency
  2 Checked with private employment agency
  4 Checked with employer directly
  8 Checked with friends or relatives
  16 Placed or answered ads
  32 Other method

Multiple responses are then coded for each respondent by adding the individual codes, which yields a unique value for each combination. Such multiple entry variables are identified by an asterisk (*) next to the answer categories in the questionnaire. If a multiple entry has only a few unique combinations, the codebook will specify the exact combinations; those with many combinations need to be unpacked. Methods of unpacking such multiple entry variables are presented in Appendix C at the end of this guide.

Missing Responses

The following conventions were used to treat nonresponse in interviews with the Older and Young Men.

"NA" is the convention used to describe the absence of a valid response where (1) the respondent is not in the applicable universe or (2) the respondent refused to respond or interviewer, coding, transcribing, or data entry error occurred. NA codes are typically treated as missing data.

  • NA is assigned a value of -128 if valid responses to a question or created variable range from -126 to +127 inclusive.
  • NA is assigned a value of -999 if valid responses fall outside the range of -126 to +127.

Note: Refusals were also coded as -1 for some income items and other sensitive questions during PAPI interviews; however, -1 has other meanings on other questions, such as 'Highest Grade Attended, 66' in Figure 3.3.1. Users should consult the codebook before working with variables that include -1 values.

"DK" is the convention used to denote a "don't know" response; these codes are typically treated as missing data.

  • DK is assigned a value of -127 if valid responses range from -126 to +127.
  • DK is assigned a value of -998 if valid responses fall outside the range of -126 to +127.

Note: "NEGATIVE" is a convention used in the codebook that provides the frequency of negative responses that are not defined as NA or DK (i.e., missing).


The decision rules employed in the creation of constructed variables have been included, whenever possible, in the codebook under the title "DERIVATIONS." This information is designed to enable researchers to determine whether available constructs are appropriate to their needs. In the 'Hourly Rate of Pay at Current or Last Job 66 *KEY*' example, the derivation describes in detail the questionnaire items used to create the variable. If the derivation is too lengthy to be included in the codebook, the codeblock instead refers users to the supplemental documentation item that contains variable creation information. In the case of 'Highest Grade Attended, 66', no derivation is shown because this variable is picked up directly from the questionnaire.

Frequency Distribution

In the case of discrete (categorical) variables, frequency counts are normally shown in the first column to the left of the code categories. In the case of continuous (quantitative) variables, a distribution of the variable is presented using a convenient class interval. The format of these distributions varies. In the case of 'Hourly Rate of Pay at Current or Last Job 66 *KEY*,' the frequency count is straightforward. There are twelve categories; the maximum category shown is 5000 and above (since two decimal places are implied, the figure 5000 represents $50.00 and above), for which there is a frequency count of 0.

Questionnaire Item

"Questionnaire item" is a generic term identifying the printed source of data for a given variable. A questionnaire item may be a question, a check item, or an interviewer's reference item appearing within one of the survey instruments. In 'Highest Grade Attended, 66', the questionnaire item is 48A.

Universe Information

Universe information for the Original Cohort data sets is printed as separate line items in the codebook for each survey through 1976. Both sample variables present universe information at the bottom of the codeblock; in Figure 3.3.1, for example, 47 respondents do not have information available. Subsequent to 1976, universes can be tracked by referring to the flowchart associated with a particular year's survey.

Valid Values Range

Depicted below the frequency distribution is information relating to the range of valid values for that particular distribution. "MINIMUM" indicates the smallest recorded value exclusive of "NA" and "DK." Example "MAXIMUM" indicates the largest recorded value. In the case of the created variable example (Figure 3.3.2), 'Hourly Rate of Pay at Current or Last Job 66 *KEY*,' this value is 4615 with two implied decimal places, or $46.15.

Topcoding and Asset Values

To insure respondent confidentiality, income variables exceeding particular limits are truncated each survey year so that values exceeding the upper limits are converted to a set maximum value. These upper limits vary by year and cohort, as do the set maximum values. From 1966 through 1970, upper limit dollar amount variables were converted to set maximum values of 990, 999, 9990, 9999, 999900, or 999999. From 1971 through 1980, upper limit variables were set to maximum values of 50000, and from 1981 to 1983 the set maximum value was 50001. From 1966 through 1980, asset variables exceeding upper limits were truncated to 999999, and beginning in 1981 assets exceeding one million were converted to a set maximum value of 999997. In the 1990 survey of the Older Men, Census also topcoded selected asset items if it considered that release of the absolute value might aid in the identification of a respondent. This topcoding was conducted on a case by case basis with the mean of the top three values substituted for each respondent who reported such amounts.