Chapter 3 continued:  Guide to the Young Women Data

 

Return to beginning of chapter


3.3 Young Women Codebook System

All variables present on a main file data set are documented via:  (1) a cohort-specific codebook, (2) an accompanying codebook supplement, and (3) error updates.  This section describes these three primary components of the codebook system and discusses the important types of information found within each. 

Codebooks

The codebook is the principal element of the documentation system and contains information intended to be complete and self-explanatory for each variable in a data file.  Codebook information can be viewed using the search software or printed from the data file.  This feature enables researchers to customize their documentation for their particular research needs and to select and print information for the variables of interest.

Every variable is presented within the documentation as a block of information called a “codeblock.”  Codeblock entries depict the following information:  a reference number, variable title, coding information, frequency distribution, reference to the questionnaire item or source of the variable, and information on the derivation for created variables.  The codeblocks of many variables include special notes containing additional information designed to assist in the accurate use of data from that variable.  Users will find that Young Women CAPI/CATI codeblocks present greater detail on each variable, including universe totals, universe skip patterns, and more range of acceptable values information.  Each of the above terms is described more completely below. 

Codebooks are arranged by reference number.  Variables are first grouped according to survey year.  Within each survey year, those variables related to the interview (e.g., interview method, interview date, reason for noninterview, sampling weight, etc.) appear first, followed by variables picked up directly from the questionnaire and Information Sheet.  In general, created and edited variables appear last, although the created environmental variables are grouped with variables relating to the interview in the early survey years. 

User Notes: NLS codebooks are not a substitute for the questionnaires. Although these two pieces of documentation contain similar information, the questionnaires should be used to determine precise universe information and question wording.

Codebook Item Descriptions

The following common types of information for each variable within a codeblock will be discussed in this section:  coding information, multiple responses, missing responses, derivations, frequency distribution, questionnaire items (question numbers), universe information, valid values range, and verbatims.  The sample codeblocks in Figures 3.3.1 and 3.3.2 provide visual examples of this information.  The first codeblock is a created variable and the second is a question actually asked of respondents during the interview; these examples allow users to identify the differences between codeblocks for these two types of variables.

Figure 3.3.1 Sample Created Variable Codeblock

Figure 3.3.2 Sample Interview Question Codeblock

Coding Information

Each codeblock entry presents the set of legitimate codes that a variable may assume along with a text entry describing the codes.  Users should note that coding information for a given variable in the NLS codeblock is not necessarily consistent with the codes found within the questionnaire or for the same variable across years.  Use only the codebook coding information for analysis.  The following types of code entries occur in NLS codeblocks:

Dichotomous or yes/no variables are uniformly coded “Yes” = 1, “No” = 0. Other dichotomous variables have frequently been reformulated to permit this convention to be followed.

Discrete (Categorical), as in the case of ‘Activity Most of Survey Week 93.’

1  Working

5  Keeping house

2  With a job, not at work

6  Unable to work

3  Looking for work

7  Retired

4  Going to school

8  Other

Continuous (Quantitative), as in the case of ‘Hourly Rate of Pay at Current or Last Job 83 *KEY*.’ These variables have continuous data, but the codebook presents a frequency distribution such as the one below for ease of use.

1 thru 999999 actual dollars and cents per hour (2 implied decimal places)
1-99  600-699   
100-199  700-799
200-299 800-899
300-399 900-999
400-499   1000 +
500-599

Combined Quantitative-Qualitative Variables, i.e., variables which are ostensibly quantitative but which may have several nonquantitative (categorical) responses, utilize positive integers equaling the actual values for the quantitative responses and negative values, beginning with -1, for the qualitative (categorical) responses. For example, ‘Expected Age of Retirement’ is coded as follows:

45 thru 99 actual age
-1 already retired
-2 never plan to retire

Multiple Responses: Response categories to multiple entry questions found in certain job search, child care, discrimination, or health questions have been coded in a geometric progression.  For example, more than one response to the question “Method of seeking employment to be used in next year” was possible.  The response categories to that question were each assigned a value as follows: 

Checked with public employment agency

1

Checked with private employment agency

2

Checked with employer directly

4

Checked with friends or relatives

8

Placed or answered ads

16

Other method

32

Multiple responses are then coded for each respondent by adding the individual codes, which yields a unique value for each combination.  Such multiple entry variables are identified by an asterisk (*) next to the answer categories in the questionnaire.  If a multiple entry has only a few unique combinations, the codebook will specify the exact combinations; those with many combinations need to be unpacked.  Methods of unpacking such multiple entry variables are presented in Appendix C at the end of this guide.  After the 1991 survey, this practice was discontinued and all responses were coded as yes/no.

Missing Responses: Negative numbers are used to indicate that a respondent does not have a valid value for a particular variable.  Different numbers indicate different reasons for nonresponse:

“Refusal” indicates that the respondent refused to answer a given question.  These respondents are assigned a value of ‑1.  This code is only used for CAPI/CATI interviews (1995–2001).

“Don’t know” indicates that the respondent did not know the answer to a given question.  These respondents are assigned a value of ‑2.  This code is used for all interviews of this cohort.

“Invalid skip” indicates that the respondent was not asked a question that she should have answered, usually due to programming or interviewer error.  These respondents are assigned a value of ‑3.  This code is only used for CAPI/CATI interviews (1995–2001).

“Valid skip” indicates that the respondent was skipped past the question intentionally, because she was not in the universe of respondents to whom that question applied.  These respondents are assigned a value of ‑4.  This code is only used for CAPI/CATI interviews (1995–2001).

Finally, a value of ‑5 has been assigned for slightly different reasons in different years.  In PAPI surveys (1968–93), a ‑5 code indicates the absence of a valid response, because (1) the respondent is not in the applicable universe, (2) the respondent refused to respond, (3) interviewer, coding, transcribing, or data entry error occurred, or (4) the respondent was not interviewed in that year’s survey.  Beginning with the first computer-assisted survey in 1995, the ‑5 code is reserved for respondents not interviewed in a given year.  Because computer interviewing permits more exact determination of the reason for nonresponse, the other reasons for the absence of valid data are described by the expanded missing value codes listed above.

User Notes: The missing value codes described above are accurate for the 1999 and 2001 data releases. In previous years, a more complicated system was used to indicate missing data in the PAPI interviews. Beginning in 1999, the missing values were reassigned using a standardized system that matches the Young Women’s CAPI/CATI data as well as the other NLS cohorts. This standardization should make it easier to use the data in analysis. However, researchers using programs written for a previous release of the Young Women data may need to change the parts of their programming code related to missing values. Users who need more information about the codes previously used in order to make these adjustments should contact NLS User Services.

Three additional negative codes are used only with the NLS women’s cohorts for particular types of nonresponse. 

In questions dealing with usual hours per week worked, if the respondent reported that her hours varied, she was assigned a code of ‑6.

Women who have been widowed since the last survey are asked a series of questions regarding their husband’s care and their financial situation since his death.  A code of ‑7 was assigned to women whom the interviewer judged to be emotionally unable to answer these questions.

Some variables in multiple response question series include codes of ‑8, indicating that the respondent was done with the series.  A more detailed description is provided under “Multiple Responses” above.

User Notes: In computer-assisted surveys, respondents are initially assigned a default code of -4 (valid skip) for all questions in the interview. As the interview proceeds, the -4 codes are replaced by valid data. The -3 (invalid skip) codes must be inserted into the data as hand-edits when data archivists uncover skip pattern errors during the data cleaning process. Therefore, some respondents classified as valid skips may actually have skipped a question incorrectly. If researchers need to know the exact reason a question was not answered, they can examine the skip patterns and universes in the questionnaire to determine whether any additional respondents should have been identified as invalid skips.

Derivations: The decision rules employed in the creation of constructed variables have been included, whenever possible, in the codebook under the title “DERIVATIONS.”  This information is designed to enable researchers to determine whether available constructs are appropriate for their needs.  In the ‘Hourly Rate of Pay at Current or Last Job 83 *KEY*’ example (Figure 3.3.1), the derivation describes in detail the items of the interview schedule used to create the variable.  If the derivation is too lengthy to include in the codebook, the codeblock will instead refer users to the supplemental documentation item that contains variable creation information.

Frequency Distribution: In the case of discrete (categorical) variables, frequency counts are normally shown in the first column to the left of the code categories.  In the case of continuous (quantitative) variables, a distribution of the variable is presented using a convenient class interval.  The format of these distributions varies.  In the case of the illustrative variables in Figures 3.3.1 and 3.3.2, the frequency count is straightforward.  For example, in Figure 3.3.1 there are twelve categories; the maximum category shown is 1000 and above (since two decimal places are implied, 1000 represents $10.00), for which there is a frequency count of 388.

Questionnaire Item: “Questionnaire item” is a generic term identifying the source of data for a given variable.  A questionnaire item may be a question, a check item, or an interviewer’s reference item appearing within one of the survey instruments.  Questionnaire item identifications are located in the extreme right hand column of the codebook.  The question number, when available, is copied exactly from the questionnaire.  The question numbering system is described in the questionnaire section earlier in this chapter.  In Figure 3.3.2, the question number is RSP-81I-ARR-01.

During PAPI interview years, the absence of a question entry in the codeblock (as in Figure 3.3.1) indicates that a variable was not taken directly from the questionnaire and is therefore a created variable.  Created variables in CAPI/CATI survey years usually include the letters CV in the question name and usually have the word *KEY* in their title.

Valid Values Range: Depicted below the frequency distribution are the maximum and minimum fields, which define the range of valid values (the upper and lower limits) for a given question.  “MINIMUM” indicates the smallest recorded value exclusive of nonresponse codes; “MAXIMUM” indicates the largest recorded value.  In the case of the ‘Hourly Rate of Pay’ example (Figure 3.3.1), the maximum, or highest value recorded, is 9815 with two implied decimal places, or $98.15.

Topcoding Income and Asset Values:  Confidentiality issues restrict release of all income and asset values.  To ensure respondent confidentiality, income variables exceeding particular limits are truncated each survey year so that values exceeding the upper limits are converted to a set maximum value.  These upper limits vary by year, as do the set maximum values.  From 1968 through 1971, upper limit dollar amounts were set to 999999.  From 1972 through 1980, upper limit variables were set to maximum values of 50000, and in 1982 and 1983 the set maximum value was 50001.  Beginning in 1985, income amounts exceeding $100,000 were converted to a set maximum value of 100001.

From the cohort’s inception, asset variables exceeding upper limits were truncated to 999999.  Beginning in 1983, assets exceeding one million were converted to a set maximum value of 999997.  Starting in 1993, the Census Bureau also topcoded selected asset items if it considered that the release of the absolute value might aid in the identification of a respondent.  This topcoding was conducted on a case-by-case basis with the mean of the top three values substituted for each respondent who reported such amounts.  

Codebook Supplements

Variable creation procedures and supplemental coding information are provided within the Codebook Supplement.  Information provided within these documents is not available in the electronic documentation files on the NLS CD‑ROMs or via download.  The following attachments and appendices are included in the Codebook Supplement, which is available in hard copy form.

Attachment 2:  1960, 1980 & 1990 Census of Population Industry and Occupational Classification Codes provides the occupation-industry coding assignments made by Census Bureau personnel from the verbal descriptions obtained in the interviews.  Users should refer to the “Industries” or “Occupations & Occupational Prestige Indices” sections in chapter 4 of this guide for information about which coding schemes were used in various survey years.  This attachment also contains a copy of the Duncan Socioeconomic Index, an ordinal prestige scale assigning a rank of 0–97 to each of the three digit 1960 Census occupations.

Attachment 4:  Bose Index provides a mean occupational prestige score for each of the three-digit 1960 occupation codes for respondents of the cohort.

Attachment 5:  Employment Status Recodes describes the methodology used by Census to calculate each respondent’s employment status from the CPS questions that are asked in each NLS round.  This document provides (1) definitions of ‘working,’ ‘with a job but not at work,’ ‘unemployed,’ and ‘not in the labor force’; (2) the decision rules used to assign or recode respondents to a particular labor force status; and (3) Census methodology for dealing with exceptions to the rules.

Appendix 1:     Fields of Study in College—Instructions for the Coding Scheme

Appendix 2:     State Names and State Codes by Census Division Listing

Appendix 4:     Listing of Median Education for Different Occupations

Appendix 5:     Source for Occupational Atypicality Scores

Appendix 6:     Supplemental Edit Specifications for *KEY* Variables:  R03297., R03292., R03294., R03293., R03295.

Appendix 7:     Listing of Correction to Employment Status Recode for 1968 and 1969

Appendix 9:     Determinants of Early Labor Market Success:  Appendix A

Appendix 10:   Determinants of Early Labor Market Success:  Appendix B

Appendix 11:   Determinants of Early Labor Market Success:  Appendix C

Appendix 12:   Determinants of Early Labor Market Success:  Method for Variable Construction

Appendix 18:   Union Categories—Copy of Coding Instructions for Name of Union or Employee Association

Appendix 20:   Derivations for R05007., R05012. (Marital Status Patterns)

Appendix 21:   Rules for Revising Variables Representing Month and Year since Left School

Appendix 22:   GED (General Education Development), SVP (Specific Vocational Preparation), Job-Level, and Job Family Values

Appendix 23:   Derivations for R05031.–R05047. (Occupation and Other Job Information before Birth of Child)

Appendix 24:   Derivations for R05049.–R05060. (Occupation and Other Job Information after Birth of Child)

Appendix 25:   New Geographic and Environmental Variables for 1968–78

Appendix 26:   Derivations for 1978 *KEY* Variables

Appendix 27:   Source for the Job Characteristics Index

Appendix 28:   Source for the Job Satisfaction Measures

Appendix 29:   Reason for Reference in Union Certification Election (Item 10e, 1982, R07627.)

Appendix 30:   Derivations for the 1983 *KEY* Variables

Appendix 31:   Listing of Changes in 1983 Survey Made after Questionnaire Printed

Appendix 32:   Derivations for the 1988 *KEY* Variables

Appendix 33:   Derivations for the 1991 *KEY* Variables

Appendix 34:   Derivations for the 1993 *KEY* Variables (includes Highest Grade Completed 1993, and topcoding information)

Appendix 35:   Geometric Progression Coding

Appendix 36:   Summary of the Major Differences Between the 1995 and Earlier Surveys

Appendix 37:   Summary of 1995 Data Cleaning Issues

Appendix 38:   Derivations for 1995 *KEY*  and other Created Variables

Appendix 39:   Summary of 1997 Data Cleaning Issues

Appendix 40:   Derivations for 1997 *KEY* and other Created Variables

Appendix 41:   Derivations for 1999*KEY* and other Created Variables
Appendix 42:   Derivations for 2001 *KEY* and other Created Variables

Error Updates

Prior to working with an NLS data file, users should make every effort to acquire current information on data and/or documentation errors.  A variety of methods are used to notify users of errors in the data files and/or documentation and to provide those persons who acquired an NLS data set from CHRR with corrected information.  Errors discovered after the release of a data file are distributed in hard copy form to current disc purchasers along with the data set.  Error notices also appear, along with information on how to acquire the corrected data and/or documentation, in NLS News, the quarterly newsletter. An errata sheet may also be accessed on the BLS Web site at <http://www.bls.gov/nls>.

Return to beginning of chapter


3.4 Data File Search Functions

Variables can be accessed through NLS Investigator, the search and extraction software available on the NLS CD‑ROMs or with data file downloads.  This software provides users with bridging information to the codebook and/or survey instruments.  The search indexes and lists described below can be used individually or combined to produce a more refined list of variables. This section provides only a cursory overview of the search and extraction software.  Researchers who need more information should refer to the quick reference guide in Appendix A of this document.

Any word search.  The “Any Word in Context” function (also called “contextual search”) of the software allows the user to search for and select those variables whose titles contain any single word or combination of words found in the entire documentation database.  This function allows users to easily access variables on a variety of topics but is still dependent on the wording of each variable title.  For more information on the naming of variables, see the “Variable Titles” discussion in section 3.2 above, especially the User Notes.

Area of interest.  Areas of interest contain lists of variables grouped by topic.  For example, questions on a respondent’s health and medical insurance are grouped in the “Health” area of interest.  Researchers should be aware that an individual question can be linked to only one area of interest, so questions that apply to a common research topic may appear in different areas of interest.

Question number list.  The data file contains a searchable list of the question numbers for every Young Women variable.  By accessing this list, users who know the question number for the item of interest can locate the variable without performing an any word search.  Researchers can also browse through the questions for a given questionnaire section.

Reference number list.  The data file also contains a searchable list of all Young Women reference numbers.  By using this list, researchers can locate variables for which they already know the reference number or browse through questions of interest, which are generally arranged in the order in which they were asked during the interview.

Year index.  Finally, the data file includes an option by which users can limit their searches to a single year of interest.  Researchers can browse through all the variables relating to a given survey year or can combine the year with an any word search to locate specific variables of interest.


Return to top  Return to beginning of chapter  Return to Table of Contents