Sample Weights & Clustering Adjustments

Sample Weights & Clustering Adjustments

Clustering Adjustments

Researchers use NLSY79 data to estimate a variety of statistics. Since NLSY79 data come from a sample instead of data from every age appropriate individual in the U.S. the statistics produced are only estimates of the "true" national values. When researchers use a computer package to compute a statistic such as a mean or a regression coefficient, the program automatically provides a second set of statistics, such as the standard error, standard deviation, or t-statistic, which tells researchers how precisely the mean or coefficient is measured. 

Details: Instead of randomly selecting individuals located anywhere in the U.S. during 1978, only a random selection of areas were selected. By randomly selecting a fixed number of small areas, interviewers reduced the amount of time they spent traveling for each interview. In this way, costs were lowered and the survey was fielded faster yielding data more quickly. Like all other national data sets that use clustering, NLSY79 data has many groups or bunches of respondents who share similar characteristics because they lived in the same neighborhood during 1978. This makes survey results appear more homogeneous, or similar, than actually found in the US.

Researchers can use two different approaches to correct this problem. The first approach uses the tables found in the NLSY79 Technical Sampling Report. For each survey round there is a table that lists the "Design Effects" or DEFT factors. These DEFTs give users a simple method for determining approximately how much they should increase their standard errors when trying to measure the precision of their estimates. Using the DEFT factors is a simple method of adjusting standard errors to account for clustering. However, when using specialized subsamples, these tables provide no guidance for users on how to adjust regression coefficients being based on calculations from only a small subset of NLSY79 variables.

The more general method is to correct for clustering by using a specialized software package. Two of the most widely used packages to adjust surveys for clustering effects are Stata, sold by the Stata Corporation (www.stata.com) and  Sudaan, sold by RTI International (www.rti.org/sudaan). This section describes how to adjust for clustering using Sudaan. Sudaan is used to generate the DEFT factors found in the Technical Sampling Report.

Important Information

If you do not have access to the Geocode data set, you cannot use Sudaan or Stata to adjust for clustering. The Geocode data set can only be accessed by individuals approved by BLS. See Geographic Residence and Neighborhood Composition for information on obtaining the Geocode data CD.

Two of the most common uses of NLSY79 data are to create summary statistics and to run regressions. Table 1 shows how adjusting for clustering affects summary statistics. The table uses data from the 1998 survey. The second column labeled "mean value" shows that in 1998 the 7,624 NLSY79 respondents who participated ha an average net worth (assets minus liabilities) of $128,068, a total family income of $55,031, and a body mass index of 26.7 (BMI is a measure that combines height and weight into a single measure that is commonly used to check individuals for obesity). The value 26.7 is in the middle of the overweight range. The third and fourth columns show first the uncorrected standard errors from the statistical package, SAS, and then Sudaan's standard errors corrected for clustering. Correcting for clustering increases net worth's standard error from $3,403 to $5,826, a jump of 1.7 times; income's error from $536 to $1,137, a jump of 2.12 times; and BMI's error from 0.06 to 0.09, a jump of 1.5 times.

Table 1. Effect of Clustering Correction on a Mean Value's Standard Error, 1998 Data, Example One

Variable Mean Value Uncorrected Std Error Corrected Std Error
Net Worth $128,068 $3,403 $5,826
Family Income $55,031 $536 $1,137
BMI 26.7 0.06 0.09

Table 2 shows how adjusting for clustering affects a simple regression. Using the same 1998 data, a simple unweighted least squares equation was run with both SAS and Sudaan using net worth as the dependent variable and six independent variables. Three of these independent variables (BMI, income and age) take a wide range of values, while the remaining three variables (black, Hispanic or Latino, and female) take the value of 1 if the respondent has the particular characteristic and 0 otherwise.

The table shows that adjusting for clustering changes many of the standard errors and associated t-values. The biggest effect is seen on the income line. The uncorrected standard error increases from 0.06 to 0.19, resulting in the t-value falling from 44.37 to 13.87. Smaller changes are seen for the other variables.  The intercept, age, and female standard errors all increase in size while the BMI, black, and Hispanic or Latino variables all end up with slightly smaller standard errors.

Overall, both examples show that adjusting for clustering effects is important. The next subsection shows what variables are needed to adjust for clustering. The section ends with the specific Sudaan commands used to create the tables in this chapter.

Key Variables Needed For Clustering Correction: Two variables are needed to adjust the data set for clustering. Both variables are found only on the Geocode data set and are placed there because researchers can use these variables to determine where each civilian respondent lived in 1978.

Table 2. Effect of Clustering Correction on a Mean Value's Standard Error, 1998 Data, Example Two

Variable Coefficient Estimate Uncorrected Standard Error Uncorrected t Value Corrected Standard Error Corrected t Value
Intercept 186,808 43,534 4.29 52,166 3.58
BMI 1,091 466 2.34 457 2.39
Income 2.63 0.06 44.37 0.19 13.87
Black 40,394 5,938 6.80 4,259 9.48
Hispanic 41,382 6,617 6.25 4,554 9.09
Age 5,285 1,086 4.87 1,252 4.22
Female 2,814 4,891 0.58 5,064 0.56

As discussed above, the NLSY79 is a multi-stage clustered sample. The clusters were created by first dividing the entire U.S. into Primary Sampling Units, or PSUs. These PSUs were defined by NORC and were composed of Standard Metropolitan Statistical Areas (SMSAs), entire counties when the counties were small, parts of counties when the counties were large, and independent cities. NORC randomly selected two different sets of PSUs for inclusion in the study, each of which by itself randomly represents the U.S. This selection of two sets of PSUs means the NLSY79 is composed of two replicates or strata. Within each is a random selection of PSUs. The replicate or strata that a respondent belongs to is found in the Geocode data set only and is labeled variable R02191.46, entitled "Within Stratum Replicate Of Primary Sampling Unit." This variable takes either the value 1 or 2, for either the first or second replicate.

The variable, containing the PSU is labeled R02191.45, and is entitled "Stratum Number For Primary Sampling Units." R02191.45 ranges in value from 1 to 120.  Researchers who want to know which geographic areas correspond to particular values should look at Attachment 104 of the Geocode Supplement for the crosswalk table. Respondents with a PSU code of 52 to 70 are part of the military sample and do not have any known geographic location.

Important Information

The label for variable R02191.46 found in SAS and SPSS programs that is automatically produced by NLS Investigator is confusing. The label reads "PRIMARY SAMPLNG UNIT PSU SCRAMBLED 79". This variable contains the scrambled replicate, or stratum number, not the PSU. PSU information is found in R02191.45. Users should be careful when adjusting geographic variables using the clustering corrections. The complete title for variable R02191.46 is "Within Stratum Replicate Of Primary Sampling Unit (PSU) - Scrambled." Because this variable is randomly scrambled, doing clustering corrections on some geographic variables produces incorrect results.  Scrambling has no effect on variables that are not geographic, such as education, income, or training.

Using the Key Variables In Sudaan: The specific steps used to generate the tables above are covered in this section. While the tables were produced using the Windows Version 8.0 Standalone package, the steps and commands are similar for other versions of Sudaan. To adjust summary statistics such as means or regressions with Sudaan, the researcher needs to create three files: one containing the data, one telling Sudaan how to read the data, and one containing the specific commands. Any computer package can be used to create the data file. Data can even be written directly from NLS Investigator to a file. Figure 1 has the relevant portion of the SAS program used to create the data file used in Tables 1 and 2 above.

Figure 1. SAS Commands to Create Sudaan Data File

Data obesity;
     (SAS commands that generate variables like Age, Income, and BMI are placed here)

PSU                =R0219145;
REPLICATE      =R0219146;

proc sort;                      /* Sort the data since Sudaan can not handle unsorted */
by replicate psu;
Data;
Set obesity;
file 'C:\DesignEffects\ObesitySudaanAdjustment.dbs'

put ID           5.         
PSU              3.
REPLICATE    2.
WGHT           7.
BLACK          2.
HISPANIC     2.
AGE             3.
SEX             2.
INCOME       9.
BMI             4.1
NETASSET     9

Run;

One of the key things to note is that the data are sorted by the PSU and replicate variables before being written to the file. For most operations, Sudaan requires the data to be in this order before processing.

The second file is the "label" file. This file is used to read the data into Sudaan. The label file, called "ObesitySudaanAdjustment.lab," is shown in Figure 2. The label file has five parts. The first column on the left is the variable's name, followed by a letter which tells Sudaan if the variable contains numeric or character data. The third and fourth columns contain the number of bytes (characters) taken up by the variable and the number of decimal places in the number. The last column contains the label. Sudaan expects the label file to follow a precise format with columns starting and ending in very specific places.

Figure 2. Sudaan Label File

ID
PSU
REPLICAT
WGHT
BLACK
HISPANIC
AGE
SEX
TOTINC
BMI
NETASS 
N
N
N
N
N
N
N
N
N
N
N
5
3
2
7
2
2
3
2
9
4
9
0
0
0
0
0
0
0
0
0
1
0
ID# (1-12686)
# OF PSU
REPPLICATE SCRAMBLED
SAMPLING WEIGHT
T/F BLACK
T/F HISPANIC
AGE OF RESPONDENT
MALE 0 - FEMALE 1
TOTAL INCOME
BODY MASS
TOTAL NET WORTH

The third file is the set of commands used to run Sudaan. Many versions of Sudaan allow commands to be typed directly into the program so researchers are not forced to create command files. Figures 3 and 4 provide the Sudaan commands that were used to create Tables 1 and 2 above. Figure 3 has three sections. The top section below the "Proc Descript" command tells Sudaan where to find the raw data and what variable contains the basic survey weights. The nest command defines which variables contain the replicate and PSU information. The middle section, beginning with "Var," tells Sudaan which variables will have descriptive statistics created. The final section, beginning with "Print," specifies the types of output that are shown.

The first section of Figure 4 is similar to commands seen above in Proc Descript. The large difference is that the "weight" command has the reserved name "_ONE_" after it instead of the NLSY79 weight, "wght." Putting the "wght" variable after the weight command would cause Sudaan to run weighted least squares. By using  "_ONE_" instead, Sudaan weights all variables with the same 1.0 value, resulting in Sudaan running unweighted least squares. The second part of the command, which begins with "Model," shows the exact regression to run.

Figure 3. Sudaan Commands Used to Create Summary Statistics in Table 1

Proc Descript
Data="C:\DesignEffects\ObesitySudaanAdjustment.dbs"
  filetype=asciidesign=wr mean DEFT1est_no=12686;
  weight wght;
  nest REPLICAT PSU / MISSUNIT;
Var NETASS BMI TOTINC BLACK HISPANIC AGE SEX;
Print nsum="Sample Size" WSUM="Population Size" Mean
  semean="Std. Err." DEFFMEAN="Design Effect" / style=nchs
  nsumfmt=f6.0 wsumfmt=f10.0 deffmeanfmt=f6.2 semeanfmt=f11.2;

Figure 4. Sudaan Commands Used to Create Regression Values in Table 2

Proc Regress
Data="C:\DesignEffects\ObesitySudaanAdjustment.dbs"
filetype=asciidesign=wr DEFT1est_no=12686;
  weight ONE;
  nest REPLICAT PSU / MISSUNIT;
Model NETASS = BMI TOTINC BLACK HISPANIC AGE SEX;

 

Related Variables The 1979 Geocode data also contain the State, county, and metropolitan statistical area where the respondent lived in 1979.
Documentation Additional information can be found in Standard Errors and Design Effects section of this User's Guide, in the NLSY79 Technical Sampling Report, and in Attachment 104 of the Geocode Supplement.
Data Files Data on clustering can be found only in the NLSY79 Geocode files under the "GEOCODE" 1979 area of interest.