Large Dataset Recommendations

Reasons to not download an entire NLS Cohort/Dataset

Users who come from research traditions that typically use smaller datasets think in terms of pulling down the full set of data. However, with complex longitudinal survey projects with data spanning decades, the databases can be so large and complex that the methods used for smaller data sets can become very onerous for the user and can become a substantial impediment to the successful organization of a research effort.

There are many reasons that downloading the entire data set is not a good idea.

  • Each data set has over 100,000 variables. With several thousand respondents in each cohort, this involves several hundred million data points. To download the entire 2-4 gigabyte file would take several hours and require a significant amount of storage space on your computer.
  • The Investigator is designed to help you find the variables you need for your research. If you download an ASCII data file, you will have more than 100,000 variables and no way to tell which are the variables of interest for your research.
  • You may need to break the data up into multiple records per respondent, because the flat file is too large for nearly all PC database programs.
  • If you try to open a 100,000 variable file in a statistical program such as SAS, SPSS or Stata, you will likely crash your computer as these programs are not designed to handle that many variables.
  • If you load your ASCII data file into a statistical program such as SAS, SPSS or STATA, your file becomes exponentially larger. Extracts stored in SAS, SPSS or Stata take up a great deal more space than if they were stored in ASCII.
  • We continuously review the data we release. When you return to a research project after a period of inactivity, you can easily add the next round of data or update any data that was corrected.
  • The data in the Investigator are linked to a great many documentation resources that help you understand the variables. Once you download a data extract you no longer have those documentation links available to help you choose your variables.
  • If you are working with a collaborator at another institution, it can be difficult to coordinate the data each of you have on your computers. By storing the extract definition files on our servers in a shared account, you can instantly share which variables each of you is using.

For users that have the capacity to utilize extremely large data files and the programs to handle them, downloads are available for each cohort.

Each download file includes a data file (.dat); a codebook file (.cdb); programs to read the data into SAS (.sas), SPSS (.sps), and R (.R); and a tagset file listing all reference numbers (.COHORT NAME). Stata programs are available through Investigator but are not included here because the full data sets are too large to open in Stata.

 


<< Previous  |  Table of Contents  |  Glossary  |  Next >>