Updated by L. A. Hook, T. W. Beaty, S. Santhana-Vannan, L. Baskaran, and R. B. Cook. June 2007.
Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A.
At the request of several field researchers, investigators, GIS and image specialists, and data managers, the following guidelines have been prepared for the data management practices that data collectors should follow to improve the usability of their data sets. This guidance is provided for those who perform environmental measurements, although many of the practices may be useful for other data collection and archiving activities.
We assembled what we feel are the most important practices that researchers could implement to make their data sets ready to share with other researchers. These practices could be performed at any time during the preparation of the data set, but we suggest that researchers consider them before measurements are taken. The order of the practices is not necessarily sequential, as a researcher could provide draft data set metadata before any measurements are taken.
The seven best practices are:
File names should reflect the contents of the file and include enough information to uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type. The file name should be provided in the documentation (described in Sect. 7) and in the first line of the header rows in the file itself.
Clear, descriptive, and unique file names may be important later when your data file is combined in a directory or FTP site with your own data files or with the data files of other investigators. Avoid using file names such as mydata.dat or 1998.dat.
File names should be constructed for easy management by various data systems. Names should contain only numbers, letters, dashes, and underscores -- no spaces or special characters. Also, in general, lower-case names are less software and platform dependent and are preferred. When choosing a file name, check for any database management limitations on the use of special characters and file name length. For practical reasons of legibility and usability, file names should not be more than 64 characters in length and if well constructed could be considerably less.
You may want to use similar logic when designing directory structures and names. Also, the data set title (see Sect. 6) should be similar to the data file name(s).
Version Number: Including a data file creation date or version number enables data users to quickly determine which data they are using if an update to the data set is released (e.g., *_v1.csv, *_r1.csv, or *_20070227.csv).
File Type or Extensions: Use *.txt, *.csv generally for tabular data. Section 2 addresses formats and extensions for image data files.
File Compression: Use *.zip, *.gz, or *.tar file extensions, as appropriate for the compression software. The individual files may be compressed for space conservation or several files may be aggregated and then compressed as one file of reduced size. When multiple files are compressed together, the same file naming guidelines apply to the compressed collection of files.
Example Data File Names:
c130_a792_20000916.csv
(From data set SAFARI 2000 C-130 Aerosol and Meteorological Data, Dry Season 2000)
WBW_veg_inventory_all_20050304.csv
(From data set Walker Branch Watershed Vegetation Inventory, 1967-1997)
bigfoot_agro_2000_gpp.zip
(From bigfoot_agro_2000_gpp.zip, data set BigFoot GPP Surfaces for North and South American Sites, 2000-2004)
In choosing a file format, data collectors should select a consistent format that can be read well into the future and is independent of changes in applications.
Tabular Data:
Using ASCII file formats is the best way to ensure that measurement data are readable in the future.
At the top of the file, include several header rows.
Within the ASCII file, follow these guidelines.
In the data set documentation, specifically add the following data file information.
Image (Raster) Data:
Some field researchers may generate Image (Raster) data sets. Below are some guidelines/recommendations for archiving these types of data files.
Suggested Non-Proprietary File Formats: (Listed in order of our preference. See file extension reference, Appendix A.)
If you cannot use any of the above formats, another option is to use any non-proprietary public domain data format. Whatever file format you use, be sure to thoroughly document the format and follow the suggested guidelines.
Guidelines for documenting image data files:
Proprietary Software Data Formats:
Data that are provided in a proprietary software format must include documentation of the software specifications (i.e., Software Package, Version, Vendor, and native platform). The archive data center will use this information to convert to a non-proprietary format for the archive.
All files types, that constitute the complete geographic data format documentation, must be provided for the specific software package. For example:
idrisi software images -- provide the *.rdc and the *.rst files [http://www.clarklabs.org/]
Image (Vector) Data:
Below are suggested vector file formats. These are mostly proprietary data formats; please be sure to document the Software Package, Version, Vendor, and native platform.
Also make sure that the vectors are properly geo-referenced and the geometry type (Point, Line, Polygon, Multipoint etc ) is specified.
File Extension Reference Table
A table of common file extensions and their generally accepted formats are described in Appendix A.
On-line Resources:
In order for others to use your data, they must fully understand the contents of the data set, including the parameter names, units of measure, formats, and definitions of coded values. Provide the English language translation of any data values and descriptors that are in another language (e.g., coded fields, variable classes, and GIS coverage attributes).
Parameter Name: The parameters reported in the data set need to have names that describe the contents. The documentation should contain a full description of the parameter. Use commonly accepted parameter names, for example, Temp for temperature, Precip for precipitation, and Lat and Long for latitude and longitude. See the online references in the Bibliography for additional examples. Also, be sure to use consistent capitalization (not temp, Temp, and TEMP) and use only letters, numerals, and underscores in the parameter name.
Units: The units of reported parameters need to be explicitly stated in the data file and in the documentation. We recommend SI units but recognize that each discipline may have its own commonly used units of measure. The critical aspect here is that the units be defined in the documentation so that others understand what is reported.
Formats: Within each data set, choose a format for each parameter, explain the format in the documentation, and use that format throughout the data set. Consistent formats are particularly important for dates, times, and spatial coordinates. For numeric parameters, if the number of decimal places should be preserved to indicate significant digits, then explicitly define the format such that users may take precautions to ensure that significant figures are not lost or gained during data transformations.
We recommend the following formats for common parameters:
Dates: yyyy-mm-dd or yyyymmdd, e.g., January 2, 1997 is 19970102.
Time: Use 24-hour notation (13:30 hrs instead of 1:30 p.m. and 04:30 instead of 4:30 a.m.). Report in both local time and Coordinated Universal Time (UTC). Include local time zone in a separate field. As appropriate, both the begin time and end time should be reported in both local and UTC time. Because UTC and local time may be on different days, we suggest that dates be given for each time reported. Applicable data and time standards are listed in Appendix B.
Spatial Coordinates: Spatial coordinates should be recorded in decimal degrees format to at least 4 (preferably 5 or 6) significant digits past the decimal point. Provide latitude and longitude with south latitude and west longitude recorded as negative values, e.g., 80 30' 00" W longitude is is -80.5000. Make sure that all location information in a file uses the same coordinate system, including coordinate type, datum, and spheroid. Document all three of these characteristics (e.g., Lat/Long decimal degrees, NAD83 (North American Datum of 1983), WGRS84 (World Geographic Reference System of 1984)). Mixing coordinate systems [e.g., NAD83 and NAD27 (North American Datum of 1927)] will cause errors in any geographic analysis of the data. Applicable spatial coordinate standards are listed in Appendix C.
Elevation: Provide elevation in meters. Include detailed information on the vertical datum used (e.g.- North American Vertical Datum 1988 (NAVD 1988) or Australian Height Datum (AHD)). Additional information on vertical datum are include in Appendix D.
Coded Fields:
Coded fields, as opposed to free text fields, often have standardized lists of predefined values from which the data provider may choose. Two good examples are U.S. state abbreviations and postal zip codes. Data collectors may establish their own coded fields with defined values to be consistently used across several data files. The use of consistent sampling site designations is a good application. Coded fields are more efficient for storage and retrieval of data than free text fields.
Guidance for two specific coded fields commonly used in environmental data files:
Data Flag or Qualifying Values: A separate field with specified values may be used to provide additional information about the measured data value including, for example, quality considerations, reasons for missing values, or indicating replicated samples. Codes should not be parameter specific but should be consistent across parameters and data files. Definitions of flag codes should be included in the accompanying data set documentation.
Example documentation of Data Quality Flag values:
Flag Value
Description
V0
Valid value
V1
Valid value but comprised wholly or partially of below detection limit data
V2
Valid estimated value
V3
Valid interpolated value
V4
Valid value despite failing to meet some QC or statistical criteria
V5
Valid value but qualified because of possible contamination (e.g., pollution source, laboratory contamination source)
V6
Valid value but qualified due to non-standard sampling conditions (e.g., instrument malfunction, sample handling)
V7
Valid value but set equal to the detection limit (DL) because the measured value was below the DL
M1
Missing value because no value is available
M2
Missing value because invalidated by data originator
H1
Historical data that have not been assessed or validated
Units: While data collectors can generally agree on the units for reporting measured parameters, the exact syntax of the units designation varies widely among programs, projects, scientific communities, and investigators (if standardized at all). If a shorthand notation is reported in the data file, the complete units should be spelled out in the documentation so that others can understand and interpret your representation of subscripts, superscripts, area, time intervals, etc.
Missing Values: Use a specified extreme value not likely to ever be confused with a measured value (e.g., -9999). Consistently use the same notation for each missing value in the data file.
Typical Parameter Documentation:
The following text describes the parameters in a data set; this type of description should be included in the data set documentation.
Data File Contents: (kt_tree_data.csv) The files are in comma-delimited ASCII format, with the first line listing the data set, author, and date. The data records follow and are described in the table below. A value of -9.99 indicates no data.
Column Description Units/Format SITE k=Kataba forest, p=Pandamatenga, m=Near Maun, e=HOORC/MPG Maun tower, o=Okwa river crossing, t=Tshane, skukuza=Skukuza Flux Tower text SPECIES Scientific name up to 25 characters text DATE Date of measurement yyyymmdd BA Woody plant basal area m2/ha SEBA Standard error of BA m2/ha DENSITY Woody plant density (number of trees per hectare) number/ha SEDEN Standard error of DENSITY (n=42 for KT, n=49 for Skukuza) number/ha STEMS Number of stems per hectare (/ha) number/ha HEIGHT Basal area-weighted average height m2/ha WOOD Aboveground woody plant wood dry biomass kg/ha LEAF Aboveground woody plant leaf dry biomass kg/ha LAI Leaf Area Index calculated by allometry m2/m2 [ Adapted from Scholes, R. J. 2005. SAFARI 2000 Woody Vegetation Characteristics of Kalahari and Skukuza Sites. Data set. Available on-line [http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. ]
Data File Contents: (NARSTO_EPA_SS_HOUSTON_FRASER_ORG_SPEC_24HR_V1.txt)
COLUMN NAME
NAME TYPE
CAS IDENTIFIER
UNITS
FORMAT TYPE
FORMAT FOR DISPLAY
MISSING CODE
OBSERVATION TYPE
SAMPLE PREPARATION
BLANK CORRECTION
Site ID: standard Variable
None
None Char 12 None Supplementary data
Not applicable
Not applicable
Date start: local time
Variable
None
yyyy/mm/dd
Date
10
None
Supplementary data
Not applicable
Not applicable
Time start: local time
Variable
None
hh:mm
Time
5
None
Supplementary data
Not applicable
Not applicable
Date end: local time
Variable
None
yyyy/mm/dd
Date
10
None
Supplementary data
Not applicable
Not applicable
Time end: local time
Variable
None
hh:mm
Time
5
None
Supplementary data
Not applicable
Not applicable
Time zone: local
Variable
None
None
Char
3
None
Supplementary data
Not applicable
Not applicable
Date start: UTC
Variable
None
yyyy/mm/dd
Date
10
None
Supplementary data
Not applicable
Not applicable
Time start: UTC
Variable
None
hh:mm
Time
5
None
Supplementary data
Not applicable
Not applicable
Date end: UTC
Variable
None
yyyy/mm/dd
Date
10
None
Supplementary data
Not applicable
Not applicable
Time end: UTC
Variable
None
hh:mm
Time
5
None
Supplementary data
Not applicable
Not applicable
Fluoranthene
Variable
C206-44-0
ng/m3 (nanogram per cubic meter)
Decimal
8.2
-999.99
Particles
Organic extraction
Blank corrected
Fluoranthene
Flag
C206-44-0
None
Char
2
None
Particles
Organic extraction
Blank corrected
Pyrene
Variable
C129-00-0
ng/m3 (nanogram per cubic meter)
Decimal
8.2
-999.99
Particles
Organic extraction
Blank corrected
Pyrene
Flag
C129-00-0
None
Char
2
None
Particles
Organic extraction
Blank corrected
Benz[a]anthracene
Variable
C56-55-3
ng/m3 (nanogram per cubic meter)
Decimal
8.2
-999.99
Particles
Organic extraction
Blank corrected
Benz[a]anthracene
Flag
C56-55-3
None
Char
2
None
Particles
Organic extraction
Blank corrected
[ Adapted from Fraser, Matthew. 2003. NARSTO EPA_SS_HOUSTON TEXAQS2000 PM2.5 Organic Speciation Data. Available on-line (http://eosweb.larc.nasa.gov/PRODOCS/narsto/table_narsto.html) at the Langley DAAC, Hampton, Virginia, U.S.A. ]
We recommend that you organize the data within a file in one of two ways. Whichever style you use, be sure to place each observation in a separate line (row). Most often each row in a file represents a complete record, and the columns represent all the parameters that make up the record. This arrangement is similar to a spreadsheet or matrix. For example:
SAFARI 2000 Plant and Soil C and N Isotopes, Southern Africa, 1995-2000SITE,COUNTRY,LAT,LONG,DATE,START_DEPTH,END_DEPTH,CHARACTERISTICS,C,N,d13C,d15N units,none,decimal degrees,decimal degrees,yyyy/mm/dd,cm,cm,none,percent,percent,per mil,per mil USGS-1,Botswana,-21.62,27.37,1999/07/12,5,20,Hardveld,0.67,0.052,-17,8.9 USGS-2,Botswana,-21.07,27.42,1999/07/12,5,20,Hardveld,0.68,0.063,-18.3,8 USGS-3,Botswana,-20.72,26.83,1999/07/12,5,20,Hardveld,0.94,0.087,-17,6.8 USGS-4,Botswana,-20.52,26.41,1999/07/12,5,20,Hardveld,0.53,0.04,-19.9,5.5 USGS-5,Botswana,-20.55,26.15,1999/07/12,5,20,Lacustrine,2.11,0.162,-15.2,5.9 ... USGS-30,Botswana,-19.81,23.63,1999/07/18,5,20,Alluvium,0.67,0.063,-19.2,11.8 USGS-31,Botswana,-20.62,22.74,1999/07/18,5,20,Hardveld,0.23,0.014,-16.8,16.2 USGS-32,Botswana,-21.06,22.4,1999/07/18,5,20,Hardveld,0.39,0.028,-20.9,9.5 USGS-33,Botswana,-22.01,21.37,1999/07/19,5,20,Sandveld,0.19,0.01,-17.9,9.1 USGS-34,Botswana,-22.99,22.18,1999/07/19,5,20,Sandveld,0.16,0.006,-19.7,8.7 USGS-35,Botswana,-23.7,22.8,1999/07/19,5,20,Sandveld,0.37,0.019,-20.7,15.2[ From: Aranibar, J. N., and S. A. Macko. 2005. SAFARI 2000 Plant and Soil C and N Isotopes, Southern Africa, 1995-2000.Data set. Available on-line [http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active Archive Center,Oak Ridge, Tennessee, U.S.A. ]
If you use a coded value or abbreviation for a site or station, be sure to provide a definition, including spatial coordinates, in the documentation.
A second arrangement may be more efficient when most records do not have measurements for most parameters, that is, a very sparse matrix of data, with many missing values. In this arrangement, one column is used to define the parameter and another column is used for the value of the parameter. Other columns may be used for site, date, treatment, units of measure, etc. For example:
Coast redwood NPP data from Humboldt Redwoods State Park, California, USA; Busing & Fujimori, June 2005 Old stand plot study at Bull Creek with bole diameter measurements at 1.7 m aboveground in 1972 and 2001 Orig_sort_order Parameter Measurement_Type Value Units Species Sequoia_sp_grav Equation 1 Latitude Site Characteristics 40.35 decimal degree Not applicable -999.9 Not applicable 2 Longitude Site Characteristics -123 decimal degree Not applicable -999.9 Not applicable 3 Terrain Site Characteristics Alluvial flat Not applicable Not applicable -999.9 Not applicable 4 Slope Site Characteristics 0 degree Not applicable -999.9 Not applicable 5 Elevation (above mean sea level) Site Characteristics 80 m (meter) Not applicable -999.9 Not applicable 6 Total site area Site Characteristics 1.44 ha (hectare) Not applicable -999.9 Not applicable 7 Density Density 380