CalCOFI Data Management: Setting Community Standards
(presented as a poster at the 2007 CalCOFI Conference by James Wilkinson, Karen Baker & Richard Charter)

Introduction
CalCOFI represents a partnership of multiple agencies conducting quarterly joint oceanographic cruises. CalCOFI cruise participants work as a cohesive cross-agency unit to accomplish cruise objectives. Ancillary researchers frequently integrate their field measurements and sampling with the long-term core CalCOFI measurements and samples. Once a cruise concludes, however, this cohesive unit disperses; individuals return to their respective agencies and labs to process samples and analyze data. Each group uses legacy lab or agency specific methods and software to generate data products in local formats. These diverse data processing methods, products, and storage formats create challenges for merging final datasets. Development and incorporation of shared data management practices and joint community standards enable data integration.

Establishing Shared Practices
Identifying and establishing common, queriable columns, such as order occupied and event number, and including them in final data products allows heterogeneous datasets to be related. In addition, standardizing data elements such as column headers, date-time specifications, spatial designations such as GPS decimal format are easy to implement with minimal impact on existing data production. Standard, linkable data elements allow ingestion into relational databases, applications, and other analytical tools such as Data Zoo using import templates.

CalCOFI Standardization Strategies:

  1. Persistent vocabulary and formats with defined standard data column label
  2. Date & position format conventions
    • Date: YYYY/MM/DD HHMMSS.S UTC
    • Position: 32.53455, -117.23433
  3. Standard Line Station grid designations example: line 93.3, station 120.0. Traditionally, integer values are used to describe a CalCOFI line.sta, 93.120 for example. But with the integration of SCCOOS stations as part of the regular 75 station pattern, the decimal notation improves line.station distinguishability.
  4. Order-occupied numbering for sequential stations
  5. Event numbers for distinguishing all station activities that generate data
  6. Data distribution in non-proprietary format such as comma-delimited ascii files (.csv) in addition to legacy IEH for data warehouses like NODC who expect & can ingest IEH.
  7. Metadata - definitions of measurements & equipment; translation tables for different unit attributes.

Shared Practices Begin in the Field
With quarterly cruises generating a persistent influx of data, the CalCOFI technical team must maintain an established routine to keep pace. Changes in procedure or protocol impact the expediency of the ongoing process. To minimize the impact of new data integration practices, the change process best begins at sea. Careful attention to sta activities & event logs create both a shared index and initiates a dialogue about organizational design.

SIO-CalCOFI Data Processing Flow Diagram

Figure 1: Typical CalCOFI-SIO Data Flow from Field Collection to Publication

CalCOFI-SWFSC Data Process
Figure 2: Typical CalCOFI-SWFSC Data Flow from Field Collection to Publication

Developing Data Integration Standards

At sea:

  1. Water samples are collected, logged & analysed. Net tows collected, logged, & preserved.
    • Standard 1: all logs use joint standard formats with common station, event number & order occupied indexes.
  2. Preliminary data processing and quality control of individual samples types: salinity, nutrients, oxygen, chlorophyll.
    • Standard 2: event logs, sample logs, & analytical output files are available on the network, all include common sample indices.
Ashore:
  1. Traditional data processing; merging of individual data types into a combined, local ascii format using in-house software. Preliminary data are merged, quality control protocols are applied. Final compilation and data publication: CalCOFI Data Reports as txt, pdf, and html; contour plots; IEH files (proprietary legacy format), all web accessible.
    • Standard 3: Station & cast information, bottle, and CTD data are merged into a non-proprietary csv with common, queriable elements, standard formats, & labels.
  2. Net tow data are processed with the flow meter calibrations and depth of tow, volume of water strained and other variables are calculated and added to the net tow dataset. The bongo tows are then volumed.
    • Standard 4: A plankton volume report is generated with common elements, standard formats and labels.
  3. The plankton samples are then sorted removing all fish eggs, larvae and squid paralarvae. The major species (sardine, anchovy, and hake) are identified and sized at the time of sorting. The remaining plankton sample goes to the SIO archives. The unidentified eggs and larvae are identified in the ichthyoplankton identification lab. The fish eggs and larvae are then archived in the NMFS ichthyoplankton archives.
    • Standard 5: An eggs and larvae dataset is produced in a standard format. An annual ichthyoplankton data report is produced in a standard format once all the cruises of the year have been identified.

Cross-Project Data Interfacing

CalCOFI cruises generate multiple data formats such as station data; continuous meteorological, ADCP, & SCIMS (universal format SCS or MET continuous data + event numbers) data; avifauna & marine mammal visual observations and acoustic recordings. Each research group has their own data publishing goals. It must be the goal of all data-producing participants to generate a standard product with common indices for use by the data community. CalCOFI-SIO & CalCOFI-SWFSC are establishing a common vocabulary and standardizing final data formats & practices so hydrographic, zooplankton, and ichthyologic data can be integrated.

Data Interfacing Strategies:

  1. Establish a shared data product
  2. Consider your final data and what you are able to share with the data community – some data processes take longer.
  3. Develop a standard, persistent format so cross-project partners can plan for a consistent data format and design ingestion mechanisms such as import templates.
  4. Think collaboratively
  5. The Ocean Informatics team is working together to automate the importing of CalCOFI data into DataZoo, a cross-project, web-based, information system.

Acknowledgements
We would like to recognize the added work done by field participants - ship and scientific staff - in helping to plan forward for data integration and by the cross-project community of participants working to create a common information environment. This work is supported by NOAA CalCOFI SIO and SWFSC together with the NSF LTER California Current Ecosystem and the Ocean Informatics team.

This information was presented as a poster at CalCOFI Conference 2007 (image).
Authors: James Wilkinson, Karen Baker from Scripps Institution of Oceanography and Richard Charter from NOAA Southwest Fisheries Science Center