Not logged in
PANGAEA.
Data Publisher for Earth & Environmental Science

Preservation Plan

PANGAEA is committed to the long-term preservation of all data and metadata entrusted to it by the research community. This commitment extends beyond bit-level storage integrity to encompass active management of format usability, semantic consistency over time, and formally documented procedures for all stages of the archival lifecycle. The following article describes the technical and organizational measures that together constitute PANGAEA's preservation strategy. It supplements the information provided in Felden et al. (2023) and is one of the reference documents for PANGAEA's CoreTrustSeal certification.

Principles

PANGAEA's preservation approach is grounded in three guiding principles. First, data are not merely stored as files but are ingested into a structured, normalized relational database that preserves the full semantic context of each measurement — ensuring that data remain interpretable independently of any external documentation. Second, preservation is an active process: PANGAEA monitors the long-term usability of archived formats and takes preventive action against obsolescence, including the creation of format-migrated copies when required. Third, institutional commitment is formally secured: the AWI/MARUM cooperation agreement (AMAR) guarantees that all archived data and metadata will remain accessible for a minimum of ten years following any formal decommissioning of PANGAEA, and that the host institutions will maintain the necessary infrastructure and expertise to honor this commitment.

PANGAEA's ingest and archiving workflow is compliant with the Open Archival Information System (OAIS) standard (ISO 14721).

Metadata Preservation

PANGAEA treats metadata as essential for the long-term reusability of data. Metadata are stored in a highly normalized PostgreSQL relational database, whose schema is modeled to be compatible with international standards including schema.org and ISO 19115. This normalized structure allows dataset representations to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived records.

The following metadata categories are collected and preserved for every published dataset:

Citation metadata: author and contributor names with ORCID iDs; institutional affiliations with ROR identifiers; dataset title; publication year; publisher; DOI name; [resource type] according to Stall et al., 2023

Funding information: project names, grant numbers, and funder identifiers (Crossref Funder IDs or ROR identifiers).

Event information: detailed spatial and temporal coverage of sampling or measurement events, including methods, devices, and campaign context.

Related documentation: links (using DOIs or other persistent identifiers) to related scientific articles, reports, and supplementary materials. Where related documentation is not held in an external repository with a persistent identifier, PANGAEA stores a local copy in PDF/A format. PDF/A is preferred for its long-term stability; copies will be migrated to successor standards if PDF/A itself becomes obsolete.

PANGAEA's database schema is continuously adapted to accommodate new and evolving metadata standards. When schema extensions are introduced, the metadata of existing datasets are reviewed and updated accordingly. All such changes are managed carefully to avoid incompatible modifications to existing records.

Data Object Preservation

Tabular Data

Except for binary objects, all submitted data values are imported into the PANGAEA relational database as structured data series. Each data entry carries metadata about its type (numeric, date/time, string), the responsible scientist (PI), the methodology applied, and, for numerical values, format information including significant digits. This structured representation decouples the data from any particular file format, ensuring long-term interpretability regardless of changes in software environments.

At the time of archival, a copy of each dataset (with checksum and timestamp) is additionally marshaled to disk as a tab-delimited text file. These copies serve as reference files for integrity verification, and the tab-delimited format ensures readability without specialized software.

Binary Data

Not all data held in PANGAEA is available in tabular form. Some datasets are archived in compact, community-specific binary formats — including NetCDF files, images, video recordings, and geophysical data products. For these, long-term usability is an active responsibility: the PANGAEA team monitors software dependencies, version changes, and backward compatibility issues for all archived binary formats. Where continued readability requires it, new format-migrated copies are created and archived; the original submitted file is always retained alongside any migrated copy.

PANGAEA applies format rules before accepting binary data for archival. Where possible, uncompressed or widely supported open formats are preferred. Currently accepted formats are:

  • Images: JPEG, PNG, TIFF
  • Documents: PDF/A (preferred), ODF, OOXML
  • Media containers: MP4, MPG, OGG, Matroska; audio and video content within containers must comply with the following codecs:
    • Video: uncompressed, MPEG-1, MPEG-4 Part 2, AVC/H.264, H.265
    • Audio: uncompressed, MPEG Layer III (MP3), MPEG-4 Part 3/AAC
  • Scientific data: NetCDF, preferably using the Climate and Forecast (CF) Metadata Conventions; detailed documentation is required in all other cases

This list is not exhaustive and is updated as community standards evolve. If any accepted format becomes deprecated or is superseded, PANGAEA will migrate archived copies to modern equivalents while retaining the originals.

Raw data — defined as level-0 data without any accompanying metadata — are not accepted for archival. Data at processing level 1 (raw data with a minimum set of metadata) may be accepted if adequate contextual information is provided. No guarantees are given for the long-term usability of level-0 or level-1 datasets. Processing levels are documented here: Processing levels.

Storage and Physical Infrastructure

PANGAEA's storage infrastructure is operated by the computing center of the Alfred Wegener Institute (AWI) in Bremerhaven, in accordance with the AMAR cooperation agreement. A comprehensive set of technical and organizational measures (TOM) is in place:

Redundant storage: all data are stored using erasure coding across disk and tape, with write caches battery-backed to ensure integrity at the point of write. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots.

Tape archive: the central archival storage system consists of two SpectraLogic TFinity ExaScale robotic tape libraries, housed in separate buildings at AWI, with a combined capacity of up to 60 PB and using high-capacity LTO-tape drives.

Database integrity: PostgreSQL streaming replication to a dedicated backup system enables point-in-time recovery to any moment prior to a failure event.

Facility measures: fire and smoke detection systems; server room monitoring of temperature and humidity; server room air-conditioning; uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation; RAID and hard disk mirroring in the virtualization environment; user permission management; network firewall and intrusion detection systems; anti-virus email filtering.

Documentation: all systems are documented in an internal Confluence Wiki kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes. A ticket system is used to track and manage incidents.

Hardware is typically renewed every three to four years through the AWI computing center's lifecycle management program, implemented transparently via virtualization.

Off-Site Replica at MARUM

Since 2025, PANGAEA operates an off-site replica of the relational database and web frontend at MARUM/University of Bremen, hosted in the Green IT Housing Center (Rechenzentrum) of Bremen University. This facility provides geographic and institutional separation from the primary AWI infrastructure and is a key component of PANGAEA's resilience strategy against both technical failure and cyberattack scenarios.

The off-site installation currently covers:

  • All dataset metadata, including individual DOI landing pages and all metadata serializations available via harvesting endpoints (OAI-PMH, schema.org, DataCite, Dublin Core, DIF, ISO 19139)
  • Full representations of tabular data publications

Extension of the replica to include binary data files is planned as part of the ongoing development of this facility. The replica enables recovery of data delivery services within 24 hours following a catastrophic failure at AWI.

The MARUM Green IT Housing Center maintains the following physical and technical safeguards: two separate fire sections with servers distributed across both for site redundancy; 24/7 on-site monitoring by Bremen University staff; automated fire alarms with a fire brigade station less than 1 km away; redundant power supply with battery backup; and multilevel physical access control. The off-site installation operates in a fully isolated network environment (Layer 2 separation), accessible only via a firewalled VPN gateway at AWI. The number of access tokens and keys is strictly limited to the corresponding gateway host and PANGAEA DevOps staff. Replication is unidirectional from AWI to MARUM, using snapshot-based transfers executed multiple times per day.

Versioning and Persistent Identifiers

Every published dataset is assigned a universally unique Digital Object Identifier (DOI) minted at DataCite. DOI resolution is actively maintained: PANGAEA keeps its authoritative metadata records at DataCite synchronized with dataset landing pages, and all external links in metadata records are checked automatically on a weekly basis for broken (HTTP 404) or permanently redirected (HTTP 301) responses.

New DOI names are created under clearly defined conditions:

  • A new identifier is issued upon the initial publication of each dataset.
  • A new identifier is issued when a published dataset undergoes a substantive revision of data or metadata that would affect reproducibility or scientific interpretation. The prior version remains accessible and is cross-referenced in the metadata record of the new version.
  • Minor editorial corrections that do not affect scientific content are applied without creating a new identifier. Instead, the corrections are transparently documented in the “Change history” section of the dataset landing pages, including the date and a short summary of the changes applied.

All versions are linked in the metadata record, ensuring full traceability of the publication history for data users and citing authors.

Deletion and Tombstone Records

PANGAEA does not routinely delete or remove published datasets. In exceptional cases where retraction is required — for example, due to demonstrated scientific error, misconduct, copyright infringement, or data privacy obligations under applicable law (e.g., GDPR Art. 17) — the following procedure applies: the data itself is made inaccessible, but the DOI and the dataset landing page are retained as a tombstone record. The tombstone record clearly indicates the dataset's status and the documented reason for its withdrawal, in accordance with DataCite's tombstone policy. All such actions are logged in the editorial system's change history.

Custody Transfer and Decommissioning

Should PANGAEA cease operations, the host institutions guarantee that all data and metadata will remain accessible for a minimum of ten years following any formal decommissioning. In such a scenario, only the submission and editorial system would be terminated; the database and data delivery services would remain operational. A full transition of custody to another repository could be supported by the off-site replica at MARUM as a concrete technical starting point. As a further fallback, PANGAEA can be reduced to a file-based repository: a complete file-based copy of all datasets, including binary objects, can be assembled and made available independently by AWI and/or the University of Bremen. The legal and institutional basis for these guarantees is the AWI/MARUM cooperation agreement (AMAR); a summary of its key commitments is available in the Continuity Plan.

Community-Specific Preservation Documentation

In coordination with scientific communities, PANGAEA has developed detailed documentation on the harmonization and preservation of specific data types. These include guidance on CTD data, Thermosalinograph (TSG) underway data, and bathymetric data, among others. Where applicable, these documents contain information on format choices and long-term preservation handling specific to the relevant data type. See Best practice manuals and templates for the full collection.

References

See Also