Platform · Infrastructure & Reproducibility
Tools, workflows, and standards
Every engagement uses a defined, versioned technology stack. Pipelines are containerised and version-controlled so analyses can be independently re-run months or years after delivery. This page documents the infrastructure and the data-handling principles that apply across all projects.
Technology Stack
Infrastructure
Workflow Orchestration
Scalable, portable workflow engines with native container and cloud support.
Containerisation & Environments
Immutable software environments guarantee bit-for-bit reproducibility across compute targets.
Cloud & HPC
Elastic compute provisioned on demand; no idle capacity charges between projects.
Provenance & Versioning
Every analysis is version-controlled, logged, and independently re-runnable from a single command.
Languages & Frameworks
Language choices follow scientific community standards — not engineering preference.
Reproducibility
FAIR data principles
All analyses are conducted in accordance with the FAIR data principles (Wilkinson et al., 2016, Scientific Data doi:10.1038/sdata.2016.18). The practices below describe how each principle is implemented in project delivery.
Findable
Data and metadata are assigned globally unique, persistent identifiers and are described with sufficient metadata to be discoverable by humans and machines.
- —DOIs assigned to published datasets via Zenodo or Figshare
- —Structured metadata following Dublin Core and DataCite schemas
- —Workflow code registered with WorkflowHub (nf-core registry where applicable)
Accessible
Data and metadata are retrievable by their identifier using open, standardised communication protocols; metadata remains accessible even when the underlying data is not.
- —HTTPS and S3/GCS signed URLs for time-limited, authenticated access
- —Metadata deposited in public repositories independently of raw data
- —Access procedures documented in project README at project close
Interoperable
Data uses standard community formats and ontologies so it can be integrated with other datasets and consumed by existing analysis tools without transformation.
- —Standard formats: FASTQ, BAM/CRAM, VCF/BCF, BED, GFF3, HDF5/AnnData, MEX
- —Ontologies: Sequence Ontology (SO), Gene Ontology (GO), HPO, OMIM
- —Reference genomes from Ensembl or UCSC with explicit build and annotation versions
Reusable
Data and metadata are richly described with accurate provenance and clear data usage licences so they can be replicated and combined for future studies.
- —CC-BY or CC0 licences on published data; MIT on pipeline code
- —Complete methods sections written to Methods in Molecular Biology standard of detail
- —Containerised environments pinned to exact tool versions for long-term re-runability
Data Handling
Security & compliance
The policies below apply to all client data received under project agreements. Specific requirements — jurisdiction, retention, or additional controls — can be documented in a Data Processing Agreement prior to project start.
Encryption at rest and in transit
All client data stored on cloud infrastructure is encrypted at rest using AES-256. Data transmitted between systems and clients uses TLS 1.3. SSH keys with Ed25519 or RSA-4096 are required for server access.
No long-term raw data retention
Raw client data is stored only for the duration of the project and for a defined post-delivery review period (default: 30 days). Retention beyond this requires explicit written agreement and is scoped to specific file types.
Compute jurisdiction agreements
Data is processed in the cloud region agreed with the client prior to project start. Default: AWS ap-southeast-1 (Singapore) for South/Southeast Asian clients. EU-WEST-1 available for GDPR-scoped projects.
Access controls and audit logging
Cloud console access requires MFA. IAM roles are scoped to minimum required permissions. All S3 and object storage access is logged via AWS CloudTrail or GCP Cloud Audit Logs and retained for 90 days.
Client data is not used for model training
Data provided by clients under project agreements is not used to train or fine-tune AI/ML models. Any research use of anonymised or aggregated data requires a separate written data-sharing agreement.
Incident notification
In the event of a suspected data security incident, the client will be notified within 72 hours of detection. Incidents are assessed and documented in line with ISO/IEC 27001 principles.
These policies are reviewed annually. Last reviewed: 2025. To request a copy of our Data Processing Agreement template or to discuss project-specific requirements, contact us via the contact form.
Questions?
Infrastructure requirements vary by project
If your institution has specific HPC configurations, data sovereignty requirements, or security controls, describe them when you get in touch.
Discuss your requirements