Skip to main content
ByteFlow AI LabsByteFlow AI Labs

Platform · Infrastructure & Reproducibility

Tools, workflows, and standards

Every engagement uses a defined, versioned technology stack. Pipelines are containerised and version-controlled so analyses can be independently re-run months or years after delivery. This page documents the infrastructure and the data-handling principles that apply across all projects.

Technology Stack

Infrastructure

Workflow Orchestration

Scalable, portable workflow engines with native container and cloud support.

NextflowPrimary workflow management system (DSL2); nf-core community pipelines for RNAseq, variant calling, scRNA-seq, and amplicon analysis.≥23.10 LTS
SnakemakeRule-based Python-native pipelines for exploratory and Python-heavy workflows; conda/container integration.≥7.32
GalaxyWeb-based collaborative environment used for training programmes and integration with NCBI/EBI data sources.

Containerisation & Environments

Immutable software environments guarantee bit-for-bit reproducibility across compute targets.

DockerPrimary container runtime for development, CI/CD, and cloud deployment; images hosted on Docker Hub and GitHub Container Registry.
Singularity / ApptainerHPC-compatible execution without root privileges; Docker images converted to SIF format for Slurm clusters.
conda / mambaPackage environment management for legacy tools not yet containerised; pinned environment.yml files committed alongside pipeline code.

Cloud & HPC

Elastic compute provisioned on demand; no idle capacity charges between projects.

AWSS3 for data ingestion and archival; EC2/Batch for on-demand compute; FSx for Lustre as high-throughput scratch storage; ECR for container images.
GCPGoogle Cloud Storage and Vertex AI Workbench for notebook-based ML development; BigQuery for large variant databases.
SlurmOn-premise HPC cluster scheduling at partner institutions; Nextflow and Snakemake submit jobs directly via Slurm executor.

Provenance & Versioning

Every analysis is version-controlled, logged, and independently re-runnable from a single command.

Git / GitHubSource control for all pipeline and analysis code; branches and tags mark each project release.
DVCData version control for large files and model artefacts; tracked alongside code in Git without storing binaries in the repository.
MLflowExperiment tracking, hyperparameter logging, and model registry for AI/ML development (Tier 4 workflows).
Nextflow TowerPipeline monitoring, run history, and cost reporting for cloud-submitted Nextflow workflows.

Languages & Frameworks

Language choices follow scientific community standards — not engineering preference.

PythonPrimary language for bioinformatics scripting, ML/AI (PyTorch, scikit-learn), and single-cell analysis (scanpy, AnnData, squidpy).
R / BioconductorStatistical analysis and genomics (DESeq2, edgeR, limma/voom, Seurat, GenomicRanges, VariantAnnotation).
Nextflow DSL2Pipeline definition with explicit channel logic, scatter/gather patterns, and process-level container declarations.
BashShell scripting for data staging, format conversion, and system-level integration within pipelines.

Reproducibility

FAIR data principles

All analyses are conducted in accordance with the FAIR data principles (Wilkinson et al., 2016, Scientific Data doi:10.1038/sdata.2016.18). The practices below describe how each principle is implemented in project delivery.

F

Findable

Data and metadata are assigned globally unique, persistent identifiers and are described with sufficient metadata to be discoverable by humans and machines.

  • DOIs assigned to published datasets via Zenodo or Figshare
  • Structured metadata following Dublin Core and DataCite schemas
  • Workflow code registered with WorkflowHub (nf-core registry where applicable)
A

Accessible

Data and metadata are retrievable by their identifier using open, standardised communication protocols; metadata remains accessible even when the underlying data is not.

  • HTTPS and S3/GCS signed URLs for time-limited, authenticated access
  • Metadata deposited in public repositories independently of raw data
  • Access procedures documented in project README at project close
I

Interoperable

Data uses standard community formats and ontologies so it can be integrated with other datasets and consumed by existing analysis tools without transformation.

  • Standard formats: FASTQ, BAM/CRAM, VCF/BCF, BED, GFF3, HDF5/AnnData, MEX
  • Ontologies: Sequence Ontology (SO), Gene Ontology (GO), HPO, OMIM
  • Reference genomes from Ensembl or UCSC with explicit build and annotation versions
R

Reusable

Data and metadata are richly described with accurate provenance and clear data usage licences so they can be replicated and combined for future studies.

  • CC-BY or CC0 licences on published data; MIT on pipeline code
  • Complete methods sections written to Methods in Molecular Biology standard of detail
  • Containerised environments pinned to exact tool versions for long-term re-runability

Data Handling

Security & compliance

The policies below apply to all client data received under project agreements. Specific requirements — jurisdiction, retention, or additional controls — can be documented in a Data Processing Agreement prior to project start.

01

Encryption at rest and in transit

All client data stored on cloud infrastructure is encrypted at rest using AES-256. Data transmitted between systems and clients uses TLS 1.3. SSH keys with Ed25519 or RSA-4096 are required for server access.

02

No long-term raw data retention

Raw client data is stored only for the duration of the project and for a defined post-delivery review period (default: 30 days). Retention beyond this requires explicit written agreement and is scoped to specific file types.

03

Compute jurisdiction agreements

Data is processed in the cloud region agreed with the client prior to project start. Default: AWS ap-southeast-1 (Singapore) for South/Southeast Asian clients. EU-WEST-1 available for GDPR-scoped projects.

04

Access controls and audit logging

Cloud console access requires MFA. IAM roles are scoped to minimum required permissions. All S3 and object storage access is logged via AWS CloudTrail or GCP Cloud Audit Logs and retained for 90 days.

05

Client data is not used for model training

Data provided by clients under project agreements is not used to train or fine-tune AI/ML models. Any research use of anonymised or aggregated data requires a separate written data-sharing agreement.

06

Incident notification

In the event of a suspected data security incident, the client will be notified within 72 hours of detection. Incidents are assessed and documented in line with ISO/IEC 27001 principles.

These policies are reviewed annually. Last reviewed: 2025. To request a copy of our Data Processing Agreement template or to discuss project-specific requirements, contact us via the contact form.

Questions?

Infrastructure requirements vary by project

If your institution has specific HPC configurations, data sovereignty requirements, or security controls, describe them when you get in touch.

Discuss your requirements