Authors: Andrew Perry, Steve Androulakis — Monash Bioinformatics Platform

Summary

We describe the establishment and production deployment of the MyTardis-Seq system for the Monash Health Translational Precinct Genomics sequencing facility (http://www.mhtpmedicalgenomics.org.au/), addressing the need for a community-accepted software solution that captures, stores and serves output from gene sequencing experiments.
MyTardis-seq benefits facility managers and gene sequencer users by providing an automated and structured method to capture, store and share the results of sequencing runs with associated quality reports and metadata.
We also provide detail on how MyTardis-seq can be applied to other gene sequencing facilities.

First 6 Months: Sequence Data Ingested by MyTardis-Seq (1.6TB

Components

MyTardis-Seq is the name for the combination of the MyTardis data management system, the MyTardis next-gen sequencing extensions, and the Sequencing Data Ingestion Client. The sequencing extensions and Data Ingestion Client were both created for the purposes of this deployment.

MyTardis with Next-Gen Sequencing Extension

The production workflow from sequencer to MyTardis-Seq comprises of:

The sequencer and associated instrument control machine (Windows controlled)
A network attached file server used to store and process (demultiplex) the immediate output from the instrument. This is where the Sequencing Data Ingesting Client runs.
The MyTardis-Seq server (MyTardis web application and database, connected to storage)

The instrument control machine (1) is the standard Windows machine provided by the instrument vendor. The file server (2) and MyTardis-Seq server (3) run on Linux-based operating systems and can be on separate machines or combined on the same machine (physical or virtual).

Example architecture of a production MyTardis-Seq installation. This setup closely reflects the production system currently in use at MHTP Genomics.

In our production setup the File Server is used as processing and staging area before the data is registered with the MyTardis-Seq server. The MyTardis web application and the File Server could be run on the same instance, however segregating them ensures the process of run demultiplexing and QC report generation does not affect the performance of the MyTardis web application and allows different security policies to be applied to the public facing (read-only) and instrument (read-write) components.

Next-Gen Sequencing Extensions for MyTardis

MyTardis is a web based system that enables private data access in the ‘cloud’. Data is able to be shared with select collaborators, published or used as a data processing platform.

A sequencing run is the FASTQ format nucleotide sequence produced by the gene sequencer and associated metadata about the run (logs, configuration files).

MyTardis stores any kind of file by default, but doesn't have discipline-specific file handling or custom presentation.

An extension (github.com/mytardis/mytardis-seqfac) was written to MyTardis that allows it to recognise and organise sequencing data, and gene sequencing runs.

Example Sequencing Run File Structure (Illumina Hiseq 3000):

- 20160501_HSQ0123_0010_AZXY987
-- SampleSheet.csv
-- RTAComplete.txt
-- RunInfo.xml
-- Logs/
-- Config/
-- Data/RTALogs/
-- 20160501_HSQ0123_0010_AZXY987.bcl2fastq
---- Project_Steve/
----- Sample_ControlX/
------ ControlX_INST123_L1_R1.tar.gz
------ ControlX_INST123_L1_R2.tar.gz
----- Sample_TreatedY/
------ TreatedY_INST123_L1_R1.tar.gz
------ TreatedY_INST123_L1_R2.tar.gz
---- Project_Roxanne/
----- Sample_Normal1/
------ Normal1_INST123_L2_R1.tar.gz
------ Normal1_INST123_L2_R2.tar.gz
----- Sample_Tumor2/
------ Tumor2_INST123_L2_R1.tar.gz
------ Tumor2_INST123_L2_R2.tar.gz

High-throughput sequencing runs typically contain data destined for many end users which must be ‘demultiplexed’ before delivery. MyTardis organises data into an Experiment for the overall sequencing run (‘Run Experiment’) as well as breaking it down into individual Experiments for each demultiplexed Project (‘Project Experiment’).

Project Experiments are intended to be accessed by end users and contain Datasets with the FASTQ files and their associated FastQC reports. Run Experiments, containing one or many Project Experiments, are only accessible to Facility Managers and contain the combined data for all users sharing a run. Project Experiments are shared with individual clients by Facility Manager using the MyTardis permissions model.

Illumina sequencers, after demultiplexing the data, ultimately produce FASTQ files.

MyTardis-Seq handles them in the following ways:

Metadata extraction
FastQC report generation
Data transfer to the final storage location

More information on how this is processed and presented is available below.

Sequencing Data Ingestion Client

The Sequencing Data Ingestion Client (github.com/mytardis/mytardis_ngs_ingestor) has been tested with Illumina HiSeq instruments (1500 & 3000) and should be compatible with NextSeq and MiSeq instrument data with minimal to no modification. It is configurable with site specific settings via a YAML configuration file and/or command line flags - there is no code that is specific to the MHTP Medical Genomics Facility, so this will work on instruments in any location.

This process is known in the MyTardis project as 'Instrument Integration'.

The Sequencing Data Ingestion Client extracts basic metadata about the run (instrument model and identifier, flowcell ID, number of index and read cycles), identifies the versions of Illumina RTA and bcl2fastq software used for demultiplexing, runs FastQC for report generation if required, extracts some basic statistics about the FASTQ sequence files (number of reads, read length), and uploads the FASTQ sequence data, FastQC reports and SampleSheet.csv file (see below) to MyTardis.

Dependence on Sample Information

Sample information is currently captured from SampleSheet.csv, a standard file used by Illumina sequencers as part of its run configuration.

This is the file used by instrument operators to define sample names / IDs, index sequences used for demultiplexing and the project (end-user) the sample is associated with.

Raw files, along with versions representing the samples for each individual project, are preserved for download since it is not uncommon for this to be used directly in bioinformatics workflows.

The information in the SampleSheet.csv can be incorrect due to human error (eg invalid Project names, incorrect or incompatible index sequences). In the future we would like to implement a web-based solution for writing SampleSheet.csv prior to beginning a run which can pre-check and validate sample sheets to reduce human error, as well as providing the opportunity to capture more/richer sample information than is possible in the SampleSheet.csv format.

Infrastructure

Choices and compatibility

Cloud and OS

We are extensively using the OpenStack-Based NeCTAR cloud to host MyTardis and its related services. The NeCTAR cloud was chosen as it’s available to our group at no cost. Each component of MyTardis-Seq is compatible with compute clouds such as Amazon AWS, Google Compute Engine.

The Sequencing Data Ingestion app can comfortably run on a small/medium sized server (2 cores, 16GB RAM). CPU usage is minimal and used for checksum generation during upload, optional FastQC report generation and counting reads (which is typically I/O bound).

Our components currently run on Ubuntu 14.04 LTS instances.

Storage

The Illumina instrument control software transfers data in real time from the Windows host to a mounted NeCTAR Volume using an SMB network filesystem, where it is subsequently processed on a VM instance and ingested by MyTardis. The Sequencing Data Ingestion Client must work with files (as opposed to objects).

Sequencing data is initially stored on NeCTAR cloud volumes, processed and then transferred for long-term storage on VicNode infrastructure.

MyTardis supports multiple storage backends, including NeCTAR Object Store and Amazon S3-compatible services as well as traditional file-based storage systems.

Sequencing Instrumentation

The Sequencing Data Ingestion Client has been designed to be compatible with Illumina Sequencers, but is configurable to suit output from other sequencers. No additional sample or user information is required to use the MyTardis-Seq system, beyond what is normally entered to initiate an Illumina sequencing run. MyTardis-Seq focuses on managing the data and extractable metadata for sharing and long term archival, but does not attempt to replicate functionality provided by a LIMS.

Illumina MiSeq instruments handle data differently to their HiSeq counterparts. Namely, they can do a large share of the data processing (eg. demultiplexing) on the instrument itself. The demultiplexed FASTQ data can still be stored on a remove network mount accessible to the Sequencing Data Ingestor. Different versions of the Illumina demultiplexing software (bcl2fastq) generate slightly different report outputs and directory structures. The ingestion client has built-in support for these differences.

Networking

The speed of data being ingested from the sequencer file server into MyTardis is constrained by the network bandwidth - this is not usually an issue within organizations such as universities that have large bandwidth capability between instrument facilities and university servers.

We find that data transfer from the facility to MyTardis in the NeCTAR cloud is typically between 40 and 80/mb sec. High speed connectivity via AARNET means this is suitable even in the case of MyTardis being located at a remote institution in Australia.

MyTardis requires HTTP or HTTPS and SSH ports to be open for web access and SFTP downloads respectively.

The Sequencing Data Ingester Client doesn’t require any open incoming ports on the firewall. It uses standard web ports by default to communicate with the MyTardis server.

High availability of this service can be achieved by load-balancing web and database services for MyTardis in the cloud.

Identity Management

MyTardis-Seq supports authentication via LDAP, and local MyTardis user accounts, as well as any authentication system supported by MyTardis. Some sites are using AAF authentication with MyTardis instances, although this is not yet in use on our production instance.

The MyTardis permissions model allows Experiments to be shared with users with View-only, Edit and Owner permissions. Users can be added to Groups to allow convenient sharing with a set of Users. In production we use a dedicated Group to provide facility managers access to all data ingested from their facility.

Facility managers manually link data to the user that owns it - it is not automatically assigned. This process could easily be automated with minimal development effort via modifying the workflow to include email addresses in the free-text description of Illumina’s SampleSheet.csv manifest file. Initial feedback in our instance indicated this feature wasn’t desired, but other facilities may prefer automatic release of data to end-users if quality metrics are met.

Interactions

Data Generation and Ingest Interactions

See below for diagram.

This describes how the data ingest process takes place, from the generation of data to the arrival of it in MyTardis:

Facility manager initiates the run which creates a run directory on the network attached storage (file server). This run folder initially contains SampleSheet.csv and configuration files and logs generated by the instrument.
During the run, the Illumina RTA software writes the multiplexed basecall data (‘bcl files’) to the file server, via SMB.
At the completion of the run, the instrument writes a flag file RTAComplete.txt. This file is detected by a frequent cron job on the file server and demultiplexing via the Illumina bcl2fastq program is initiated.
The Sequencing Data Ingestion client is automatically run at the end of successful demultiplexing. The ingestion client runs a series of pre-checks to ensure that the MyTardis-Seq server is online and up to date, validates that the run looks complete, extracts basic run metadata, runs FastQC for report generation, then uploads files with associated metadata to MyTardis.

End-User Interactions

Logs into MyTardis-seq
Can see their experiments listed
Can browse basic metadata (number of reads, index sequences)
Can see summary reports indicating data quality (eg FastQC)
Can download data over SFTP using their username/password or via generating an expiring obfuscated URL. Data can also be directly downloaded in the browser, however most users of sequencing data will download data directly to the server where they do downstream analysis rather than their local desktop machine.

Facility Management Interactions

The facility manager initiates the run, directing the instrument to output data to the Windows share provided by the file server.
At the completion of the run and automatic demultiplexing, report generation and ingestion, the run will appear in the MyTardis web interface (typically this will take ~30 min for a 2 lane ‘rapid’ run and ~2 hours for a larger 8 lane run).
The Facility Overview allows facility managers see an overview of data ingested and can be filtered by user, experiment (Run or Project) ID and instrument.
The Facility Manager views quality reports (FastQC) in MyTardis (to complement metrics provided by the on-instrument vendor software) and determines if the data quality is acceptable.
Once the facility manager is satisfied that the data can be released to the end user, they use the MyTardis Sharing tab to enter the user’s email address to give them access to the data.

Systems Administrator Interactions

MyTardis installation and configuration is well documented, and the project routinely provides support for system administrators and developers setting up test and production instances.

System administrators are encouraged to use configuration management tools (eg Ansible, Salt, Puppet) to maintain production services.

The MyTardis database (typically PostgreSQL) should be backed up periodically, and ingested files should be stored in a backend which uses redundancy to preserve data integrity (eg RAID, NeCTAR / OpenStack Swift Object Store, Amazon S3) and is periodically backed up, depending on site policies.

Demo Account

Public data is available on the production deployment:

https://mhtp-seq.erc.monash.edu.au/public_data/

Please contact steve.androulakis@monash.edu for a demo account on our server, or your own test server.

MyTardis Integrates with Gene Sequencers