The Center for Translational Data Science at the University of Chicago is developing the discipline of data science and its applications to problems in biology, medicine, healthcare and the environment. They develop and operate large-scale data platforms to support research in topics of societal interest, including cancer, cardiovascular disease, inflammatory bowel disease (IBD), birth defects, veterans’ health, pain management, opioid use disorder, and environmental science. They also develop new machine learning and AI algorithms over the data in our platforms.
The center has developed a number of important “firsts:” including, one of the first large-scale data clouds (the NSF supported Open Science Data Cloud (2010-2016)); the first data cloud designed to host biomedical data and approved as a NIH Trusted Partner (the Bionimbus Protected Data Cloud (2013-present)); the first large-scale data commons (the NCI Genomic Data Commons (2016-present)); and the first set of services to create data ecosystems for biomedical data (Data Commons Frameworks Services (2020-present)).
The University of Chicago’s testing framework is based on CodeceptJS and helps to improve the quality of their Gen3 Data Commons ecosystem. The platform empowers the scientific community that leverages big data and bioinformatics to perform genome sequencing research to discover new treatments, cure diseases, etc. The University is leveraging CodeceptJS hooks to capture the successes, failures and retries of each of the tests in the pipeline to assemble a Grafana dashboard. This allows it to oversee the entire pipeline, benchmarking, blockers and track flaky tests.
One big motivation for the University of Chicago to adopt InfluxDB was how easy it was to set it up. With a quick Docker Hub image, a Kubernetes YAML descriptor and a simple HTTP request to create the database, they just leveraged the Continuous Integration framework hooks to introduce time series data points through a NodeJS “influx” client library. Then the team just had to point their Grafana dashboard to the InfluxDB data source and quickly achieved great observability for their Continuous Integration pipeline. Now, they are able to gather useful metrics associated with their tests, like intermittent failures (aka: flaky tests), test duration, benchmarking the time spent on re-provisioning infrastructure of testing environments or generating fictitious clinical metadata to make sure all the biological research mechanisms are working as expected.
InfluxDB is easy to set up and the concepts around measurements and its respective tags were quickly absorbed by the team, which promoted amazing discussions on what other metrics they want to capture and which insights they need to obtain to continuously improve their CI / CD strategy.
Software Development Engineer Marcelo Costa suggests new users to bear in mind the maximum number of metrics per tag 1000000 (one million). They might want to remove this default setting in their Kubernetes configuration for their InfluxDB yaml descriptor:
env: - name: "INFLUXDB_DATA_MAX_VALUES_PER_TAG" value: "0"
Another tip is to set up a persistence volume to keep the metrics database file safe and come up with a good backup policy to run some interesting retroactive calculations to gain more insights on whatever metrics you have.
Encourage the scientific community to share research data
Used to discover new treatments and help the medical industry
Simplifying genome sequencing
By empowering them with big data and bioinformatics used
Improved CI/CD pipeline
Continuously tweaking metrics collection to determine best practices
Technologies Used
- Docker
- Grafana
- InfluxDB
- Kubernetes