Supporting a Robust Data Management Platform for Genomic Research
Objective:
To provide project management, data engineering, scientific analysis, and support in the development of the Genomic Data Commons (GDC), a comprehensive data management platform for genomic research sponsored by the National Cancer Institute (NCI) and led by the University of Chicago.
Client Overview:
The Genomic Data Commons (GDC) is an initiative sponsored by the NCI’s Center for Cancer Genomics (CCG) aimed at establishing a centralized data repository and management platform for genomic data generated by CCG projects and individual researchers. This “big data” repository began with over 3 TB of genomic data and is designed to support a wide range of cancer research initiatives.
Challenge:
The GDC needed a scalable, secure, and efficient platform to manage vast amounts of genomic data from various cancer research projects. This required the integration of distributed object storage, graph, and relational database systems, and a combination of microservices to facilitate data production, indexing, and searching.
Our Solution:
Radiant partnered with the University of Chicago to support the development and implementation of the Genomic Data Commons. Our contributions included project management, scientific and business analysis, data engineering, architectural design, and data quality assessment.
Key activities included:
- Project Management and Agile Implementation: Worked closely with GDC technical leads to refine project plans, ensure team alignment, and implement Agile and DevOps processes to streamline development and deployment.
- Business and Technical Analysis: Supported the gathering of business requirements, developed user stories, facilitated technical design, and provided feedback on APIs and portal access as part of User Acceptance Testing.
- Data Engineering and Architectural Support: Assisted in domain modeling, construction of a data dictionary, and designed a graph database to manage metadata for unstructured data stored in the Object Store.
- Data Quality Assessment: Provided comprehensive data quality assessments to ensure accurate data import, representation, re-analysis, and functionality of data pipelines for generating higher-level analysis data.
Technologies and Infrastructure:
- Distributed Object Storage: Utilized CEPH and Cleversafe for scalable, secure data storage.
- Database Systems: Integrated PostgreSQL and Neo4J to manage relational and graph data structures.
- Microservices Architecture: Implemented microservices for efficient data production, indexing, and searching.
Outcomes:
- Robust Data Management Platform: Successfully developed a scalable, secure data repository that supports the storage, management, and analysis of vast amounts of genomic data, enhancing cancer research capabilities.
- Efficient Data Operations: Streamlined data submission, management, and quality assessment processes, ensuring reliable and accurate data for research purposes.
- Enhanced Research Capabilities: Provided researchers with a comprehensive, user-friendly platform for accessing, analyzing, and sharing genomic data, supporting groundbreaking cancer research initiatives.