Data Standards and Ontologies: Making Data Fair

Odd number posters will be presented on Monday, 8th April and even numbered posters on Tuesday, 9th April.

Posters 123 - 161.

123 Building a Better "Trap" to Capture Information on Mouse Study Populations

Bello, Susan M
Jackson Laboratory

In order to integrate data for the same study population, researchers first need to define the population in question. We describe here emerging standards developed in collaboration with the Alliance of Genome Resources for describing study populations that take into account the full range of types of data that are needed. Currently, the Mouse Genome Database (MGD, www.informatics.jax.org) and Mouse Models of Human Cancer database (tumor.informatics.jax.org) capture information about mouse study populations as "strains" but the populations included in this set cover much more than classical inbred and outbred strains. For example, the allele Trp53tm1Tyj is available from IMSR repositories on 9 different inbred strain backgrounds. This same allele in MGD is associated with 18 strains, 100 strain backgrounds, 22 different single marker genotypes, and 261 different genotypes total. In MMHCdb this allele is associated with 126 single marker strains and 448 total strains. Many of the strains or strain backgrounds in MGD and MMHCdb contain only partial information about the actual mouse strains used to create the study populations. Reviewing data objects associated with study populations in MGD and MMHCdb as well as in the other Alliance members resulted in the development of an initial minimal set of objects required to define a population. The current minimal set of required objects is an identifier (CURI), symbol or name, taxon ID, and population type.  Consistent application of these minimal data standards will allow for improved integrations of data attached to the same study cohort at MGD and MMHCdb. In addition these standards will allow the integration of larger sets of MGD and MMHCdb into the Alliance database in the future.   MMHCdb is supported by NCI grant CA089713. MGD is supported by NHGRI grant HG000330.

 

124 The EMBL-EBI Genome Editing Catalogue

Corbett, Sybilla
EMBL EBI

Recent advances in genome engineering have significantly changed our approach to studying biological processes, elucidating mechanisms from single cells to complex organisms. The versatility of the CRISPR/Cas9 system has generated a broad range of techniques from gene editing through non-homologous end joining and homology directed repair, to transcriptional activation and repression, imaging experiments, as well as controlling protein activity. Past successes, an increasing availability of commercial reagents, and the ability to carry out genome-wide screens have together resulted in a high volume of generated data. Curation is required to capture this wealth of information in a systematic fashion, enabling its visibility, discoverability and re-usability.We are therefore developing the EMBL-EBI Genome Editing Catalogue to collect, curate and convey the large amounts of data arising from these techniques. This resource will list known publications along with experimental parameters such a guide RNA sequences and raw results such as read counts. The Genome Editing Catalogue aims to include work carried out on a wide range of organisms. By including data from genome-wide pooled screens and single gene experiments we expect to provide an integrated resource that will aid research and facilitate new discoveries.To aggregate as many datasets as possible while maintaining a high level of data quality, manual curation and programmatic validation will be combined; we also hope to incorporate a user-submission protocol to increase the reach of the Catalogue. As part of the NIST Genome Editing Consortium we are developing a common set of required parameters, allowing for cross-comparison irrespective of data origin.

 

125 Research on Metadata Standards of Biomedical Data

Cui, Jiawei
Institute of Medical Information & Library, Chinese Academy of Medical Sciences

As data-intensive scientific research has become the norm, scientific data in the biomedical field exploded, which  brought enormous challenges for researchers to store, manage and share data. How to construct a biomedical data metadata standard that can meet the needs of users and function targets of data repository has become an important issue. The first step of this study is to select typical metadata standards of biomedical data, such as Datacite Metadata Schema, Data Tag Suite, metadata standard of Dryad. Then establishs a comparative analysis framework for them, the content includes basic information, content design and practical application. Finally, on the basis of the current problems of existing standards, we put forward the standard construction suggestions. Through analysis, we discovery that the format of the standard is diversified, and the content is constantly updated. In content design, the content affinity and similarity of these standards provide the basis for realize interoperability of metadata standards and integrated retrieval of biomedical data. In practice application, these standards have different scope of application, characteristics and defects, service users include data producers, data managers and data users, whose roles cover all aspects of the data life cycle. And the suggestions summarized are as follows: When developing biomedical scientific data metadata standards, it should ensure that the metadata standards are compatible with the target positioning, taking into account the simplicity and complexity of the metadata standards, and referring to the existing typical metadata standards, ensuring that the description scope covers the content and format of the data, forming basic metadata elements according to the universality principle, forming extended metadata elements according to the individualization principle, using controlled vocabularies to control standardization.

 

126 Challenges of Capturing Large-Scale Longitudinal Mouse Phenotyping Data

Delbarre, Daniel
MRC Harwell Institute

The International Mouse Phenotyping Consortium (IMPC, www.mousephenotype.org) has already captured and analysed multidimensional phenotyping data for over 6,000 mouse knockout lines, from 12 centres around the world. The IMPC’s standardised phenotyping pipeline originally covered a broad spectrum of tests, beginning at week 9 and finishing at 16 weeks, classed as the Early Adult Pipeline; however it has now been expanded to include a Late Adult Pipeline, with tests repeated between 1 year and 72 weeks. This new Late Adult Pipeline has also been developed and defined in the IMPReSS (International Mouse Phenotyping Resource of Standardised Screens) database to allow for standardisation across centres (www.mousephenotype.org/impress/).The IMPC expanded to include ageing mice because, although many mutants showed robust phenotypes by early adult, it is widely acknowledged there would be significant value in investigating later onset phenotypes.Ageing of mice for this project has prompted the need for a standardised method of capturing and curating welfare data from IMPC centres to allow for the identification of welfare issues that occur during the extended Late Adult Pipeline. Welfare data is captured using a new protocol that allows phenotyping centres to submit observations using a standardised language, with the optional submission of images to provide complementary visual documentation. This standardised language comes in the form of Mouse Welfare Terms, an established collection of over 200 terms used for recording mouse welfare issues, that have been converted into an ontology during the development of the IMPC welfare protocol. The Mouse Welfare Terms ontology contains explicit links to other ontologies that are used in the field of mouse genetics, and the terms feed directly into the welfare protocol.

 

127 Making reproducible science

Gabdank, Idan
Stanford

Scientific progress relies on our ability to make observations and share these observations with others. The observations are traditionally reported in the scientific literature, allowing the scientific community to examine observed phenomena and assess the validity of the reported findings. Presence of a detailed record of the conditions that lead to the reported observation makes the assessment and examination possible, while the lack of it results in irreproducibility. Unfortunately, the conventional process by which we capture and report the conditions leading to the reported observations are outdated and has not been sufficiently adjusted to keep up with the fast pace and growing scale of the recent technological advancement.Here we present the model developed by the ENCODE Data Coordination Center (DCC) in order to more effectively capture experimental artifacts of various high throughput genomic assays. Briefly, we encourage data to be submitted as it is being analyzed so that the standard metadata and quality control checks are performed well before results are being written into a manuscript. The ENCODE DCC computational pipeline development framework should be used for the development and integration of software analysis tools utilized in the data processing. The use of the framework promotes proper documentation and ensures reproducibility of computational results. Both the model and the framework could be adopted by other repositories storing high-throughput genomic assays data. Such an adoption would lead to metadata standardization and verified reproducibility resulting in what I believe to be a better science.

 

128 New approaches to data management: supporting FAIR data sharing at Springer Nature

Grant, Rebecca
Springer Nature

Since 2016, academic publishers including Springer Nature, Elsevier and Taylor & Francis have been providing standard research data policies to journal authors, reflecting key aspects of the FAIR Principles’ practical applications: sharing data in repositories, using persistent identifiers and citing data appropriately. In spite of the rise of FAIR and good data management practice, recent surveys found that nearly 60% of researchers had never heard of the FAIR Principles, and 46% are not sure how to organise their data in a presentable and useful way. In this presentation we will analyse the results of a white paper which assessed the key challenges faced by researchers in sharing their data, and discuss current initiatives and approaches to support researchers to adopt good data sharing practice.These include the roll-out of research data policies since 2016, as well as the launch of a Helpdesk service which has provided support to authors and allowed the research data team to capture more granular information on the challenges they face in sharing their data. We will also discuss the development of a third-party curation service which assists authors in depositing their data into appropriate repositories, and drafting data availability statements.Finally we will assess the impacts of some of these interventions, including an analysis of data availability statements and an overview of the methods authors are currently using to share their data, and how these align with FAIR.

 

129 Building a Pharmacogenomics Knowledge Representation Model: the case of melanoma caused by BRAF gene mutation

Kang, Hongyu
Institute of Medical Information & Library, Chinese Academy of Medical Sciences

With rapid growth of massive pharmacogenomics data, these data reflect a hierarchy of relationships between drugs and genes, as well as more detailed information within the drugs and diseases. The semantic types and relationships involved in pharmacogenomics knowledge representation model are usually limited to drugs, genes, and diseases in existing research, but no in-depth exploration of personalized medication.In this research, we designed one manually method to building a pharmacogenomics knowledge representation model, which reflected the relationships between drugs, genes, diseases, as well as personalized medication and adverse reaction. Firstly, we defined the scope of our model and extended the semantic types to six types:Drug, Gene (Gene Type, Mutation), Disease, Dosage (daily dose, dose form, frequency, take time for, etc.), Population (age, gender, applicable population, etc.) and Adverse Reaction. Secondly, Pharmacogenomics open data such as medicine specifications published by the US Food and Drug Administration (FDA), DrugBank and RxNorm were integrated in this research. We also developed data collection, entity labeling and extraction rules. In the end, this model included 6 top classes and 17 secondary classes. 22 semantic relationships have also been given specific descriptions and definitions, including: is_biomarker_of, target_to, synergized_by, treat, dose_adjust_by_adverse_reation, dose_adjust_ by_gene, etc. What’s more, we verified the feasibility of this model by the case of melanoma caused by BRAF gene mutation. 7 related drugs were covered in and three of them are targeted drugs to BRAF gene mutation.Above all, we highlighted our model as a scalable framework for drug dosage adjustment according to genetic variation, and new drug discoveries. In future work, a series of other antitumor drugs will be considered to fill up our framework and machine learning methods will be taken to discover more semantic relationships.

 

130 Improvement of mouse strain ontology in GENEVESTIGATOR

Kinsky, Slavomir
NEBION AG, odstepny zavod, Rimska 526/20, Prague, 12000, Czech Republic

Genevestigator is a high-performance search engine for gene expression. It integrates manually curated public microarray and RNAseq experiments and nicely visualizes gene expression across different biological contexts.The recent increase of curated articles from mouse transcriptomics directed our attention to the revision of mouse strain ontology used in Genevestigator. Primarily the mouse strain, substrain or genetically mutated strain nomenclature follows the rules and guidelines established by the International Committee on Standardized Genetic Nomenclature for Mice. Each annotated mouse strain is designed with specific symbols that characterize, e.g. its genetic background, embryonic stem cells, type of mutation or laboratory code. Moreover, new mouse strain ontology organizes annotated strains to several categories: inbred, hybrid, outbred, and (genetically) mutated strains. The inbred, hybrid and outbred class represents a collection of more than 700 strains, that were developed by classical breeding followed by selection on inherited differences in basic biological parameters. Category mutated strains represents a collection of laboratory animals carrying genetic modification in one or more genes and is further organized to single, double, triple, quadruple or unspecified mutant strains. In addition, each sub-category contains an alphabetical list of mutated genes, which gather mouse strains carrying a mutation in the corresponding gene. Annotation of each mutant strain is also linked with the type of genetic modification, e.g. deletion, overexpression or conditional Cre in/activation.The architecture of our new mouse strain ontology, as well as adoption of mouse strain nomenclature, has implemented FAIR Data principles. All these attributes together allow Genevestigator users fast and easy data analysis between individual strains with genetic modification in a specific gene, in parallel with consideration of genetic background or type of genetic manipulation.

 

131 How TBs of proteomics data can be efficiently handled and curated in the PRIDE database

Kundu, Deepti J
EMBL-EBI

Introduction 
The PRoteomics IDEntifications (PRIDE) database at the European Bioinformatics Institute is currently the world-leading repository of mass spectrometry (MS)-based proteomics data. PRIDE is also one of the founding members of the global ProteomeXchange (PX) consortium. Thanks to the success of PRIDE and PX, the proteomics community is now widely embracing open data policies, opposite to the situation few years ago. Therefore, PRIDE has grown very significantly in recent years (~275 datasets per month were submitted on average during 2018). One major challenge is to ensure a fast and efficient data submission process, ensuring that the data representation is correct. Here, we describe in detail the PRIDE data handling and curation process.Data handling and curationFor each submitted dataset, a validation pipeline is first run to ensure that the data complies with the PRIDE metadata requirements and included files in the dataset are correctly formatted. Issues of different types can often be detected. Therefore, the direct interaction between PRIDE curators and the users becomes critical. In a second step, actual data submission takes place and accession numbers are provided to the users. Finally, datasets are publicly released when the corresponding paper is published. The PX Submission tool and PRIDE Inspector are two stand-alone tools that support users in the data submission process.ConclusionBoth the quantity and complexity of the proteomics datasets submitted to PRIDE are rapidly increasing. The complexity of proteomics data makes a fully automated data deposition process very challenging, especially since data formats are complex and very heterogeneous. Curators then play a very active role in supporting the submitters in the preparation and quality control of each PRIDE data submission.

 

132 SPARQL-Powered Search Engine and RESTful APIs for Protein Ontology Database

Li, Xiang
Center for Bioinformatics and Computational Biology, University of Delaware

The Protein Ontology (PRO) represents protein-related entities, ranging from broad protein families down to very specific proteoforms, and protein complexes. We currently use a relational database to store not only the ontology itself, but also other information, such as orthology, annotation and cross-reference information. Data integration and information retrieval can be efficiently achieved using the powerful Semantic Web technologies. Once data are converted into the widely used Resource Description Framework (RDF) format, they can be queried using the SPARQL language. In this work, we investigate the use of Semantic Web technologies to power the text search for the PRO website. We studied the content and organization of data source and extracted the relevant information. This was converted into RDF triples (10,657,089 in total, PRO release 56.0) and integrated into a Virtuoso RDF store. We then developed a SPARQL-powered search engine to query the resulting RDF database to handle full-scale text search and hierarchy browsing for the PRO website. Tests reveal that we achieved a performance similar to the Apache Lucene-based search engine in current use. We also developed RESTful APIs for programmatic access to the PRO database based on the new search engine. The API specification was designed using the Swagger Editor based on Open API Specification 3. Swagger UI was used to visualize and interact with the API’s resources automatically generated from API specifications. The PRO APIs include 8 API operation groups and 33 access paths, and are implemented using Django-REST framework. The SPARQL-powered PRO text search website and the PRO APIs are accessible at https://proconsortium.org/rdf_based.

 

133 Maximising community participation in the FAIR-sharing of data from small-scale publications

Lock, Antonia
University of Cambridge

Two major outputs of biological research are the data produced, and the knowledge gained from these data. Integration of standardized published data into online databases is a widely acknowledged mechanism to accelerate knowledge distribution, and thereby increase the value of research efforts and funding. To make data integration more comprehensive and efficient, PomBase has developed Canto, an intuitive online curation tool that enables publication authors to participate directly in the semantically standardized "encoding" of detailed functional data. Critically, curation in Canto leads to the FAIR-sharing of data from small-scale publications where newly acquired, detailed biological knowledge from hypothesis-driven experiments would otherwise remain buried. Community curation also helps PomBase increase knowledge integration and dissemination despite current constraints on funding for biocuration activities. As the fission yeast community's curation response rate approaches 50%, we review the project's procedures and metrics, the unanticipated added value from co-curation by publication authors and professional curators, and incentives and "nudges" we deploy to maximize participation.

 

134 Providing semantically-rich subject and knowledge domain annotation of FAIRsharing standards, databases and policies

(Note: 138 in Abstract book)

Lister, Allyson
University of Oxford

FAIRsharing (https://www.fairsharing.org) is a manually-curated, cross-discipline, searchable portal of three linked registries covering standards, databases and data policies. Records provide detailed resource metadata as well as relationships between themselves and other resources.With over 2500 records, FAIRsharing is reliant upon appropriate descriptors for each record to facilitate search and discovery. Our user community extends across a range of academic subjects and has created a similarly wide-ranging set of over 1000 user-created domain tags. However, the lack of structure or metadata for these tags meant that their usefulness was limited. To provide a hierarchical structure and richer semantics, two application ontologies drawn from subsets of multiple community ontologies were derived from the user-created domain tags, alongside a much-reduced set of specific user-defined tags..The Subject Resource Application Ontology (SRAO, https://github.com/FAIRsharing/subject-ontology) stores FAIRsharing’s subject tags. SRAO is a hierarchy of almost 400 academic disciplines which combines a library subject list with subsets of six additional ontologies. FAIRsharing’s knowledge domain tags are drawn from the Domain Resource Application Ontology (DRAO, https://github.com/FAIRsharing/domain-ontology), a hierarchy of research domains and descriptors. Fifty external ontologies are used to provide over 1000 classes. Both application ontologies are covered by a CC BY-SA 4.0 licence. New tags are first added to FAIRsharing’s flat list of user-defined tags and, where appropriate, curators will promote these tags to either SRAO or DRAO.The integration of SRAO and DRAO within FAIRsharing has provided users with access to hierarchical searching for resource discovery as well as to a wealth of definitions and synonyms for improved curation.

 

135 ENCODE data standards maximize quality and use of high-throughput genomic data

(Note: 134 in Abstract book)

Luo, Yunhai
Stanford University

The Encyclopedia of DNA Elements (ENCODE) Consortium aims to discover and annotate functional elements in the human genome through functional genomic studies. ENCODE investigators have so far assayed over 75 different cell and tissue types with more than 40 high-throughput genomic technologies in over 15,000 experiments. ENCODE Data Coordinating Center (DCC) collects, organizes, curates and shares these experimental data along with results of uniform integrative analyses on the open access ENCODE portal (https://www.encodeproject.org). ENCODE has established a set of data standards that make ENCODE results adherent to the FAIR (findable, accessible, interoperable and reusable) principles. These standards are coherent yet flexible enough to accommodate current and future technologies. Therefore, they have the potential to be applied to data from other consortium-based projects and ensure all projects are interoperable. Indeed, the ENCODE portal also hosts data from Roadmap, modENCODE, modERN and GGR using the same standards. Moreover, the ENCODE metadata data model is MINSEQE compliant and can be exported in ISA-TAB, GEO, SRA-XML or GA4GH-defined formats. ENCODE standards specify a set of metadata to be collected, which not only makes experimental materials reusable and uniquely identifiable but also encourages reproducibility and interpretability of experiments. ENCODE DCC modeled these metadata as a hybrid relational-object-based data store and implemented it in JSON-SCHEMA, which allows curators to monitor the compliance with the ENCODE standards using JSON schema dependencies, audits, controlled vocabularies and ontologies. In addition to promoting data provenance, following data standards agreed upon by the community is a prerequisite for uniform data processing, which makes ENCODE experiments comparable within ENCODE as well as to experiments from other consortia. Overall, ENCODE data standards ensure the data quality and maximize the use of high-throughput genomic data.

 

136 InterMine: towards supporting the FAIR principles and widening integrative data analysis

(Note: 135 in Abstract book)

Lyne, Rachel
University of Cambridge

InterMine is a well-established open source data warehouse built specifically for the integration and analysis of complex biological data. Many groups use the InterMine framework to provide a query and analysis platform for their data and there is now a broad selection of InterMine databases worldwide, covering many organisms, including HumanMine, PhytoMine (over 87 plant genomes), Hymenoptera, Planaria, the Legume federation InterMines (Chickpea, Soy, Legume, Peanut, Bean), MedicMine (Medicago), ThaleMine (Arabidopsis) as well as the budding yeast, rat, zebrafish, mouse, fly and nematode model organisms.The InterMine framework includes a user-friendly web interface as well as a powerful web service API, with multiple language bindings including Python and R. An advanced query builder supplements keyword search, and results can be interactively explored and refined. The interface is designed to allow flexible and iterative querying in which items collected as results in one step are used as input in the next. A set of graphical analysis tools provide a rich environment for data exploration including statistical enrichment of sets, and visualisations, such as expression graphs and interaction networks. Recent work includes a major rewrite of the user interface to exploit the latest technologies, providing an improved user experience and enabling better integration with third party tools.With BBSRC support we are currently working on providing persistent stable URIs, RDF export and other features that support the creation and dissemination data according to FAIR principles.With Wellcome Trust support and in collaboration with the Sansone Group at Oxford University we are starting to bring InterMine functionality more readily to bench scientists: new features will include tools to help transform research data files into cloud-hosted InterMine databases, enhancing ISAtools for metadata capture and providing tools to support data deposition, dissemination and publication.

 

137 International Protein Nomenclature Guidelines: helping to standardise protein naming

(Note: 136 in Abstract book)

Magrane, Michele
EMBL-EBI

Consistent protein nomenclature is indispensable for communication, literature searching and retrieval of database records. However, for many proteins, a variety of different names are used across the scientific literature and public biological databases which makes effective organization and exchange of biological information a difficult task. To address this issue and provide help in protein naming, a set of protein nomenclature guidelines have been produced jointly by the European Bioinformatics Institute (EMBL-EBI), the National Center for Biotechnology Information (NCBI), the Protein Information Resource (PIR) and the Swiss Institute for Bioinformatics (SIB). The guidelines cover all aspects of protein naming from advice on expert sources of protein names and how to name novel proteins of unknown function to more detailed advice such as terms to avoid in a protein name and acceptable abbreviations. The guidelines are intended for use by anyone who wants to name a protein including biocurators who want to assign a protein name as part of a database record, bioinformaticians who assign protein names as part of gene annotation pipelines prior to submission to public archives and researchers who isolate a new protein and want to name it prior to publication. The guidelines are available at http://www.uniprot.org/docs/International_Protein_Nomenclature_Guidelines.pdf and https://www.ncbi.nlm.nih.gov/genome/doc/internatprot_nomenguide/. We welcome feedback from the community which will be considered for future improvements to the guidelines.

 

138 uPheno 2.0: Framework for standardised representation of phenotypes across species

(Note: 137 in Abstract book)

Matentzoglu, Nicolas
The European Bioinformatics Institute (EMBL-EBI)

The uPheno project aims to unify the annotation of phenotypes across species in a manner analogous to unification of gene function annotation by the Gene Ontology. uPheno 2.0 builds on earlier efforts with a new strategy that directly leverages the work of the phenotype ontology development community and incorporates phenotypes from a much wider range of species. We have organised a collaborative community effort, including representatives of all major model organism databases, to document and align formal design patterns for representing phenotypes and further develop reference ontologies, such as PATO, which are used in these patterns.  A common development infrastructure makes it easy to use these design patterns to generate both species-specific ontologies and a species-independent layer that subsumes them.The resulting community-curated ontology for the representation and integration of phenotypes across species serves two general purposes:  Providing a community-developed framework for ontology editors to bootstrap, maintain and extend their phenotype ontologies in a scalable and standardised manner.  Facilitating the retrieval and comparative analysis of species-specific phenotypes through a deep layer of species-independent phenotypes.Here we present the general uPheno 2.0 framework and an early version of the ontology, including a demonstration of a basic case of comparative analysis.

 

139 FAIRsharing: Mapping the landscape of biocuration - what standards should you use, what kind of data is in each database, which resources are FAIR

McQuilton, Peter
University of Oxford

To support the widely accepted principles of Findable, Accessible, Interoperable and Reusable (FAIR) data, we need to paint an accurate landscape of the evolving network of standards (for identifying, reporting, and citing of data and metadata), databases (repositories and knowledge-bases) and data policies (from journals, publishers, funders and other organizations). FAIRsharing (https://www.fairsharing.org), a cross-discipline, manually curated portal, does just that. Standards are subdivided into four types - reporting guidelines or checklists, which outline the necessary and sufficient (or minimum) information to contextualize and understand a dataset (e.g. the ARRIVE guidelines); Terminology artefacts or semantics, which unambiguously identify and define concepts within a scientific domain (e.g. the Disease Ontology); Models/formats or syntaxes, which define the structure and interrelation of information from a conceptual model, and the transmission format to facilitate data exchange between different systems (e.g. CellML); and identifier schemas, which define how resources and digital objects are uniquely and unambiguously identified (e.g. PURL). Policies are curated from journals and funders, while databases and repositories are captured from the literature. FAIRsharing allows resource maintainers themselves to edit their records, ensuring they are current and improving accuracy.FAIRsharing is open and collaborative, working via a joint Force11/RDA working group, ELIXIR, the EOSC pilot, GO-FAIR and other European and international initiatives. FAIRsharing has a range of adopters including journal publishers, infrastructure programmes, organizations and projects, all of whom use FAIRsharing in some way (see https://fairsharing.org/communities). We continue to build content, improve data visualisation/analysis, develop ways to better serve our users (e.g. via API) and welcome feedback as the project grows.

 

140 Creation and development of a marine-community data coordination service: the EMBRIC Configurator

Milano, Annalisa
EMBL-EBI

The EMBRIC Configurator (http://www.embric.eu/configurator) is a data coordination service created to provide an entry point for the marine community into the breadth of existing bioinformatics resources and to make the most effective use of these. This service has been specifically created to serve the marine community representing disciplines as diverse as aquaculture, ichthyology and algal biotechnology, targeting those embarking on the design of new marine projects. It is a freely available service sponsored by and designed for the European Marine Biological Research Infrastructure Cluster (EMBRIC), a European Union Horizon 2020 project with aim to promote new applications from marine organisms. This service is fundamentally provided by expert biocurators who translate the requirements of requests from the marine community into project-specific “configurations”. These configurations can include selected elements of infrastructure such as databases (especially those from ELIXIR), standards, formats, curation groups, analysis methods, cloud compute capacity, advice on accessing and setting these elements up for the project and data management guidelines. This structured consultancy can also support core activities such as help with data submission and metadata standards development through structured checklists for different marine data types. The EMBRIC Configurator has a panel of consultants who can provide specific expertise on Configurations if required. Furthermore, the EMBRIC configurator can also serve ELIXIR resources’ curators dealing with marine datasets across multi-omics studies. Are you curating marine datasets and do you want to increase their exposure to the marine community? The EMBRIC configurator can bring together experts from different resources by coordinating marine biotechnological data in compliance with data principles of “Findability, Accessibility, Interoperability, Re-usability” (FAIR), such as qualified references to other data and metadata.

 

141 Virtual Fly Brain - semantic solutions for building query-able web resources for neurobiology and 3D image data

Osumi-Sutherland, David
EMBL/EBI

Advances in genetic and imaging technologies, coupled to large-scale efforts to generate reagents and map connectomes have provided Drosophila neurobiologists with unprecedented opportunities to map, manipulate and functionally dissect neural circuits to the level of individual neurons and synapses. But these opportunities can only be fully realised if biologists can access resources that provide intuitive ways to interrogate the vast volumes of data produced by these efforts.Virtual Fly Brain (VFB) provides just such a resource. In collaboration with FlyBase, we curate information about Drosophila neurobiology from literature and bulk sources and integrate this  with tens of thousands of cross-registered 3D images. Our website provides sophisticated visualization and query tools for users to explore and interrogate this content.We achieve this by using standardised semantic schemas in Web Ontology language (OWL) to store queryable representations of cross-registered image data, transgene expression patterns, and neuroanatomy to the level of individual neurons and synaptic connections. Our schemas are designed to be re-usable outside the context of Drosophila neurobiology, using generally applicable terms from the OBO Relations Ontology (e.g. to record neuronal innervation patterns), the Sequence Ontology (for genetic features), and the Gene Ontology (for neuronal parts, processes and behaviors).OWL representation drives powerful queries on VFB, including queries for neurons by innervation pattern, but is not ideal for storing curation or for generating browsable visualizations. We use the graph database Neo4J to store curation and to produce browsable visualizations of VFB content and query results.We will present details of our schemas, illustrating how they drive queries and visualization of neurobiological knowledge, data and 3D images. We will also present work on a generic, containerised version VFB that can host any 3D image data annotated with ontologies.

 

142 CausalTab: PSI-MITAB updated for signaling data representation and dissemination

Panneerselvam, Kalpana
EMBL-EBI

In a biological context, causal interactions are abstractions representing the regulatory effect that a regulator entity (e.g, an enzyme) has on a target entity (e.g, a substrate). Several causality-related resources, such as SIGNOR, Signalink 2.0, GO-CAM and IntAct, capture, organize and display signalling interactions by portraying them as binary, causal relationships between biological entities. To ensure interoperability, these resources have agreed on the development of a standardized exchange format and a controlled vocabulary for the field of causal interactions. Aware of the enormous benefit obtained by the stakeholders in the Molecular Interaction research field, the signalling network community decided to adopt the PSI-MI standard.We have extended the PSI-MI controlled vocabulary to include additional terms representing aspects of causal interactions. We have defined the new branch “causal interaction”, and related child terms, directly from the PSI-MI root term “molecular interaction”.  Additionally, we have released the PSI Causal Interaction tabular format (CausalTAB) which is an extension of the existing PSI-MI tab-delimited format, now designated MITAB2.8. This new standard benefits from the expressive nature of the existing PSI-MITAB, which can capture great experimental detail, and extends it by including 4 additional columns. These columns capture aspects such as the regulatory effect being annotated (e.g, activation, increase in expression), the molecular function of the regulator and regulated components and the mechanism by which regulatory event happens if the components are not physically interacting (e.g, transcriptional regulation).The new standard creates the basis for the development of a common query interface for molecular interactions and causal interactions, thus enabling systems biologists to model large-scale signalling networks more precisely and with higher coverage than before.

 

143 BioCompute: Establishing standards for communication of HTS workflows and knowledgebase curation

Patel, Janisha
George Washington University

Developments in high-throughput sequencing (HTS) have enabled widespread data generation and analysis at unprecedented scales. The BioCompute Framework is a standardized, consensus-based mechanism of analysis communication built to support the curation, accessibility, and reproducibility of next generation sequencing analysis pipelines. With collaboration and data provenance in mind, BioCompute extends these capabilities to both HTS workflows and knowledgebases. BioCompute is a product of the HTS Computational Standards for Regulatory Sciences (HTS-CSRS) Project and is driven by a consortium of diverse members within academia, industry, and government. A BioCompute Object (BCO) is a data object that adheres to the BioCompute specification. BCOs are built using the JSON Schema to encode pertinent information on knowledgebase creation and computational workflow execution, in addition to metadata, data provenance, and usage. By recording information on data origin, QC criteria, a “verification kit” (which includes false positives, false negatives, the bounds of detection, and other pipeline metrics), and data content, BCOs allow the reuse of previously recorded workflows for the curation and integration of new data in a standardized manner. A database of BCOs, or a “BioComputeDB,” enables individuals to systematically access and track all existing data records, as metadata and QC processes are intrinsic to BCO files. BCO input data is structured in accordance with several industry standards such as MIAPE, MIRAGE, STRENDA, and Minimum Information Standards to enhance BCO functionality and compliance. BioCompute adheres to FAIR data standards, thus making the BCO datasets and workflows findable, accessible, interoperable, and reusable. The ease of accessibility and transparency in data curation will facilitate future data integration and promote interoperability between different platforms, industries, scientists and regulators within the scientific community.

 

144 The Drosophila Anatomy Ontology

Pilgrim, Clare

The Drosophila Anatomy Ontology (DAO) is a queryable store of knowledge about Drosophila anatomy and cell types. It consists of over 10,000 terms and over 17,000 logical relationships curated from almost 900 publications.

The DAO supports the multiple classification schemes used by biologists, for example classifying neurons by lineage, neurotransmitter, sensory modality and innervation pattern. Building a high-quality ontology of this complexity would not be possible without automation. Consistency and efficiency are achieved by using standard design patterns and scripting. For scalability, we use logical definitions and inferred classification wherever possible. As a result, 35 % of over 13,500 classifications are inferred.

The majority of new additions to the DAO are for neuroanatomy, following recent discoveries using techniques such as split-GAL4 and electron microscopy of whole central nervous systems. Terms are created and updated in a collaboration between FlyBase and Virtual Fly Brain (VFB), both of which use the DAO to annotate data. FlyBase takes a gene-centric approach, using DAO terms to record the expression patterns and phenotypes. VFB takes an anatomy-centric approach, associating DAO terms with 3D images of anatomical structures and expression patterns. The VFB site takes advantage of the logical assertions recorded in the DAO to drive queries, for example allowing users to query for neurons by innervation pattern.

We will present examples of how we use automation to aid ontology development and of how the knowledge encoded in the DAO is used to drive biologically meaningful queries on VFB.

 

145 GlyGen - Computational and Informatics Resources for Glycoscience

Ranzinger, Rene
Complex Carbohydrate Research Center

Advancing our understanding of the roles that glycosylation plays in development and disease is hindered by the diversity of the data that must be integrated to gain insight into these complex phenomena. GlyGen is a new initiative supported by the NIH Common Fund with the goal of democratizing glycoscience by implementing a comprehensive data repository that integrates diverse types of data, including glycan structures, glycan biosynthesis enzymes, glycoproteins, and three-dimensional glycoprotein structures along with genomic and proteomic knowledge. As part of GlyGen we have established collaborations with database providers from different domains (including but not limited to EBI, NCBI, GlyTouCan and UniCarbKB) in order to populate the repository with data. All information from these resources was standardized and crosslinked with dataset from the other resources to allow queries across multiple domains. To provide the community with an easy way to access the information an intuitive, web based interface (http://glygen.org/) has been developed to visually represent the data and the connection between the individual datasets. In addition to the browser bases interface we also developed a RESTful webservice based API and SparQL endpoint allowing programmatic access to the integrated datasets.Our goal is to provide both trained and aspiring glycoscientists an easy way to access the complex information involving glycans in biology. One aspect we have implemented in this context is a query interface which suggest specific questions that a user is likely to have, such as “What are the enzymes involved in the biosynthesis of glycan X in humans?”. By specifying only one or few pieces of information (e.g. glycan X in the example) a complex query across multiple datasets and domains is performed and the result returned to the user. These questions are based on a collection of use cases we compiled using mailing lists and community meetings.

 

146 Standardization of glycosaminoglycan sequences binding to proteins and creation of a pipeline for the curation of protein-glycosaminoglycan interactions

Ricard-Blum, Sylvie
University Lyon 1, ICBMS, UMR 5246 CNRS

Mammalian glycosaminoglycans (GAGs) are complex polysaccharides comprising heparan sulfate, heparin, dermatan sulfate, chondroitin sulfate, keratan sulfate and hyaluronan. GAG–protein interactions reported in the literature are curated mostly by MatrixDB database (http://matrixdb.univ-lyon1.fr/), which belongs to the IMEX consortium (http://www.imexconsortium.org/) and follows its curation rules. However, a standard nomenclature and a machine-readable format of GAGs for curation together with bioinformatics tools for mining their interaction data are lacking. We have built an automated pipeline to (i) standardize the format of GAG sequences interacting with proteins manually curated from the literature, (ii) translate them into the machine-readable GlycoCT format and SNFG (Symbol Nomenclature For Glycans) images and (iii) convert them into a format processed by a builder generating 3D models based on a repertoire of conformations validated by data from GAG-protein co-crystals.We have curated GAG sequences binding to proteins, translated them into the GlycoCT and SNFG formats using the pipeline and cross-referenced GAG entries of MatrixDB with the Chemical Entities of Biological Interest (ChEBI, https://www.ebi.ac.uk/chebi/) and the glycan repository GlyTouCan (https://glytoucan.org/). We have also cross-referenced GAG entries of GlyTouCan with MatrixDB and ChEBI entries with GlyTouCan to increase the interoperability of major databases for GAGs. Furthermore, we have developed and integrated into the pipeline a converter (CT23D), which automatically translates the GlycoCT code of a GAG sequence into the input file required to construct a 3D model (https://github.com/OlivierClerc/convert-glycoct-inp). The 3D models, used to display the GAG  binding sites, can be built on MatrixDB web site using a version of the POLYS glycan builder that we have developed for GAGs (http://glycan-builder.cermav.cnrs.fr/gag/), downloaded as pdb files and visualized.

 

147 Gene Ontology Annotation (GOA) Database

Rodriguez Lopez, Milagros
EBI/EMBL

The Gene Ontology Annotation program aims to provide high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB), RNA molecules from RNAcentral and protein complexes from the Complex Portal.GOA files contain a mixture of manual annotation supplied by members of the Gene Ontology Consortium and computationally assigned GO terms describing gene products. Annotation type is clearly indicated by associated evidence codes and there are links to the source data. GOA currently supplies more than 600 million annotations to more than 100 million gene products from more than 1 million species.In order to annotate and make the data available, GOA provides the Protein2GO curation tool and the QuickGO tool that allows users to search the Gene Ontology and associated annotations. A new search feature was added this year to QuickGO that allows users to add annotation extensions to their query. Our next improvement of QuickGO will focus on the creation of an advanced search to allow users make even more specific queries.A major focus over the past year has been to increase the coverage of manually assigned GO terms to the human proteome. More than 85% of the proteins in the human proteome now have manually assigned terms and this work is ongoing. 

 

148 The ELIXIR Recommended Interoperability Resources (RIRs) - What tools can I use to make data FAIR?

Sarntivijai, Sirarat
ELIXIR

The ELIXIR Interoperability Platform (EIP) is one of the five technical platforms (Compute, Data, Interoperability, Tools, and Training) that support life science knowledge discovery by coordinating bioinformatics research infrastructures across the ELIXIR Consortium’s 23 member states. The EIP aims to help people and machines to discover, access, integrate, and analyse biological data by encouraging the life science community to adopt standardised file formats, metadata, vocabularies, and identifiers. To support this goal, the Recommended Interoperability Resource (RIR) selection process has been established to identify tools that are fit-for-purpose for the myriad tasks in making data FAIR. The RIRs are promoted through ELIXIR for adoption as they have been evaluated for their practicality, and to encourage community reusability practice at large. All ELIXIR resources are publicly available.The applications for RIR consideration were reviewed by ELIXIR technical experts and evaluated by non-ELIXIR experts from international communities, based on the RIR’s production maturity and quality, facilitation to scientific discovery, community support and impact, and legal framework and governance. The first round of the selection process resulted in an initial portfolio of 10 RIRs, as announced on https://www.elixir-europe.org/platforms/interoperability/rirs. These resources will be regularly evaluated for quality assurance and quality control. Additional resources will be included in future calls for RIR application, as ELIXIR evolves to accommodate emerging technologies and changing scientific needs. Recommendations for use of RIRs should also be recognised as suggestions and not a mandate as we value progressive developments from our ELIXIR members based on upcoming new use case requirements.

 

149 DO: The FAIR human disease ontology domain standard

Schriml, Lynn
University of Maryland School of Medicine, Institute for Genome Sciences

The Human Disease Ontology (DO, www.disease-ontology.org) is a critical infrastructure resource for the biomedical community. As the NIH/NHGRI-funded community resource for human diseases, and the OBO Foundry disease domain ontology, the DO is an authoritative disease nomenclature and classification resource utilized across the biomedical and clinical research communities for the breadth of human diseases.Established in 2003, the DO project encompasses the breadth of rare and common diseases (cancers, syndromes, infectious, metabolic, genetic and mental health diseases). Most recently, the DO is innovating interoperability within the DO and across OBO Foundry ontologies, defining novel parent to child disease term relationships, inferred from logical axioms coordinated with OBO Foundry ontologies and based on a disease’s related information (e.g. anatomical location, mode of inheritance, cell type, sequence structural variant). Logical axioms for phenotypes, symptoms and infectious agents are under development.  The DO’s complex etiology disease model has established methods for defining multiple genetic etiologies (e.g. single gene, chromosomal, post-translational modification) in a single resource, thus expanding the DO’s clinical utility for exploring differential diagnosis. The DO employs best practices for community data coordination including the use of globally unique and persistent identifiers, a GitHub repository and issue tracker, production of OBO and OWL formatted files, regular ROBOT-automated data releases, extensive cross references and open CC0 licensing. Broad and expanding implementation, usage (> 340 citations) and collaborative projects with the Alliance for Genome Resources, JAX CKB, Clinical Interpretations of Variants in Cancer (CIViC) and Wikidata drive the DO’s ongoing development, ensures timely integration of new information and evolves DO’s disease classification (e.g. ClinGen pediatric cancer re-classification) based on new knowledge.

 

150 Expanding MIxS Genomic Minimal Information Standards

Schriml, Lynn
University of Maryland School of Medicine, Institute for Genome Sciences

The Genomics Standards Consortium’s (GSC, www.gensc.org) successful development and implementation of genomic metadata MIxS standards established a community-based mechanism for sharing genomic data through a common framework. The GSC, an international open-membership working body of over 500 researchers from 15 countries, promotes community-driven efforts for the reuse and analysis of contextual metadata describing the collected sample, the environment and/or the host and sequencing methodologies and technologies. Since 2005, the GSC community has deployed MIGS genome checklists and a library of 15 MIxS environmental packages to meet evolving genomic research needs to enable standardized capture of environmental, human & host associated study data. The GSC’s MIxS Compliance and Interoperability working group maintains, promotes and actively co-develops additional MIxS standards. The GSC’s suite of minimal information reporting guidelines have been supported (GenBank and BioSample implementations) for over a decade by the International Nucleotide Sequence Database Collaboration (INSDC) databases, namely NCBI GenBank, EMBL-EBI ENA and DDBJ, thus allowing for an enriched environmental and epidemiological description of sequenced samples.  To date, over 450,000 BioSample records have been annotated with the GSC’s MIxS standards. In the past two years, the GSC community has developed novel metadata standards for capturing contextual data for single-cell genomes and genomes assembled from metagenomes for bacteria and archaea and uncultivated virus genomes. These standards contribute knowledge pertinent to these research communities, enable data reuse and integration and foster cross-study data comparisons, thus  addressing the critical need for consistent data representation, data sharing and promotion of interoperability in genomic Big Data. Agriculture Microbiome, Host-Parasite Microbiome and Metabolomics MIxS standards are under development for 2019.

 

151 Comparative genetics and genomics of mouse strains and species at Mouse Genome Informatics (MGI)

Smith, Cynthia
The Jackson Laboratory

The availability of hundreds of genetically distinct and phenotypically characterized mouse strains is at the core of the power of mouse models to detect genotype-to-phenotype and disease relationships. The recent availability of annotated genome assemblies for nineteen mouse strains and species generated by the Mouse Genomes Project further enhances the power of the mouse as a model system for investigating the genetic basis of human health and disease. The Mouse Genome Informatics database (MGI, www.informatics.jax.org) has implemented new functionality to take advantage of the availability of strain genomes including, the development of an interactive Multi Genome Viewer (MGV) and summarized mouse strain-specific data web pages. MGV allows users to display and interact with the annotated genomes all available mouse strains and species in a compact, comparative graphical interface. The new strain summary pages feature official strain names, synonyms, SNP data, known phenotypic and sequence variants and disease models, links to vendors and references. Links to external resources such as the Mouse Phenome Database for baseline strain characteristics are also provided when relevant. The new strain-centric interfaces available from MGI will allow researchers to leverage emerging large-scale phenotype and genome data sets to identify the precise genomic differences among mouse strains.  MGI is funded, in part, by HG000300.

 

152 iDog: an integrated resource for domestic dogs and wild canids

Tang, Bixia
BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China

The domestic dog (Canis lupus familiaris) is indisputably one of man’s best friends. It is also a fundamental model for many heritable human diseases. Here, we present iDog (http://bigd.big.ac.cn/idog), the first integrated resource dedicated to domestic dogs and wild canids. It incorporates a variety of omics data, including genome sequences assemblies for dhole and wolf, genomic variations extracted from hundreds of dog/wolf whole genomes, phenotype/disease traits curated from dog research communities and public resources, gene expression profiles derived from published RNA-Seq data, gene ontology for functional annotation, homolog gene information for multiple organisms, and disease-related literature. Additionally, iDog integrates sequence alignment tools for data analyses and a genome browser for data visualization. iDog will not only benefit the global dog research community, but also provide access to a user-friendly consolidation of dog information to a large number of dog enthusiasts.

 

153 Evidence and Conclusion Ontology: 2019 Update

Tauber, Rebecca
University of Maryland School of Medicine

The Evidence and Conclusion Ontology (ECO) describes types of evidence relevant to biological investigations. First developed in the early 2000s, ECO now consists of over 1700 defined classes and is used by a large, and growing, list of resources. ECO imports close to 1000 classes from the Ontology for Biomedical Investigations and the Gene Ontology for use in logical definitions. Historically, ECO terms have generally been categorized by either the biological context of the evidence (e.g. gene expression) or the technique used to generate the evidence (e.g. PCR-based evidence). The result is that sometimes terms that have related biological context are found under different unrelated nodes. To address this, we have been performing rigorous review of the structure and logic of the branches of ECO. Working with additional input from collaborators through the issue tracker on GitHub, term labels, definitions, and relationships are being evaluated and updated. The goal of these changes is to increase the logical consistency of ECO, make it easier for users to find and understand terms, and allow for ECO to continue to grow and support its users. In addition to the structural review, we have been working with CollecTF to utilize ECO for automated text mining. To generate a curated corpus for this effort, we have been annotating ECO terms to sentences which contain evidence-based assertions about gene products, taxonomic entities, and sequence features. From this effort we have developed clearly-defined annotation guidelines that have been passed on to a team of undergraduates who are continuing the curation effort. Annotations are limited to single sentences, or to two consecutive sentences, containing the evidence instance and assertion clause. The quality of the mapping to ECO and the strength of the author’s assertion are also captured. ECO is freely available at http://evidenceontology.org/ and https://github.com/evidenceontology.

 

154 HGNC: promoting standardized gene names for 40 years

Tweedie, Susan
HUGO Gene Nomenclature Committee (HGNC)

The HUGO Gene Nomenclature Committee (HGNC, www.genenames.org), working closely with authors, journals, annotating groups and other nomenclature committees, has been responsible for assigning approved human gene names for the last 40 years. While unique IDs are undoubtedly the most unambiguous way to identify genes, they are not practical for many forms of communication – biologists also need stable, memorable, and ideally pronounceable nomenclature for effective written and verbal communication about genes.Unfortunately, compliance in using approved nomenclature is inconsistent. Though usage of approved nomenclature has increased greatly over the last decade, some, papers are still published using unapproved alias symbols, creating new names for ‘novel’ genes that already have approved nomenclature, or naming genuinely novel genes with symbols that cannot subsequently be approved. We discuss our ongoing efforts to encourage the use of official nomenclature, including:promotion of our multi-symbol checking toolrenewed approaches to journals that do not provide nomenclature guidelineslaunching a new nomenclature blogwriting a commentary on best practice and the pitfalls of using unapproved nomenclatureAdditionally, as part of our efforts to extend approved nomenclature to other well-studied vertebrates that lack their own dedicated nomenclature group via the Vertebrate Gene Nomenclature Committee (VGNC), we plan to develop new minimal cross-species nomenclature guidelines. To date, we have named over 13K 1:1 human orthologs in each of four species: chimpanzee, cow, dog and horse (vertebrate.genenames.org).

 

155 The Ontology for Biomedical Investigations (OBI) as a Curation Tool

Vita, Randi
La Jolla Institute for Allergy & Immunology

The Ontology for Biomedical Investigations (OBI) is a community based ontology that provides terms for all aspects of biological and medical investigations. OBI is an OBO Foundry member following its principles such as reusing existing ontology terms whenever possible. The scope of OBI includes terms covering investigations such as investigation types, subject enrollment terms, material and data transformations, and importantly, experimental assay terms. These terms can be used to curate biomedical datasets and experiments described in the scientific literature. Many projects currently use OBI and here we will describe how several example projects utilize OBI in data curation to annotate and standardize metadata, including The Immune Epitope Database (IEDB), The Eukaryotic Pathogen Genomics Database Resource (EuPathDB), and NASA GeneLab as well as any curation effort that utilizes The Evidence Ontology (ECO).

 

156 The application of ontologies in analyzing the similarities of thousands of drug pairs

Wang, Zhigang
Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences

Two important applications of ontology are knowledge reasoning and biological entity annotation. The ontologies suitable for drug annotation include ChEBI, SIDER, GO, etc. ChEBI stores chemical structures and functions of a structural group. SIDER contains information on marketed medicines and their recorded adverse drug reactions. GO can be used to describe molecular functions of drug-targeted genes and proteins, and their associated biological processes and cellular components. The ontology nodes and IS-A links forms a direct acyclic graph (DAG) structure. Considering the positions of parent and child terms, the semantic similarity can be obtained. We downloaded more than 5000 drugs from the Drugbank database and pre-calculated semantic similarities using ChEBI, SIDER, and GO. Users can analyze drug similarities through a web service. New usages can be identified from old drugs. We also developed an R package called DrugIDMapper that provides the function of mapping drug IDs between DrugBank, ChEBI, DPD (Health Canada Drug Project Database), KEGG Compounds, KEGG Drug, PharmGKB, PubChem Compound and PubChem Substance databases. ChEBISemSim, another R package we developed, includes the ChEBI ontological information. ChEBISemSim provides a function for viewing drugs in the ChEBI ontology hierarchical tree and calculating semantical similarities between the two drugs. The packages are available upon your request to wangzg@pumc.edu.cn.

 

157 Metadata in the Human Cell Atlas

Welter, Danielle
EMBL-EBI

The Human Cell Atlas (HCA) project aims to build comprehensive reference maps of human cells that will further our understanding of biological processes and diagnosing and treating disease. The HCA Data Coordination Platform (DCP) is responsible for providing the infrastructure and capacity to both host and deliver data openly to the scientific community. The DCP team have been working with the scientific community to support and develop data standards and a metadata specification for data coming into the platform that will ensure the data is interoperable and amenable to integrative analysis. Data interoperability is being further achieved through the use of ontologies that will provide terminology standards with well-defined semantics for describing experimental data coming into the HCA.The HCA metadata schema provides the descriptive framework needed to represent the data submitted to the HCA. Using JSON schema conventions and validation, the metadata schema defines the key concepts needed to describe relevant aspects of cellular resolution sequencing and imaging experiments, as well as the links that can exist between concepts. Where possible, the metadata schema allows ontology annotations of concept including species, disease, anatomy, cell types, and experimental techniques.The DCP aims to make the HCA metadata schema the de facto standard for single cell experiment annotation, with a number of related projects already looking to adopt the standard for their own data collection.All our metadata schemas are publicly accessible at https://schema.humancellatlas.org/ and via our Github repo https://github.com/HumanCellAtlas/metadata-schema.

 

158 Identifiers.org Compact Identifiers for robust data citation

Wimalaratne, Sarala
EBIIdentifiers.org

Compact Identifiers for robust data citationSarala M Wimalaratne, Manuel Bernal Llinares, Nick Juty and Henning HermjakobCompact identifiers have been informally and widely used for referencing life science data for many years, though the practice has been largely been through ad hoc implementations, serving specific use cases. We describe our implementation, which has already begun to be adopted by publishers.Compact Identifiers consist of an (Identifiers.org) assigned unique prefix in combination with a locally (database) assigned accession number (prefix:accession). Compact Identifiers are resolved to database records using information that is stored in an underlying Registry, which contains high quality, manually curated information on over 700 data collections. This information  includes the assigned unique prefix, a description of the data collection, identifier pattern, and a list of hosting resources or resolving locations. When a Compact Identifier is presented to the Identifiers.org Resolver, it is redirected to a resource provider, taking into consideration information such as the uptime and reliability of all available hosting resources. For example, pdb:2gc4, GO:0006915, doi:10.1101/101279, orcid:0000-0002-5355-2576 etc.In addition, a formal agreement with N2T resolver, based in California Digital Library has been struck to provide  backup resolution services Users can therefore resolve Compact Identifiers using Identifiers.org (https://identifiers.org) or N2T (https://n2t.net/) resolvers. This implementation of Compact Identifiers has been adopted by Nature Scientific Data for data citations when linking to biomedical datasets with accession numbers [2].Sarala M. Wimalaratne et al. Uniform resolution of compact identifiers for biomedical data. Sci. Data 5:180029 doi:10.1038/sdata.2018.29 (2018)Open Editorial. On the road to robust data citation. Sci. Data 5:180095 doi:10.1038/sdata.2018.95 (2018)

 

159 FAIRDOM: supporting FAIR data and model management

Wittig, Ulrike
Heidelberg Institute for Theoretical Studies

FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operating procedures and Models) is an initiative to establish sustained data and model management service to the European Systems Biology community (http://fair-dom.org). With the free SEEK data management software FAIRDOM offers a data management platform for interdisciplinary projects to support the storage and exchange of data and models from research partners based on the FAIR principles. SEEK can either be installed, run and further developed as an own instance or be used within the web-accessible platform FAIRDOMHub, offering public information and a password protected user area.The SEEK data management system contains yellow pages, data files, models, SOPs (standard operating procedures) and is structured using the ISA standard (Investigation, Study, Assay). A variety of data security levels allow controlled access to digital assets (data, models, SOPs) and secure sharing between project partners or to keep temporarily the data private. Data uploaded to SEEK can be displayed within the web interface if the file format is supported (e.g. Excel, Word, pdf), downloaded to local machines or accessed automatically using an API.FAIRDOMHub/SEEK is used in many different national and international research projects and is an integral member of several infrastructure initiatives (e.g. ELIXIR, German Network for Bioinformatics Infrastructure - de.NBI).

 

160 SABIO-RK: extraction of enzyme function data from STRENDA DB

Wittig, Ulrike
Heidelberg Institute for Theoretical Studies

SABIO-RK (http://sabiork.h-its.org) is a web-accessible, manually curated database for biochemical reactions and their kinetic properties with the focus on both supporting the computational modelling to create models of biochemical reaction networks and allowing experimentalists to gain further knowledge about enzymatic activities and reaction properties. SABIO-RK contains annotations to controlled vocabularies and ontologies and is highly interlinked with different other databases and integrated in the data workflow of several modelling and simulation tools. A flexible way of exporting database search results using web services or in a table-like format is provided. Both the export and the import of data are possible via standard data exchange formats.Data in SABIO-RK are mainly manually extracted from literature but could also be uploaded directly from laboratories or other resources via SBML. Beside that SABIO-RK supports the STRENDA initiative which defines guidelines for reporting enzyme function data in publications to increase the reusability of data for databases and modelling tools as well as to improve the reproducibility of results for experimentalists. While the paper submission process STRENDA recommends authors to submit enzyme function data to STRENDA DB which automatically checks the manuscript data on compliance with the STRENDA guidelines and supports the reviewing process. Immediately after final publication SABIO-RK extracts the kinetic parameters including experimental conditions from STRENDA DB. This whole procedure allows a structured and standardized data storage during the publication process, improves the data extraction and curation process for other databases, and gives more responsibility for final data quality to the authors.

 

161 Translation of OBO biomedical ontologies to Chinese and their visualization

Yang, Xiaolin
Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences

In the past decades, several biomedical terminologies, have been translated into Chinese and widely used in China. However, the translations are more focused on the clinical domain. The development of China Precision Medicine Initiative has stimulated the need for more data standards and ontologies in life science. Furthermore, to ensure the semantic interoperability, it is not sufficient to perform only the nature language translation of terms or ontology labels. The relationships between different language ontologies should be mapped and represented correctly. The term annotations, which help humans to understand, should also be translated.We have developed the Chinese version of several OBO Foundry ontologies, including BFO, OBI, IAO, RO, OGMS, PATO, CLO and ICO. We first invited the professionals to carry on the nature language translation. The translated contents include the labels and the definitions of ontology classes at minimal. The translation process ordered from root to leaves to keep the definition consistent in logic and language. Second, the Chinese labels, definitions, other annotation properties were integrated into the original (English) ontology framework to build a bilingual ontology in the OWL format. For every ontology class, the W3C specified xml:lang was used to separate the expression in different languages. In some ontologies, such as CLO, some new classes and annotation properties, which represent the information about cell line cells in China cell repository, were also integrated into the ontology. At last, the Chinese-English bilingual ontologies were uploaded to the ontology repository MedPortal (http://medportal.bmicc.cn), which is a service as part of the OntoChina program.Users can browse and use the bilingual version of BFO, OBI, IAO, RO, OGMS, CLO and PATO from MedPortal. The OBO ontologies translations will promote the OBO principles and the ontology application in China, and also promote the semantics interoperability between languages.