EMIF Platform

EMIF-Platform Project Overview

EMIF has ultimately build an integrated, efficient Information Framework for consistent re-use and exploitation of available patient-level data to support novel research. This will support data discoverydata evaluation and then (re)use.

EMIF-Platform System Overview

EMIF-Platform has developed an IT platform allowing access to multiple, diverse data sources. The EMIF-Platform made this data available for browsing and allows exploitation in multiple ways by the end user. EMIF-Platform has leveraged data on more then 62 Million European adults and children by means of federation of healthcare databases and cohorts from 7 different countries (DK, IT, NL, UK, ES, EE), designed to be representative of the different types of existing data sources (population-based registries, hospital-based databases, cohorts, national registries, biobanks, etc.).

The EMIF Data Catalogue is now available outside of EMIF to bona fide researchers - more information via here.

Prof Johan van der Lei, EMC Rotterdam

EMIF Platform EHR Video

This video provides an overview of a service approach via EHR data post-IMI

EMIF Platform Cohort Video

This video provides an overview of a service approach via Cohort data post-IMI

EMIF-Platform Objectives


To achieve EMIF-Platform objectives, the project was divided into eight work packages (WP9-WP16 below). In addition, there are four EMIF-AD work packages and four EMIF-MET work packages to explore.

EMIF-Platform Work Packages WP16 Use and sustainability models, community building & outreach WP15 Use and sustainability models, community building & outreach WP14 Architecture, solution development, security & privacy technologies WP13 Analysis, processing & visualization methods and tools WP12 Data extraction, benchmarking, aggregation & linkage WP11 Harmonization & Semantics WP10 Governance WP9 Framework Requirements& Evaluation

EMIF-Platform Achievements

Tool Development

EMIF-Platform Tool Development
  • Key tool developed – EMIF Catalogue as data “shop window” to support the platform architecture, also being utilised by other initiatives (ADVACE, MOCHA, IMI-EPAD, DP-UK)
  • Development & integration of TASKA in the EMIF catalogue to manage the workflow.

Common Data Model

EMIF-Platform Common Data Model: OMOP-CDM


  • Mapped 10 European Databases to the OMOP-CDM
  • Contributes to the extension of the CDM and Standardized Vocabularies to accommodate the European data
  • Supports the European OHDSI initiative to stimulate adoption of the CDM and collaboration across Europe
EMIF-Platform Common Data Model: ATLAS


  • ATLAS tool developed by OHDSI is used to conduct scientific analyses on standardized health data
  • EMIF is evaluating the OHDSI tools in the EMIF community and is actively contributing to their further development

Biomarker Discovery

EMIF-Platform Biomarker Discovery
  • Raw cohort data integration and analysis via tranSMART and allied bioinformatics tooling development to support biomarker discovery in AD and metabolic disorders
    • 3423 subjects from 14 AD cohorts harmonized
    • Support of multi-omics data analysis.

EMIF-Platform Tools

Workflow Management Workflow Management Private Remote Research Environment (PRRE) Private Remote Research Environment (PRRE) Variable Selection Tool (VST) Variable Selection Tool (VST) ParticipantSelection Tool (PST) ParticipantSelection Tool (PST) CohortSelection Tool (CST; Catalogue) CohortSelection Tool (CST; Catalogue) “Switchbox” “Switchbox” Private Remote Research Environment (PRRE) Private Remote Research Environment (PRRE) Workflow Management Workflow Management OMOP CDM/OHDSI Tools OMOP CDM/OHDSI Tools EMIFCatalogue EMIFCatalogue Data extractiontooling Data extractiontooling Ethical Code of Practice (ECoP) Ethical Code of Practice (ECoP) EMIF-Platform Tools Governance & Security/Integration Layer Cohort architecture supports EMIF-AD data discovery & (re)use workflow EHR architecture supports ‘generic’ data discovery & (re)use workflow Datare-use Dataaccess Dataassessment Datadiscovery Datare-use Dataaccess Dataassessment Datadiscovery EMIF-Platform Tools Governance & Security/Integration Layer Cohort architecture supports EMIF-AD data discovery & (re)use workflow EHR architecture supports ‘generic’ data discovery & (re)use workflow Datare-use Dataaccess Dataassessment Datadiscovery Datare-use Dataaccess Dataassessment Datadiscovery

WP 9 Framework requirements & evaluation

Carlos Díaz (Synapse) – Peter Egger (GSK)



The objectives are:

  1. To continuously elucidate user requirements and evaluate the EMIF-Platform as it is progressively developed and deployed.
  2. For each cycle of the work plan, ensure adequate input from prospective users, especially the present and future research projects, and input from EFPIA participants, as important future users of the platform.
  3. For each cycle of the work plan, evaluate from the users’ perspective the results of that cycle, feeding it back to the development work packages.


Find a balance between tool development and answering scientific questions

Three user input/development cycles have been designed:

  • Cycle 1: design of use cases 1-6 to initiate interactions between users and data custodians, and to start development of first tool prototypes.
  • Cycle 2: evaluation of use cases 2-6 and production of user-driven high and low-level requirements of the Platform. Increasing interactions with the verticals on the design and implementation of use cases 7 and 8, which explored feasibility data extraction runs for simple end points.
  • Cycle 3: development of use cases 9-13 by dedicated mixed teams (end users and data custodians) and full protocols to address complex scientific questions and test the refined prototypes developed so far.

Definition of use cases

  • WP9 established as entry point of all requirements, including requirements articulated by the Research Topics (AD and Metabolic).
  • Specification of high and low-level requirements of the Platform. Definition of procedures for user requirements gathering and evaluation, including set up of user groups.
  • Definition of Use cases 1-6 in cycle 1, 7-8 in cycle 2 and 9-13 in cycle 3 (ongoing). Coordination of protocol development teams.

Evaluation of the EMIF Catalogue

  • Evaluation of the first set of use cases and issue of user requirements for the Platform. These requirements validate the correct orientation of the tools developed for the Platform and provide feedback to developers to allow prioritisation of actions for further tool development.
  • Follow-up with WP14 on the developments and implementation of user requirements as issued from the evaluation of use cases.
  • Evaluation of the first version of the EMIF Catalogue with potential end users, and a second evaluation in 2015 for v3.

WP 10 Governance, federation, DB fingerprinting, legal context & ethics

Nigel Hughes (Janssen) - TBD (EMC)



The objectives are to:

  1. Provide detailed information (the “fingerprint”) of each participating database with respect to population included in that database, the data available in each database, the local data model, the mechanism generating the data, and the ability to address specific re-use issues (such as ability to contact the GP or patients, or inclusion of patients in clinical trials).
  2. Specify the procedures that will be followed in the governance of the federation, including the required safeguards and criteria for admission.
  3. Define the different types of users of the EMIF-Platform, the levels of access for each type of user, and the procedures and safeguards to handle permissions.
  4. Ensure that the project follows ethical principles and conforms to relevant international and national regulations in this regard.


  • Fingerprint Browser core attribute of EMIF Catalogue and will be “shop window” of the EMIF Platform
  • Ethical Code of Conduct (ECoP) presented widely and internationally as it developed —>EMIF Ethical Code of Practices
  • Multiple presentations at international conferences on the areas of collaborations, governance, ethics and federation


Fingerprint browser database [D10.1]

  • Pivotal metadata formulation and collection for EMIF Catalogue
  • Harmonised metadata across population and cohort sources within EMIF-AD, EMIF-MET and disease-agnostic


Ethical Code of Practice (ECoP) [D10.4]

  • Wide ranging, ethical practice and governance framework
  • Encompassing critical policy, guidance and legal requirements within EU
  • Federation structure and access governance


Collaborations and Outreach

WP 11 Harmonization & semantics

Dipak Kalra (UCL) – Michel Van Speybroeck (Janssen)



The objectives are:

  1. To analyse the harmonisation and semantic needs, and devise a common framework of reference that guides all further interoperability work in the project, including metadata.
  2. To develop a common data model to be used throughout the EMIF-Platform architecture to represent queries and result sets for aggregate and patient-level data.
  3. To design and implement services to support specific cross-mappings between terminologies used for clinical/medical terms, medicinal products and units of measurement.


  • Deep data integration for cross cohort analysis, presentation at Learning Health System in Europe, Brussels, September 2015
  • The IMI EMIF Project: Managing Data For Alzheimer Research Using tranSMART, presentation at 12th Annual Pharmaceutical IT Congress, September 2014
  • Identify diseases from data sources where information on diagnosis is incomplete: experiences from Europe and beyond, Institut universitaire de médecine sociale et préventive (IUMSP), CHUV et Université de Lausanne September 2015


Specification of EMIF Knowledge Objects

  • Knowledge Objects (KO) are semantic models that represent concepts found in data items in a systematic way, along with relevant metadata, which map concepts used in research queries to underlying data sources
  • Successful pilot with a partial ontology and KO representation for dementia concepts, in collaboration with EMIF-AD, using Protégé and the ONTOP platform; through this several guiding principles for KO design have emerged
  • Development of the ‘Participant Selection Tool’ (PST) using the KO. The PST allows to determine the number of participants meeting a set of clinical criteria across federated data sources.
  • KO can also now incorporate security features such that a semantic web reasoner can be used to determine if a user has access to a particular variable

Adoption of OMOP as the CDM for EHR data

  • Standardises the representation of EHR data for the generation of dashboards, and for suitability & feasibility queries
  • Collaboration with OHDSI on requirements and tooling
  • Six selected data sources are starting to pilot mapping their data to this model

Development of terminology mappings and inference rules to detect clinical events

  • Semantic mapping tools have been implemented to capture the definition of medical conditions from EHRs, using UMLS
  • Conducted terminology mapping for the association of non-alcoholic fatty liver disease with cardiovascular and liver morbidity in electronic health record databases: 21 events and comorbidity were mapped:
    1. Alcohol abuse
    2. Diabetes type 2 or unspecified
    3. Hypertension
    4. Ischemic heart disease
    5. Ischaemic/unspecified stroke
    6. Hepatocellular carcinoma
    7. Alcoholic liver disease
    8. Non-alcoholic cirrhosis
    9. Other cirrhosis
    10. Unspecified cirrhosis
    11. Non-alcoholic liver disease
    12. Non-alcoholic steatohepatitis
    13. NAFLD or NASH (when it is not possible to distinguish between the two using codes)
    14. Alcoholic liver disease
    15. Hepatitis
    16. Drug induced liver toxicity
    17. Alpha1 antirypsin deficiency
    18. Wilson’s disease
    19. Pregnancy related liver disorders
    20. Primary sclerosing cholangitis
    21. Hemochromatosis


  • ETRIKS, EHR4CR, the tranSMART and OHDSI communities

WP 12 Data extraction, benchmarking, aggregation & linkage

Peter Rijnbeek (EMC) – Martijn Schuemie (Janssen)



The objectives are:

  1. To develop a common framework for extracting data across different data sources.
  2. To benchmark the results of the extractions across different data sources with the explicit objective to account for any differences observed.
  3. To provide tools for linking data between different sources that will allow individuals to be tracked through those data sources, taking into account the transition from childhood to adulthood.
  4. To aggregate data in order to comply with the privacy and governance rules that control the re-use of data.


  • D12.1 – Data extraction software v1
  • D12.2 – Data extraction software v2
  • D12.3 – Benchmarking and data quality analysis
  • D12.4 – Report on data linkage
  • D12.5 – Interim report on specialized data extraction, benchmarking, aggregation and processing


Jerboa Reloaded

Jerboa Reloaded Tool

  • Several modules have been developed for Jerboa Reloaded to support data source discovery and to enable study designs to run in a distributed manner.
  • The tool is being tested in several Use Cases developed in strong collaboration with the Metabolic and Alzheimer teams.

Study Workflow

  • A first version of a workflow is developed to perform studies in a fully harmonized approach based on a common generic protocol, and common tools for data extraction, data transformation and processing.

Data Derivation Workflow

  • A first version of a data derivation workflow is developed that supports identification of subjects in a data source who have a specific clinical condition.
  • The workflow is supported by prototypes of tools.

Record Linkage Investigative Work

  • A report has been created to establish the current state of record linkage within twelve EMIF data source partners to develop a better understanding of the challenges that each face when they perform linkage of health records.

WP 13 Analysis, processing & visualization methods and tools

Alvis Brazma (EMBL) – Rudi Verbeeck (Janssen)



The objectives are:

  1. To develop dedicated data analysis algorithms and tools to help exploitation of the data made available through the platfom, leveraging and re-using when possible already existing tools;
  2. To develop visualisation tools for EHR based on a generalisation of the concept of genome browsers and distributed annotation systems;
  3. To develop specific analysis tools for broad translational research datasets used in the vertical projects and apply them to data analysis problems arising in the context of such projects.


  • Jaak Vilo, UTARTU, “Estonian EU Projects in Bioinformatics”, Presentation to the Delegation of German E Health experts, 8 April 2014, Tallin, Estonia;
  • Jaak Vilo, UTARTU, “Big Data in small Estonia”, Presentation to European Science Journalists,15 May 2014, Tallin, Estonia;
  • Alvis Brazma, EMBL-EBI, Invited talk at Oslo University, 7 May 2015, Oslo, Norway;
  • Alvis Brazma, EMBL-EBI, “Transcriptome structure in normal and cancer gene expression”, Next Generation Sequencing BioteXel Forum, 23 March 2015, Glasgow, UK;
  • Natalja Kurbatova, EMBL-EBI, “Multi-omics data analysis using docker cluster”, Poster for EBI Day, 27 October 2015, Cambridge, UK.


Register of potential EMIF components

  • Register of analysis tools and platforms;
  • Evaluation of potential components.

tranSMART version 1.2

  • Consolidation of divergent development tracks;
  • New functionality: cross study analysis;
  • New functionality: support of -omics data.

Multi-omics Research Environment

  • Cloud based research environment which is transferable across cloud platforms and work load scalable;
  • Integration of tranSMART, Docker Cluster, iRODS and R Cloud components; where R Cloud is used for the heavy R jobs parallelization;
  • Includes variety of –omics data analysis and visualisation pipelines adapted for the cloud computing;
  • Used for the AD vertical data analysis available through the platform.


  • From the research perspective we have multiple collaborations with the AD vertical trying to discover AD biomarkers and to perform integration data analysis (multi modal analysis);
  • From the technical perspective we collaborate with EMBL-EBI Embassy Cloud and with Amazon Web Services to develop transferable cloud environment by testing different cloud platforms.

WP 14 Architecture, solution development, security & privacy technologies

José Luis Oliveira (UAVR) – Philippe Baudoux (UCB)



The objectives are:

  1. To engineer an ICT infrastructure for the federation of resources, including dynamic support for common data models and variable operation procedures.
  2. To provide the technical infrastructure for secure, cross-project data exchange.
  3. To create a development environment for the EMIF-Platform ecosystem, enabling the creation of software such as a clinical information browser, evolving biomedical knowledge bases and Private Remote Research Environments.


  • P. Lopes, L. Bastião Silva, J. L. Oliveira, “Challenges and Opportunities for Exploring Patient-Level Data”, BioMedResearch International, 2015,
  • L. A. Bastião Silva, C. Días, J. van der Lei, and J. L. Oliveira, “Architecture to Summarize Patient-Level Data Across Borders and Countries”, MEDINFO 2015, Brasil, 2015
  • D. Campos, J. Lourenco, S. Matos, and J. L. Oliveira, “Egas: a collaborative and interactive document curation platform”, Database, 2014.
  • J. L. Oliveira, “Biomedical approaches for efficient reuse of health data”, International Symposium on Clinical and Translational Research Informatics, Melbourne, 2014.
  • J. L. Oliveira, “Bioinformatics: Cutting Edges through in silico Approaches”, 60th IPSF World Congress, 2014.
  • L. Bastião Silva, C. Costa, and J. L. Oliveira, “Semantic search over DICOM repositories”, IEEE International Conference on Healthcare Informatics 2014 (ICHI 2014)


EMIF Catalogue design and architecture

  • Community-based access and management
  • Distributed management of groups and data
  • Database fingerprinting
  • Free text and advanced searches
  • Suitability assessment (through Jerboa and Achilles)
  • Integrated components to facilitate the communication
  • Open architecture supported by plugins
  • Role-Based access control
  • Continuous evolution (version 1 up to version 4)
  • Four communities available: EMIF-EHR (14 databases), EMIF-AD (45), EPAD (18), ADVANCE (2)

TASKA workflow manager

  • Online system to help supporting feasibility studies
  • Easy-to-use visual editor for tasks and workflow definition


  • Workflow for performing multi-database study on EHR Data with Jerboa as data transformation tool
  • New modules, driven by the Use Cases
  • Module to communicate with the Observational Medical Outcomes Partnership (OMOP) model

transMART and Knowledge Objects

  • Support to EMIF-AD and EMIF-Metabolic with harmonized clinical and omics data upload
  • Tools for knowledge object infrastructure individually tested


  • OHDSI, (www.ohdsi.org) and OMOP - Observational Medical Outcomes Partnership (omop.fnih.org)
  • tranSMART (www.transmartproject.org/)
  • Catalogue-Jerboa-Achilles

WP 15 Use and sustainability models, community building & outreach

Eva Molero (Synapse) – Bart Vannieuwenhuyse (Janssen)



The objectives are:

  1. To analyze the needs of different key stakeholders (potential users of the platform) with the aim to develop a sustainability model (business model) that serves those needs while being acceptable to data custodians
  2. To develop and test different sustainability models derived from the analysis above, which ensure the long-term maintenance and operations of the information framework developed during the project.
  3. Outreach: WP15 also aims at connecting EMIF with other relevant projects (IMI and other) and has the ambition to identify and connect new data sources to the network


  • Collaboration workflow designed and implemented to help detect, assess, prioritise and execute collaborations with other initiatives and projects. Project fiches developed.
  • Incorporation of IDIAP JORDI GOL, a new regional EHR data source, as full partner from June 2015.
  • Collaboration with OHDSI and use of its OMOP common data model. Studying possible leverage of some OHDSI tools (e.g. ACHILLES).
  • Established contacts with public organizations (EMA, ENCePP) and other projects (ADVANCE, EHR4CR, EPAD, DP-UK, UK-CRIS, BD4BO projects). Some specific collaborations with these are ongoing, others are being planned.


Market analysis completed

  • In-depth study of users, contexts of use, and possibilities for data sources incentives.
  • Interviews conducted with a variety of potential users within the consortium, centred around four axes: (1) Willingness to share; (2) Willingness to pay; (3) Platform cost structure; and (4) IP and governance structure.
  • Further in-depth market analysis within the selected application domains.

Development of First Business Plan draft

  • Definition of the first business plan draft, D15.5, after a series of face to face workshops including EFPIA and data custodian representatives.
  • High level service delivery model under refinement:
  • P&L and value proposition discussions started. P&L model developed, key assumptions are further being validated
High level service delivery model under refinement

Definition of application domains

Two application domains have been defined as examples to be piloted in EMIF:

  • Clinical trial support and optimisation
  • Post-authorisation safety monitoring

WP 16 Use and sustainability models, community building & outreach

Carlos Díaz (Synapse) – Bart Vannieuwenhuyse (Janssen)

All EMIF partners


The objectives are:

The overall objective of this WP is to ensure the successful implementation of this project on a scientific, financial and management level. To achieve this we will focus on 4 main areas :

  1. Ensure efficient management of scientific activities and financial allocation/reporting, and organisation of project meetings
  2. Ensure effective communication between partners to support sharing best practice, maximise synergies and prevent duplication
  3. Designing and implementing dissemination plans and communication strategy
  4. Managing of Intellectual Property rights and deriving value of foreground information generated
  5. Ensure appropriate risk management, particularly for joint risks across the whole programme and at the interface of the three topics


Day-to-day project management

Day-to-day project management

Progress trackers

  • EMIF 1000 sample cohort (AD)
  • Cohort data upload in tranSMART (Cross-topic)
  • Recruitment for preclinical cohort (AD)
  • Use Cases progress (Cross-topic)
  • OMOP common data model mapping progress (Platform)


EMIF Dissemination Tracking


Reporting progress 2013-2015

  • Periodic reports 1-3 submitted
  • Interim status reports for M6 / M 18 / M30 completed
  • 65 deliverables submitted
  • High number of f-2-f and online meetings organized
EMIF - Total funding claimed 2013-2015
EMIF, Reported vs Total Funding

Keeping project documentation up-to-date

  • Amendments 1-3 approved
  • Contact and mailing lists updated on regular basis

EMIF Communication Task Force

EMIF Communication Task Force


EMIF Code of Practice (ECoP)

The EMIF platform can only be used for: assessing the feasibility of a study and conducting research by bona-fide research organisations with the objective of discovering new knowledge intended for the public good and made publicly accessible (i.e., published)

ECoP was developed in order to help ensure that:

Data sources Data users
will always have autonomy over which data are made accessible and for which types of research must adhere to the ethical rules and privacy protection policies of each data source
will always determine ethical acceptability and scientific validity may only use the data for the specific agreed research purposes
must be transparent about their data must acknowledge the sources of the data they have used, and EMIF

EMIF Catalogue

The key idea of the EMIF Catalogue is to allow researchers to find specific databases aligned with their research purposes, providing a summarized overview of a number of geographically scattered healthcare databases. To accomplish this, the EMIF Catalogue is a flexible web system supported by the Community concept, i.e., a group joining databases and users with the same clinical interests.



EMIF Catalogue


TASKA is an innovative platform designed to streamline the creation of modular and easily extendable workflows to manage data extraction and handling general work processes. It is based on Software-as-a-Service approach, making it more versatile and easy to integrate by third party applications.



This platform allows several users to collaborate and interact in the creation and execution of distributed workflows, relying on an easy-to-use interface for managing complex procedures. Features:

Jerboa Reloaded

Work package 12 (Data extraction, benchmarking, aggregation & linkage) is mainly involved in the important and challenging task to develop the complex scenario when informed consent cannot be obtained. For example, in the case of electronic healthcare records (EHR) collected from general practitioners, access to patent data will often be restricted to only anonymized and aggregated data.

Jerboa Reloaded


The Jerboa Reloaded extraction tool is developed in the EMIF project to support the data extraction and processing from the EHR databases. It is used in a so-called distributed network design, i.e. it runs de-identification, linkage, analysis and aggregation locally at each data source site. Jerboa runs a script that contains all parameters of a specific study design. This has the advantage that the local analyses are performed in a common, standard way and are not subject to small differences in implementation by local statisticians, and that for each study-only data necessary for that particular study is shared in analytical dataset.

OCTOPUS Private Remote Research Environment

The OCTOPUS infrastructure is used as a prototype for the private remote research environment (PRRE) in WP12. It allows for secured file transfer from and to the data custodians and can be used to collaborate on the aggregated data generated by Jerboa Reloaded.

OCTOPUS Private REmote Research Environment

The OCTOPUS remote research environment is a socio-technological framework that has been developed by Erasmus MC in the past, and has already proven its value in various projects. It stimulates geographically dispersed research groups to collaborate and has resulted in consortia that were engaged in all the phases of the drug safety research. To achieve a successful and sustainable collaboration, database custodians should be more than just data suppliers. As most of the custodians reside in research institutes, analytical tasks should be distributed as well. The main purpose of OCTOPUS is to stimulate such collaborative drug safety research in a secured environment.

In EMIF a dedicated PRRE is currently being developed that is more scalable and further optimized based on the experience with OCTOPUS.

Participant Selection Tool

The Participant Selection Tool allows researchers to get an overview of patient profiles in a given cohort, filtering on a set of predefined key characteristics. The tool has currently been built to provide this capability for AD cohort data sets.

Participant Selection Tool


The user interface is designed to minimise the learning curve. Federated data sources are represented in a single tabular or graphical view. Filtering on categorical, date or continuous key variables is possible and the result is a count of the number of matching patients across the different cohorts.

EMIF Participant Selection Tool


The basis of the tool is the ‘Knowledge Object’. These ‘knowledge objects’ are ontologies to which the different data sources (in this cases cohort data) can be mapped. The use of semantic technology offers the required flexibility as the ontology develops or as additional data sources are added.

The tool can accommodate different projects (e.g. a single data sources can be participating in one or more projects) or additional characteristics can be added for a given project. The same framework will also be used to develop additional tools, ultimately enabling an integrated data pipeline for cohort data.

Virtual Selection Tool (VST)

VST provides the researcher with an overview of available variables (counts, not values), followed by data access request to the selected cohort owners.




TranSMART is a knowledge management application built from open source components to investigate correlations between genetic and phenotypic (clinical) data to aid in predictive biomarker discovery.

It consists of a web-based graphical data mining application that connects to a server based data warehouse. TranSMART was made open source in 2012. The tranSMART Foundation was established in 2013 to provide governance and coordination for new developments. Important contributions have come from IMI eTRIKS, CTMM TraIT, Pfizer, Sanofi and Janssen, amongst others.

TranSMART can combine clinical and high dimensional data, such as gene expression, single nucleotide polymor-phisms (SNPs), Rules-based medicine (RBM), genome-wide associations studies (GWAS), copy number variations (CNV), etc. In EMIF, the main emphasis has been on harmonized clinical data. High dimensional data is expected from e.g. the EMIF-AD 1000 samples cohort.

TranSMART allows easy generation of queries by phenotypes, genotypes, or a combination. Study groups can be formed ad hoc to generate summary statistics or for hypothesis testing. Several advanced analysis pipelines are built in, such as boxplots with ANOVA, scatter-plots with linear regression, Kaplan-Meier plots with survival analysis, etc. GenePattern functionality is available for gene expression or proteomic analysis. Subjects from individual cohorts can be pooled in a virtual cross trial cohort for a unified data analysis.

Multi-omics Research Environment (MORE)

Cloud computing provides users with a number of benefits: reduction of computational costs, universal access, up to date software, choice of applications, flexibility. There are many cloud platforms that Research projects can choose from: Amazon Web Services, OpenStack, VMWare, Google Cloud, etc. MORE is a transferrable solution between different cloud platforms and has specialised components for clinical and –omics data analysis. In addition, MORE implies a flexible architecture that allows adding new tools and pipelines upon request.

The following pipelines and tools are currently available in MORE (version 2):

  • tranSMART for clinical data analysis;
  • R Cloud for R parallel computing and R specific analysis;
  • iRAP pipeline adapted for the Docker cluster to analyse transcriptomics sequencing data;
  • NGSeasy pipeline adapted for the Docker cluster to analyse genomics sequencing data;
  • MZmine2 adapted for the Docker cluster to analyse proteomics and metabolomics LC-MS data;
  • Sequence Imp pipeline adapted for the Docker cluster to analyse microRNA sequencing data.

The tranSMART, Docker cluster and R Cloud are connected and use a shared file system. The Docker cluster and R Cloud benefit from the scalability of cluster computing usage – multiple VMs, job queues and task scheduler. New resources are added when needed.

MORE is an open-source project and the code is publicly available at: https://github.com/olgamelnichuk/ansible-vcloud. The EMIF instance of MORE is available at EMBL-EBI Embassy Cloud and can be accessible upon a request by EMIF users. This instance is used for EMIF-AD biomarker discovery.

Knowledge Object Framework

Michel van Speybroeck, James Cunningham, et al. Janssen & UNIMAN

Knowledge Object FrameworkThe knowledge object framework consists of a number of components that support the harmonization effort of clinical data. A knowledge object is a semantic representation of a clinical variable and contains descriptive metadata, executable rules that specify the relations (mapping) to other knowledge objects, and the actual data. A local knowledge object is the representation of a source variable and contains the raw data. A global knowledge object defines a harmonized, cross trial variable and serves as a mapping target for local knowledge objects. Several levels of derived knowledge objects are possible, thus creating a dependency graph.

The main goal of knowledge objects is to make the data harmonization process more efficient by specifying the minimal information to understand a clinical measurement and its mapping to a harmonized variable. That information is owned and maintained by the local data source or the research community. By specifying metadata and mapping rules using semantic web technology a reasoner can be used to perform the actual data extraction and harmonization. Security restrictions are also defined on local variables and automatically propagated to global knowledge objects.

The knowledge object framework consists of the following technical components:


Knowledge Object FrameworkOHDSI tools (e.g. ACHILLES) provides researcher with an overview of patient profiles in a given cohort, with filtering for a limited set of pre-agreed characteristics.


Knowledge Object Framework

Workflow Management

Workflow management ensures integrity of process between researcher and data sources; management of tasks and process steps.

Workflow Managment

Workflow management ensures integrity of process between researcher and data sources; management of tasks and process steps.


Switchbox provides a single interface for cohort owners, but multiple harmonised outputs for tools.

Corhort Selection Tool (CST)

CST provides the researcher with an overview of the potential cohort data, availability, and suitability.