Big Data and e-Infrastracture needs
Science and Technology of Information and Communication (STIC) and RIs must develop the maximum synergy to produce, communicate and diffuse the culture of high quality data in all areas of science. Frontier research generates high-quality data and extracts knowledge from them. Raw data are generally unstructured flows that are effectively usable only by the scientists who generated them, or by a specialized scientific community. Open Data science requires large cohorts of structured and well documented data: only a special effort pushing ahead the state-of-the-art of e-Infrastructures can enable this goal by realizing advanced standards of accessibility, quality control of data and analysis tools.
The multi-disciplinarily character of data-intensive themes imposes the scalability of the digital infrastructures with increasing e-needs, calling for a coordinated effort of all RIs. ESFRI foster adoption of FAIR – Findable, Accessible, Interoperable and Reusable – data principles plus data Reproducibility and Openness by all RIs of the Roadmap. ESFRI RIs generate massive data and have often developed own standards and metadata formats, developed data analysis and computational resources available to users, as well as data repositories for storage facilitating data sharing and re-use optimized for their reference scientific communities. High throughput (HTC) and high performance computing (HPC) are key services for the RIs activities. RIs in different domains face similar data management challenges – e.g. validation and access – and this gives high evidence of the urge of concerted actions towards innovative data policies as well as for the training of data scientists.
The European Open Science Cloud can realize a framework of Commons – good practices and instruments of interdisciplinary value – to organize a consistent system of RIs and e-Infrastructures, assuring data preservation and protection, Virtual Research Environments (VREs) data interoperability,and suitable data analytics and computational resources across all disciplines, with proper solutions for ethical and IPR issues. EOSC will federate the most advanced data and service infrastructures, often directly built and supported by the RIs. ESFRI RIs – among other European, national and international – represent a major investment in data infrastructure. In contributing to the EOSC the RIs should retain control over the quality of their data, its persistence in time, and over the quality of the data services that might eventually be also provisioned by others to the general scientific users and for innovation purposes.
Important efforts are being developed for coordinating a harmonized contribution to all e-Infrastructure European initiatives – e.g. EOSC, EDI and HPC, in tune with the most recent recommendation of the European Council. Notably HPC applications for data analysis are expected to have a multiscale integrating character and contribute filling gaps with regard to databases and research platforms. Distributed RI platforms and a rising number of national hubs have the potential to advance the digital RIs in the ecosystem.
RIs producing Big Data need, for example, pre-exascale computing technology – e.g. green computing, data and streaming, nano- photonics, etc. – that requires an important R&D effort. Green computing will radically reduce the power needed to do computationally intensive work on large amounts of data. Data & streaming technologies are needed to process data on-the-fly and to efficiently archive data – e.g time series that need perpetual conservation. Nano-photonics research addresses technologies to drastically reduce the energy cost of data transport over long distances as well as inside computing devices.
In the PSE domain, astronomy has pioneered a global framework for FAIR data sharing, which is operational and intensively used by the international community: ground and space-based observatories provide access to their data, which can be reused for scientific aims different from the initial motivation of the research; a Virtual Observatory (VO) defines the relevant data standards as well state-of-the-art data analysis tools. The VO shows the power of interoperability within a discipline to enable data and Commons to become an integrated Research Infrastructure. The International VO Alliance (IVOA)International VO Alliance (IVOA) http://www.ivoa.net is expanding with the inclusion of Astroparticle Physics needs through the Astronomy ESFRI & Research Infrastructures cluster (ASTERICS)Astronomy ESFRI & Research Infrastruttures cluster (ASTERICS) https://www.asterics.eu and planetary physics by the Virtual Atomic and Molecular Data Centre (VAMDC)Virtual Atomic and Molecular Data Centre (VAMDC) http://www.vamdc.org. The rate at which data-handling is required at analytical facilities exceeds Moore’s Law and will further accelerate as new-generation detectors and brighter sources come online – ARIs’ instruments produce PBs of data a year, e.g. in imaging and tomography. Advanced ARIs need to process data in near real- time to enable the users to steer their experiments and optimize the costly and scarce beam time availability. An example are Free Electron Laser facilities requiring large computing power and advanced analytics at hand during the experiment sessions. The RDA is developing reference criteria and methods and a broader concept of virtual observatory/laboratory enabling remote access to RIs spanning over the PSE domain.
ENV RIs are distributed with nodes/sites acquiring data with a range of spatial and time coordinates so that data standardization and quality control are built-in activities to provide exploitable information and data services for research and society. The ENV RIs involve Big Data as collections of many different kinds of environmental data and share common challenges such as data capture from distributed sensors, management of high volume data, data visualization and web-casting of data nearly in real time.
Europe is the most environmentally monitored continent, there is however an urgent need to develop a more advanced approach to environmental observation capable of integration across diverse science domains, temporal and spatial scales, space-based observations and in situ measurements, data and analysis performed by researchers, industry and governments. A priority and a key driver for ENV RIs is to implement a federated approach to IT resources for greater integration and interoperability building on the examples of earth observation – e.g. the role of the ESFRI Landmarks EPOS and Lifewatch ERIC in the Global Earth Observation System of Systems (GEOSS)Global Earth Observation System of Systems (GEOSS) https://www.earthobservations.org/geoss.php– and meteorology – e.g. the role of the ESFRI Projects ACTRIS and the ESFRI Landmark IAGOS in the World Meteorological Organization (WMO)World Meteorological Organization (WMO) https://public.wmo.int/en.
H&F RIs data and knowledge output, as well as the technologies to manage and integrate with other research data in this field, will extend the frontiers of research and generate opportunities to respond to the health and food challenges. H&F RIs have a long track record on generating the conditions to share data. In the Health domain, there are special requirements in the Data management, concerning protection, anonymization, identification, storage and data analytics to keep data public and confidential. This expertise can be translated into other fields in order to align and to extract knowledge from H&F data in combination with data from other domains, including the development and application of AI and machine learning.
The multi-disciplinarity intrinsic to ENE domain requires developing and applying scale-bridging approaches to the design of new materials and to study energy related processes. The study of energy networks and systems from local to macroscopic scales needs handling large volumes of data and intensive model-based processing. New initiatives as the Energy oriented Centre of Excellence for computing applications (EoCoE)Energy oriented Centre of Excellence for computing applications (EoCoE) https://www.eocoe.eu are expected to have a multiscale integrating character and to contribute filling gaps between databases and research platforms. Distributed RI platforms such as the European Distributed Energy Resources Laboratories (DERlab)European Distributed Energy Resources Laboratories (DERlab) http://der-lab.net/about/ and European Real-time Integrated Co-simulation laboratory (ERIC-Lab)European Real-time Integrated Co-simulation laboratory (ERIC-Lab) http://www.eric-lab.eu and a rising number of national living laboratories collecting and processing data of real energy systems have the potential to advance the digital real-time integration of distributed and volatile energy resources into advanced and sustainable energy systems.
An important element that needs special mention in e-needs is Human Resources (HR). HR are at the core of all aspects of the overall e-infrastructure and Big Data ecosystem at institutional, national, European (EOSC, EDI, EU-HPC etc.) and global level. Sustainability of the RIs and e-Infrastructures must be approached simultaneously with an unprecedented effort in training of data scientists as a specialised profession and of broad data literacy to enable more and more users of the data system. Finally, it is important to say that there is a clear need to bring together data from multiple platforms to solve increasingly complex problems, and this requires more open access and transferability of data, but the ethical issues and those related to industrial confidentiality need to be properly addresses by all the concerned RIs aiming at developing well based rules.