Part 2
Landscape analysis

Gaps, Challenges and future needs

The development of high speed data connection together with storage capacities and information processing software has provided access to massive amounts of data, as well as new ways of analysing analogue resources of cultural heritage. The disciplines in Social Sciences and Humanities are thus confronted with a momentum that is transforming the entire profession of the researcher. Research Infrastructures in this area must enable the creation and manipulation of large and very heterogeneous  bodies of data, of a qualitative or quantitative nature, opening up new research possibilities and encouraging interdisciplinary work. RIs contribute to the valorisation of scientific and cultural heritage. 

Data storage and digital interactivity, have opened up new opportunities in terms of appropriation and handling of research resources. Consequently, we have seen a diversification in the locations of digital resource production which have resulted in the creation of many platforms. They form clusters for bringing together disciplinary and technological skills that offer many services to support researchers in the humanities who use ICT either directly because the research data is digital or as an environment allowing for access to new processing tools.

Exponential growth in the amount of data, their increasing use by SSH scholars, as well as the rapid evolution of technology, opens up new opportunities for SSH research. The use of Big Data also bears new methodological  challenges with implications for empirical research: the implementation of surveys on emerging social trends in longitudinal perspectives can lead to important advances in epistemological and methodological fields. In particular, Big Data raises some important issues for the SCI domain, among which we point out the following. 

THE PREDICTIVE CAPACITY OF BIG DATA. The challenge is to understand if and how a large amount of data can improve and enhance predictive capacity regarding social phenomena. One concern is that Big Data might lead to causality being overlooked, since this perspective relies on correlations and trends whose underlying cause may not be clear.

BIG DATA AND THE ROLE OF THEORY. The advent of Big Data has often been accompanied by concerns about the end of theory and about a kind of knowledge that is only data-driven. The question is to understand if the use of Big Data can contribute to the generation of interpretative patterns of past, present and future social, political or cultural reality, or identification of the limits of machine learning for humanist and sociological knowledge.

Big Data and data protection issues. The use of Big Data in research does not only refer to the issues of privacy protection, but also concerns about access and possession of such data and in many cases also copyright issues. Solutions at national and European level must be found in order to enable researchers to use Big Data, given the ethical and legal challenges.

METHODOLOGICAL CHALLENGES OF BIG DATA. The scientific community has some concerns about the validity and the reliability of Big Data: the discussion is how to use them in a controlled way in order to produce scientifically relevant inferences.

BIG DATA ANALYSIS. Especially in the field of opinion mining and sentiment analysis, treatment and analysis of Big Data rely on automatic procedures more or less monitored by the researcher and on algorithms of machine learning through which the software is able to classify a large amount of textual information. This raises classic methodological problems in a new form: the inspection and thus the control and evaluation of data analysis procedures.

In light of the changes outlined in the preceding section, new forms of Research Infrastructures combining storage and state of- the-art information extraction methods and services are required if the research community is to utilise all potential research opportunities. This section identifies a number of areas in which the changing research landscape needs to foster new research opportunities in SSH and at the disciplinary boundaries with other scientific communities.

Integration of bio-social data

Interdisciplinary research cutting across SSH has the potential to supply increasingly rapid insights into the influence of socioeconomic and environmental conditions on biological changes. These have long-lasting consequences for our behaviours, health and socioeconomic well-being through the life cycle. There is a need to understand the pathways and mechanisms involved in these reciprocal feedbacks over the range from cells to society. Bringing together interdisciplinary teams to address these research issues and ensuring that our longitudinal and cohort studies are augmented and enhanced to enable such research, provides new opportunities for scientific discovery.

Bringing together data from diverse sources, spanning genomics, blood analyses and biomarkers, health and other administrative records, and business or transaction data, and linking all of these into the rich longitudinal cohort and panel studies presents several major challenges to ensure accessibility and usage. Many researchers would benefit enormously from having complex data pre-processed and summarised in useful ways.

Existing RIs, such as the ESFRI Landmark ELIXIR (A distributed infrastructure for life-science information, H&F) and ESFRI Landmark SHARE ERIC, indicate that there is significant potential at the pan-European level to integrate a range of biomedical and socio-economic data resources during the lifecycle. In the same way, the surveys and data provided by the emerging project GGP will contribute to the analysis of generational differences in values and gender roles that are highly relevant for policy debates.

Further development in this area of the European research landscape will provide fertile ground for trailblazing research with huge potential benefits for the health and well-being of populations.

Promoting an international approach to real-time data analytics

Historically, data that have been used for research in the social, economic and behavioural sciences have been designed and/or collected specifically for that purpose. In recent years, however, new forms of data not originally intended for research use, such as transactional and administrative data, internet data (derived from social media and other online interactions), tracking data (monitoring the movement of people and objects), and image/video data (aerial, satellite and land-based), have emerged as important supplement resources and alternatives to traditional datasets. In quantitative humanities, however, the analysis of resources that were not created for specific research questions has a long tradition. The problem of non-tailored resources arises newly since large amounts of digital resources are available to researchers. The troubling gap is emerging in our ability to capture and explore these new forms and amounts of data for the purposes of research. This gap is arising because:

  • the prevalence of new forms of data will increase exponentially as technologies and digital capabilities evolve and it is imperative for the SSH community to take a leading role in establishing a robust, quality assuring, secure and sustainable infrastructure for utilising them;
  • technological and methodological advances must be made in order to realise the potential of real-time analytics for research in SSH;
  • much of the value of new forms of data lies in the potential for linkage and calibration with other data and the derived opportunities for addressing novel research questions aa well as re-examining open questions through a new lens; 
  • current training and capacity-building provisions are insufficient to meet the growing demand from researchers at all stages of their careers to utilise new forms of data;
  • new forms of data and their subsequent uses pose novel ethical, quality and privacy questions that must be explored to ensure that these technologies are deployed in a responsible way.

Europe has to implement standards with respect to privacy and security issues in Research Infrastructures. The recently established General Data Protection Regulation (GDPR) could bring clarity and benefit to research across the SSH and associated disciplines – e.g. medical sciences, health research etc. The new GDPR is an opportunity to develop and establish a lead in regard to privacy and security issues concerning Research Infrastructures.

RIs for social media – archiving web

Since the mid-1990s the web has become an integrated part of society, culture, business,  and politics, and national web archives have been established to preserve this part of the digital cultural heritage. But for the scholar who wants to study the web across borders, national web archives become an obstacle since they delimit the borderless information flow on the web by national barriers. Thus, a transnational Research Infrastructure should be established with a view to: i) developing a more efficient and attractive European Research Area; ii) ensuring the researcher free access to the digital cultural heritage from different nations; and iii) increasing the potential for fostering innovative partnerships with the software development industry for studies of Big Data.

RIs for humanities and cultural innovation

Contemporary technologies offer great opportunities to revitalize and make available on a large scale cultural items which represent a collective treasure for Europe in terms of identity, citizenship, diversity, cultural growth, and economic potential. The effort in that direction should be conceived in two different ways:

  • Cultural items – manuscripts, papyri, books, movies, music, paintings, monuments, etc. – in their material reality, are complex physical objects that are in need of material analysis, dating, preservation and restoration. Viewed in this way, they are relevant for RIs which aim to support the analysis of physical objects in general.
  • Making a material object part of our cultural heritage largely depends on the collective awareness of its existence and on the value vested in it. In this respect, RIs devoted to the dissemination (digitisation,  3D-reconstitution, etc.) of those objects are crucially needed to the maintenance of our cultural heritage.

An enormous amount of diverse materials is widely distributed across Europe: they are often difficult to access from outside local communities, and sometimes at risk of deterioration. The main challenge of RIs is to provide users access (educators, museums and exhibition curators, public) to such treasures and heritage, and to the state-of-the-art analysis carried out by experts and researchers, also by exploiting digital media and archives.

National museums and integrating Research Infrastructures such as the European Cultural Heritage Online (ECHO)European Cultural Heritage Online (ECHO) http://echo.mpiwg-berlin.mpg.de/home and EuropeanaEuropeana http://www.europeana.eu/portal/ have made important efforts to digitize libraries and collections. ECHO was established in 2002 to create a research driven IT infrastructure for the humanities. It works on digitisation of cultural heritage and develops research driven tools and workflows for analysis and publication of scholarly data linked to primary sources. ECHO features more than 70 collections from more than 24 countries worldwide. Europeana is a European network representing more than 3,300 institutions and aggregators and provides cultural heritage collections to all in the form of more than 30 million digitised objects and descriptive data.

However, making these treasures accessible in digital form is only the first step in ensuring their uptake by the target audience. The vast bulk of the cultural heritage accumulated through centuries of European history is a formidable resource of rich material for new and far-reaching analyses, typically in languages no longer spoken. Therefore, the mere availability of this heritage no longer guarantees that current scholars, let alone the general public, will be able to internalise it through conventional methods, i.e., reading about this heritage and annotating it. Thus, new methods of intelligent information mining and text analytics are needed that should be capable of automatically processing the content of the massive amount of cultural heritage treasures and making them accessible to present day audiences. In response, initiatives have been made by setting up projects to record and possibly revitalize endangered languages, in which social media can also play an important roleThe Endangered Languages Project http://www.endangeredlanguages.com/. It may be noted that CLARIN has examples from a very large number of languages, and that more than 1.500 languages are represented with 5 examples or more.

Meeting this challenge requires significant interdisciplinary efforts to integrate competences  from different expert fields, bring together the most advanced facilities and make their resources available on a large scale. Infrastructures such as the Cultural Heritage Advanced Research Infrastructure (CHARISMA) contributed to the development of joint activities in the field of conservation of cultural heritage. CHARISMA covers joint research, transnational access and networking  of twenty-one organizations that provide access to advanced facilities, develop research and applications on artwork materials for the conservation of cultural heritage  and open up larger perspectives to heritage conservation activities in Europe. It has defined and consolidated the background for the ESFRI Project E-RIHS.

Additionally, in the archaeological sciences the ARIADNE network developed out of the vital need to develop infrastructures for the management and integration of archaeological data at a European level. As a digital infrastructure for archaeological research ARIADNE brings together and integrates existing archaeological research data infrastructures so that researchers can use the various distributed datasets and technologies. ARIADNE has strong ties to the ESFRI Landmark DARIAH ERIC.

Increasing the global reach

Given that RIs in the Social Sciences and Humanities will be stably anchored in Europe in the future, further actions have to be undertaken to make them attractive and compatible on a global scale.

The accessibility of digital research data – e.g. survey data in the Social Sciences, digitized and annotated cultural heritage in the humanities – is obviously the key driver for increasing global research not only in the Social Sciences and Humanities but in the whole scientific system. It can be stated that not only is there still a possible gap between different standards for certain kinds of data – depending on the source from which they were derived – between European infrastructures and non-European infrastructures, there are also rather difficult challenges to meet that are inherently connected to the content types of data and research traditions between different parts of the world.

Digital tools and functions have to be potentially newly programmed when applied to sources in languages, scripts or symbols more recently encountered by the technology. The underlying understanding of text types, art classification systems and semantics would potentially have to be adjusted, complete methodologies would have to be newly negotiated. The same is true for political and sociological research terms and classification systems. In order to integrate and interconnect heritage and knowledge from and about societies and cultures from all over the world, their history and self-conception a lot of work has to be done to enable global infrastructures to offer  a certain degree of consistency between these data and concepts.

Sustainability and geographical coverage

There is an increased sustainability of the research data due to the fact that all RIs provide archives for storing data and state of- the-art methods to analyse and interpret them. This is an important difference with respect to ten years ago where data could  disappear when a researcher retired. Currently not only data are stored in sustainable, long-lasting and secure archives, but the current RIs – e.g. the ESFRI Landmarks CLARIN ERIC and CESSDA ERIC – also use innovative methods such as Persistent Identifiers for resources and data collections, so that the same version can always be retrieved and so that research based on their data can be replicated or extended. We also need to consider the sustainability of the RIs themselves. Research Infrastructures need to be sustainable: i) financially and organisationally; ii) technically; and iii) in terms of human resources. These three dimensions of sustainability are heavily interlinked and therefore require adequate  financial resources. The organisational sustainability is supported through the use of the ERIC and other legal structures. The financial sustainability of the central and national operations may still be an issue worth considering. For all of the SSH Infrastructures, geographical coverage is crucial for the quality of the research they support  and hence for their sustainability. Data from one country is not only of interest for the researchers of this country itself, but also for everyone else in Europe for comparison. To compare attitudes to different aspects of society, it is not enough to have one part of Europe if other parts are missing. For those Infrastructures where language plays an important role, it is obvious that a very good geographical coverage is needed, so that all types of languages, and preferably all languages, are described and will be the basis for the research developments. European data collecting infrastructures only have European Added Value, if they are able to provide data from all over Europe. Technical sustainability has to do with upgrading to new versions, following and updating standards, including new tools and possibilities, following international developments. All current SSH Research Infrastructures are heavily involved in and committed to continuous technical development.

Sustainability in terms of human resources is at the heart of our infrastructures. There are three classes of activities where human resources are crucial:

  • building and operating the infrastructure and keeping it up-to-date in the light of technological and methodological developments and evolving user needs (this is treated above under technological sustainability);
  • instrumentation and population of the infrastructure with community specific data and services;
  • education, training and research support for existing and future users.

There are various instruments to make these things happen in a sustainable way, and they are all implemented to some extent by the current ESFRI Landmark SSH RIs. For example, building knowledge about the availability of RIs within standard university curricula is a good, sustainable long-term investment. In the shorter term the obligation for infrastructures to build and maintain what could be called a Knowledge Sharing Infrastructure is important. Knowledge Sharing Infrastructure is a formalized way of recognizing and sharing knowledge among members. It is an acknowledgement that not all useful knowledge can be concentrated at the central level, and that the knowledge present at the national level is crucial for sustainability and has to be made visible and shared. This is particularly true for distributed Research Infrastructures like the SCI RIs.