Centre on Textual Data CTD

The proposed centre will rely on a decentralized structure between the University of Vienna, Department of Communication and Vienna Center of Electoral Research (VieCER), and the University of Amsterdam, Amsterdam School of Communication Research (ASCoR).


The use of textual data for empirical research is rapidly growing. Because of this, MEDem plans to establish a Centre for textual data (CTD) that will:

  • Provide coordination between different comparative projects involved in the collection of textual data, such as the MARPOR and CAP, as well as comparative media and parliamentary research projects that are under development;

  • Promote the usability of textual data across datasets collected in the different projects;

  • Document methodological and technical approaches for collecting, scraping and analysing digital textual data from a variety of sources, with a special focus on the multilingual nature of such raw data;

  • Ensure that all the textual data is well documented and updated following a joint metadata and documentation standard, and archived in established data archives if not archived at the competence centre itself;

  • Contribute to the development of post- and pre-harmonization procedures and standards for previous and future textual data collections;

  • Advise projects in the development of standards for data collection, data integration and data documentation that will facilitate data harmonization in the future.


A  number  of   challenges  characterize  the  field  of political communication and media analysis. First, the types of information researchers deal with are fundamentally different in shape, format and style, ranging from parties’ election manifestos, speeches and parliamentary debates, to  traditional  news  media  coverage  to  political  posts and  debates in social  media, from  pure  text-based  data  to multimodal  information. Second, capturing such information environments across Europe inevitably deals with a multitude of political contexts and languages, and text analysis tools and data  gathering  methods  are  often very  sensitive  to  such  variation. Third, and   importantly  for  the MEDem  project, the field is very little institutionalized, at least across different types of political texts, and partly, such as in the case of media or social media, also within such types, with very little attention to common and comparable data collection tools and data sharing.


It is before such backdrops that the CTD seeks to address the following tasks:


  • The CDT will act as a coordination point between existing projects in the area of political text analysis and thereby facilitate a common understanding of the challenges of comparative political text analysis and the opportunities of cooperation.


First and foremost, it will focus on bringing together existing comparative projects, such as the CAP or the MARPOR. Second, it will serve to support the establishment of networks of scholars working in the fields of parliamentary debates, traditional news media or social media, aiming at an institutionalization of such international networks. Networking activities rely on regular physical meetings (biyearly) of the main players in the different fields, of seeking the opportunities of international conferences to bring people together for short events, through workshops focusing on state-of-the-art methodological aspects of text analysis, and on communicative activities such as regular newsletters, a website and social media channels. The support of network structures aiming at institutionalizing comparative research projects and the coordination between existing projects should create critical support among data producer and user communities in the area of political text analysis for the MEDem


  • The CTD will promote the usability of textual data among different data producer and data user communities, focusing on cross-data usability.


While some established comparative or national projects and data collections are well-established, with visibility in their subfields and beyond, and with a clear focus on data sharing structures, for other projects, in particular in less institutionalized areas, such openness towards data sharing is less evident. The CDT will push the idea of open science and measurement instrument and data sharing for the area of political text analysis, and will promote the value of data linkage and integration of political text data among the different communities, highlighting the added value of data linkage through workshops, conference presentations and exemplary studies/publications. Promotion of usability will also greatly hinge on activities described below.


  • A major task of the CDT in the first years of operation will be to collect and document existing approaches to political text analysis in the various fields, ranging from sampling procedures via manual coding instructions to text-as-data tools being applied to political texts.


Building on the INCA structure the CDT seeks to build a measurement data base that integrates a comprehensive collection of existing operationalizations and measurements, including, if possible, assessments of the quality of such instruments. First, it systematizes sampling strategies being applied in different projects and works towards a better understanding of sampling procedures in the different areas of political text analysis. Second, it distinguishes coding instructions applied in projects relying on human coders, systematically documenting the measurement tools, their comparability across projects and languages, and their reliability. Third, it collects tools being developed and/or applied in computer assisted text-as-data approaches in the field of electoral communication, ranging from data collection tools (scraping and pre-processing) to analysis tools, again focusing on the comparability of such tools and their quality/ validity. A major challenge lies in the multilingual nature of comparative political text analysis, and the documentation will maintain original language tools as well as provide advanced translations of such tools into English in order to enable comparisons and work towards understanding the comparability of approaches. The documentation will be openly provided to the MEDem community. Such systematic documentation, continuously updated, will serve as major input towards further data harmonization and integration as described below. It will furthermore serve as point of departure for the identification of possible linkage points between different projects and data collections, and with other MEDem data collections.


  • The CDT should become a major player in pushing the open data approach for the political text analysis community and providing standardization of data archiving.


To that end it will provide templates for data documentation practices for the different approaches of text analysis and push for standardization in the reporting of meta data and full transparency regarding data collection tools and their reliability/ validity. It will also address data archives in different countries that are part of CESSDA and promote openness and the development of routines for the archiving of text data files. The CDT will coordinate communication between data producers and archivers, if necessary. All data produced by MEDem member projects coordinated through the CDT should be open.

  • Based on the documentation described above, the CDT seeks to develop post-harmonization procedures for existing data and, more importantly, promote pre-harmonization across different MEDem projects.


For post-harmonization, the CDT will use the inventory and documentation to identify all instances of data for which cross-project post-harmonization appears feasible. The CDT will construct and continuously update such data linkage database and share the task of building data integration tools with the Centre for data archiving and dissemination (CAD). For pre-harmonization, in close collaboration with MEDem Project PIs and the CAD, the CDT identifies, where possible, best practices regarding measurements and approaches, or develops and validate such measures and tools. These are shared and promoted across the community and thereby measurement harmonization of political text data collections is implemented in the multilingual European media environment. To facilitate the use of harmonized core measures of political texts in different contexts, while allowing for flexibility to integrate context specific measures, CDT will develop a tool in which core measures and instructions are available in all languages, and which can be used for either manual online coding of textual and audio-visual materials or for automated text and eventually visual analysis procedures. The tool will serve as international exemplary for a standardization and harmonization of text analysis approaches. This applies both to traditional, manual content analysis approaches, for which the CDT provides also training to coder trainers (national PIs) and advises in terms of data management, retrieval, etc. and for computer assisted approaches, for which the CDT facilitates training for the usage and application in regular workshops or tailored trainings.


  • Given the fluidity of some political text data (e.g. online news, public social media posts) and the rather dispersed archiving of some political texts (e.g. parliamentary debates, news coverage, press releases), but at the same time the fact that the majority of the political texts are nowadays available in digital format, in the long run the CDT seeks to build a comprehensive, large scale system of data recording and linking to archived data across European member states.


Based on such system the CDT, in cooperation with the projects involved, will act at the forefront of integrating innovative, automated, computer-assisted approaches of text analysis to the multi-lingual reality of European democracies. Increasing application of computer assisted content/ text analysis approaches will facilitate cheaper, more efficient, and quicker content data collection, and in particular will ensure ready comparability of such data.

  • Provided that multimodal communication has become central in electoral communication, in the mid- to-long term the CDT will develop/ apply approaches for automatic visual recognition and speech recognition allowing for more encompassing analysis of multimedia and multimodal data.