From Press Releases to Research Projects: Why We Built an LLM Pipeline 

Every day, various research institutions announce new projects: large collaborative ventures, small exploratory studies, interdisciplinary initiatives, or long-term projects with significant relevance for the society. For science journalism, these projects can offer valuable insights, such as early indicators of scientific trends, emerging research areas, or work that may become newsworthy months or even years later. The difficulty is not a lack of information, but the exact opposite: the volume is overwhelming.

This becomes especially clear when we look at how these projects are announced. Many of them are distributed through the Informationsdienst Wissenschaft (idw), a central platform that bundles press releases from a wide range of universities, research institutes, and other scientific organisations. Dozens (sometimes even more) of press releases are published there each day, ranging from brief notes to long and detailed project descriptions.

For editors, keeping track of this much input is demanding. Eventually, only a fraction of the press releases are relevant for science journalism. Yet identifying those cases requires careful monitoring, contextual understanding, and the application of journalistic criteria. Doing this manually on a daily basis alongside regular newsroom work is time-consuming and, for many newsrooms, not feasible at all.

However, with the recent advances in artificial intelligence in general and large language models (LLMs) specifically, we saw an opportunity to use these tools to support this process. LLMs are particularly strong at understanding and categorising text, identifying patterns, and summarising long text passages. These capabilities make them suitable not for replacing editorial judgment, but for augmenting it: pre-filtering incoming releases, identifying potentially relevant research projects, and reducing the burden of reading through large volumes of text to only the most important information of the input.

The Idea Behind our LLM-Based Pipeline

In our project, we set out to explore how LLMs could support this workflow. The core idea of our idw Pipeline is simple: an automated system that monitors and processes incoming press releases and assists editors by reducing the amount of text they need to review.

The pipeline performs two main tasks. First, a classifier module decides whether a press release describes a research project that is relevant for our newsroom. This classification is based on pre-defined journalistic criteria, allowing the system to filter out the many releases that fall outside the editorial scope. Second, for those projects that pass this first step, an extractor module identifies and structures key information from the press release. Instead of reading through long and often complex announcements, editors receive a structured summary containing essential information such as start date, duration, funding amount, project partners, or the project’s aims. This structured data is then stored in a database and can later be forwarded to editors in multiple ways, such as through a dashboard, automated emails, or Slack notifications.

Importantly, the goal is not to replace an editor, but to support them. Our pipeline acts as an assistant that continuously monitors incoming releases, processes large volumes of text in the background, and then presents the information that matters most for editorial decision-making.

Key Challenges and Why They Matter

The design of our pipeline comes with two central challenges. Both relate to what LLMs must understand from press releases in order to support editorial work reliably.

The first challenge concerns the classifier module. It must decide whether a press release is relevant for our newsroom. However, relevance is not a simple, universal rule. Editors apply a mix of experience, domain knowledge, and judgment when assessing whether a project is worth observing or not. Translating this implicit knowledge into explicit, generalisable instructions for an LLM is one of the core difficulties of such a system. The model needs a prompt that captures these criteria clearly enough to be applied consistently across a wide range of press releases.

The second challenge lies in the extractor module. Press releases sometimes present information in vague language, distribute details across paragraphs, or omit explicit statements entirely. Extracting structured information from such text requires the model to infer connections, identify implicit details, and work reliably even when key information is only hinted at or missing altogether.

These two tasks – modelling editorial relevance and extracting project information – are the foundation of our pipeline. In the next part of this series, we will look more closely at how we addressed these issues and build the different modules.

Acknowledgment

This project was made possible with the support of idw, who provided access to their press-release API for research and development.