4 The Data Transparency Crisis
Social media platforms are responsible for a transparency crisis that is threatening the integrity of our information ecosystem. Although participation in social media platforms is open to billions of users, companies have effectively closed many of the avenues for effective scrutiny. Online services appropriate public data for private gain while keeping public-interest research in the dark. A regime designed for information asymmetry and operational opacity leaves serious questions about social life unanswered: who is orchestrating the online disinformation campaigns that are fostering an unprecedented wave of democratic backslide? How did infodemics become one of the major threats to public health? What is the relationship between individual acts of online abuse against young women and emergent misogynistic cultures?
Without access to data, researchers cannot address major areas of social change, policymakers lack evidence for decision-making, and society fails to have the information it needs to act. Social media companies are resisting independent public-interest scrutiny by researchers, journalists, policymakers, and citizens. Our increasingly datafied societies operate in walled gardens of an online media ecosystem where data are not found.
The landscape was not always this bleak. During an initial period of market expansion, social media companies opened their platforms to programmatic interaction and interoperability. They introduced Application Programming Interfaces (APIs) as a mechanism to enable the communication between different applications. The relationship was mutually beneficial. The API owners offered new services on their platforms without significant costs and the independent developers gained visibility and reach. Data access, albeit limited, was part of this tradeoff.
The so-called big data revolution benefited from access to social media data. Over more than a decade, we witnessed an unprecedented growth in the quantity and quality of scientific and public-interest research with big data. The promise of “Computing Humanity” (Ledford 2020) did not come without its significant risks and challenges. The Cambridge Analytica scandal, a case of personal data misuse for commercial and political purposes enabled by Facebook’s API, showed that programmatic interaction was a delicate ecosystem (OECD 2019) in need of robust regulation and democratic oversight (European Parliament 2018). Instead, social media companies used their power to redefine the terms of engagement. Public-interest research ultimately bore the cost as public APIs were deprecated and personal data increasingly locked within digital enclosures.
Twitter (now X), long regarded as one of the most research-friendly social media platforms, aggressively monetised its API in 2023. The costs are prohibitive for most research institutions (Coalition for Independent Technology Research 2023). For example, a landmark 2018 study on the spread of true and false news drew on a dataset of over 4.5 million posts, along with an additional 10 million tweets used for modelling purposes (Vosoughi et al. 2018). Until February 2023, academics could collect such volumes of data for free. Under the pricing structure introduced thereafter, researchers would need the enterprise plan at USD 42,500 per month (X, n.d.-b), while the pay-per-use tier introduced in 2026 would be even more onerous, with estimated costs of USD 72,500, and technically unfeasible due to a cap of 2 million post reads per month (X, n.d.-a).
In 2024, Meta shut down CrowdTangle, a formerly independent media insights tool the company had acquired and made available to academics and journalists, replacing it with the Meta Content Library (MCL). Our report shows that this is a poorly performing tool only available to researchers pre-approved by Meta. After an arbitrary approval process, researchers must navigate closed virtual machine systems that prevent data exports. Other platforms promoted a narrow set of official data-access solutions, such as the TikTok Research API, which is restricted to researchers based in the United States and Europe (Pearson et al. 2024).
The rise of data scraping for artificial intelligence (AI) training further limits independent research. Large language models and other generative AI systems increasingly rely on vast amounts of data derived from texts, images, and videos published on online platforms. Social media companies—including Reddit and X—resist third-party uses of what they frame as “proprietary data” to train new models without compensation (Joseph 2025; Murphy et al. 2025). As a consequence, access to social media data by researchers, with or without permission, has become increasingly difficult and contentious, as the same data access mechanisms developed for scholarly inquiry can also be leveraged by large AI companies to train commercial tools.
Business models based on the monetisation of personal data thrived while public-interest research with public data from social media platforms suffered. Without adequate data transparency, our ability to know what happens to people online is subject to the choices made by social media companies.
Debates over data access have increasingly extended beyond user-generated content (UGC) to encompass publicly available advertising data. Transparency in advertising practices is essential for scrutinising paid content, a cornerstone of the multi-billion dollar digital advertising market—estimated at USD 650 billion in 2025 (Precedence Research 2025). Access to advertising data remains constrained and has received less attention compared to user-generated content. This asymmetry is not incidental: disclosing advertising data opens up platforms’ commercial interests and their business model to scrutiny. There are strong incentives to limit public availability to this data and curtail external oversight, and these incentives are misaligned with the public’s need to know about advertising online.
For example, revelations about the role of political advertising in shaping public opinion and influencing democratic processes have cast a light on the importance of advertising transparency in regulatory and academic debates. These concerns stem from the same electoral and public opinion scandals that led platforms to restrict access to UGC data, especially those surrounding the 2016 US elections (Leerssen et al. 2019). At the same time, the sophisticated microtargeting capabilities offered to advertisers have been widely used in disinformation campaigns, influence operations, and in online scams that disproportionately target at-risk individuals (Horwitz 2025; Santini et al. 2025; Seers et al. 2026).
Major platforms responded to mounting public pressure and successive crises by increasing advertising transparency for the public, authorities, and researchers, partly as a strategy to pre-empt binding regulatory frameworks and mitigate any backlash. Google, Meta, and X/Twitter were among the first to develop self-regulated ad repositories, or ad libraries, web-based tools often accompanied by APIs for data exploration and extraction, designed to enable greater scrutiny of ads served on major platforms. These systems typically provide information on targeting criteria, audiences reached, ad spending, delivery periods, and, in some cases, content moderation actions (Leerssen et al. 2019; Mozilla Foundation 2019).
Researchers and civil society organisations widely consider ad repositories inadequate and insufficient. These archives impose significant constraints on bulk access, historical depth, and analytical functionality: datasets are often incomplete or inconsistent, interfaces are difficult to navigate, and information is provided in formats that hinder large-scale analysis and systematic reporting (European Commission 2018; Rieke and Bogen 2018; Santini et al. 2024; Who Targets Me 2025).
Even if political advertising has been treated as the primary focus of these tools, given its disproportionate impact on public opinion and social institutions, self-regulated ad repositories have left other forms of advertising largely, and sometimes entirely, opaque. Advertising can play a significant role in funding extremist content and disinformation. For example, the digital advertising watchdog Check My Ads found obscure advertising played a role in the 2024 racially-motivated riots in Southport, UK (Check My Ads Institute 2025). Moreover, platforms diverge significantly in how they define and operationalise “political advertising”: some adopt narrow, election-focused criteria, while others employ broader frameworks that encompass social issues (Sosnovik and Goga 2021).
Political ad categorisation relies on self-classification by advertisers, who may mislabel their content, intentionally or not, thereby directly affecting what is made transparent. More fundamentally, assigning the responsibility of determining what constitutes political content worldwide to a small number of private companies poses significant risks, effectively turning these platforms into arbiters of electoral campaigning and political discourse across jurisdictions. Given the difficulty of drawing clear boundaries between political and non-political advertising, many explicitly political ads remain uncategorised, while non-political ads are sometimes incorrectly labelled as political, weakening the effectiveness of platform transparency mechanisms (European Partnership for Democracy 2020; Leerssen et al. 2019; Le Pochat et al. 2022; Santini et al. 2025).
From pharmaceutical and health disinformation to online financial scams, there are areas where the lack of information about online advertising harms people. Easily searchable databases of ads and advertisers could be a key step in protecting users, preserving the public’s trust in online advertising systems and improving the fairness and transparency in online ad markets (Schiffrin et al. 2026). The UK’s Online Safety Act provides some duties around advertising, but “falls short of delivering a comprehensive regulatory framework, leaving some gaps in coverage, enforcement, and oversight of third-party ad delivery systems” (Woods and Antoniu 2025).
Regulatory experiences have already shown that tools to monitor online advertising content can—and should—be substantially improved. Understanding the commercial operations that underpin the economics of large social media platforms is a matter of public interest which cannot remain obscured by transparency tools that fall short of their stated goals.