7 Materials & Methods

This study adopts a comprehensive framework grounded in data quality principles (Batini and Scannapieco 2016) to evaluate researchers’ access to public-interest data and to generate empirical evidence on whether, and under what conditions, platforms make such data available. The framework is designed to be replicable, comparable, and applicable across different platforms and regulatory contexts. It is based on the premise that transparency extends beyond the mere existence of public data, encompassing the conditions under which such data can be accessed, understood, used, and verified, while respecting applicable privacy and data protection frameworks.

Building on this premise, we developed and applied a comprehensive framework to evaluate which types of publicly accessible, public-interest social media data are made available by each assessed platform, covering both UGC and advertising content. Publicly accessible data are defined as content with public visibility—such as posts, comments, channels, public profiles, metadata, and engagement indicators—that can be accessed by any user. Accordingly, private communications, encrypted content, and any data requiring individual consent are explicitly excluded. Social media advertising is treated as inherently public-facing, as it is designed for broad dissemination through paid promotion. By contrast, public UGC, while also publicly accessible, is approached with greater ethical sensitivity due to its contextual and potentially personal nature. Accordingly, all data were handled in line with established ethical guidelines, including considerations of privacy, contextual integrity, and minimising potential harm (Association of Internet Researchers 2019; Gliniecka 2023; Lauterwasser and Nedzhvetskaya 2023). Further clarification on the terminology used throughout this report is provided in the Terminology section.

The analysis examines official public UGC and advertising transparency mechanisms across 15 social media platforms, listed in Table 1, through which platforms provide systematic access to granular, post-level data rather than aggregated or summary outputs. These mechanisms primarily include official APIs, as well as GUIs with data extraction and analysis capabilities.

By contrast, third-party data brokers, alternative data retrieval methods such as web scraping, and platform-curated datasets provided upon request are excluded. Scraping typically occurs under suboptimal conditions, exposing both researchers and users to legal risks and potential legal challenges by platforms, while also introducing additional challenges for data quality—particularly in terms of completeness and reliability—since the data collected depend on what is displayed to the individual conducting the scraping at a given moment (Bruns 2019; Freelon 2018). Although we recommend the regulation of lawful and ethical scraping practices, we argue that such approaches remain insufficient to fully support the forms of public auditing, scrutiny, and research advanced in this work.

Third-party data brokers, while frequently used in the social listening market, typically provide non-auditable outputs with limited customisation. They operate largely as black-box systems, offering little transparency or auditability, and often involve costs that may be prohibitive for independent research (Yadav and Wanless 2022). Similarly, data access requests to platforms provide highly restricted and mediated forms of access, as platforms determine the scope and composition of the datasets they supply, offering researchers limited control over how data are compiled, filtered, or curated, which may limit transparency and reproducibility (Morten et al. 2024; Windwehr and Selinger 2024). Other forms of platform transparency, such as content moderation transparency reports, are also not considered (Access Now 2024; Urman and Makhortykh 2023).

Table 1: Platforms Analysed and Data Access Mechanisms Considered for User-Generated and Advertising Content

Platform	Owner Company	UGC data access	Advertising data access	DSA VLOP status
Bluesky	Bluesky PBLLC	Bluesky Developer APIs	N/A	No
Discord	Discord Inc.	N/A	N/A	No
Facebook	Meta Platforms Inc.	Meta Content Library	Meta Ad Library	Yes
Instagram	Meta Platforms Inc.	Meta Content Library	Meta Ad Library	Yes
Kwai (Kuaishou/Snack Video)	Kuaishou Technology	N/A	N/A	No
LinkedIn	Microsoft Corporation	Beta Researcher Access Program	LinkedIn Ad Library	Yes
Pinterest	Pinterest Inc.	N/A	Pinterest Ads Repository	Yes
Reddit	Reddit Inc.	Reddit Research API	N/A	No
Snapchat	Snap Inc.	N/A	Snapchat Ads Gallery	Yes
Telegram	Telegram Messenger Inc.	Telegram APIs	N/A	No
Threads	Meta Platforms Inc.	Meta Content Library	Meta Ad Library	No
TikTok	ByteDance Ltd.	TikTok Research API	TikTok Commercial Content Library	Yes
X (Twitter)	X Corp.	X API	X Ads Repository	Yes
WhatsApp	Meta Platforms Inc.	N/A	Meta Ad Library	Yes
YouTube	Alphabet Inc. (Google)	YouTube Data API	Google Ads Transparency Center	Yes

These transparency resources are examined across three distinct regulatory environments: the European Union—where access to public-interest data for research is regulated under the DSA—as well as Brazil and the United Kingdom, which serve as analytically relevant comparative cases. Where no regional differences were identified in the data access mechanisms, a single assessment was conducted and applied across all relevant jurisdictions.

Assessment criteria and transparency scores

The transparency resources listed in Table 1 were evaluated using two subsets of original criteria, which assess public UGC and advertising data transparency separately in terms of access tools and data quality. These criteria are structured into special and additional categories, grounded in seven data quality dimensions, enabling an assessment not only of whether data is made available, but also under what conditions and how. Performance across each criterion is then used to generate a final score, ranging from 0 to 100, for each resource assessed.

Accounting for 75% of each final score, the higher-weighted special criteria, listed in Table 2, assess key aspects such as API availability, scope, cost, and the presence of GUIs for data access. Four special criteria are applied to UGC data transparency, while three are used for advertising data transparency.

APIs stand out as a central transparency mechanism, enabling the automation and scaling of data collection, as well as facilitating independent testing to ensure that extracted data correspond to what was intended. Crucially, APIs must support the independent extraction of structured data to users’ own infrastructure, outside allegedly controlled and secure environments, thereby upholding core open science principles and enabling verification and collaboration across institutions (Bekavac and Mayer 2026; Davidson et al. 2023; Wilkinson et al. 2016). GUIs also play an important role in democratising access for researchers and citizens with more limited technical resources or expertise (Giglietto and Terenzi 2025).

Table 2: Special Assessment Criteria for User-Generated Content and Advertising Data Transparency Mechanisms

User-generated content data transparency	Advertising content data transparency
SC1: Does the platform provide an API that enables the structured extraction of public user-generated content data for independent analysis?	SC1: Does the platform provide an API to access its ad repository and extract data on advertising content for independent analysis?
SC2: Can the full scope of public content data be extracted through the platform’s API?	SC2: Does the platform provide a graphical user interface to its ad repository for extracting advertising content data?
SC3: Is access to the platform’s API free of charge?	SC3: Can data from both active and inactive ads be extracted?
SC4: Does the platform offer a graphical interface for extracting data?

The remaining criteria carry equal weight and together account for the remaining 25% of each final score. In total, the framework includes 25 other criteria for UGC data transparency, found in Table 3, and 33 for advertising data, found in Table 4, all organised under seven data quality dimensions:

Accessibility ensures that the intended data can be easily discovered and retrieved by users;
Compliance refers to adherence to widely adopted standards, applicable regulatory frameworks, and the platform’s own documentation;
Completeness ensures that the data include all the information necessary to conduct the intended analyses;
Consistency refers to the uniformity of data across all instances;
Relevance ensures that the data are meaningful and appropriately scoped for their intended uses;
Timeliness indicates that the data are up to date and available when needed;
and Accuracy refers to the extent to which the data faithfully reflect the real-world entity or event they represent.

The UGC transparency score is the sum of two components. The special criteria score ($S_{SC}$) accounts for 75% of the total: four criteria are weighted at $w_1 = w_2 = w_3 = 0.30$ and $w_4 = 0.10$, with answer weights ($SC_i$) of 1 for full access, 0.5 for researcher-only access, and 0 for unavailable. The other criteria score ($S_{OC}$) accounts for the remaining 25% and equals the mean answer weight across all applicable criteria, with not-applicable responses excluded from the denominator.

The advertising transparency score follows the same two-component structure. The special criteria score ($S_{SC}$) accounts for 75% of the total: three criteria are weighted at $w_1 = 0.50$, $w_2 = 0.30$, and $w_3 = 0.20$, with answer weights ($SC_i$) of 1 for complete access, 0.5 when access is restricted to specific categories (e.g., political advertising), and 0 for unavailable. The other criteria score ($S_{OC}$) is calculated identically to UGC. The complete scoring formulas are provided in the Scoring Methodology section.

All platforms were initially assessed between October and December 2025 by pairs of researchers with complementary expertise in data collection infrastructure and data analysis. The evaluation prioritised, in descending order, empirical testing of access mechanisms, review of official documentation, and evidence from comparable studies, explicitly avoiding reliance solely on self-declared platform claims wherever possible. All assessments subsequently underwent an internal peer-review validation process conducted between January and March 2026. All structured data and code used in the evaluation are publicly available in the project repository.

The final scores reflect the extent to which each platform enables independent, systematic, and high-quality access to UGC and advertising data for research purposes. Each platform–region assessment is then classified into one of the following transparency levels:

Meaningful (81–100): The platform provides well-established, openly accessible data infrastructure—including free APIs, comprehensive documentation, and broad data coverage—enabling systematic content collection and monitoring with minimal barriers.
Limited (61–80): The platform offers functional data access tools, but notable limitations—such as paywalls, restricted API scope, incomplete documentation, or access limited to approved researchers—require workarounds for comprehensive monitoring.
Deficient (41–60): The platform provides partial transparency resources, but significant gaps in data coverage, filtering capabilities, and documentation prevent reliable, large-scale research.
Minimal (21–40): The platform offers only minimal data access, with most transparency features either absent or severely constrained, hindering systematic monitoring.
Negligible (1–20): The platform offers negligible transparency infrastructure, with nearly all criteria unmet and any existing tools too limited to support meaningful research.
Not available (0): The platform provides no data access mechanisms, despite the framework being applicable.

7.1 Scoring Methodology

7.1.1 UGC Transparency

Special Criteria Score (75% weight):

\[ S_{SC} = 75 \sum_{i=1}^{4} w_i \cdot SC_i \]

where $w_1 = w_2 = w_3 = 0.30$ and $w_4 = 0.10$ are fixed question weights, and $SC_i \in \{0, 0.5, 1\}$ is the answer weight (1 for full access, 0.5 for researcher-only access, 0 for unavailable).

7.1.2 Advertising Transparency

Special Criteria Score (75% weight):

\[ S_{SC} = 75 \sum_{i=1}^{3} w_i \cdot SC_i \]

where $w_1 = 0.50$, $w_2 = 0.30$, and $w_3 = 0.20$, and $SC_i \in \{0, 0.5, 1\}$ (0.5 for access restricted to specific categories such as political advertising).

7.1.3 Both Frameworks

Other Criteria Score (25% weight):

\[ S_{OC} = 25 \cdot \frac{\sum_{j=1}^{n} OC_j}{N_{app}} \]

where $OC_j \in \{0, 0.5, 1\}$ is the answer weight for each criterion and $N_{app}$ is the number of applicable criteria (not-applicable responses are excluded from both the numerator and denominator). For advertising criteria, $OC_j = 0.5$ when a feature is available through only API or GUI, and $OC_j = 1$ when available through both.

Total Score:

\[ \text{Total Score} = S_{SC} + S_{OC} \]

7.2 Questionnaires

7.2.1 User-Generated Content (UGC) Questions

7.2.1.1 Special Criteria

UGC_SC1: Does the platform provide an API that enables the structured extraction of public user-generated content data for independent analysis?

Description: Verifies whether the platform provides an API with at least one endpoint for programmatically extracting public user-generated content to the users’ infrastructure, without requiring privileged or internal access beyond standard developer registration.

Available answers:

No (weight: 0.0)
Yes, but only for approved researchers (weight: 0.5)
Yes (weight: 1.0)

UGC_SC2: Can the full scope of public content data be extracted through the platform’s API?

Description: Verifies whether the platform enables programmatic discovery and extraction of data from the complete set of public user-generated content, without exclusions or artificial restrictions that limit data completeness.

Available answers:

No (weight: 0.0)
Yes, but only for approved researchers (weight: 0.5)
Yes (weight: 1.0)

UGC_SC3: Is access to the platform’s API free of charge?

Description: Verifies whether API use is free of charge, confirming via documentation and pricing policies that no fees are applied for API access.

Available answers:

No (weight: 0.0)
Yes, but only for approved researchers (weight: 0.5)
Yes (weight: 1.0)

UGC_SC4: Does the platform offer a graphical interface for extracting data?

Description: Verifies whether the platform offers a graphical interface for observing and collecting data to the users’ infrastructure, such as a dashboard or export feature, allowing extraction of public content data without programming.

Available answers:

No (weight: 0.0)
Yes, but only for approved researchers (weight: 0.5)
Yes (weight: 1.0)

7.2.1.2 Accessibility

UGC_OC1: Can the requested data be extracted directly from the platform’s API response?

Description: Verifies whether the API returns structured data directly in its response payload, rather than only providing redirect links to the data (excluding audiovisual media such as images, video, and audio).

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC2: Does the platform’s API support renewable authentication mechanisms without risk of data loss?

Description: Verifies whether the API’s authentication mechanism (e.g., tokens, keys, or other access methods) can be renewed or updated without interrupting data collection, losing access, or compromising the continuity and integrity of monitoring and extraction.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC3: Does the platform’s API offer an endpoint for extracting data from an individual publication?

Description: Verifies whether data from a specific public post can be collected using its unique identifier, without relying only on search terms or filters.

Available answers:

No (weight: 0.0)
Yes, but only for approved researchers (weight: 0.5)
Yes (weight: 1.0)

UGC_OC4: Does the platform’s API offer an endpoint for extracting data from an individual author?

Description: Verifies whether it is possible to collect data from public posts made by a specific author using their username or unique identifier.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC5: Does the platform’s API provide an endpoint for extracting data based on search terms?

Description: Verifies whether public user-generated content can be collected by querying individual or combined search terms, enabling datasets of posts mentioning those terms.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC6: Does the API use locale-neutral data representations?

Description: Verifies whether locale-sensitive data (such as timestamps, currency, and numbers) are returned in a locale-neutral format or accompanied by relevant locale metadata.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

7.2.1.3 Compliance

UGC_OC7: Does the platform implement a proper deprecation strategy to avoid breaking client applications while rolling out major changes in the API?

Description: Verifies whether the platform documents a deprecation strategy with a grace period before removing features, including deprecation and removal dates and migration instructions.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC8: Is the platform’s API documentation published in open access?

Description: Verifies whether full API documentation is openly available on the internet without requiring registration, login, or account creation.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC9: Is the platform’s API documentation clearly written and exemplified?

Description: Verifies whether API documentation is clear, complete, and includes practical examples such as sample code, queries, and structured references.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC10: Does the platform’s documentation include or link to the API or data access terms of use?

Description: Verifies whether the platform’s documentation clearly states, or provides a direct link to, the API’s terms of use. The assessment should review the documentation to confirm the presence of explicit legal terms that define permitted data access, usage conditions, and any applicable restrictions.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC11: Does the platform’s API documentation detail the response format of each endpoint?

Description: Verifies whether the API documentation specifies response formats for endpoints, including examples and potential error codes, with explicit descriptions of response structures.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC12: Does the platform provide its API documentation in the official languages of the assessed region?

Description: Verifies whether complete and up-to-date API documentation is available in the official languages of the region being assessed.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC13: Does the platform’s API documentation detail the quota or rate limits applicable to each available endpoint?

Description: Verifies whether the documentation specifies rate limits and quotas for each endpoint, including variations by authentication level or endpoint type.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC14: Does the platform provide a way to label content that has been generated with artificial intelligence?

Description: Verifies whether the platform automatically flags or allows users to flag AI-generated content and whether this information is exposed in the API response.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

7.2.1.4 Completeness

UGC_OC15: Can data from a publication’s comments be extracted using the platform’s API?

Description: Verifies whether comment data, including content, can be extracted when available—either together with publication data or via dedicated endpoints.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC16: Can data from temporary content be extracted through the platform’s API?

Description: Verifies whether the API provides at least one endpoint for collecting data from temporary publications (e.g., stories or other ephemeral content) as structured data before expiration.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC17: Can historical data be extracted through the platform’s API?

Description: Verifies whether endpoints allow the collection of public user-generated content data from a time range that extends more than one year prior to the request date.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC18: Is the number of requests allowed by the API sufficient for monitoring more than 10,000 publications in 24 hours?

Description: Verifies whether the platform’s rate limits and quotas allow continuous extraction of data for more than 10,000 publications within a 24-hour period without losses.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

7.2.1.5 Consistency

UGC_OC19: Are the results returned by the API consistently reproducible?

Description: Verifies whether data extracted via the platform’s API at any given time is consistent with other collections performed similarly. The assessment should conduct repeated test queries to confirm the reproducibility of results or ground the response based on recent (less than 2 years) experiments published in peer-reviewed journals.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC20: Is the data returned by the platform’s API consistent with the parameters and filters used in the request?

Description: Verifies whether the data returned by the API matches the filters and parameters specified, based on repeated test queries or recent validation studies.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

7.2.1.6 Relevance

UGC_OC21: Does the data extracted by the platform’s API reflect what is displayed on its user interface?

Description: Verifies whether the API data corresponds to the information displayed on the user interface at all levels of detail, including authorship, full content, interaction counts, and referenced content.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC22: Does the platform’s API allow for filtering data based on content or its author location?

Description: Verifies whether the API supports applying location-based filters to data extraction. The assessment should test the endpoint for the main content type to confirm that data on public posts can be filtered by content or author location.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC23: Does the platform’s API allow for filtering data based on content language?

Description: Verifies whether the API allows applying language-based filters, enabling public post data to be filtered by content language.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

UGC_OC24: Does the platform’s API allow for filtering data by specific time periods?

Description: Verifies whether the API allows applying custom temporal filters so that public post data can be filtered by specific time ranges.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

7.2.1.7 Timeliness

UGC_OC25: Can data from newly published content be extracted from the platform’s API in near real time?

Description: Verifies whether the API allows the collection of data from specific content within approximately one hour of its publication, enabling near real-time monitoring.

Available answers:

No (weight: 0.0)
Yes (weight: 1.0)

7.2.2 Advertising (Ads) Questions

7.2.2.1 Special Criteria

AD_SC1: Does the platform provide an API to access its ad repository and extract data on advertising content for independent analysis?

Description: This item verifies whether the platform provides an ad repository API with at least one endpoint for programmatically extracting advertising data. Full availability is confirmed when the API returns information on ads across all categories. The assessment should confirm that the endpoint allows the retrieval and storage of ad data without requiring privileged or internal access beyond standard developer registration.

Available answers:

Yes, with full availability (weight: 1.0)
Yes, with partial availability (weight: 0.5)
No (weight: 0.0)

AD_SC2: Does the platform provide a graphical user interface to its ad repository for extracting advertising content data?

Description: This item verifies whether the platform provides a graphical user interface (GUI) that enables the extraction of structured data without programming. Full availability is considered granted when the GUI delivers information on ads across all categories. The assessment should confirm the availability of an official browser-based tool that allows users not only to view ad content but also to export its data.

Available answers:

Yes, with full availability (weight: 1.0)
Yes, with partial availability (weight: 0.5)
No (weight: 0.0)

AD_SC3: Can data from both active and inactive ads be extracted?

Description: This item verifies whether the platform allows the extraction of ad data through either the GUI or the API, from at least one day after publication to at least one year prior. Full availability is confirmed when both active and inactive ad data are delivered across all ad categories. The assessment should test the interface and endpoints to confirm whether both active and inactive ads can be retrieved.

Available answers:

Yes, with full availability (weight: 1.0)
Yes, with partial availability (weight: 0.5)
No (weight: 0.0)

7.2.2.2 Accessibility

AD_OC1: Does the platform provide a GUI for accessing and visualizing its ad repository?

Description: This item verifies whether the platform provides a GUI for accessing and viewing ads in its ad repository. Full access is confirmed when the GUI provides information on ads across all categories and publication statuses, including both active and inactive ads. The assessment should confirm the availability of an official browser-based tool that allows users to search, access, and view ad content.

Available answers:

Yes, with full availability (weight: 1.0)
Yes, with partial availability (weight: 0.5)
No (weight: 0.0)

AD_OC2: Is access to the platform’s ad repository free of charge?

Description: This item verifies whether the ad repository API or GUI is free of charge, since even modest fees can create barriers or force researchers in low-resourced settings to narrow the scope of their work. The assessment should verify the platform’s documentation and pricing policies to confirm that no fees are applied for access to the ad repository.

Available answers:

Free API and GUI access (weight: 1.0)
Free API access (weight: 0.5)
Free GUI access (weight: 0.5)
No (weight: 0.0)

AD_OC3: Can the requested data be extracted directly from the ad repository response?

Description: This item verifies whether the platform’s ad repository returns structured data on ad content and authorship directly in the response, rather than providing a link that redirects to the data. Audiovisual media files and data (e.g., images, videos, and audio) should not be considered when assessing this item. The assessment should examine sample data responses from both the ad repository GUI and API to confirm that the requested public data is included in the returned payload.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC4: Does the platform’s ad repository API provide a form of authentication that allows for renewal without the risk of data loss?

Description: This item verifies whether the tokens provided for API use can be renewed without the risk of data loss, ensuring continuity and integrity of data access and monitoring. The assessment should check the platform’s documentation or directly observe the authentication and renewal process to confirm that token updates do not interrupt or compromise data access.

Available answers:

Yes (weight: 1.0)
No (weight: 0.0)

AD_OC5: Can data from an individual ad be retrieved from the platform?

Description: This item verifies whether it is possible to retrieve data from a specific advertisement on the ad repository using a unique identifier, rather than relying on search terms or other parameters and filters. The assessment should review the ad repository documentation and test available features to confirm that an individual ad can be retrieved directly by its unique identifier.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC6: Can data from ads served by a specific advertiser be retrieved from the platform?

Description: This item verifies whether it is possible to retrieve data from ads run by a specific advertiser, via their username or unique identifier. The assessment should review the ad repository documentation and test any available feature to retrieve data from an individual advertiser.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC7: Can ad data be retrieved from the platform using search terms?

Description: This item verifies whether ad data can be retrieved through search terms, enabling the creation of datasets based on those queries. The assessment should test search-related features to confirm that it accepts search queries using keywords.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC8: Does the platform use locale-neutral data representations?

Description: This item verifies whether locale-sensitive data (e.g., timestamps, currency, numbers) are provided in a locale-neutral format, or, if that is not possible, whether relevant locale metadata is included. The assessment should review the ad repository documentation and inspect sample responses to confirm the presence of standardized formats or accompanying metadata.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

7.2.2.3 Completeness

AD_OC9: Does the platform provide data that allows the identification of advertisers who ran ads?

Description: This item verifies whether the platform discloses information on the advertisers responsible for the identified ads. The assessment should confirm whether the advertiser’s page name, URL, and unique identifier can be retrieved.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC10: Does the platform provide data on the funders who paid for ads?

Description: This item verifies whether the platform provides data on the individuals or organizations that paid for the identified ads. The assessment should confirm whether any sponsor information is retrievable.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC11: Does the platform provide data on the period during which ads were served?

Description: This item verifies whether the platform provides data on the days on which the identified ads ran. The assessment should review the extracted ad data to confirm that it includes start and end dates (or equivalent temporal markers) indicating the period of activity.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC12: Does the platform provide data on user engagement with ads?

Description: This item verifies whether the platform provides data on the total number of user interactions with ads (e.g., likes, comments, shares, clicks). The assessment should review the extracted ad data to confirm that engagement metrics are available and clearly linked to each ad.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC13: Does the platform indicate whether ads were placed by verified or unverified advertisers?

Description: This item verifies whether the platform clearly indicates whether advertisers were verified at the time their ads were served. The assessment should review ad records to confirm that a verification status field is present.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

7.2.2.4 Compliance

AD_OC14: Does the platform flag ads that were removed due to violations of its guidelines or relevant legislation?

Description: This item verifies whether the platform indicates when an ad has been moderated. At a minimum, the platform should provide the reason for removal and the date. The assessment should review ad records to confirm that moderated ads are flagged and that the corresponding moderation details are clearly documented.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC15: Does the platform indicate whether ad content was generated using artificial intelligence?

Description: This item verifies whether the platform flags ads in which AI was involved in generating the content. The assessment should review ad records to confirm the presence of a field or label indicating the use of AI in ad production.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC16: Is the platform’s ad repository documentation published in open access?

Description: This item verifies whether the platform makes its ad repository documentation openly available on the internet, without requiring user registration or login. The assessment should attempt to access the documentation directly to confirm that it is fully available without authentication barriers.

Available answers:

Yes, both API and GUI documentation (weight: 1.0)
Yes, the API documentation (weight: 0.5)
Yes, the GUI documentation (weight: 0.5)
No (weight: 0.0)

AD_OC17: Is the platform’s ad repository documentation clearly written and exemplified?

Description: This item verifies whether the documentation for the platform’s ad repository is clear, complete, and provides practical implementation examples. The assessment should review the documentation to confirm the presence of detailed explanations, structured references, and sample queries or outputs illustrating correct use.

Available answers:

Yes, both API and GUI documentation (weight: 1.0)
Yes, the API documentation (weight: 0.5)
Yes, the GUI documentation (weight: 0.5)
No (weight: 0.0)

AD_OC18: Does the platform’s ad repository documentation include or link to its terms of use?

Description: This item verifies whether the documentation clearly and unambiguously states or refers to the terms for using the ad repository and its associated legal aspects. The assessment should review the documentation to confirm that explicit terms or references are provided and accessible.

Available answers:

Yes, both API and GUI documentation (weight: 1.0)
Yes, the API documentation (weight: 0.5)
Yes, the GUI documentation (weight: 0.5)
No (weight: 0.0)

AD_OC19: Does the platform provide its ad repository documentation in the official languages of the assessed region?

Description: This item verifies whether the platform provides its ad repository documentation in the official languages of the region being assessed. The assessment should review the documentation to confirm that complete and up-to-date versions are available in those languages.

Available answers:

Yes, both API and GUI documentation (weight: 1.0)
Yes, the API documentation (weight: 0.5)
Yes, the GUI documentation (weight: 0.5)
No (weight: 0.0)

AD_OC20: Does the platform implement a proper deprecation strategy to avoid breaking client applications while rolling out major changes in the API?

Description: This item verifies whether the platform’s documentation describes a deprecation strategy with a grace period before removing features. The assessment should review changelogs to confirm that deprecated features are listed with deprecation and removal dates and include migration instructions. This item applies only to breaking changes that require client updates, such as endpoint modifications, authentication updates, or the removal of features.

Available answers:

Yes (weight: 1.0)
No (weight: 0.0)

AD_OC21: Does the platform’s ad repository API documentation detail the response format of each endpoint?

Description: This item verifies whether the platform’s ad repository API documentation specifies the format of each possible response, including examples and potential errors. The assessment should review the documentation to confirm that response structures are described and illustrated with sample outputs.

Available answers:

Yes (weight: 1.0)
No (weight: 0.0)

AD_OC22: Does the platform’s ad repository API documentation detail the quota or rate limits applicable to each available endpoint?

Description: This item verifies whether the platform’s ad repository API documentation specifies the limits for each endpoint, including any variations based on authentication level or endpoint type. Rate and quota limits define the maximum number of requests allowed within a given period (e.g., 1,000 requests per hour). The assessment should review the documentation to confirm that request caps (rate limits) and overall usage restrictions (quotas) are clearly stated.

Available answers:

Yes (weight: 1.0)
No (weight: 0.0)

7.2.2.5 Consistency

AD_OC23: Does the data retrieved by the API reflect what is displayed on the platform’s ad repository GUI?

Description: This item verifies whether the data returned by the platform’s ad repository API corresponds to the information displayed on its GUI in all its levels of detail. It should be possible to identify in the API response information such as authorship, complete content, and other serving information (e.g., amount spent, impressions reached). The assessment should compare API responses with the GUI to confirm that at least the following elements are consistent: authorship, full content, and serving information (e.g., spending, impressions).

Available answers:

Yes (weight: 1.0)
No (weight: 0.0)

AD_OC24: Are the results returned by the platform consistently reproducible?

Description: This item verifies whether data accessed and extracted via the platform’s ad repository at a given time is consistent with other collections performed similarly. The assessment should perform repeated queries to confirm the reproducibility of results or ground the response based on recent (less than 2 years) experiments published in peer-reviewed journals.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC25: Is the data returned by the platform consistent with the parameters and filters used in the request?

Description: This item verifies whether the data retrieved through the ad repository accurately reflects the parameters and filters specified at the time of the request. The assessment should run test queries with different filters to confirm that results consistently match the requested conditions.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

7.2.2.6 Relevance

AD_OC26: Does the platform allow the use of temporal filters to retrieve data on ads?

Description: This item verifies whether the ad repository allows filtering data by the time period in which the ads were served. The assessment should test queries with temporal filters to confirm that results accurately reflect the specified date ranges.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC27: Does the platform allow filtering advertising data by ad category?

Description: This item verifies whether the ad repository allows filtering data by any categories assigned at the time of ad creation. The assessment should run test queries with category filters to confirm that results align with the selected classifications.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC28: Does the platform allow filtering advertising data by geographic location?

Description: This item verifies whether the ad repository allows filtering data by one or more geographic locations where the ads were served. The assessment should test queries with location filters to confirm that results match the specified areas.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

7.2.2.7 Accuracy

AD_OC29: Does the platform provide age and gender data on the audiences of ads?

Description: This item verifies whether the platform provides data on the age and gender of audiences reached. The assessment should review the ad records to confirm that these breakdowns are available and consistently reported.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC30: Does the platform provide subnational geographic data on the audience reached by ads?

Description: This item verifies whether the platform provides data on the subnational geographic location of audiences reached. The assessment should review the ad records to confirm that such location breakdowns are available and consistently reported.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC31: Does the platform include data on audience targeting criteria defined by advertisers?

Description: This item verifies whether the platform provides data on audience targeting criteria defined by the advertiser when publishing ads (e.g., demographic and geographic segments, as well as interests, attitudes, behaviors, and keywords). The assessment should review ad records to confirm that these targeting parameters are available and consistently reported.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC32: Does the platform provide granular volume ranges for ad impressions?

Description: This item verifies whether the ad repository provides impression values for ads, using ranges that closely approximate the actual numbers. Intervals should be no larger than 10% of the upper bound of the value range they represent. For example, values up to 1,000 impressions should be displayed in intervals no larger than 100; between 1,000 and 10,000 in intervals no larger than 1,000; between 10,000 and 100,000 in intervals no larger than 10,000; between 100,000 and 1 million or above, in intervals no larger than 100,000. The assessment should measure whether the reported intervals remain within this threshold across the different value ranges using the platform’s documentation or available data interfaces.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

AD_OC33: Does the platform provide granular investment ranges for ad spending?

Description: This item verifies whether the ad repository provides spending values for ads, using ranges that closely approximate the actual amounts. Intervals should be no larger than 10% of the upper bound of the value range they represent. For example, values up to $100 should be displayed in intervals no larger than $10; between $100 and $1,000 in intervals no larger than $100; and between $1,000 and $10,000 in intervals no larger than $1,000. The assessment should measure whether the reported intervals remain within this threshold across the different value ranges using the platform’s documentation or available data interfaces.

Available answers:

Yes, through both GUI and API (weight: 1.0)
Yes, through the GUI (weight: 0.5)
Yes, through the API (weight: 0.5)
No (weight: 0.0)

7.3 Terminology

Online platforms

Online platforms are technological infrastructures that connect individuals and organisations, facilitating the exchange of goods, services, or information by mediating relationships among two or more parties. They often include technical features enabling third parties to interact with the platform or develop additional functionalities. Such platforms are inherently non-neutral and operate under profit-driven incentives. While online platforms take various forms (e.g., app stores, streaming services, online marketplaces), this work focuses specifically on one type: social media.

Social media platforms. Social media platforms are a subset of online platforms that host and circulate user-generated content and interactions rather than producing most content themselves. In addition to user-generated content, these platforms feature advertising and commercial material, which form the basis of their business models. These models rely on computational infrastructures designed for large-scale collection, processing, and analysis of user data. Even though the content is largely created by users, platforms make crucial decisions about its distribution—determining what is shown, to whom, how user connections and interactions are structured, and what is permitted.

Messaging platforms. While primarily used for real-time, personal, and encrypted communication, messaging platforms (e.g., Discord, WhatsApp, Telegram) increasingly blur the boundaries between private and public interactions. They are progressively incorporating social components into their core features, such as large groups, public channels, status updates, and other broadcast-oriented or community-building functionalities, which facilitate the widespread circulation of content and the formation of networked publics. Consequently, we argue that messaging platforms should be considered an integral part of the social media ecosystem.

User-generated content

Social media platforms are predominantly populated by two distinct types of content: user-generated content and advertising content. User-generated content (UGC) refers to any material created and shared by users on a platform—such as posts, comments, images, or videos—rather than by the platform itself, and distributed organically without paid promotion. For most platforms, UGC is not only a means of enabling users to communicate and connect, but also a core mechanism through which user engagement is transformed into behavioural data and, ultimately, economic value.

Advertising content

On the other hand, advertising content can be understood as any form of commercial communication whose visibility is amplified through payments made to the platform. This type of content is typically distributed in a microtargeted manner: drawing on the accumulation of users’ behavioural, demographic, and geographic data—derived both from their interactions and from information they voluntarily provide—platforms construct detailed user profiles. These profiles are then made available to advertisers, who use them to target audiences they consider most likely to receive, engage with, and respond to the promoted content.

Application Programming Interface

Application Programming Interfaces (APIs) are among the most widely used methods for collecting structured data from social media platforms. They enable communication between software components—typically a server (such as a platform database) and a client (for instance, a researcher’s device)—through standardised data requests governed by specific protocols. By allowing users to securely access and retrieve data in a structured and automated manner, APIs support scalable data collection processes and offer a high degree of customisation, as researchers can tailor queries to specific parameters and research needs. Access generally requires authentication tokens to validate client requests, while data retrieval occurs through calls to predefined endpoints, following parameters specified in the API’s documentation and executed through code.

By facilitating data exchange, APIs promote interoperability across otherwise heterogeneous systems. However, some APIs adopt a more restrictive design, requiring prior approval and limiting access to secure and controlled environments. We argue that this model is not ideal for online platform-based research and inquiry, as it undermines core principles of open science by preventing independent verification of results, restricting cross-validation with other data sources, and constraining collaboration and data sharing across research communities.

Web scraping

Web scraping is a method for extracting and aggregating online content, often used by researchers when official data access mechanisms—such as APIs—are limited or insufficient. In practice, however, scraping is typically restricted by platform terms of service and actively discouraged through technical barriers, requiring continuous effort to maintain data collection and processing workflows. This dynamic not only increases the technical burden on researchers but also introduces legal uncertainty, as platforms have, in some cases, pursued action against individuals engaging in scraping.

While platforms justify these restrictions as necessary to protect user data, ensure authenticity, prevent automated abuse, and safeguard their ecosystems, scraping remains a far-from-ideal approach for social media research. It poses significant challenges related to data quality, reliability, and completeness, as well as to the long-term maintenance and stability of the infrastructures required to support data collection at scale.

Graphical User Interface

All platforms are accessed through a graphical user interface (GUI): a visual and interactive environment that allows users to perform actions and input commands in an intuitive, non-programmatic way. In addition to their primary interfaces, some platforms also provide dedicated interfaces for data access and exploration, enabling the collection and navigation of both UGC and advertising content.

User-Generated Content. In the case of UGC, a notable example is CrowdTangle, a tool developed by Meta that allowed approved researchers and journalists to browse and extract structured public data from platforms within its ecosystem. Data access could be performed either programmatically, via an API, or directly through the graphical interface, often with minimal technical effort. The tool also offered accessible and interactive features for data visualisation and analysis. CrowdTangle was eventually replaced by the Meta Content Library (MCL), introduced in 2023 as a private archive of public UGC from Facebook, Instagram, and Threads, with more restricted access and functionality. Access to the MCL is limited to researchers pre-approved by Meta, including those affiliated with academic institutions, research institutes, not-for-profit organisations, and other entities engaged in scientific or public-interest research. Data can be accessed either through a GUI with features similar to those previously offered by CrowdTangle or via an API. However, API access is only granted within a tightly controlled environment, typically through a closed virtual machine system that prevents data export and restricts integration with researchers’ own infrastructures, thereby limiting data sharing and independent analysis.

Advertising Content. In the context of advertising, ad repositories are primarily structured as web-based layers accessible through a GUI, and may be accompanied by an API. These GUIs allow users to browse ads served on a platform over a given period, providing access to their content as well as metadata such as delivery dates, reach, targeting criteria, and spending. In some cases, platforms also enable the extraction of this information in structured formats through simple interface-based actions, although comprehensive data coverage is often not guaranteed. Ad repository APIs, in turn, allow for the structured extraction of the data made available through these GUIs. However, they frequently impose additional limitations, including restrictions on the categories of ads whose data can be accessed, thereby constraining the scope and completeness of independent analysis.