View on GitHub

Political Bias and Factualness in News Sharing Across more then 100,000 Online Communities

Galen Weld, Maria Glenski, Tim Althoff (2021)

Overview

This is the webpage for find information, datasets, and analysis for the paper Political Bias and Factualness in News Sharing Across more then 100,000 Online Communities, which you can find online on arXiv.

In this paper, we examine every link posted to reddit from 2015-2019, and label links to news sources with their political bias (left-right) and factualness (low-high), using news source annotations from the widely used site Media Bias/Fact Check.

We make this dataset public. For more details, scroll down.

Abstract

As civil discourse increasingly takes place online, misinformation and the polarization of news shared in online communities have become ever more relevant concerns with real world harms across our society. %Biased news and misinformation has been shown to have real world impacts across society, impacting elections across the globe, and disrupting the public health response to the COVID-19 pandemic. Studying online news sharing at scale is challenging due to the massive volume of content which is shared by millions of users across thousands of communities. Therefore, existing research has largely focused on specific communities or specific interventions, such as bans. However, understanding the prevalence and spread of misinformation and polarization more broadly, across thousands of online communities, is critical for the development of governance strategies, interventions, and community design. Here, we conduct the largest study of news sharing on \reddit to date, analyzing more than 550 million links spanning 4 years. We use non-partisan news source ratings from Media Bias/Fact Check to annotate links to news sources with their political bias and factualness. We find that, compared to left-leaning communities, right-leaning communities have 105\% more variance in the political bias of their news sources, and more links to relatively-more biased sources, on average. We observe that \reddit users’ voting and re-sharing behaviors generally decrease the visibility of extremely biased and low factual content, which receives 20% fewer upvotes and 30% fewer exposures from crossposts than more neutral or more factual content. This suggests that reddit is more resilient to low factual content than Twitter. Furthermore, users posting extremely biased and low factual content leave reddit 68% faster than users with more neutral bias, on average. We show that extremely biased and low factual content is very concentrated, with 99% of such content being shared in only 0.5% of communities, giving credence to the recent strategy of community-wide bans and quarantines.

Dataset

To create our dataset, we downloaded all public reddit submissions from Pushshift posted between January 2015 and August 2019 (inclusive), the most recent month available at the time of this study, for a total of 56 months of content (580 million submissions, 35 million unique authors, 3.4 million unique subreddits). For each submission, we extract the URLs of each linked-to website, which resulted in 559 million links. While link submissions by definition contain exactly one link, text submissions (selfposts) can include 0 or more links. Then, using annotations from Media Bias/Fact Check (MBFC) we identify links to news sources, and annotate these links to news sources with their political bias and factualness.

The resulting dataset, consisting of 35 million links to news sources, is publicly downloadable at this website. The dataset is in gzip-compressed line-delimted .json format, and is divided into 200 files for convenience, totaling approximately 21 gigabytes.

Schema

The dataset consists of a single table of posts, stored in line-delimted json format; as such, a single line corresponds to a single post. Here’s an example line:

{"subreddit":"00sdesign","author":"Cringe_Revolution","created_utc":"1533210010","id":"93yixs","is_deleted":false,"is_linkpost":true,"is_removed":false,"is_selfpost":false,"score":7,"urls":[{"domain":"i.imgur.com","nn_kind":"content_external","url":"https://i.imgur.com/fmmswxl.png"}],"subscribers":4}

Each post has the following columns:

Column Name Description
subreddit The subreddit this post was submitted to.
author The reddit username of the author of this post.
created_utc The timestamp (seconds from Unix epoch) that this post was created.
id The reddit-assigned unique ID for this post.
is_deleted A boolean value, was this post deleted (by its author?
is_removed A boolean value, was this post removed (by a moderator)?
is_linkpost A boolean value, is this post a link post (i.e. not a selfpost)?
is_selfpost A boolean value, is this post a selfpost. Note that a post may be neither a link post nor a self post if it is removed or deleted.
score The score (approximately the number of upvotes minus the number of downvotes) of this post, as of ~2 months after it was submitted.
urls A list of 0 or more url objects (see below) submitted with this post. If this post is a link post, it will have exactly one URL. If it’s a selfpost, it may have 0 or more URLS extracted from the selftext of the post.
subscribers The approximate number of subscribers to the subreddit the post was submitted to at the time it was submitted. Estimated by linear interpolation of archived subscriber counts from archive.org. See the paper for more details.

URL objects consist of the following columns:

Column Name Description
domain The subreddit this post was submitted to.
url The subreddit this post was submitted to.
nn_kind Approximate categorization of non-news content. Null if news content or uncategorized. See below for more details. This value was not used in the paper.
pnnl_kind Label provided by the Volkova et. al dataset. Null if not labeled. This value was not used in the paper.
mbfc_kind Category that this news source is given by MBFC. This value was not used in the paper.
mbfc_bias Political bias label given by MBFC, on an integer scale from -3 (extreme left) to 0 (center) to +3 (extreme right). Null if no label given.
mbfc_factualness Factualness label given by MBFC, on an integer scale from 0 (very low) to +5 (very high). Null if no label given.

Non-news content categories consist of the following:

Category Name Description
reddit reddit links.
content_external External (non-reddit) content hosting sites such as imgur.
content_internal Internal (reddit-operated) content hosting domains such a v.redd.it.
social Social media sites like Facebook and Twitter.
music Music sites such as Spotify and Pandora.
porn Porn websites.
productivity Productivity sites such as docs.google.com.
reddit_mirror Reddit mirroring websites such as reddit-stream.com.
reference Reference sites such as Wikipedia and archive.org.
search Search engines like Google and Bing.
shopping Online shipping sites like Amazon and Newegg.
shortener Url shorteners like bit.ly.

If you find any errors or have any questions or clarifications, please feel free to contact Galen Weld or file an issue on this github repo.