Rule Taxonomy and Timelines
Overview
Rule timelines were reconstructed from the Wayback Machine, scraping at-most-weekly snapshots. These rules are then classified into a hierarchical taxonomy of 3 levels and 17 classes.
Taxonomy

Rules were classified by a GPT-4o-based classification pipeline. For more details, see our paper on arXiv.
Rule Data Download
Our rules data, spanning from 2018-04-23 to 2024-06-20, is available for download as a .csv file here.
Note that due to the infrequency of Wayback Machine snapshots, there is some uncertainty in both when a rule was created and when a rule was removed. To quantify this, we include the timestamps of the Wayback Machine snapshots on either side of a rule’s start and end date. For example, if a snapshot taken on Monday does not include Rule X, but the snapshot taken on Wednesday does include Rule X, we know that Rule X was created sometime between Monday and Wednesay. The same holds for a rule being removed. We include the timestamp of the Monday snapshot (earliest_start) and the Wednesday snapshot (latest_start) as these represent the lower and upper bounds, respectively, of a rule’s actual (and unknown) creation date).
The file includes each rule in our dataset as a row, with each row having the following columns:
subreddit: the subreddit nameearliest_start: the lower bound of when the rule was addedlatest_start: the upper bound of when the rule was addedearliest_end: the lower bound of when the rule was removedlatest_end: the lower bound of when the rule was removedPrescriptive: a boolean, expressing if the rule tone is PrescriptiveRestrictive: a boolean, expressing if the rule tone is RestrictivePost Content: a boolean, expressing if the rule target is Post ContentPost Format: a boolean, expressing if the rule target is Post FormatUser-Related: a boolean, expressing if the rule target is User-RelatedNot a Rule: a boolean, expressing if the rule is not able to be classified as a ruleSpam, Low Quality, Off-Topic, and Reposts: a boolean, expressing if the rule topic is Spam, Low Quality, Off-Topic, and RepostsPost Tagging & Flairing: a boolean, expressing if the rule topic is Post Tagging & FlairingPeer Engagement: a boolean, expressing if the rule topic is Peer EngagementLinks & External Content: a boolean, expressing if the rule topic is Links & External ContentImages: a boolean, expressing if the rule topic is ImagesCommercialization: a boolean, expressing if the rule topic is CommercializationIllegal Content: a boolean, expressing if the rule topic is Illegal ContentDivisive Content: a boolean, expressing if the rule topic is Divisive ContentRespect for Others: a boolean, expressing if the rule topic is Respect for OthersBrigading: a boolean, expressing if the rule topic is BrigadingBan Mentioned: a boolean, expressing if the rule topic is Ban MentionedKarma/Score Mentioned: a boolean, expressing if the rule topic is Karma/Score MentionedRule Text: the full text of the rule
For ease of computing, we recommend loading it using Pandas and creating a MultiIndex. This can be done with pd.read_csv('path_to_downloaded.csv').set_index(['subreddit', 'earliest_start', 'latest_start', 'earliest_end', 'latest_end']).