- A dataset of Wikipedia articles nominated for Speedy Deletion from October - December 2011; some of these were later deleted and some were not. The set also contains,
as a comparison, articles that were not deleted from Wikipedia, obtained using Wikipedia's category hierarchy. (For more details, see our
description of the data.) Download
The data is organized as follows: Each day's nominees are contained in a separate folder labeled 'For speedy' followed by the date; for example, November 9, 2011's nominees are in the folder For speedy 11-09. (Please note that we do not guarantee that we retrieved all nominees for each day, as our download interval was large enough that a few may have been missed; some of the earliest dates contain significantly fewer files than later ones.)
Each subfolder contains a number of files, which represent articles that were nominated for deletion but kept, and a folder called "Deleted", which contains all pages nominated for Speedy Deletion on that day and actually deleted. Each "Deleted" folder contains a folder entitled "Good", which contains the subset of deleted pages which were nominated for deletion for one of the following reasons: "No indication of significance", "advertising or promotion", or "no context". These files were selected by parsing each file for the deletion reason given in the deletion nomination template on each page. Article talk pages that existed at the time of article download are included as well; these are denoted by the word "Talk", followed by an underscore and the article name. (E.g., "Talk_Computer")
The comparison set we used for our Speedy Deletion experiments (see data description ) is in a separate file; it contains 1381 articles from 21 categories, along with any accompanying talk pages. Download
A set of 847 Wikipedia articles Proposed for Deletion, or PROD'ed, during the same time period as above. This set also includes 141 articles PROD'ed in March 2013. Folder structure is similar to that of the Speedy set; each folder is labeled with the date the articles in it were PROD'ed, as follows: PRODs , followed by the date. Each day's folder has a subfolder entitled "Deleted
" containing all pages PRODed on that day that were later deleted. As above, relevant talk pages are included. (Note: the archive is in 7z [use free 7zip utility to unpack it] format because it contains file names with Unicode characters.) Download The comparison set used for the PRODs is a superset of that used for the Speedies; it contains all 1381 articles from the Speedy comparison set described above, as well as an additional 655 articles from another 13 categories for a total of 2036 articles; as always, each article is accompanied by its talk page, if it exists. The articles from the speedy comparison set are in the main folder; the additional ones used only for PRODs are divided into subfolders by category. Download
A set of articles nominated for deletion discussion on the Articles for Deletion page. As above, articles are separated in folders labeled by date (in this case date only), with sub-folders marked "Deleted" for those articles that were later deleted. Please note that the first two weeks or so of folders do not have Deleted sub-folders because all the articles in them were downloaded after the articles chosen for deletion had already been deleted. Download
A set of articles downloaded in December 2012 shortly after their creation. They were later sorted by which were subsequently deleted. For the deleted set, see here; for the kept set, see here.