- December 2016: Patient threads with human-assigned votes for sentence relevance
- December 2016: Viva threads with human-assigned votes for post relevance
- November 2015: forum data in one unified XML format
December 2016: Patient threads with human-assigned votes for sentence relevance
We release the following labelled data set:
- 100 long threads (> 10 posts) from the Facebook group GIST support international in the XML format defined below, and in an HTML format that shows the sentence splitting and sentence labels. GIST_fb_data_100_long_threads
- 50 long threads (a subset of the 100 threads) from the Facebook group GIST support international in the XML format defined below, and in an HTML format that shows the sentence splitting and sentence labels. GIST_fb_data_50_long_threads
- A file with on each row a thread id, a sentence id and the number of votes for the sentence (labelled by five crowdsourced raters so the range of the vote count is 0-5). GIST_fb_100longthreads_with_sentvotes_crowd.txt
- A file with on each row a thread id, a sentence id and the number of votes for the sentence (labelled by two expert raters so the range of the vote count is 0-2). GIST_fb_50longthreads_with_sentvotes_experts.txt
The data is described in more detail in this paper:
- Suzan Verberne, Antal van den Bosch, Sander Wubben, Emiel Krahmer (2017). Automatic summarization of domain-specific forum threads: collecting reference data (pdf). To appear in the proceedings of CHIIR 2017.
Please refer to this paper when using the data for your work.
December 2016: Viva threads with human-assigned votes for post relevance
We release the following labelled data set:
- 106 long threads (> 20 posts) from the Viva forum in the XML format defined below. viva_data_106_long_threads
- A file with on each row a thread id, a post id and the number of votes for the post (labelled by ten raters so the range of the vote count is 0-10). 106long20threads_with_postvotes.txt
- A file with on each row a thread id, a post id, the extracted post features (values standardized per thread) and the number of votes for the post. 106long20threads_with_postvotes_and_postfeats.txt
- A file with on each row one thread annotation: a timestamp, a subject id, a thread id, the list of selected posts, the familiarity score and the usefulness score. 106long20threads_annotations.txt
The data are described in a paper that will appear in 2017 in Springer’s Language Resources and Evaluation: “Creating a Reference Data Set for the Summarization of Discussion Forum Threads”. Two excerpts from the paper:
Through social media and the Radboud University research participation system, we recruited members of the Viva forum target group (Dutch-language, female, aged 18–45) as raters for our study. […] They were presented with randomly selected threads from our sample. The raters decided themselves how many threads they wanted to summarize. They were paid a gift certificate. […] In the annotation interface, the left column of the screen shows the complete thread; the right column shows an empty table. By clicking on a post in the thread on the left it is added to the column on the right (in the same position); by clicking it in the right column it disappears again. The opening post of the thread was always selected. We intentionally did not pre-require a specific number of posts to be selected for the summary because we wanted to investigate what the desired summary size was for the raters.
We also asked the raters to indicate their familiarity with the topic of the thread (scale 1–5, where 1 means ‘not familiar at all’ and 5 means ‘highly familiar’) and how useful it would be for this thread to have the possibility to see only the most important posts (scale 1–5). In case they chose a usefulness score of 1, they were asked to choose between either of the options ‘none of the posts are relevant’ (1n), ‘all posts are equally relevant’ (1a) or ‘other reason’ (1o). We gave room for additional comments.
November 2015: forum data in one unified XML format
We have prepared data from two web forums in one unified XML format:
- Viva forum (Dutch): 10,000 threads in 26 categories (50MB gzipped). Send an e-mail to firstname.lastname@example.org to obtain these data.
- Reddit (English): 242,666 threads in 12,980 subreddits, from December 2014 (248MB gzipped). Click here to download these data.
The data is organized in subdirectories (one per category/subreddit), with one XML file per thread. The DTD for the XML format is:
<!ELEMENT thread (threadid,title,post+,category*,type*,nrofviews?)>
<!ELEMENT post (postid,author,timestamp,parent*,upvotes?,downvotes?,body)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT timestamp (#PCDATA)>
<!ELEMENT parent (#PCDATA)>
<!ELEMENT upvotes (#PCDATA)>
<!ELEMENT downvotes (#PCDATA)>
<!ELEMENT body (content,url*)>
<!ELEMENT content (#PCDATA)>
<!ELEMENT url (#PCDATA)>
If you use this data for research, please make a reference to:
- Sander Wubben, Suzan Verberne, Emiel Krahmer and Antal van den Bosch (2015). Facilitating online discussions by automatic summarization (pdf). In Proceedings of the 27th Benelux Conference on Artificial Intelligence (BNAIC 2015), Hasselt, 5-6 November, 2015.
- And this webpage: http://discosumo.ruhosting.nl/wordpress/project-deliverables/
If you have a question about the data, please send an e-mail to Suzan Verberne, email@example.com