Show simple item record

AuthorKutlu, M.
AuthorKutlu, Mucahid
AuthorMcDonnell, Tyler
AuthorBarkallah, Yassmine
AuthorElsayed, Tamer
AuthorLease, Matthew
Available date2019-09-15T08:00:42Z
Publication Date2018-06-27
Publication Name41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018
Identifierhttp://dx.doi.org/10.1145/3209978.3210033
CitationMucahid Kutlu, Tyler McDonnell, Yassmine Barkallah, Tamer Elsayed, and Matthew Lease. 2018. Crowd vs. expert: What can relevance judgment rationales teach us about assessor disagreement? In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’18). 805–814.
ISBN9781450356572
URIhttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85051536970&origin=inward
URIhttp://hdl.handle.net/10576/11838
Abstract© 2018 ACM. While crowdsourcing offers a low-cost, scalable way to collect relevance judgments, lack of transparency with remote crowd work has limited understanding about the quality of collected judgments. In prior work, we showed a variety of benefits from asking crowd workers to provide \em rationales for each relevance judgment \citemcdonnell2016relevant. In this work, we scale up our rationale-based judging design to assess its reliability on the 2014 TREC Web Track, collecting roughly 25K crowd judgments for 5K document-topic pairs. We also study having crowd judges perform topic-focused judging, rather than across topics, finding this improves quality. Overall, we show that crowd judgments can be used to reliably rank IR systems for evaluation. We further explore the potential of rationales to shed new light on reasons for judging disagreement between experts and crowd workers. Our qualitative and quantitative analysis distinguishes subjective vs.\ objective forms of disagreement, as well as the relative importance of each disagreement cause, and we present a new taxonomy for organizing the different types of disagreement we observe. We show that many crowd disagreements seem valid and plausible, with disagreement in many cases due to judging errors by the original TREC assessors. We also share our WebCrowd25k dataset, including: (1) crowd judgments with rationales, and (2) taxonomy category labels for each judging disagreement analyzed.
Languageen
PublisherACM
SubjectCrowdsourcing
Disagreement
Evaluation
Relevance assessment
TitleCrowd vs. Expert: What can relevance judgment rationales teach us about assessor disagreement?
TypeConference Paper
Pagination805-814


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record