×

zbMATH — the first resource for mathematics

Annotator rationales for labeling tasks in crowdsourcing. (English) Zbl 07269304
Summary: When collecting item ratings from human judges, it can be difficult to measure and enforce data quality due to task subjectivity and lack of transparency into how judges make each rating decision. To address this, we investigate asking judges to provide a specific form of rationale supporting each rating decision. We evaluate this approach on an information retrieval task in which human judges rate the relevance of Web pages for different search topics. Cost-benefit analysis over 10,000 judgments collected on Amazon’s Mechanical Turk suggests a win-win. Firstly, rationales yield a multitude of benefits: more reliable judgments, greater transparency for evaluating both human raters and their judgments, reduced need for expert gold, the opportunity for dual-supervision from ratings and rationales, and added value from the rationales themselves. Secondly, once experienced in the task, crowd workers provide rationales with almost no increase in task completion time. Consequently, we can realize the above benefits with minimal additional cost.
MSC:
68T Artificial intelligence
Software:
BERT; word2vec
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Aiman L Al-Harbi and Mark D Smucker. A qualitative exploration of secondary assessor relevance judging behavior. InProceedings of the 5th Information Interaction in Context Symposium, pages 195-204. ACM, 2014.
[2] Omar Alonso. Guidelines for designing crowdsourcing-based relevance experiments, 2009. CiteSeerX DOI 10.1.1.149.6649.
[3] Omar Alonso. Practical lessons for gathering quality labels at scale. InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1089-1092. ACM, 2015.
[4] Omar Alonso and Ricardo Baeza-Yates. An analysis of crowdsourcing relevance assessments in spanish. InSpanish Conference on Information Retrieval, 2010.
[5] Omar Alonso and Stefano Mizzaro. Can we get rid of trec assessors? using mechanical turk for relevance assessment. InProceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, volume 15, page 16, 2009.
[6] Omar Alonso and Stefano Mizzaro. Using crowdsourcing for trec relevance assessment. Information Processing & Management, 48(6):1053-1066, 2012.
[7] Omar Alonso, Daniel E Rose, and Benjamin Stewart. Crowdsourcing for relevance evaluation. InACM SigIR Forum, volume 42, pages 9-15, 2008.
[8] Antonio A Arechar, Gordon T Kraft-Todd, and David G Rand. Turking overtime: how participant characteristics and behavior vary over time and day on amazon mechanical turk.Journal of the Economic Science Association, 3(1):1-11, 2017.
[9] Ron Artstein and Massimo Poesio. Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555-596, 2008.
[10] Josh M Attenberg, Pagagiotis G Ipeirotis, and Foster Provost. Beat the machine: Challenging workers to find the unknown unknowns. InWorkshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.
[11] Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P de Vries, and Emine Yilmaz. Relevance assessment: are judges exchangeable and does it matter. InProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 667-674. ACM, 2008.
[12] Yujia Bao, Shiyu Chang, Mo Yu, and Regina Barzilay. Deriving machine attention from human rationales. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1903-1913, 2018.
[13] Jeff Barr and Luis Felipe Cabrera. Ai gets a brain.Queue, 4(4):24, 2006.
[14] Joost Bastings, Wilker Aziz, and Ivan Titov. Interpretable neural predictions with differentiable binary variables.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2963-2977, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1284. URLhttps: //www.aclweb.org/anthology/P19-1284.
[15] Michael S Bernstein, Greg Little, Robert C Miller, Bj¨orn Hartmann, Mark S Ackerman, David R Karger, David Crowell, and Katrina Panovich. Soylent: a word processor with a crowd inside. InUIST, pages 313-322. ACM, 2010.
[16] Roi Blanco, Harry Halpin, Daniel M Herzig, Peter Mika, Jeffrey Pound, Henry S Thompson, and Thanh Tran Duc. Repeatable and reliable search system evaluation using crowdsourcing. InSIGIR, pages 923-932. ACM, 2011.
[17] Martin Braschler and Carol Peters.The clef campaigns: Evaluation of cross-language information retrieval systems.UPGRADE (The European Online Magazine for the IT Professional), 3:78-81, 2002.
[18] Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. The ClueWeb09 Data Set, 2009. Presentation Nov. 19, 2009 at NIST TREC. Slides online atboston.lti.cs.cmu.edu/ classes/11-742/S10-TREC/TREC-Nov19-09.pdf.
[19] Jean Carletta. Assessing agreement on classification tasks: the kappa statistic.Computational linguistics, 22(2):249-254, 1996.
[20] Ben Carterette, Virgiliu Pavlu, Hui Fang, and Evangelos Kanoulas. Million query track 2009 overview. InProceedings of NIST TREC, 2010.
[21] Logan S Casey, Jesse Chandler, Adam Seth Levine, Andrew Proctor, and Dara Z Strolovitch. Intertemporal differences among mturk workers: Time-based sample variations and implications for online data collection.SAGE Open, 7(2):2158244017712774, 2017.
[22] Praveen Chandar, William Webber, and Ben Carterette. Document features predicting assessor disagreement. InProceedings of the 36th ACM SIGIR conference on Research and development in information retrieval, pages 745-748. ACM, 2013.
[23] Joseph Chee Chang, Saleema Amershi, and Ece Kamar. Revolt: Collaborative crowdsourcing for labeling machine learning datasets. InProceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 2334-2346. ACM, 2017.
[24] Nancy Chang, Praveen Paritosh, David Huynh, and Collin F Baker. Scaling semantic frame annotation. InThe 9th Linguistic Annotation Workshop held in conjuncion with NAACL 2015, page 1, 2015.
[25] David L Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 190-200. Association for Computational Linguistics, 2011.
[26] Quanze Chen, Jonathan Bragg, Lydia B Chilton, and Dan S Weld. Cicero: Multi-turn, contextual argumentation for accurate crowdsourcing. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pages 1-14, 2019.
[27] Charles L Clarke, Nick Craswell, and Ian Soboroff. Overview of the TREC 2009 Web Track. InProceedings of NIST TREC, 2010.
[28] Cyril W Cleverdon and Michael Keen. Aslib cranfield research project-factors determining the performance of indexing systems; volume 2, test results. 1966.
[29] Paul Clough, Mark Sanderson, Jiayu Tang, Tim Gollins, and Amy Warner. Examining the limits of crowdsourcing for relevance assessment.IEEE Internet Computing, 17(4):32-38, 2013.
[30] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm.Applied statistics, 28(1):20-28, 1979.
[31] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, 2019.
[32] Djellel Difallah, Elena Filatova, and Panos Ipeirotis. Demographics and dynamics of mechanical turk workers. InProceedings of the eleventh acm international conference on web search and data mining, pages 135-143. ACM, 2018.
[33] Jeff Donahue and Kristen Grauman. Annotator rationales for visual recognition. InICCV, pages 1395-1402. IEEE, 2011.
[34] Ryan Drapeau, Lydia B Chilton, Jonathan Bragg, and Daniel S Weld. Microtalk: Using argumentation to improve crowdsourcing accuracy. InFourth AAAI Conference on Human Computation and Crowdsourcing, 2016.
[35] Ujwal Gadiraju, Jie Yang, and Alessandro Bozzon. Clarity is a worthwhile quality: On the role of task clarity in microtask crowdsourcing. InProceedings of the 28th ACM Conference on Hypertext and Social Media, pages 5-14. ACM, 2017.
[36] Snehalkumar Neil S Gaikwad, Durim Morina, Adam Ginzberg, Catherine Mullings, Shirish Goyal, Dilrukshi Gamage, Christopher Diemert, Mathias Burton, Sharon Zhou, Mark Whiting, et al. Boomerang: Rebounding the consequences of reputation feedback on crowdsourcing platforms. InProceedings of the 29th Annual Symposium on User Interface Software and Technology, pages 625-637. ACM, 2016.
[37] Google. Search quality rating guidelines.Inside Search: How Search Works, March 28 2016. www.google.com/insidesearch/.
[38] Mary L Gray and Siddharth Suri.Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Eamon Dolan Books, 2019.
[39] Maura R Grossman and Gordon V Cormak. Inconsistent responsiveness determination in document review: Difference of opinion or human error.Pace L. Rev., 32:267, 2012.
[40] Lei Han, Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Gianluca Demartini. On transforming relevance scales. InProceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 39-48, 2019.
[41] Lei Han, Eddy Maddalena, Alessandro Checco, Cristina Sarasua, Ujwal Gadiraju, Kevin Roitero, and Gianluca Demartini. Crowd worker strategies in relevance judgment tasks. InProceedings of the 13th International Conference on Web Search and Data Mining, pages 241-249, 2020.
[42] Maram Hasanain, Yassmine Barkallah, Rees Suwaileh, Mucahid Kutlu, and Tamer Elsayed. Artest: The first test collection for arabic web search with relevance rationales. InThe 43rd International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 2020.
[43] Chien-Ju Ho, Aleksandrs Slivkins, Siddharth Suri, and Jennifer Wortman Vaughan. Incentivizing high quality crowdwork. InProceedings of the 24th International Conference on World Wide Web, pages 419-429. International World Wide Web Conferences Steering Committee, 2015.
[44] John Joseph Horton and Lydia B Chilton. The labor economics of paid crowdsourcing. In Proceedings of the 11th ACM conference on Electronic commerce, pages 209-218. ACM, 2010.
[45] Mehdi Hosseini, Ingemar J Cox, Nataˇsa Mili´c-Frayling, Gabriella Kazai, and Vishwa Vinay. On aggregating labels from multiple crowd workers to infer relevance of documents. In ECIR, pages 182-194. Springer, 2012.
[46] Jeff Howe. The rise of crowdsourcing.Wired Magazine, 14(6):176-183, 2006.
[47] Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Lam Ngoc Tran, and Karl Aberer. An evaluation of aggregation techniques in crowdsourcing. InInternational Conference on Web Information Systems Engineering, pages 1-15. Springer, 2013.
[48] Panos Ipeirotis. A plea to amazon: Fix mechanical turk!Blog: Behind Enemy Lines, Oct. 21 2010. October 21, 2010.www.behind-the-enemy-lines.com.
[49] Paul Jaccard. ´Etude comparative de la distribution florale dans une portion des alpes et des jura.Bull Soc Vaudoise Sci Nat, 37:547-579, 1901.
[50] Ece Kamar, Severin Hacker, and Eric Horvitz. Combining human and machine intelligence in large-scale crowdsourcing. InProceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 467-474. International Foundation for Autonomous Agents and Multiagent Systems, 2012.
[51] Mike Kappel. The end of an era? how the abc test could affect your use of independent contractors.Forbes, 2018. August 8.
[52] Gabriella Kazai, Nick Craswell, Emine Yilmaz, and Seyed MM Tahaghoghi. An analysis of systematic judging errors in information retrieval. InProceedings of the 21st ACM conference on Information and knowledge management, pages 105-114, 2012.
[53] Steve Kelling, Jeff Gerbracht, Daniel Fink, Carl Lagoze, Weng-Keen Wong, Jun Yu, Theodoros Damoulas, and Carla Gomes. A human/computer learning network to improve biodiversity conservation and research.AI magazine, 34(1):10-10, 2013.
[54] Kenneth A. Kinney, Scott B. Huffman, and Juting Zhai. How evaluator domain expertise affects search result relevance judgments. InProceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 591-598, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-991-3. doi: 10.1145/1458082.1458160. URLhttp: //doi.acm.org/10.1145/1458082.1458160.
[55] Aniket Kittur, Ed H Chi, and Bongwon Suh. Crowdsourcing user studies with mechanical turk. InProceedings of the SIGCHI conference on human factors in computing systems, pages 453-456. ACM, 2008.
[56] Aniket Kittur, Jeffrey V Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton. The Future of Crowd Work. InCSCW, pages 1301-1318. ACM, 2013.
[57] Anand Kulkarni, Philipp Gutheim, Prayag Narula, David Rolnitzky, Tapan Parikh, and Bj¨orn Hartmann. Mobileworks: Designing for quality in a managed crowdsourcing architecture.IEEE Internet Computing, 16(5):28-35, 2012.
[58] Mucahid Kutlu, Tyler McDonnell, Yassmine Barkallah, Tamer Elsayed, and Matthew Lease. Crowd vs. Expert: What Can Relevance Judgment Rationales Teach Us About Assessor Disagreement?InProceedings of the 41st ACM SIGIR conference on Research and development in Information Retrieval, 2018.
[59] J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159-174, 1977. · Zbl 0351.62039
[60] Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing neural predictions. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107-117, 2016.
[61] Christopher H Lin, Mausam Mausam, and Daniel S Weld. Dynamically switching between synergistic workflows for crowdsourcing. InTwenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
[62] Chris J Lintott, Kevin Schawinski, Anˇze Slosar, Kate Land, Steven Bamford, Daniel Thomas, M Jordan Raddick, Robert C Nichol, Alex Szalay, Dan Andreescu, et al. Galaxy zoo: morphologies derived from visual inspection of galaxies from the sloan digital sky survey.Monthly Notices of the Royal Astronomical Society, 389(3):1179-1189, 2008.
[63] Greg Little, Lydia B Chilton, Max Goldman, and Robert C Miller. Turkit: Human computation algorithms on mechanical turk. InProceedings of the 23nd annual ACM symposium on User interface software and technology, pages 57-66. ACM, 2010.
[64] C. V. K. Manam and Alexander James Quinn. WingIt: Efficient Refinement of Unclear Task Instructions. InProceedings of the Sixth AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2018.
[65] Akash Mankar, Riddhi J. Shah, and Matthew Lease. Design Activism for Minimum Wage Crowd Work.In5th AAAI Conference on Human Computation and Crowdsourcing (HCOMP): Works-in-Progress Track, 2017.
[66] Catherine C Marshall and Frank M Shipman. Experiences surveying the crowd: Reflections on methods, participation, and reliability. In5th Annual Web Science Conference, pages 234-243. ACM, 2013.
[67] Winter Mason and Siddharth Suri. Conducting behavioral research on amazon’s mechanical turk.Behavior research methods, 44(1):1-23, 2012.
[68] Winter Mason and Duncan J Watts. Financial incentives and the performance of crowds. In Proceedings of the ACM SIGKDD workshop on human computation, pages 77-85. ACM, 2009.
[69] Tyler McDonnell, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed. Why is that relevant? collecting annotator rationales for relevance judgments. InFourth AAAI Conference on Human Computation and Crowdsourcing, 2016.
[70] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing systems, pages 3111-3119, 2013.
[71] An Thanh Nguyen, Matthew Halpern, Byron C. Wallace, and Matthew Lease. Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings. InProceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), pages 149-158, 2016.
[72] David Oleson, Alexander Sorokin, Greg P Laughlin, Vaughn Hester, John Le, and Lukas Biewald. Programmatic gold: Targeted and scalable quality assurance in crowdsourcing. Human computation, 11(11), 2011.
[73] Jean-Francois Paiement, James G Shanahan, and Remi Zajac. Crowdsourcing local search relevance.Proceedings of the CrowdConf 2010, 2010.
[74] Praveen Paritosh. Human computation must be reproducible. InCrowdSearch, pages 20-25, 2012.
[75] Jorge Ram´ırez, Marcos Baez, Fabio Casati, and Boualem Benatallah. Understanding the impact of text highlighting in crowdsourcing tasks. InProceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 7, pages 144-152, 2019.
[76] John W Ratcliff and David E Metzener. Pattern-matching-the gestalt approach.Dr Dobbs Journal, 13(7):46, 1988.
[77] Kevin Roitero, Eddy Maddalena, Gianluca Demartini, and Stefano Mizzaro. On fine-grained relevance scales. InThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 675-684. ACM, 2018.
[78] Joel Ross, Lilly Irani, M Silberman, Andrew Zaldivar, and Bill Tomlinson. Who are the crowdworkers?: shifting demographics in mechanical turk. InCHI’10 extended abstracts on Human factors in computing systems, pages 2863-2872. ACM, 2010.
[79] Holly Rosser and Andrea Wiggins. Crowds and camera traps: Genres in online citizen science projects. InProceedings of the 52nd Hawaii International Conference on System Sciences, 2019.
[80] Niloufar Salehi, Lilly C Irani, Michael S Bernstein, Ali Alkhatib, Eva Ogbe, Kristy Milland, et al. We are dynamo: Overcoming stalling and friction in collective action for crowd workers. InProceedings of the 33rd annual ACM conference on human factors in computing systems, pages 1621-1630. ACM, 2015.
[81] Mark Sanderson.Test collection based evaluation of information retrieval systems. Now Publishers Inc, 2010. · Zbl 1198.68140
[82] Tefko Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. part iii: Behavior and effects of relevance.Journal of the American Society for Information Science and Technology, 58(13):2126-2144, 2007.
[83] Mike Schaekermann, Joslin Goh, Kate Larson, and Edith Law. Resolvable vs. irresolvable disagreement: A study on worker deliberation in crowd work.Proceedings of the ACM on Human-Computer Interaction, 2(CSCW):1-19, 2018.
[84] Aashish Sheshadri and Matthew Lease. SQUARE: A Benchmark for Research on Computing Crowd Consensus. InProceedings of the AAAI Conference on Human Computation (HCOMP), pages 156-164, 2013.
[85] M Six Silberman, Bill Tomlinson, Rochelle LaPlante, Joel Ross, Lilly Irani, and Andrew Zaldivar. Responsible research with crowds: pay crowdworkers at least minimum wage. Communications of the ACM, 61(3):39-41, 2018.
[86] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. InProceedings of the conference on empirical methods in natural language processing, pages 254-263. Association for Computational Linguistics, 2008.
[87] Eero Sormunen. Liberal relevance criteria of trec-: Counting on negligible documents? In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 324-330. ACM, 2002.
[88] Qi Su, Dmitry Pavlov, Jyh-Herng Chow, and Wendell C. Baker. Internet-scale collection of human-reviewed data. InProceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 231-240, New York, NY, USA, 2007. ACM. ISBN 978-1-59593654-7. doi: 10.1145/1242572.1242604. URLhttp://doi.acm.org/10.1145/1242572. 1242604.
[89] John C Tang, Manuel Cebrian, Nicklaus A Giacobe, Hyun-Woo Kim, Taemie Kim, and Douglas Beaker Wickert. Reflecting on the darpa red balloon challenge.Communications of the ACM, 54(4):78-85, 2011.
[90] Rong Tang, William M Shaw Jr, and Jack L Vevea. Towards the identification of the optimal number of relevance categories.Journal of the Association for Information Science and Technology, 50(3):254, 1999.
[91] Yuandong Tian and Jun Zhu. Learning from crowds in the presence of schools of thought. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226-234. ACM, 2012.
[92] Andrew Trotman, Nils Pharo, and Dylan Jenkinson. Can we at least agree on something. InProceedings of the SIGIR 2007 Workshop on Focused Retrieval, pages 49-56, 2007.
[93] Donna Vakharia and Matthew Lease. Beyond mechanical turk: An analysis of paid crowd work platforms.Proceedings of the iConference, 2015.
[94] Werner Vogels.Help Find Jim Gray, 2007.https://www.allthingsdistributed.com/ 2007/02/help_find_jim_gray.html.
[95] Ellen M Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness.Information processing & management, 36(5):697-716, 2000.
[96] Ellen M Voorhees. The philosophy of information retrieval evaluation. InEvaluation of cross-language information retrieval systems, pages 355-370. Springer, 2001. · Zbl 1014.68878
[97] Ellen M Voorhees, Donna K Harman, et al.TREC: Experiment and evaluation in information retrieval. The MIT Press, 2005.
[98] Simon Wakeling, Martin Halvey, Robert Villa, and Laura Hasler. A comparison of primary and secondary relevance judgements for real-life topics. InProc. of the ACM on Conf. on Human Information Interaction and Retrieval, pages 173-182, 2016.
[99] Bing Wang, Bonan Hou, Yiping Yao, and Laibin Yan. Human flesh search model incorporating network expansion and gossip with feedback. In2009 13th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, pages 82-88. IEEE, 2009.
[100] Nai-Ching Wang, David Hicks, Paul Quigley, and Kurt Luther. Read-agree-predict: A crowdsourced approach to discovering relevant primary sources for historians.Human Computation, 6(1):147-175, 2019.
[101] Peng Dai Mausam Daniel S Weld. Decision-theoretic control of crowd-sourced workflows. InProceddings of the Twenty-Fourth Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, pages 1168-1174, 2010.
[102] Mark E Whiting, Grant Hugh, and Michael S Bernstein. Fair work: Crowd work minimum wage with one line of code. InProceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 7, pages 197-206, 2019.
[103] W John Wilbur and Won Kim. Improving a gold standard: treating human relevance judgments of medline document pairs.BMC bioinformatics, 12(3):S5, 2011.
[104] Stephen Wolfson and Matthew Lease. Look before you leap: Legal pitfalls of crowdsourcing. InProceedings of the 74th Annual Meeting of the American Society for Information Science and Technology (ASIS&T), 2011.
[105] Meng-Han Wu and Alexander James Quinn. Confusing the crowd: Task instruction quality on amazon mechanical turk. InProceedings of the Fifth AAAI Conference on Human Computation and Crowdsourcing (HCOMP), pages 206-215, 2017.
[106] Omar F Zaidan and Jason Eisner. Modeling annotators: A generative approach to learning from annotator rationales. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 31-40. Association for Computational Linguistics, 2008.
[107] Omar F Zaidan, Jason Eisner, and Christine D Piatko. Using “annotator rationales” to improve machine learning for text categorization. InHLT-NAACL, pages 260-267, 2007.
[108] Ye Zhang, Iain Marshall, and Byron C Wallace. Rationale-augmented convolutional neural networks for text classification.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 795-804, 2016.
[109] Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.