A Large Scale Dataset for Content Reliability on Wikipedia
文章推薦指數: 80 %
To fill this gap, in this paper, we propose Wiki-Reliability, the first dataset of English Wikipedia articles annotated with a wide set of ... GlobalSurvey Injust3minutes,helpusbetterunderstandhowyouperceivearXiv. Takethesurvey TAKESURVEY ComputerScience>InformationRetrieval arXiv:2105.04117(cs) [Submittedon10May2021(v1),lastrevised1Jun2021(thisversion,v2)] Title:Wiki-Reliability:ALargeScaleDatasetforContentReliabilityonWikipedia Authors:KayYenWong,MiriamRedi,DiegoSaez-Trumper DownloadPDF Abstract:Wikipediaisthelargestonlineencyclopedia,usedbyalgorithmsandweb usersasacentralhubofreliableinformationontheweb.Thequalityand reliabilityofWikipediacontentismaintainedbyacommunityofvolunteer editors.Machinelearningandinformationretrievalalgorithmscouldhelpscale upeditors'manualeffortsaroundWikipediacontentreliability.However,there isalackoflarge-scaledatatosupportthedevelopmentofsuchresearch.To fillthisgap,inthispaper,weproposeWiki-Reliability,thefirstdatasetof EnglishWikipediaarticlesannotatedwithawidesetofcontentreliability issues.Tobuildthisdataset,werelyonWikipedia"templates".Templatesare tagsusedbyexpertWikipediaeditorstoindicatecontentissues,suchasthe presenceof"non-neutralpointofview"or"contradictoryarticles",andserve asastrongsignalfordetectingreliabilityissuesinarevision.Weselect the10mostpopularreliability-relatedtemplatesonWikipedia,andproposean effectivemethodtolabelalmost1MsamplesofWikipediaarticlerevisionsas positiveornegativewithrespecttoeachtemplate.Eachpositive/negative exampleinthedatasetcomeswiththefullarticletextand20featuresfrom therevision'smetadata.Weprovideanoverviewofthepossibledownstream tasksenabledbysuchdata,andshowthatWiki-Reliabilitycanbeusedtotrain large-scalemodelsforcontentreliabilityprediction.Wereleasealldataand codeforpublicuse. Comments: Proceedingsofthe44thInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval(SIGIR'21),2021 Subjects: InformationRetrieval(cs.IR);ComputationandLanguage(cs.CL);MachineLearning(cs.LG) Citeas: arXiv:2105.04117[cs.IR] (or arXiv:2105.04117v2[cs.IR]forthisversion) https://doi.org/10.48550/arXiv.2105.04117 Focustolearnmore arXiv-issuedDOIviaDataCite RelatedDOI: https://doi.org/10.1145/3404835.3463253 Focustolearnmore DOI(s)linkingtorelatedresources SubmissionhistoryFrom:KayYenWong[viewemail] [v1] Mon,10May202105:07:03UTC(1,338KB)[v2] Tue,1Jun202111:57:14UTC(1,338KB) Full-textlinks: Download: PDF Otherformats Currentbrowsecontext:cs.IR new | recent | 2105 Changetobrowseby: cs cs.CL cs.LG References&Citations NASAADSGoogleScholar SemanticScholar DBLP-CSBibliography listing|bibtex MiriamRediDiegoSáez-Trumper a exportbibtexcitation Loading... Bibtexformattedcitation × loading... Dataprovidedby: Bookmark BibliographicTools BibliographicandCitationTools BibliographicExplorerToggle BibliographicExplorer(WhatistheExplorer?) LitmapsToggle Litmaps(WhatisLitmaps?) scite.aiToggle sciteSmartCitations(WhatareSmartCitations?) Code&Data CodeandDataAssociatedwiththisArticle arXivLinkstoCodeToggle arXivLinkstoCode&Data(WhatisLinkstoCode&Data?) Demos Demos ReplicateToggle Replicate(WhatisReplicate?) RelatedPapers RecommendersandSearchTools ConnectedPapersToggle ConnectedPapers(WhatisConnectedPapers?) Corerecommendertoggle CORERecommender(WhatisCORE?) AboutarXivLabs arXivLabs:experimentalprojectswithcommunitycollaborators arXivLabsisaframeworkthatallowscollaboratorstodevelopandsharenewarXivfeaturesdirectlyonourwebsite. BothindividualsandorganizationsthatworkwitharXivLabshaveembracedandacceptedourvaluesofopenness,community,excellence,anduserdataprivacy.arXiviscommittedtothesevaluesandonlyworkswithpartnersthatadheretothem. HaveanideaforaprojectthatwilladdvalueforarXiv'scommunity?LearnmoreaboutarXivLabsandhowtogetinvolved. Whichauthorsofthispaperareendorsers?| DisableMathJax(WhatisMathJax?)
延伸文章資訊
- 1wikipedia · Datasets at Hugging Face
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the W...
- 2Wikipedia:Database download
Wikipedia offers free copies of all available content to interested users. These databases can be...
- 3List of datasets for machine-learning research - Wikipedia
Afifi, M. et al. IMDB-WIKI, IMDB and Wikipedia face images with gender and age labels. None, 523,...
- 4Wikipedia:Size of Wikipedia
- 5SNAP: Network datasets: Wikipedia Article Networks
Dataset information. The data was collected from the English Wikipedia (December 2018). These dat...