wikipedia · Datasets at Hugging Face

文章推薦指數: 80 %
投票人數:10人

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split ... DatasetStructure DataInstances DataFields DataSplits DatasetCreation CurationRationale SourceData Annotations PersonalandSensitiveInformation ConsiderationsforUsingtheData SocialImpactofDataset DiscussionofBiases OtherKnownLimitations AdditionalInformation DatasetCurators LicensingInformation CitationInformation Contributions DatasetPreview Gotodatasetviewer Thedatasetpreviewisnotavailableforthisdataset. ServerError Statuscode:400 Exception:TypeError Message:_split_generators()missing1requiredpositionalargument:'pipeline' Needhelptomakethedatasetviewerwork?Openanissuefordirectsupport. DatasetCardfor"wikipedia" DatasetSummary Wikipediadatasetcontainingcleanedarticlesofalllanguages. ThedatasetsarebuiltfromtheWikipediadump (https://dumps.wikimedia.org/)withonesplitperlanguage.Eachexample containsthecontentofonefullWikipediaarticlewithcleaningtostrip markdownandunwantedsections(references,etc.). Thearticleshavebeenparsedusingthemwparserfromhelltool. SupportedTasksandLeaderboards MoreInformationNeeded Languages MoreInformationNeeded DatasetStructure Weshowdetailedinformationforupto5configurationsofthedataset. DataInstances 20200501.en Sizeofdownloadeddatasetfiles:17396.28MB Sizeofthegenerateddataset:17481.07MB Totalamountofdiskused:34877.35MB Anexamplelooksasfollows. { 'title':'Yangliuqing', 'text':'Yangliuqing()isamarkettowninXiqingDistrict,inthewesternsuburbsofTianjin, ... andtraditionalperiodfurnishingsandcrafts.\n\nSeealso\n\nListoftownship-leveldivisionsofTianjin\n\nReferences\n\n http://arts.cultural-china.com/en/65Arts4795.html\n\nCategory:TownsinTianjin' } 20200501.de Sizeofdownloadeddatasetfiles:5531.82MB Sizeofthegenerateddataset:7716.79MB Totalamountofdiskused:13248.61MB 20200501.fr Sizeofdownloadeddatasetfiles:4653.55MB Sizeofthegenerateddataset:6182.24MB Totalamountofdiskused:10835.79MB 20200501.frr Sizeofdownloadeddatasetfiles:9.05MB Sizeofthegenerateddataset:5.88MB Totalamountofdiskused:14.93MB 20200501.it Sizeofdownloadeddatasetfiles:2970.57MB Sizeofthegenerateddataset:3809.89MB Totalamountofdiskused:6780.46MB DataFields Thedatafieldsarethesameamongallsplitsandconfigurations: title:astringfeaturecorrespondingtothetitleofthearticle text:astringfeaturecorrespondingtothetextcontentofthearticle DataSplits Herearethesizesforseveralconfigurations: name train 20200501.de 3140341 20200501.en 6078422 20200501.fr 2210508 20200501.frr 11803 20200501.it 1931197 DatasetCreation CurationRationale MoreInformationNeeded SourceData InitialDataCollectionandNormalization MoreInformationNeeded Whoarethesourcelanguageproducers? MoreInformationNeeded Annotations Annotationprocess MoreInformationNeeded Whoaretheannotators? MoreInformationNeeded PersonalandSensitiveInformation MoreInformationNeeded ConsiderationsforUsingtheData SocialImpactofDataset MoreInformationNeeded DiscussionofBiases MoreInformationNeeded OtherKnownLimitations MoreInformationNeeded AdditionalInformation DatasetCurators MoreInformationNeeded LicensingInformation MoreInformationNeeded CitationInformation @ONLINE{wikidump, author="WikimediaFoundation", title="WikimediaDownloads", url="https://dumps.wikimedia.org" } Contributions Thanksto@lewtun,@mariamabarham,@thomwolf,@lhoestq,@patrickvonplatenforaddingthisdataset. UpdateonGitHub Useindatasetlibrary TraininAutoNLP Homepage: dumps.wikimedia.orgSizeofdownloadeddatasetfiles: 30739.25MBSizeofthegenerateddataset: 35376.35MBTotalamountofdiskused: 66115.60MB Modelstrainedorfine-tunedon wikipedia Fill-Mask • Updated May18,2021 • 14.1M • 114 Fill-Mask • Updated Aug29,2021 • 5.67M • 44 Fill-Mask • Updated Jul6,2021 • 4.83M • 17 Fill-Mask • Updated Sep6,2021 • 3.46M • 11 Fill-Mask • Updated Sep23,2021 • 2.84M • 4 Fill-Mask • Updated Sep23,2021 • 2.06M • 7 Browse545modelstrainedonthisdataset



請為這篇文章評分?