tscheepers/Wikipedia-Summary-Dataset - GitHub
文章推薦指數: 80 %
This dataset contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in september of 2017. Skiptocontent {{message}} Thisrepositoryhasbeenarchivedbytheowner.Itisnowread-only. tscheepers / Wikipedia-Summary-Dataset Publicarchive Notifications Fork 5 Star 44 Thisdatasetcontainsalltitlesandsummaries(orintroductions)ofEnglishWikipediaarticles,extractedinseptemberof2017.Itcouldbeusefulifonewantstousethesmaller,moreconcise,andmoredefinitionalsummariesintheirresearch.Orifonejustwantstouseasmallerbutstilldiversedatasetforefficienttrainingwithresource… thijs.ai/wikipedia-summary-dataset/ 44 stars 5 forks Star Notifications Code Pullrequests 0 Actions Security Insights More Code Pullrequests Actions Security Insights Thiscommitdoesnotbelongtoanybranchonthisrepository,andmaybelongtoaforkoutsideoftherepository. master Branches Tags Couldnotloadbranches Nothingtoshow {{refName}} default Couldnotloadtags Nothingtoshow {{refName}} default 1 branch 0 tags Code Latestcommit Gitstats 32 commits Files Permalink Failedtoloadlatestcommitinformation. Type Name Latestcommitmessage Committime src README.md _config.yml Viewcode WikipediaSummaryDataset Download Datasetcontents Datasetconstruction ResearchPublications License(MIT) README.md WikipediaSummaryDataset Thisisadatasetthatcanbeusedforresearchintomachinelearningandnaturallanguageprocessing.Itcontainsalltitlesandsummaries(orintroductions)ofEnglishWikipediaarticles,extractedinSeptemberof2017. ThedatasetisdifferentfromtheregularWikipediadumpanddifferentfromthedatasetsthatcanbecreatedbygensimbecauseourscontainstheextractedsummariesandnottheentireunprocessedpagebody.Thiscouldbeusefulifonewantstousethesmaller,moreconcise,andmoredefinitionalsummariesintheirresearch.Orifonejustwantstouseasmallerbutstilldiversedatasetforefficienttrainingwithresourceconstraints. Asummaryorintroductionofanarticleiseverythingstartingfromthepagetitleuptothecontentoutline. Therawdatasetleavestheoriginaltextstructureintact.Additionally,weprovidepre-processedversions. File Tokenized Lowercased NoPunctuation Nostopwords Stemmed raw.tar.gz tokenized.tar.gz ✓ lowercased.tar.gz ✓ ✓ without-punctuation.tar.gz ✓ ✓ ✓ without-stop-words.tar.gz ✓ ✓ ✓ ✓ stemmed.tar.gz ✓ ✓ ✓ ✓ ✓ Download 💾raw.tar.gz(±1GB;459,081,607words;5,315,384articles) 💾tokenized.tar.gz(±1GB;533,211,092words;5,627,475vocab;5,315,384articles) 💾lowercased.tar.gz(±1GB;533,211,092words;5,172,571vocab;5,315,384articles) 💾without-punctuation.tar.gz(±1GB;461,749,888words;5,171,326vocab;5,315,384articles) 💾without-stop-words.tar.gz(±0.8GB;296,210,530words;5,171,164vocab;5,315,384articles) 💾stemmed.tar.gz(±0.7GB;296,210,530words;4,830,348vocab;5,315,384articles) Datasetcontents Thetarbalscontaintwofiles.A.txtfileanda.vocabfile.The.txtfilecontainsallthenecessarydata.Eachlinerepresentsanarticleandcontainsbothatitleandasummaryseparatedby|||.ThelinesareorderedbyWikipediapage_id.Ifyouwanttocreateasmallertestdataset,Iwouldsuggestsamplinglinesfromthefileandnotsplittingitdirectly. Examplefromtokenized.txt: Anarchism|||Anarchismisapoliticalphilosophythatadvocatesself-governedsocietiesbasedonvoluntary… Autism|||Autismisaneurodevelopmentaldisordercharacterizedbyimpairedsocialinteraction,impairedverbal… Albedo|||Albedo()isameasureforreflectanceoropticalbrightness(Latinalbedo,``whiteness'')of… … Thereisalsoa.vocabfilewhichcontainsthevocabularyandthecountofeachtoken.Examplefromtokenized.vocab: ,27222735 the25505452 .21555700 of16267241 in13313133 and12630336 a10202887 is7770405 … Datasetconstruction ThedatasetwasconstructedusingascriptthatcallsWikipediaAPIforeverypagewiththeirpage_id.ThecorrectwaytoconstructsummarieswithoutanyunwantedartifactsisconstructingthembyusingtheTextExtractsextension.SotheAPIcallweused,alsousestheTextExtractsextensiontocreatethesummariesorintroductions.Asyoucanimagine,thistakesquiteawhile. https://en.wikipedia.org/w/api.php?format=json&maxlag=5&action=query&prop=extracts&exintro=&explaintext=&pageids=123|456|789 Theactualdownloadingisdoneusingdownload.pyandstorestherawJSONoutputoftheAPIinaseparatefolder.Afterwardsthescriptprocess.pycancombinealltheseAPIresponsesintotwobigfiles,i.e.a.txtfileanda.vocabfile. Scriptstocreatethedatasetareprovidedinthisrepository.TheyrequirealocalWikipediainstallationandaccesstoitsMySQLdatabasefilledwithdatatogetthepageidentifiers(page_id).YoucanfillaMySQLdatabasewiththeWikipediadatafromthedumpusingMWDumper. Additionally,wewouldaskyounottobuildthedatasetusingtheofficialWikipediaAPIifthisisnotneeded,sincebuildingthedatasetwouldrequirecallingtheAPIforeverypageandthisputsstrainontheirpublicAPI.Pleaserespectthemaxlag=5parameterifyouusetheofficialAPIen.wikipedia.org/w/api.php. ResearchPublications ImprovingWordEmbeddingCompositionalityusingLexicographicDefinitions(willbepublishedandpresentedatWWW'18) ImprovingtheCompositionalityofWordEmbeddingsThesisPDF,PresentationPDF AnalyzingthecompositionalpropertiesofwordembeddingsPaperPDF Pleasecitethefollowingthesisifyouuseourdataorcodeforyourownresearch: @mastersthesis{scheepers2017compositionality, author={Scheepers,Thijs}, title={ImprovingtheCompositionalityofWordEmbeddings}, school={UniversiteitvanAmsterdam}, year={2017}, month={11}, address={SciencePark904,Amsterdam,Netherlands} } License(MIT) Copyright2017ThijsScheepers Permissionisherebygranted,freeofcharge,toanypersonobtainingacopyofthissoftwareandassociateddocumentationfiles(the"Software"),todealintheSoftwarewithoutrestriction,includingwithoutlimitationtherightstouse,copy,modify,merge,publish,distribute,sublicense,and/orsellcopiesoftheSoftware,andtopermitpersonstowhomtheSoftwareisfurnishedtodoso,subjecttothefollowingconditions: TheabovecopyrightnoticeandthispermissionnoticeshallbeincludedinallcopiesorsubstantialportionsoftheSoftware. THESOFTWAREISPROVIDED"ASIS",WITHOUTWARRANTYOFANYKIND,EXPRESSORIMPLIED,INCLUDINGBUTNOTLIMITEDTOTHEWARRANTIESOFMERCHANTABILITY,FITNESSFORAPARTICULARPURPOSEANDNONINFRINGEMENT.INNOEVENTSHALLTHEAUTHORSORCOPYRIGHTHOLDERSBELIABLEFORANYCLAIM,DAMAGESOROTHERLIABILITY,WHETHERINANACTIONOFCONTRACT,TORTOROTHERWISE,ARISINGFROM,OUTOFORINCONNECTIONWITHTHESOFTWAREORTHEUSEOROTHERDEALINGSINTHESOFTWARE. About Thisdatasetcontainsalltitlesandsummaries(orintroductions)ofEnglishWikipediaarticles,extractedinseptemberof2017.Itcouldbeusefulifonewantstousethesmaller,moreconcise,andmoredefinitionalsummariesintheirresearch.Orifonejustwantstouseasmallerbutstilldiversedatasetforefficienttrainingwithresource… thijs.ai/wikipedia-summary-dataset/ Resources Readme Stars 44 stars Watchers 3 watching Forks 5 forks Releases Noreleasespublished Packages0 Nopackagespublished Languages Python 100.0% Youcan’tperformthatactionatthistime. Yousignedinwithanothertaborwindow.Reloadtorefreshyoursession. Yousignedoutinanothertaborwindow.Reloadtorefreshyoursession.
延伸文章資訊
- 1tscheepers/Wikipedia-Summary-Dataset - GitHub
This dataset contains all titles and summaries (or introductions) of English Wikipedia articles, ...
- 2List of datasets for machine-learning research - Wikipedia
Afifi, M. et al. IMDB-WIKI, IMDB and Wikipedia face images with gender and age labels. None, 523,...
- 3There are 10 wikipedia datasets available on data.world.
Find open data about wikipedia contributed by thousands of users and organizations across the wor...
- 4wikipedia | TensorFlow Datasets
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the W...
- 5wikipedia · Datasets at Hugging Face
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the W...