tscheepers/Wikipedia-Summary-Dataset - GitHub

文章推薦指數: 80 %
投票人數:10人

This dataset contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in september of 2017. Skiptocontent {{message}} Thisrepositoryhasbeenarchivedbytheowner.Itisnowread-only. tscheepers / Wikipedia-Summary-Dataset Publicarchive Notifications Fork 5 Star 44 Thisdatasetcontainsalltitlesandsummaries(orintroductions)ofEnglishWikipediaarticles,extractedinseptemberof2017.Itcouldbeusefulifonewantstousethesmaller,moreconcise,andmoredefinitionalsummariesintheirresearch.Orifonejustwantstouseasmallerbutstilldiversedatasetforefficienttrainingwithresource… thijs.ai/wikipedia-summary-dataset/ 44 stars 5 forks Star Notifications Code Pullrequests 0 Actions Security Insights More Code Pullrequests Actions Security Insights Thiscommitdoesnotbelongtoanybranchonthisrepository,andmaybelongtoaforkoutsideoftherepository. master Branches Tags Couldnotloadbranches Nothingtoshow {{refName}} default Couldnotloadtags Nothingtoshow {{refName}} default 1 branch 0 tags Code Latestcommit   Gitstats 32 commits Files Permalink Failedtoloadlatestcommitinformation. Type Name Latestcommitmessage Committime src     README.md     _config.yml     Viewcode WikipediaSummaryDataset Download Datasetcontents Datasetconstruction ResearchPublications License(MIT) README.md WikipediaSummaryDataset Thisisadatasetthatcanbeusedforresearchintomachinelearningandnaturallanguageprocessing.Itcontainsalltitlesandsummaries(orintroductions)ofEnglishWikipediaarticles,extractedinSeptemberof2017. ThedatasetisdifferentfromtheregularWikipediadumpanddifferentfromthedatasetsthatcanbecreatedbygensimbecauseourscontainstheextractedsummariesandnottheentireunprocessedpagebody.Thiscouldbeusefulifonewantstousethesmaller,moreconcise,andmoredefinitionalsummariesintheirresearch.Orifonejustwantstouseasmallerbutstilldiversedatasetforefficienttrainingwithresourceconstraints. Asummaryorintroductionofanarticleiseverythingstartingfromthepagetitleuptothecontentoutline. Therawdatasetleavestheoriginaltextstructureintact.Additionally,weprovidepre-processedversions. File Tokenized Lowercased NoPunctuation Nostopwords Stemmed raw.tar.gz tokenized.tar.gz ✓ lowercased.tar.gz ✓ ✓ without-punctuation.tar.gz ✓ ✓ ✓ without-stop-words.tar.gz ✓ ✓ ✓ ✓ stemmed.tar.gz ✓ ✓ ✓ ✓ ✓ Download 💾raw.tar.gz(±1GB;459,081,607words;5,315,384articles) 💾tokenized.tar.gz(±1GB;533,211,092words;5,627,475vocab;5,315,384articles) 💾lowercased.tar.gz(±1GB;533,211,092words;5,172,571vocab;5,315,384articles) 💾without-punctuation.tar.gz(±1GB;461,749,888words;5,171,326vocab;5,315,384articles) 💾without-stop-words.tar.gz(±0.8GB;296,210,530words;5,171,164vocab;5,315,384articles) 💾stemmed.tar.gz(±0.7GB;296,210,530words;4,830,348vocab;5,315,384articles) Datasetcontents Thetarbalscontaintwofiles.A.txtfileanda.vocabfile.The.txtfilecontainsallthenecessarydata.Eachlinerepresentsanarticleandcontainsbothatitleandasummaryseparatedby|||.ThelinesareorderedbyWikipediapage_id.Ifyouwanttocreateasmallertestdataset,Iwouldsuggestsamplinglinesfromthefileandnotsplittingitdirectly. Examplefromtokenized.txt: Anarchism|||Anarchismisapoliticalphilosophythatadvocatesself-governedsocietiesbasedonvoluntary… Autism|||Autismisaneurodevelopmentaldisordercharacterizedbyimpairedsocialinteraction,impairedverbal… Albedo|||Albedo()isameasureforreflectanceoropticalbrightness(Latinalbedo,``whiteness'')of… … Thereisalsoa.vocabfilewhichcontainsthevocabularyandthecountofeachtoken.Examplefromtokenized.vocab: ,27222735 the25505452 .21555700 of16267241 in13313133 and12630336 a10202887 is7770405 … Datasetconstruction ThedatasetwasconstructedusingascriptthatcallsWikipediaAPIforeverypagewiththeirpage_id.ThecorrectwaytoconstructsummarieswithoutanyunwantedartifactsisconstructingthembyusingtheTextExtractsextension.SotheAPIcallweused,alsousestheTextExtractsextensiontocreatethesummariesorintroductions.Asyoucanimagine,thistakesquiteawhile. https://en.wikipedia.org/w/api.php?format=json&maxlag=5&action=query&prop=extracts&exintro=&explaintext=&pageids=123|456|789 Theactualdownloadingisdoneusingdownload.pyandstorestherawJSONoutputoftheAPIinaseparatefolder.Afterwardsthescriptprocess.pycancombinealltheseAPIresponsesintotwobigfiles,i.e.a.txtfileanda.vocabfile. Scriptstocreatethedatasetareprovidedinthisrepository.TheyrequirealocalWikipediainstallationandaccesstoitsMySQLdatabasefilledwithdatatogetthepageidentifiers(page_id).YoucanfillaMySQLdatabasewiththeWikipediadatafromthedumpusingMWDumper. Additionally,wewouldaskyounottobuildthedatasetusingtheofficialWikipediaAPIifthisisnotneeded,sincebuildingthedatasetwouldrequirecallingtheAPIforeverypageandthisputsstrainontheirpublicAPI.Pleaserespectthemaxlag=5parameterifyouusetheofficialAPIen.wikipedia.org/w/api.php. ResearchPublications ImprovingWordEmbeddingCompositionalityusingLexicographicDefinitions(willbepublishedandpresentedatWWW'18) ImprovingtheCompositionalityofWordEmbeddingsThesisPDF,PresentationPDF AnalyzingthecompositionalpropertiesofwordembeddingsPaperPDF Pleasecitethefollowingthesisifyouuseourdataorcodeforyourownresearch: @mastersthesis{scheepers2017compositionality, author={Scheepers,Thijs}, title={ImprovingtheCompositionalityofWordEmbeddings}, school={UniversiteitvanAmsterdam}, year={2017}, month={11}, address={SciencePark904,Amsterdam,Netherlands} } License(MIT) Copyright2017ThijsScheepers Permissionisherebygranted,freeofcharge,toanypersonobtainingacopyofthissoftwareandassociateddocumentationfiles(the"Software"),todealintheSoftwarewithoutrestriction,includingwithoutlimitationtherightstouse,copy,modify,merge,publish,distribute,sublicense,and/orsellcopiesoftheSoftware,andtopermitpersonstowhomtheSoftwareisfurnishedtodoso,subjecttothefollowingconditions: TheabovecopyrightnoticeandthispermissionnoticeshallbeincludedinallcopiesorsubstantialportionsoftheSoftware. THESOFTWAREISPROVIDED"ASIS",WITHOUTWARRANTYOFANYKIND,EXPRESSORIMPLIED,INCLUDINGBUTNOTLIMITEDTOTHEWARRANTIESOFMERCHANTABILITY,FITNESSFORAPARTICULARPURPOSEANDNONINFRINGEMENT.INNOEVENTSHALLTHEAUTHORSORCOPYRIGHTHOLDERSBELIABLEFORANYCLAIM,DAMAGESOROTHERLIABILITY,WHETHERINANACTIONOFCONTRACT,TORTOROTHERWISE,ARISINGFROM,OUTOFORINCONNECTIONWITHTHESOFTWAREORTHEUSEOROTHERDEALINGSINTHESOFTWARE. About Thisdatasetcontainsalltitlesandsummaries(orintroductions)ofEnglishWikipediaarticles,extractedinseptemberof2017.Itcouldbeusefulifonewantstousethesmaller,moreconcise,andmoredefinitionalsummariesintheirresearch.Orifonejustwantstouseasmallerbutstilldiversedatasetforefficienttrainingwithresource… thijs.ai/wikipedia-summary-dataset/ Resources Readme Stars 44 stars Watchers 3 watching Forks 5 forks Releases Noreleasespublished Packages0 Nopackagespublished Languages Python 100.0% Youcan’tperformthatactionatthistime. Yousignedinwithanothertaborwindow.Reloadtorefreshyoursession. Yousignedoutinanothertaborwindow.Reloadtorefreshyoursession.



請為這篇文章評分?