tscheepers/Wikipedia-Summary-Dataset - GitHub

2024-09-22

文章推薦指數： 80 %

投票人數：10人

This dataset contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in september of 2017. Skiptocontent {{message}} Thisrepositoryhasbeenarchivedbytheowner.Itisnowread-only. tscheepers / Wikipedia-Summary-Dataset Publicarchive Notifications Fork 5 Star 44 Thisdatasetcontainsalltitlesandsummaries(orintroductions)ofEnglishWikipediaarticles,extractedinseptemberof2017.Itcouldbeusefulifonewantstousethesmaller,moreconcise,andmoredefinitionalsummariesintheirresearch.Orifonejustwantstouseasmallerbutstilldiversedatasetforefficienttrainingwithresource… thijs.ai/wikipedia-summary-dataset/ 44 stars 5 forks Star Notifications Code Pullrequests 0 Actions Security Insights More Code Pullrequests Actions Security Insights Thiscommitdoesnotbelongtoanybranchonthisrepository,andmaybelongtoaforkoutsideoftherepository. master Branches Tags Couldnotloadbranches Nothingtoshow {{refName}} default Couldnotloadtags Nothingtoshow {{refName}} default 1 branch 0 tags Code Latestcommit Gitstats 32 commits Files Permalink Failedtoloadlatestcommitinformation. Type Name Latestcommitmessage Committime src README.md _config.yml Viewcode WikipediaSummaryDataset Download Datasetcontents Datasetconstruction ResearchPublications License(MIT) README.md WikipediaSummaryDataset Thisisadatasetthatcanbeusedforresearchintomachinelearningandnaturallanguageprocessing.Itcontainsalltitlesandsummaries(orintroductions)ofEnglishWikipediaarticles,extractedinSeptemberof2017. ThedatasetisdifferentfromtheregularWikipediadumpanddifferentfromthedatasetsthatcanbecreatedbygensimbecauseourscontainstheextractedsummariesandnottheentireunprocessedpagebody.Thiscouldbeusefulifonewantstousethesmaller,moreconcise,andmoredefinitionalsummariesintheirresearch.Orifonejustwantstouseasmallerbutstilldiversedatasetforefficienttrainingwithresourceconstraints. Asummaryorintroductionofanarticleiseverythingstartingfromthepagetitleuptothecontentoutline. Therawdatasetleavestheoriginaltextstructureintact.Additionally,weprovidepre-processedversions. File Tokenized Lowercased NoPunctuation Nostopwords Stemmed raw.tar.gz tokenized.tar.gz ✓ lowercased.tar.gz ✓ ✓ without-punctuation.tar.gz ✓ ✓ ✓ without-stop-words.tar.gz ✓ ✓ ✓ ✓ stemmed.tar.gz ✓ ✓ ✓ ✓ ✓ Download 💾raw.tar.gz(±1GB;459,081,607words;5,315,384articles) 💾tokenized.tar.gz(±1GB;533,211,092words;5,627,475vocab;5,315,384articles) 💾lowercased.tar.gz(±1GB;533,211,092words;5,172,571vocab;5,315,384articles) 💾without-punctuation.tar.gz(±1GB;461,749,888words;5,171,326vocab;5,315,384articles) 💾without-stop-words.tar.gz(±0.8GB;296,210,530words;5,171,164vocab;5,315,384articles) 💾stemmed.tar.gz(±0.7GB;296,210,530words;4,830,348vocab;5,315,384articles) Datasetcontents Thetarbalscontaintwofiles.A.txtfileanda.vocabfile.The.txtfilecontainsallthenecessarydata.Eachlinerepresentsanarticleandcontainsbothatitleandasummaryseparatedby|||.ThelinesareorderedbyWikipediapage_id.Ifyouwanttocreateasmallertestdataset,Iwouldsuggestsamplinglinesfromthefileandnotsplittingitdirectly. Examplefromtokenized.txt: Anarchism|||Anarchismisapoliticalphilosophythatadvocatesself-governedsocietiesbasedonvoluntary… Autism|||Autismisaneurodevelopmentaldisordercharacterizedbyimpairedsocialinteraction,impairedverbal… Albedo|||Albedo()isameasureforreflectanceoropticalbrightness(Latinalbedo,``whiteness'')of… … Thereisalsoa.vocabfilewhichcontainsthevocabularyandthecountofeachtoken.Examplefromtokenized.vocab: ,27222735 the25505452 .21555700 of16267241 in13313133 and12630336 a10202887 is7770405 … Datasetconstruction ThedatasetwasconstructedusingascriptthatcallsWikipediaAPIforeverypagewiththeirpage_id.ThecorrectwaytoconstructsummarieswithoutanyunwantedartifactsisconstructingthembyusingtheTextExtractsextension.SotheAPIcallweused,alsousestheTextExtractsextensiontocreatethesummariesorintroductions.Asyoucanimagine,thistakesquiteawhile. https://en.wikipedia.org/w/api.php?format=json&maxlag=5&action=query&prop=extracts&exintro=&explaintext=&pageids=123|456|789 Theactualdownloadingisdoneusingdownload.pyandstorestherawJSONoutputoftheAPIinaseparatefolder.Afterwardsthescriptprocess.pycancombinealltheseAPIresponsesintotwobigfiles,i.e.a.txtfileanda.vocabfile. Scriptstocreatethedatasetareprovidedinthisrepository.TheyrequirealocalWikipediainstallationandaccesstoitsMySQLdatabasefilledwithdatatogetthepageidentifiers(page_id).YoucanfillaMySQLdatabasewiththeWikipediadatafromthedumpusingMWDumper. Additionally,wewouldaskyounottobuildthedatasetusingtheofficialWikipediaAPIifthisisnotneeded,sincebuildingthedatasetwouldrequirecallingtheAPIforeverypageandthisputsstrainontheirpublicAPI.Pleaserespectthemaxlag=5parameterifyouusetheofficialAPIen.wikipedia.org/w/api.php. ResearchPublications ImprovingWordEmbeddingCompositionalityusingLexicographicDefinitions(willbepublishedandpresentedatWWW'18) ImprovingtheCompositionalityofWordEmbeddingsThesisPDF,PresentationPDF AnalyzingthecompositionalpropertiesofwordembeddingsPaperPDF Pleasecitethefollowingthesisifyouuseourdataorcodeforyourownresearch: @mastersthesis{scheepers2017compositionality, author={Scheepers,Thijs}, title={ImprovingtheCompositionalityofWordEmbeddings}, school={UniversiteitvanAmsterdam}, year={2017}, month={11}, address={SciencePark904,Amsterdam,Netherlands} } License(MIT) Copyright2017ThijsScheepers Permissionisherebygranted,freeofcharge,toanypersonobtainingacopyofthissoftwareandassociateddocumentationfiles(the"Software"),todealintheSoftwarewithoutrestriction,includingwithoutlimitationtherightstouse,copy,modify,merge,publish,distribute,sublicense,and/orsellcopiesoftheSoftware,andtopermitpersonstowhomtheSoftwareisfurnishedtodoso,subjecttothefollowingconditions: TheabovecopyrightnoticeandthispermissionnoticeshallbeincludedinallcopiesorsubstantialportionsoftheSoftware. THESOFTWAREISPROVIDED"ASIS",WITHOUTWARRANTYOFANYKIND,EXPRESSORIMPLIED,INCLUDINGBUTNOTLIMITEDTOTHEWARRANTIESOFMERCHANTABILITY,FITNESSFORAPARTICULARPURPOSEANDNONINFRINGEMENT.INNOEVENTSHALLTHEAUTHORSORCOPYRIGHTHOLDERSBELIABLEFORANYCLAIM,DAMAGESOROTHERLIABILITY,WHETHERINANACTIONOFCONTRACT,TORTOROTHERWISE,ARISINGFROM,OUTOFORINCONNECTIONWITHTHESOFTWAREORTHEUSEOROTHERDEALINGSINTHESOFTWARE. About Thisdatasetcontainsalltitlesandsummaries(orintroductions)ofEnglishWikipediaarticles,extractedinseptemberof2017.Itcouldbeusefulifonewantstousethesmaller,moreconcise,andmoredefinitionalsummariesintheirresearch.Orifonejustwantstouseasmallerbutstilldiversedatasetforefficienttrainingwithresource… thijs.ai/wikipedia-summary-dataset/ Resources Readme Stars 44 stars Watchers 3 watching Forks 5 forks Releases Noreleasespublished Packages0 Nopackagespublished Languages Python 100.0% Youcan’tperformthatactionatthistime. Yousignedinwithanothertaborwindow.Reloadtorefreshyoursession. Yousignedoutinanothertaborwindow.Reloadtorefreshyoursession.

請為這篇文章評分？

延伸文章資訊

tscheepers/Wikipedia-Summary-Dataset - GitHub

This dataset contains all titles and summaries (or introductions) of English Wikipedia articles, ...

List of datasets for machine-learning research - Wikipedia

Afifi, M. et al. IMDB-WIKI, IMDB and Wikipedia face images with gender and age labels. None, 523,...

There are 10 wikipedia datasets available on data.world.

Find open data about wikipedia contributed by thousands of users and organizations across the wor...

wikipedia | TensorFlow Datasets

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the W...

wikipedia · Datasets at Hugging Face

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the W...

tscheepers/Wikipedia-Summary-Dataset - GitHub

文章推薦指數： 80 %

請為這篇文章評分？

延伸文章資訊

最新文章

相關網站資訊

華為被禁原因

無邊無際意思

華為工廠

tscheepers/Wikipedia-Summary-Dataset - GitHub

文章推薦指數： 80 %

請為這篇文章評分？

延伸文章資訊

最新文章

相關網站資訊

華為被禁原因

無邊無際意思

華為 工廠

華為工廠