Wikipedia:Database download
文章推薦指數: 80 %
WikiTaxi is an offline-reader for wikis in MediaWiki format. It enables users to search and browse popular wikis like Wikipedia, Wikiquote, or WikiNews, without ...
Wikipedia:Databasedownload
FromWikipedia,thefreeencyclopedia
Jumptonavigation
Jumptosearch
Informationondownloadingdumpsofthewikidatabase
Forscheduling,relatedtoolsetc.,seeDatadumpsonMeta-Wiki.
"WP:DD"redirectshere.Fordeletiondiscussions,seeWikipedia:Deletiondiscussions.
Thishelppageisahow-toguide.Itdetailsprocessesorproceduresofsomeaspect(s)ofWikipedia'snormsandpractices.ItisnotoneofWikipedia'spoliciesorguidelines,andmayreflectvaryinglevelsofconsensusandvetting.ShortcutsWP:DUMPWP:DUMPS
Readers'FAQ
AboutWikipedia
Administration
FAQs
Assessingarticlequality
Authoritycontrol
Books
Categories
Censorship
Copyright
Disambiguation
Imagesandmultimedia
ISBN
Microformats
Mobileaccess
Offlineaccess
Navigation
Otherlanguages
Pagenames
Portals
Searching
Studenthelp
ResearchingwithWikipedia
CitingWikipedia
Readers'glossary
Readers'index
Reader'sguidetoWikipedia
vte
Wikipediaoffersfreecopiesofallavailablecontenttointerestedusers.Thesedatabasescanbeusedformirroring,personaluse,informalbackups,offlineuseordatabasequeries(suchasforWikipedia:Maintenance).Alltextcontentismulti-licensedundertheCreativeCommonsAttribution-ShareAlike3.0License(CC-BY-SA)andtheGNUFreeDocumentationLicense(GFDL).Imagesandotherfilesareavailableunderdifferentterms,asdetailedontheirdescriptionpages.Forouradviceaboutcomplyingwiththeselicenses,seeWikipedia:Copyrights.
Contents
1OfflineWikipediareaders
2WheredoIgetit?
2.1English-languageWikipedia
3ShouldIgetmultistream?
3.1Howtousemultistream?
3.2Otherlanguages
4Wherearetheuploadedfiles(image,audio,video,etc.)?
5Dealingwithcompressedfiles
6Dealingwithlargefiles
6.1Filesystemlimits
6.2Operatingsystemlimits
6.3Tips
6.3.1Detectcorruptedfiles
6.3.2ReformattingexternalUSBdrives
6.3.3LinuxandUnix
7Whynotjustretrievedatafromwikipedia.orgatruntime?
7.1Pleasedonotuseawebcrawler
7.1.1Sampleblockedcrawleremail
7.2DoingSQLqueriesonthecurrentdatabasedump
8Databaseschema
8.1SQLschema
8.2XMLschema
9Helptoparsedumpsforuseinscripts
9.1DoingHadoopMapReduceontheWikipediacurrentdatabasedump
10HelptoimportdumpsintoMySQL
11WikimediaEnterpriseHTMLDumps
12StaticHTMLtreedumpsformirroringorCDdistribution
12.1Kiwix
12.2AardDictionary/Aard2
12.3E-book
12.4WikiviewerforRockbox
12.5Olddumps
13DynamicHTMLgenerationfromalocalXMLdatabasedump
13.1XOWA
13.1.1Features
13.1.2Mainfeatures
13.2WikiFilter
13.2.1WikiFiltersystemrequirements
13.2.2HowtosetupWikiFilter
13.3WikiTaxi(forWindows)
13.3.1WikiTaxisystemrequirements
13.3.2WikiTaxiusage
13.4BzReaderandMzReader(forWindows)
13.5EPWING
14Mirrorbuilding
14.1WP-MIRROR
15Seealso
16References
17Externallinks
OfflineWikipediareaders
SomeofthemanywaystoreadWikipediawhileoffline:
XOWA:(§ XOWA)
Kiwix:(§ Kiwix)
WikiTaxi:§ WikiTaxi(forWindows)
aarddict:§ AardDictionary
BzReader:§ BzReaderandMzReader(forWindows)
SelectedWikipediaarticlesasaprinteddocument:Help:Printing
WikiasE-Book:§ E-book
WikiFilter:§ WikiFilter
Wikipediaonrockbox:§ WikiviewerforRockbox
Someofthemaremobileapplications–see"listofWikipediamobileapplications".
WheredoIgetit?
English-languageWikipedia
DumpsfromanyWikimediaFoundationproject:dumps.wikimedia.organdtheInternetArchive
EnglishWikipediadumpsinSQLandXML:dumps.wikimedia.org/enwiki/andtheInternetArchive
DownloadthedatadumpusingaBitTorrentclient(torrentinghasmanybenefitsandreducesserverload,savingbandwidthcosts).
pages-articles-multistream.xml.bz2–Currentrevisionsonly,notalkoruserpages;thisisprobablywhatyouwant,andisover19GBcompressed(expandstoover86GBwhendecompressed).
pages-meta-current.xml.bz2–Currentrevisionsonly,allpages(includingtalk)
abstract.xml.gz–pageabstracts
all-titles-in-ns0.gz–Articletitlesonly(withredirects)
SQLfilesforthepagesandlinksarealsoavailable
Allrevisions,allpages:Thesefilesexpandtomultipleterabytesoftext.Pleaseonlydownloadtheseifyouknowyoucancopewiththisquantityofdata.GotoLatestDumpsandlookoutforallthefilesthathave'pages-meta-history'intheirname.
TodownloadasubsetofthedatabaseinXMLformat,suchasaspecificcategoryoralistofarticlessee:Special:Export,usageofwhichisdescribedatHelp:Export.
Wikifront-endsoftware:MediaWiki[1].
Databasebackendsoftware:MySQL.
Imagedumps:Seebelow.
ShouldIgetmultistream?
TL;DR:
GETTHEMULTISTREAMVERSION!(andthecorrespondingindexfile,pages-articles-multistream-index.txt.bz2)
pages-articles.xml.bz2andpages-articles-multistream.xml.bz2bothcontainthesamexmlcontents.Soifyouunpackeither,yougetthesamedata.Butwithmultistream,itispossibletogetanarticlefromthearchivewithoutunpackingthewholething.Yourreadershouldhandlethisforyou,ifyourreaderdoesn'tsupportititwillworkanywaysincemultistreamandnon-multistreamcontainthesamexml.Theonlydownsidetomultistreamisthatitismarginallylarger.Youmightbetemptedtogetthesmallernon-multistreamarchive,butthiswillbeuselessifyoudon'tunpackit.Anditwillunpackto~5-10timesitsoriginalsize.Pennywise,poundfoolish.Getmultistream.
NOTETHATthemultistreamdumpfilecontainsmultiplebz2'streams'(bz2header,body,footer)concatenatedtogetherintoonefile,incontrasttothevanillafilewhichcontainsonestream.Eachseparate'stream'(orreally,file)inthemultistreamdumpcontains100pages,exceptpossiblythelastone.
Howtousemultistream?
Formultistream,youcangetanindexfile,pages-articles-multistream-index.txt.bz2.Thefirstfieldofthisindexisthenumberofbytestoseekintothecompressedarchivepages-articles-multistream.xml.bz2,thesecondisthearticleID,thethirdthearticletitle.
Cutasmallpartoutofthearchivewithddusingthebyteoffsetasfoundintheindex.Youcouldtheneitherbzip2decompressitorusebzip2recover,andsearchthefirstfileforthearticleID.
Seehttps://docs.python.org/3/library/bz2.html#bz2.BZ2Decompressorforinfoaboutsuchmultistreamfilesandabouthowtodecompressthemwithpython;seealsohttps://gerrit.wikimedia.org/r/plugins/gitiles/operations/dumps/+/ariel/toys/bz2multistream/README.txtandrelatedfilesforanoldworkingtoy.
Otherlanguages
Inthedumps.wikimedia.orgdirectoryyouwillfindthelatestSQLandXMLdumpsfortheprojects,notjustEnglish.Thesub-directoriesarenamedforthelanguagecodeandtheappropriateproject.Someotherdirectories(e.g.simple,nostalgia)exist,withthesamestructure.ThesedumpsarealsoavailablefromtheInternetArchive.
Wherearetheuploadedfiles(image,audio,video,etc.)?
ImagesandotheruploadedmediaareavailablefrommirrorsinadditiontobeingserveddirectlyfromWikimediaservers.Bulkdownloadis(asofSeptember2013)availablefrommirrorsbutnotoffereddirectlyfromWikimediaservers.Seethelistofcurrentmirrors.Youshouldrsyncfromthemirror,thenfillinthemissingimagesfromupload.wikimedia.org;whendownloadingfromupload.wikimedia.orgyoushouldthrottleyourselfto1cachemisspersecond(youcancheckheadersonaresponsetoseeifwasahitormissandthenbackoffwhenyougetamiss)andyoushouldn'tusemorethanoneortwosimultaneousHTTPconnections.Inanycase,makesureyouhaveanaccurateuseragentstringwithcontactinfo(emailaddress)soopscancontactyouifthere'sanissue.YoushouldbegettingchecksumsfromthemediawikiAPIandverifyingthem.TheAPIEtiquettepagecontainssomeguidelines,althoughnotallofthemapply(forexample,becauseupload.wikimedia.orgisn'tMediaWiki,thereisnomaxlagparameter).
Unlikemostarticletext,imagesarenotnecessarilylicensedundertheGFDL&CC-BY-SA-3.0.Theymaybeunderoneofmanyfreelicenses,inthepublicdomain,believedtobefairuse,orevencopyrightinfringements(whichshouldbedeleted).Inparticular,useoffairuseimagesoutsidethecontextofWikipediaorsimilarworksmaybeillegal.Imagesundermostlicensesrequireacredit,andpossiblyotherattachedcopyrightinformation.Thisinformationisincludedinimagedescriptionpages,whicharepartofthetextdumpsavailablefromdumps.wikimedia.org.Inconclusion,downloadtheseimagesatyourownrisk(Legal)
Dealingwithcompressedfiles
Compresseddumpfilesaresignificantlycompressed,thusafterbeingdecompressedwilltakeuplargeamountsofdrivespace.AlargelistofdecompressionprogramsaredescribedinComparisonoffilearchivers.Thefollowingprogramsinparticularcanbeusedtodecompressbzip2.bz2.zipand.7zfiles.
Windows
BeginningwithWindowsXP,abasicdecompressionprogramenablesdecompressionofzipfiles.[1][2]Amongothers,thefollowingcanbeusedtodecompressbzip2files.
bzip2(command-line)(fromhere)isavailableforfreeunderaBSDlicense.
7-ZipisavailableforfreeunderanLGPLlicense.
WinRAR
WinZip
Macintosh(Mac)
OSXshipswiththecommand-linebzip2tool.
GNU/Linux
MostGNU/Linuxdistributionsshipwiththecommand-linebzip2tool.
BerkeleySoftwareDistribution(BSD)
SomeBSDsystemsshipwiththecommand-linebzip2toolaspartoftheoperatingsystem.Others,suchasOpenBSD,provideitasapackagewhichmustfirstbeinstalled.
Notes
Someolderversionsofbzip2maynotbeabletohandlefileslargerthan2GB,somakesureyouhavethelatestversionifyouexperienceanyproblems.
Someolderarchivesarecompressedwithgzip,whichiscompatiblewithPKZIP(themostcommonWindowsformat).
Dealingwithlargefiles
Asfilesgrowinsize,sodoesthelikelihoodtheywillexceedsomelimitofacomputingdevice.Eachoperatingsystem,filesystem,hardstoragedevice,andsoftware(application)hasamaximumfilesizelimit.Eachoneofthesewilllikelyhaveadifferentmaximum,andthelowestlimitofallofthemwillbecomethefilesizelimitforastoragedevice.
Theolderthesoftwareinacomputingdevice,themorelikelyitwillhavea2GBfilelimitsomewhereinthesystem.Thisisduetooldersoftwareusing32-bitintegersforfileindexing,whichlimitsfilesizesto2^31bytes(2GB)(forsignedintegers),or2^32(4GB)(forunsignedintegers).OlderCprogramminglibrarieshavethis2or4GBlimit,butthenewerfilelibrarieshavebeenconvertedto64-bitintegersthussupportingfilesizesupto2^63or2^64bytes(8or16EB).
Beforestartingadownloadofalargefile,checkthestoragedevicetoensureitsfilesystemcansupportfilesofsuchalargesize,andchecktheamountoffreespacetoensurethatitcanholdthedownloadedfile.
Filesystemlimits
Therearetwolimitsforafilesystem:thefilesystemsizelimit,andthefilesystemlimit.Ingeneral,sincethefilesizelimitislessthanthefilesystemlimit,thelargerfilesystemlimitsareamootpoint.Alargepercentageofusersassumetheycancreatefilesuptothesizeoftheirstoragedevice,butarewrongintheirassumption.Forexample,a16GBstoragedeviceformattedasFAT32filesystemhasafilelimitof4GBforanysinglefile.Thefollowingisalistofthemostcommonfilesystems,andseeComparisonoffilesystemsforadditionaldetailedinformation.
Windows
FAT16supportsfilesupto4GB.FAT16isthefactoryformatofsmallerUSBdrivesandallSDcardsthatare2GBorsmaller.
FAT32supportsfilesupto4GB.FAT32isthefactoryformatoflargerUSBdrivesandallSDHCcardsthatare4GBorlarger.
exFATsupportsfilesupto127PB.exFATisthefactoryformatofallSDXCcards,butisincompatiblewithmostflavorsofUNIXduetolicensingproblems.
NTFSsupportsfilesupto16TB.NTFSisthedefaultfilesystemformodernWindowscomputers,includingWindows2000,WindowsXP,andalltheirsuccessorstodate.VersionsafterWindows8cansupportlargerfilesifthefilesystemisformattedwithalargerclustersize.
ReFSsupportsfilesupto16EB.
Macintosh(Mac)
HFSPlus(HFS+)supportsfilesupto8EBonMacOSX10.2+andiOS.HFS+wasthedefaultfilesystemforOSXcomputerspriortomacOSHighSierrain2017whenitwasreplacedasdefaultwithAppleFileSystem,APFS.
Linux
ext2andext3supportsfilesupto16GB,butupto2TBwithlargerblocksizes.Seehttps://users.suse.com/~aj/linux_lfs.htmlformoreinformation.
ext4supportsfilesupto16TB,using4KBblocksize.(limitremovedine2fsprogs-1.42(2012))
XFSsupportsfilesupto8EB.
ReiserFSsupportsfilesupto1EB,8TBon32-bitsystems.
JFSsupportsfilesupto4PB.
Btrfssupportsfilesupto16EB.
NILFSsupportsfilesupto8EB.
YAFFS2supportsfilesupto2GB
FreeBSD
ZFSsupportsfilesupto16EB.
FreeBSDandotherBSDs
UnixFileSystem(UFS)supportsfilesupto8ZiB.
Operatingsystemlimits
Eachoperatingsystemhasinternalfilesystemlimitsforfilesizeanddrivesize,whichisindependentofthefilesystemorphysicalmedia.Iftheoperatingsystemhasanylimitslowerthanthefilesystemorphysicalmedia,thentheOSlimitswillbethereallimit.
Windows
Windows95,98,MEhavea4GBlimitforallfilesizes.
WindowsXPhasa16TBlimitforallfilesizes.
Windows7hasa16TBlimitforallfilesizes.
Windows8,10,andServer2012havea256TBlimitforallfilesizes.
Linux
32-bitkernel2.4.xsystemshavea2TBlimitforallfilesystems.
64-bitkernel2.4.xsystemshavean8EBlimitforallfilesystems.
32-bitkernel2.6.xsystemswithoutoptionCONFIG_LBDhavea2TBlimitforallfilesystems.
32-bitkernel2.6.xsystemswithoptionCONFIG_LBDandall64-bitkernel2.6.xsystemshavean8ZBlimitforallfilesystems.[3]
GoogleAndroid
GoogleAndroidisbasedonLinux,whichdeterminesitsbaselimits.
Internalstorage:
Android2.3andlaterusestheext4filesystem.[4]
Android2.2andearlierusestheYAFFS2filesystem.
Externalstorageslots:
AllAndroiddevicesshouldsupportFAT16,FAT32,ext2filesystems.
Android2.3andlatersupportsext4filesystem.
AppleiOS(seeListofiOSdevices)
AlldevicessupportHFSPlus(HFS+)forinternalstorage.Nodeviceshaveexternalstorageslots.Deviceson10.3orlaterrunAppleFileSystemsupportingamaxfilesizeof8EB.
Tips
Detectcorruptedfiles
ItisusefultochecktheMD5sums(providedinafileinthedownloaddirectory)tomakesurethedownloadwascompleteandaccurate.Thiscanbecheckedbyrunningthe"md5sum"commandonthefilesdownloaded.Giventheirsizes,thismaytakesometimetocalculate.Duetothetechnicaldetailsofhowfilesarestored,filesizesmaybereporteddifferentlyondifferentfilesystems,andsoarenotnecessarilyreliable.Also,corruptionmayhaveoccurredduringthedownload,thoughthisisunlikely.
ReformattingexternalUSBdrives
IfyouplantodownloadWikipediaDumpfilestoonecomputeranduseanexternalUSBflashdriveorharddrivetocopythemtoothercomputers,thenyouwillrunintothe4GBFAT32filesizelimit.Toworkaroundthislimit,reformatthe>4GBUSBdrivetoafilesystemthatsupportslargerfilesizes.IfworkingexclusivelywithWindowsXP/Vista/7computers,thenreformattheUSBdrivetoNTFSfilesystem.
LinuxandUnix
Ifyouseemtobehittingthe2GBlimit,tryusingwgetversion1.10orgreater,cURLversion7.11.1-1orgreater,orarecentversionoflynx(using-dump).Also,youcanresumedownloads(forexamplewget-c).
Whynotjustretrievedatafromwikipedia.orgatruntime?
SupposeyouarebuildingapieceofsoftwarethatatcertainpointsdisplaysinformationthatcamefromWikipedia.Ifyouwantyourprogramtodisplaytheinformationinadifferentwaythancanbeseenintheliveversion,you'llprobablyneedthewikicodethatisusedtoenterit,insteadofthefinishedHTML.
Also,ifyouwanttogetallthedata,you'llprobablywanttotransferitinthemostefficientwaythat'spossible.Thewikipedia.orgserversneedtodoquiteabitofworktoconvertthewikicodeintoHTML.That'stimeconsumingbothforyouandforthewikipedia.orgservers,sosimplyspideringallpagesisnotthewaytogo.
ToaccessanyarticleinXML,oneatatime,accessSpecial:Export/Titleofthearticle.
ReadmoreaboutthisatSpecial:Export.
PleasebeawarethatlivemirrorsofWikipediathataredynamicallyloadedfromtheWikimediaserversareprohibited.PleaseseeWikipedia:Mirrorsandforks.
Pleasedonotuseawebcrawler
Pleasedonotuseawebcrawlertodownloadlargenumbersofarticles.Aggressivecrawlingoftheservercancauseadramaticslow-downofWikipedia.
Sampleblockedcrawleremail
IPaddressnnn.nnn.nnn.nnnwasretrievingupto50pagespersecondfromwikipedia.orgaddresses.Somethinglikeatleastaseconddelaybetweenrequestsisreasonable.Pleaserespectthatsetting.Ifyoumustexceeditalittle,dosoonlyduringtheleastbusytimesshowninoursiteloadgraphsatstats.wikimedia.org/EN/ChartsWikipediaZZ.htm.It'sworthnotingthattocrawlthewholesiteatonehitpersecondwilltakeseveralweeks.TheoriginatingIPisnowblockedorwillbeshortly.Pleasecontactusifyouwantitunblocked.Pleasedon'ttrytocircumventit –we'lljustblockyourwholeIPrange.
Ifyouwantinformationonhowtogetourcontentmoreefficiently,weofferavarietyofmethods,includingweeklydatabasedumpswhichyoucanloadintoMySQLandcrawllocallyatanyrateyoufindconvenient.Toolsarealsoavailablewhichwilldothatforyouasoftenasyoulikeonceyouhavetheinfrastructureinplace.
Insteadofanemailreplyyoumayprefertovisit#mediawikiconnectatirc.libera.chattodiscussyouroptionswithourteam.
DoingSQLqueriesonthecurrentdatabasedump
YoucandoSQLqueriesonthecurrentdatabasedumpusingQuarry(asareplacementforthedisabledSpecial:Asksqlpage).
Databaseschema
SQLschema
Seealso:mw:Manual:Databaselayout
ThesqlfileusedtoinitializeaMediaWikidatabasecanbefoundhere.
XMLschema
TheXMLschemaforeachdumpisdefinedatthetopofthefile.AndalsodescribedintheMediaWikiexporthelppage.
Helptoparsedumpsforuseinscripts
Wikipedia:Computerhelpdesk/ParseMediaWikiDumpdescribesthePerlParse::MediaWikiDumplibrary,whichcanparseXMLdumps.
Wikipediapreprocessor(wikiprep.pl)isaPerlscriptthatpreprocessesrawXMLdumpsandbuildslinktables,categoryhierarchies,collectsanchortextforeacharticleetc.
WikipediaSQLdumpparserisa.NETlibrarytoreadMySQLdumpswithouttheneedtouseMySQLdatabase
WikiDumpParser–a.NETCorelibarytoparsethedatabasedumps.
DictionaryBuilderisaRustprogramthatcanparseXMLdumpsandextractentriesinfiles
ScriptsforparsingWikipediadumps–Pythonbasedscriptsforparsingsql.gzfilesfromwikipediadumps.
parse-mediawiki-sql–aRustlibraryforquicklyparsingtheSQLdumpfileswithminimalmemoryallocation
gitlab.com/tozd/go/mediawiki–aGopackageprovidingutilitiesforprocessingWikipediaandWikidatadumps.
DoingHadoopMapReduceontheWikipediacurrentdatabasedump
YoucandoHadoopMapReducequeriesonthecurrentdatabasedump,butyouwillneedanextensiontotheInputRecordFormatto
haveeach
延伸文章資訊
- 1WikiTaxi - The Portable Freeware Collection
WikiTaxi enables you to read, search, and browse Wikipedia offline. No Internet connection is nee...
- 2Wikitaxi - 维基教科书,自由的教学读本
WikiTaxi是一款绿色软件。可以阅读、搜索和离线浏览维基百科。无需互联网的连接,所有的网页储存在WikiTaxi数据库。由于维基百科是不断增加,WikiTaxi使用压缩,以确保 ...
- 3WikiTaxi - Free download and software reviews
WikiTaxi enables you to read, search, and browse Wikipedia offline. No Internet connection is nee...
- 4WikiTaxi Alternatives and Similar Software | AlternativeTo
WikiTaxi is described as 'It's a portable application that delivers the Wikipedia of your choice ...
- 5WikiTaxi - Yunqa • The Delphi Inspiration
WikiTaxi enables you to read, search, and browse Wikipedia offline. No Internet connection is nee...