Wikipedia Data Science: Working with the World's Largest ...

2024-09-22

文章推薦指數： 80 %

投票人數：10人

Wikipedia Data Science: Working with the World's Largest Encyclopedia. How to programmatically download and parse the Wikipedia. (Source). HomeNotificationsListsStoriesWritePublishedinTowardsDataScienceWikipediaDataScience:WorkingwiththeWorld’sLargestEncyclopediaHowtoprogrammaticallydownloadandparsetheWikipedia(Source)Wikipediaisoneofmodernhumanity’smostimpressivecreations.Whowouldhavethoughtthatinjustafewyears,anonymouscontributorsworkingforfreecouldcreatethegreatestsourceofonlineknowledgetheworldhaseverseen?NotonlyisWikipediathebestplacetogetinformationforwritingyourcollegepapers,butit’salsoanextremelyrichsourceofdatathatcanfuelnumerousdatascienceprojectsfromnaturallanguageprocessingtosupervisedmachinelearning.ThesizeofWikipediamakesitboththeworld’slargestencyclopediaandslightlyintimidatingtoworkwith.However,sizeisnotanissuewiththerighttools,andinthisarticle,we’llwalkthroughhowwecanprogrammaticallydownloadandparsethroughalloftheEnglishlanguageWikipedia.Alongtheway,we’llcoveranumberofusefultopicsindatascience:FindingandprogrammaticallydownloadingdatafromthewebParsingwebdata(HTML,XML,MediaWiki)usingPythonlibrariesRunningoperationsinparallelwithmultiprocessing/multithreadingBenchmarkingmethodstofindtheoptimalsolutiontoaproblemTheoriginalimpetusforthisprojectwastocollectinformationoneverysinglebookonWikipedia,butIsoonrealizedthesolutionsinvolvedweremorebroadlyapplicable.ThetechniquescoveredhereandpresentedintheaccompanyingJupyterNotebookwillletyouefficientlyworkwithanyarticlesonWikipediaandcanbeextendedtoothersourcesofwebdata.Ifyou’dliketoseemoreaboututilizingthedatainthisarticle,Iwroteapostusingneuralnetworkembeddingstobuildabookrecommendationsystem.ThenotebookcontainingthePythoncodeforthisarticleisavailableonGitHub.ThisprojectwasinspiredbytheexcellentDeepLearningCookbookbyDouweOsingaandmuchofthecodeisadaptedfromthebook.ThebookiswellworthitandyoucanaccesstheJupyterNotebooksatnocostonGitHub.FindingandDownloadingDataProgrammaticallyThefirststepinanydatascienceprojectisaccessingyourdata!WhilewecouldmakeindividualrequeststoWikipediapagesandscrapetheresults,we’dquicklyrunintoratelimitsandunnecessarilytaxWikipedia’sservers.Instead,wecanaccessadumpofallofWikipediathroughWikimediaatdumps.wikimedia.org.(Adumpreferstoaperiodicsnapshotofadatabase).TheEnglishversionisatdumps.wikimedia.org/enwiki.Weviewtheavailableversionsofthedatabaseusingthefollowingcode.importrequests#LibraryforparsingHTMLfrombs4importBeautifulSoupbase_url='https://dumps.wikimedia.org/enwiki/'index=requests.get(base_url).textsoup_index=BeautifulSoup(index,'html.parser')#Findthelinksonthepagedumps=[a['href']forainsoup_index.find_all('a')ifa.has_attr('href')]dumps['../','20180620/','20180701/','20180720/','20180801/','20180820/','20180901/','20180920/','latest/']ThiscodemakesuseoftheBeautifulSouplibraryforparsingHTML.GiventhatHTMListhestandardmarkuplanguageforwebpages,thisisaninvaluablelibraryforworkingwithwebdata.Forthisproject,we’lltakethedumponSeptember1,2018(someofthedumpsareincompletesomakesuretochooseonewiththedatayouneed).Tofindalltheavailablefilesinthedump,weusethefollowingcode:dump_url=base_url+'20180901/'#Retrievethehtmldump_html=requests.get(dump_url).text#Converttoasoupsoup_dump=BeautifulSoup(dump_html,'html.parser')#Findlistelementswiththeclassfilesoup_dump.find_all('li',{'class':'file'})[:3][enwiki-20180901-pages-articles-multistream.xml.bz215.2GB,enwiki-20180901-pages-articles-multistream-index.txt.bz2195.6MB,enwiki-20180901-pages-meta-history1.xml-p10p2101.7z320.6MB]Again,weparsethewebpageusingBeautifulSouptofindthefiles.Wecouldgotohttps://dumps.wikimedia.org/enwiki/20180901/andlookforthefilestodownloadmanually,butthatwouldbeinefficient.KnowinghowtoparseHTMLandinteractwithwebsitesinaprogramisanextremelyusefulskillconsideringhowmuchdataisontheweb.Learnalittlewebscrapingandvastnewdatasourcesbecomeaccessible.(Here’satutorialtogetyoustarted).DecidingwhattoDownloadTheabovecodefindsallofthefilesinthedump.Thisincludesseveraloptionsfordownload:thecurrentversionofonlythearticles,thearticlesalongwiththecurrentdiscussion,orthearticlesalongwithallpasteditsanddiscussion.Ifwegowiththelatteroption,wearelookingatseveralterabytesofdata!Forthisproject,we’llsticktothemostrecentversionofonlythearticles.Thispageisusefulfordeterminingwhichfilestogetgivenyourneeds.Thecurrentversionofallthearticlesisavailableasasinglefile.However,ifwegetthesinglefile,thenwhenweparseit,we’llbestuckgoingthroughallthearticlessequentially—oneatatime—averyinefficientapproach.Abetteroptionistodownloadpartitionedfiles,eachofwhichcontainsasubsetofthearticles.Then,aswe’llsee,wecanparsethroughmultiplefilesatatimethroughparallelization,speedinguptheprocesssignificantly.WhenI’mdealingwithfiles,IwouldratherhavemanysmallfilesthanonelargefilebecausethenIcanparallelizeoperationsonthefiles.Thepartitionedfilesareavailableasbz2-compressedXML(eXtendedMarkupLanguage).Eachpartitionisaround300–400MBinsizewithatotalcompressedsizeof15.4GB.Wewon’tneedtodecompressthefiles,butifyouchoosetodoso,theentiresizeisaround58GB.Thisactuallydoesn’tseemtoolargeforallofhumanknowledge!(Okay,notallknowledge,butstill).CompressedSizeofWikipedia(Source).DownloadingFilesToactuallydownloadthefiles,theKerasutilityget_fileisextremelyuseful.Thisdownloadsafileatalinkandsavesittodisk.fromkeras.utilsimportget_filesaved_file_path=get_file(file,url)Thefilesaresavedin~/.keras/datasets/,thedefaultsavelocationforKeras.Downloadingallofthefilesoneatatimetakesalittleover2hours.(Youcantrytodownloadinparallel,butIranintoratelimitswhenItriedtomakemultiplerequestsatthesametime.)ParsingtheDataItmightseemlikethefirstthingwewanttodoisdecompressthefiles.However,itturnsoutwewon’teveractuallyneedtodothistoaccessallthedatainthearticles!Instead,wecaniterativelyworkwiththefilesbydecompressingandprocessinglinesoneatatime.Iteratingthroughfilesisoftentheonlyoptionifweworkwithlargedatasetsthatdonotfitinmemory.Toiteratethroughabz2compressedfilewecouldusethebz2library.Intestingthough,Ifoundthatafasteroption(byafactorof2)istocallthesystemutilitybzcatwiththesubprocessPythonmodule.Thisillustratesacriticalpoint:often,therearemultiplesolutionstoaproblemandtheonlywaytofindwhatismostefficientistobenchmarktheoptions.Thiscanbeassimpleasusingthe%%timeitJupytercellmagictotimethemethods.Forthecompletedetails,seethenotebook,butthebasicformatofiterativelydecompressingafileis:data_path='~/.keras/datasets/enwiki-20180901-pages-articles15.xml-p7744803p9244803.bz2#Iteratethroughcompressedfileonelineatatimeforlineinsubprocess.Popen(['bzcat'],stdin=open(data_path),stdout=subprocess.PIPE).stdout:#processlineIfwesimplyreadintheXMLdataandappendittoalist,wegetsomethingthatlookslikethis:RawXMLfromWikipediaArticle.ThisshowstheXMLfromasingleWikipediaarticle.Thefileswehavedownloadedcontainmillionsoflineslikethis,withthousandsofarticlesineachfile.Ifwereallywantedtomakethingsdifficult,wecouldgothroughthisusingregularexpressionsandstringmatchingtofindeacharticle.Giventhisisextraordinarilyinefficient,we’lltakeabetterapproachusingtoolscustombuiltforparsingbothXMLandWikipedia-stylearticles.ParsingApproachWeneedtoparsethefilesontwolevels:ExtractthearticletitlesandtextfromtheXMLExtractrelevantinformationfromthearticletextFortunately,therearegoodoptionsforbothoftheseoperationsinPython.ParsingXMLTosolvethefirstproblemoflocatingarticles,we’llusetheSAXparser,whichis“TheSimpleAPIforXML.”BeautifulSoupcanalsobeusedforparsingXML,butthisrequiresloadingtheentirefileintomemoryandbuildingaDocumentObjectModel(DOM).SAX,ontheotherhand,processesXMLonelineatatime,whichfitsourapproachperfectly.ThebasicideaweneedtoexecuteistosearchthroughtheXMLandextracttheinformationbetweenspecifictags(IfyouneedanintroductiontoXML,I’drecommendstartinghere).Forexample,giventheXMLbelow:CarrollF.Knicely \'\'\'CarrollF.Knicely\'\'\'(bornc.1929in[[Staunton,Virginia]]-diedNovember2,2006in[[Glasgow,Kentucky]])was[[Editing|editor]]and[[Publishing|publisher]]ofthe\'\'[[GlasgowDailyTimes]]\'\'fornearly20years(andlater,itsowner)andservedunderthree[[GovernorofKentucky|KentuckyGovernors]]ascommissionerandlaterCommerceSecretary.\n'Wewanttoselectthecontentbetweentheand<text>tags.(ThetitleissimplytheWikipediapagetitleandthetextisthecontentofthearticle).SAXwillletusdoexactlythisusingaparserandaContentHandlerwhichcontrolshowtheinformationpassedtotheparserishandled.WepasstheXMLtotheparseronelineatatimeandtheContentHandlerletsusextracttherelevantinformation.Thisisalittledifficulttofollowwithouttryingitoutyourself,buttheideaisthattheContenthandlerlooksforcertainstarttags,andwhenitfindsone,itaddscharacterstoabufferuntilitencountersthesameendtag.Thenitsavesthebuffercontenttoadictionarywiththetagasthekey.Theresultisthatwegetadictionarywherethekeysarethetagsandthevaluesarethecontentbetweenthetags.Wecanthensendthisdictionarytoanotherfunctionthatwillparsethevaluesinthedictionary.TheonlypartofSAXweneedtowriteistheContentHandler.Thisisshowninitsentiretybelow:ContentHandlerforSAXparserInthiscode,wearelookingforthetagstitleandtext.Everytimetheparserencountersoneofthese,itwillsavecharacterstothebufferuntilitencountersthesameendtag(identifiedby).Atthispointitwillsavethebuffercontentstoadictionary—self._values.Articlesareseparatedby<page>tags,soifthecontenthandlerencountersanending</page>tag,thenitshouldaddtheself._valuestothelistofarticles,self._pages.Ifthisisalittleconfusing,thenperhapsseeingitinactionwillhelp.ThecodebelowshowshowweusethistosearchthroughtheXMLfiletofindarticles.Fornowwe’rejustsavingthemtothehandler._pagesattribute,butlaterwe’llsendthearticlestoanotherfunctionforparsing.#Objectforhandlingxmlhandler=WikiXmlHandler()#Parsingobjectparser=xml.sax.make_parser()parser.setContentHandler(handler)#Iterativelyprocessfileforlineinsubprocess.Popen(['bzcat'],stdin=open(data_path),stdout=subprocess.PIPE).stdout:parser.feed(line)#Stopwhen3articleshavebeenfoundiflen(handler._pages)>2:breakIfweinspecthandler._pages,we’llseealist,eachelementofwhichisatuplewiththetitleandtextofonearticle:handler._pages[0][('CarrollKnicely',"'''CarrollF.Knicely'''(bornc.1929in[[Staunton,Virginia]]-diedNovember2,2006in[[Glasgow,Kentucky]])was[[Editing|editor]]and[[Publishing|publisher]]...)]AtthispointwehavewrittencodethatcansuccessfullyidentifyarticleswithintheXML.Thisgetsushalfwaythroughtheprocessofparsingthefilesandthenextstepistoprocessthearticlesthemselvestofindspecificpagesandinformation.Onceagain,we’llturntoatoolpurposebuiltforthetask.ParsingWikipediaArticlesWikipediarunsonasoftwareforbuildingwikisknownasMediaWiki.Thismeansthatarticlesfollowastandardformatthatmakesprogrammaticallyaccessingtheinformationwithinthemsimple.Whilethetextofanarticlemaylooklikejustastring,itencodesfarmoreinformationduetotheformatting.Toefficientlygetatthisinformation,webringinthepowerfulmwparserfromhell,alibrarybuilttoworkwithMediaWikicontent.IfwepassthetextofaWikipediaarticletothemwparserfromhell,wegetaWikicodeobjectwhichcomeswithmanymethodsforsortingthroughthedata.Forexample,thefollowingcodecreatesawikicodeobjectfromanarticle(aboutKENZFM)andretrievesthewikilinks()withinthearticle.TheseareallofthelinksthatpointtootherWikipediaarticles:importmwparserfromhell#Createthewikiarticlewiki=mwparserfromhell.parse(handler._pages[6][1])#Findthewikilinkswikilinks=[x.titleforxinwiki.filter_wikilinks()]wikilinks[:5]['Provo,Utah','WasatchFront','Megahertz','Contemporaryhitradio','watt']Thereareanumberofusefulmethodsthatcanbeappliedtothewikicodesuchasfindingcommentsorsearchingforaspecifickeyword.Ifyouwanttogetacleanversionofthearticletext,thencall:wiki.strip_code().strip()'KENZ(94.9FM,"Power94.9")isatop40/CHRradiostationbroadcastingtoSaltLakeCity,Utah'Sincemyultimategoalwastofindallthearticlesaboutbooks,thequestionarisesifthereisawaytousethisparsertoidentifyarticlesinacertaincategory?Fortunately,theanswerisyes,usingMediaWikitemplates.ArticleTemplatesTemplatesarestandardwaysofrecordinginformation.TherearenumeroustemplatesforeverythingonWikipedia,butthemostrelevantforourpurposesareInfoboxes.Thesearetemplatesthatencodesummaryinformationforanarticle.Forinstance,theinfoboxforWarandPeaceis:EachcategoryofarticlesonWikipedia,suchasfilms,books,orradiostations,hasitsowntypeofinfobox.Inthecaseofbooks,theinfoboxtemplateishelpfullynamedInfoboxbook.Justashelpful,thewikiobjecthasamethodcalledfilter_templates()thatallowsustoextractaspecifictemplatefromanarticle.Therefore,ifwewanttoknowwhetheranarticleisaboutabook,wecanfilteritforthebookinfobox.Thisisshownbelow:#Filterarticleforbooktemplatewiki.filter_templates('Infoboxbook')Ifthere’samatch,thenwe’vefoundabook!TofindtheInfoboxtemplateforthecategoryofarticlesyouareinterestedin,refertothelistofinfoboxes.HowdowecombinethemwparserfromhellforparsingarticleswiththeSAXparserwewrote?Well,wemodifytheendElementmethodintheContentHandlertosendthedictionaryofvaluescontainingthetitleandtextofanarticletoafunctionthatsearchesthearticletextforspecifiedtemplate.Ifthefunctionfindsanarticlewewant,itextractsinformationfromthearticleandthenreturnsittothehandler.First,I’llshowtheupdatedendElement:defendElement(self,name):"""Closingtagofelement"""ifname==self._current_tag:self._values[name]=''.join(self._buffer)ifname=='page':self._article_count+=1#Sendthepagetotheprocessarticlefunctionbook=process_article(**self._values,template='Infoboxbook')#Ifarticleisabookappendtothelistofbooksifbook:self._books.append(book)Now,oncetheparserhashittheendofanarticle,wesendthearticleontothefunctionprocess_articlewhichisshownbelow:ProcessArticleFunctionAlthoughI’mlookingforbooks,thisfunctioncanbeusedtosearchforanycategoryofarticleonWikipedia.Justreplacethetemplatewiththetemplateforthecategory(suchasInfoboxlanguagetofindlanguages)anditwillonlyreturntheinformationfromarticleswithinthecategory.WecantestthisfunctionandthenewContentHandlerononefile.Searchedthrough427481articles.Found1426booksin1055seconds.Let’stakealookattheoutputforonebook:books[10]['WarandPeace',{'name':'WarandPeace','author':'LeoTolstoy','language':'Russian,withsomeFrench','country':'Russia','genre':'Novel(Historicalnovel)','publisher':'TheRussianMessenger(serial)','title_orig':'Войнаимиръ','orig_lang_code':'ru','translator':'ThefirsttranslationofWarandPeaceintoEnglishwasbyAmericanNathanHaskellDole,in1899','image':'Tolstoy-WarandPeace-firstedition,1869.jpg','caption':'FrontpageofWarandPeace,firstedition,1869(Russian)','release_date':'Serialised1865–1867;book1869','media_type':'Print','pages':'1,225(firstpublishededition)'},['LeoTolstoy','Novel','Historicalnovel','TheRussianMessenger','Serial(publishing)','Category:1869Russiannovels','Category:Epicnovels','Category:Novelssetin19th-centuryRussia','Category:Russiannovelsadaptedintofilms','Category:Russianphilosophicalnovels'],['https://books.google.com/?id=c4HEAN-ti1MC','https://www.britannica.com/art/English-literature','https://books.google.com/books?id=xf7umXHGDPcC','https://books.google.com/?id=E5fotqsglPEC','https://books.google.com/?id=9sHebfZIXFAC'],'2018-08-29T02:37:35Z']ForeverysinglebookonWikipedia,wehavetheinformationfromtheInfoboxasadictionary,theinternalwikilinks,theexternallinks,andthetimestampofthemostrecentedit.(I’mconcentratingonthesepiecesofinformationtobuildabookrecommendationsystemformynextproject).Youcanmodifytheprocess_articlefunctionandWikiXmlHandlerclasstofindwhateverinformationandarticlesyouneed!Ifyoulookatthetimetoprocessjustonefile,1055seconds,andmultiplythatby55,yougetover15hoursofprocessingtimeforallfiles!Granted,wecouldjustrunthatovernight,butI’drathernotwastetheextratimeifIdon’thaveto.Thisbringsustoourfinaltechniquewe’llcoverinthisproject:parallelizationusingmultiprocessingandmultithreading.RunningOperationsinParallelInsteadofparsingthroughthefilesoneatatime,wewanttoprocessseveralofthematonce(whichiswhywedownloadedthepartitions).Wecandothisusingparallelization,eitherthroughmultithreadingormultiprocessing.MultithreadingandMultiprocessingMultithreadingandmultiprocessingarewaystocarryoutmanytasksonacomputer—ormultiplecomputers—simultaneously.Wemanyfilesondisk,eachofwhichneedstobeparsedinthesameway.Anaiveapproachwouldbetoparseonefileatatime,butthatisnottakingfulladvantageofourresources.Instead,weuseeithermultithreadingormultiprocessingtoparsemanyfilesatthesametime,significantlyspeedinguptheentireprocess.Generally,multithreadingworksbetter(isfaster)forinput/outputboundtasks,suchasreadinginfilesormakingrequests.Multiprocessingworksbetter(isfaster)forcpu-boundtasks(source).Fortheprocessofparsingarticles,Iwasn’tsurewhichmethodwouldbeoptimal,soagainIbenchmarkedbothofthemwithdifferentparameters.Learninghowtosetuptestsandseekoutdifferentwaystosolveaproblemwillgetyoufarinadatascienceoranytechnicalcareer.(Thecodefortestingmultithreadingandmultiprocessingappearsattheendofthenotebook).WhenIranthetests,Ifoundmultiprocessingwasalmost10timesfasterindicatingthisprocessisprobablyCPUbound(limited).Processingresults(left)vsthreadingresults(right).Learningmultithreading/multiprocessingisessentialformakingyourdatascienceworkflowsmoreefficient.I’drecommendthisarticletogetstartedwiththeconcepts.(We’llsticktothebuilt-inmultiprocessinglibrary,butyoucanalsousingDaskforparallelizationasinthisproject).Afterrunninganumberoftests,Ifoundthefastestwaytoprocessthefileswasusing16processes,oneforeachcoreofmycomputer.Thismeanswecanprocess16filesatatimeinsteadof1!I’dencourageanyonetotestoutafewoptionsformultiprocessing/multithreadingandletmeknowtheresults!I’mstillnotsureIdidthingsinthebestway,andI’malwayswillingtolearn.SettingUpParallelizedCodeTorunanoperationinparallel,weneedaserviceandasetoftasks.Aserviceisjustafunctionandtasksareinaniterable—suchasalist—eachofwhichwesendtothefunction.ForthepurposeofparsingtheXMLfiles,eachtaskisonefile,andthefunctionwilltakeinthefile,findallthebooks,andsavethemtodisk.Thepseudo-codeforthefunctionisbelow:deffind_books(data_path,save=True):"""FindandsaveallthebookarticlesfromacompressedwikipediaXMLfile."""#Parsefileforbooksifsave:#SaveallbookstoafilebasedonthedatapathnameTheendresultofrunningthisfunctionisasavedlistofbooksfromthefilesenttothefunction.Thefilesaresavedasjson,amachinereadableformatforwritingnestedinformationsuchaslistsoflistsanddictionaries.Thetasksthatwewanttosendtothisfunctionareallthecompressedfiles.#Listofcompressedfilestoprocesspartitions=[keras_home+fileforfileinos.listdir(keras_home)if'xml-p'infile]len(partitions),partitions[-1](55,'/home/ubuntu/.keras/datasets/enwiki-20180901-pages-articles17.xml-p11539268p13039268.bz2')Foreachfile,wewanttosendittofind_bookstobeparsed.SearchingthroughallofWikipediaThefinalcodetosearchthrougheveryarticleonWikipediaisbelow:frommultiprocessingimportPool#Createapoolofworkerstoexecuteprocessespool=Pool(processes=16)#Map(service,tasks),appliesfunctiontoeachpartitionresults=pool.map(find_books,partitions)pool.close()pool.join()Wemapeachtasktotheservice,thefunctionthatfindsthebooks(mapreferstoapplyingafunctiontoeachiteminaniterable).Runningwith16processesinparallel,wecansearchallofWikipediainunder3hours!Afterrunningthecode,thebooksfromeachfilearesavedondiskinseparatejsonfiles.ReadingandJoiningFileswithMultithreadingForpracticewritingparallelizedcode,we’llreadtheseparatefilesinwithmultipleprocesses,thistimeusingthreads.Themultiprocessing.dummylibraryprovidesawrapperaroundthethreadingmodule.Thistimetheserviceisread_dataandthetasksarethesavedfilesondisk:Themultithreadedcodeworksintheexactsameway,mappingtasksinaniterabletofunction.Oncewehavethelistoflists,weflattenittoasinglelist.print(f'Found{len(book_list)}books.')Found37861books.Wikipediahasnearly38,000articlesonbooksaccordingtoourcount.Thesizeofthefinaljsonfilewithallthebookinformationisonlyabout55MBmeaningwesearchedthroughover50GB(uncompressed)oftotalfilestofind55MBworthofbooks!Giventhatweareonlykeepingalimitedsubsetofthebookinformation,thatmakessense.WenowhaveinformationoneverysinglebookonWikipedia.Youcanusethesamecodetofindarticlesforanycategoryofyourchoosing,ormodifythefunctionstosearchfordifferentinformation.UsingsomefairlysimplePythoncode,weareabletosearchthroughanincredibleamountofinformation.SizeofWikipediaifprintedinvolumes(Source).ConclusionsInthisarticle,wesawhowtodownloadandparsetheentireEnglishlanguageversionofWikipedia.Havingatonofdataisnotusefulunlesswecanmakesenseofit,andsowedevelopedasetofmethodsforefficientlyprocessingallofthearticlesfortheinformationweneedforourprojects.Throughoutthisproject,wecoveredanumberofimportanttopics:FindinganddownloadingdataprogrammaticallyParsingthroughdatainanefficientmannerRunningoperationsinparalleltogetthemostfromourhardwareSettingupandrunningbenchmarkingteststofindefficientsolutionsTheskillsdevelopedinthisprojectarewell-suitedtoWikipediadatabutarealsobroadlyapplicabletoanyinformationfromtheweb.I’dencourageyoutoapplythesemethodsforyourownprojectsortryanalyzingadifferentcategoryofarticles.There’splentyofinformationforeveryonetodotheirownproject!(IamworkingonmakingabookrecommendationsystemwiththeWikipediaarticlesusingentityembeddingsfromneuralnetworks.)Wikipediaisanincrediblesourceofhuman-curatedinformation,andwenowknowhowtousethismonumentalachievementbyaccessingandprocessingitprogrammatically.IlookforwardtowritingaboutanddoingmoreWikipediaDataScience.Inthemeantime,thetechniquespresentedherearebroadlyapplicablesogetoutthereandfindaproblemtosolve!Asalways,Iwelcomefeedbackandconstructivecriticism.IcanbereachedonTwitter@koehrsen_willoronmypersonalwebsiteatwillk.online.MorefromTowardsDataScienceFollowYourhomefordatascience.AMediumpublicationsharingconcepts,ideasandcodes.ReadmorefromTowardsDataScienceMorefromMedium9Built-inFunctionsEveryPythonProgrammerShouldKnow(Part2)HowtoInstallPHP8.0onUbuntu20.04Heyguys,herewewilllearnhowtoinstallPHP8onUbuntu20.04.BeforewestartfirstdoupdateyourUbuntusystempackagesandinstall…PHPAndItsImportancePHP(HypertextPreprocessor)isgenerallyascriptinglanguageembeddedinHTMLtodevelopdifferentdynamicallyinteractivewebsites.It…HactivityCon2021CTFWriteupOPASecretschallenge(category:web,level:hard)DatosIORecoverX2.0:LeapfroggingtheDataManagementIndustry!WhatissubnetIcameacrossaveryinterestingblog(cloudflare)wheresubnetisdefinedsoeasily.Iamnotmodifyingcontentherebutputtingsame…UnityGameDevelopment — GettingStartedwithPostProcessingPowerBI#6 — TheBackendThebackendofPowerBI.Thatisthetopicoftoday’sarticle.Inotherwords,aregoingtohaveaveryhigh-levellookatthe…GetstartedWillKoehrsen35KFollowersDataScientistatCortexIntel,DataScienceCommunicatorFollowRelatedDoesBetterCodeEqualBetterDataScience?Ourweeklyselectionofmust-readEditors’PicksandoriginalfeaturesLearntousethesevenmostcrucialpythonlibrariesfordatascienceAreYouTooLatetoStartYourDataScienceJourney?MyJourneytoConnectMariaDBtoMySQLWorkbenchHelpStatusWritersBlogCareersPrivacyTermsAboutKnowable </text>