Using neural nets to recognize handwritten digits - Neural ...

2024-11-14

文章推薦指數： 80 %

投票人數：10人

For example, the inputs to the network might be the raw pixel data from a scanned, handwritten image of a digit. And we'd like the network to learn weights and ... NeuralNetworksandDeepLearningWhatthisbookisaboutOntheexercisesandproblemsUsingneuralnetstorecognizehandwrittendigitsPerceptronsSigmoidneuronsThearchitectureofneuralnetworksAsimplenetworktoclassifyhandwrittendigitsLearningwithgradientdescentImplementingournetworktoclassifydigitsTowarddeeplearning HowthebackpropagationalgorithmworksWarmup:afastmatrix-basedapproachtocomputingtheoutput fromaneuralnetworkThetwoassumptionsweneedaboutthecostfunctionTheHadamardproduct,$s\odott$ThefourfundamentalequationsbehindbackpropagationProofofthefourfundamentalequations(optional)ThebackpropagationalgorithmThecodeforbackpropagationInwhatsenseisbackpropagationafastalgorithm?Backpropagation:thebigpicture ImprovingthewayneuralnetworkslearnThecross-entropycostfunctionOverfittingandregularizationWeightinitializationHandwritingrecognitionrevisited:thecodeHowtochooseaneuralnetwork'shyper-parameters?Othertechniques AvisualproofthatneuralnetscancomputeanyfunctionTwocaveatsUniversalitywithoneinputandoneoutputManyinputvariablesExtensionbeyondsigmoidneuronsFixingupthestepfunctionsConclusion Whyaredeepneuralnetworkshardtotrain?ThevanishinggradientproblemWhat'scausingthevanishinggradientproblem?UnstablegradientsindeepneuralnetsUnstablegradientsinmorecomplexnetworksOtherobstaclestodeeplearning DeeplearningIntroducingconvolutionalnetworksConvolutionalneuralnetworksinpracticeThecodeforourconvolutionalnetworksRecentprogressinimagerecognitionOtherapproachestodeepneuralnetsOnthefutureofneuralnetworks Appendix:Isthereasimplealgorithmforintelligence?AcknowledgementsFrequentlyAskedQuestions Ifyoubenefitfromthebook,pleasemakeasmall donation.Isuggest$5,butyoucanchoosetheamount. Alternately,youcanmakeadonationbysendingme Bitcoin,ataddress1Kd6tXH5SDAmiFb49J9hknG5pqj7KStSAx Sponsors DeepLearningWorkstations,Servers,andLaptops Thankstoallthesupporterswhomadethebookpossible,with especialthankstoPavelDudrenov.Thanksalsotoallthe contributorstotheBugfinderHallof Fame. Resources MichaelNielsenonTwitter BookFAQ Coderepository MichaelNielsen'sprojectannouncementmailinglist DeepLearning,bookbyIan Goodfellow,YoshuaBengio,andAaronCourville cognitivemedium.com ByMichaelNielsen/Dec2019 Thehumanvisualsystemisoneofthewondersoftheworld.Consider thefollowingsequenceofhandwrittendigits:Mostpeopleeffortlesslyrecognizethosedigitsas504192.Thatease isdeceptive.Ineachhemisphereofourbrain,humanshaveaprimary visualcortex,alsoknownasV1,containing140millionneurons,with tensofbillionsofconnectionsbetweenthem.Andyethumanvision involvesnotjustV1,butanentireseriesofvisualcortices-V2, V3,V4,andV5-doingprogressivelymorecompleximageprocessing. Wecarryinourheadsasupercomputer,tunedbyevolutionover hundredsofmillionsofyears,andsuperblyadaptedtounderstandthe visualworld.Recognizinghandwrittendigitsisn'teasy.Rather,we humansarestupendously,astoundinglygoodatmakingsenseofwhatour eyesshowus.Butnearlyallthatworkisdoneunconsciously.Andso wedon'tusuallyappreciatehowtoughaproblemourvisualsystems solve.Thedifficultyofvisualpatternrecognitionbecomesapparentifyou attempttowriteacomputerprogramtorecognizedigitslikethose above.Whatseemseasywhenwedoitourselvessuddenlybecomes extremelydifficult.Simpleintuitionsabouthowwerecognizeshapes -"a9hasaloopatthetop,andaverticalstrokeinthebottom right"-turnouttobenotsosimpletoexpressalgorithmically. Whenyoutrytomakesuchrulesprecise,youquicklygetlostina morassofexceptionsandcaveatsandspecialcases.Itseems hopeless. Neuralnetworksapproachtheprobleminadifferentway.Theideais totakealargenumberofhandwrittendigits,knownastraining examples,andthendevelopasystemwhichcanlearnfromthosetraining examples.Inotherwords,theneuralnetworkusestheexamplesto automaticallyinferrulesforrecognizinghandwrittendigits. Furthermore,byincreasingthenumberoftrainingexamples,the networkcanlearnmoreabouthandwriting,andsoimproveitsaccuracy. SowhileI'veshownjust100trainingdigitsabove,perhapswecould buildabetterhandwritingrecognizerbyusingthousandsoreven millionsorbillionsoftrainingexamples.Inthischapterwe'llwriteacomputerprogramimplementinganeural networkthatlearnstorecognizehandwrittendigits.Theprogramis just74lineslong,andusesnospecialneuralnetworklibraries.But thisshortprogramcanrecognizedigitswithanaccuracyover96 percent,withouthumanintervention.Furthermore,inlaterchapters we'lldevelopideaswhichcanimproveaccuracytoover99percent.In fact,thebestcommercialneuralnetworksarenowsogoodthatthey areusedbybankstoprocesscheques,andbypostofficestorecognize addresses.We'refocusingonhandwritingrecognitionbecauseit'sanexcellent prototypeproblemforlearningaboutneuralnetworksingeneral.Asa prototypeithitsasweetspot:it'schallenging-it'snosmall feattorecognizehandwrittendigits-butit'snotsodifficultas torequireanextremelycomplicatedsolution,ortremendous computationalpower.Furthermore,it'sagreatwaytodevelopmore advancedtechniques,suchasdeeplearning.Andsothroughoutthe bookwe'llreturnrepeatedlytotheproblemofhandwriting recognition.Laterinthebook,we'lldiscusshowtheseideasmaybe appliedtootherproblemsincomputervision,andalsoinspeech, naturallanguageprocessing,andotherdomains.Ofcourse,ifthepointofthechapterwasonlytowriteacomputer programtorecognizehandwrittendigits,thenthechapterwouldbe muchshorter!Butalongthewaywe'lldevelopmanykeyideasabout neuralnetworks,includingtwoimportanttypesofartificialneuron (theperceptronandthesigmoidneuron),andthestandardlearning algorithmforneuralnetworks,knownasstochasticgradientdescent. Throughout,Ifocusonexplainingwhythingsaredonetheway theyare,andonbuildingyourneuralnetworksintuition.That requiresalengthierdiscussionthanifIjustpresentedthebasic mechanicsofwhat'sgoingon,butit'sworthitforthedeeper understandingyou'llattain.Amongstthepayoffs,bytheendofthe chapterwe'llbeinpositiontounderstandwhatdeeplearningis,and whyitmatters.PerceptronsWhatisaneuralnetwork?Togetstarted,I'llexplainatypeof artificialneuroncalledaperceptron. Perceptronswere developed inthe1950sand1960sbythescientist Frank Rosenblatt,inspiredbyearlier work byWarren McCullochand Walter Pitts.Today,it'smorecommontouseother modelsofartificialneurons-inthisbook,andinmuchmodernwork onneuralnetworks,themainneuronmodelusedisonecalledthe sigmoidneuron.We'llgettosigmoidneuronsshortly.Butto understandwhysigmoidneuronsaredefinedthewaytheyare,it's worthtakingthetimetofirstunderstandperceptrons.Sohowdoperceptronswork?Aperceptrontakesseveralbinaryinputs, $x_1,x_2,\ldots$,andproducesasinglebinaryoutput: Intheexampleshowntheperceptronhasthreeinputs,$x_1,x_2,x_3$. Ingeneralitcouldhavemoreorfewerinputs.Rosenblattproposeda simpleruletocomputetheoutput.Heintroduced weights,$w_1,w_2,\ldots$,realnumbers expressingtheimportanceoftherespectiveinputstotheoutput.The neuron'soutput,$0$or$1$,isdeterminedbywhethertheweightedsum $\sum_jw_jx_j$islessthanorgreaterthansomethreshold value.Justliketheweights,the thresholdisarealnumberwhichisaparameteroftheneuron.Toput itinmoreprecisealgebraicterms: \begin{eqnarray} \mbox{output}&=&\left\{\begin{array}{ll} 0&\mbox{if}\sum_jw_jx_j\leq\mbox{threshold}\\ 1&\mbox{if}\sum_jw_jx_j>\mbox{threshold} \end{array}\right. \tag{1}\end{eqnarray} That'sallthereistohowaperceptronworks!That'sthebasicmathematicalmodel.Awayyoucanthinkaboutthe perceptronisthatit'sadevicethatmakesdecisionsbyweighingup evidence.Letmegiveanexample.It'snotaveryrealisticexample, butit'seasytounderstand,andwe'llsoongettomorerealistic examples.Supposetheweekendiscomingup,andyou'veheardthat there'sgoingtobeacheesefestivalinyourcity.Youlikecheese, andaretryingtodecidewhetherornottogotothefestival.You mightmakeyourdecisionbyweighingupthreefactors: Istheweathergood? Doesyourboyfriendorgirlfriendwanttoaccompanyyou? Isthefestivalnearpublictransit?(Youdon'townacar). Wecanrepresentthesethreefactorsbycorrespondingbinaryvariables $x_1,x_2$,and$x_3$.Forinstance,we'dhave$x_1=1$ifthe weatherisgood,and$x_1=0$iftheweatherisbad.Similarly,$x_2 =1$ifyourboyfriendorgirlfriendwantstogo,and$x_2=0$if not.Andsimilarlyagainfor$x_3$andpublictransit.Now,supposeyouabsolutelyadorecheese,somuchsothatyou'rehappy togotothefestivalevenifyourboyfriendorgirlfriendis uninterestedandthefestivalishardtogetto.Butperhapsyou reallyloathebadweather,andthere'snowayyou'dgotothefestival iftheweatherisbad.Youcanuseperceptronstomodelthiskindof decision-making.Onewaytodothisistochooseaweight$w_1=6$ fortheweather,and$w_2=2$and$w_3=2$fortheotherconditions. Thelargervalueof$w_1$indicatesthattheweathermattersalotto you,muchmorethanwhetheryourboyfriendorgirlfriendjoinsyou,or thenearnessofpublictransit.Finally,supposeyouchoosea thresholdof$5$fortheperceptron.Withthesechoices,the perceptronimplementsthedesireddecision-makingmodel,outputting $1$whenevertheweatherisgood,and$0$whenevertheweatherisbad. Itmakesnodifferencetotheoutputwhetheryourboyfriendor girlfriendwantstogo,orwhetherpublictransitisnearby.Byvaryingtheweightsandthethreshold,wecangetdifferentmodels ofdecision-making.Forexample,supposeweinsteadchoseathreshold of$3$.Thentheperceptronwoulddecidethatyoushouldgotothe festivalwhenevertheweatherwasgoodorwhenboththe festivalwasnearpublictransitandyourboyfriendor girlfriendwaswillingtojoinyou.Inotherwords,it'dbea differentmodelofdecision-making.Droppingthethresholdmeans you'remorewillingtogotothefestival.Obviously,theperceptronisn'tacompletemodelofhuman decision-making!Butwhattheexampleillustratesishowaperceptron canweighupdifferentkindsofevidenceinordertomakedecisions. Anditshouldseemplausiblethatacomplexnetworkofperceptrons couldmakequitesubtledecisions: Inthisnetwork,thefirstcolumnofperceptrons-whatwe'llcall thefirstlayerofperceptrons-ismakingthreeverysimple decisions,byweighingtheinputevidence.Whatabouttheperceptrons inthesecondlayer?Eachofthoseperceptronsismakingadecision byweighinguptheresultsfromthefirstlayerofdecision-making. Inthiswayaperceptroninthesecondlayercanmakeadecisionata morecomplexandmoreabstractlevelthanperceptronsinthefirst layer.Andevenmorecomplexdecisionscanbemadebytheperceptron inthethirdlayer.Inthisway,amany-layernetworkofperceptrons canengageinsophisticateddecisionmaking.Incidentally,whenIdefinedperceptronsIsaidthataperceptronhas justasingleoutput.Inthenetworkabovetheperceptronslooklike theyhavemultipleoutputs.Infact,they'restillsingleoutput. Themultipleoutputarrowsaremerelyausefulwayofindicatingthat theoutputfromaperceptronisbeingusedastheinputtoseveral otherperceptrons.It'slessunwieldythandrawingasingleoutput linewhichthensplits.Let'ssimplifythewaywedescribeperceptrons.Thecondition$\sum_j w_jx_j>\mbox{threshold}$iscumbersome,andwecanmaketwo notationalchangestosimplifyit. Thefirstchangeistowrite $\sum_jw_jx_j$asadotproduct,$w\cdotx\equiv\sum_jw_jx_j$, where$w$and$x$arevectorswhosecomponentsaretheweightsand inputs,respectively.Thesecondchangeistomovethethresholdto theothersideoftheinequality,andtoreplaceitbywhat'sknownas theperceptron'sbias,$b\equiv -\mbox{threshold}$.Usingthebiasinsteadofthethreshold,the perceptronrulecanbe rewritten: \begin{eqnarray} \mbox{output}=\left\{ \begin{array}{ll} 0&\mbox{if}w\cdotx+b\leq0\\ 1&\mbox{if}w\cdotx+b>0 \end{array} \right. \tag{2}\end{eqnarray} Youcanthinkofthebiasasameasureofhoweasyitistogetthe perceptrontooutputa$1$.Ortoputitinmorebiologicalterms, thebiasisameasureofhoweasyitistogettheperceptronto fire.Foraperceptronwithareallybigbias,it'sextremely easyfortheperceptrontooutputa$1$.Butifthebiasisvery negative,thenit'sdifficultfortheperceptrontooutputa$1$. Obviously,introducingthebiasisonlyasmallchangeinhowwe describeperceptrons,butwe'llseelaterthatitleadstofurther notationalsimplifications.Becauseofthis,intheremainderofthe bookwewon'tusethethreshold,we'llalwaysusethebias.I'vedescribedperceptronsasamethodforweighingevidencetomake decisions.Anotherwayperceptronscanbeusedistocomputethe elementarylogicalfunctionsweusuallythinkofasunderlying computation,functionssuchasAND,OR,and NAND.Forexample,supposewehaveaperceptronwithtwo inputs,eachwithweight$-2$,andanoverallbiasof$3$.Here'sour perceptron: Thenweseethatinput$00$producesoutput$1$,since $(-2)*0+(-2)*0+3=3$ispositive.Here,I'veintroducedthe$*$ symboltomakethemultiplicationsexplicit.Similarcalculations showthattheinputs$01$and$10$produceoutput$1$.Buttheinput $11$producesoutput$0$,since$(-2)*1+(-2)*1+3=-1$isnegative. AndsoourperceptronimplementsaNAND gate!TheNANDexampleshowsthatwecanuseperceptronstocompute simplelogicalfunctions. Infact,wecanuse networksofperceptronstocomputeanylogicalfunctionatall. ThereasonisthattheNANDgateisuniversalfor computation,thatis,wecanbuildanycomputationupoutof NANDgates.Forexample,wecanuseNANDgatesto buildacircuitwhichaddstwobits,$x_1$and$x_2$.Thisrequires computingthebitwisesum,$x_1\oplusx_2$,aswellasacarrybit whichissetto$1$whenboth$x_1$and$x_2$are$1$,i.e.,thecarry bitisjustthebitwiseproduct$x_1x_2$: Togetanequivalentnetworkofperceptronswereplaceallthe NANDgatesbyperceptronswithtwoinputs,eachwithweight $-2$,andanoverallbiasof$3$.Here'stheresultingnetwork.Note thatI'vemovedtheperceptroncorrespondingtothebottomright NANDgatealittle,justtomakeiteasiertodrawthearrows onthediagram: Onenotableaspectofthisnetworkofperceptronsisthattheoutput fromtheleftmostperceptronisusedtwiceasinputtothebottommost perceptron.WhenIdefinedtheperceptronmodelIdidn'tsaywhether thiskindofdouble-output-to-the-same-placewasallowed.Actually, itdoesn'tmuchmatter.Ifwedon'twanttoallowthiskindofthing, thenit'spossibletosimplymergethetwolines,intoasingle connectionwithaweightof-4insteadoftwoconnectionswith-2 weights.(Ifyoudon'tfindthisobvious,youshouldstopandprove toyourselfthatthisisequivalent.)Withthatchange,thenetwork looksasfollows,withallunmarkedweightsequalto-2,allbiases equalto3,andasingleweightof-4,asmarked: UptonowI'vebeendrawinginputslike$x_1$and$x_2$asvariables floatingtotheleftofthenetworkofperceptrons.Infact,it's conventionaltodrawanextralayerofperceptrons-theinput layer-toencodetheinputs: Thisnotationforinputperceptrons,inwhichwehaveanoutput,but noinputs, isashorthand.Itdoesn'tactuallymeanaperceptronwithnoinputs. Toseethis,supposewedidhaveaperceptronwithnoinputs.Then theweightedsum$\sum_jw_jx_j$wouldalwaysbezero,andsothe perceptronwouldoutput$1$if$b>0$,and$0$if$b\leq0$.That is,theperceptronwouldsimplyoutputafixedvalue,notthedesired value($x_1$,intheexampleabove).It'sbettertothinkofthe inputperceptronsasnotreallybeingperceptronsatall,butrather specialunitswhicharesimplydefinedtooutputthedesiredvalues, $x_1,x_2,\ldots$.Theadderexampledemonstrateshowanetworkofperceptronscanbe usedtosimulateacircuitcontainingmanyNANDgates.And becauseNANDgatesareuniversalforcomputation,itfollows thatperceptronsarealsouniversalforcomputation.Thecomputationaluniversalityofperceptronsissimultaneously reassuringanddisappointing.It'sreassuringbecauseittellsus thatnetworksofperceptronscanbeaspowerfulasanyothercomputing device.Butit'salsodisappointing,becauseitmakesitseemas thoughperceptronsaremerelyanewtypeofNANDgate. That'shardlybignews!However,thesituationisbetterthanthisviewsuggests.Itturns outthatwecandeviselearning algorithmswhichcan automaticallytunetheweightsandbiasesofanetworkofartificial neurons.Thistuninghappensinresponsetoexternalstimuli,without directinterventionbyaprogrammer.Theselearningalgorithmsenable ustouseartificialneuronsinawaywhichisradicallydifferentto conventionallogicgates.Insteadofexplicitlylayingoutacircuit ofNANDandothergates,ourneuralnetworkscansimplylearn tosolveproblems,sometimesproblemswhereitwouldbeextremely difficulttodirectlydesignaconventionalcircuit.SigmoidneuronsLearningalgorithmssoundterrific.Buthowcanwedevisesuch algorithmsforaneuralnetwork?Supposewehaveanetworkof perceptronsthatwe'dliketousetolearntosolvesomeproblem.For example,theinputstothenetworkmightbetherawpixeldatafroma scanned,handwrittenimageofadigit.Andwe'dlikethenetworkto learnweightsandbiasessothattheoutputfromthenetworkcorrectly classifiesthedigit.Toseehowlearningmightwork,supposewemake asmallchangeinsomeweight(orbias)inthenetwork.Whatwe'd likeisforthissmallchangeinweighttocauseonlyasmall correspondingchangeintheoutputfromthenetwork.Aswe'llseein amoment,thispropertywillmakelearningpossible.Schematically, here'swhatwewant(obviouslythisnetworkistoosimpletodo handwritingrecognition!): Ifitweretruethatasmallchangeinaweight(orbias)causesonly asmallchangeinoutput,thenwecouldusethisfacttomodifythe weightsandbiasestogetournetworktobehavemoreinthemannerwe want.Forexample,supposethenetworkwasmistakenlyclassifyingan imageasan"8"whenitshouldbea"9".Wecouldfigureouthow tomakeasmallchangeintheweightsandbiasessothenetworkgetsa littleclosertoclassifyingtheimageasa"9".Andthenwe'd repeatthis,changingtheweightsandbiasesoverandovertoproduce betterandbetteroutput.Thenetworkwouldbelearning.Theproblemisthatthisisn'twhathappenswhenournetworkcontains perceptrons.Infact,asmallchangeintheweightsorbiasofany singleperceptroninthenetworkcansometimescausetheoutputof thatperceptrontocompletelyflip,sayfrom$0$to$1$.Thatflip maythencausethebehaviouroftherestofthenetworktocompletely changeinsomeverycomplicatedway.Sowhileyour"9"mightnowbe classifiedcorrectly,thebehaviourofthenetworkonalltheother imagesislikelytohavecompletelychangedinsomehard-to-control way.Thatmakesitdifficulttoseehowtograduallymodifythe weightsandbiasessothatthenetworkgetsclosertothedesired behaviour.Perhapsthere'ssomecleverwayofgettingaroundthis problem.Butit'snotimmediatelyobvioushowwecangetanetworkof perceptronstolearn.Wecanovercomethisproblembyintroducinganewtypeofartificial neuroncalledasigmoidneuron. Sigmoidneuronsaresimilartoperceptrons,butmodifiedsothatsmall changesintheirweightsandbiascauseonlyasmallchangeintheir output.That'sthecrucialfactwhichwillallowanetworkofsigmoid neuronstolearn.Okay,letmedescribethesigmoidneuron.We'lldepictsigmoid neuronsinthesamewaywedepictedperceptrons: Justlikeaperceptron,thesigmoidneuronhasinputs,$x_1,x_2, \ldots$.Butinsteadofbeingjust$0$or$1$,theseinputscanalso takeonanyvaluesbetween$0$and$1$.So,forinstance, $0.638\ldots$isavalidinputforasigmoidneuron.Alsojustlikea perceptron,thesigmoidneuronhasweightsforeachinput,$w_1,w_2, \ldots$,andanoverallbias,$b$.Buttheoutputisnot$0$or$1$. Instead,it's$\sigma(w\cdotx+b)$,where$\sigma$iscalledthe sigmoidfunction* *Incidentally,$\sigma$issometimes calledthelogistic function,andthis newclassofneuronscalledlogistic neurons.It'suseful torememberthisterminology,sincethesetermsareusedbymany peopleworkingwithneuralnets.However,we'llstickwiththe sigmoidterminology.,andisdefined by: \begin{eqnarray} \sigma(z)\equiv\frac{1}{1+e^{-z}}. \tag{3}\end{eqnarray} Toputitallalittlemoreexplicitly,theoutputofasigmoidneuron withinputs$x_1,x_2,\ldots$,weights$w_1,w_2,\ldots$,andbias$b$is \begin{eqnarray} \frac{1}{1+\exp(-\sum_jw_jx_j-b)}. \tag{4}\end{eqnarray}Atfirstsight,sigmoidneuronsappearverydifferenttoperceptrons. Thealgebraicformofthesigmoidfunctionmayseemopaqueand forbiddingifyou'renotalreadyfamiliarwithit.Infact,thereare manysimilaritiesbetweenperceptronsandsigmoidneurons,andthe algebraicformofthesigmoidfunctionturnsouttobemoreofa technicaldetailthanatruebarriertounderstanding.Tounderstandthesimilaritytotheperceptronmodel,suppose$z \equivw\cdotx+b$isalargepositivenumber.Then$e^{-z} \approx0$andso$\sigma(z)\approx1$.Inotherwords,when$z=w \cdotx+b$islargeandpositive,theoutputfromthesigmoidneuron isapproximately$1$,justasitwouldhavebeenforaperceptron. Supposeontheotherhandthat$z=w\cdotx+b$isverynegative. Then$e^{-z}\rightarrow\infty$,and$\sigma(z)\approx0$.Sowhen $z=w\cdotx+b$isverynegative,thebehaviourofasigmoidneuron alsocloselyapproximatesaperceptron.It'sonlywhen$w\cdotx+b$ isofmodestsizethatthere'smuchdeviationfromtheperceptron model.Whataboutthealgebraicformof$\sigma$?Howcanweunderstand that?Infact,theexactformof$\sigma$isn'tsoimportant-what reallymattersistheshapeofthefunctionwhenplotted.Here'sthe shape: Thisshapeisasmoothedoutversionofastepfunction: If$\sigma$hadinfactbeenastepfunction,thenthesigmoidneuron wouldbeaperceptron,sincetheoutputwouldbe$1$or$0$ dependingonwhether$w\cdotx+b$waspositiveor negative* *Actually,when$w\cdotx+b=0$theperceptron outputs$0$,whilethestepfunctionoutputs$1$.So,strictly speaking,we'dneedtomodifythestepfunctionatthatonepoint. Butyougettheidea..Byusingtheactual$\sigma$functionwe get,asalreadyimpliedabove,asmoothedoutperceptron.Indeed, it'sthesmoothnessofthe$\sigma$functionthatisthecrucialfact, notitsdetailedform.Thesmoothnessof$\sigma$meansthatsmall changes$\Deltaw_j$intheweightsand$\Deltab$inthebiaswill produceasmallchange$\Delta\mbox{output}$intheoutputfromthe neuron.Infact,calculustellsusthat$\Delta\mbox{output}$is wellapproximatedby \begin{eqnarray} \Delta\mbox{output}\approx\sum_j\frac{\partial\,\mbox{output}}{\partialw_j} \Deltaw_j+\frac{\partial\,\mbox{output}}{\partialb}\Deltab, \tag{5}\end{eqnarray} wherethesumisoveralltheweights,$w_j$,and$\partial\, \mbox{output}/\partialw_j$and$\partial\,\mbox{output}/\partial b$denotepartialderivativesofthe$\mbox{output}$withrespectto $w_j$and$b$,respectively.Don'tpanicifyou'renotcomfortable withpartialderivatives!Whiletheexpressionabovelooks complicated,withallthepartialderivatives,it'sactuallysaying somethingverysimple(andwhichisverygoodnews):$\Delta \mbox{output}$isalinearfunctionofthechanges$\Deltaw_j$ and$\Deltab$intheweightsandbias.Thislinearitymakesiteasy tochoosesmallchangesintheweightsandbiasestoachieveany desiredsmallchangeintheoutput.Sowhilesigmoidneuronshave muchofthesamequalitativebehaviourasperceptrons,theymakeit mucheasiertofigureouthowchangingtheweightsandbiaseswill changetheoutput.Ifit'stheshapeof$\sigma$whichreallymatters,andnotitsexact form,thenwhyusetheparticularformusedfor$\sigma$in Equation(3)\begin{eqnarray} \sigma(z)\equiv\frac{1}{1+e^{-z}}\nonumber\end{eqnarray}?Infact,laterinthebookwewill occasionallyconsiderneuronswheretheoutputis$f(w\cdotx+b)$ forsomeotheractivationfunction$f(\cdot)$.Themainthing thatchangeswhenweuseadifferentactivationfunctionisthatthe particularvaluesforthepartialderivativesin Equation(5)\begin{eqnarray} \Delta\mbox{output}\approx\sum_j\frac{\partial\,\mbox{output}}{\partialw_j} \Deltaw_j+\frac{\partial\,\mbox{output}}{\partialb}\Deltab\nonumber\end{eqnarray}change.Itturnsoutthatwhenwe computethosepartialderivativeslater,using$\sigma$willsimplify thealgebra,simplybecauseexponentialshavelovelypropertieswhen differentiated.Inanycase,$\sigma$iscommonly-usedinworkon neuralnets,andistheactivationfunctionwe'llusemostoftenin thisbook.Howshouldweinterprettheoutputfromasigmoidneuron?Obviously, onebigdifferencebetweenperceptronsandsigmoidneuronsisthat sigmoidneuronsdon'tjustoutput$0$or$1$.Theycanhaveasoutput anyrealnumberbetween$0$and$1$,sovaluessuchas$0.173\ldots$ and$0.689\ldots$arelegitimateoutputs.Thiscanbeuseful,for example,ifwewanttousetheoutputvaluetorepresenttheaverage intensityofthepixelsinanimageinputtoaneuralnetwork.But sometimesitcanbeanuisance.Supposewewanttheoutputfromthe networktoindicateeither"theinputimageisa9"or"theinput imageisnota9".Obviously,it'dbeeasiesttodothisifthe outputwasa$0$ora$1$,asinaperceptron.Butinpracticewecan setupaconventiontodealwiththis,forexample,bydecidingto interpretanyoutputofatleast$0.5$asindicatinga"9",andany outputlessthan$0.5$asindicating"nota9".I'llalways explicitlystatewhenwe'reusingsuchaconvention,soitshouldn't causeanyconfusion. Exercises Sigmoidneuronssimulatingperceptrons,partI$\mbox{}$ Supposewetakealltheweightsandbiasesinanetworkof perceptrons,andmultiplythembyapositiveconstant,$c>0$. Showthatthebehaviourofthenetworkdoesn'tchange.Sigmoidneuronssimulatingperceptrons,partII$\mbox{}$ Supposewehavethesamesetupasthelastproblem-a networkofperceptrons.Supposealsothattheoverallinputtothe networkofperceptronshasbeenchosen.Wewon'tneedtheactual inputvalue,wejustneedtheinputtohavebeenfixed.Supposethe weightsandbiasesaresuchthat$w\cdotx+b\neq0$forthe input$x$toanyparticularperceptroninthenetwork.Nowreplace alltheperceptronsinthenetworkbysigmoidneurons,andmultiply theweightsandbiasesbyapositiveconstant$c>0$.Showthatin thelimitas$c\rightarrow\infty$thebehaviourofthisnetworkof sigmoidneuronsisexactlythesameasthenetworkofperceptrons. Howcanthisfailwhen$w\cdotx+b=0$foroneofthe perceptrons? ThearchitectureofneuralnetworksInthenextsectionI'llintroduceaneuralnetworkthatcandoa prettygoodjobclassifyinghandwrittendigits.Inpreparationfor that,ithelpstoexplainsometerminologythatletsusnamedifferent partsofanetwork.Supposewehavethenetwork: Asmentionedearlier,theleftmostlayerinthisnetworkiscalledthe inputlayer,andtheneuronswithinthe layerarecalledinputneurons. Therightmostoroutputlayer containstheoutputneurons,or, asinthiscase,asingleoutputneuron.Themiddlelayeriscalleda hiddenlayer,sincetheneuronsin thislayerareneitherinputsnoroutputs.Theterm"hidden" perhapssoundsalittlemysterious-thefirsttimeIheardtheterm Ithoughtitmusthavesomedeepphilosophicalormathematical significance-butitreallymeansnothingmorethan"notaninput oranoutput".Thenetworkabovehasjustasinglehiddenlayer,but somenetworkshavemultiplehiddenlayers.Forexample,thefollowing four-layernetworkhastwohiddenlayers: Somewhatconfusingly,andforhistoricalreasons,suchmultiplelayer networksaresometimescalledmultilayerperceptronsor MLPs,despitebeingmadeupofsigmoidneurons,not perceptrons.I'mnotgoingtousetheMLPterminologyinthisbook, sinceIthinkit'sconfusing,butwantedtowarnyouofitsexistence.Thedesignoftheinputandoutputlayersinanetworkisoften straightforward.Forexample,supposewe'retryingtodetermine whetherahandwrittenimagedepictsa"9"ornot.Anaturalwayto designthenetworkistoencodetheintensitiesoftheimagepixels intotheinputneurons.Iftheimageisa$64$by$64$greyscale image,thenwe'dhave$4,096=64\times64$inputneurons,withthe intensitiesscaledappropriatelybetween$0$and$1$.Theoutput layerwillcontainjustasingleneuron,withoutputvaluesofless than$0.5$indicating"inputimageisnota9",andvaluesgreater than$0.5$indicating"inputimageisa9". Whilethedesignoftheinputandoutputlayersofaneuralnetworkis oftenstraightforward,therecanbequiteanarttothedesignofthe hiddenlayers.Inparticular,it'snotpossibletosumupthedesign processforthehiddenlayerswithafewsimplerulesofthumb. Instead,neuralnetworksresearchershavedevelopedmanydesign heuristicsforthehiddenlayers,whichhelppeoplegetthebehaviour theywantoutoftheirnets.Forexample,suchheuristicscanbeused tohelpdeterminehowtotradeoffthenumberofhiddenlayersagainst thetimerequiredtotrainthenetwork.We'llmeetseveralsuch designheuristicslaterinthisbook.Uptonow,we'vebeendiscussingneuralnetworkswheretheoutputfrom onelayerisusedasinputtothenextlayer.Suchnetworksare calledfeedforward neuralnetworks.Thismeanstherearenoloopsinthenetwork- informationisalwaysfedforward,neverfedback.Ifwedidhave loops,we'dendupwithsituationswheretheinputtothe$\sigma$ functiondependedontheoutput.That'dbehardtomakesenseof,and sowedon'tallowsuchloops.However,thereareothermodelsofartificialneuralnetworksinwhich feedbackloopsarepossible.Thesemodelsarecalled recurrent neuralnetworks.Theideainthesemodelsistohaveneuronswhich fireforsomelimiteddurationoftime,beforebecomingquiescent. Thatfiringcanstimulateotherneurons,whichmayfirealittlewhile later,alsoforalimitedduration.Thatcausesstillmoreneuronsto fire,andsoovertimewegetacascadeofneuronsfiring.Loops don'tcauseproblemsinsuchamodel,sinceaneuron'soutputonly affectsitsinputatsomelatertime,notinstantaneously.Recurrentneuralnetshavebeenlessinfluentialthanfeedforward networks,inpartbecausethelearningalgorithmsforrecurrentnets are(atleasttodate)lesspowerful.Butrecurrentnetworksare stillextremelyinteresting.They'remuchcloserinspirittohowour brainsworkthanfeedforwardnetworks.Andit'spossiblethat recurrentnetworkscansolveimportantproblemswhichcanonlybe solvedwithgreatdifficultybyfeedforwardnetworks.However,to limitourscope,inthisbookwe'regoingtoconcentrateonthemore widely-usedfeedforwardnetworks.AsimplenetworktoclassifyhandwrittendigitsHavingdefinedneuralnetworks,let'sreturntohandwriting recognition.Wecansplittheproblemofrecognizinghandwritten digitsintotwosub-problems.First,we'dlikeawayofbreakingan imagecontainingmanydigitsintoasequenceofseparateimages,each containingasingledigit.Forexample,we'dliketobreaktheimageintosixseparateimages,Wehumanssolvethissegmentation problemwithease,butit'schallenging foracomputerprogramtocorrectlybreakuptheimage.Oncethe imagehasbeensegmented,theprogramthenneedstoclassifyeach individualdigit.So,forinstance,we'dlikeourprogramto recognizethatthefirstdigitabove,isa5.We'llfocusonwritingaprogramtosolvethesecondproblem,thatis, classifyingindividualdigits.Wedothisbecauseitturnsoutthat thesegmentationproblemisnotsodifficulttosolve,onceyouhavea goodwayofclassifyingindividualdigits.Therearemanyapproaches tosolvingthesegmentationproblem.Oneapproachistotrialmany differentwaysofsegmentingtheimage,usingtheindividualdigit classifiertoscoreeachtrialsegmentation.Atrialsegmentation getsahighscoreiftheindividualdigitclassifierisconfidentof itsclassificationinallsegments,andalowscoreiftheclassifier ishavingalotoftroubleinoneormoresegments.Theideaisthat iftheclassifierishavingtroublesomewhere,thenit'sprobably havingtroublebecausethesegmentationhasbeenchosenincorrectly. Thisideaandothervariationscanbeusedtosolvethesegmentation problemquitewell.Soinsteadofworryingaboutsegmentationwe'll concentrateondevelopinganeuralnetworkwhichcansolvethemore interestinganddifficultproblem,namely,recognizingindividual handwrittendigits.Torecognizeindividualdigitswewilluseathree-layerneural network: Theinputlayerofthenetworkcontainsneuronsencodingthevaluesof theinputpixels.Asdiscussedinthenextsection,ourtrainingdata forthenetworkwillconsistofmany$28$by$28$pixelimagesof scannedhandwrittendigits,andsotheinputlayercontains$784=28 \times28$neurons.ForsimplicityI'veomittedmostofthe$784$ inputneuronsinthediagramabove.Theinputpixelsaregreyscale, withavalueof$0.0$representingwhite,avalueof$1.0$ representingblack,andinbetweenvaluesrepresentinggradually darkeningshadesofgrey.Thesecondlayerofthenetworkisahiddenlayer.Wedenotethe numberofneuronsinthishiddenlayerby$n$,andwe'llexperiment withdifferentvaluesfor$n$.Theexampleshownillustratesasmall hiddenlayer,containingjust$n=15$neurons.Theoutputlayerofthenetworkcontains10neurons.Ifthefirst neuronfires,i.e.,hasanoutput$\approx1$,thenthatwillindicate thatthenetworkthinksthedigitisa$0$.Ifthesecondneuron firesthenthatwillindicatethatthenetworkthinksthedigitisa $1$.Andsoon.Alittlemoreprecisely,wenumbertheoutput neuronsfrom$0$through$9$,andfigureoutwhichneuronhasthe highestactivationvalue.Ifthatneuronis,say,neuronnumber$6$, thenournetworkwillguessthattheinputdigitwasa$6$.Andsoon fortheotheroutputneurons.Youmightwonderwhyweuse$10$outputneurons.Afterall,thegoal ofthenetworkistotelluswhichdigit($0,1,2,\ldots,9$) correspondstotheinputimage.Aseeminglynaturalwayofdoingthat istousejust$4$outputneurons,treatingeachneuronastakingona binaryvalue,dependingonwhethertheneuron'soutputiscloserto $0$orto$1$.Fourneuronsareenoughtoencodetheanswer,since $2^4=16$ismorethanthe10possiblevaluesfortheinputdigit. Whyshouldournetworkuse$10$neuronsinstead?Isn'tthat inefficient?Theultimatejustificationisempirical:wecantryout bothnetworkdesigns,anditturnsoutthat,forthisparticular problem,thenetworkwith$10$outputneuronslearnstorecognize digitsbetterthanthenetworkwith$4$outputneurons.Butthat leavesuswonderingwhyusing$10$outputneuronsworksbetter. Istheresomeheuristicthatwouldtellusinadvancethatweshould usethe$10$-outputencodinginsteadofthe$4$-outputencoding?Tounderstandwhywedothis,ithelpstothinkaboutwhattheneural networkisdoingfromfirstprinciples.Considerfirstthecasewhere weuse$10$outputneurons.Let'sconcentrateonthefirstoutput neuron,theonethat'stryingtodecidewhetherornotthedigitisa $0$.Itdoesthisbyweighingupevidencefromthehiddenlayerof neurons.Whatarethosehiddenneuronsdoing?Well,justsupposefor thesakeofargumentthatthefirstneuroninthehiddenlayerdetects whetherornotanimagelikethefollowingispresent:Itcandothisbyheavilyweightinginputpixelswhichoverlapwith theimage,andonlylightlyweightingtheotherinputs.Inasimilar way,let'ssupposeforthesakeofargumentthatthesecond,third, andfourthneuronsinthehiddenlayerdetectwhetherornotthe followingimagesarepresent:Asyoumayhaveguessed,thesefourimagestogethermakeupthe$0$ imagethatwesawinthelineofdigitsshownearlier:Soifallfourofthesehiddenneuronsarefiringthenwecanconclude thatthedigitisa$0$.Ofcourse,that'snottheonlysort ofevidencewecanusetoconcludethattheimagewasa$0$-we couldlegitimatelygeta$0$inmanyotherways(say,through translationsoftheaboveimages,orslightdistortions).Butit seemssafetosaythatatleastinthiscasewe'dconcludethatthe inputwasa$0$.Supposingtheneuralnetworkfunctionsinthisway,wecangivea plausibleexplanationforwhyit'sbettertohave$10$outputsfrom thenetwork,ratherthan$4$.Ifwehad$4$outputs,thenthefirst outputneuronwouldbetryingtodecidewhatthemostsignificantbit ofthedigitwas.Andthere'snoeasywaytorelatethatmost significantbittosimpleshapeslikethoseshownabove.It'shardto imaginethatthere'sanygoodhistoricalreasonthecomponentshapes ofthedigitwillbecloselyrelatedto(say)themostsignificantbit intheoutput.Now,withallthatsaid,thisisalljustaheuristic.Nothingsays thatthethree-layerneuralnetworkhastooperateinthewayI described,withthehiddenneuronsdetectingsimplecomponentshapes. Maybeacleverlearningalgorithmwillfindsomeassignmentofweights thatletsususeonly$4$outputneurons.Butasaheuristictheway ofthinkingI'vedescribedworksprettywell,andcansaveyoualot oftimeindesigninggoodneuralnetworkarchitectures.Exercise Thereisawayofdeterminingthebitwiserepresentationofa digitbyaddinganextralayertothethree-layernetworkabove. Theextralayerconvertstheoutputfromthepreviouslayerintoa binaryrepresentation,asillustratedinthefigurebelow.Finda setofweightsandbiasesforthenewoutputlayer.Assumethatthe first$3$layersofneuronsaresuchthatthecorrectoutputinthe thirdlayer(i.e.,theoldoutputlayer)hasactivationatleast $0.99$,andincorrectoutputshaveactivationlessthan$0.01$. LearningwithgradientdescentNowthatwehaveadesignforourneuralnetwork,howcanitlearnto recognizedigits?Thefirstthingwe'llneedisadatasettolearn from-aso-calledtrainingdataset.We'llusethe MNIST dataset,whichcontainstensofthousandsofscannedimagesof handwrittendigits,togetherwiththeircorrectclassifications. MNIST'snamecomesfromthefactthatitisamodifiedsubsetoftwo datasetscollectedby NIST, theUnitedStates'NationalInstituteofStandardsand Technology.Here'safewimagesfromMNIST:Asyoucansee,thesedigitsare,infact,thesameasthoseshown atthebeginningofthischapterasachallenge torecognize.Ofcourse,whentestingournetworkwe'llaskitto recognizeimageswhicharen'tinthetrainingset!TheMNISTdatacomesintwoparts.Thefirstpartcontains60,000 imagestobeusedastrainingdata.Theseimagesarescanned handwritingsamplesfrom250people,halfofwhomwereUSCensus Bureauemployees,andhalfofwhomwerehighschoolstudents.The imagesaregreyscaleand28by28pixelsinsize.Thesecondpartof theMNISTdatasetis10,000imagestobeusedastestdata.Again, theseare28by28greyscaleimages.We'llusethetestdatato evaluatehowwellourneuralnetworkhaslearnedtorecognizedigits. Tomakethisagoodtestofperformance,thetestdatawastakenfrom adifferentsetof250peoplethantheoriginaltrainingdata (albeitstillagroupsplitbetweenCensusBureauemployeesandhigh schoolstudents).Thishelpsgiveusconfidencethatoursystemcan recognizedigitsfrompeoplewhosewritingitdidn'tseeduring training.We'llusethenotation$x$todenoteatraininginput.It'llbe convenienttoregardeachtraininginput$x$asa$28\times28= 784$-dimensionalvector.Eachentryinthevectorrepresentsthegrey valueforasinglepixelintheimage.We'lldenotethecorresponding desiredoutputby$y=y(x)$,where$y$isa$10$-dimensionalvector. Forexample,ifaparticulartrainingimage,$x$,depictsa$6$,then $y(x)=(0,0,0,0,0,0,1,0,0,0)^T$isthedesiredoutputfrom thenetwork.Notethat$T$hereisthetransposeoperation,turninga rowvectorintoanordinary(column)vector.Whatwe'dlikeisanalgorithmwhichletsusfindweightsandbiases sothattheoutputfromthenetworkapproximates$y(x)$forall traininginputs$x$.Toquantifyhowwellwe'reachievingthisgoal wedefineacostfunction* *Sometimesreferredtoasa lossorobjectivefunction.Weusethetermcost functionthroughoutthisbook,butyoushouldnotetheother terminology,sinceit'softenusedinresearchpapersandother discussionsofneuralnetworks.: \begin{eqnarray}C(w,b)\equiv \frac{1}{2n}\sum_x\|y(x)-a\|^2. \tag{6}\end{eqnarray} Here,$w$denotesthecollectionofallweightsinthenetwork,$b$ allthebiases,$n$isthetotalnumberoftraininginputs,$a$isthe vectorofoutputsfromthenetworkwhen$x$isinput,andthesumis overalltraininginputs,$x$.Ofcourse,theoutput$a$dependson $x$,$w$and$b$,buttokeepthenotationsimpleIhaven'texplicitly indicatedthisdependence.Thenotation$\|v\|$justdenotesthe usuallengthfunctionforavector$v$.We'llcall$C$the quadraticcostfunction;it'salso sometimesknownasthemeansquarederrororjustMSE. Inspectingtheformofthequadraticcostfunction,weseethat $C(w,b)$isnon-negative,sinceeveryterminthesumisnon-negative. Furthermore,thecost$C(w,b)$becomessmall,i.e.,$C(w,b)\approx 0$,preciselywhen$y(x)$isapproximatelyequaltotheoutput,$a$, foralltraininginputs,$x$.Soourtrainingalgorithmhasdonea goodjobifitcanfindweightsandbiasessothat$C(w,b)\approx0$. Bycontrast,it'snotdoingsowellwhen$C(w,b)$islarge-that wouldmeanthat$y(x)$isnotclosetotheoutput$a$foralarge numberofinputs.Sotheaimofourtrainingalgorithmwillbeto minimizethecost$C(w,b)$asafunctionoftheweightsandbiases. Inotherwords,wewanttofindasetofweightsandbiaseswhichmake thecostassmallaspossible.We'lldothatusinganalgorithmknown asgradientdescent. Whyintroducethequadraticcost?Afterall,aren'tweprimarily interestedinthenumberofimagescorrectlyclassifiedbythe network?Whynottrytomaximizethatnumberdirectly,ratherthan minimizingaproxymeasurelikethequadraticcost?Theproblemwith thatisthatthenumberofimagescorrectlyclassifiedisnotasmooth functionoftheweightsandbiasesinthenetwork.Forthemostpart, makingsmallchangestotheweightsandbiaseswon'tcauseanychange atallinthenumberoftrainingimagesclassifiedcorrectly.That makesitdifficulttofigureouthowtochangetheweightsandbiases togetimprovedperformance.Ifweinsteaduseasmoothcostfunction likethequadraticcostitturnsouttobeeasytofigureouthowto makesmallchangesintheweightsandbiasessoastogetan improvementinthecost.That'swhywefocusfirstonminimizingthe quadraticcost,andonlyafterthatwillweexaminetheclassification accuracy.Evengiventhatwewanttouseasmoothcostfunction,youmaystill wonderwhywechoosethequadraticfunctionusedin Equation(6)\begin{eqnarray}C(w,b)\equiv \frac{1}{2n}\sum_x\|y(x)-a\|^2\nonumber\end{eqnarray}.Isn'tthisaratherad hocchoice?Perhapsifwechoseadifferentcostfunctionwe'dget atotallydifferentsetofminimizingweightsandbiases?Thisisa validconcern,andlaterwe'llrevisitthecostfunction,andmake somemodifications.However,thequadraticcostfunctionof Equation(6)\begin{eqnarray}C(w,b)\equiv \frac{1}{2n}\sum_x\|y(x)-a\|^2\nonumber\end{eqnarray}worksperfectlywellfor understandingthebasicsoflearninginneuralnetworks,sowe'll stickwithitfornow.Recapping,ourgoalintraininganeuralnetworkistofindweights andbiaseswhichminimizethequadraticcostfunction$C(w,b)$.This isawell-posedproblem,butit'sgotalotofdistractingstructure ascurrentlyposed-theinterpretationof$w$and$b$asweights andbiases,the$\sigma$functionlurkinginthebackground,the choiceofnetworkarchitecture,MNIST,andsoon.Itturnsoutthat wecanunderstandatremendousamountbyignoringmostofthat structure,andjustconcentratingontheminimizationaspect.Sofor nowwe'regoingtoforgetallaboutthespecificformofthecost function,theconnectiontoneuralnetworks,andsoon.Instead, we'regoingtoimaginethatwe'vesimplybeengivenafunctionofmany variablesandwewanttominimizethatfunction.We'regoingto developatechniquecalledgradientdescentwhichcanbeused tosolvesuchminimizationproblems.Thenwe'llcomebacktothe specificfunctionwewanttominimizeforneuralnetworks.Okay,let'ssupposewe'retryingtominimizesomefunction,$C(v)$. Thiscouldbeanyreal-valuedfunctionofmanyvariables,$v=v_1, v_2,\ldots$.NotethatI'vereplacedthe$w$and$b$notationby$v$ toemphasizethatthiscouldbeanyfunction-we'renot specificallythinkingintheneuralnetworkscontextanymore.To minimize$C(v)$ithelpstoimagine$C$asafunctionofjusttwo variables,whichwe'llcall$v_1$and$v_2$:Whatwe'dlikeistofindwhere$C$achievesitsglobalminimum.Now, ofcourse,forthefunctionplottedabove,wecaneyeballthegraph andfindtheminimum.Inthatsense,I'veperhapsshownslightly toosimpleafunction!Ageneralfunction,$C$,maybea complicatedfunctionofmanyvariables,anditwon'tusuallybe possibletojusteyeballthegraphtofindtheminimum.Onewayofattackingtheproblemistousecalculustotrytofindthe minimumanalytically.Wecouldcomputederivativesandthentryusing themtofindplaceswhere$C$isanextremum.Withsomeluckthat mightworkwhen$C$isafunctionofjustoneorafewvariables.But it'llturnintoanightmarewhenwehavemanymorevariables.Andfor neuralnetworkswe'lloftenwantfarmorevariables-the biggestneuralnetworkshavecostfunctionswhichdependonbillions ofweightsandbiasesinanextremelycomplicatedway.Usingcalculus tominimizethatjustwon'twork!(Afterassertingthatwe'llgaininsightbyimagining$C$asa functionofjusttwovariables,I'veturnedaroundtwiceintwo paragraphsandsaid,"hey,butwhatifit'safunctionofmanymore thantwovariables?"Sorryaboutthat.PleasebelievemewhenIsay thatitreallydoeshelptoimagine$C$asafunctionoftwo variables.Itjusthappensthatsometimesthatpicturebreaksdown, andthelasttwoparagraphsweredealingwithsuchbreakdowns.Good thinkingaboutmathematicsofteninvolvesjugglingmultipleintuitive pictures,learningwhenit'sappropriatetouseeachpicture,andwhen it'snot.)Okay,socalculusdoesn'twork.Fortunately,thereisabeautiful analogywhichsuggestsanalgorithmwhichworksprettywell.Westart bythinkingofourfunctionasakindofavalley.Ifyousquintjust alittleattheplotabove,thatshouldn'tbetoohard.Andwe imagineaballrollingdowntheslopeofthevalley.Oureveryday experiencetellsusthattheballwilleventuallyrolltothebottom ofthevalley.Perhapswecanusethisideaasawaytofinda minimumforthefunction?We'drandomlychooseastartingpointfor an(imaginary)ball,andthensimulatethemotionoftheballasit rolleddowntothebottomofthevalley.Wecoulddothissimulation simplybycomputingderivatives(andperhapssomesecondderivatives) of$C$-thosederivativeswouldtelluseverythingweneedtoknow aboutthelocal"shape"ofthevalley,andthereforehowourball shouldroll.BasedonwhatI'vejustwritten,youmightsupposethatwe'llbe tryingtowritedownNewton'sequationsofmotionfortheball, consideringtheeffectsoffrictionandgravity,andsoon.Actually, we'renotgoingtotaketheball-rollinganalogyquitethatseriously -we'redevisinganalgorithmtominimize$C$,notdevelopingan accuratesimulationofthelawsofphysics!Theball's-eyeviewis meanttostimulateourimagination,notconstrainourthinking.So ratherthangetintoallthemessydetailsofphysics,let'ssimply askourselves:ifweweredeclaredGodforaday,andcouldmakeup ourownlawsofphysics,dictatingtotheballhowitshouldroll, whatlaworlawsofmotioncouldwepickthatwouldmakeitsothe ballalwaysrolledtothebottomofthevalley?Tomakethisquestionmoreprecise,let'sthinkaboutwhathappens whenwemovetheballasmallamount$\Deltav_1$inthe$v_1$ direction,andasmallamount$\Deltav_2$inthe$v_2$direction. Calculustellsusthat$C$changesasfollows: \begin{eqnarray} \DeltaC\approx\frac{\partialC}{\partialv_1}\Deltav_1+ \frac{\partialC}{\partialv_2}\Deltav_2. \tag{7}\end{eqnarray} We'regoingtofindawayofchoosing$\Deltav_1$and$\Deltav_2$so astomake$\DeltaC$negative;i.e.,we'llchoosethemsotheballis rollingdownintothevalley.Tofigureouthowtomakesuchachoice ithelpstodefine$\Deltav$tobethevectorofchangesin$v$, $\Deltav\equiv(\Deltav_1,\Deltav_2)^T$,where$T$isagainthe transposeoperation,turningrowvectorsintocolumnvectors.We'll alsodefinethegradientof$C$ tobethevectorofpartialderivatives,$\left(\frac{\partial C}{\partialv_1},\frac{\partialC}{\partialv_2}\right)^T$.We denotethegradientvectorby$\nablaC$,i.e.: \begin{eqnarray} \nablaC\equiv\left(\frac{\partialC}{\partialv_1}, \frac{\partialC}{\partialv_2}\right)^T. \tag{8}\end{eqnarray} Inamomentwe'llrewritethechange$\DeltaC$intermsof$\Deltav$ andthegradient,$\nablaC$.Beforegettingtothat,though,Iwant toclarifysomethingthatsometimesgetspeoplehunguponthe gradient.Whenmeetingthe$\nablaC$notationforthefirsttime, peoplesometimeswonderhowtheyshouldthinkaboutthe$\nabla$ symbol.What,exactly,does$\nabla$mean?Infact,it'sperfectly finetothinkof$\nablaC$asasinglemathematicalobject-the vectordefinedabove-whichhappenstobewrittenusingtwo symbols.Inthispointofview,$\nabla$isjustapieceof notationalflag-waving,tellingyou"hey,$\nablaC$isagradient vector".Therearemoreadvancedpointsofviewwhere$\nabla$can beviewedasanindependentmathematicalentityinitsownright(for example,asadifferentialoperator),butwewon'tneedsuchpointsof view.Withthesedefinitions,theexpression(7)\begin{eqnarray} \DeltaC\approx\frac{\partialC}{\partialv_1}\Deltav_1+ \frac{\partialC}{\partialv_2}\Deltav_2\nonumber\end{eqnarray}for $\DeltaC$canberewrittenas \begin{eqnarray} \DeltaC\approx\nablaC\cdot\Deltav. \tag{9}\end{eqnarray} Thisequationhelpsexplainwhy$\nablaC$iscalledthegradient vector:$\nablaC$relateschangesin$v$tochangesin$C$,justas we'dexpectsomethingcalledagradienttodo.Butwhat'sreally excitingabouttheequationisthatitletsusseehowtochoose $\Deltav$soastomake$\DeltaC$negative.Inparticular,suppose wechoose \begin{eqnarray} \Deltav=-\eta\nablaC, \tag{10}\end{eqnarray} where$\eta$isasmall,positiveparameter(knownasthe learningrate). ThenEquation(9)\begin{eqnarray} \DeltaC\approx\nablaC\cdot\Deltav\nonumber\end{eqnarray}tellsusthat$\DeltaC\approx-\eta \nablaC\cdot\nablaC=-\eta\|\nablaC\|^2$.Because$\|\nablaC \|^2\geq0$,thisguaranteesthat$\DeltaC\leq0$,i.e.,$C$will alwaysdecrease,neverincrease,ifwechange$v$accordingtothe prescriptionin(10)\begin{eqnarray} \Deltav=-\eta\nablaC\nonumber\end{eqnarray}.(Within,ofcourse,the limitsoftheapproximationinEquation(9)\begin{eqnarray} \DeltaC\approx\nablaC\cdot\Deltav\nonumber\end{eqnarray}).Thisis exactlythepropertywewanted!Andsowe'lltake Equation(10)\begin{eqnarray} \Deltav=-\eta\nablaC\nonumber\end{eqnarray}todefinethe"lawofmotion" fortheballinourgradientdescentalgorithm.Thatis,we'lluse Equation(10)\begin{eqnarray} \Deltav=-\eta\nablaC\nonumber\end{eqnarray}tocomputeavaluefor$\Delta v$,thenmovetheball'sposition$v$bythatamount: \begin{eqnarray} v\rightarrowv'=v-\eta\nablaC. \tag{11}\end{eqnarray} Thenwe'llusethisupdateruleagain,tomakeanothermove.Ifwe keepdoingthis,overandover,we'llkeepdecreasing$C$until-we hope-wereachaglobalminimum.Summingup,thewaythegradientdescentalgorithmworksisto repeatedlycomputethegradient$\nablaC$,andthentomoveinthe oppositedirection,"fallingdown"theslopeofthevalley. Wecanvisualizeitlikethis:Noticethatwiththisrulegradientdescentdoesn'treproducereal physicalmotion.Inreallifeaballhasmomentum,andthatmomentum mayallowittorollacrosstheslope,oreven(momentarily)roll uphill.It'sonlyaftertheeffectsoffrictionsetinthattheball isguaranteedtorolldownintothevalley.Bycontrast,ourrulefor choosing$\Deltav$justsays"godown,rightnow".That'sstilla prettygoodruleforfindingtheminimum!Tomakegradientdescentworkcorrectly,weneedtochoosethe learningrate$\eta$tobesmall enoughthatEquation(9)\begin{eqnarray} \DeltaC\approx\nablaC\cdot\Deltav\nonumber\end{eqnarray}isagoodapproximation.If wedon't,wemightendupwith$\DeltaC>0$,whichobviouslywould notbegood!Atthesametime,wedon'twant$\eta$tobetoosmall, sincethatwillmakethechanges$\Deltav$tiny,andthusthe gradientdescentalgorithmwillworkveryslowly.Inpractical implementations,$\eta$isoftenvariedsothat Equation(9)\begin{eqnarray} \DeltaC\approx\nablaC\cdot\Deltav\nonumber\end{eqnarray}remainsagoodapproximation,butthe algorithmisn'ttooslow.We'llseelaterhowthis works.I'veexplainedgradientdescentwhen$C$isafunctionofjusttwo variables.But,infact,everythingworksjustaswellevenwhen$C$ isafunctionofmanymorevariables.Supposeinparticularthat$C$ isafunctionof$m$variables,$v_1,\ldots,v_m$.Thenthechange $\DeltaC$in$C$producedbyasmallchange$\Deltav=(\Deltav_1, \ldots,\Deltav_m)^T$is \begin{eqnarray} \DeltaC\approx\nablaC\cdot\Deltav, \tag{12}\end{eqnarray} wherethegradient$\nablaC$isthevector \begin{eqnarray} \nablaC\equiv\left(\frac{\partialC}{\partialv_1},\ldots, \frac{\partialC}{\partialv_m}\right)^T. \tag{13}\end{eqnarray} Justasforthetwovariablecase,wecan choose \begin{eqnarray} \Deltav=-\eta\nablaC, \tag{14}\end{eqnarray} andwe'reguaranteedthatour(approximate) expression(12)\begin{eqnarray} \DeltaC\approx\nablaC\cdot\Deltav\nonumber\end{eqnarray}for$\DeltaC$willbenegative. Thisgivesusawayoffollowingthegradienttoaminimum,evenwhen $C$isafunctionofmanyvariables,byrepeatedlyapplyingtheupdate rule \begin{eqnarray} v\rightarrowv'=v-\eta\nablaC. \tag{15}\end{eqnarray} Youcanthinkofthisupdateruleasdefiningthegradient descentalgorithm.Itgivesusawayofrepeatedlychangingthe position$v$inordertofindaminimumofthefunction$C$.Therule doesn'talwayswork-severalthingscangowrongandprevent gradientdescentfromfindingtheglobalminimumof$C$,apointwe'll returntoexploreinlaterchapters.But,inpracticegradient descentoftenworksextremelywell,andinneuralnetworkswe'llfind thatit'sapowerfulwayofminimizingthecostfunction,andso helpingthenetlearn.Indeed,there'sevenasenseinwhichgradientdescentistheoptimal strategyforsearchingforaminimum.Let'ssupposethatwe'retrying tomakeamove$\Deltav$inpositionsoastodecrease$C$asmuchas possible.Thisisequivalenttominimizing$\DeltaC\approx\nablaC \cdot\Deltav$.We'llconstrainthesizeofthemovesothat$\| \Deltav\|=\epsilon$forsomesmallfixed$\epsilon>0$.Inother words,wewantamovethatisasmallstepofafixedsize,andwe're tryingtofindthemovementdirectionwhichdecreases$C$asmuchas possible.Itcanbeprovedthatthechoiceof$\Deltav$which minimizes$\nablaC\cdot\Deltav$is$\Deltav=-\eta\nablaC$, where$\eta=\epsilon/\|\nablaC\|$isdeterminedbythesize constraint$\|\Deltav\|=\epsilon$.Sogradientdescentcanbe viewedasawayoftakingsmallstepsinthedirectionwhichdoesthe mosttoimmediatelydecrease$C$.Exercises Provetheassertionofthelastparagraph.Hint:If you'renotalreadyfamiliarwiththe Cauchy-Schwarz inequality,youmayfindithelpfultofamiliarizeyourself withit.Iexplainedgradientdescentwhen$C$isafunctionoftwo variables,andwhenit'safunctionofmorethantwovariables. Whathappenswhen$C$isafunctionofjustonevariable?Canyou provideageometricinterpretationofwhatgradientdescentisdoing intheone-dimensionalcase? Peoplehaveinvestigatedmanyvariationsofgradientdescent, includingvariationsthatmorecloselymimicarealphysicalball. Theseball-mimickingvariationshavesomeadvantages,butalsohavea majordisadvantage:itturnsouttobenecessarytocomputesecond partialderivativesof$C$,andthiscanbequitecostly.Toseewhy it'scostly,supposewewanttocomputeallthesecondpartial derivatives$\partial^2C/\partialv_j\partialv_k$.Iftherearea millionsuch$v_j$variablesthenwe'dneedtocomputesomethinglike atrillion(i.e.,amillionsquared)secondpartial derivatives* *Actually,morelikehalfatrillion,since $\partial^2C/\partialv_j\partialv_k=\partial^2C/\partial v_k\partialv_j$.Still,yougetthepoint.!That'sgoingtobe computationallycostly.Withthatsaid,therearetricksforavoiding thiskindofproblem,andfindingalternativestogradientdescentis anactiveareaofinvestigation.Butinthisbookwe'llusegradient descent(andvariations)asourmainapproachtolearninginneural networks.Howcanweapplygradientdescenttolearninaneuralnetwork?The ideaistousegradientdescenttofindtheweights$w_k$andbiases $b_l$whichminimizethecostin Equation(6)\begin{eqnarray}C(w,b)\equiv \frac{1}{2n}\sum_x\|y(x)-a\|^2\nonumber\end{eqnarray}.Toseehowthisworks,let's restatethegradientdescentupdaterule,withtheweightsandbiases replacingthevariables$v_j$.Inotherwords,our"position"now hascomponents$w_k$and$b_l$,andthegradientvector$\nablaC$has correspondingcomponents$\partialC/\partialw_k$and$\partialC /\partialb_l$.Writingoutthegradientdescentupdaterulein termsofcomponents,wehave \begin{eqnarray} w_k&\rightarrow&w_k'=w_k-\eta\frac{\partialC}{\partialw_k}\tag{16}\\ b_l&\rightarrow&b_l'=b_l-\eta\frac{\partialC}{\partialb_l}. \tag{17}\end{eqnarray} Byrepeatedlyapplyingthisupdaterulewecan"rolldownthehill", andhopefullyfindaminimumofthecostfunction.Inotherwords, thisisarulewhichcanbeusedtolearninaneuralnetwork.Thereareanumberofchallengesinapplyingthegradientdescent rule.We'lllookintothoseindepthinlaterchapters.Butfornow Ijustwanttomentiononeproblem.Tounderstandwhattheproblem is,let'slookbackatthequadraticcostin Equation(6)\begin{eqnarray}C(w,b)\equiv \frac{1}{2n}\sum_x\|y(x)-a\|^2\nonumber\end{eqnarray}.Noticethatthiscost functionhastheform$C=\frac{1}{n}\sum_xC_x$,thatis,it'san averageovercosts$C_x\equiv\frac{\|y(x)-a\|^2}{2}$forindividual trainingexamples.Inpractice,tocomputethegradient$\nablaC$we needtocomputethegradients$\nablaC_x$separatelyforeach traininginput,$x$,andthenaveragethem,$\nablaC=\frac{1}{n} \sum_x\nablaC_x$.Unfortunately,whenthenumberoftraininginputs isverylargethiscantakealongtime,andlearningthusoccurs slowly.Anideacalledstochasticgradientdescentcanbeusedtospeed uplearning.Theideaistoestimatethegradient$\nablaC$by computing$\nablaC_x$forasmallsampleofrandomlychosentraining inputs.Byaveragingoverthissmallsampleitturnsoutthatwecan quicklygetagoodestimateofthetruegradient$\nablaC$,andthis helpsspeedupgradientdescent,andthuslearning.Tomaketheseideasmoreprecise,stochasticgradientdescentworksby randomlypickingoutasmallnumber$m$ofrandomlychosentraining inputs.We'lllabelthoserandomtraininginputs$X_1,X_2,\ldots, X_m$,andrefertothemasamini-batch.Providedthesample size$m$islargeenoughweexpectthattheaveragevalueofthe $\nablaC_{X_j}$willberoughlyequaltotheaverageoverall$\nabla C_x$,thatis, \begin{eqnarray} \frac{\sum_{j=1}^m\nablaC_{X_{j}}}{m}\approx\frac{\sum_x\nablaC_x}{n}=\nablaC, \tag{18}\end{eqnarray} wherethesecondsumisovertheentiresetoftrainingdata. Swappingsidesweget \begin{eqnarray} \nablaC\approx\frac{1}{m}\sum_{j=1}^m\nablaC_{X_{j}}, \tag{19}\end{eqnarray} confirmingthatwecanestimatetheoverallgradientbycomputing gradientsjustfortherandomlychosenmini-batch.Toconnectthisexplicitlytolearninginneuralnetworks,suppose $w_k$and$b_l$denotetheweightsandbiasesinourneuralnetwork. Thenstochasticgradientdescentworksbypickingoutarandomly chosenmini-batchoftraininginputs,andtrainingwiththose, \begin{eqnarray} w_k&\rightarrow&w_k'=w_k-\frac{\eta}{m} \sum_j\frac{\partialC_{X_j}}{\partialw_k}\tag{20}\\ b_l&\rightarrow&b_l'=b_l-\frac{\eta}{m} \sum_j\frac{\partialC_{X_j}}{\partialb_l}, \tag{21}\end{eqnarray} wherethesumsareoverallthetrainingexamples$X_j$inthecurrent mini-batch.Thenwepickoutanotherrandomlychosenmini-batchand trainwiththose.Andsoon,untilwe'veexhaustedthetraining inputs,whichissaidtocompletean epochoftraining.Atthatpoint westartoverwithanewtrainingepoch.Incidentally,it'sworthnotingthatconventionsvaryaboutscalingof thecostfunctionandofmini-batchupdatestotheweightsandbiases. InEquation(6)\begin{eqnarray}C(w,b)\equiv \frac{1}{2n}\sum_x\|y(x)-a\|^2\nonumber\end{eqnarray}wescaledtheoverallcost functionbyafactor$\frac{1}{n}$.Peoplesometimesomitthe $\frac{1}{n}$,summingoverthecostsofindividualtrainingexamples insteadofaveraging.Thisisparticularlyusefulwhenthetotal numberoftrainingexamplesisn'tknowninadvance.Thiscanoccurif moretrainingdataisbeinggeneratedinrealtime,forinstance. And,inasimilarway,themini-batchupdaterules(20)\begin{eqnarray} w_k&\rightarrow&w_k'=w_k-\frac{\eta}{m} \sum_j\frac{\partialC_{X_j}}{\partialw_k}\nonumber\end{eqnarray} and(21)\begin{eqnarray} b_l&\rightarrow&b_l'=b_l-\frac{\eta}{m} \sum_j\frac{\partialC_{X_j}}{\partialb_l}\nonumber\end{eqnarray}sometimesomitthe$\frac{1}{m}$termoutthe frontofthesums.Conceptuallythismakeslittledifference,since it'sequivalenttorescalingthelearningrate$\eta$.Butwhendoing detailedcomparisonsofdifferentworkit'sworthwatchingoutfor.Wecanthinkofstochasticgradientdescentasbeinglikepolitical polling:it'smucheasiertosampleasmallmini-batchthanitisto applygradientdescenttothefullbatch,justascarryingoutapoll iseasierthanrunningafullelection.Forexample,ifwehavea trainingsetofsize$n=60,000$,asinMNIST,andchoosea mini-batchsizeof(say)$m=10$,thismeanswe'llgetafactorof $6,000$speedupinestimatingthegradient!Ofcourse,theestimate won'tbeperfect-therewillbestatisticalfluctuations-butit doesn'tneedtobeperfect:allwereallycareaboutismovingina generaldirectionthatwillhelpdecrease$C$,andthatmeanswedon't needanexactcomputationofthegradient.Inpractice,stochastic gradientdescentisacommonlyusedandpowerfultechniquefor learninginneuralnetworks,andit'sthebasisformostofthe learningtechniqueswe'lldevelopinthisbook.Exercise Anextremeversionofgradientdescentistouseamini-batch sizeofjust1.Thatis,givenatraininginput,$x$,weupdateour weightsandbiasesaccordingtotherules$w_k\rightarroww_k'= w_k-\eta\partialC_x/\partialw_k$and$b_l\rightarrowb_l'= b_l-\eta\partialC_x/\partialb_l$.Thenwechooseanother traininginput,andupdatetheweightsandbiasesagain.Andsoon, repeatedly.Thisprocedureisknownasonline, on-line,orincrementallearning.Inonlinelearning, aneuralnetworklearnsfromjustonetraininginputatatime(just ashumanbeingsdo).Nameoneadvantageandonedisadvantageof onlinelearning,comparedtostochasticgradientdescentwitha mini-batchsizeof,say,$20$. Letmeconcludethissectionbydiscussingapointthatsometimesbugs peoplenewtogradientdescent.Inneuralnetworksthecost$C$is, ofcourse,afunctionofmanyvariables-alltheweightsandbiases -andsoinsomesensedefinesasurfaceinaveryhigh-dimensional space.Somepeoplegethungupthinking:"Hey,Ihavetobeableto visualizealltheseextradimensions".Andtheymaystarttoworry: "Ican'tthinkinfourdimensions,letalonefive(orfive million)".Istheresomespecialabilitythey'remissing,some abilitythat"real"supermathematicianshave?Ofcourse,theanswer isno.Evenmostprofessionalmathematicianscan'tvisualizefour dimensionsespeciallywell,ifatall.Thetricktheyuse,instead, istodevelopotherwaysofrepresentingwhat'sgoingon.That's exactlywhatwedidabove:weusedanalgebraic(ratherthanvisual) representationof$\DeltaC$tofigureouthowtomovesoasto decrease$C$.Peoplewhoaregoodatthinkinginhighdimensionshave amentallibrarycontainingmanydifferenttechniquesalongthese lines;ouralgebraictrickisjustoneexample.Thosetechniquesmay nothavethesimplicitywe'reaccustomedtowhenvisualizingthree dimensions,butonceyoubuildupalibraryofsuchtechniques,you cangetprettygoodatthinkinginhighdimensions.Iwon'tgointo moredetailhere,butifyou'reinterestedthenyoumayenjoyreading this discussionofsomeofthetechniquesprofessionalmathematicians usetothinkinhighdimensions.Whilesomeofthetechniques discussedarequitecomplex,muchofthebestcontentisintuitiveand accessible,andcouldbemasteredbyanyone. ImplementingournetworktoclassifydigitsAlright,let'swriteaprogramthatlearnshowtorecognize handwrittendigits,usingstochasticgradientdescentandtheMNIST trainingdata.We'lldothiswithashortPython(2.7)program,just 74linesofcode!ThefirstthingweneedistogettheMNISTdata. Ifyou'reagituserthenyoucanobtainthedatabycloning thecoderepositoryforthisbook,gitclonehttps://github.com/mnielsen/neural-networks-and-deep-learning.git Ifyoudon'tusegitthenyoucandownloadthedataandcode here.Incidentally,whenIdescribedtheMNISTdataearlier,Isaiditwas splitinto60,000trainingimages,and10,000testimages.That'sthe officialMNISTdescription.Actually,we'regoingtosplitthedataa littledifferently.We'llleavethetestimagesasis,butsplitthe 60,000-imageMNISTtrainingsetintotwoparts:asetof50,000 images,whichwe'llusetotrainourneuralnetwork,andaseparate 10,000imagevalidationset.Wewon't usethevalidationdatainthischapter,butlaterinthebookwe'll finditusefulinfiguringouthowtosetcertain hyper-parametersoftheneuralnetwork-thingslikethe learningrate,andsoon,whicharen'tdirectlyselectedbyour learningalgorithm.Althoughthevalidationdataisn'tpartofthe originalMNISTspecification,manypeopleuseMNISTinthisfashion, andtheuseofvalidationdataiscommoninneuralnetworks.WhenI refertothe"MNISTtrainingdata"fromnowon,I'llbereferringto our50,000imagedataset,nottheoriginal60,000imagedata set* *Asnotedearlier,theMNISTdatasetisbasedontwodata setscollectedbyNIST,theUnitedStates'NationalInstituteof StandardsandTechnology.ToconstructMNISTtheNISTdatasets werestrippeddownandputintoamoreconvenientformatbyYann LeCun,CorinnaCortes,andChristopherJ.C.Burges.See thislinkformore details.Thedatasetinmyrepositoryisinaformthatmakesit easytoloadandmanipulatetheMNISTdatainPython.Iobtained thisparticularformofthedatafromtheLISAmachinelearning laboratoryattheUniversityofMontreal (link)..ApartfromtheMNISTdatawealsoneedaPythonlibrarycalled Numpy,fordoingfastlinearalgebra.Ifyou don'talreadyhaveNumpyinstalled,youcangetit here.Letmeexplainthecorefeaturesoftheneuralnetworkscode,before givingafulllisting,below.ThecenterpieceisaNetwork class,whichweusetorepresentaneuralnetwork.Here'sthecodewe usetoinitializeaNetworkobject: classNetwork(object): def__init__(self,sizes): self.num_layers=len(sizes) self.sizes=sizes self.biases=[np.random.randn(y,1)foryinsizes[1:]] self.weights=[np.random.randn(y,x) forx,yinzip(sizes[:-1],sizes[1:])] Inthiscode,thelistsizescontainsthenumberofneuronsin therespectivelayers.So,forexample,ifwewanttocreatea Networkobjectwith2neuronsinthefirstlayer,3neuronsin thesecondlayer,and1neuroninthefinallayer,we'ddothiswith thecode: net=Network([2,3,1]) Thebiases andweightsintheNetworkobjectareallinitializedrandomly, usingtheNumpynp.random.randnfunctiontogenerateGaussian distributionswithmean$0$andstandarddeviation$1$.Thisrandom initializationgivesourstochasticgradientdescentalgorithmaplace tostartfrom.Inlaterchapterswe'llfindbetterwaysof initializingtheweightsandbiases,butthiswilldofornow.Note thattheNetworkinitializationcodeassumesthatthefirst layerofneuronsisaninputlayer,andomitstosetanybiasesfor thoseneurons,sincebiasesareonlyeverusedincomputingthe outputsfromlaterlayers.NotealsothatthebiasesandweightsarestoredaslistsofNumpy matrices.So,forexamplenet.weights[1]isaNumpymatrix storingtheweightsconnectingthesecondandthirdlayersofneurons. (It'snotthefirstandsecondlayers,sincePython'slistindexing startsat0.)Sincenet.weights[1]isratherverbose, let'sjustdenotethatmatrix$w$.It'samatrixsuchthat$w_{jk}$ istheweightfortheconnectionbetweenthe$k^{\rmth}$neuroninthe secondlayer,andthe$j^{\rmth}$neuroninthethirdlayer.Thisordering ofthe$j$and$k$indicesmayseemstrange-surelyit'dmakemore sensetoswapthe$j$and$k$indicesaround?Thebigadvantageof usingthisorderingisthatitmeansthatthevectorofactivationsof thethirdlayerofneuronsis: \begin{eqnarray} a'=\sigma(wa+b). \tag{22}\end{eqnarray} There'squiteabitgoingoninthisequation,solet'sunpackit piecebypiece.$a$isthevectorofactivationsofthesecondlayer ofneurons.Toobtain$a'$wemultiply$a$bytheweightmatrix$w$, andaddthevector$b$ofbiases.Wethenapplythefunction$\sigma$ elementwisetoeveryentryinthevector$wa+b$.(Thisiscalled vectorizingthefunction $\sigma$.)It'seasytoverifythat Equation(22)\begin{eqnarray} a'=\sigma(wa+b)\nonumber\end{eqnarray}givesthesameresultasour earlierrule,Equation(4)\begin{eqnarray} \frac{1}{1+\exp(-\sum_jw_jx_j-b)}\nonumber\end{eqnarray},for computingtheoutputofasigmoidneuron.Exercise WriteoutEquation(22)\begin{eqnarray} a'=\sigma(wa+b)\nonumber\end{eqnarray}incomponent form,andverifythatitgivesthesameresultasthe rule(4)\begin{eqnarray} \frac{1}{1+\exp(-\sum_jw_jx_j-b)}\nonumber\end{eqnarray}forcomputingtheoutput ofasigmoidneuron. Withallthisinmind,it'seasytowritecodecomputingtheoutput fromaNetworkinstance.Webeginbydefiningthesigmoid function: defsigmoid(z): return1.0/(1.0+np.exp(-z)) NotethatwhentheinputzisavectororNumpyarray,Numpy automaticallyappliesthefunctionsigmoidelementwise,that is,invectorizedform.WethenaddafeedforwardmethodtotheNetworkclass, which,givenaninputaforthenetwork,returnsthe correspondingoutput* *Itisassumedthattheinputais an(n,1)Numpyndarray,nota(n,)vector.Here, nisthenumberofinputstothenetwork.Ifyoutrytouse an(n,)vectorasinputyou'llgetstrangeresults.Although usingan(n,)vectorappearsthemorenaturalchoice,using an(n,1)ndarraymakesitparticularlyeasytomodifythe codetofeedforwardmultipleinputsatonce,andthatissometimes convenient..Allthemethoddoesisapplies Equation(22)\begin{eqnarray} a'=\sigma(wa+b)\nonumber\end{eqnarray}foreachlayer: deffeedforward(self,a): """Returntheoutputofthenetworkif"a"isinput.""" forb,winzip(self.biases,self.weights): a=sigmoid(np.dot(w,a)+b) returna Ofcourse,themainthingwewantourNetworkobjectstodois tolearn.Tothatendwe'llgivethemanSGDmethodwhich implementsstochasticgradientdescent.Here'sthecode.It'sa littlemysteriousinafewplaces,butI'llbreakitdownbelow,after thelisting.defSGD(self,training_data,epochs,mini_batch_size,eta, test_data=None): """Traintheneuralnetworkusingmini-batchstochastic gradientdescent.The"training_data"isalistoftuples "(x,y)"representingthetraininginputsandthedesired outputs.Theothernon-optionalparametersare self-explanatory.If"test_data"isprovidedthenthe networkwillbeevaluatedagainstthetestdataaftereach epoch,andpartialprogressprintedout.Thisisusefulfor trackingprogress,butslowsthingsdownsubstantially.""" iftest_data:n_test=len(test_data) n=len(training_data) forjinxrange(epochs): random.shuffle(training_data) mini_batches=[ training_data[k:k+mini_batch_size] forkinxrange(0,n,mini_batch_size)] formini_batchinmini_batches: self.update_mini_batch(mini_batch,eta) iftest_data: print"Epoch{0}:{1}/{2}".format( j,self.evaluate(test_data),n_test) else: print"Epoch{0}complete".format(j) Thetraining_dataisalistoftuples(x,y) representingthetraininginputsandcorrespondingdesiredoutputs. Thevariablesepochsandmini_batch_sizearewhatyou'd expect-thenumberofepochstotrainfor,andthesizeofthe mini-batchestousewhensampling.etaisthelearningrate, $\eta$.Iftheoptionalargumenttest_dataissupplied,then theprogramwillevaluatethenetworkaftereachepochoftraining, andprintoutpartialprogress.Thisisusefulfortrackingprogress, butslowsthingsdownsubstantially.Thecodeworksasfollows.Ineachepoch,itstartsbyrandomly shufflingthetrainingdata,andthenpartitionsitintomini-batches oftheappropriatesize.Thisisaneasywayofsamplingrandomly fromthetrainingdata.Thenforeachmini_batchweapplya singlestepofgradientdescent.Thisisdonebythecode self.update_mini_batch(mini_batch,eta),whichupdatesthe networkweightsandbiasesaccordingtoasingleiterationofgradient descent,usingjustthetrainingdatainmini_batch.Here's thecodefortheupdate_mini_batchmethod: defupdate_mini_batch(self,mini_batch,eta): """Updatethenetwork'sweightsandbiasesbyapplying gradientdescentusingbackpropagationtoasingleminibatch. The"mini_batch"isalistoftuples"(x,y)",and"eta" isthelearningrate.""" nabla_b=[np.zeros(b.shape)forbinself.biases] nabla_w=[np.zeros(w.shape)forwinself.weights] forx,yinmini_batch: delta_nabla_b,delta_nabla_w=self.backprop(x,y) nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)] nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabla_w)] self.weights=[w-(eta/len(mini_batch))*nw forw,nwinzip(self.weights,nabla_w)] self.biases=[b-(eta/len(mini_batch))*nb forb,nbinzip(self.biases,nabla_b)] Mostoftheworkisdonebytheline delta_nabla_b,delta_nabla_w=self.backprop(x,y) Thisinvokessomethingcalledthebackpropagationalgorithm, whichisafastwayofcomputingthegradientofthecostfunction. Soupdate_mini_batchworkssimplybycomputingthesegradients foreverytrainingexampleinthemini_batch,andthenupdating self.weightsandself.biasesappropriately.I'mnotgoingtoshowthecodeforself.backproprightnow. We'llstudyhowbackpropagationworksinthenextchapter,including thecodeforself.backprop.Fornow,justassumethatit behavesasclaimed,returningtheappropriategradientforthecost associatedtothetrainingexamplex.Let'slookatthefullprogram,includingthedocumentationstrings, whichIomittedabove.Apartfromself.backproptheprogramis self-explanatory-alltheheavyliftingisdoneinself.SGD andself.update_mini_batch,whichwe'vealreadydiscussed.The self.backpropmethodmakesuseofafewextrafunctionstohelp incomputingthegradient,namelysigmoid_prime,whichcomputes thederivativeofthe$\sigma$function,and self.cost_derivative,whichIwon'tdescribehere.Youcanget thegistofthese(andperhapsthedetails)justbylookingatthe codeanddocumentationstrings.We'lllookatthemindetailinthe nextchapter. Notethatwhiletheprogramappearslengthy,muchofthecodeis documentationstringsintendedtomakethecodeeasytounderstand. Infact,theprogramcontainsjust74linesofnon-whitespace, non-commentcode.AllthecodemaybefoundonGitHub here.""" network.py ~~~~~~~~~~ Amoduletoimplementthestochasticgradientdescentlearning algorithmforafeedforwardneuralnetwork.Gradientsarecalculated usingbackpropagation.NotethatIhavefocusedonmakingthecode simple,easilyreadable,andeasilymodifiable.Itisnotoptimized, andomitsmanydesirablefeatures. """ ####Libraries #Standardlibrary importrandom #Third-partylibraries importnumpyasnp classNetwork(object): def__init__(self,sizes): """Thelist``sizes``containsthenumberofneuronsinthe respectivelayersofthenetwork.Forexample,ifthelist was[2,3,1]thenitwouldbeathree-layernetwork,withthe firstlayercontaining2neurons,thesecondlayer3neurons, andthethirdlayer1neuron.Thebiasesandweightsforthe networkareinitializedrandomly,usingaGaussian distributionwithmean0,andvariance1.Notethatthefirst layerisassumedtobeaninputlayer,andbyconventionwe won'tsetanybiasesforthoseneurons,sincebiasesareonly everusedincomputingtheoutputsfromlaterlayers.""" self.num_layers=len(sizes) self.sizes=sizes self.biases=[np.random.randn(y,1)foryinsizes[1:]] self.weights=[np.random.randn(y,x) forx,yinzip(sizes[:-1],sizes[1:])] deffeedforward(self,a): """Returntheoutputofthenetworkif``a``isinput.""" forb,winzip(self.biases,self.weights): a=sigmoid(np.dot(w,a)+b) returna defSGD(self,training_data,epochs,mini_batch_size,eta, test_data=None): """Traintheneuralnetworkusingmini-batchstochastic gradientdescent.The``training_data``isalistoftuples ``(x,y)``representingthetraininginputsandthedesired outputs.Theothernon-optionalparametersare self-explanatory.If``test_data``isprovidedthenthe networkwillbeevaluatedagainstthetestdataaftereach epoch,andpartialprogressprintedout.Thisisusefulfor trackingprogress,butslowsthingsdownsubstantially.""" iftest_data:n_test=len(test_data) n=len(training_data) forjinxrange(epochs): random.shuffle(training_data) mini_batches=[ training_data[k:k+mini_batch_size] forkinxrange(0,n,mini_batch_size)] formini_batchinmini_batches: self.update_mini_batch(mini_batch,eta) iftest_data: print"Epoch{0}:{1}/{2}".format( j,self.evaluate(test_data),n_test) else: print"Epoch{0}complete".format(j) defupdate_mini_batch(self,mini_batch,eta): """Updatethenetwork'sweightsandbiasesbyapplying gradientdescentusingbackpropagationtoasingleminibatch. The``mini_batch``isalistoftuples``(x,y)``,and``eta`` isthelearningrate.""" nabla_b=[np.zeros(b.shape)forbinself.biases] nabla_w=[np.zeros(w.shape)forwinself.weights] forx,yinmini_batch: delta_nabla_b,delta_nabla_w=self.backprop(x,y) nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)] nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabla_w)] self.weights=[w-(eta/len(mini_batch))*nw forw,nwinzip(self.weights,nabla_w)] self.biases=[b-(eta/len(mini_batch))*nb forb,nbinzip(self.biases,nabla_b)] defbackprop(self,x,y): """Returnatuple``(nabla_b,nabla_w)``representingthe gradientforthecostfunctionC_x.``nabla_b``and ``nabla_w``arelayer-by-layerlistsofnumpyarrays,similar to``self.biases``and``self.weights``.""" nabla_b=[np.zeros(b.shape)forbinself.biases] nabla_w=[np.zeros(w.shape)forwinself.weights] #feedforward activation=x activations=[x]#listtostorealltheactivations,layerbylayer zs=[]#listtostoreallthezvectors,layerbylayer forb,winzip(self.biases,self.weights): z=np.dot(w,activation)+b zs.append(z) activation=sigmoid(z) activations.append(activation) #backwardpass delta=self.cost_derivative(activations[-1],y)*\ sigmoid_prime(zs[-1]) nabla_b[-1]=delta nabla_w[-1]=np.dot(delta,activations[-2].transpose()) #Notethatthevariablelintheloopbelowisusedalittle #differentlytothenotationinChapter2ofthebook.Here, #l=1meansthelastlayerofneurons,l=2isthe #second-lastlayer,andsoon.It'sarenumberingofthe #schemeinthebook,usedheretotakeadvantageofthefact #thatPythoncanusenegativeindicesinlists. forlinxrange(2,self.num_layers): z=zs[-l] sp=sigmoid_prime(z) delta=np.dot(self.weights[-l+1].transpose(),delta)*sp nabla_b[-l]=delta nabla_w[-l]=np.dot(delta,activations[-l-1].transpose()) return(nabla_b,nabla_w) defevaluate(self,test_data): """Returnthenumberoftestinputsforwhichtheneural networkoutputsthecorrectresult.Notethattheneural network'soutputisassumedtobetheindexofwhichever neuroninthefinallayerhasthehighestactivation.""" test_results=[(np.argmax(self.feedforward(x)),y) for(x,y)intest_data] returnsum(int(x==y)for(x,y)intest_results) defcost_derivative(self,output_activations,y): """Returnthevectorofpartialderivatives\partialC_x/ \partialafortheoutputactivations.""" return(output_activations-y) ####Miscellaneousfunctions defsigmoid(z): """Thesigmoidfunction.""" return1.0/(1.0+np.exp(-z)) defsigmoid_prime(z): """Derivativeofthesigmoidfunction.""" returnsigmoid(z)*(1-sigmoid(z)) Howwelldoestheprogramrecognizehandwrittendigits?Well,let's startbyloadingintheMNISTdata.I'lldothisusingalittle helperprogram,mnist_loader.py,tobedescribedbelow.We executethefollowingcommandsinaPythonshell,>>>importmnist_loader >>>training_data,validation_data,test_data=\ ...mnist_loader.load_data_wrapper() Ofcourse,thiscouldalsobedoneinaseparatePythonprogram,but ifyou'refollowingalongit'sprobablyeasiesttodoinaPython shell.AfterloadingtheMNISTdata,we'llsetupaNetworkwith$30$ hiddenneurons.WedothisafterimportingthePythonprogramlisted above,whichisnamednetwork,>>>importnetwork >>>net=network.Network([784,30,10]) Finally,we'llusestochasticgradientdescenttolearnfromtheMNIST training_dataover30epochs,withamini-batchsizeof10,anda learningrateof$\eta=3.0$,>>>net.SGD(training_data,30,10,3.0,test_data=test_data) Notethatifyou'rerunningthecodeasyoureadalong,itwilltake sometimetoexecute-foratypicalmachine(asof2015)itwill likelytakeafewminutestorun.Isuggestyousetthingsrunning, continuetoread,andperiodicallychecktheoutputfromthecode.If you'reinarushyoucanspeedthingsupbydecreasingthenumberof epochs,bydecreasingthenumberofhiddenneurons,orbyusingonly partofthetrainingdata.Notethatproductioncodewouldbemuch, muchfaster:thesePythonscriptsareintendedtohelpyouunderstand howneuralnetswork,nottobehigh-performancecode!And,of course,oncewe'vetrainedanetworkitcanberunveryquickly indeed,onalmostanycomputingplatform.Forexample,oncewe've learnedagoodsetofweightsandbiasesforanetwork,itcaneasily beportedtoruninJavascriptinawebbrowser,orasanativeappon amobiledevice.Inanycase,hereisapartialtranscriptofthe outputofonetrainingrunoftheneuralnetwork.Thetranscript showsthenumberoftestimagescorrectlyrecognizedbytheneural networkaftereachepochoftraining.Asyoucansee,afterjusta singleepochthishasreached9,129outof10,000,andthenumber continuestogrow,Epoch0:9129/10000 Epoch1:9295/10000 Epoch2:9348/10000 ... Epoch27:9528/10000 Epoch28:9542/10000 Epoch29:9534/10000 Thatis,thetrainednetworkgivesusaclassificationrateofabout $95$percent-$95.42$percentatitspeak("Epoch28")!That's quiteencouragingasafirstattempt.Ishouldwarnyou,however, thatifyourunthecodethenyourresultsarenotnecessarilygoing tobequitethesameasmine,sincewe'llbeinitializingournetwork using(different)randomweightsandbiases.Togenerateresultsin thischapterI'vetakenbest-of-threeruns.Let'sreruntheaboveexperiment,changingthenumberofhidden neuronsto$100$.Aswasthecaseearlier,ifyou'rerunningthecode asyoureadalong,youshouldbewarnedthatittakesquiteawhileto execute(onmymachinethisexperimenttakestensofsecondsforeach trainingepoch),soit'swisetocontinuereadinginparallelwhile thecodeexecutes.>>>net=network.Network([784,100,10]) >>>net.SGD(training_data,30,10,3.0,test_data=test_data) Sureenough,thisimprovestheresultsto$96.59$percent.Atleast inthiscase,usingmorehiddenneuronshelpsusgetbetter results* *Readerfeedbackindicatesquitesomevariationin resultsforthisexperiment,andsometrainingrunsgiveresults quiteabitworse.Usingthetechniquesintroducedinchapter3 willgreatlyreducethevariationinperformanceacrossdifferent trainingrunsforournetworks..Ofcourse,toobtaintheseaccuraciesIhadtomakespecificchoices forthenumberofepochsoftraining,themini-batchsize,andthe learningrate,$\eta$.AsImentionedabove,theseareknownas hyper-parametersforourneuralnetwork,inordertodistinguishthem fromtheparameters(weightsandbiases)learntbyourlearning algorithm.Ifwechooseourhyper-parameterspoorly,wecangetbad results.Suppose,forexample,thatwe'dchosenthelearningrateto be$\eta=0.001$,>>>net=network.Network([784,100,10]) >>>net.SGD(training_data,30,10,0.001,test_data=test_data) Theresultsaremuchlessencouraging, Epoch0:1139/10000 Epoch1:1136/10000 Epoch2:1135/10000 ... Epoch27:2101/10000 Epoch28:2123/10000 Epoch29:2142/10000 However,youcanseethattheperformanceofthenetworkisgetting slowlybetterovertime.Thatsuggestsincreasingthelearningrate, sayto$\eta=0.01$.Ifwedothat,wegetbetterresults,which suggestsincreasingthelearningrateagain.(Ifmakingachange improvesthings,trydoingmore!)Ifwedothatseveraltimesover, we'llendupwithalearningrateofsomethinglike$\eta=1.0$(and perhapsfinetuneto$3.0$),whichisclosetoourearlier experiments.Soeventhoughweinitiallymadeapoorchoiceof hyper-parameters,weatleastgotenoughinformationtohelpus improveourchoiceofhyper-parameters.Ingeneral,debugginganeuralnetworkcanbechallenging.Thisis especiallytruewhentheinitialchoiceofhyper-parametersproduces resultsnobetterthanrandomnoise.Supposewetrythesuccessful30 hiddenneuronnetworkarchitecturefromearlier,butwiththelearning ratechangedto$\eta=100.0$: >>>net=network.Network([784,30,10]) >>>net.SGD(training_data,30,10,100.0,test_data=test_data) Atthispointwe'veactuallygonetoofar,andthelearningrateis toohigh: Epoch0:1009/10000 Epoch1:1009/10000 Epoch2:1009/10000 Epoch3:1009/10000 ... Epoch27:982/10000 Epoch28:982/10000 Epoch29:982/10000 Nowimaginethatwewerecomingtothisproblemforthefirsttime. Ofcourse,weknowfromourearlierexperimentsthattheright thingtodoistodecreasethelearningrate.Butifwewerecoming tothisproblemforthefirsttimethentherewouldn'tbemuchinthe outputtoguideusonwhattodo.Wemightworrynotonlyaboutthe learningrate,butabouteveryotheraspectofourneuralnetwork.We mightwonderifwe'veinitializedtheweightsandbiasesinawaythat makesithardforthenetworktolearn?Ormaybewedon'thaveenough trainingdatatogetmeaningfullearning?Perhapswehaven'trunfor enoughepochs?Ormaybeit'simpossibleforaneuralnetworkwith thisarchitecturetolearntorecognizehandwrittendigits?Maybethe learningrateistoolow?Or,maybe,thelearningrateistoo high?Whenyou'recomingtoaproblemforthefirsttime,you'renot alwayssure.Thelessontotakeawayfromthisisthatdebugginganeuralnetwork isnottrivial,and,justasforordinaryprogramming,thereisanart toit.Youneedtolearnthatartofdebugginginordertogetgood resultsfromneuralnetworks.Moregenerally,weneedtodevelop heuristicsforchoosinggoodhyper-parametersandagoodarchitecture. We'lldiscussalltheseatlengththroughthebook,includinghowI chosethehyper-parametersabove. ExerciseTrycreatinganetworkwithjusttwolayers-aninputandan outputlayer,nohiddenlayer-with784and10neurons, respectively.Trainthenetworkusingstochasticgradientdescent. Whatclassificationaccuracycanyouachieve? Earlier,IskippedoverthedetailsofhowtheMNISTdataisloaded. It'sprettystraightforward.Forcompleteness,here'sthecode.The datastructuresusedtostoretheMNISTdataaredescribedinthe documentationstrings-it'sstraightforwardstuff,tuplesandlists ofNumpyndarrayobjects(thinkofthemasvectorsifyou're notfamiliarwithndarrays):""" mnist_loader ~~~~~~~~~~~~ AlibrarytoloadtheMNISTimagedata.Fordetailsofthedata structuresthatarereturned,seethedocstringsfor``load_data`` and``load_data_wrapper``.Inpractice,``load_data_wrapper``isthe functionusuallycalledbyourneuralnetworkcode. """ ####Libraries #Standardlibrary importcPickle importgzip #Third-partylibraries importnumpyasnp defload_data(): """ReturntheMNISTdataasatuplecontainingthetrainingdata, thevalidationdata,andthetestdata. The``training_data``isreturnedasatuplewithtwoentries. Thefirstentrycontainstheactualtrainingimages.Thisisa numpyndarraywith50,000entries.Eachentryis,inturn,a numpyndarraywith784values,representingthe28*28=784 pixelsinasingleMNISTimage. Thesecondentryinthe``training_data``tupleisanumpyndarray containing50,000entries.Thoseentriesarejustthedigit values(0...9)forthecorrespondingimagescontainedinthefirst entryofthetuple. The``validation_data``and``test_data``aresimilar,except eachcontainsonly10,000images. Thisisanicedataformat,butforuseinneuralnetworksit's helpfultomodifytheformatofthe``training_data``alittle. That'sdoneinthewrapperfunction``load_data_wrapper()``,see below. """ f=gzip.open('../data/mnist.pkl.gz','rb') training_data,validation_data,test_data=cPickle.load(f) f.close() return(training_data,validation_data,test_data) defload_data_wrapper(): """Returnatuplecontaining``(training_data,validation_data, test_data)``.Basedon``load_data``,buttheformatismore convenientforuseinourimplementationofneuralnetworks. Inparticular,``training_data``isalistcontaining50,000 2-tuples``(x,y)``.``x``isa784-dimensionalnumpy.ndarray containingtheinputimage.``y``isa10-dimensional numpy.ndarrayrepresentingtheunitvectorcorrespondingtothe correctdigitfor``x``. ``validation_data``and``test_data``arelistscontaining10,000 2-tuples``(x,y)``.Ineachcase,``x``isa784-dimensional numpy.ndarrycontainingtheinputimage,and``y``isthe correspondingclassification,i.e.,thedigitvalues(integers) correspondingto``x``. Obviously,thismeanswe'reusingslightlydifferentformatsfor thetrainingdataandthevalidation/testdata.Theseformats turnouttobethemostconvenientforuseinourneuralnetwork code.""" tr_d,va_d,te_d=load_data() training_inputs=[np.reshape(x,(784,1))forxintr_d[0]] training_results=[vectorized_result(y)foryintr_d[1]] training_data=zip(training_inputs,training_results) validation_inputs=[np.reshape(x,(784,1))forxinva_d[0]] validation_data=zip(validation_inputs,va_d[1]) test_inputs=[np.reshape(x,(784,1))forxinte_d[0]] test_data=zip(test_inputs,te_d[1]) return(training_data,validation_data,test_data) defvectorized_result(j): """Returna10-dimensionalunitvectorwitha1.0inthejth positionandzeroeselsewhere.Thisisusedtoconvertadigit (0...9)intoacorrespondingdesiredoutputfromtheneural network.""" e=np.zeros((10,1)) e[j]=1.0 returne Isaidabovethatourprogramgetsprettygoodresults.Whatdoes thatmean?Goodcomparedtowhat?It'sinformativetohavesome simple(non-neural-network)baselineteststocompareagainst,to understandwhatitmeanstoperformwell.Thesimplestbaselineof all,ofcourse,istorandomlyguessthedigit.That'llberight abouttenpercentofthetime.We'redoingmuchbetterthanthat!Whataboutalesstrivialbaseline?Let'stryanextremelysimple idea:we'lllookathowdarkanimageis.Forinstance,an imageofa$2$willtypicallybequiteabitdarkerthananimageofa $1$,justbecausemorepixelsareblackenedout,asthefollowing examplesillustrate:Thissuggestsusingthetrainingdatatocomputeaveragedarknesses foreachdigit,$0,1,2,\ldots,9$.Whenpresentedwithanewimage, wecomputehowdarktheimageis,andthenguessthatit'swhichever digithastheclosestaveragedarkness.Thisisasimpleprocedure, andiseasytocodeup,soIwon'texplicitlywriteoutthecode- ifyou'reinterestedit'sinthe GitHub repository.Butit'sabigimprovementoverrandomguessing, getting$2,225$ofthe$10,000$testimagescorrect,i.e.,$22.25$ percentaccuracy.It'snotdifficulttofindotherideaswhichachieveaccuraciesinthe $20$to$50$percentrange.Ifyouworkabitharderyoucangetup over$50$percent.Buttogetmuchhigheraccuraciesithelpstouse establishedmachinelearningalgorithms.Let'stryusingoneofthe bestknownalgorithms,thesupportvector machine orSVM.Ifyou'renot familiarwithSVMs,nottoworry,we'renotgoingtoneedto understandthedetailsofhowSVMswork.Instead,we'lluseaPython librarycalled scikit-learn, whichprovidesasimplePythoninterfacetoafastC-basedlibraryfor SVMsknownas LIBSVM.Ifwerunscikit-learn'sSVMclassifierusingthedefaultsettings, thenitgets9,435of10,000testimagescorrect.(Thecodeis available here.) That'sabigimprovementoverournaiveapproachofclassifyingan imagebasedonhowdarkitis.Indeed,itmeansthattheSVMis performingroughlyaswellasourneuralnetworks,justalittle worse.Inlaterchapterswe'llintroducenewtechniquesthatenable ustoimproveourneuralnetworkssothattheyperformmuchbetter thantheSVM.That'snottheendofthestory,however.The9,435of10,000result isforscikit-learn'sdefaultsettingsforSVMs.SVMshaveanumber oftunableparameters,andit'spossibletosearchforparameters whichimprovethisout-of-the-boxperformance.Iwon'texplicitlydo thissearch,butinsteadreferyouto this blogpostbyAndreas Muellerifyou'dliketoknowmore.Muellershowsthatwithsome workoptimizingtheSVM'sparametersit'spossibletogetthe performanceupabove98.5percentaccuracy.Inotherwords,a well-tunedSVMonlymakesanerroronaboutonedigitin70.That's prettygood!Canneuralnetworksdobetter?Infact,theycan.Atpresent,well-designedneuralnetworks outperformeveryothertechniqueforsolvingMNIST,includingSVMs. Thecurrent(2013)recordisclassifying9,979of10,000images correctly.ThiswasdonebyLi Wan,MatthewZeiler,Sixin Zhang,YannLeCun,and RobFergus. We'llseemostofthetechniquestheyusedlaterinthebook.Atthat leveltheperformanceisclosetohuman-equivalent,andisarguably better,sincequiteafewoftheMNISTimagesaredifficultevenfor humanstorecognizewithconfidence,forexample:Itrustyou'llagreethatthosearetoughtoclassify!Withimages liketheseintheMNISTdatasetit'sremarkablethatneuralnetworks canaccuratelyclassifyallbut21ofthe10,000testimages. Usually,whenprogrammingwebelievethatsolvingacomplicated problemlikerecognizingtheMNISTdigitsrequiresasophisticated algorithm.ButeventheneuralnetworksintheWanetalpaper justmentionedinvolvequitesimplealgorithms,variationsonthe algorithmwe'veseeninthischapter.Allthecomplexityislearned, automatically,fromthetrainingdata.Insomesense,themoralof bothourresultsandthoseinmoresophisticatedpapers,isthatfor someproblems: sophisticatedalgorithm$\leq$simplelearningalgorithm+good trainingdata. TowarddeeplearningWhileourneuralnetworkgivesimpressiveperformance,that performanceissomewhatmysterious.Theweightsandbiasesinthe networkwerediscoveredautomatically.Andthatmeanswedon't immediatelyhaveanexplanationofhowthenetworkdoeswhatitdoes. Canwefindsomewaytounderstandtheprinciplesbywhichournetwork isclassifyinghandwrittendigits?And,givensuchprinciples,canwe dobetter?Toputthesequestionsmorestarkly,supposethatafewdecadeshence neuralnetworksleadtoartificialintelligence(AI).Willwe understandhowsuchintelligentnetworkswork?Perhapsthenetworks willbeopaquetous,withweightsandbiaseswedon'tunderstand, becausethey'vebeenlearnedautomatically.IntheearlydaysofAI researchpeoplehopedthattheefforttobuildanAIwouldalsohelp usunderstandtheprinciplesbehindintelligenceand,maybe,the functioningofthehumanbrain.Butperhapstheoutcomewillbethat weendupunderstandingneitherthebrainnorhowartificial intelligenceworks!Toaddressthesequestions,let'sthinkbacktotheinterpretationof artificialneuronsthatIgaveatthestartofthechapter,asameans ofweighingevidence.Supposewewanttodeterminewhetheranimage showsahumanfaceornot:Credits:1.EsterInbar.2. Unknown.3.NASA,ESA,G.Illingworth,D.Magee,andP.Oesch (UniversityofCalifornia,SantaCruz),R.Bouwens(Leiden University),andtheHUDF09Team.Clickontheimagesformore details.Wecouldattackthisproblemthesamewayweattackedhandwriting recognition-byusingthepixelsintheimageasinputtoaneural network,withtheoutputfromthenetworkasingleneuronindicating either"Yes,it'saface"or"No,it'snotaface".Let'ssupposewedothis,butthatwe'renotusingalearning algorithm.Instead,we'regoingtotrytodesignanetworkbyhand, choosingappropriateweightsandbiases.Howmightwegoaboutit? Forgettingneuralnetworksentirelyforthemoment,aheuristicwe coulduseistodecomposetheproblemintosub-problems:doesthe imagehaveaneyeinthetopleft?Doesithaveaneyeinthetop right?Doesithaveanoseinthemiddle?Doesithaveamouthin thebottommiddle?Istherehairontop?Andsoon.Iftheanswerstoseveralofthesequestionsare"yes",orevenjust "probablyyes",thenwe'dconcludethattheimageislikelytobea face.Conversely,iftheanswerstomostofthequestionsare"no", thentheimageprobablyisn'taface.Ofcourse,thisisjustaroughheuristic,anditsuffersfrommany deficiencies.Maybethepersonisbald,sotheyhavenohair.Maybe wecanonlyseepartoftheface,orthefaceisatanangle,sosome ofthefacialfeaturesareobscured.Still,theheuristicsuggests thatifwecansolvethesub-problemsusingneuralnetworks,then perhapswecanbuildaneuralnetworkforface-detection,bycombining thenetworksforthesub-problems.Here'sapossiblearchitecture, withrectanglesdenotingthesub-networks.Notethatthisisn't intendedasarealisticapproachtosolvingtheface-detection problem;rather,it'stohelpusbuildintuitionabouthownetworks function.Here'sthearchitecture: It'salsoplausiblethatthesub-networkscanbedecomposed.Suppose we'reconsideringthequestion:"Isthereaneyeinthetopleft?" Thiscanbedecomposedintoquestionssuchas:"Istherean eyebrow?";"Arethereeyelashes?";"Isthereaniris?";andso on.Ofcourse,thesequestionsshouldreallyincludepositional information,aswell-"Istheeyebrowinthetopleft,andabove theiris?",thatkindofthing-butlet'skeepitsimple.The networktoanswerthequestion"Isthereaneyeinthetopleft?" cannowbedecomposed: Thosequestionstoocanbebrokendown,furtherandfurtherthrough multiplelayers.Ultimately,we'llbeworkingwithsub-networksthat answerquestionssosimpletheycaneasilybeansweredatthelevelof singlepixels.Thosequestionsmight,forexample,beaboutthe presenceorabsenceofverysimpleshapesatparticularpointsinthe image.Suchquestionscanbeansweredbysingleneuronsconnectedto therawpixelsintheimage.Theendresultisanetworkwhichbreaksdownaverycomplicated question-doesthisimageshowafaceornot-intoverysimple questionsanswerableatthelevelofsinglepixels.Itdoesthis throughaseriesofmanylayers,withearlylayersansweringvery simpleandspecificquestionsabouttheinputimage,andlaterlayers buildingupahierarchyofevermorecomplexandabstractconcepts. Networkswiththiskindofmany-layerstructure-twoormorehidden layers-arecalleddeepneuralnetworks. Ofcourse,Ihaven'tsaidhowtodothisrecursivedecompositioninto sub-networks.Itcertainlyisn'tpracticaltohand-designtheweights andbiasesinthenetwork.Instead,we'dliketouselearning algorithmssothatthenetworkcanautomaticallylearntheweightsand biases-andthus,thehierarchyofconcepts-fromtrainingdata. Researchersinthe1980sand1990striedusingstochasticgradient descentandbackpropagationtotraindeepnetworks.Unfortunately, exceptforafewspecialarchitectures,theydidn'thavemuchluck. Thenetworkswouldlearn,butveryslowly,andinpracticeoftentoo slowlytobeuseful.Since2006,asetoftechniqueshasbeendevelopedthatenable learningindeepneuralnets.Thesedeeplearningtechniquesare basedonstochasticgradientdescentandbackpropagation,butalso introducenewideas.Thesetechniqueshaveenabledmuchdeeper(and larger)networkstobetrained-peoplenowroutinelytrainnetworks with5to10hiddenlayers.And,itturnsoutthattheseperformfar betteronmanyproblemsthanshallowneuralnetworks,i.e.,networks withjustasinglehiddenlayer.Thereason,ofcourse,isthe abilityofdeepnetstobuildupacomplexhierarchyofconcepts. It'sabitlikethewayconventionalprogramminglanguagesusemodular designandideasaboutabstractiontoenablethecreationofcomplex computerprograms.Comparingadeepnetworktoashallownetworkisa bitlikecomparingaprogramminglanguagewiththeabilitytomake functioncallstoastrippeddownlanguagewithnoabilitytomake suchcalls.Abstractiontakesadifferentforminneuralnetworks thanitdoesinconventionalprogramming,butit'sjustasimportant.