Deep learning processor - Wikipedia

文章推薦指數: 80 %
投票人數:10人

A deep learning processor (DLP), or a deep learning accelerator, is an electronic circuit designed for deep learning algorithms, usually with separate data ... Deeplearningprocessor FromWikipedia,thefreeencyclopedia Jumptonavigation Jumptosearch Speciallydesignedcircuitry Adeeplearningprocessor(DLP),oradeeplearningaccelerator,isanelectroniccircuitdesignedfordeeplearningalgorithms,usuallywithseparatedatamemoryanddedicatedinstructionsetarchitecture.Deeplearningprocessorsrangefrommobiledevices,suchasneuralprocessingunits(NPUs)inHuaweicellphones,[1] tocloudcomputingserverssuchastensorprocessingunits(TPU)intheGoogleCloudPlatform.[2] ThegoalofDLPsistoprovidehigherefficiencyandperformancefordeeplearningalgorithmsthangeneralcentralprocessingunit(CPUs)andgraphicsprocessingunits(GPUs)would.MostDLPsemployalargenumberofcomputingcomponentstoleveragehighdata-levelparallelism,arelativelylargeron-chipbuffer/memorytoleveragethedatareusepatterns,andlimiteddata-widthoperatorsforerror-resilienceofdeeplearning. Contents 1History 1.1TheuseofCPUs/GPUs 1.2ThefirstDLP 1.3ThebloomingDLPs 2DLParchitecture 2.1DigitalDLPs 2.2HybridDLPs 3GPUsandFPGAs 4Atomicallythinsemiconductorsfordeeplearning 5Integratedphotonictensorcore 6Benchmarks 7Seealso 8References History[edit] TheuseofCPUs/GPUs[edit] Atthebeginning,generalCPUswereadoptedtoperformdeeplearningalgorithms.Later,GPUsareintroducedtothedomainofdeep learning.Forexample,in2012,AlexKrizhevskyadoptedtwoGPUstotrainadeeplearningnetwork,i.e.,AlexNet,[3]whichwonthechampionoftheISLVRC-2012competition.AstheinterestsindeeplearningalgorithmsandDLPskeepincreasing,GPUmanufacturersstarttoadddeeplearningrelatedfeaturesinbothhardware(e.g.,INT8operators)andsoftware(e.g.,cuDNNLibrary).Forexample,NvidiaevenreleasedtheTuringTensorCore—aDLP—toacceleratedeeplearningprocessing. ThefirstDLP[edit] Toprovidehigherefficiencyinperformanceandenergy,domain-specific designstartstodrawagreatattention.In2014,Chenetal.proposedthefirstDLPintheworld,DianNao(Chinesefor"electricbrain"),[4]toacceleratedeepneuralnetworksespecially.DianNaoprovidesthe452Gop/speakperformance(ofkeyoperationsindeepneuralnetworks)onlyinasmallfootprintof3.02mm2and485mW.Later,thesuccessors(DaDianNao,[5]ShiDianNao,[6]PuDianNao[7])areproposedbythesamegroup,formingtheDianNaoFamily[8] ThebloomingDLPs[edit] InspiredfromthepioneerworkofDianNaoFamily,manyDLPsareproposedinbothacademiaandindustrywithdesignoptimizedtoleveragethefeaturesofdeepneuralnetworksforhighefficiency.OnlyatISCA2016,threesessions,15%(!)oftheacceptedpapers,areallarchitecturedesignsaboutdeeplearning.SucheffortsincludeEyeriss[9](MIT),EIE[10](Stanford),Minerva[11](Harvard),Stripes[12](UniversityofToronto)inacademia,andTPU[13](Google),MLU[14](Cambricon)inindustry.WelistedseveralrepresentativeworksinTable1. Table1.TypicalDLPs Year DLPs Institution Type Computation MemoryHierarchy Control PeakPerformance 2014 DianNao[4] ICT,CAS digital vectorMACs scratchpad VLIW 452Gops(16-bit) DaDianNao[5] ICT,CAS digital vectorMACs scratchpad VLIW 5.58Tops(16-bit) 2015 ShiDianNao[6] ICT,CAS digital scalarMACs scratchpad VLIW 194Gops(16-bit) PuDianNao[7] ICT,CAS digital vectorMACs scratchpad VLIW 1,056Gops(16-bit) 2016 DnnWeaver GeorgiaTech digital VectorMACs scratchpad - - EIE[10] Stanford digital scalarMACs scratchpad - 102Gops(16-bit) Eyeriss[9] MIT digital scalarMACs scratchpad - 67.2Gops(16-bit) Prime[15] UCSB hybrid Process-in-Memory ReRAM - - 2017 TPU[13] Google digital scalarMACs scratchpad CISC 92Tops(8-bit) PipeLayer[16] UofPittsburgh hybrid Process-in-Memory ReRAM - FlexFlow ICT,CAS digital scalarMACs scratchpad - 420Gops() 2018 MAERI GeorgiaTech digital scalarMACs scratchpad - PermDNN CityUniversityofNewYork digital vectorMACs scratchpad - 614.4Gops(16-bit) 2019 FPSA Tsinghua hybrid Process-in-Memory ReRAM - Cambricon-F ICT,CAS digital vectorMACs scratchpad FISA 14.9Tops(F1,16-bit) 956Tops(F100,16-bit) DLParchitecture[edit] WiththerapidevolutionofdeeplearningalgorithmsandDLPs,manyarchitectureshavebeenexplored.Roughly,DLPscanbeclassifiedintothreecategoriesbasedontheirimplementation:digitalcircuits,analogcircuits,andhybridcircuits.AsthepureanalogDLPsarerarelyseen,weintroducethedigitalDLPsandhybridDLPs. DigitalDLPs[edit] ThemajorcomponentsofDLPsarchitectureusuallyincludeacomputationcomponent,theon-chipmemoryhierarchy,andthecontrollogicthatmanagesthedatacommunicationandcomputingflows. Regardingthecomputationcomponent,asmostoperationsindeeplearningcanbeaggregatedintovectoroperations,themostcommonwaysforbuildingcomputationcomponentsindigitalDLPsaretheMAC-based(multiplier-accumulation)organization,eitherwithvectorMACs[4][5][7]orscalarMACs.[13][6][9]RatherthanSIMDorSIMTingeneralprocessingdevices,deeplearningdomain-specificparallelismisbetterexploredontheseMAC-basedorganizations.Regardingthememoryhierarchy,asdeeplearningalgorithmsrequirehighbandwidthtoprovidethecomputationcomponentwithsufficientdata,DLPsusuallyemployarelativelylargersize(tensofkilobytesorseveralmegabytes)on-chipbufferbutwithdedicatedon-chipdatareusestrategyanddataexchangestrategytoalleviatetheburdenformemorybandwidth.Forexample,DianNao,1616-invectorMAC,requires16×16×2=51216-bitdata,i.e.,almost1024GB/sbandwidthrequirementsbetweencomputationcomponentsandbuffers.Withon-chipreuse,suchbandwidthrequirementsarereduceddrastically.[4]Insteadofthewidelyusedcacheingeneralprocessingdevices,DLPsalwaysusescratchpadmemoryasitcouldprovidehigherdatareuseopportunitiesbyleveragingtherelativelyregulardataaccesspatternindeeplearningalgorithms.Regardingthecontrollogic,asthedeeplearningalgorithmskeepevolvingatadramaticspeed,DLPsstarttoleveragededicatedISA(instructionsetarchitecture)tosupportthedeeplearningdomainflexibly.Atfirst,DianNaousedaVLIW-styleinstructionsetwhereeachinstructioncouldfinishalayerinaDNN.Cambricon[17]introducesthefirstdeeplearningdomain-specificISA,whichcouldsupportmorethantendifferentdeeplearningalgorithms.TPUalsorevealsfivekeyinstructionsfromtheCISC-styleISA. HybridDLPs[edit] HybridDLPsemergeforDNNinferenceandtrainingaccelerationbecauseoftheirhighefficiency.Processing-in-memory(PIM)architecturesareonemostimportanttypeofhybridDLP.ThekeydesignconceptofPIMistobridgethegapbetweencomputingandmemory,withthefollowingmanners:1)Movingcomputationcomponentsintomemorycells,controllers,ormemorychipstoalleviatethememorywallissue.[16][18][19]Sucharchitecturessignificantlyshortendatapathsandleveragemuchhigherinternalbandwidth,henceresultinginattractiveperformanceimprovement.2)BuildhighefficientDNNenginesbyadoptingcomputationaldevices.In2013,HPLabdemonstratedtheastonishingcapabilityofadoptingReRAMcrossbarstructureforcomputing.[20]Inspiringbythiswork,tremendousworkareproposedtoexplorethenewarchitectureandsystemdesignbasedonReRAM,[15][21][22][16]phasechangememory,[18][23][24]etc. GPUsandFPGAs[edit] DespitetheDLPs,GPUsandFPGAsarealsobeenusedasacceleratorstospeeduptheexecutionofdeeplearningalgorithms.Forexample,Summit,asupercomputerfromIBMforOakRidgeNationalLaboratory,[25]contains27,648NvidiaTeslaV100cards,whichcanbeusedtoacceleratedeeplearningalgorithms.MicrosoftbuildsitsdeeplearningplatformusingtonsofFPGAsinitsAzuretosupportreal-timedeeplearningservices.[26]InTable2wecomparetheDLPsagainstGPUsandFPGAsintermsoftarget,performance,energyefficiency,andflexibility. Table2.DLPsvs.GPUsvs.FPGAs Target Performance EnergyEfficiency Flexibility DLPs deeplearning high high domain-specific FPGAs all low moderate general GPUs matrixcomputation moderate low matrixapplications Atomicallythinsemiconductorsfordeeplearning[edit] Atomicallythinsemiconductorsareconsideredpromisingforenergy-efficientdeeplearninghardwarewherethesamebasicdevicestructureisusedforbothlogicoperationsanddatastorage. In2020,Maregaetal.publishedexperimentswithalarge-areaactivechannelmaterialfordevelopinglogic-in-memorydevicesandcircuitsbasedonfloating-gatefield-effecttransistors(FGFETs).[27]Theyusetwo-dimensionalmaterialssuchassemiconductingmolybdenumdisulphidetopreciselytuneFGFETsasbuildingblocksinwhichlogicoperationscanbeperformedwiththememoryelements.[27] Integratedphotonictensorcore[edit] In2021,J.Feldmannetal.proposedanintegratedphotonichardwareacceleratorforparallelconvolutionalprocessing.[28]Theauthorsidentifytwokeyadvantagesofintegratedphotonicsoveritselectroniccounterparts:(1)massivelyparalleldatatransferthroughwavelengthdivisionmultiplexinginconjunctionwithfrequencycombs,and(2)extremelyhighdatamodulationspeeds.[28]Theirsystemcanexecutetrillionsofmultiply-accumulateoperationspersecond,indicatingthepotentialofintegratedphotonicsindata-heavyAIapplications.[28] Benchmarks[edit] Benchmarkinghasservedlongasthefoundationofdesigningnewhardwarearchitectures,wherebotharchitectsandpractitionerscancomparevariousarchitectures,identifytheirbottlenecks,andconductthecorrespondingsystem/architecturaloptimization.Table3listsseveraltypicalbenchmarksforDLPs,datingfromtheyear2012intimeorder. Table3.Benchmarks. Year NNBenchmark Affiliations #ofmicrobenchmarks #ofcomponentbenchmarks #ofapplicationbenchmarks 2012 BenchNN ICT,CAS N/A 12 N/A 2016 Fathom Harvard N/A 8 N/A 2017 BenchIP ICT,CAS 12 11 N/A 2017 DAWNBench Stanford 8 N/A N/A 2017 DeepBench Baidu 4 N/A N/A 2018 MLPerf Harvard,Intel,andGoogle,etc. N/A 7 N/A 2019 AIBench ICT,CASandAlibaba,etc. 12 16 2 2019 NNBench-X UCSB N/A 10 N/A Seealso[edit] AIaccelerator CerebrasSystems References[edit] ^"HUAWEIRevealstheFutureofMobileAIatIFA". ^P,JouppiNorman;YoungCliff;PatilNishant;PattersonDavid;AgrawalGaurav;BajwaRaminder;BatesSarah;BhatiaSuresh;BodenNan;BorchersAl;BoyleRick(2017-06-24)."In-DatacenterPerformanceAnalysisofaTensorProcessingUnit".ACMSIGARCHComputerArchitectureNews.45(2):1–12.doi:10.1145/3140659.3080246. ^Krizhevsky,Alex;Sutskever,Ilya;Hinton,GeoffreyE(2017-05-24)."ImageNetclassificationwithdeepconvolutionalneuralnetworks".CommunicationsoftheACM.60(6):84–90.doi:10.1145/3065386. ^abcdChen,Tianshi;Du,Zidong;Sun,Ninghui;Wang,Jia;Wu,Chengyong;Chen,Yunji;Temam,Olivier(2014-04-05)."DianNao".ACMSIGARCHComputerArchitectureNews.42(1):269–284.doi:10.1145/2654822.2541967.ISSN 0163-5964. ^abcChen,Yunji;Luo,Tao;Liu,Shaoli;Zhang,Shijin;He,Liqiang;Wang,Jia;Li,Ling;Chen,Tianshi;Xu,Zhiwei;Sun,Ninghui;Temam,Olivier(December2014)."DaDianNao:AMachine-LearningSupercomputer".201447thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture.IEEE:609–622.doi:10.1109/micro.2014.58.ISBN 978-1-4799-6998-2.S2CID 6838992. ^abcDu,Zidong;Fasthuber,Robert;Chen,Tianshi;Ienne,Paolo;Li,Ling;Luo,Tao;Feng,Xiaobing;Chen,Yunji;Temam,Olivier(2016-01-04)."ShiDianNao".ACMSIGARCHComputerArchitectureNews.43(3S):92–104.doi:10.1145/2872887.2750389.ISSN 0163-5964. ^abcLiu,Daofu;Chen,Tianshi;Liu,Shaoli;Zhou,Jinhong;Zhou,Shengyuan;Teman,Olivier;Feng,Xiaobing;Zhou,Xuehai;Chen,Yunji(2015-05-29)."PuDianNao".ACMSIGARCHComputerArchitectureNews.43(1):369–381.doi:10.1145/2786763.2694358.ISSN 0163-5964. ^Chen,Yunji;Chen,Tianshi;Xu,Zhiwei;Sun,Ninghui;Temam,Olivier(2016-10-28)."DianNaofamily".CommunicationsoftheACM.59(11):105–112.doi:10.1145/2996864.ISSN 0001-0782.S2CID 207243998. ^abcChen,Yu-Hsin;Emer,Joel;Sze,Vivienne(2017)."Eyeriss:ASpatialArchitectureforEnergy-EfficientDataflowforConvolutionalNeuralNetworks".IEEEMicro:1.doi:10.1109/mm.2017.265085944.hdl:1721.1/102369.ISSN 0272-1732. ^abHan,Song;Liu,Xingyu;Mao,Huizi;Pu,Jing;Pedram,Ardavan;Horowitz,MarkA.;Dally,WilliamJ.(2016-02-03).EIE:EfficientInferenceEngineonCompressedDeepNeuralNetwork.OCLC 1106232247. ^Reagen,Brandon;Whatmough,Paul;Adolf,Robert;Rama,Saketh;Lee,Hyunkwang;Lee,SaeKyu;Hernandez-Lobato,JoseMiguel;Wei,Gu-Yeon;Brooks,David(June2016)."Minerva:EnablingLow-Power,Highly-AccurateDeepNeuralNetworkAccelerators".2016ACM/IEEE43rdAnnualInternationalSymposiumonComputerArchitecture(ISCA).Seoul:IEEE:267–278.doi:10.1109/ISCA.2016.32.ISBN 978-1-4673-8947-1. ^Judd,Patrick;Albericio,Jorge;Moshovos,Andreas(2017-01-01)."Stripes:Bit-SerialDeepNeuralNetworkComputing".IEEEComputerArchitectureLetters.16(1):80–83.doi:10.1109/lca.2016.2597140.ISSN 1556-6056.S2CID 3784424. ^abc"In-DatacenterPerformanceAnalysisofaTensorProcessingUnit|Proceedingsofthe44thAnnualInternationalSymposiumonComputerArchitecture".doi:10.1145/3079856.3080246.S2CID 4202768.{{citejournal}}:Citejournalrequires|journal=(help) ^"MLU100intelligenceacceleratorcard". ^abChi,Ping;Li,Shuangchen;Xu,Cong;Zhang,Tao;Zhao,Jishen;Liu,Yongpan;Wang,Yu;Xie,Yuan(June2016)."PRIME:ANovelProcessing-in-MemoryArchitectureforNeuralNetworkComputationinReRAM-BasedMainMemory".2016ACM/IEEE43rdAnnualInternationalSymposiumonComputerArchitecture(ISCA).IEEE:27–39.doi:10.1109/isca.2016.13.ISBN 978-1-4673-8947-1. ^abcSong,Linghao;Qian,Xuehai;Li,Hai;Chen,Yiran(February2017)."PipeLayer:APipelinedReRAM-BasedAcceleratorforDeepLearning".2017IEEEInternationalSymposiumonHighPerformanceComputerArchitecture(HPCA).IEEE:541–552.doi:10.1109/hpca.2017.55.ISBN 978-1-5090-4985-1.S2CID 15281419. ^Liu,Shaoli;Du,Zidong;Tao,Jinhua;Han,Dong;Luo,Tao;Xie,Yuan;Chen,Yunji;Chen,Tianshi(June2016)."Cambricon:AnInstructionSetArchitectureforNeuralNetworks".2016ACM/IEEE43rdAnnualInternationalSymposiumonComputerArchitecture(ISCA).IEEE:393–405.doi:10.1109/isca.2016.42.ISBN 978-1-4673-8947-1. ^abAmbrogio,Stefano;Narayanan,Pritish;Tsai,Hsinyu;Shelby,RobertM.;Boybat,Irem;diNolfo,Carmelo;Sidler,Severin;Giordano,Massimo;Bodini,Martina;Farinha,NathanC.P.;Killeen,Benjamin(June2018)."Equivalent-accuracyacceleratedneural-networktrainingusinganaloguememory".Nature.558(7708):60–67.doi:10.1038/s41586-018-0180-5.ISSN 0028-0836.PMID 29875487.S2CID 46956938. ^Chen,Wei-Hao;Lin,Wen-Jang;Lai,Li-Ya;Li,Shuangchen;Hsu,Chien-Hua;Lin,Huan-Ting;Lee,Heng-Yuan;Su,Jian-Wei;Xie,Yuan;Sheu,Shyh-Shyuan;Chang,Meng-Fan(December2017)."A16Mbdual-modeReRAMmacrowithsub-14nscomputing-in-memoryandmemoryfunctionsenabledbyself-writeterminationscheme".2017IEEEInternationalElectronDevicesMeeting(IEDM).IEEE:28.2.1–28.2.4.doi:10.1109/iedm.2017.8268468.ISBN 978-1-5386-3559-9.S2CID 19556846. ^Yang,J.Joshua;Strukov,DmitriB.;Stewart,DuncanR.(January2013)."Memristivedevicesforcomputing".NatureNanotechnology.8(1):13–24.doi:10.1038/nnano.2012.240.ISSN 1748-3395.PMID 23269430. ^Shafiee,Ali;Nag,Anirban;Muralimanohar,Naveen;Balasubramonian,Rajeev;Strachan,JohnPaul;Hu,Miao;Williams,R.Stanley;Srikumar,Vivek(2016-10-12)."ISAAC".ACMSIGARCHComputerArchitectureNews.44(3):14–26.doi:10.1145/3007787.3001139.ISSN 0163-5964.S2CID 6329628. ^Ji,YuZhang,YouyangXie,XinfengLi,ShuangchenWang,PeiqiHu,XingZhang,YouhuiXie,Yuan(2019-01-27).FPSA:AFullSystemStackSolutionforReconfigurableReRAM-basedNNAcceleratorArchitecture.OCLC 1106329050.{{citebook}}:CS1maint:multiplenames:authorslist(link) ^Nandakumar,S.R.;Boybat,Irem;Joshi,Vinay;Piveteau,Christophe;LeGallo,Manuel;Rajendran,Bipin;Sebastian,Abu;Eleftheriou,Evangelos(November2019)."Phase-ChangeMemoryModelsforDeepLearningTrainingandInference".201926thIEEEInternationalConferenceonElectronics,CircuitsandSystems(ICECS).IEEE:727–730.doi:10.1109/icecs46596.2019.8964852.ISBN 978-1-7281-0996-1.S2CID 210930121. ^Joshi,Vinay;LeGallo,Manuel;Haefeli,Simon;Boybat,Irem;Nandakumar,S.R.;Piveteau,Christophe;Dazzi,Martino;Rajendran,Bipin;Sebastian,Abu;Eleftheriou,Evangelos(2020-05-18)."Accuratedeepneuralnetworkinferenceusingcomputationalphase-changememory".NatureCommunications.11(1):2473.arXiv:1906.03138.doi:10.1038/s41467-020-16108-9.ISSN 2041-1723.PMC 7235046.PMID 32424184. ^"Summit:OakRidgeNationalLaboratory's200petaflopsupercomputer". ^"MicrosoftunveilsProjectBrainwaveforreal-timeAI". ^abMarega,GuilhermeMigliato;Zhao,Yanfei;Avsar,Ahmet;Wang,Zhenyu;Tripati,Mukesh;Radenovic,Aleksandra;Kis,Anras(2020)."Logic-in-memorybasedonanatomicallythinsemiconductor".Nature.587(2):72–77.doi:10.1038/s41586-020-2861-0.PMC 7116757.PMID 33149289. ^abcFeldmann,J.;Youngblood,N.;Karpov,M.;et al.(2021)."Parallelconvolutionalprocessingusinganintegratedphotonictensor".Nature.589(2):52–58.arXiv:2002.00281.doi:10.1038/s41586-020-03070-1. vteHardwareaccelerationTheory UniversalTuringmachine Parallelcomputing Distributedcomputing Applications GPU GPGPU DirectX Audio Digitalsignalprocessing Hardwarerandomnumbergeneration Artificialintelligence Cryptography TLS Machinevision Customhardwareattack scrypt Networking Data Implementations High-levelsynthesis CtoHDL FPGA ASIC CPLD Systemonachip Networkonachip Architectures Dataflow Transporttriggered Multicore Manycore Heterogeneous In-memorycomputing Systolicarray Neuromorphic Related Programmablelogic Processor design chronology Digitalelectronics Virtualization Hardwareemulation Logicsynthesis Embeddedsystems Retrievedfrom"https://en.wikipedia.org/w/index.php?title=Deep_learning_processor&oldid=1076012745" Categories:ComputeroptimizationDeeplearningHiddencategories:CS1errors:missingperiodicalCS1maint:multiplenames:authorslistArticleswithshortdescriptionShortdescriptionmatchesWikidata Navigationmenu Personaltools NotloggedinTalkContributionsCreateaccountLogin Namespaces ArticleTalk English expanded collapsed Views ReadEditViewhistory More expanded collapsed Search Navigation MainpageContentsCurrenteventsRandomarticleAboutWikipediaContactusDonate Contribute HelpLearntoeditCommunityportalRecentchangesUploadfile Tools WhatlinkshereRelatedchangesUploadfileSpecialpagesPermanentlinkPageinformationCitethispageWikidataitem Print/export DownloadasPDFPrintableversion Languages Русский中文 Editlinks



請為這篇文章評分?