NeuralScale: Industry Leading General Purpose ... - RISC-V

文章推薦指數: 80 %
投票人數:10人

NeuralScale: Industry Leading General Purpose Programmable NPU Architecture based on RISC-V ... AI and its diverse applications have seen ... Skiptomaincontent twitterlinkedinyoutubeRSSflickremail Languages RISC-VChina RISC-VJapan TechMeetings CommunityMeetings WorkingGroupsPortal Join HitentertosearchorESCtoclose CloseSearch Blog NeuralScale:IndustryLeadingGeneralPurposeProgrammableNPUArchitecturebasedonRISC-V ByMarkZhanJuly6,2021NoComments AIanditsdiverseapplicationshaveseensignificantincreasingdemandforAIcomputingincloudsoverthelastfewyears.TypicalAI-enabledservicesincludeimageandspeechrecognition,naturallanguageprocessing,medicaldiagnosis,visualsearch,andpersonalizedrecommendations.AIcomputingincloudsincludestwodistinctworkloads:trainingandinference.AIinferenceisreportedtoconstitutemorethan95%ofAIcomputingworkloadsinclouds.  Meanwhile,wehaveseensignificantgrowthinapplication-specificintegratedcircuits(ASICs)forAIinferenceinclouds.ASICswithcustomizedsiliconandmemoryhierarchyforspecificmodelsoralgorithmshaveshownbetterperformanceandenergyefficiencythantraditionalGPUsinAIinference.However,mostASICscanonlyperformdesignedmodelsoralgorithmswithpoororevennoflexibilityandprogrammability.AsAIalgorithmsarestillevolving,newoperatorsandactivationfunctionsmaybeproposed.ItwillbecomeagreatchallengeforASICswhenmigratingtonewAImodelsinthefuture. TopromotetheprogrammabilityofASICswhileretainingtheperformanceandenergyefficiency,weproposeNeruralScale,ageneral-purposeneuralprocessorarchitecturebasedontheRISC-VISAasRISC-Vismeanttoprovideabasisformorespecializedinstruction-setextensionsorcustomizedaccelerators.OurindustrialproductP920,implementedwith32NeuralScalecores,achieves256TOPS(INT8)and128TFLOPS(FP16)peakperformanceunderaclockfrequencyof1.0GHzinaTSMC12nmFinFETprocesstechnology.Evaluationresultsontypicalinferenceworkloadsshowthatourprocessordeliversstate-of-the-artthroughputandpowerefficiencyperformance. NeuralScaleArchitectureOverview Thefigurebelowillustratesahigh-leveloverviewoftheNeuralScalearchitecture.ThekeycomponentsincludeaRISC-Vscalarcoreandaneuralprocessorcore.Thescalarcorefetchesanddecodesallinstructions,anddispatchestheinstructionstothecorrectpathbasedontheirtypes.Scalarinstructionsareexecutedinorderinthescalarpipelinewhilevectorinstructionsflowthroughthescalarpipelinetotheneuralprocessorcore.  TheneuralprocessorcorecombinesthefeaturesofvectorprocessorsandAIinferenceaccelerators.Asshowninthefigurebelow,thecomputationcomponentsincludeaMACvectorforexecutingvectoroperations,aMACmatrixforexecutingmatrixoperations,andaPOLYmoduleforcomplexarithmeticslikeexp,div,andsqrtcomputations.On-chipsmemorycomponentsincludeavectorregisterfile(theREGBankmodule)aswellasthreelocalbuffers,namedL1Buffer,DataInputBuffer,andIntermediateBufferrespectively. Theneuralprocessorcore’spipelineisdividedinto4stagesinconcept:decode,issue,execute,andwrite-back.Vectorinstructionsaredecodedintomicro-opsinthedecodeunitandthendispatchedtotheissueunit.Theissueunitissuesinstructionstocorrespondingexecutionunitsbasedontheiroperationtypes.Therearethreeexecutionunits,avectorMACengine(VME),amatrixMACengine(MME),andamemorytransmissionengine(MTE)fordifferentoperationtypes.Theissueunitmaintainsthreeinstructionbufferstrackingthestateofallinflightinstructionsineachexecutionunit.Adispatchedinstructionfromthedecodeunitwillbebufferedaccordingtoitsoperationtypeandwillberemovedonceit’scommittedbytheexecutionunit.Allinstructionswillbeissuedinorder,andaninstructioncanbeissuedonlywhenthereisnoaddressoverlapwithinflightinstructions.Theissueunitcanissuethreeinstructionsatmost.Allthreeexecutionunitscanworksimultaneouslyinthiscase,andhencememorylatencycanbepartiallyhiddenbycomputation,whichliftsoverallperformance.Afterexecution,theresultswillbewrittenbacktovectorregistersorlocalbuffers. TheRISC-VVectorextension(RVV)enablesprocessorcoresbasedontheRISC-Vinstructionsetarchitecturetoprocessdataarrays,alongsidetraditionalscalaroperationstoacceleratethecomputationofsingleinstructionstreamsonlargedatasets.NeuralScaleimplementsstandardextensionsincludingthebaseRVVextension(v0.8)andcustomizedvectorextensionswithfixed-width32-bitinstructionformat.Weusethecustom-3opcode(11111011)intheRISC-Vbaseopcodemapasthemajoropcodeforcustomizedvectorextensions,markedasOP-VE.Allcustomizedvectorextensionskeepthesource(rs1andrs2)anddestination(rd)registersatthesamepositionasthebaseRISC-VISAdoestosimplifydecoding,asshowninthetablebelow. Theopm2fieldencodesthesourceoperandtypesandsourcelocations.Foravectorormatrixoperand,thegeneral-purposeregisterprovidesthememoryaddressofthevalues,markedas(rsx).Thematrixoperationdirectionsareencodedusingthedmcanddmfields.Takingmatrix-vectoradditionsforexample,{𝑑𝑚𝑐,𝑑𝑚}=10indicatesaddingamatrixwitharowvectorwhile{𝑑𝑚𝑐,𝑑𝑚}=01indicatesaddingamatrixwithacolumnvector.Thefunct6fieldencodesoperationtypes,includingaddition,subtraction,multiplication,accumulation,etc.TypicaloperationssuchasconvolutionsandactivationfunctionsinAIinferenceworkloadsareallcovered. Atotalofmorethan50customizedinstructionsareextendedinadditiontothebaseRVVextension.Formanymatrix-relatedoperations,informationsuchasheightandwidthofthematrixcannotbeencodedwithinthe32-bitfixed-widthinstruction.Therefore,22unprivilegedvectorCSRsareaddedtothebaseRVVextension.CustomizedvectorCSRscanonlybeupdatedwithCSRinstructionsdefinedinthebasescalarRISC-VISA.Thevaluesshouldbeproperlysettomatchapplicationneeds. STCP920:AnIndustrialSoCImplementation BasedontheNeuralScalearchitecture,weimplementanindustrialSoCplatformnamedP920forAIinferenceinclouds.ThescalarcoreadoptedinP920istheAndesCoreN25Fthatisa32-bitRISC-VCPUIPcorewithvectorextensionsupport.ThescalarcorehasaseparatedL1DataCacheandL1InstructionCache,eachof64KB.Theneuralprocessorcorehasa1MBL1DataIOBuffer,a256KBL1WeightBufferanda256KBIntermediateBuffer.ThesizeofeachlocalbufferintheneuralprocessorcoreisselectedbasedonexperimentalstatisticsoftypicalAIinferenceworkloads,whichhelpstoavoidfrequentlyexchangingdatabetweenon-chipmemoryandexternalmemory.TheMACvectorintheneuralprocessorcorehas64MACunits,andtheMACmatrixintheneuralprocessorcorecontains64×32MACunits.EachMACunitsupportsbothFP16andINT8arithmetics,whichcanbedynamicallyswitchedaccordingtotheoperationtypeofeachinstruction. Thefigurebelowshowsahigh-leveloverviewoftheP920architecture.Thekeycomponentsinclude32NeuralScalecores,a32MBlastlevelbuffer(LLB),ahardwaresynchronization(HSYNC)subsystem,twoPCIesubsystems,fourDDRsubsystems,aperipheralsubsystem,andaCPUsubsystem.AllcomponentsareconnectedthroughanNoCwitharegular4×6mesh-basedtopology.ThelinksbetweeneachcomponentandanNoCrouter,andthelinksbetweenNoCroutersareallbidirectional.TheNoCseparatescontrolflowanddataflowtoliftdatatransmissionefficiency.Thecontrolbusis32bitswideineachdirectionandthedatabusis512bitswideineachdirection.At1.0GHz,eachdirectionprovidesupto64𝐺𝐵/𝑠bandwidthor128𝐺𝐵/𝑠combined.The32MBLLBissplitupintoeightseparatedsmallLLBsof4MBeach.ThesmallLLBsareconnectedtotheNoCindependently,providing1𝑇𝐵/𝑠memorybandwidthintotal.Meanwhile,theyareevenlydistributedintheNoCsothatothernodescanaccessanLLBwithinasmalllatency.Asthereare32NeuralScalecoresintotal,anHSYNCsubsystemisusedtomanagehowthesecorescooperateandsynchronize.NeuralScalecorescanbedividedintoupto16groupsbytheHSYNCsubsystems,andthenumberofcoresineachgroupisconfiguredbytheapplication. TensorTurbo:anE2EInferenceStackforSTCP920NPU Weimplementanend-to-endinferencestacknamedTensorTurboforP920thatenablesfastandefficientdeploymentofcustomers’pre-trainedAImodels,asshowninFigure4.TensorTurboismainlycomprisedofagraphcompilerandaheterogenousprogramengine(HPE).ThegraphcompilerisbasedonTVMandhasbeendeeplycustomizedforNeuralScalearchitecture.ItprovidesC++andpythoninferenceAPIforpopulardeeplearningframeworksincludingTensorFlow,PyTorch,MxNet,andKeras.Graphintermediaterepresentations(GIRs)fromdifferentframeworksareimportedasunifiedTensorTurboIRsviatheinferenceAPI.Thegraphcompilerthenappliesgraphschedule,operatorsschedule,tilingstrategieswithinanoperator,amongotheroptimizationstofindthefastestimplementationleveragingthehardwarefeaturesatthemost.TheHPEprovideshigh-levelCUDA-styleruntimeAPIsinthehardwareabstractionlayer(HAL),enablingfunctionslikedevicemanagement,kernellaunchandmanagement,memorymanagement,etc.TheHPEalsoprovidesutilitiesincludingGDBdebugtool,performanceprofilingtool,andsystemmonitorinterfacetoolviaaccessingP920’sdebuggingfeatures(eventlogging,performancecounters,breakpoints). EvaluationResults P920wasfabricatedusingTSMC’s12nmFinFETtechnologyandthetotalareais400𝑚𝑚2.Itdelivers256TOPS(INT8)and128TFLOPS(FP16)peakcomputeperformancewithathermaldesignpowerof130𝑊under1.0GHzworkingfrequency.WeconductadetailedperformanceevaluationofP920withtwotypicalAIinferenceworkloadsinclouds:theResNet-50CNNmodelforvisiontasksandtheBERTmodelforNLPtasks.Asacomparison,experimentsarealsoconductedontwoGPUdevices(NvidiaT4andNvidiaV100)andanAIchip(HabanaGoya).  Theperformanceresultsincludethroughput,powerefficiency,andlatencyofthefourplatformsareshowninthefigurebelow.ComparedtotheGPUs,ourP920chiptakestheleadinallthethreeaspectsonbothResNetandBERT.ComparedtothetheHabanaGoyachip,ourP920takestheleadinthroughputandpowerefficiency.Tobespecifc,P920deliversclosethroughputonResNetand2.37timesthroughputonBERT.Besides,P920’spowerefficiencyis1.50timesoftheHabanaGoyachiponResNetand3.56timesoftheHabanaGoyachip. What’sNext? NeuralScaletakesadvantageofcustomizedRISC-Vvectorextensionstoimproveprogrammabilityandperformance.EvaluationsonourindustrialproductP920demonstratethatourprocessorcanachievestate-of-the-artinferenceperformanceonbothCNNandNLPtasks.We’reabouttomassproducetheP920chiptoprovidehigh-performance,high-efficientandhigh-programmabilitysolutionsforAIcomputinginclouds.Meanwhile,weplantorelease~30optimizedNNmodelsforSTCP920chiptocoverasmuchapplicationneedsaspossible.OptimizationsonNeuralScalewillalsobedoneinfutureworktofurtherliftoverallperformanceforournext-generationproducts. “Thesimple,modulararchitectureandextensibilityoftheRISC-VISAmadethedesignofourNPCpossible.Thisdesignfreedomenabledustocreateanextremelypowerfulcomputingcoreforneuralnetworksthatisalsosuperpower-efficient,scalableandprogrammable,”saidMarkZhanatStreamComputing.“WelookforwardtocollaboratingwiththeRISC-VcommunitytodrivemoreopensourceAIinnovationforcommercialapplications.” PreviousPostTheImportanceofIncreasingDiversityintheOpenSourceCommunity NextPostVideo:tinyMLTalksFrance-StateoftheTinyMLtoday ShareTweetSharePin StayConnectedWithRISC-V WesendoccasionalnewsaboutRISC-Vtechnicalprogress,news,andevents. Copyright©2021RISC-VInternational®.Allrightsreserved.RISC-V,RISC-VInternational,andtheRISC-VlogosaretrademarksofRISC-VInternational.Fortrademarkusageguidelines,pleaseseeourBrandGuidelines andPrivacyPolicy twitterlinkedinyoutubeflickr CloseMenu twitterlinkedinyoutubeRSSflickremail



請為這篇文章評分?