NeuralScale: Industry Leading General Purpose ... - RISC-V
文章推薦指數: 80 %
NeuralScale: Industry Leading General Purpose Programmable NPU Architecture based on RISC-V ... AI and its diverse applications have seen ... Skiptomaincontent twitterlinkedinyoutubeRSSflickremail Languages RISC-VChina RISC-VJapan TechMeetings CommunityMeetings WorkingGroupsPortal Join HitentertosearchorESCtoclose CloseSearch Blog NeuralScale:IndustryLeadingGeneralPurposeProgrammableNPUArchitecturebasedonRISC-V ByMarkZhanJuly6,2021NoComments AIanditsdiverseapplicationshaveseensignificantincreasingdemandforAIcomputingincloudsoverthelastfewyears.TypicalAI-enabledservicesincludeimageandspeechrecognition,naturallanguageprocessing,medicaldiagnosis,visualsearch,andpersonalizedrecommendations.AIcomputingincloudsincludestwodistinctworkloads:trainingandinference.AIinferenceisreportedtoconstitutemorethan95%ofAIcomputingworkloadsinclouds. Meanwhile,wehaveseensignificantgrowthinapplication-specificintegratedcircuits(ASICs)forAIinferenceinclouds.ASICswithcustomizedsiliconandmemoryhierarchyforspecificmodelsoralgorithmshaveshownbetterperformanceandenergyefficiencythantraditionalGPUsinAIinference.However,mostASICscanonlyperformdesignedmodelsoralgorithmswithpoororevennoflexibilityandprogrammability.AsAIalgorithmsarestillevolving,newoperatorsandactivationfunctionsmaybeproposed.ItwillbecomeagreatchallengeforASICswhenmigratingtonewAImodelsinthefuture. TopromotetheprogrammabilityofASICswhileretainingtheperformanceandenergyefficiency,weproposeNeruralScale,ageneral-purposeneuralprocessorarchitecturebasedontheRISC-VISAasRISC-Vismeanttoprovideabasisformorespecializedinstruction-setextensionsorcustomizedaccelerators.OurindustrialproductP920,implementedwith32NeuralScalecores,achieves256TOPS(INT8)and128TFLOPS(FP16)peakperformanceunderaclockfrequencyof1.0GHzinaTSMC12nmFinFETprocesstechnology.Evaluationresultsontypicalinferenceworkloadsshowthatourprocessordeliversstate-of-the-artthroughputandpowerefficiencyperformance. NeuralScaleArchitectureOverview Thefigurebelowillustratesahigh-leveloverviewoftheNeuralScalearchitecture.ThekeycomponentsincludeaRISC-Vscalarcoreandaneuralprocessorcore.Thescalarcorefetchesanddecodesallinstructions,anddispatchestheinstructionstothecorrectpathbasedontheirtypes.Scalarinstructionsareexecutedinorderinthescalarpipelinewhilevectorinstructionsflowthroughthescalarpipelinetotheneuralprocessorcore. TheneuralprocessorcorecombinesthefeaturesofvectorprocessorsandAIinferenceaccelerators.Asshowninthefigurebelow,thecomputationcomponentsincludeaMACvectorforexecutingvectoroperations,aMACmatrixforexecutingmatrixoperations,andaPOLYmoduleforcomplexarithmeticslikeexp,div,andsqrtcomputations.On-chipsmemorycomponentsincludeavectorregisterfile(theREGBankmodule)aswellasthreelocalbuffers,namedL1Buffer,DataInputBuffer,andIntermediateBufferrespectively. Theneuralprocessorcore’spipelineisdividedinto4stagesinconcept:decode,issue,execute,andwrite-back.Vectorinstructionsaredecodedintomicro-opsinthedecodeunitandthendispatchedtotheissueunit.Theissueunitissuesinstructionstocorrespondingexecutionunitsbasedontheiroperationtypes.Therearethreeexecutionunits,avectorMACengine(VME),amatrixMACengine(MME),andamemorytransmissionengine(MTE)fordifferentoperationtypes.Theissueunitmaintainsthreeinstructionbufferstrackingthestateofallinflightinstructionsineachexecutionunit.Adispatchedinstructionfromthedecodeunitwillbebufferedaccordingtoitsoperationtypeandwillberemovedonceit’scommittedbytheexecutionunit.Allinstructionswillbeissuedinorder,andaninstructioncanbeissuedonlywhenthereisnoaddressoverlapwithinflightinstructions.Theissueunitcanissuethreeinstructionsatmost.Allthreeexecutionunitscanworksimultaneouslyinthiscase,andhencememorylatencycanbepartiallyhiddenbycomputation,whichliftsoverallperformance.Afterexecution,theresultswillbewrittenbacktovectorregistersorlocalbuffers. TheRISC-VVectorextension(RVV)enablesprocessorcoresbasedontheRISC-Vinstructionsetarchitecturetoprocessdataarrays,alongsidetraditionalscalaroperationstoacceleratethecomputationofsingleinstructionstreamsonlargedatasets.NeuralScaleimplementsstandardextensionsincludingthebaseRVVextension(v0.8)andcustomizedvectorextensionswithfixed-width32-bitinstructionformat.Weusethecustom-3opcode(11111011)intheRISC-Vbaseopcodemapasthemajoropcodeforcustomizedvectorextensions,markedasOP-VE.Allcustomizedvectorextensionskeepthesource(rs1andrs2)anddestination(rd)registersatthesamepositionasthebaseRISC-VISAdoestosimplifydecoding,asshowninthetablebelow. Theopm2fieldencodesthesourceoperandtypesandsourcelocations.Foravectorormatrixoperand,thegeneral-purposeregisterprovidesthememoryaddressofthevalues,markedas(rsx).Thematrixoperationdirectionsareencodedusingthedmcanddmfields.Takingmatrix-vectoradditionsforexample,{𝑑𝑚𝑐,𝑑𝑚}=10indicatesaddingamatrixwitharowvectorwhile{𝑑𝑚𝑐,𝑑𝑚}=01indicatesaddingamatrixwithacolumnvector.Thefunct6fieldencodesoperationtypes,includingaddition,subtraction,multiplication,accumulation,etc.TypicaloperationssuchasconvolutionsandactivationfunctionsinAIinferenceworkloadsareallcovered. Atotalofmorethan50customizedinstructionsareextendedinadditiontothebaseRVVextension.Formanymatrix-relatedoperations,informationsuchasheightandwidthofthematrixcannotbeencodedwithinthe32-bitfixed-widthinstruction.Therefore,22unprivilegedvectorCSRsareaddedtothebaseRVVextension.CustomizedvectorCSRscanonlybeupdatedwithCSRinstructionsdefinedinthebasescalarRISC-VISA.Thevaluesshouldbeproperlysettomatchapplicationneeds. STCP920:AnIndustrialSoCImplementation BasedontheNeuralScalearchitecture,weimplementanindustrialSoCplatformnamedP920forAIinferenceinclouds.ThescalarcoreadoptedinP920istheAndesCoreN25Fthatisa32-bitRISC-VCPUIPcorewithvectorextensionsupport.ThescalarcorehasaseparatedL1DataCacheandL1InstructionCache,eachof64KB.Theneuralprocessorcorehasa1MBL1DataIOBuffer,a256KBL1WeightBufferanda256KBIntermediateBuffer.ThesizeofeachlocalbufferintheneuralprocessorcoreisselectedbasedonexperimentalstatisticsoftypicalAIinferenceworkloads,whichhelpstoavoidfrequentlyexchangingdatabetweenon-chipmemoryandexternalmemory.TheMACvectorintheneuralprocessorcorehas64MACunits,andtheMACmatrixintheneuralprocessorcorecontains64×32MACunits.EachMACunitsupportsbothFP16andINT8arithmetics,whichcanbedynamicallyswitchedaccordingtotheoperationtypeofeachinstruction. Thefigurebelowshowsahigh-leveloverviewoftheP920architecture.Thekeycomponentsinclude32NeuralScalecores,a32MBlastlevelbuffer(LLB),ahardwaresynchronization(HSYNC)subsystem,twoPCIesubsystems,fourDDRsubsystems,aperipheralsubsystem,andaCPUsubsystem.AllcomponentsareconnectedthroughanNoCwitharegular4×6mesh-basedtopology.ThelinksbetweeneachcomponentandanNoCrouter,andthelinksbetweenNoCroutersareallbidirectional.TheNoCseparatescontrolflowanddataflowtoliftdatatransmissionefficiency.Thecontrolbusis32bitswideineachdirectionandthedatabusis512bitswideineachdirection.At1.0GHz,eachdirectionprovidesupto64𝐺𝐵/𝑠bandwidthor128𝐺𝐵/𝑠combined.The32MBLLBissplitupintoeightseparatedsmallLLBsof4MBeach.ThesmallLLBsareconnectedtotheNoCindependently,providing1𝑇𝐵/𝑠memorybandwidthintotal.Meanwhile,theyareevenlydistributedintheNoCsothatothernodescanaccessanLLBwithinasmalllatency.Asthereare32NeuralScalecoresintotal,anHSYNCsubsystemisusedtomanagehowthesecorescooperateandsynchronize.NeuralScalecorescanbedividedintoupto16groupsbytheHSYNCsubsystems,andthenumberofcoresineachgroupisconfiguredbytheapplication. TensorTurbo:anE2EInferenceStackforSTCP920NPU Weimplementanend-to-endinferencestacknamedTensorTurboforP920thatenablesfastandefficientdeploymentofcustomers’pre-trainedAImodels,asshowninFigure4.TensorTurboismainlycomprisedofagraphcompilerandaheterogenousprogramengine(HPE).ThegraphcompilerisbasedonTVMandhasbeendeeplycustomizedforNeuralScalearchitecture.ItprovidesC++andpythoninferenceAPIforpopulardeeplearningframeworksincludingTensorFlow,PyTorch,MxNet,andKeras.Graphintermediaterepresentations(GIRs)fromdifferentframeworksareimportedasunifiedTensorTurboIRsviatheinferenceAPI.Thegraphcompilerthenappliesgraphschedule,operatorsschedule,tilingstrategieswithinanoperator,amongotheroptimizationstofindthefastestimplementationleveragingthehardwarefeaturesatthemost.TheHPEprovideshigh-levelCUDA-styleruntimeAPIsinthehardwareabstractionlayer(HAL),enablingfunctionslikedevicemanagement,kernellaunchandmanagement,memorymanagement,etc.TheHPEalsoprovidesutilitiesincludingGDBdebugtool,performanceprofilingtool,andsystemmonitorinterfacetoolviaaccessingP920’sdebuggingfeatures(eventlogging,performancecounters,breakpoints). EvaluationResults P920wasfabricatedusingTSMC’s12nmFinFETtechnologyandthetotalareais400𝑚𝑚2.Itdelivers256TOPS(INT8)and128TFLOPS(FP16)peakcomputeperformancewithathermaldesignpowerof130𝑊under1.0GHzworkingfrequency.WeconductadetailedperformanceevaluationofP920withtwotypicalAIinferenceworkloadsinclouds:theResNet-50CNNmodelforvisiontasksandtheBERTmodelforNLPtasks.Asacomparison,experimentsarealsoconductedontwoGPUdevices(NvidiaT4andNvidiaV100)andanAIchip(HabanaGoya). Theperformanceresultsincludethroughput,powerefficiency,andlatencyofthefourplatformsareshowninthefigurebelow.ComparedtotheGPUs,ourP920chiptakestheleadinallthethreeaspectsonbothResNetandBERT.ComparedtothetheHabanaGoyachip,ourP920takestheleadinthroughputandpowerefficiency.Tobespecifc,P920deliversclosethroughputonResNetand2.37timesthroughputonBERT.Besides,P920’spowerefficiencyis1.50timesoftheHabanaGoyachiponResNetand3.56timesoftheHabanaGoyachip. What’sNext? NeuralScaletakesadvantageofcustomizedRISC-Vvectorextensionstoimproveprogrammabilityandperformance.EvaluationsonourindustrialproductP920demonstratethatourprocessorcanachievestate-of-the-artinferenceperformanceonbothCNNandNLPtasks.We’reabouttomassproducetheP920chiptoprovidehigh-performance,high-efficientandhigh-programmabilitysolutionsforAIcomputinginclouds.Meanwhile,weplantorelease~30optimizedNNmodelsforSTCP920chiptocoverasmuchapplicationneedsaspossible.OptimizationsonNeuralScalewillalsobedoneinfutureworktofurtherliftoverallperformanceforournext-generationproducts. “Thesimple,modulararchitectureandextensibilityoftheRISC-VISAmadethedesignofourNPCpossible.Thisdesignfreedomenabledustocreateanextremelypowerfulcomputingcoreforneuralnetworksthatisalsosuperpower-efficient,scalableandprogrammable,”saidMarkZhanatStreamComputing.“WelookforwardtocollaboratingwiththeRISC-VcommunitytodrivemoreopensourceAIinnovationforcommercialapplications.” PreviousPostTheImportanceofIncreasingDiversityintheOpenSourceCommunity NextPostVideo:tinyMLTalksFrance-StateoftheTinyMLtoday ShareTweetSharePin StayConnectedWithRISC-V WesendoccasionalnewsaboutRISC-Vtechnicalprogress,news,andevents. Copyright©2021RISC-VInternational®.Allrightsreserved.RISC-V,RISC-VInternational,andtheRISC-VlogosaretrademarksofRISC-VInternational.Fortrademarkusageguidelines,pleaseseeourBrandGuidelines andPrivacyPolicy twitterlinkedinyoutubeflickr CloseMenu twitterlinkedinyoutubeRSSflickremail
延伸文章資訊
- 1NeuralScale: Industry Leading General Purpose ... - RISC-V
NeuralScale: Industry Leading General Purpose Programmable NPU Architecture based on RISC-V ... A...
- 2【原创】大家可能忽略了安谋科技干的一件大事! | 电子创新网
这是全球首个致力于打造开源NPU ISA的半导体技术创新联合体,今后将以标准协作 ... 开源指令集,只不过,RISC-V是一款开源CPU ,而ONIA开源的是NPU。
- 3介紹一款國外的RISC-V AI晶片,可用於L1、L2自動駕駛
為了支持NPU,該SoC 還引入了Kneron首款RISC-V AI 協處理器、Cortex M4 系統控制內核和圖像信號處理器(ISP)。所有這些計算都由128 MB 的低功耗雙倍數據速率( ...
- 4Kneron Edge AI SoC Powered by Andes RISC-V Processor ...
Its reconfigurable NPU design takes advantage of the high performance of the D25F RISC-V core, an...
- 5搭載晶心RISC-V處理器耐能智慧邊緣運算晶片KL530進入量產
邊緣運算(Edge AI)方案供應商耐能智慧(Kneron),與RISC-V嵌入式處理器商晶 ... KL530是耐能智慧的最新型異構AI晶片,採用全新的NPU架構,它是業界中第 ...