IBM/AccDNN: A compiler from AI model to RTL ... - GitHub

文章推薦指數: 80 %
投票人數:10人

A compiler from AI model to RTL (Verilog) accelerator in FPGA hardware with auto design space exploration. - GitHub - IBM/AccDNN: A compiler from AI model ... Skiptocontent {{message}} IBM / AccDNN Public Notifications Fork 77 Star 213 AcompilerfromAImodeltoRTL(Verilog)acceleratorinFPGAhardwarewithautodesignspaceexploration. Apache-2.0License 213 stars 77 forks Star Notifications Code Issues 1 Pullrequests 0 Actions Projects 0 Wiki Security Insights More Code Issues Pullrequests Actions Projects Wiki Security Insights Thiscommitdoesnotbelongtoanybranchonthisrepository,andmaybelongtoaforkoutsideoftherepository. master Branches Tags Couldnotloadbranches Nothingtoshow {{refName}} default Couldnotloadtags Nothingtoshow {{refName}} default 1 branch 0 tags Code Latestcommit   Gitstats 21 commits Files Permalink Failedtoloadlatestcommitinformation. Type Name Latestcommitmessage Committime bin     conf     example     layers     lib     pic     sim     tools     util     web     LICENSE     README.md     chip_define.py     codegen.py     freqency.py     model.py     service.py     settings.py     Viewcode AccDNN(AcceleratorCoreCompilerforDeepNeuralNetwork) ThisprojectisalsonamedasDNNBuilderinouracademicresearch. ProjectDescription: Theconversionconsistsofthreestages, AccDNNConstraints: Requirements: Runthecifar10demotoonlygeneratetheIPcoreoftheaccelerator Datainput/outputformat Inputformat: Outputformat: Quantization/Precisionconstraintsforactivationsandweights Simulationwithoutinvolinghardware Beyondthedemo:quantization,batchmodeandtipsforhighFPGAresourceutilization Autooptimization Otherdemos Contact: Citation README.md AccDNN(AcceleratorCoreCompilerforDeepNeuralNetwork) ThisprojectisalsonamedasDNNBuilderinouracademicresearch. ProjectDescription: Inthisproject,weproposedanovelsolutionthatcanautomaticallyconverttheCaffetraineddeepneuralnetworktotheFPGARTLlevelimplementationwithoutinvolvinganyprogrammingeffort,andalsoprovideuniformAPIstotheusersfortheirrecognitiontask.Thusthedevelopers,withoutanyFPGAprogrammingexperience,candeploytheirFPGAaccelerateddeeplearningserviceinthedatacenteroredgedevicesonlyprovidingtheirtrainedCaffemodel.ThisworkwaspublishedinICCAD'18,andwontheBestPaperAwardforFront-end.Formoredesigndetails.pleaserefertoourpaper. Theconversionconsistsofthreestages, Caffenetfileisfirstlyparsedtoobtainthenetstructure.WeestimatetheworkloadofeachlayertodeterminetheparallelismlevelundertheconstraintsofFPFAresource. EachlayerdefinedinthisnetgeneratesacustomizedVerilogmodulethroughinstantiatingcorrespondingneurallayerinthelibrary.Thetoplevelmoduleisalsogeneratedbyconnectingthesecustomizedinstancestogetherbasedonthelayersequencedefinedinthenetfile,andtherequiredon-chipmemoryforweightsisalsogeneratedinthisstage. Synthesizethegeneratedsourcefiles,placementandlayouttogeneratetheexecutableFPGAbitfile. AccDNNConstraints: OnlysupportthemodelstrainedbyCaffeframework. Onlysupportconvolutionallayer,maxpoolinglayer,fullyconnectedlayer,andbatchnormalizationlayer. ThetotalnumberofconvolutionalandfullyconnectedlayersinthenetworkdefinedinCaffe.prototxtshouldbelessthan15layers Requirements: Tomakesureyoucanusethequantizedcaffemodel,pleaseinstallristrettoCaffeinsteadofBVLCcaffefollowingtheinstructionshere,testedonrc3,andalsothePythoncaffebyruningtheMakepycaffeandpipinstall-rrequirements.txtincaffe/python.MakesurethatyouhavecompiledthePythonCaffeinterfaceandthatitisonyourPYTHONPATH.PleasealsosettheACCDNN_ROOT. exportPYTHONPATH=path/to/caffe/python exportACCDNN_ROOT=path/to/AccDNN ClonethePower-AI-Enginerepository,andaddthePower-AI-EngineSDKtoenvironment.(optionalforIBMPOWERFPGAAcceleration) exportFPGA_SDK_PATH=path/to/Power-AI-Engine/FPGA-SDK InstallXilinxVidadosoftware,andalsoaddtoenvironment,thehardwareSDKwastestedonVivado2017.4. exportVIVADO_PATH=path/to/Xilinx/Vivado/201x.x/bin ClonetheAccDNNrepository. gitclonehttps://github.com/IBM/AccDNN.git Runthecifar10demotoonlygeneratetheIPcoreoftheaccelerator python./codegen.pyexample/cifar10/cifar10_quick.prototxt\ example/cifar10/cifar10_quick_iter_5000.caffemodel\ --optim_fileexample/cifar10/optim_cifar10.conf\ --batch_size1\ --profile Theparametersareverysimilartothepiecommand.Youcouldusethe--profileparametertogettheprofileoftheaccelerator,includingthenetworkstructuresumary,FPGAresourceusage,andtheprojectedperformance.Ifyouommitthisparameter,theIPcoreoftheacceleratorwillbegeneratedinthe./builddirectory,whichmaybetakeseveralminutestocompletebasedonthemodelsize. InthebuilddirectorythattheAccDNNgeneratesincludes: src/Allthegeneratedverilogsourcefileswillbestoredinthisdirectory,andthetopmoduleismodel.v coe/Alltheweightsrelatedfileswillbestoredinthisdirectory,includingthecoefileforROM,andbinfileforDDR. timing/Allthetimingconstraintsrelatedfileswillbestoredinthisdirectory. ips.tclThisTCLfileisusedtogeneratetheXilinxIPcoresthatwillbeinstantiatedintheaccelerator. imp_file.fThisfileisalistoftheverilogsourcefilesthatwillbeusedintheaccelerator,andthelibverilogfilesarealsoincluded. file_list.txtYoucouldaddthisfileinyourVivadoproject,toimportalltheacceleratorrequiredfilesintoyourcustomizedVivadoproject. Datainput/outputformat Inputformat: TheinputdatasequecefollowsWHCformat,whichisdifferentfromCaffe'sCHWformat.Forbatchmode,theinputdatashouldbeinterleaved.Forexample,ifbatchsizeis2,theinputdataofeachpixelshouldbeR1,R2,G1,G2,B1,B2,...IfINPUT_CHANNEL_PADDINGinsettings.pyissetto1,thepaddingshouldbealsointerleavedasR1,R2,G1,G2,B1,B2,0,0... Outputformat: Theoutputdatasequenceisalittlecomplicated. Ifthelastlayerisfullyconnectedlayer,theoutputsequeceissimple,samesequenceastheoutputvector. Ifthelastlayerisconvolutionallayer,theresultisoutputcolumnbycolumn.Ineachcolumn,theoutputisinterleavedaccordingtotheKPFofthelastconvolutionallayer. Ineachcolumn,thefirstKPFelementsofthefirstfeature(redblockinh(1))willbeoutput,followedbythefirstKPFelementsofthesecondfeature(redblockinh(2)),afteralltheredblocksinthiscolumnareoutput,thesecondKPFelementsineachfeatureofthiscolumnwillbeoutputbysequence.Then,theblueblock,thegreenblock,… Exampleoftheoutputformat. Ifbatchsizeislargerthan1,eachblockcontainsKPF*batch_sizeelements,thefirstKPFelementsoffirstimage,followedbythesecondKPFelementsofsecondimage,... Quantization/Precisionconstraintsforactivationsandweights Thebitwidthoftheactivationscouldbe16/8bits,andthebitwidthoftheweightscouldbe16/8/4bits.Thebitwidthoftheweightscannotbelargerthanthatoftheactivations.Forexample,4bitsactivationswith8bitsweightsisnotallowed. Whentheactivationis16bits,thebitwdithoftheweightscouldbe16/8/4bits,theDSPblockisonlyusedforonemultiplier. Whentheactivationis8bits,thebitwidthoftheweightscouldbe8/4bits,theDSPblockwillbeusedfortwomultipliers,resultingindoubledthroughput. Whentheactivationis8bits(weightis8/4bits),theKPFandthekernelnumberofthislayershouldbebotheven.Ifthekernelnumberofthisparticularlayerisodd,oneextrachannelpadding(withallweightszeros)isrequired.Otherwise,itwillpaddingbothactivationandweightto16bits,andthedoubledthroughputcannotbeachieved. Simulationwithoutinvolinghardware WewilluseatinyneuralnetworktrainedonCIFAR10datatodemonstratetheprocedureofthesimulationfunctioninAccDNN.OnlyVivado2013.4issupportedinthissimulationenvironment. SetAccDNNtosimulationenvironmentbychangingthevariableSIMULATION_ONLYinsettings.pytoTrue. UseAccDNNtoconvertthetargetdeepneuralnetworktoVerilogHDLsourcecode,herewetakeatinynetworkforCIFAR10astheexample. python./codegen.pyexample/cifar10/cifar10_quick.prototxt\ example/cifar10/cifar10_quick_iter_5000.caffemodel\ --optim_fileexample/cifar10/optim_cifar10.conf Usethecommand./bin/sim_file_gen.shtogeneratethesimulationenvironment Usethefollowingcommandtogeneratethesimulationtestdata. pythontools/sim_data_gen.pyexample/cifar10/cifar10_quick.prototxt\ example/cifar10/cifar10_quick_iter_5000.caffemodel\ example/cifar10/test.png cdsim/tb/andvsimtostartthemodelsim IntheTranscript,modifytheXilinxIPsimulationlibpathinsim_model.tcl(line23-25),thentypethecommandsourcesim_model.tcl,comp_modeltocompilethesimulationprojectandusesimtostartthesimulationprocess. Usethecommandpythontools/compare.pysim_result_filereal_filetoverfiythecorrectionofthesimulationresult.Allthefileswillbestoredinsim/datadirectory.Forexample,ifyouwanttocheckifthesimulationoutputofpool3iscorrect,youcanusethefollowingcommand.PleasenotethatGUIisrequiredinthiscomparisonstep. pythontools/compare.pysim/data/pool3_sim.datsim/data/pool3.dat Beyondthedemo:quantization,batchmodeandtipsforhighFPGAresourceutilization TheDW,WQ,DQisonlyavalibalewhentheinputmodelfilehasnoquantizationinformation,alsoyoucanuseexplicitquantizationsettinginmodelfile,andtheformatdefineinristrettoCaffeissupported.ThemodeltrainedortunnedbyristrettoCaffecouldbedirectlyinputtoAccDNN. Duetothelimitedbandwidthofoff-chipmemory,ithasbeenprovedtousethebatchmodetoincreasethedatareuse.YoucansettheBATCH_SIZEinsettings.py,themaxbatchsizeis32. ItismuchbettertosetappropriateCPFandKPFforeachlayertoachievehighFPGAresourceutilizationinthe.conffile.CPFmeansthenumberofchannelsin3Dconvolutiontobecomputedsimultaneously.KPFmeansthenumberofkernelsin3Dconvolutionstobecomputedsimultaneously.Thereisanexampleinexample/cifar10/optim_cifar10.conf. Adjusteachlayer'sCPFandKPFtoleteachlayerhaveclosedelays,whcihcouldmakethewholepipelinemoreefficient.Youcouldusethefollowcommandtoprofilethenetwork(takecifar10modelfortheexample)first,andthefinalFPGAresourceutilizationisalsoprovided,whcihisbettertobecloseto1.0. python./codegen.pyexample/cifar10/cifar10_quick.prototxt\ example/cifar10/cifar10_quick_iter_5000.caffemodel\ --optim_fileexample/cifar10/optim_cifar10.conf\ --profile ThereportgeneratedbytheprofilingalsoprovidestherequiredDDRbandwidth.HigherCPF/KFPwillreuqirehigherDDRbandwidth,andachievemuchlowerlatency.ItisnotagooddesigniftherequiredDDRbandwidthismuchlargerthanthephysicalDDRbandwidth.ThetotalDSPsandon-chipmemory(BLOCKRAM)requiredinthisdesignarealsoprovided.AfterdeterminingtheCPF/KFP,youcansetappropriatebatchsizetofullyutilizetheDSPsandBLOCKRAMresourcesinFPGA. Eachlayer(ifithasweights)willrequireaDMAchannel.Tohavebettertiming,itismuchbettertosettheDMAdelayinthe.conffile,especiallyforlargescaleFPGAs.Thisvalueshouldbebetween0~2[default=0].Atthebeginning,youcouldsetitto0,andifyoufindthereisserioustimingissueinthisDMAchannelafterrouting,youcouldsetthismanuallytohavebettertiminginDMAmodule. Autooptimization AccDNNalsoprovidesautooptimizationwithgivenFPGAresourcestoachievelowlatencyandmaximalthroughput.Ifyoudon'tprovidetheoptimal_file,theautooptimizationwillbeperformed. Otherdemos Besidesthecifar10demo,wealsoprovidetheZF,VGG16,YOLOmodelsinthe'example/'. Contact: JunSongWang,IBMResearchChina,[email protected] XiaofanZhang,UniversityofIllinoisatUrbana-Champaign,[email protected] Citation IfyoufindAccDNN/DNNBuilderusefulinyourresearch,pleaseconsidertociteourpaper: @inproceedings{DNNBuilder, title={DNNBuilder:anAutomatedToolforBuildingHigh-PerformanceDNNHardwareAcceleratorsforFPGAs}, author={XiaofanZhang,JunsongWang,ChaoZhu,YonghuaLin,JinjunXiong,Wen-meiHwu,DemingChen}, booktitle={ProceedingsofIEEE/ACMInternationalConferenceonComputer-AidedDesign}, year={2018} } About AcompilerfromAImodeltoRTL(Verilog)acceleratorinFPGAhardwarewithautodesignspaceexploration. Resources Readme License Apache-2.0License Stars 213 stars Watchers 19 watching Forks 77 forks Releases Noreleasespublished Packages0 Nopackagespublished Contributors3       Languages Verilog 39.0% Coq 35.1% SystemVerilog 12.8% Python 8.0% Stata 3.9% Shell 0.8% Tcl 0.4% Youcan’tperformthatactionatthistime. Yousignedinwithanothertaborwindow.Reloadtorefreshyoursession. Yousignedoutinanothertaborwindow.Reloadtorefreshyoursession.



請為這篇文章評分?