Unicode: flag "u" and class \p{...} - The Modern JavaScript ...

文章推薦指數: 80 %
投票人數:10人

Flag u enables the support of Unicode in regular expressions. That means two things: ... With Unicode properties we can look for words in given ... ENARعربيENEnglishESEspañolFRFrançaisIDIndonesiaITItalianoJA日本語KO한국어RUРусскийTRTürkçeZH简体中文Wewanttomakethisopen-sourceprojectavailableforpeopleallaroundtheworld.Helptotranslatethecontentofthistutorialtoyourlanguage! BuyEPUB/PDFSearchSearchTutorialmapShareعربيEnglishEspañolFrançaisIndonesiaItaliano日本語한국어РусскийTürkçe简体中文JavaScriptusesUnicodeencodingforstrings.Mostcharactersareencodedwith2bytes,butthatallowstorepresentatmost65536characters. Thatrangeisnotbigenoughtoencodeallpossiblecharacters,that’swhysomerarecharactersareencodedwith4bytes,forinstancelike𝒳(mathematicalX)or😄(asmile),somehieroglyphsandsoon. HerearetheUnicodevaluesofsomecharacters: Character Unicode BytescountinUnicode a 0x0061 2 ≈ 0x2248 2 𝒳 0x1d4b3 4 𝒴 0x1d4b4 4 😄 0x1f604 4 Socharacterslikeaand≈occupy2bytes,whilecodesfor𝒳,𝒴and😄arelonger,theyhave4bytes. Longtimeago,whenJavaScriptlanguagewascreated,Unicodeencodingwassimpler:therewereno4-bytecharacters.So,somelanguagefeaturesstillhandlethemincorrectly. Forinstance,lengththinksthatherearetwocharacters: alert('😄'.length);//2 alert('𝒳'.length);//2 …Butwecanseethatthere’sonlyone,right?Thepointisthatlengthtreats4bytesastwo2-bytecharacters.That’sincorrect,becausetheymustbeconsideredonlytogether(so-called“surrogatepair”,youcanreadabouttheminthearticleStrings). Bydefault,regularexpressionsalsotreat4-byte“longcharacters”asapairof2-byteones.And,asithappenswithstrings,thatmayleadtooddresults.We’llseethatabitlater,inthearticleSetsandranges[...]. Unlikestrings,regularexpressionshaveflaguthatfixessuchproblems.Withsuchflag,aregexphandles4-bytecharacterscorrectly.AndalsoUnicodepropertysearchbecomesavailable,we’llgettoitnext. Unicodeproperties\p{…}EverycharacterinUnicodehasalotofproperties.Theydescribewhat“category”thecharacterbelongsto,containmiscellaneousinformationaboutit. Forinstance,ifacharacterhasLetterproperty,itmeansthatthecharacterbelongstoanalphabet(ofanylanguage).AndNumberpropertymeansthatit’sadigit:maybeArabicorChinese,andsoon. Wecansearchforcharacterswithaproperty,writtenas\p{…}.Touse\p{…},aregularexpressionmusthaveflagu. Forinstance,\p{Letter}denotesaletterinanylanguage.Wecanalsouse\p{L},asLisanaliasofLetter.Thereareshorteraliasesforalmosteveryproperty. Intheexamplebelowthreekindsofletterswillbefound:English,GeorgianandKorean. letstr="Aბㄱ"; alert(str.match(/\p{L}/gu));//A,ბ,ㄱ alert(str.match(/\p{L}/g));//null(nomatches,\pdoesn'tworkwithouttheflag"u") Here’sthemaincharactercategoriesandtheirsubcategories: LetterL: lowercaseLl modifierLm, titlecaseLt, uppercaseLu, otherLo. NumberN: decimaldigitNd, letternumberNl, otherNo. PunctuationP: connectorPc, dashPd, initialquotePi, finalquotePf, openPs, closePe, otherPo. MarkM(accentsetc): spacingcombiningMc, enclosingMe, non-spacingMn. SymbolS: currencySc, modifierSk, mathSm, otherSo. SeparatorZ: lineZl, paragraphZp, spaceZs. OtherC: controlCc, formatCf, notassignedCn, privateuseCo, surrogateCs. So,e.g.ifweneedlettersinlowercase,wecanwrite\p{Ll},punctuationsigns:\p{P}andsoon. Therearealsootherderivedcategories,like: Alphabetic(Alpha),includesLettersL,plusletternumbersNl(e.g.Ⅻ–acharacterfortheromannumber12),plussomeothersymbolsOther_Alphabetic(OAlpha). Hex_Digitincludeshexadecimaldigits:0-9,a-f. …Andsoon. Unicodesupportsmanydifferentproperties,theirfulllistwouldrequirealotofspace,soherearethereferences: Listallpropertiesbyacharacter:https://unicode.org/cldr/utility/character.jsp. Listallcharactersbyaproperty:https://unicode.org/cldr/utility/list-unicodeset.jsp. Shortaliasesforproperties:https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt. AfullbaseofUnicodecharactersintextformat,withallproperties,ishere:https://www.unicode.org/Public/UCD/latest/ucd/. Example:hexadecimalnumbersForinstance,let’slookforhexadecimalnumbers,writtenasxFF,whereFisahexdigit(0…9orA…F). Ahexdigitcanbedenotedas\p{Hex_Digit}: letregexp=/x\p{Hex_Digit}\p{Hex_Digit}/u; alert("number:xAF".match(regexp));//xAF Example:ChinesehieroglyphsLet’slookforChinesehieroglyphs. There’saUnicodepropertyScript(awritingsystem),thatmayhaveavalue:Cyrillic,Greek,Arabic,Han(Chinese)andsoon,here’sthefulllist. TolookforcharactersinagivenwritingsystemweshoulduseScript=,e.g.forCyrillicletters:\p{sc=Cyrillic},forChinesehieroglyphs:\p{sc=Han},andsoon: letregexp=/\p{sc=Han}/gu;//returnsChinesehieroglyphs letstr=`HelloПривет你好123_456`; alert(str.match(regexp));//你,好 Example:currencyCharactersthatdenoteacurrency,suchas$,€,¥,haveUnicodeproperty\p{Currency_Symbol},theshortalias:\p{Sc}. Let’suseittolookforpricesintheformat“currency,followedbyadigit”: letregexp=/\p{Sc}\d/gu; letstr=`Prices:$2,€1,¥9`; alert(str.match(regexp));//$2,€1,¥9 Later,inthearticleQuantifiers+,*,?and{n}we’llseehowtolookfornumbersthatcontainmanydigits. SummaryFlaguenablesthesupportofUnicodeinregularexpressions. Thatmeanstwothings: Charactersof4bytesarehandledcorrectly:asasinglecharacter,nottwo2-bytecharacters. Unicodepropertiescanbeusedinthesearch:\p{…}. WithUnicodepropertieswecanlookforwordsingivenlanguages,specialcharacters(quotes,currencies)andsoon. PreviouslessonNextlessonShareTutorialmapCommentsreadthisbeforecommenting…Ifyouhavesuggestionswhattoimprove-pleasesubmitaGitHubissueorapullrequestinsteadofcommenting.Ifyoucan'tunderstandsomethinginthearticle–pleaseelaborate.Toinsertfewwordsofcode,usethetag,forseverallines–wrapthemin

tag,formorethan10lines–useasandbox(plnkr,jsbin,codepen…)ChapterRegularexpressionsLessonnavigationUnicodeproperties\p{…}SummaryCommentsShareEditonGitHub© 2007—2021 IlyaKantorabouttheprojectcontactustermsofusageprivacypolicy



請為這篇文章評分?