Unicode: flag "u" and class \p{...} - The Modern JavaScript ...
文章推薦指數: 80 %
Flag u enables the support of Unicode in regular expressions. That means two things: ... With Unicode properties we can look for words in given ...
ENARعربيENEnglishESEspañolFRFrançaisIDIndonesiaITItalianoJA日本語KO한국어RUРусскийTRTürkçeZH简体中文Wewanttomakethisopen-sourceprojectavailableforpeopleallaroundtheworld.Helptotranslatethecontentofthistutorialtoyourlanguage!
BuyEPUB/PDFSearchSearchTutorialmapShareعربيEnglishEspañolFrançaisIndonesiaItaliano日本語한국어РусскийTürkçe简体中文JavaScriptusesUnicodeencodingforstrings.Mostcharactersareencodedwith2bytes,butthatallowstorepresentatmost65536characters.
Thatrangeisnotbigenoughtoencodeallpossiblecharacters,that’swhysomerarecharactersareencodedwith4bytes,forinstancelike𝒳(mathematicalX)or😄(asmile),somehieroglyphsandsoon.
HerearetheUnicodevaluesofsomecharacters:
Character
Unicode
BytescountinUnicode
a
0x0061
2
≈
0x2248
2
𝒳
0x1d4b3
4
𝒴
0x1d4b4
4
😄
0x1f604
4
Socharacterslikeaand≈occupy2bytes,whilecodesfor𝒳,𝒴and😄arelonger,theyhave4bytes.
Longtimeago,whenJavaScriptlanguagewascreated,Unicodeencodingwassimpler:therewereno4-bytecharacters.So,somelanguagefeaturesstillhandlethemincorrectly.
Forinstance,lengththinksthatherearetwocharacters:
alert('😄'.length);//2
alert('𝒳'.length);//2
…Butwecanseethatthere’sonlyone,right?Thepointisthatlengthtreats4bytesastwo2-bytecharacters.That’sincorrect,becausetheymustbeconsideredonlytogether(so-called“surrogatepair”,youcanreadabouttheminthearticleStrings).
Bydefault,regularexpressionsalsotreat4-byte“longcharacters”asapairof2-byteones.And,asithappenswithstrings,thatmayleadtooddresults.We’llseethatabitlater,inthearticleSetsandranges[...].
Unlikestrings,regularexpressionshaveflaguthatfixessuchproblems.Withsuchflag,aregexphandles4-bytecharacterscorrectly.AndalsoUnicodepropertysearchbecomesavailable,we’llgettoitnext.
Unicodeproperties\p{…}EverycharacterinUnicodehasalotofproperties.Theydescribewhat“category”thecharacterbelongsto,containmiscellaneousinformationaboutit.
Forinstance,ifacharacterhasLetterproperty,itmeansthatthecharacterbelongstoanalphabet(ofanylanguage).AndNumberpropertymeansthatit’sadigit:maybeArabicorChinese,andsoon.
Wecansearchforcharacterswithaproperty,writtenas\p{…}.Touse\p{…},aregularexpressionmusthaveflagu.
Forinstance,\p{Letter}denotesaletterinanylanguage.Wecanalsouse\p{L},asLisanaliasofLetter.Thereareshorteraliasesforalmosteveryproperty.
Intheexamplebelowthreekindsofletterswillbefound:English,GeorgianandKorean.
letstr="Aბㄱ";
alert(str.match(/\p{L}/gu));//A,ბ,ㄱ
alert(str.match(/\p{L}/g));//null(nomatches,\pdoesn'tworkwithouttheflag"u")
Here’sthemaincharactercategoriesandtheirsubcategories:
LetterL:
lowercaseLl
modifierLm,
titlecaseLt,
uppercaseLu,
otherLo.
NumberN:
decimaldigitNd,
letternumberNl,
otherNo.
PunctuationP:
connectorPc,
dashPd,
initialquotePi,
finalquotePf,
openPs,
closePe,
otherPo.
MarkM(accentsetc):
spacingcombiningMc,
enclosingMe,
non-spacingMn.
SymbolS:
currencySc,
modifierSk,
mathSm,
otherSo.
SeparatorZ:
lineZl,
paragraphZp,
spaceZs.
OtherC:
controlCc,
formatCf,
notassignedCn,
privateuseCo,
surrogateCs.
So,e.g.ifweneedlettersinlowercase,wecanwrite\p{Ll},punctuationsigns:\p{P}andsoon.
Therearealsootherderivedcategories,like:
Alphabetic(Alpha),includesLettersL,plusletternumbersNl(e.g.Ⅻ–acharacterfortheromannumber12),plussomeothersymbolsOther_Alphabetic(OAlpha).
Hex_Digitincludeshexadecimaldigits:0-9,a-f.
…Andsoon.
Unicodesupportsmanydifferentproperties,theirfulllistwouldrequirealotofspace,soherearethereferences:
Listallpropertiesbyacharacter:https://unicode.org/cldr/utility/character.jsp.
Listallcharactersbyaproperty:https://unicode.org/cldr/utility/list-unicodeset.jsp.
Shortaliasesforproperties:https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt.
AfullbaseofUnicodecharactersintextformat,withallproperties,ishere:https://www.unicode.org/Public/UCD/latest/ucd/.
Example:hexadecimalnumbersForinstance,let’slookforhexadecimalnumbers,writtenasxFF,whereFisahexdigit(0…9orA…F).
Ahexdigitcanbedenotedas\p{Hex_Digit}:
letregexp=/x\p{Hex_Digit}\p{Hex_Digit}/u;
alert("number:xAF".match(regexp));//xAF
Example:ChinesehieroglyphsLet’slookforChinesehieroglyphs.
There’saUnicodepropertyScript(awritingsystem),thatmayhaveavalue:Cyrillic,Greek,Arabic,Han(Chinese)andsoon,here’sthefulllist.
TolookforcharactersinagivenwritingsystemweshoulduseScript=tag,forseverallines–wrapthemin
tag,formorethan10lines–useasandbox(plnkr,jsbin,codepen…)ChapterRegularexpressionsLessonnavigationUnicodeproperties\p{…}SummaryCommentsShareEditonGitHub© 2007—2021 IlyaKantorabouttheprojectcontactustermsofusageprivacypolicy
延伸文章資訊
- 1List of Emoji Flags to Copy and Paste - Emojipedia
All emojis on this page are RGI (Recommended for General Interchange by Unicode) except Flag for ...
- 2Insert a symbol - Microsoft Support
- 3Unicode: flag "u" and class \p{...} - The Modern JavaScript ...
Flag u enables the support of Unicode in regular expressions. That means two things: ... With Uni...
- 4🇹🇼 Flag for Taiwan Emoji - Emojipedia
- 5Emojis of all country flags in the world | Flagpedia.net