What does .NET's String.Normalize do?

回答1

One difference between form C and form D is how letters with accents are represented: form C uses a single letter-with-accent codepoint, while form D separates that into a letter and an accent.

For instance, an "à" can be codepoint 224 ("Latin small letter A with grave"), or codepoint 97 ("Latin small letter A") followed by codepoint 786 ("Combining grave accent"). A char-by-char comparison would see these as different. Normalisation lets the comparison succeed.

A side-effect is that this makes it possible to easily create a "remove accents" method.

public static string RemoveAccents(string input)
{
    return new string(input
        .Normalize(System.Text.NormalizationForm.FormD)
        .ToCharArray()
        .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        .ToArray());
    // the normalization to FormD splits accented letters in letters+accents
    // the rest removes those accents (and other non-spacing characters)
    // and creates a new string from the remaining chars
}

Or have the "highly secure" ROT13 encoding work with accents:

string Rot13(string input)
{
    var v = input.Normalize(NormalizationForm.FormD)
        .Select(c => {
            if ((c>='a' && c<='m') || (c>='A' && c<='M'))
                return (char)(c+13);
            if ((c>='n' && c<='z') || (c>='N' && c<='Z'))
                return (char)(c-13);
            return c;
        });
    return new String(v.ToArray()).Normalize(NormalizationForm.FormC);
}

This will turn "Crème brûlée" into "Per̀zr oeĥyŕr" (and vice versa, of course), by first splitting "character with accent" codepoints in separate "character" and "accent" codepoints (FormD), then performing the ROT13 translation on just the letters and afterwards trying to recombine them (FormC).

回答2

It makes sure that unicode strings can be compared for equality (even if they are using different unicode encodings).

From Unicode Standard Annex #15:

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

回答3

This link has a good explanation:

http://unicode.org/reports/tr15/#Norm_Forms

From what I can surmise, its so you can compare two unicode strings for equality.

回答4

In Unicode, a (composed) character can either have a unique code point, or a sequence of code points consisting of the base character and its accents.

Wikipedia lists as example Vietnamese ế (U+1EBF) and its decomposed sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent).

string.Normalize() converts between the 4 normal forms a string can be coded in Unicode.

From ChatGPT

The main difference between Form C (NFC) and Form D (NFD) is the way they represent characters with combining diacritical marks.

Form C (NFC) represents such characters as a single code point, while Form D (NFD) represents them as multiple code points.

For example, the character "é" (LATIN SMALL LETTER E WITH ACUTE) can be represented in both NFC and NFD forms:

In NFC, "é" is represented as a single code point: U+00E9 (LATIN SMALL LETTER E WITH ACUTE).
In NFD, "é" is represented as two code points: U+0065 (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).

In general, Form C (NFC) is preferred for data interchange because it results in fewer code points and therefore smaller data size. However, Form D (NFD) can be useful for certain operations such as searching, sorting, and comparing text, especially in languages that heavily use diacritical marks.

标签：Normalize,do,What,code,string,Form,accents,accent
From： https://www.cnblogs.com/chucklu/p/17387199.html

解决 Docker 的 DeviceMapper 占用空间过大
某虚拟机运行容器半年后，磁盘空间报警，使用率超过百分之九十。经查后发现为Docker的DeviceMapper占用空间过大。概述DeviceMapper为容器的镜像和运行过程的缓存存放目录，这并不是一个文件夹，而是一个虚拟块设备。解决先将当前运行的容器导出为镜像（若已经对原有镜像进行过修改......
解决CentOS 7出现docker-compose: command not found
解决CentOS7出现docker-compose:commandnotfound1.安装docker-compose既然使用了docker-compose那自然得安装了在GitHub上拉取过慢，建议在国内源DaoCloud中拉取：curl-Lhttps://get.daocloud.io/docker/compose/releases/download/1.29.2/docker-compose-`uname-s`-`unam......
抓包工具之Charles(windows)
PC端如何配置才能抓取到https请求：1．安装证书：在顶部工具栏中选择“help--InstallCharlesCASSLCertificate”； 2．然后会弹出证书信息，选择安装证书，接下来将证书存储改为：受信任的根证书颁发机构，接下来都点“下一步”； .最后一步前可能会弹一个安全警告的弹窗，点“......
python控制windows 任务计划程序获取具体单一任务
importwin32com.clientTASK_ENUM_HIDDEN=1TASK_STATE={0:'Unknown',1:'Disabled',2:'Queued',3:'Ready',4:'Running'}scheduler=win32c......
安装docker和docker-compose的shell脚本（Centos7版本）
在执行脚本之前，我们需要先做两件事：避免防火墙与docker产生冲突，应先关闭防火墙。shell#去掉防火墙的开机自启动systemctldisablefirewalld.service#关闭防火墙systemctlstopfirewalld.service国内拉取dockerhub中的镜像速度一般都很慢，现在有一种方法可以提高......
Java获取当前路径（Linux+Windows）
Java获取当前路径（Linux+Windows）获取当前路径（兼容Linux、Windows）：StringcurPath=System.getProperty("user.dir");log.info("===========当前路径===========curPath:{}",curPath);输出结果：===========当前路径===========curPath:/home/lizhm......
ChatPDF/ChatDOC实现原理解析
1）把PDF切分成小的文本片段，通过OpenAI的Ada模型创建Embedding放到本地或远程向量数据库。2）把用户的提问也创建成Embedding，用它和之前创建的PDF向量比对，通过语义相似性搜索（余弦算法），找到最相关的文本片段。比关键词搜索好的一点是不要求关键词包含，也能发现文本相关性，比如汽车和公路......
Windows 服务失败自启动
先上bat文件@echooffrem定义循环间隔时间和监测的服务：setsecs=90setsrvname=%1echo==%1说明调用第一条参数，也可以在这里直接写服务名称==echo.echo========================================echo==查询计算机服务的状态，==echo==......
Docker部署网易云音乐解灰无版权VIP音乐播放下载
若服务器已搭建好Docker，则跳过输入搭建docker命令，回车执行，耐心等待安装完成curl-fsSLhttps://get.docker.com|bash-sdocker--mirrorAliyun执行一键部署命令dockerrun-dit\ -eENABLE_FLAC=true\ -eENABLE_LOCAL_VIP=svip\ -eBLOCK_ADS=true\ -eSEARCH_A......
Hadoop API使用大坑
这几天一直在困扰我pycurl版本和本机的版本不符合他连接又连接的自己自带的版本与系统不相同低级也会报错 https://blog.csdn.net/u010910682/article/details/89496550/?ops_request_misc=&request_id=&biz_id=102&utm_term=pycurl7.45.2%20%E6%90%AD%E9%85%8Dlibcu......

What does .NET's String.Normalize do?

What does .NET's String.Normalize do?

相关文章

赞助商

阅读排行