Go - charset and encoding

时间：2024-02-21 10:14:57浏览次数：33

标签：code UTF string encoding charset bytes characters Go

We should understand the distinction between a charset and an encoding:
 A charset, as the name suggests, is a set of characters. For example, the Unicode charset contains 2^21 characters.
 An encoding is the translation of a character’s list in binary. For example, UTF-8 is an encoding standard capable of encoding all the Unicode characters in a variable number of bytes (from 1 to 4 bytes).

We mentioned characters to simplify the charset definition. But in Unicode, we use the concept of a code point to refer to an item represented by a single value. For example, the 汉 character is identified by the U+6C49 code point. Using UTF-8, 汉 is encoded using three bytes: 0xE6, 0xB1, and 0x89. Why is this important? Because in Go, a rune is a Unicode code point.

Meanwhile, we mentioned that UTF-8 encodes characters into 1 to 4 bytes, hence, up to 32 bits. This is why in Go, a rune is an alias of int32:

Another thing to highlight about UTF-8: some people believe that Go strings are always UTF-8, but this isn’t true. Let’s consider the following example:

We assign a string literal (a string constant) to s. In Go, a source code is encoded in UTF-8. So, all string literals are encoded into a sequence of bytes using UTF-8. However, a string is a sequence of arbitrary bytes; it’s not necessarily based on UTF-8. Hence, when we manipulate a variable that wasn’t initialized from a string literal (for example, reading from the filesystem), we can’t necessarily assume that it uses the UTF-8 encoding.

Let’s get back to the hello example. We have a string composed of five characters: h, e, l, l, and o. These simple characters are encoded using a single byte each. This is why getting the length of s returns 5:

But a character isn’t always encoded into a single byte. Coming back to the 汉 character, we mentioned that with UTF-8, this character is encoded into three bytes. We can validate this with the following example:

Instead of printing 1, this example prints 3. Indeed, the len built-in function applied on a string doesn’t return the number of characters; it returns the number of bytes. Conversely, we can create a string from a list of bytes. We mentioned that the 汉 character was encoded using three bytes, 0xE6, 0xB1, and 0x89:

Here, we build a string composed of these three bytes. When we print the string, instead of printing three characters, the code prints a single one: 汉.

In summary:
 A charset is a set of characters, whereas an encoding describes how to translate a charset into binary.
 In Go, a string references an immutable slice of arbitrary bytes.
 Go source code is encoded using UTF-8. Hence, all string literals are UTF-8 strings. But because a string can contain arbitrary bytes, if it’s obtained from somewhere else (not the source code), it isn’t guaranteed to be based on the UTF-8 encoding.
 A rune corresponds to the concept of a Unicode code point, meaning an item represented by a single value.
 Using UTF-8, a Unicode code point can be encoded into 1 to 4 bytes.
 Using len on a string in Go returns the number of bytes, not the number of runes.

标签：code,UTF,string,encoding,charset,bytes,characters,Go
From： https://www.cnblogs.com/zhangzhihui/p/18024542

Studio 3T 2024.1 (macOS, Linux, Windows) - MongoDB 的专业 GUI、IDE 和客户端，支持
Studio3T2024.1(macOS,Linux,Windows)-MongoDB的专业GUI、IDE和客户端，支持自然语言查询TheprofessionalGUI,IDEandclientforMongoDB请访问原文链接：Studio3T2024.1(macOS,Linux,Windows)-MongoDB的专业GUI、IDE和客户端，支持自然语言查询，查看最新版......
Go 100 mistakes - #35: Using defer inside a loop
Wewillimplementafunctionthatopensasetoffileswherethefilepathsare receivedviaachannel.Hence,wehavetoiterateoverthischannel,openthefiles,and handletheclosure.Here’sourfirstversion:Thereisasignificantproblemwithth......
Go - break
Abreakstatementiscommonlyusedtoterminatetheexecutionofaloop.When loopsareusedinconjunctionwithswitchorselect,developersfrequentlymakethe mistakeofbreakingthewrongstatement.Let’stakealookatthefollowingexample.Weimplem......
Go 中如何高效遍历目录？探索几种方法
Go中如何高效遍历目录？探索几种方法原创波罗学码途漫漫2024-02-2108:01上海听全文图片嗨，大家好！我是波罗学。本文是系列文章Go技巧第十八篇，系列文章查看：Go语言技巧。如果对我的文章感兴趣，欢迎关注我的公众号：目录遍历是一个很常见的操作，它的使用场景有如文件目录查......
Go 100 mistakes - #32: Ignoring the impact of using pointer elements in range lo
Thissectionlooksataspecificmistakewhenusingarangeloopwithpointerelements.Ifwe’renotcautiousenough,itcanleadustoanissuewherewereferencethe wrongelements.Let’sexaminethisproblemandhowtofixit.Beforewebegin,let’scla......
【Go-Lua】Golang嵌入Lua代码——gopher-lua
嵌入式8篇文章0订阅订阅专栏Lua代码嵌入GolangGo版本：1.19首先是Go语言直接调用Lua程序，并打印，把环境跑通packagemainimportlua"github.com/yuin/gopher-lua"funcmain(){ L:=lua.NewState() deferL.Close() //go err:=L.DoString(`print("gogogo!")`) iferr!=n......
golang time包和日期函数
获取当前时间 //获取当前时间对象 timeObj:=time.Now() /*获取当前日期语法一*/ //打印当前日期 fmt.Println(timeObj)//2024-02-2017:50:54.085353+0800CSTm=+0.000323093 //当前年 year:=timeObj.Year() //打印当月 month:=timeObj.Month() //......
go自定义了一个Code的错误代码类型
第一次基于GoFrame框架开发项目，这是一个灵感来自PHPLaravel的Golang开发框架，使用之后其实自己并不是很喜欢，把一个开发语言的习惯直接迁移到另一个开发语言上，个人觉得并不是一个好主意，不过这次并不想讨论这个。同事之前的实践异常处理是每个框架都需要考虑的问题，GoFrame也有自己......
golang函数
函数定义/*函数定义关键字funcfunc函数名(参数参数类型)函数返回值的类型*/funcgetInfo(namestring,ageint)string{ returnname}//函数返回多个返回值:则返回类型括号包裹(返回值类型,类型..)，即时返回两个int，也需要(int,int)funcgetNum(xint,statusboo......
Go - range
Arangeloopallowsiteratingoverdifferentdatastructures:StringArrayPointertoanarraySliceMapReceivingchannelIngeneral,rangeproducestwovaluesforeachdatastructure exceptareceivingchannel,forwhichitproducesasingleeleme......

Go - charset and encoding

相关文章

赞助商

阅读排行