首页 > 其他分享 >Go - charset and encoding

Go - charset and encoding

时间:2024-02-21 10:14:57浏览次数:29  
标签:code UTF string encoding charset bytes characters Go

We should understand the distinction between a charset and an encoding:
 A charset, as the name suggests, is a set of characters. For example, the Unicode charset contains 2^21 characters.
 An encoding is the translation of a character’s list in binary. For example, UTF-8 is an encoding standard capable of encoding all the Unicode characters in a variable number of bytes (from 1 to 4 bytes).

We mentioned characters to simplify the charset definition. But in Unicode, we use the concept of a code point to refer to an item represented by a single value. For example, the 汉 character is identified by the U+6C49 code point. Using UTF-8, 汉 is encoded using three bytes: 0xE6, 0xB1, and 0x89. Why is this important? Because in Go, a rune is a Unicode code point.

Meanwhile, we mentioned that UTF-8 encodes characters into 1 to 4 bytes, hence, up to 32 bits. This is why in Go, a rune is an alias of int32:

Another thing to highlight about UTF-8: some people believe that Go strings are always UTF-8, but this isn’t true. Let’s consider the following example:

We assign a string literal (a string constant) to s. In Go, a source code is encoded in UTF-8. So, all string literals are encoded into a sequence of bytes using UTF-8. However, a string is a sequence of arbitrary bytes; it’s not necessarily based on UTF-8. Hence, when we manipulate a variable that wasn’t initialized from a string literal (for example, reading from the filesystem), we can’t necessarily assume that it uses the UTF-8 encoding.

Let’s get back to the hello example. We have a string composed of five characters: h, e, l, l, and o. These simple characters are encoded using a single byte each. This is why getting the length of s returns 5:

But a character isn’t always encoded into a single byte. Coming back to the 汉 character, we mentioned that with UTF-8, this character is encoded into three bytes. We can validate this with the following example:

Instead of printing 1, this example prints 3. Indeed, the len built-in function applied on a string doesn’t return the number of characters; it returns the number of bytes. Conversely, we can create a string from a list of bytes. We mentioned that the 汉 character was encoded using three bytes, 0xE6, 0xB1, and 0x89:

Here, we build a string composed of these three bytes. When we print the string, instead of printing three characters, the code prints a single one: 汉.

 

In summary:
 A charset is a set of characters, whereas an encoding describes how to translate a charset into binary.
 In Go, a string references an immutable slice of arbitrary bytes.
 Go source code is encoded using UTF-8. Hence, all string literals are UTF-8 strings. But because a string can contain arbitrary bytes, if it’s obtained from somewhere else (not the source code), it isn’t guaranteed to be based on the UTF-8 encoding.
 A rune corresponds to the concept of a Unicode code point, meaning an item represented by a single value.
 Using UTF-8, a Unicode code point can be encoded into 1 to 4 bytes.
 Using len on a string in Go returns the number of bytes, not the number of runes.

标签:code,UTF,string,encoding,charset,bytes,characters,Go
From: https://www.cnblogs.com/zhangzhihui/p/18024542

相关文章

  • Studio 3T 2024.1 (macOS, Linux, Windows) - MongoDB 的专业 GUI、IDE 和 客户端,支持
    Studio3T2024.1(macOS,Linux,Windows)-MongoDB的专业GUI、IDE和客户端,支持自然语言查询TheprofessionalGUI,IDEandclientforMongoDB请访问原文链接:Studio3T2024.1(macOS,Linux,Windows)-MongoDB的专业GUI、IDE和客户端,支持自然语言查询,查看最新版......
  • Go 100 mistakes - #35: Using defer inside a loop
    Wewillimplementafunctionthatopensasetoffileswherethefilepathsare receivedviaachannel.Hence,wehavetoiterateoverthischannel,openthefiles,and handletheclosure.Here’sourfirstversion:Thereisasignificantproblemwithth......
  • Go - break
    Abreakstatementiscommonlyusedtoterminatetheexecutionofaloop.When loopsareusedinconjunctionwithswitchorselect,developersfrequentlymakethe mistakeofbreakingthewrongstatement.Let’stakealookatthefollowingexample.Weimplem......
  • Go 中如何高效遍历目录?探索几种方法
    Go中如何高效遍历目录?探索几种方法原创波罗学码途漫漫2024-02-2108:01上海听全文图片嗨,大家好!我是波罗学。本文是系列文章Go技巧第十八篇,系列文章查看:Go语言技巧。如果对我的文章感兴趣,欢迎关注我的公众号: 目录遍历是一个很常见的操作,它的使用场景有如文件目录查......
  • Go 100 mistakes - #32: Ignoring the impact of using pointer elements in range lo
    Thissectionlooksataspecificmistakewhenusingarangeloopwithpointerelements.Ifwe’renotcautiousenough,itcanleadustoanissuewherewereferencethe wrongelements.Let’sexaminethisproblemandhowtofixit.Beforewebegin,let’scla......
  • 【Go-Lua】Golang嵌入Lua代码——gopher-lua
    嵌入式8篇文章0订阅订阅专栏Lua代码嵌入GolangGo版本:1.19首先是Go语言直接调用Lua程序,并打印,把环境跑通packagemainimportlua"github.com/yuin/gopher-lua"funcmain(){ L:=lua.NewState() deferL.Close() //go err:=L.DoString(`print("gogogo!")`) iferr!=n......
  • golang time包和日期函数
    获取当前时间 //获取当前时间对象 timeObj:=time.Now() /*获取当前日期语法一*/ //打印当前日期 fmt.Println(timeObj)//2024-02-2017:50:54.085353+0800CSTm=+0.000323093 //当前年 year:=timeObj.Year() //打印当月 month:=timeObj.Month() //......
  • go自定义了一个Code的错误代码类型
    第一次基于GoFrame框架开发项目,这是一个灵感来自PHPLaravel的Golang开发框架,使用之后其实自己并不是很喜欢,把一个开发语言的习惯直接迁移到另一个开发语言上,个人觉得并不是一个好主意,不过这次并不想讨论这个。同事之前的实践异常处理是每个框架都需要考虑的问题,GoFrame也有自己......
  • golang函数
    函数定义/*函数定义关键字funcfunc函数名(参数参数类型)函数返回值的类型*/funcgetInfo(namestring,ageint)string{ returnname}//函数返回多个返回值:则返回类型括号包裹(返回值类型,类型..),即时返回两个int,也需要(int,int)funcgetNum(xint,statusboo......
  • Go - range
    Arangeloopallowsiteratingoverdifferentdatastructures:StringArrayPointertoanarraySliceMapReceivingchannelIngeneral,rangeproducestwovaluesforeachdatastructure exceptareceivingchannel,forwhichitproducesasingleeleme......