Go 正则表达式学习

标签：匹配正则表达式 text fmt matches 学习 Println Go

正则是用于处理文本的利器之一。

关于正则的基础知识及应用，之前写过几篇文章，读者可以阅读文后的相关资料作一基本了解。本文主要学习 Go 的正则。

正则表达式学习，可以分为三个子部分：

正则 API ；
正则语法；
正则匹配策略。

正则 API

第一个要学习的，就是 Go 正则 API。 API 是通往程序世界的第一把钥匙。

学习 API 的最简单方式，就是在电脑上敲下程序，然后看程序输出。根据 AI 给出的例子，自己加以改造和尝试，深化理解。

基础 API

import (
	"fmt"
	"regexp"
	"testing"
)

func TestGoRegex(t *testing.T) {

	// 创建一个新的正则表达式对象
	pattern := "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$"
	r, err := regexp.Compile(pattern)
	if err != nil {
		fmt.Println(err)
	}
	fmt.Println(r.String())                      // ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$
	fmt.Println(r.MatchString("qinshu@163.com")) // true

	// 创建原生字符串并查找字符串
	enclosedInt := regexp.MustCompile(`[\[{(]\d+[)}\]]`)
	matches := enclosedInt.FindAllString("(12) [34] {56}", -1)
	fmt.Println(matches) // [(12) [34] {56}]

	// 有限次数匹配
	matches2 := enclosedInt.FindAllString("(12) [34] {56}", 2)
	fmt.Println(matches2) // [(12) [34]]

	// 匹配的索引位置
	matchIndexes := enclosedInt.FindAllStringIndex("(12) [34] {56}", -1)
	fmt.Println(matchIndexes) // [[0 4] [5 9] [10 14]] 右边的索引位置是不包含的

	matchIndexes2 := enclosedInt.FindAllStringIndex("(12) [34] {56}", 2)
	fmt.Println(matchIndexes2) // [[0 4] [5 9]] 右边的索引位置是不包含的

	// 替代
	spacePattern := regexp.MustCompile(`\s+`)
	origin := "hello	world!  \n You get    champion！"
	replaced := spacePattern.ReplaceAllString(origin, " ")
	fmt.Println(replaced)
}

正则捕获

捕获并提取由正则表达式提取的文本，是日常开发常备的一个子任务。捕获需要通过 () 括起来的内容。比如 (\d+) 就会捕获 \d+ 匹配的文本。

func TestRegexCatch(t *testing.T) {
	input := "(12) [34] {56}"
	pattern := `\((\d+)\) \[(\d+)\] \{(\d+)\}`

	re := regexp.MustCompile(pattern)
	submatches := re.FindStringSubmatch(input)

	numbers := make([]string, 0)
	for i := 1; i < len(submatches); i++ {
		numbers = append(numbers, submatches[i])
	}

	fmt.Println("Captured numbers:", numbers)
}

正则反向引用

正则表达式中的反向引用是一种机制，它允许你在同一个正则表达式中引用先前已捕获的分组内容。捕获组是通过圆括号 () 定义的，当正则表达式引擎遇到捕获组并成功匹配其中的内容时，该内容会被记住并在后续匹配过程中被引用。引用的方式通常是通过反斜杠 \ 加上一个或多个数字，数字代表被捕获组的顺序（从左到右，从1开始计数）。

反向引用一般用来匹配成对的标签。比如，将 <标签>文本</标签> 中的文本提取出来，如下：

@Test
    public void testBackReference() {
        Pattern p = Pattern.compile("(?i)<([a-z][a-z0-9]*)[^>]*>(.*?)<\\/\\1>");
        Matcher match = p.matcher("<h1>我是大标题</h1>");
        if (match.find()) {
            System.out.println(match.group(2));
        }
    }

不过 Go 并不支持反向引用的语法。

正则语法

关于正则语法，最需要了解的是 POSIX 语法。

Go 的正则有反引号 ``，可以创建原生字符串，不用像 Java 那样总要加两道斜杠，这样使得正则表达式更清晰。比如 java 版的 enclosedInt 得写成这样：

"[(\\[{]*\\d+[)\\]}]*"

如果有原生斜杠，还得再加两道斜杠。Go 只要写成

`[\[{(]\d+[)}\]]`

正则匹配策略

正则匹配有两种最常用的匹配策略：

Leftmost-First Match（最左优先匹配但非最长）

正则表达式匹配的一种策略，也称为“最左优先匹配”。在处理文本时，这种匹配策略会从目标文本的左侧开始搜索，一旦找到第一个能够满足正则表达式的子串，就立即停止进一步的搜索，并返回该匹配结果。即使可能存在更长的匹配子串，也会优先返回最先找到的匹配。

在正则表达式中通过在重复元字符后面添加 ? 来实现，如 *?、+?、??。在这一策略下，引擎从左到右搜索，但在遇到重复元字符时，它会尽可能少地消耗文本，也就是说，只要满足匹配条件，它就会立即停止匹配更多的字符。

func TestRegexLeftMostFirstMatch(t *testing.T) {
	text := "abccc"
	re := regexp.MustCompile("ab(c)+?")
	matches := re.FindAllString(text, -1)
	fmt.Println(matches) // [abc]
}

Leftmost-Longest Match（最长/最左优先匹配）

也称为“贪心匹配”，这是许多正则表达式引擎（如Perl、Python、JavaScript、PHP、Java等）的默认匹配策略。在这种策略下，正则表达式引擎从左到右逐字符地搜索文本，一旦找到一个符合模式的匹配，它会选择最长可能的匹配，也就是说，对于重复元字符（如 *、+、? 和 {m,n}）它会尽可能多地消耗文本。

func TestRegexLeftMostLongestMatch(t *testing.T) {
	text := "abccc"
	re := regexp.MustCompile("ab(c)+")
	matches := re.FindAllString(text, -1)
	fmt.Println(matches) // [abccc]
}

此外，还有些特定匹配模式：

Anchored Matching（锚定匹配）

当正则表达式以 ^（开始位置）或 $（结束位置）等定位符开始或结束时，匹配只能在字符串的开始或结束处进行，这意味着匹配时强制考虑字符串的边界。

func TestRegexAnchorMatch(t *testing.T) {
	text := "abccc"
	re := regexp.MustCompile("^ab?c+$")
	matches := re.FindAllString(text, -1)
	fmt.Println(matches) // [abccc]

	re2 := regexp.MustCompile("^bc+$")
	nomatch := re2.FindAllString(text, -1)
	fmt.Println(nomatch) // []
}

**Multiline Matching（多行匹配）**

在多行模式下，正则表达式中的 ^ 和 $ 除了匹配字符串的开始和结束外，还可以匹配每一行的开始和结束。Go 默认支持多行模式。

func TestMultiLineMatch(t *testing.T) {
	text := `Line 1
start
Middle line 1
Middle line 2
end
Line 3`
	pattern := regexp.MustCompile(`start([\s\S]*?)end`)
	matches := pattern.MatchString(text)
	fmt.Println(matches) // true
}

func TestMultiLineMatch3(t *testing.T) {
	text := `start
Middle line 1
Middle line 2
end`
	pattern := regexp.MustCompile(`^start([\s\S]*?)end$`)
	matches := pattern.MatchString(text)
	fmt.Println(matches) // true
}

func TestMultiLineMatch2(t *testing.T) {
	text := `Line 1
start
Middle line 1
Middle line 2
end
Line 3`
	pattern := regexp.MustCompile(`^start([\s\S]*?)end$`)
	matches := pattern.MatchString(text)
	fmt.Println(matches) // false
}

Singleline Matching（单行匹配）

在单行模式下，. 元字符可以匹配包括换行符在内的所有字符，而在普通模式下，. 不匹配换行符。Go 不支持单行模式匹配。

func TestSingleLineMatch3(t *testing.T) {
	text := `start
Middle line 1
Middle line 2
end`
	pattern := regexp.MustCompile(`^start.*end$`)
	matches := pattern.MatchString(text)
	fmt.Println(matches) // false
}

Backtracking（回溯匹配）

在处理复杂的正则表达式时，引擎可能采用回溯算法，尝试不同的路径来找到匹配。当正则表达式包含分支结构（如 (|)）和重复结构时，引擎会尝试所有可能的匹配路径，直至找到一个匹配或确定无匹配。
Go 采用 RE2 （DFA）实现，不支持回溯匹配。

Atomic Grouping and Possessive Quantifiers（原子组和占有量词）
一些正则表达式引擎支持原子组 (?>...) 和占有量词，这些机制有助于控制回溯行为，以提高匹配效率和准确性。Go 也不支持原子组和占有量词。

标签：匹配,正则表达式,text,fmt,matches,学习,Println,Go
From： https://www.cnblogs.com/lovesqcc/p/18117126