注:本文为几篇 regex 相关合辑。机翻,未校,未整理。
Regex History and How-To
Crystal Villanueva
Jan 14, 2021
A regular expression, also known as regex or regexp, is a special string that presents itself repeatedly in a search pattern; today, programmers use regex to create a regular pattern to look for within their code to capture, validate, find and replace and insert. Regex may seem intimidating, however with a basic breakdown of what regex is and how to apply it, it isn’t so terrifying after all. This post is just a beginners dip into the world of regex, not a comprehensive guide of what each symbol means.
正则表达式(也称为 regex 或 regexp)是一种特殊字符串,它在搜索模式中重复出现;如今,程序员使用 Regex 创建常规模式,以便在其代码中查找以捕获、验证、查找、替换和插入。正则表达式可能看起来令人生畏,但是对于正则表达式是什么以及如何应用它的基本分解,它毕竟并不那么可怕。这篇文章只是初学者对正则表达式世界的深入了解,而不是每个符号含义的全面指南。
History 历史
In the 1950’s a mathematician named Stephan Cole Kleene submitted a paper on automata theory and regular expressions. While the automata theory meant that the simplest machine has bounds or limits in its memory; “the memory size is independent of the input length”, his focus on regular expressions stemmed from McCulloch and Pitts’ neural calculus in their investigation of behavioral activation and activity of a neuron: with the computation of neuron activity in his biophysics focus, Kleene denoted that neuronal activation sequences “[bring a given net…] to a particular state after they have been completely processed, and discovered interesting regularities among them”(McCulloch and Pitts’ neural logical calculus. (n.d.)). Kleene distinguished these as regular events’, later called
regular expressions’ in lexical analysis (Hopcroft, Motwani, and Ullman 2014). From his mathematical contributions, regular expressions were later implemented into machines (i.e., the SNOBOL programming language in the 60’s and the Unix editor systems in the late 60’s/early 70’s ) to compile and search for expressions occurring more than once. Regex is used through out search engines to find and replace input, editing, formatting, and the output of text.
在 1950 年代,一位名叫 Stephan Cole Kleene 的数学家提交了一篇关于自动机理论和正则表达式的论文。虽然自动机理论意味着最简单的机器在其内存中有边界或限制;“内存大小与输入长度无关”,他对正则表达式的关注源于 McCulloch 和 Pitts 在研究神经元的行为激活和活动时的神经演算:在他的生物物理学重点中,神经元活动的计算表明神经元激活序列“[带来给定的网络…]到特定状态,并在它们之间发现了有趣的规律“(McCulloch 和 Pitts 的神经逻辑演算。(日期不详))。Kleene 将这些区分为“常规事件”,后来在词汇分析中称为“常规表达式”(Hopcroft、Motwani 和 Ullman 2014)。根据他的数学贡献,正则表达式后来被实现到机器中(即 60 年代的 SNOBOL 编程语言和 60 年代末/70 年代初的 Unix 编辑器系统),以编译和搜索多次出现的表达式。正则表达式在整个搜索引擎中用于查找和替换输入、编辑、格式化和文本输出。
Javascript and Regex
Javascript 和 Regex
In Javascript, ‘.test()’, ‘.match()’, ‘.matchAll()’, ‘.replaceAll()’, ‘.search()’, ‘.split()’or ‘.replace()’ are all methods that a programmer could implement into their function body to search for a regular expression within their code. The method ‘.test()’ returns true if found, and false if it cannot find the pattern within the code. The method ‘.match()’ searches throughout the code to find it’s regex match. Lastly, ‘.replace()’ takes in two arguments: the regex that the method is searching for, and what it is trying to replace:
在 Javascript 中,‘.test()’、‘.match()’、‘.matchAll()’、‘.replaceAll()’、‘.search()’、‘.split()‘或 ‘.replace()’ 都是程序员可以在其函数体中实现的方法,以在其代码中搜索正则表达式。方法 ‘.test()’ 如果找到,则返回 true,如果在代码中找不到模式,则返回 false。方法 ‘.match()’ 在整个代码中搜索以查找其正则表达式匹配项。最后,’.replace()’ 接受两个参数:方法正在搜索的正则表达式,以及它试图替换的内容:
let command = "G()()()()(al)"
let interpret = function(command) {
return command.replace(/\(\)/g, "o").replace(/\(al\)/g, "al")
};// output => "Gooooal"
Above, the method ‘.replace()’ is featured. The first argument of ‘.replace()’ is to find all instances of ‘()’ which is syntactically placed for regex form, and to replace it with ‘o’. Another ‘.replace()’ is chained onto the function body to find all instances of ‘(al)’ and replace it with ‘al’.
上面,方法 ‘.replace()’ 是特色。‘.replace()’ 的第一个参数是找到 ‘()’ 的所有实例,它在语法上是针对正则表达式形式的,并将其替换为 ‘o’。另一个 ‘.replace()’ 被链接到函数体上,以查找 ‘(al)’ 的所有实例并将其替换为 ‘al’。
What does the filler in the first part of the argument in the ‘.replace()’ mean? It can be confusing, but here is a short guide of slowing down and understanding this example of regex. The first argument in the first ‘.replace()’, ‘/()/g’, is deciphered as the following:
‘.replace()’ 中参数第一部分的填充物是什么意思?这可能会令人困惑,但这里有一个简短的指南,可以放慢速度并理解这个正则表达式示例。第一个 ‘.replace()’ 中的第一个参数 ‘/\(\)/g’ 被解读为:
The ‘/ /’ is standard format in regex; a pattern is enclosed in slashes. Most times you will see it in slashes, sometimes in the constructor format.
‘/ /’ 是正则表达式中的标准格式;模式用斜杠括起来。大多数时候,您会以斜杠形式看到它,有时是构造函数格式。
The ‘g’ flag in regex means ‘global’; the ‘g’ flag finds all possible matches throughout the string.
正则表达式中的“g”标志表示“全局”;‘g’ 标志查找整个字符串中所有可能的匹配项。
The backslash in the box escapes the special character; to find one of the parenthesis, you need the backslash.
框中的反斜杠转义特殊字符;要找到其中一个括号,您需要反斜杠。
The slash in the box escapes the special character; to find one of the parenthesis, you need the backslash.
框中的斜杠转义特殊字符;要找到其中一个括号,您需要反斜杠。
Put it all together and this is the result! Were searching for the ‘()’ globally throughout the code in the correct format.
把它们放在一起,这就是结果!在整个代码中以正确的格式全局搜索 ‘()’。
note: to escape or escaping a character means a character has special meaning *inside* a regular expression.
注: To escapeing or estruing a character 表示字符在正则表达式中具有特殊含义。
Congratulations, now you understand a simple regex! Regex is universal throughout all programming languages. No matter what language you are using, regex will be used to find, replace, split, match and test your code. While this example is rudimentary, hopefully it keens your interest into the world of regex.
恭喜,现在您了解了一个简单的正则表达式!Regex 在所有编程语言中都是通用的。无论您使用哪种语言,正则表达式都将用于查找、替换、拆分、匹配和测试您的代码。虽然这个例子很初级,但希望它能激发您对正则表达式世界的兴趣。
Resources:
Hopcroft, J. E., Motwani, R., & Ullman, J. D. (2014). Introduction to automata theory, languages, and computation. Harlow: Pearson Education.
Leung, H. (2010, September 16). Regular Languages and Finite Automata. Retrieved January 12, 2021, from https://www.cs.nmsu.edu/historical-projects/Projects/kleene.9.16.10.pdf
McCulloch and Pitts’ neural logical calculus. (n.d.). Retrieved January 12, 2021, from https://www.dlsi.ua.es/~mlf/nnafmc/pbook/node10.html
Introduction and History of Regular Expression
Gurkirat Singh
Posted on 2023,08,01
Updated on 2023,09,03
Hello World! Welcome to the “Regular Expressions” series, where we tackle the intimidating syntax that has spawned numerous memes among developers. Don’t worry, though! As we move forward, I assure you that you’ll gain the confidence to craft your own elegant regular expressions by the end of this journey.
世界您好!欢迎来到“正则表达式”系列,在这里,我们将讨论在开发人员中催生了大量模因的令人生畏的语法。不过别担心!随着我们向前发展,我向你保证,在这段旅程结束时,你将有信心制作你自己的优雅正则表达式。
In this series, I will utilise https://regex101.com to share patterns with you, avoiding the constraint of a specific programming language that could potentially create barriers for some learners.
在本系列中,我将利用 https://regex101.com 与您分享模式,避免特定编程语言的限制,这可能会给某些学习者带来障碍。
Note →→ If anywhere coding is required, I will be using C++.
注意 → → 如果需要编码,我将使用 C++。
What are Regular Expressions?
什么是正则表达式?
Regular expressions are a specific type of text pattern that are used while programming a logic.
正则表达式是编写逻辑时使用的一种特定类型的文本模式。
I couldn’t think of a single modern application that doesn’t make use of it, either directly or indirectly. If you went to a website and entered any gibberish text in the email field, you would have received an invalid email format message.
我想不出任何一个现代应用程序不直接或间接地使用它。如果您访问某个网站并在电子邮件字段中输入任何乱码文本,您将收到无效的电子邮件格式消息。
Under the hood, your input text is being verified by the following regex pattern.
在后台,您的输入文本由以下正则表达式模式进行验证。
[a-z0-9.-]@[a-z0-9]{2,}\.[a-z]{2,}
You can see how this regex works here. This is a minimal version of matching simple email addresses. Please bear with me if this appears overwhelming. Such expressions can be created in less than 10 seconds.
你可以在这里看到这个正则表达式是如何工作的。这是匹配简单电子邮件地址的最小版本。如果这看起来让人不知所措,请多多包涵。可以在 10 秒内创建此类表达式。
Why this name after all?
为什么叫这个名字呢?
The name “regular expression” originates from the mathematical concept of regular languages, which were first studied by mathematicians in the field of formal language theory. Regular expressions are a way to describe and match patterns in strings of characters, hence the name “regular” expressions.
“正则表达式”这个名字源于规则语言的数学概念,最早是由形式语言理论领域的数学家研究的。正则表达式是一种描述和匹配字符串模式的方法,因此称为 “regular” 表达式。
Problem that Led Development of RegExp
导致 RegExp 开发的问题
It was created in the 1950s (far before most of us were born) to help with text processing tasks, such as searching and editing. Since then, regular expressions have become a standard feature of many programming languages.
它创建于 1950 年代(远在我们大多数人出生之前),用于帮助完成文本处理任务,例如搜索和编辑。从那时起,正则表达式已成为许多编程语言的标准功能。
In the early 1960s, Ken Thompson implemented regular expressions in the QED text editor. This was the first time that regular expressions were used in a practical application.
在 1960 年代初期,Ken Thompson 在 QED 文本编辑器中实现了正则表达式。这是正则表达式首次在实际应用中使用。
Why You Should Learn Regular Expressions?
为什么你应该学习正则表达式?
Learning regular expressions offers numerous benefits in various fields such as
学习正则表达式在各个领域都有许多好处,例如:
- data analysis 数据分析
- software development
软件开发 - content management 内容管理
- scientific research 科研
Their versatility allows you to define complex patterns for finding specific words or phrases, extracting data from structured text, and performing advanced search and replace operations.
它们的多功能性允许您定义复杂的模式,用于查找特定单词或短语、从结构化文本中提取数据以及执行高级搜索和替换操作。
Additionally, mastering regular expressions helps prevent costly mistakes by providing a precise and controlled approach to text processing. With a solid understanding of regular expressions, you can confidently handle challenging text manipulation tasks, ensuring accuracy, reliability, and improved productivity.
此外,掌握正则表达式通过提供精确且受控的文本处理方法来帮助防止代价高昂的错误。凭借对正则表达式的深入理解,您可以自信地处理具有挑战性的文本操作任务,确保准确性、可靠性和提高生产力。
Since you’re here, I’m assuming you’re ready to understand and use those unwieldy strings of brackets and question marks in your code.
既然你在这里,我假设你已经准备好在代码中理解和使用这些笨拙的括号和问号字符串。
Flavours of RegExp
RegExp 的风格
There is no established standard that specifies which text patterns are and are not regular expressions. There are numerous languages on the market whose creators have various ideas about how regular expressions should look. So we’re now stuck with a whole spectrum of regular expression flavours (implementation of RegExp in the programming language).
没有确定的标准来指定哪些文本模式是正则表达式,哪些不是正则表达式。市场上有许多语言,其创建者对正则表达式的外观有各种想法。因此,我们现在被一整套正则表达式风格所困(在编程语言中实现 RegExp)。
But why reinvent the wheel? Instead, every modern regular expression engine may be traced back to the Perl programming language.
但为什么要重新发明轮子呢?相反,每个现代正则表达式引擎都可以追溯到 Perl 编程语言。
Regular expression flavours are commonly integrated into scripting languages, while other programming languages rely on dedicated libraries for regex support. JavaScript offers built-in support for regular expressions using the syntax /expr/
or the RegExp
object. On the other hand, Python implements regular expressions through its standard library re
.
正则表达式风格通常集成到脚本语言中,而其他编程语言则依赖于专用库来支持正则表达式。JavaScript 为使用语法 /expr/
或 RegExp
对象的正则表达式提供内置支持。另一方面,Python 通过其标准库 re 实现正则表达式。
References 引用
Automata theory
Automata theory is the study of abstract machines and automata, as well as the computational problems that can be solved using them. It is a theory in theoretical computer science with close connections to mathematical logic. The word automata comes from the Greek word αὐτόματος, which means “self-acting, self-willed, self-moving”. An automaton is an abstract self-propelled computing device which follows a predetermined sequence of operations automatically. An automaton with a finite number of states is called a finite automaton (FA) or finite-state machine (FSM). The figure on the right illustrates a finite-state machine, which is a well-known type of automaton. This automaton consists of states and transitions. As the automaton sees a symbol of input, it makes a transition to another state, according to its transition function, which takes the previous state and current input symbol as its arguments.
自动机理论是对抽象机器和自动机的研究,以及可以使用它们解决的计算问题。它是理论计算机科学中的一种理论,与数理逻辑密切相关。自动机一词来自希腊语 αὐτόματος,意思是“自我行动、任性、自我移动”。自动机是一种抽象的自走式计算设备,它会自动遵循预定的操作顺序。具有有限状态数的自动机称为有限自动机 (FA) 或有限状态机 (FSM)。下图说明了有限状态机,这是一种众所周知的自动机类型。此自动机由状态和过渡组成。当自动机看到 input 符号时,它会根据其 transition 函数过渡到另一个状态,该函数将前一个状态和当前 input 符号作为其参数。
Regular expression
A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.
正则表达式(有时称为有理表达式)是指定文本中的匹配模式的字符序列。通常,字符串搜索算法使用此类模式对字符串执行“查找”或“查找和替换”操作,或用于输入验证。正则表达式技术是在理论计算机科学和形式语言理论中发展起来的。
Formal language
In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules called a formal grammar.
在逻辑学、数学、计算机科学和语言学中,正式语言由单词组成,这些单词的字母取自字母表,并根据一组称为正式语法的特定规则格式正确。
Regexes Got Good: The History And Future Of Regular Expressions In JavaScript
正则表达式变得好起来:JavaScript 中正则表达式的历史和未来
Steven Levithan
Aug 20, 2024
Although JavaScript regexes used to be underpowered compared to other modern flavors, numerous improvements in recent years mean that’s no longer true. Steven Levithan evaluates the history and present state of regular expressions in JavaScript with tips to make your regexes more readable, maintainable, and resilient.
尽管与其他现代风格相比,JavaScript 正则表达式曾经功能不足,但近年来的大量改进意味着情况不再如此。Steven Levithan 评估了 JavaScript 中正则表达式的历史和现状,并提供了一些技巧,以使您的正则表达式更具可读性、可维护性和弹性。
Modern JavaScript regular expressions have come a long way compared to what you might be familiar with. Regexes can be an amazing tool for searching and replacing text, but they have a longstanding reputation (perhaps outdated, as I’ll show) for being difficult to write and understand.
与你可能熟悉的相比,现代 JavaScript 正则表达式已经取得了长足的进步。正则表达式可以成为搜索和替换文本的出色工具,但它们因难以编写和理解而长期以来的声誉(可能已经过时了,我将展示)。
This is especially true in JavaScript-land, where regexes languished for many years, comparatively underpowered compared to their more modern counterparts in PCRE, Perl, .NET, Java, Ruby, C++, and Python. Those days are over.
在 JavaScript 领域尤其如此,正则表达式多年来一直萎靡不振,与 PCRE、Perl、.NET、Java、Ruby、C++ 和 Python 中更现代的对应项相比,其功能相对不足。那些日子已经过去了。
In this article, I’ll recount the history of improvements to JavaScript regexes (spoiler: ES2018 and ES2024 changed the game), show examples of modern regex features in action, introduce you to a lightweight JavaScript library that makes JavaScript stand alongside or surpass other modern regex flavors, and end with a preview of active proposals that will continue to improve regexes in future versions of JavaScript (with some of them already working in your browser today).
在本文中,我将回顾 JavaScript 正则表达式改进的历史(剧透:ES2018 和 ES2024 改变了游戏规则),展示现代正则表达式功能的实际示例,向您介绍一个轻量级的 JavaScript 库,它使 JavaScript 与其他现代正则表达式风格并驾齐驱或超越其他现代正则表达式风格,最后预览将继续改进未来 JavaScript 版本中的正则表达式的活动提案(其中一些已经在您的浏览器中运行)。
The History of Regular Expressions in JavaScript
JavaScript 中正则表达式的历史
ECMAScript 3, standardized in 1999, introduced Perl-inspired regular expressions to the JavaScript language. Although it got enough things right to make regexes pretty useful (and mostly compatible with other Perl-inspired flavors), there were some big omissions, even then. And while JavaScript waited 10 years for its next standardized version with ES5, other programming languages and regex implementations added useful new features that made their regexes more powerful and readable.
ECMAScript 3 于 1999 年标准化,将受 Perl 启发的正则表达式引入 JavaScript 语言。尽管它做对了足够多的事情使正则表达式非常有用(并且主要与其他受 Perl 启发的风格兼容),但即使在那时,也有一些很大的遗漏。虽然 JavaScript 等待了 10 年才推出 ES5 的下一个标准化版本,但其他编程语言和正则表达式实现添加了有用的新功能,使它们的正则表达式更加强大和可读。
But that was then. 但那是当时的事了。
Did you know that nearly every new version of JavaScript has made at least minor improvements to regular expressions?
您知道吗,几乎每个新版本的 JavaScript 都至少对正则表达式进行了微小的改进?
Let’s take a look at them.
让我们来看看它们。
Don’t worry if it’s hard to understand what some of the following features mean — we’ll look more closely at several of the key features afterward.
如果难以理解以下某些功能的含义,请不要担心 — 我们稍后将更仔细地研究几个关键功能。
- ES5 (2009) fixed unintuitive behavior by creating a new object every time regex literals are evaluated and allowed regex literals to use unescaped forward slashes within character classes (
/[/]/
).
ES5 (2009) 修复了不直观的行为,方法是在每次计算正则表达式文字时创建一个新对象,并允许正则表达式文字在字符类 (/[/]/
) 中使用未转义的正斜杠。 - ES6/ES2015 added two new regex flags:
y
(sticky
), which made it easier to use regexes in parsers, andu
(unicode
), which added several significant Unicode-related improvements along with strict errors. It also added theRegExp.prototype.flags
getter, support for subclassingRegExp
, and the ability to copy a regex while changing its flags.
ES6/ES2015 添加了两个新的正则表达式标志:y
(sticky
),这使得在解析器中使用正则表达式变得更加容易,以及u
(unicode
),它添加了一些与 Unicode 相关的重要改进以及严格的错误。它还添加了RegExp.prototype.flags
getter,对子类化RegExp
的支持,以及在更改其标志时复制正则表达式的能力。 - ES2018 was the edition that finally made JavaScript regexes pretty good. It added the
s
(dotAll
) flag, lookbehind, named capture, and Unicode properties (via\p{...}
and\P{...}
, which require ES6’s flagu
). All of these are extremely useful features, as we’ll see.
ES2018 是最终使 JavaScript 正则表达式变得相当好的版本。它添加了s
(dotAll
) 标志、后视、命名 capture 和 Unicode 属性(通过\p{...}
和\P{...}
,它们需要 ES6 的标志u
)。正如我们将看到的,所有这些都是非常有用的功能。 - ES2020 added the string method
matchAll
, which we’ll also see more of shortly.
ES2020 添加了字符串方法matchAll
,我们很快也会看到更多。 - ES2022 added flag
d
(hasIndices
), which provides start and end indices for matched substrings.
ES2022 添加了标志d
(hasIndices
),它为匹配的子字符串提供开始和结束索引。 - And finally, ES2024 added flag
v
(unicodeSets
) as an upgrade to ES6’s flagu
. Thev
flag adds a set of multicharacter “properties of strings” to\p{...}
, multicharacter elements within character classes via\p{...}
and\q{...}
, nested character classes, set subtraction[A--B]
and intersection[A&&B]
, and different escaping rules within character classes. It also fixed case-insensitive matching for Unicode properties within negated sets[^...]
.
最后,ES2024 添加了标志v
(unicodeSets
) 作为 ES6 标志u
的升级。v
标志向\p{...}
添加了一组多字符“字符串属性”,通过\p{...}
和\q{...}
在字符类中添加多字符元素,嵌套字符类,集合减法[A--B]
和交集[A&&B]
,以及字符类中的不同转义规则。它还修复了否定集[^...]
中 Unicode 属性的不区分大小写的匹配。
As for whether you can safely use these features in your code today, the answer is yes! The latest of these features, flag v
, is supported in Node.js 20 and 2023-era browsers. The rest are supported in 2021-era browsers or earlier.
至于您现在是否可以在代码中安全地使用这些功能,答案是肯定的!这些功能中的最新功能 flag v
在 Node.js 20 和 2023 时代的浏览器中受支持。其余的在 2021 年版或更早版本中受支持。
Each edition from ES2019 to ES2023 also added additional Unicode properties that can be used via \p{...}
and \P{...}
. And to be a completionist, ES2021 added string method replaceAll
— although, when given a regex, the only difference from ES3’s replace
is that it throws if not using flag g
.
从 ES2019 到 ES2023 的每个版本还添加了额外的 Unicode 属性,可以通过 \p{...}
和 \P{...}
使用。为了成为完成主义者,ES2021 添加了字符串方法 replaceAll
——尽管,当给定正则表达式时,与 ES3 的 replace
的唯一区别是,如果不使用标志 g
,它会抛出。
Aside: What Makes a Regex Flavor Good?
题外话:是什么让 REGEX 风格好?
With all of these changes, how do JavaScript regular expressions now stack up against other flavors? There are multiple ways to think about this, but here are a few key aspects:
有了所有这些变化,JavaScript 正则表达式现在如何与其他风格相提并论呢?有多种方法可以考虑这个问题,但以下是几个关键方面:
- Performance.
This is an important aspect but probably not the main one since mature regex implementations are generally pretty fast. JavaScript is strong on regex performance (at least considering V8’s Irregexp engine, used by Node.js, Chromium-based browsers, and even Firefox; and JavaScriptCore, used by Safari), but it uses a backtracking engine that is missing any syntax for backtracking control — a major limitation that makes ReDoS vulnerability more common.
性能。 这是一个重要的方面,但可能不是主要方面,因为成熟的 regex 实现通常非常快。JavaScript 在正则表达式性能方面很强(至少考虑到 V8 的 Irregexp 引擎,Node.js、基于 Chromium 的浏览器甚至 Firefox;以及 Safari 使用的 JavaScriptCore),但它使用的回溯引擎缺少任何回溯控制语法——这是使 ReDoS 漏洞更常见的主要限制。 - Support for advanced features that handle common or important use cases.
Here, JavaScript stepped up its game with ES2018 and ES2024. JavaScript is now best in class for some features like lookbehind (with its infinite-length support) and Unicode properties (with multicharacter “properties of strings,” set subtraction and intersection, and script extensions). These features are either not supported or not as robust in many other flavors.
支持处理常见或重要使用案例的高级功能。 在这里,JavaScript 通过 ES2018 和 ES2024 加强了它的游戏。JavaScript 现在是某些功能的最佳版本,例如后瞻(具有无限长度支持)和 Unicode 属性(具有多字符“字符串属性”、集合减法和交集以及脚本扩展)。这些功能要么不受支持,要么在许多其他风格中不那么健壮。 - Ability to write readable and maintainable patterns.
Here, native JavaScript has long been the worst of the major flavors since it lacks thex
(“extended”) flag that allows insignificant whitespace and comments. Additionally, it lacks regex subroutines and subroutine definition groups (from PCRE and Perl), a powerful set of features that enable writing grammatical regexes that build up complex patterns via composition.
能够编写可读和可维护的模式。在这里,原生 JavaScript 长期以来一直是最糟糕的主要风格,因为它缺少x
(“extended”) 标志,该标志允许无关紧要的空格和注释。此外,它还缺少正则表达式子例程和子例程定义组(来自 PCRE 和 Perl),这是一组强大的功能,可以编写语法正则表达式,通过组合构建复杂的模式。
So, it’s a bit of a mixed bag.
所以,这有点好坏参半。
The good news is that all of these holes can be filled by a JavaScript library, which we’ll see later in this article.
好消息是,所有这些漏洞都可以由 JavaScript 库填补,我们将在本文后面看到。
Using JavaScript’s Modern Regex Features
使用 JavaScript 的现代正则表达式功能
Let’s look at a few of the more useful modern regex features that you might be less familiar with. You should know in advance that this is a moderately advanced guide. If you’re relatively new to regex, here are some excellent tutorials you might want to start with:
让我们看看一些您可能不太熟悉的更有用的现代正则表达式功能。您应该提前知道这是一个中等高级的指南。如果您对正则表达式相对较新,这里有一些优秀的教程,您可能希望从以下开始:
- RegexLearn and RegexOne are interactive tutorials that include practice problems.
RegexLearn 和 RegexOne 是包含练习题的交互式教程。 - JavaScript.info’s regular expressions chapter is a detailed and JavaScript-specific guide.
JavaScript.info 的正则表达式章节是一份详细的 JavaScript 特定指南。 - Demystifying Regular Expressions (video) is an excellent presentation for beginners by Lea Verou at HolyJS 2017.
揭开正则表达式的神秘面纱(视频)是 Lea Verou 在 HolyJS 2017 上为初学者提供的精彩演示。 - Learn Regular Expressions In 20 Minutes (video) is a live syntax walkthrough in a regex tester.
在 20 分钟内学习正则表达式(视频)是正则表达式测试器中的实时语法演练。
Named Capture
命名捕获
Often, you want to do more than just check whether a regex matches — you want to extract substrings from the match and do something with them in your code. Named capturing groups allow you to do this in a way that makes your regexes and code more readable and self-documenting.
通常,你想要做的不仅仅是检查正则表达式是否匹配 - 你想要从匹配中提取子字符串并在代码中对它们执行一些操作。命名捕获组允许您以一种使正则表达式和代码更具可读性和自文档性的方式执行此操作。
The following example matches a record with two date fields and captures the values:
以下示例匹配具有两个日期字段的记录并捕获值:
const record = 'Admitted: 2024-01-01\nReleased: 2024-01-03';
const re = /^Admitted: (?<admitted>\d{4}-\d{2}-\d{2})\nReleased: (?<released>\d{4}-\d{2}-\d{2})$/;
const match = record.match(re);
console.log(match.groups);
/* → {
admitted: '2024-01-01',
released: '2024-01-03'
} */
Don’t worry — although this regex might be challenging to understand, later, we’ll look at a way to make it much more readable. The key things here are that named capturing groups use the syntax (?<name>...)
, and their results are stored on the groups
object of matches.
别担心 — 尽管这个正则表达式可能很难理解,但稍后,我们将寻找一种方法来使其更具可读性。这里的关键是命名捕获组使用语法 (?<name>...)
,并且它们的结果存储在匹配的 groups
对象上。
You can also use named backreferences to rematch whatever a named capturing group matched via \k<name>
, and you can use the values within search and replace as follows:
您还可以使用命名反向引用来重新匹配通过 \k<name>
匹配的任何命名捕获组,并且可以在 search 和 replace 中使用值,如下所示:
// Change 'FirstName LastName' to 'LastName, FirstName'
const name = 'Shaquille Oatmeal';
name.replace(/(?<first>\w+) (?<last>\w+)/, '$<last>, $<first>');
// → 'Oatmeal, Shaquille'
For advanced regexers who want to use named backreferences within a replacement callback function, the groups
object is provided as the last argument. Here’s a fancy example:
对于想要在替换回调函数中使用命名反向引用的高级 regexers,groups
对象作为最后一个参数提供。这里有一个花哨的例子:
function fahrenheitToCelsius(str) {
const re = /(?<degrees>-?\d+(\.\d+)?)F\b/g;
return str.replace(re, (...args) => {
const groups = args.at(-1);
return Math.round((groups.degrees - 32) * 5/9) + 'C';
});
}
fahrenheitToCelsius('98.6F');
// → '37C'
fahrenheitToCelsius('May 9 high is 40F and low is 21F');
// → 'May 9 high is 4C and low is -6C'
Lookbehind
回顾
Lookbehind (introduced in ES2018) is the complement to lookahead, which has always been supported by JavaScript regexes. Lookahead and lookbehind are assertions (similar to ^
for the start of a string or \b
for word boundaries) that don’t consume any characters as part of the match. Lookbehinds succeed or fail based on whether their subpattern can be found immediately before the current match position.
Lookbehind(在 ES2018 中引入)是 lookahead 的补充,JavaScript 正则表达式一直支持它。Lookahead 和 lookbehind 是断言(类似于字符串开头的 ^
或单词边界的 \b
),它们在匹配过程中不使用任何字符。后视是否成功或失败取决于是否可以在当前匹配位置之前立即找到其子模式。
For example, the following regex uses a lookbehind (?<=...)
to match the word “cat” (only the word “cat”) if it’s preceded by “fat ”:
例如,如果单词前面有 “fat ”,则以下正则表达式使用后视 (?<=...)
来匹配单词 “cat” (仅单词 “cat”) :
const re = /(?<=fat )cat/g;
'cat, fat cat, brat cat'.replace(re, 'pigeon');
// → 'cat, fat pigeon, brat cat'
You can also use negative lookbehind — written as (?<!...)
— to invert the assertion. That would make the regex match any instance of “cat” that’s not preceded by “fat ”.
你也可以使用否定 lookbehind — 写成 (?<!...)
— 来反转断言。这将使正则表达式匹配任何前面没有 “fat ” 的 “cat” 实例。
const re = /(?<!fat )cat/g;
'cat, fat cat, brat cat'.replace(re, 'pigeon');
// → 'pigeon, fat cat, brat pigeon'
JavaScript’s implementation of lookbehind is one of the very best (matched only by .NET). Whereas other regex flavors have inconsistent and complex rules for when and whether they allow variable-length patterns inside lookbehind, JavaScript allows you to look behind for any subpattern.
JavaScript 的后视实现是最好的实现之一(仅与 .NET 相媲美)。其他正则表达式风格对于何时以及是否允许在 lookbehind 中使用可变长度模式具有不一致且复杂的规则,而 JavaScript 允许你向后查看任何子模式。
The matchAll Method
matchAll 方法
JavaScript’s String.prototype.matchAll
was added in ES2020 and makes it easier to operate on regex matches in a loop when you need extended match details. Although other solutions were possible before, matchAll
is often easier, and it avoids gotchas, such as the need to guard against infinite loops when looping over the results of regexes that might return zero-length matches.
JavaScript 的 String.prototype.matchAll
是在 ES2020 中添加的,当你需要扩展匹配细节时,可以更轻松地在循环中对正则表达式匹配进行操作。尽管以前还有其他解决方案,但 matchAll
通常更容易,并且它避免了陷阱,例如在循环可能返回零长度匹配的正则表达式结果时需要防止无限循环。
Since matchAll
returns an iterator (rather than an array), it’s easy to use it in a for...of
loop.
由于 matchAll
返回一个迭代器(而不是数组),因此很容易在 for...of
循环中使用它。
const re = /(?<char1>\w)(?<char2>\w)/g;
for (const match of str.matchAll(re)) {
const {char1, char2} = match.groups;
// Print each complete match and matched subpatterns
console.log(`Matched "${match[0]}" with "${char1}" and "${char2}"`);
}
Note: matchAll
requires its regexes to use flag g
(global
). Also, as with other iterators, you can get all of its results as an array using Array.from
or array spreading.
注意:matchAll 要求其正则表达式使用标志 g(全局)。此外,与其他迭代器一样,您可以使用 Array.from 或 array spread 将其所有结果作为数组获取。
const matches = [...str.matchAll(/./g)];
Unicode Properties
UNICODE 属性
Unicode properties (added in ES2018) give you powerful control over multilingual text, using the syntax \p{...}
and its negated version \P{...}
. There are hundreds of different properties you can match, which cover a wide variety of Unicode categories, scripts, script extensions, and binary properties.
Unicode 属性(在 ES2018 中添加)使用语法 \p{...}
及其否定版本 \P{...}
为您提供对多语言文本的强大控制。您可以匹配数百种不同的属性,这些属性涵盖各种 Unicode 类别、脚本、脚本扩展名和二进制属性。
Note: For more details, check out the documentation on MDN.
注意:有关更多详细信息,请查看 MDN 上的文档。
Unicode properties require using the flag u
(unicode
) or v
(unicodeSets
).
Unicode 属性需要使用标志 u
(unicode
) 或 v
(unicodeSets
)。
Flag v
标志 v
Flag v
(unicodeSets
) was added in ES2024 and is an upgrade to flag u
— you can’t use both at the same time. It’s a best practice to always use one of these flags to avoid silently introducing bugs via the default Unicode-unaware mode. The decision on which to use is fairly straightforward. If you’re okay with only supporting environments with flag v
(Node.js 20 and 2023-era browsers), then use flag v
; otherwise, use flag u
.
标志 v
(unicodeSets
) 是在 ES2024 中添加的,是标志 u
的升级 — 您不能同时使用两者。最佳做法是始终使用这些标志之一,以避免通过默认的 Unicode uncode-unaware 模式静默引入错误。决定使用哪个相当简单。如果您只支持带有标志 v
的环境(Node.js 20 和 2023 时代的浏览器),请使用标志 v
;否则,使用标志 u
。
Flag v
adds support for several new regex features, with the coolest probably being set subtraction and intersection. This allows using A--B
(within character classes) to match strings in A but not in B or using A&&B
to match strings in both A and B. For example:
标志 v
添加了对几个新正则表达式功能的支持,其中最酷的可能是设置减法和交集。这允许使用 A--B
(在字符类中) 匹配 A 中的字符串但不匹配 B 中的字符串,或使用 A&&B
匹配 A 和 B 中的字符串。例如:
// Matches all Greek symbols except the letter 'π'
/[\p{Script_Extensions=Greek}--π]/v
// Matches only Greek letters
/[\p{Script_Extensions=Greek}&&\p{Letter}]/v
For more details about flag v
, including its other new features, check out this explainer from the Google Chrome team.
有关标记 v
的更多详细信息,包括其其他新功能,请查看 Google Chrome 团队的此解释器。
A Word on Matching Emoji
关于匹配 emoji 的一句话
Emoji are
标签:Regex,regex,字符,正则表达式,JavaScript,regular,POSIX,emoji From: https://blog.csdn.net/u013669912/article/details/143468338