首页 > 编程语言 >Regex 历史 / 规范 / 流派 | JavaScript 匹配 emoji

Regex 历史 / 规范 / 流派 | JavaScript 匹配 emoji

时间:2024-11-08 19:16:02浏览次数:3  
标签:Regex regex 字符 正则表达式 JavaScript regular POSIX emoji

注:本文为几篇 regex 相关合辑。机翻,未校,未整理。


Regex History and How-To

Crystal Villanueva

Jan 14, 2021

A regular expression, also known as regex or regexp, is a special string that presents itself repeatedly in a search pattern; today, programmers use regex to create a regular pattern to look for within their code to capture, validate, find and replace and insert. Regex may seem intimidating, however with a basic breakdown of what regex is and how to apply it, it isn’t so terrifying after all. This post is just a beginners dip into the world of regex, not a comprehensive guide of what each symbol means.
正则表达式(也称为 regex 或 regexp)是一种特殊字符串,它在搜索模式中重复出现;如今,程序员使用 Regex 创建常规模式,以便在其代码中查找以捕获、验证、查找、替换和插入。正则表达式可能看起来令人生畏,但是对于正则表达式是什么以及如何应用它的基本分解,它毕竟并不那么可怕。这篇文章只是初学者对正则表达式世界的深入了解,而不是每个符号含义的全面指南。

History 历史

In the 1950’s a mathematician named Stephan Cole Kleene submitted a paper on automata theory and regular expressions. While the automata theory meant that the simplest machine has bounds or limits in its memory; “the memory size is independent of the input length”, his focus on regular expressions stemmed from McCulloch and Pitts’ neural calculus in their investigation of behavioral activation and activity of a neuron: with the computation of neuron activity in his biophysics focus, Kleene denoted that neuronal activation sequences “[bring a given net…] to a particular state after they have been completely processed, and discovered interesting regularities among them”(McCulloch and Pitts’ neural logical calculus. (n.d.)). Kleene distinguished these as regular events’, later called regular expressions’ in lexical analysis (Hopcroft, Motwani, and Ullman 2014). From his mathematical contributions, regular expressions were later implemented into machines (i.e., the SNOBOL programming language in the 60’s and the Unix editor systems in the late 60’s/early 70’s ) to compile and search for expressions occurring more than once. Regex is used through out search engines to find and replace input, editing, formatting, and the output of text.
在 1950 年代,一位名叫 Stephan Cole Kleene 的数学家提交了一篇关于自动机理论和正则表达式的论文。虽然自动机理论意味着最简单的机器在其内存中有边界或限制;“内存大小与输入长度无关”,他对正则表达式的关注源于 McCulloch 和 Pitts 在研究神经元的行为激活和活动时的神经演算:在他的生物物理学重点中,神经元活动的计算表明神经元激活序列“[带来给定的网络…]到特定状态,并在它们之间发现了有趣的规律“(McCulloch 和 Pitts 的神经逻辑演算。(日期不详))。Kleene 将这些区分为“常规事件”,后来在词汇分析中称为“常规表达式”(Hopcroft、Motwani 和 Ullman 2014)。根据他的数学贡献,正则表达式后来被实现到机器中(即 60 年代的 SNOBOL 编程语言和 60 年代末/70 年代初的 Unix 编辑器系统),以编译和搜索多次出现的表达式。正则表达式在整个搜索引擎中用于查找和替换输入、编辑、格式化和文本输出。

Javascript and Regex

Javascript 和 Regex

In Javascript, ‘.test()’, ‘.match()’, ‘.matchAll()’, ‘.replaceAll()’, ‘.search()’, ‘.split()’or ‘.replace()’ are all methods that a programmer could implement into their function body to search for a regular expression within their code. The method ‘.test()’ returns true if found, and false if it cannot find the pattern within the code. The method ‘.match()’ searches throughout the code to find it’s regex match. Lastly, ‘.replace()’ takes in two arguments: the regex that the method is searching for, and what it is trying to replace:
在 Javascript 中,‘.test()’、‘.match()’、‘.matchAll()’、‘.replaceAll()’、‘.search()’、‘.split()‘或 ‘.replace()’ 都是程序员可以在其函数体中实现的方法,以在其代码中搜索正则表达式。方法 ‘.test()’ 如果找到,则返回 true,如果在代码中找不到模式,则返回 false。方法 ‘.match()’ 在整个代码中搜索以查找其正则表达式匹配项。最后,’.replace()’ 接受两个参数:方法正在搜索的正则表达式,以及它试图替换的内容:

let command = "G()()()()(al)"
let interpret = function(command) {
 return command.replace(/\(\)/g, "o").replace(/\(al\)/g, "al")
};// output => "Gooooal"

Above, the method ‘.replace()’ is featured. The first argument of ‘.replace()’ is to find all instances of ‘()’ which is syntactically placed for regex form, and to replace it with ‘o’. Another ‘.replace()’ is chained onto the function body to find all instances of ‘(al)’ and replace it with ‘al’.
上面,方法 ‘.replace()’ 是特色。‘.replace()’ 的第一个参数是找到 ‘()’ 的所有实例,它在语法上是针对正则表达式形式的,并将其替换为 ‘o’。另一个 ‘.replace()’ 被链接到函数体上,以查找 ‘(al)’ 的所有实例并将其替换为 ‘al’。

What does the filler in the first part of the argument in the ‘.replace()’ mean? It can be confusing, but here is a short guide of slowing down and understanding this example of regex. The first argument in the first ‘.replace()’, ‘/()/g’, is deciphered as the following:
‘.replace()’ 中参数第一部分的填充物是什么意思?这可能会令人困惑,但这里有一个简短的指南,可以放慢速度并理解这个正则表达式示例。第一个 ‘.replace()’ 中的第一个参数 ‘/\(\)/g’ 被解读为:

img

The ‘/ /’ is standard format in regex; a pattern is enclosed in slashes. Most times you will see it in slashes, sometimes in the constructor format.
‘/ /’ 是正则表达式中的标准格式;模式用斜杠括起来。大多数时候,您会以斜杠形式看到它,有时是构造函数格式。

The ‘g’ flag in regex means ‘global’; the ‘g’ flag finds all possible matches throughout the string.
正则表达式中的“g”标志表示“全局”;‘g’ 标志查找整个字符串中所有可能的匹配项。

The backslash in the box escapes the special character; to find one of the parenthesis, you need the backslash.
框中的反斜杠转义特殊字符;要找到其中一个括号,您需要反斜杠。

The slash in the box escapes the special character; to find one of the parenthesis, you need the backslash.
框中的斜杠转义特殊字符;要找到其中一个括号,您需要反斜杠。

Put it all together and this is the result! Were searching for the ‘()’ globally throughout the code in the correct format.
把它们放在一起,这就是结果!在整个代码中以正确的格式全局搜索 ‘()’。

note: to escape or escaping a character means a character has special meaning *inside* a regular expression.
注: To escapeing or estruing a character 表示字符在正则表达式中具有特殊含义。

Congratulations, now you understand a simple regex! Regex is universal throughout all programming languages. No matter what language you are using, regex will be used to find, replace, split, match and test your code. While this example is rudimentary, hopefully it keens your interest into the world of regex.
恭喜,现在您了解了一个简单的正则表达式!Regex 在所有编程语言中都是通用的。无论您使用哪种语言,正则表达式都将用于查找、替换、拆分、匹配和测试您的代码。虽然这个例子很初级,但希望它能激发您对正则表达式世界的兴趣。

Resources:

Hopcroft, J. E., Motwani, R., & Ullman, J. D. (2014). Introduction to automata theory, languages, and computation. Harlow: Pearson Education.

Leung, H. (2010, September 16). Regular Languages and Finite Automata. Retrieved January 12, 2021, from https://www.cs.nmsu.edu/historical-projects/Projects/kleene.9.16.10.pdf

McCulloch and Pitts’ neural logical calculus. (n.d.). Retrieved January 12, 2021, from https://www.dlsi.ua.es/~mlf/nnafmc/pbook/node10.html


Introduction and History of Regular Expression

Gurkirat Singh

Posted on 2023,08,01

Updated on 2023,09,03

Hello World! Welcome to the “Regular Expressions” series, where we tackle the intimidating syntax that has spawned numerous memes among developers. Don’t worry, though! As we move forward, I assure you that you’ll gain the confidence to craft your own elegant regular expressions by the end of this journey.
世界您好!欢迎来到“正则表达式”系列,在这里,我们将讨论在开发人员中催生了大量模因的令人生畏的语法。不过别担心!随着我们向前发展,我向你保证,在这段旅程结束时,你将有信心制作你自己的优雅正则表达式。

In this series, I will utilise https://regex101.com to share patterns with you, avoiding the constraint of a specific programming language that could potentially create barriers for some learners.
在本系列中,我将利用 https://regex101.com 与您分享模式,避免特定编程语言的限制,这可能会给某些学习者带来障碍。

Note →→ If anywhere coding is required, I will be using C++.
注意 → → 如果需要编码,我将使用 C++。

What are Regular Expressions?

什么是正则表达式?

Regular expressions are a specific type of text pattern that are used while programming a logic.
正则表达式是编写逻辑时使用的一种特定类型的文本模式。

I couldn’t think of a single modern application that doesn’t make use of it, either directly or indirectly. If you went to a website and entered any gibberish text in the email field, you would have received an invalid email format message.
我想不出任何一个现代应用程序不直接或间接地使用它。如果您访问某个网站并在电子邮件字段中输入任何乱码文本,您将收到无效的电子邮件格式消息。

invalid email

Under the hood, your input text is being verified by the following regex pattern.

在后台,您的输入文本由以下正则表达式模式进行验证。

[a-z0-9.-]@[a-z0-9]{2,}\.[a-z]{2,}

You can see how this regex works here. This is a minimal version of matching simple email addresses. Please bear with me if this appears overwhelming. Such expressions can be created in less than 10 seconds.
你可以在这里看到这个正则表达式是如何工作的。这是匹配简单电子邮件地址的最小版本。如果这看起来让人不知所措,请多多包涵。可以在 10 秒内创建此类表达式。

Why this name after all?

为什么叫这个名字呢?

The name “regular expression” originates from the mathematical concept of regular languages, which were first studied by mathematicians in the field of formal language theory. Regular expressions are a way to describe and match patterns in strings of characters, hence the name “regular” expressions.
“正则表达式”这个名字源于规则语言的数学概念,最早是由形式语言理论领域的数学家研究的。正则表达式是一种描述和匹配字符串模式的方法,因此称为 “regular” 表达式。

Problem that Led Development of RegExp

导致 RegExp 开发的问题

It was created in the 1950s (far before most of us were born) to help with text processing tasks, such as searching and editing. Since then, regular expressions have become a standard feature of many programming languages.
它创建于 1950 年代(远在我们大多数人出生之前),用于帮助完成文本处理任务,例如搜索和编辑。从那时起,正则表达式已成为许多编程语言的标准功能。

In the early 1960s, Ken Thompson implemented regular expressions in the QED text editor. This was the first time that regular expressions were used in a practical application.
在 1960 年代初期,Ken Thompson 在 QED 文本编辑器中实现了正则表达式。这是正则表达式首次在实际应用中使用。

Why You Should Learn Regular Expressions?

为什么你应该学习正则表达式?

Learning regular expressions offers numerous benefits in various fields such as
学习正则表达式在各个领域都有许多好处,例如:

  • data analysis 数据分析
  • software development
    软件开发
  • content management 内容管理
  • scientific research 科研

Their versatility allows you to define complex patterns for finding specific words or phrases, extracting data from structured text, and performing advanced search and replace operations.
它们的多功能性允许您定义复杂的模式,用于查找特定单词或短语、从结构化文本中提取数据以及执行高级搜索和替换操作。

Additionally, mastering regular expressions helps prevent costly mistakes by providing a precise and controlled approach to text processing. With a solid understanding of regular expressions, you can confidently handle challenging text manipulation tasks, ensuring accuracy, reliability, and improved productivity.
此外,掌握正则表达式通过提供精确且受控的文本处理方法来帮助防止代价高昂的错误。凭借对正则表达式的深入理解,您可以自信地处理具有挑战性的文本操作任务,确保准确性、可靠性和提高生产力。

Since you’re here, I’m assuming you’re ready to understand and use those unwieldy strings of brackets and question marks in your code.
既然你在这里,我假设你已经准备好在代码中理解和使用这些笨拙的括号和问号字符串。

Flavours of RegExp

RegExp 的风格

There is no established standard that specifies which text patterns are and are not regular expressions. There are numerous languages on the market whose creators have various ideas about how regular expressions should look. So we’re now stuck with a whole spectrum of regular expression flavours (implementation of RegExp in the programming language).
没有确定的标准来指定哪些文本模式是正则表达式,哪些不是正则表达式。市场上有许多语言,其创建者对正则表达式的外观有各种想法。因此,我们现在被一整套正则表达式风格所困(在编程语言中实现 RegExp)。

But why reinvent the wheel? Instead, every modern regular expression engine may be traced back to the Perl programming language.
但为什么要重新发明轮子呢?相反,每个现代正则表达式引擎都可以追溯到 Perl 编程语言。

Regular expression flavours are commonly integrated into scripting languages, while other programming languages rely on dedicated libraries for regex support. JavaScript offers built-in support for regular expressions using the syntax /expr/ or the RegExp object. On the other hand, Python implements regular expressions through its standard library re.
正则表达式风格通常集成到脚本语言中,而其他编程语言则依赖于专用库来支持正则表达式。JavaScript 为使用语法 /expr/RegExp 对象的正则表达式提供内置支持。另一方面,Python 通过其标准库 re 实现正则表达式。

References 引用

Automata theory

Automata theory is the study of abstract machines and automata, as well as the computational problems that can be solved using them. It is a theory in theoretical computer science with close connections to mathematical logic. The word automata comes from the Greek word αὐτόματος, which means “self-acting, self-willed, self-moving”. An automaton is an abstract self-propelled computing device which follows a predetermined sequence of operations automatically. An automaton with a finite number of states is called a finite automaton (FA) or finite-state machine (FSM). The figure on the right illustrates a finite-state machine, which is a well-known type of automaton. This automaton consists of states and transitions. As the automaton sees a symbol of input, it makes a transition to another state, according to its transition function, which takes the previous state and current input symbol as its arguments.
自动机理论是对抽象机器和自动机的研究,以及可以使用它们解决的计算问题。它是理论计算机科学中的一种理论,与数理逻辑密切相关。自动机一词来自希腊语 αὐτόματος,意思是“自我行动、任性、自我移动”。自动机是一种抽象的自走式计算设备,它会自动遵循预定的操作顺序。具有有限状态数的自动机称为有限自动机 (FA) 或有限状态机 (FSM)。下图说明了有限状态机,这是一种众所周知的自动机类型。此自动机由状态和过渡组成。当自动机看到 input 符号时,它会根据其 transition 函数过渡到另一个状态,该函数将前一个状态和当前 input 符号作为其参数。

在这里插入图片描述

Regular expression

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.
正则表达式(有时称为有理表达式)是指定文本中的匹配模式的字符序列。通常,字符串搜索算法使用此类模式对字符串执行“查找”或“查找和替换”操作,或用于输入验证。正则表达式技术是在理论计算机科学和形式语言理论中发展起来的。

Formal language

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules called a formal grammar.
在逻辑学、数学、计算机科学和语言学中,正式语言由单词组成,这些单词的字母取自字母表,并根据一组称为正式语法的特定规则格式正确。


Regexes Got Good: The History And Future Of Regular Expressions In JavaScript

正则表达式变得好起来:JavaScript 中正则表达式的历史和未来

Steven Levithan

Aug 20, 2024

Although JavaScript regexes used to be underpowered compared to other modern flavors, numerous improvements in recent years mean that’s no longer true. Steven Levithan evaluates the history and present state of regular expressions in JavaScript with tips to make your regexes more readable, maintainable, and resilient.

尽管与其他现代风格相比,JavaScript 正则表达式曾经功能不足,但近年来的大量改进意味着情况不再如此。Steven Levithan 评估了 JavaScript 中正则表达式的历史和现状,并提供了一些技巧,以使您的正则表达式更具可读性、可维护性和弹性。

Modern JavaScript regular expressions have come a long way compared to what you might be familiar with. Regexes can be an amazing tool for searching and replacing text, but they have a longstanding reputation (perhaps outdated, as I’ll show) for being difficult to write and understand.
与你可能熟悉的相比,现代 JavaScript 正则表达式已经取得了长足的进步。正则表达式可以成为搜索和替换文本的出色工具,但它们因难以编写和理解而长期以来的声誉(可能已经过时了,我将展示)。

This is especially true in JavaScript-land, where regexes languished for many years, comparatively underpowered compared to their more modern counterparts in PCRE, Perl, .NET, Java, Ruby, C++, and Python. Those days are over.
在 JavaScript 领域尤其如此,正则表达式多年来一直萎靡不振,与 PCRE、Perl、.NET、Java、Ruby、C++ 和 Python 中更现代的对应项相比,其功能相对不足。那些日子已经过去了。

In this article, I’ll recount the history of improvements to JavaScript regexes (spoiler: ES2018 and ES2024 changed the game), show examples of modern regex features in action, introduce you to a lightweight JavaScript library that makes JavaScript stand alongside or surpass other modern regex flavors, and end with a preview of active proposals that will continue to improve regexes in future versions of JavaScript (with some of them already working in your browser today).
在本文中,我将回顾 JavaScript 正则表达式改进的历史(剧透:ES2018 和 ES2024 改变了游戏规则),展示现代正则表达式功能的实际示例,向您介绍一个轻量级的 JavaScript 库,它使 JavaScript 与其他现代正则表达式风格并驾齐驱或超越其他现代正则表达式风格,最后预览将继续改进未来 JavaScript 版本中的正则表达式的活动提案(其中一些已经在您的浏览器中运行)。

The History of Regular Expressions in JavaScript

JavaScript 中正则表达式的历史

ECMAScript 3, standardized in 1999, introduced Perl-inspired regular expressions to the JavaScript language. Although it got enough things right to make regexes pretty useful (and mostly compatible with other Perl-inspired flavors), there were some big omissions, even then. And while JavaScript waited 10 years for its next standardized version with ES5, other programming languages and regex implementations added useful new features that made their regexes more powerful and readable.
ECMAScript 3 于 1999 年标准化,将受 Perl 启发的正则表达式引入 JavaScript 语言。尽管它做对了足够多的事情使正则表达式非常有用(并且主要与其他受 Perl 启发的风格兼容),但即使在那时,也有一些很大的遗漏。虽然 JavaScript 等待了 10 年才推出 ES5 的下一个标准化版本,但其他编程语言和正则表达式实现添加了有用的新功能,使它们的正则表达式更加强大和可读。

But that was then. 但那是当时的事了。

Did you know that nearly every new version of JavaScript has made at least minor improvements to regular expressions?
您知道吗,几乎每个新版本的 JavaScript 都至少对正则表达式进行了微小的改进?

Let’s take a look at them.
让我们来看看它们。

Don’t worry if it’s hard to understand what some of the following features mean — we’ll look more closely at several of the key features afterward.
如果难以理解以下某些功能的含义,请不要担心 — 我们稍后将更仔细地研究几个关键功能。

  • ES5 (2009) fixed unintuitive behavior by creating a new object every time regex literals are evaluated and allowed regex literals to use unescaped forward slashes within character classes (/[/]/).
    ES5 (2009) 修复了不直观的行为,方法是在每次计算正则表达式文字时创建一个新对象,并允许正则表达式文字在字符类 (/[/]/) 中使用未转义的正斜杠。
  • ES6/ES2015 added two new regex flags: y (sticky), which made it easier to use regexes in parsers, and u (unicode), which added several significant Unicode-related improvements along with strict errors. It also added the RegExp.prototype.flags getter, support for subclassing RegExp, and the ability to copy a regex while changing its flags.
    ES6/ES2015 添加了两个新的正则表达式标志:ysticky),这使得在解析器中使用正则表达式变得更加容易,以及 uunicode),它添加了一些与 Unicode 相关的重要改进以及严格的错误。它还添加了 RegExp.prototype.flags getter,对子类化 RegExp 的支持,以及在更改其标志时复制正则表达式的能力。
  • ES2018 was the edition that finally made JavaScript regexes pretty good. It added the s (dotAll) flag, lookbehind, named capture, and Unicode properties (via \p{...} and \P{...}, which require ES6’s flag u). All of these are extremely useful features, as we’ll see.
    ES2018 是最终使 JavaScript 正则表达式变得相当好的版本。它添加了 sdotAll) 标志、后视、命名 capture 和 Unicode 属性(通过 \p{...}\P{...},它们需要 ES6 的标志 u)。正如我们将看到的,所有这些都是非常有用的功能。
  • ES2020 added the string method matchAll, which we’ll also see more of shortly.
    ES2020 添加了字符串方法 matchAll,我们很快也会看到更多。
  • ES2022 added flag d (hasIndices), which provides start and end indices for matched substrings.
    ES2022 添加了标志 dhasIndices),它为匹配的子字符串提供开始和结束索引。
  • And finally, ES2024 added flag v (unicodeSets) as an upgrade to ES6’s flag u. The v flag adds a set of multicharacter “properties of strings” to \p{...}, multicharacter elements within character classes via \p{...} and \q{...}, nested character classes, set subtraction [A--B] and intersection [A&&B], and different escaping rules within character classes. It also fixed case-insensitive matching for Unicode properties within negated sets [^...].
    最后,ES2024 添加了标志 vunicodeSets) 作为 ES6 标志 u 的升级。v 标志向 \p{...} 添加了一组多字符“字符串属性”,通过 \p{...}\q{...} 在字符类中添加多字符元素,嵌套字符类,集合减法 [A--B] 和交集 [A&&B],以及字符类中的不同转义规则。它还修复了否定集 [^...] 中 Unicode 属性的不区分大小写的匹配。

As for whether you can safely use these features in your code today, the answer is yes! The latest of these features, flag v, is supported in Node.js 20 and 2023-era browsers. The rest are supported in 2021-era browsers or earlier.
至于您现在是否可以在代码中安全地使用这些功能,答案是肯定的!这些功能中的最新功能 flag v 在 Node.js 20 和 2023 时代的浏览器中受支持。其余的在 2021 年版或更早版本中受支持。

Each edition from ES2019 to ES2023 also added additional Unicode properties that can be used via \p{...} and \P{...}. And to be a completionist, ES2021 added string method replaceAll — although, when given a regex, the only difference from ES3’s replace is that it throws if not using flag g.
从 ES2019 到 ES2023 的每个版本还添加了额外的 Unicode 属性,可以通过 \p{...}\P{...} 使用。为了成为完成主义者,ES2021 添加了字符串方法 replaceAll ——尽管,当给定正则表达式时,与 ES3 的 replace 的唯一区别是,如果不使用标志 g ,它会抛出。

Aside: What Makes a Regex Flavor Good?

题外话:是什么让 REGEX 风格好?

With all of these changes, how do JavaScript regular expressions now stack up against other flavors? There are multiple ways to think about this, but here are a few key aspects:
有了所有这些变化,JavaScript 正则表达式现在如何与其他风格相提并论呢?有多种方法可以考虑这个问题,但以下是几个关键方面:

  • Performance.
    This is an important aspect but probably not the main one since mature regex implementations are generally pretty fast. JavaScript is strong on regex performance (at least considering V8’s Irregexp engine, used by Node.js, Chromium-based browsers, and even Firefox; and JavaScriptCore, used by Safari), but it uses a backtracking engine that is missing any syntax for backtracking control — a major limitation that makes ReDoS vulnerability more common.
    性能。 这是一个重要的方面,但可能不是主要方面,因为成熟的 regex 实现通常非常快。JavaScript 在正则表达式性能方面很强(至少考虑到 V8 的 Irregexp 引擎,Node.js、基于 Chromium 的浏览器甚至 Firefox;以及 Safari 使用的 JavaScriptCore),但它使用的回溯引擎缺少任何回溯控制语法——这是使 ReDoS 漏洞更常见的主要限制。
  • Support for advanced features that handle common or important use cases.
    Here, JavaScript stepped up its game with ES2018 and ES2024. JavaScript is now best in class for some features like lookbehind (with its infinite-length support) and Unicode properties (with multicharacter “properties of strings,” set subtraction and intersection, and script extensions). These features are either not supported or not as robust in many other flavors.
    支持处理常见或重要使用案例的高级功能。 在这里,JavaScript 通过 ES2018 和 ES2024 加强了它的游戏。JavaScript 现在是某些功能的最佳版本,例如后瞻(具有无限长度支持)和 Unicode 属性(具有多字符“字符串属性”、集合减法和交集以及脚本扩展)。这些功能要么不受支持,要么在许多其他风格中不那么健壮。
  • Ability to write readable and maintainable patterns.
    Here, native JavaScript has long been the worst of the major flavors since it lacks the x (“extended”) flag that allows insignificant whitespace and comments. Additionally, it lacks regex subroutines and subroutine definition groups (from PCRE and Perl), a powerful set of features that enable writing grammatical regexes that build up complex patterns via composition.
    能够编写可读和可维护的模式。在这里,原生 JavaScript 长期以来一直是最糟糕的主要风格,因为它缺少 x (“extended”) 标志,该标志允许无关紧要的空格和注释。此外,它还缺少正则表达式子例程和子例程定义组(来自 PCRE 和 Perl),这是一组强大的功能,可以编写语法正则表达式,通过组合构建复杂的模式。

So, it’s a bit of a mixed bag.
所以,这有点好坏参半。

The good news is that all of these holes can be filled by a JavaScript library, which we’ll see later in this article.
好消息是,所有这些漏洞都可以由 JavaScript 库填补,我们将在本文后面看到。

Using JavaScript’s Modern Regex Features

使用 JavaScript 的现代正则表达式功能

Let’s look at a few of the more useful modern regex features that you might be less familiar with. You should know in advance that this is a moderately advanced guide. If you’re relatively new to regex, here are some excellent tutorials you might want to start with:
让我们看看一些您可能不太熟悉的更有用的现代正则表达式功能。您应该提前知道这是一个中等高级的指南。如果您对正则表达式相对较新,这里有一些优秀的教程,您可能希望从以下开始:

  • RegexLearn and RegexOne are interactive tutorials that include practice problems.
    RegexLearn 和 RegexOne 是包含练习题的交互式教程。
  • JavaScript.info’s regular expressions chapter is a detailed and JavaScript-specific guide.
    JavaScript.info 的正则表达式章节是一份详细的 JavaScript 特定指南。
  • Demystifying Regular Expressions (video) is an excellent presentation for beginners by Lea Verou at HolyJS 2017.
    揭开正则表达式的神秘面纱(视频)是 Lea Verou 在 HolyJS 2017 上为初学者提供的精彩演示。
  • Learn Regular Expressions In 20 Minutes (video) is a live syntax walkthrough in a regex tester.
    在 20 分钟内学习正则表达式(视频)是正则表达式测试器中的实时语法演练。

Named Capture

命名捕获

Often, you want to do more than just check whether a regex matches — you want to extract substrings from the match and do something with them in your code. Named capturing groups allow you to do this in a way that makes your regexes and code more readable and self-documenting.
通常,你想要做的不仅仅是检查正则表达式是否匹配 - 你想要从匹配中提取子字符串并在代码中对它们执行一些操作。命名捕获组允许您以一种使正则表达式和代码更具可读性和自文档性的方式执行此操作。

The following example matches a record with two date fields and captures the values:
以下示例匹配具有两个日期字段的记录并捕获值:

const record = 'Admitted: 2024-01-01\nReleased: 2024-01-03';
const re = /^Admitted: (?<admitted>\d{4}-\d{2}-\d{2})\nReleased: (?<released>\d{4}-\d{2}-\d{2})$/;
const match = record.match(re);
console.log(match.groups);
/* → {
 admitted: '2024-01-01',
 released: '2024-01-03'
} */

Don’t worry — although this regex might be challenging to understand, later, we’ll look at a way to make it much more readable. The key things here are that named capturing groups use the syntax (?<name>...), and their results are stored on the groups object of matches.
别担心 — 尽管这个正则表达式可能很难理解,但稍后,我们将寻找一种方法来使其更具可读性。这里的关键是命名捕获组使用语法 (?<name>...),并且它们的结果存储在匹配的 groups 对象上。

You can also use named backreferences to rematch whatever a named capturing group matched via \k<name>, and you can use the values within search and replace as follows:
您还可以使用命名反向引用来重新匹配通过 \k<name> 匹配的任何命名捕获组,并且可以在 search 和 replace 中使用值,如下所示:

// Change 'FirstName LastName' to 'LastName, FirstName'
const name = 'Shaquille Oatmeal';
name.replace(/(?<first>\w+) (?<last>\w+)/, '$<last>, $<first>');
// → 'Oatmeal, Shaquille'

For advanced regexers who want to use named backreferences within a replacement callback function, the groups object is provided as the last argument. Here’s a fancy example:
对于想要在替换回调函数中使用命名反向引用的高级 regexers,groups 对象作为最后一个参数提供。这里有一个花哨的例子:

function fahrenheitToCelsius(str) {
 const re = /(?<degrees>-?\d+(\.\d+)?)F\b/g;
 return str.replace(re, (...args) => {
  const groups = args.at(-1);
  return Math.round((groups.degrees - 32) * 5/9) + 'C';
 });
}
fahrenheitToCelsius('98.6F');
// → '37C'
fahrenheitToCelsius('May 9 high is 40F and low is 21F');
// → 'May 9 high is 4C and low is -6C'

Lookbehind

回顾

Lookbehind (introduced in ES2018) is the complement to lookahead, which has always been supported by JavaScript regexes. Lookahead and lookbehind are assertions (similar to ^ for the start of a string or \b for word boundaries) that don’t consume any characters as part of the match. Lookbehinds succeed or fail based on whether their subpattern can be found immediately before the current match position.
Lookbehind(在 ES2018 中引入)是 lookahead 的补充,JavaScript 正则表达式一直支持它。Lookahead 和 lookbehind 是断言(类似于字符串开头的 ^ 或单词边界的 \b),它们在匹配过程中不使用任何字符。后视是否成功或失败取决于是否可以在当前匹配位置之前立即找到其子模式。

For example, the following regex uses a lookbehind (?<=...) to match the word “cat” (only the word “cat”) if it’s preceded by “fat ”:
例如,如果单词前面有 “fat ”,则以下正则表达式使用后视 (?<=...) 来匹配单词 “cat” (仅单词 “cat”) :

const re = /(?<=fat )cat/g;
'cat, fat cat, brat cat'.replace(re, 'pigeon');
// → 'cat, fat pigeon, brat cat'

You can also use negative lookbehind — written as (?<!...) — to invert the assertion. That would make the regex match any instance of “cat” that’s not preceded by “fat ”.
你也可以使用否定 lookbehind — 写成 (?<!...) — 来反转断言。这将使正则表达式匹配任何前面没有 “fat ” 的 “cat” 实例。

const re = /(?<!fat )cat/g;
'cat, fat cat, brat cat'.replace(re, 'pigeon');
// → 'pigeon, fat cat, brat pigeon'

JavaScript’s implementation of lookbehind is one of the very best (matched only by .NET). Whereas other regex flavors have inconsistent and complex rules for when and whether they allow variable-length patterns inside lookbehind, JavaScript allows you to look behind for any subpattern.
JavaScript 的后视实现是最好的实现之一(仅与 .NET 相媲美)。其他正则表达式风格对于何时以及是否允许在 lookbehind 中使用可变长度模式具有不一致且复杂的规则,而 JavaScript 允许你向后查看任何子模式。

The matchAll Method

matchAll 方法

JavaScript’s String.prototype.matchAll was added in ES2020 and makes it easier to operate on regex matches in a loop when you need extended match details. Although other solutions were possible before, matchAll is often easier, and it avoids gotchas, such as the need to guard against infinite loops when looping over the results of regexes that might return zero-length matches.
JavaScript 的 String.prototype.matchAll 是在 ES2020 中添加的,当你需要扩展匹配细节时,可以更轻松地在循环中对正则表达式匹配进行操作。尽管以前还有其他解决方案,但 matchAll 通常更容易,并且它避免了陷阱,例如在循环可能返回零长度匹配的正则表达式结果时需要防止无限循环。

Since matchAll returns an iterator (rather than an array), it’s easy to use it in a for...of loop.
由于 matchAll 返回一个迭代器(而不是数组),因此很容易在 for...of 循环中使用它。

const re = /(?<char1>\w)(?<char2>\w)/g;
for (const match of str.matchAll(re)) {
 const {char1, char2} = match.groups;
 // Print each complete match and matched subpatterns
 console.log(`Matched "${match[0]}" with "${char1}" and "${char2}"`);
}

Note: matchAll requires its regexes to use flag g (global). Also, as with other iterators, you can get all of its results as an array using Array.from or array spreading.
注意:matchAll 要求其正则表达式使用标志 g(全局)。此外,与其他迭代器一样,您可以使用 Array.from 或 array spread 将其所有结果作为数组获取。

const matches = [...str.matchAll(/./g)];

Unicode Properties

UNICODE 属性

Unicode properties (added in ES2018) give you powerful control over multilingual text, using the syntax \p{...} and its negated version \P{...}. There are hundreds of different properties you can match, which cover a wide variety of Unicode categories, scripts, script extensions, and binary properties.
Unicode 属性(在 ES2018 中添加)使用语法 \p{...} 及其否定版本 \P{...} 为您提供对多语言文本的强大控制。您可以匹配数百种不同的属性,这些属性涵盖各种 Unicode 类别、脚本、脚本扩展名和二进制属性。

Note: For more details, check out the documentation on MDN.
注意:有关更多详细信息,请查看 MDN 上的文档。

Unicode properties require using the flag u (unicode) or v (unicodeSets).
Unicode 属性需要使用标志 uunicode) 或 vunicodeSets)。

Flag v

标志 v

Flag v (unicodeSets) was added in ES2024 and is an upgrade to flag u — you can’t use both at the same time. It’s a best practice to always use one of these flags to avoid silently introducing bugs via the default Unicode-unaware mode. The decision on which to use is fairly straightforward. If you’re okay with only supporting environments with flag v (Node.js 20 and 2023-era browsers), then use flag v; otherwise, use flag u.
标志 vunicodeSets) 是在 ES2024 中添加的,是标志 u 的升级 — 您不能同时使用两者。最佳做法是始终使用这些标志之一,以避免通过默认的 Unicode uncode-unaware 模式静默引入错误。决定使用哪个相当简单。如果您只支持带有标志 v 的环境(Node.js 20 和 2023 时代的浏览器),请使用标志 v;否则,使用标志 u

Flag v adds support for several new regex features, with the coolest probably being set subtraction and intersection. This allows using A--B (within character classes) to match strings in A but not in B or using A&&B to match strings in both A and B. For example:
标志 v 添加了对几个新正则表达式功能的支持,其中最酷的可能是设置减法和交集。这允许使用 A--B (在字符类中) 匹配 A 中的字符串但不匹配 B 中的字符串,或使用 A&&B 匹配 A 和 B 中的字符串。例如:

// Matches all Greek symbols except the letter 'π'
/[\p{Script_Extensions=Greek}--π]/v

// Matches only Greek letters
/[\p{Script_Extensions=Greek}&&\p{Letter}]/v

For more details about flag v, including its other new features, check out this explainer from the Google Chrome team.
有关标记 v 的更多详细信息,包括其其他新功能,请查看 Google Chrome 团队的此解释器。

A Word on Matching Emoji

关于匹配 emoji 的一句话

Emoji are

标签:Regex,regex,字符,正则表达式,JavaScript,regular,POSIX,emoji
From: https://blog.csdn.net/u013669912/article/details/143468338

相关文章

  • javascript中的this
    在JavaScript中,this关键字的值取决于它被使用的上下文。它并不像其他编程语言中的this总是指向对象的实例,而是可能指向不同的对象。以下是几种常见的this的用法及其指向的内容:全局上下文在全局范围(即没有在任何函数或对象内)中,this指向全局对象。在浏览器中,这通常是window对象。......
  • JavaScript中的解构赋值
    写在前面在JavaScript中,解构赋值是一种简洁而强大的语法特性,它允许我们从数组或对象中提取值并将其分配给变量。这个功能可以大大简化代码,提高可读性和可维护性。今天,我们将深入探讨解构赋值的用法和规则。数组解构赋值数组解构赋值允许我们从数组中提取值并将其分配给变......
  • 移动Web前端高效开发实战:HTML 5 + CSS 3 + JavaScript + Webpack + React Native + Vu
    书:pan.baidu.com/s/1tIHXj9HmIYojAHqje09DTA?pwd=jqsoHTML5新特性与应用:介绍HTML5的新特性,包括语义化标签、本地存储、设备兼容、连接特性等,并讲解如何在移动Web前端开发中充分利用这些特性提升用户体验。CSS3样式与动画设计:详细讲解CSS3的样式设计和动画效果,包括选择器、盒......
  • PowerShell 脚本(.ps1)、批处理文件(.bat)、VBScript(.vbs) 和 旧版 JavaScript(.js) 都可以在
    PowerShell脚本(.ps1)、批处理文件(.bat)、VBScript(.vbs)和旧版JavaScript(.js)都可以在Windows系统中运行,但它们的兼容性和支持范围有一定的差异,尤其是在不同的Windows版本上。下面是它们在Windows系统中支持的情况:1. PowerShell脚本(.ps1)兼容性: PowerShell是自Window......
  • JavaScript Kruskal 最小生成树 (MST) 算法(Kruskal’s Minimum Spanning Tree (MST) A
             对于加权、连通、无向图,最小生成树(MST)或最小权重生成树是权重小于或等于其他所有生成树权重的生成树。Kruskal算法简介:        在这里,我们将讨论Kruskal算法来查找给定加权图的MST。         在Kruskal算法中,按升序对给定图的所......
  • JavaScript中的this到底是什么?
    写在前面在JavaScript中,this关键字是一个非常重要的概念,它指向当前执行上下文中的对象。理解this的工作原理对于编写高效、可维护的JavaScript代码至关重要。本文将深入探讨this关键字的用法和规则。什么是this?this是一个特殊的关键字,在不同的上下文中可以指向不同的对象......
  • JavaScript中的变量作用域
    写在前面在JavaScript中,变量作用域是指变量在代码中可见的范围。理解变量作用域对于编写高效、可维护的JavaScript代码至关重要。本文将深入探讨JavaScript中的变量作用域,包括全局作用域、函数作用域和块级作用域。全局作用域在JavaScript中,任何在函数或块之外声明的变量......
  • javascript 替代try catch方案详细完整案例和优缺点
    1.OptionalChaining(可选链)案例:constuser={name:"Kimi",details:{age:30}};constage=user.details?.age;//如果user或details是null/undefined,返回undefined而不是抛出错误优点:预防运行时错误,特别是在访问可能为null或undefined的对象......
  • javascript函数
    1.1初识函数1.1.1函数的定义1.函数用于封装一段特定功能的代码作用:提高代码的复用性,降低维护的难度(你将实现一个功能多段重复的代码变为一段代码,降低了维护的难度,你将这段代码用一个函数封装,要使用这个功能的时候就调用函数,即可提高代码的复用性)1.1.2函数的定义与使用1.函......
  • javascript模块 (Module) 简介
    https://blog.csdn.net/chehec2010/article/details/119804381   随着ES6的出现,js模块已经成为正式的标准了。曾经为了解决js模块问题而发展起来的民间秘籍,requireJs(AMD)、SeaJs(CMD)、Node(CommonJs),已经或者不久的将来会成为历史。了解历史也是很重要的,因为正式标准就是......