.NET 正则表达式.NET regular expressions

正则表达式提供了功能强大、灵活而又高效的方法来处理文本。Regular expressions provide a powerful, flexible, and efficient method for processing text. 正则表达式丰富的泛模式匹配表示法使你可以快速分析大量文本,以便:The extensive pattern-matching notation of regular expressions enables you to quickly parse large amounts of text to:

  • 查找特定字符模式。Find specific character patterns.
  • 验证文本以确保它匹配预定义模式(如电子邮件地址)。Validate text to ensure that it matches a predefined pattern (such as an email address).
  • 提取、编辑、替换或删除文本子字符串。Extract, edit, replace, or delete text substrings.
  • 将提取的字符串添加到集合中,以便生成报告。Add extracted strings to a collection in order to generate a report.

对于处理字符串或分析大文本块的许多应用程序而言,正则表达式是不可缺少的工具。For many applications that deal with strings or that parse large blocks of text, regular expressions are an indispensable tool.

正则表达式的工作方式How regular expressions work

使用正则表达式处理文本的中心构件是正则表达式引擎(由 .NET 中的 System.Text.RegularExpressions.Regex 对象表示)。The centerpiece of text processing with regular expressions is the regular expression engine, which is represented by the System.Text.RegularExpressions.Regex object in .NET. 使用正则表达式处理文本至少要求向该正则表达式引擎提供以下两方面的信息:At a minimum, processing text using regular expressions requires that the regular expression engine be provided with the following two items of information:

  • 要在文本中标识的正则表达式模式。The regular expression pattern to identify in the text.

    在 .NET 中,正则表达式模式用特殊的语法或语言定义,该语法或语言与 Perl 5 正则表达式兼容,并添加了一些其他功能,例如从右到左匹配。In .NET, regular expression patterns are defined by a special syntax or language, which is compatible with Perl 5 regular expressions and adds some additional features such as right-to-left matching. 有关更多信息,请参见正则表达式语言 - 快速参考For more information, see Regular Expression Language - Quick Reference.

  • 要为正则表达式模式分析的文本。The text to parse for the regular expression pattern.

Regex 类的方法使你可以执行以下操作:The methods of the Regex class let you perform the following operations:

有关正则表达式对象模型的概述,请参见正则表达式对象模型For an overview of the regular expression object model, see The Regular Expression Object Model.

若要详细了解正则表达式语言,请参阅正则表达式语言 - 快速参考,或下载和打印下面的小册子之一:For more information about the regular expression language, see Regular Expression Language - Quick Reference or download and print one of these brochures:

正则表达式示例Regular expression examples

String 类包括许多字符串搜索和替换方法,当你要在较大字符串中定位文本字符串时,可以使用这些方法。The String class includes a number of string search and replacement methods that you can use when you want to locate literal strings in a larger string. 当你希望在较大字符串中定位若干子字符串之一时,或者当你希望在字符串中标识模式时,正则表达式最有用,如以下示例所示。Regular expressions are most useful either when you want to locate one of several substrings in a larger string, or when you want to identify patterns in a string, as the following examples illustrate.

警告

如果使用 System.Text.RegularExpressions 处理不受信任的输入,则传递一个超时。When using System.Text.RegularExpressions to process untrusted input, pass a timeout. 恶意用户可能会向 RegularExpressions 提供输入,从而导致拒绝服务攻击A malicious user can provide input to RegularExpressions causing a Denial-of-Service attack. 使用 RegularExpressions 的 ASP.NET Core 框架 API 会传递一个超时。ASP.NET Core framework APIs that use RegularExpressions pass a timeout.

提示

System.Web.RegularExpressions 命名空间包含大量正则表达式对象,这些对象实现预定义的正则表达式模式,用于分析 HTML、XML 和 ASP.NET 文档中的字符串。The System.Web.RegularExpressions namespace contains a number of regular expression objects that implement predefined regular expression patterns for parsing strings from HTML, XML, and ASP.NET documents. 例如,TagRegex 类标识字符串中的开始标记,CommentRegex 类标识字符串中的 ASP.NET 注释。For example, the TagRegex class identifies start tags in a string and the CommentRegex class identifies ASP.NET comments in a string.

示例 1:替换子字符串Example 1: Replace substrings

假设一个邮件列表包含一些姓名,这些姓名有时包括称谓(Mr.、Mrs.、Miss 或 Ms.)以及姓氏和名字。Assume that a mailing list contains names that sometimes include a title (Mr., Mrs., Miss, or Ms.) along with a first and last name. 如果你从列表中生成信封标签时不希望包括称谓,则可以使用正则表达式移除称谓,如以下示例所示。If you do not want to include the titles when you generate envelope labels from the list, you can use a regular expression to remove the titles, as the following example illustrates.

using System;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string pattern = "(Mr\\.? |Mrs\\.? |Miss |Ms\\.? )";
      string[] names = { "Mr. Henry Hunt", "Ms. Sara Samuels",
                         "Abraham Adams", "Ms. Nicole Norris" };
      foreach (string name in names)
         Console.WriteLine(Regex.Replace(name, pattern, String.Empty));
   }
}
// The example displays the following output:
//    Henry Hunt
//    Sara Samuels
//    Abraham Adams
//    Nicole Norris
Imports System.Text.RegularExpressions

Module Example
    Public Sub Main()
        Dim pattern As String = "(Mr\.? |Mrs\.? |Miss |Ms\.? )"
        Dim names() As String = {"Mr. Henry Hunt", "Ms. Sara Samuels", _
                                  "Abraham Adams", "Ms. Nicole Norris"}
        For Each name As String In names
            Console.WriteLine(Regex.Replace(name, pattern, String.Empty))
        Next
    End Sub
End Module
' The example displays the following output:
'    Henry Hunt
'    Sara Samuels
'    Abraham Adams
'    Nicole Norris

正则表达式模式 (Mr\.? |Mrs\.? |Miss |Ms\.? ) 可匹配任何“Mr”、“Mr.”、“Mrs”、“Mrs.”、“Miss”、“Ms”或“Ms.”。The regular expression pattern (Mr\.? |Mrs\.? |Miss |Ms\.? ) matches any occurrence of "Mr ", "Mr. ", "Mrs ", "Mrs. ", "Miss ", "Ms or "Ms. ". Regex.Replace 方法的调用会将匹配的字符串替换为 String.Empty;换句话说,将其从原始字符串中移除。The call to the Regex.Replace method replaces the matched string with String.Empty; in other words, it removes it from the original string.

示例 2:识别重复单词Example 2: Identify duplicated words

意外地重复单词是编写者常犯的错误。Accidentally duplicating words is a common error that writers make. 可以使用正则表达式标识重复的单词,如以下示例所示。A regular expression can be used to identify duplicated words, as the following example shows.

using System;
using System.Text.RegularExpressions;

public class Class1
{
   public static void Main()
   {
      string pattern = @"\b(\w+?)\s\1\b";
      string input = "This this is a nice day. What about this? This tastes good. I saw a a dog.";
      foreach (Match match in Regex.Matches(input, pattern, RegexOptions.IgnoreCase))
         Console.WriteLine("{0} (duplicates '{1}') at position {2}",
                           match.Value, match.Groups[1].Value, match.Index);
   }
}
// The example displays the following output:
//       This this (duplicates 'This') at position 0
//       a a (duplicates 'a') at position 66
Imports System.Text.RegularExpressions

Module modMain
    Public Sub Main()
        Dim pattern As String = "\b(\w+?)\s\1\b"
        Dim input As String = "This this is a nice day. What about this? This tastes good. I saw a a dog."
        For Each match As Match In Regex.Matches(input, pattern, RegexOptions.IgnoreCase)
            Console.WriteLine("{0} (duplicates '{1}') at position {2}", _
                              match.Value, match.Groups(1).Value, match.Index)
        Next
    End Sub
End Module
' The example displays the following output:
'       This this (duplicates 'This') at position 0
'       a a (duplicates 'a') at position 66

正则表达式模式 \b(\w+?)\s\1\b 的解释如下:The regular expression pattern \b(\w+?)\s\1\b can be interpreted as follows:

模式Pattern 解释Interpretation
\b 在单词边界处开始。Start at a word boundary.
(\w+?) 匹配一个或多个单词字符,但字符要尽可能的少。Match one or more word characters, but as few characters as possible. 它们一起构成可称为 \1 的组。Together, they form a group that can be referred to as \1.
\s 与空白字符匹配。Match a white-space character.
\1 与等于名为 \1 的组的子字符串匹配。Match the substring that is equal to the group named \1.
\b 与字边界匹配。Match a word boundary.

通过将正则表达式选项设置为 Regex.Matches,调用 RegexOptions.IgnoreCase 方法。The Regex.Matches method is called with regular expression options set to RegexOptions.IgnoreCase. 因此,匹配操作不区分大小写,此示例将子字符串“This this”标识为重复。Therefore, the match operation is case-insensitive, and the example identifies the substring "This this" as a duplication.

输入字符串包括子字符串“this?The input string includes the substring "this? This”。This". 但是,由于插入标点符号,该子字符串不被标识为重复。However, because of the intervening punctuation mark, it is not identified as a duplication.

示例 3:动态生成区分区域性的正则表达式Example 3: Dynamically build a culture-sensitive regular expression

下面的示例演示如何将正则表达式的功能与 .NET 的全球化功能所提供的灵活性结合在一起。The following example illustrates the power of regular expressions combined with the flexibility offered by .NET's globalization features. 它使用 NumberFormatInfo 对象确定系统的当前区域性设置中货币值的格式。It uses the NumberFormatInfo object to determine the format of currency values in the system's current culture. 然后使用该信息动态构造从文本提取货币值的正则表达式。It then uses that information to dynamically construct a regular expression that extracts currency values from the text. 对于每个匹配,它提取仅包含数字字符串的子组,将其转换为 Decimal 值,然后计算累计值。For each match, it extracts the subgroup that contains the numeric string only, converts it to a Decimal value, and calculates a running total.

using System;
using System.Collections.Generic;
using System.Globalization;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      // Define text to be parsed.
      string input = "Office expenses on 2/13/2008:\n" +
                     "Paper (500 sheets)                      $3.95\n" +
                     "Pencils (box of 10)                     $1.00\n" +
                     "Pens (box of 10)                        $4.49\n" +
                     "Erasers                                 $2.19\n" +
                     "Ink jet printer                        $69.95\n\n" +
                     "Total Expenses                        $ 81.58\n";

      // Get current culture's NumberFormatInfo object.
      NumberFormatInfo nfi = CultureInfo.CurrentCulture.NumberFormat;
      // Assign needed property values to variables.
      string currencySymbol = nfi.CurrencySymbol;
      bool symbolPrecedesIfPositive = nfi.CurrencyPositivePattern % 2 == 0;
      string groupSeparator = nfi.CurrencyGroupSeparator;
      string decimalSeparator = nfi.CurrencyDecimalSeparator;

      // Form regular expression pattern.
      string pattern = Regex.Escape( symbolPrecedesIfPositive ? currencySymbol : "") +
                       @"\s*[-+]?" + "([0-9]{0,3}(" + groupSeparator + "[0-9]{3})*(" +
                       Regex.Escape(decimalSeparator) + "[0-9]+)?)" +
                       (! symbolPrecedesIfPositive ? currencySymbol : "");
      Console.WriteLine( "The regular expression pattern is:");
      Console.WriteLine("   " + pattern);

      // Get text that matches regular expression pattern.
      MatchCollection matches = Regex.Matches(input, pattern,
                                              RegexOptions.IgnorePatternWhitespace);
      Console.WriteLine("Found {0} matches.", matches.Count);

      // Get numeric string, convert it to a value, and add it to List object.
      List<decimal> expenses = new List<Decimal>();

      foreach (Match match in matches)
         expenses.Add(Decimal.Parse(match.Groups[1].Value));

      // Determine whether total is present and if present, whether it is correct.
      decimal total = 0;
      foreach (decimal value in expenses)
         total += value;

      if (total / 2 == expenses[expenses.Count - 1])
         Console.WriteLine("The expenses total {0:C2}.", expenses[expenses.Count - 1]);
      else
         Console.WriteLine("The expenses total {0:C2}.", total);
   }
}
// The example displays the following output:
//       The regular expression pattern is:
//          \$\s*[-+]?([0-9]{0,3}(,[0-9]{3})*(\.[0-9]+)?)
//       Found 6 matches.
//       The expenses total $81.58.
Imports System.Collections.Generic
Imports System.Globalization
Imports System.Text.RegularExpressions

Public Module Example
    Public Sub Main()
        ' Define text to be parsed.
        Dim input As String = "Office expenses on 2/13/2008:" + vbCrLf + _
                              "Paper (500 sheets)                      $3.95" + vbCrLf + _
                              "Pencils (box of 10)                     $1.00" + vbCrLf + _
                              "Pens (box of 10)                        $4.49" + vbCrLf + _
                              "Erasers                                 $2.19" + vbCrLf + _
                              "Ink jet printer                        $69.95" + vbCrLf + vbCrLf + _
                              "Total Expenses                        $ 81.58" + vbCrLf
        ' Get current culture's NumberFormatInfo object.
        Dim nfi As NumberFormatInfo = CultureInfo.CurrentCulture.NumberFormat
        ' Assign needed property values to variables.
        Dim currencySymbol As String = nfi.CurrencySymbol
        Dim symbolPrecedesIfPositive As Boolean = CBool(nfi.CurrencyPositivePattern Mod 2 = 0)
        Dim groupSeparator As String = nfi.CurrencyGroupSeparator
        Dim decimalSeparator As String = nfi.CurrencyDecimalSeparator

        ' Form regular expression pattern.
        Dim pattern As String = Regex.Escape(CStr(IIf(symbolPrecedesIfPositive, currencySymbol, ""))) + _
                                "\s*[-+]?" + "([0-9]{0,3}(" + groupSeparator + "[0-9]{3})*(" + _
                                Regex.Escape(decimalSeparator) + "[0-9]+)?)" + _
                                CStr(IIf(Not symbolPrecedesIfPositive, currencySymbol, ""))
        Console.WriteLine("The regular expression pattern is: ")
        Console.WriteLine("   " + pattern)

        ' Get text that matches regular expression pattern.
        Dim matches As MatchCollection = Regex.Matches(input, pattern, RegexOptions.IgnorePatternWhitespace)
        Console.WriteLine("Found {0} matches. ", matches.Count)

        ' Get numeric string, convert it to a value, and add it to List object.
        Dim expenses As New List(Of Decimal)

        For Each match As Match In matches
            expenses.Add(Decimal.Parse(match.Groups.Item(1).Value))
        Next

        ' Determine whether total is present and if present, whether it is correct.
        Dim total As Decimal
        For Each value As Decimal In expenses
            total += value
        Next

        If total / 2 = expenses(expenses.Count - 1) Then
            Console.WriteLine("The expenses total {0:C2}.", expenses(expenses.Count - 1))
        Else
            Console.WriteLine("The expenses total {0:C2}.", total)
        End If
    End Sub
End Module
' The example displays the following output:
'       The regular expression pattern is:
'          \$\s*[-+]?([0-9]{0,3}(,[0-9]{3})*(\.[0-9]+)?)
'       Found 6 matches.
'       The expenses total $81.58.

在当前区域性设置为“英语 - 美国”(en-US) 的计算机上,该示例动态生成正则表达式 \$\s*[-+]?([0-9]{0,3}(,[0-9]{3})*(\.[0-9]+)?)On a computer whose current culture is English - United States (en-US), the example dynamically builds the regular expression \$\s*[-+]?([0-9]{0,3}(,[0-9]{3})*(\.[0-9]+)?). 此正则表达式模式可以按以下方式解释:This regular expression pattern can be interpreted as follows:

模式Pattern 解释Interpretation
\$ 在输入字符串中查找美元符号 ($) 的一个匹配项。Look for a single occurrence of the dollar symbol ($) in the input string. 正则表达式模式字符串包含一个反斜杠来指示按字面解释美元符号而非将其作为正则表达式定位点。The regular expression pattern string includes a backslash to indicate that the dollar symbol is to be interpreted literally rather than as a regular expression anchor. (单独的 $ 符号将指示正则表达式引擎应尝试在字符串的末尾开始匹配。)为了确保当前区域性设置的货币符号不被错误解释为正则表达式符号,该示例调用 Regex.Escape 方法使该字符转义。(The $ symbol alone would indicate that the regular expression engine should try to begin its match at the end of a string.) To ensure that the current culture's currency symbol is not misinterpreted as a regular expression symbol, the example calls the Regex.Escape method to escape the character.
\s* 查找空白字符的零个或多个匹配项。Look for zero or more occurrences of a white-space character.
[-+]? 查找正号或负号的零个或一个匹配项。Look for zero or one occurrence of either a positive sign or a negative sign.
([0-9]{0,3}(,[0-9]{3})*(\.[0-9]+)?) 括起此表达式的外部括号将表达式定义为捕获组或子表达式。The outer parentheses around this expression define it as a capturing group or a subexpression. 如果找到匹配项,则有关匹配字符串的此部分的信息可以从第二个 Group 对象中检索(该对象位于 GroupCollection 属性所返回的 Match.Groups 对象中)。If a match is found, information about this part of the matching string can be retrieved from the second Group object in the GroupCollection object returned by the Match.Groups property. (集合中的第一个元素表示整个匹配。)(The first element in the collection represents the entire match.)
[0-9]{0,3} 查找十进制数字 0 到 9 的零到三个匹配项。Look for zero to three occurrences of the decimal digits 0 through 9.
(,[0-9]{3})* 查找后跟三个十进制数字的组分隔符的零个或多个匹配项。Look for zero or more occurrences of a group separator followed by three decimal digits.
\. 查找小数分隔符的一个匹配项。Look for a single occurrence of the decimal separator.
[0-9]+ 查找一个或多个十进制数字。Look for one or more decimal digits.
(\.[0-9]+)? 查找后跟至少一个十进制数字的小数分隔符的零个或一个匹配项。Look for zero or one occurrence of the decimal separator followed by at least one decimal digit.

如果在输入字符串中找到所有这些子模式,则匹配成功,并将包含有关匹配的信息的 Match 对象添加到 MatchCollection 对象。If each of these subpatterns is found in the input string, the match succeeds, and a Match object that contains information about the match is added to the MatchCollection object.

TitleTitle 描述Description
正则表达式语言 - 快速参考Regular Expression Language - Quick Reference 提供有关可用来定义正则表达式的字符集、运算符和构造的信息。Provides information on the set of characters, operators, and constructs that you can use to define regular expressions.
正则表达式对象模型The Regular Expression Object Model 提供演示如何使用正则表达式类的信息和代码示例。Provides information and code examples that illustrate how to use the regular expression classes.
正则表达式行为的详细信息Details of Regular Expression Behavior 介绍了 .NET 正则表达式的功能和行为。Provides information about the capabilities and behavior of .NET regular expressions.
在 Visual Studio 中使用正则表达式Use regular expressions in Visual Studio

参考Reference