正则表达式示例:扫描 HREFRegular Expression Example: Scanning for HREFs

下面的示例搜索输入字符串并显示所有 href="…" 的值和它们在字符串中的位置。The following example searches an input string and displays all the href="…" values and their locations in the string.

警告

如果使用 System.Text.RegularExpressions 处理不受信任的输入,则传递一个超时。When using System.Text.RegularExpressions to process untrusted input, pass a timeout. 恶意用户可能会向 RegularExpressions 提供输入,从而导致拒绝服务攻击A malicious user can provide input to RegularExpressions causing a Denial-of-Service attack. 使用 RegularExpressions 的 ASP.NET Core 框架 API 会传递一个超时。ASP.NET Core framework APIs that use RegularExpressions pass a timeout.

Regex 对象The Regex Object

因为可以通过用户代码多次调用 DumpHRefs 方法,所以它使用 static(Visual Basic 中的 SharedRegex.Match(String, String, RegexOptions) 方法。Because the DumpHRefs method can be called multiple times from user code, it uses the static (Shared in Visual Basic) Regex.Match(String, String, RegexOptions) method. 这样一来,正则表达式引擎不仅可以缓存正则表达式,还杜绝了每次调用方法时实例化新 Regex 对象产生的开销。This enables the regular expression engine to cache the regular expression and avoids the overhead of instantiating a new Regex object each time the method is called. 随后使用 Match 对象循环访问字符串中的所有匹配。A Match object is then used to iterate through all matches in the string.

private static void DumpHRefs(string inputString)
{
    Match m;
    string HRefPattern = @"href\s*=\s*(?:[""'](?<1>[^""']*)[""']|(?<1>\S+))";

    try
    {
        m = Regex.Match(inputString, HRefPattern,
                        RegexOptions.IgnoreCase | RegexOptions.Compiled,
                        TimeSpan.FromSeconds(1));
        while (m.Success)
        {
            Console.WriteLine("Found href " + m.Groups[1] + " at "
               + m.Groups[1].Index);
            m = m.NextMatch();
        }
    }
    catch (RegexMatchTimeoutException)
    {
        Console.WriteLine("The matching operation timed out.");
    }
}
Private Sub DumpHRefs(inputString As String)
    Dim m As Match
    Dim HRefPattern As String = "href\s*=\s*(?:[""'](?<1>[^""']*)[""']|(?<1>\S+))"

    Try
        m = Regex.Match(inputString, HRefPattern, _
                        RegexOptions.IgnoreCase Or RegexOptions.Compiled,
                        TimeSpan.FromSeconds(1))
        Do While m.Success
            Console.WriteLine("Found href {0} at {1}.", _
                              m.Groups(1), m.Groups(1).Index)
            m = m.NextMatch()
        Loop
    Catch e As RegexMatchTimeoutException
        Console.WriteLine("The matching operation timed out.")
    End Try
End Sub

下面的示例演示对 DumpHRefs 方法的调用。The following example then illustrates a call to the DumpHRefs method.

public static void Main()
{
    string inputString = "My favorite web sites include:</P>" +
                         "<A HREF=\"http://msdn2.microsoft.com\">" +
                         "MSDN Home Page</A></P>" +
                         "<A HREF=\"http://www.microsoft.com\">" +
                         "Microsoft Corporation Home Page</A></P>" +
                         "<A HREF=\"http://blogs.msdn.com/bclteam\">" +
                         ".NET Base Class Library blog</A></P>";
    DumpHRefs(inputString);
}
// The example displays the following output:
//       Found href http://msdn2.microsoft.com at 43
//       Found href http://www.microsoft.com at 102
//       Found href http://blogs.msdn.com/bclteam at 176
Public Sub Main()
    Dim inputString As String = "My favorite web sites include:</P>" & _
                                "<A HREF=""http://msdn2.microsoft.com"">" & _
                                "MSDN Home Page</A></P>" & _
                                "<A HREF=""http://www.microsoft.com"">" & _
                                "Microsoft Corporation Home Page</A></P>" & _
                                "<A HREF=""http://blogs.msdn.com/bclteam"">" & _
                                ".NET Base Class Library blog</A></P>"
    DumpHRefs(inputString)
End Sub
' The example displays the following output:
'       Found href http://msdn2.microsoft.com at 43
'       Found href http://www.microsoft.com at 102
'       Found href http://blogs.msdn.com/bclteam/) at 176

正则表达式模式 href\s*=\s*(?:["'](?<1>[^"']*)["']|(?<1>\S+)) 的含义如下表所示。The regular expression pattern href\s*=\s*(?:["'](?<1>[^"']*)["']|(?<1>\S+)) is interpreted as shown in the following table.

模式Pattern 描述Description
href 匹配文本字符串“href”。Match the literal string "href". 匹配不区分大小写。The match is case-insensitive.
\s* 匹配零个或多个空白字符。Match zero or more white-space characters.
= 匹配等于号。Match the equals sign.
\s* 匹配零个或多个空白字符。Match zero or more white-space characters.
(?:\["'\](?<1>\[^"'\]*)["']|(?<1>\S+)) 匹配以下项之一,而不将结果分配到捕获组:Match one of the following without assigning the result to a captured group:
  • 一个引号或单引号,后跟零个或多个引号或单引号以外的任意字符,然后再后跟一个引号或单引号。A quotation mark or apostrophe, followed by zero or more occurrences of any character other than a quotation mark or apostrophe, followed by a quotation mark or apostrophe. 名为 1 的组包含在此模式。The group named 1 is included in this pattern.

  • 一个或多个非空格字符。One or more non-white-space characters. 名为 1 的组包含在此模式。The group named 1 is included in this pattern.

(?<1>[^"']*) 将零个或多个引号或单引号以外的任意字符分配给名为 1 的捕获组。Assign zero or more occurrences of any character other than a quotation mark or apostrophe to the capturing group named 1.
(?<1>\S+) 将一个或多个非空白字符分配给名为 1 的捕获组。Assign one or more non-white-space characters to the capturing group named 1.

匹配结果类Match Result Class

搜索结果存储在 Match 类中,此类可访问搜索提取的所有子字符串。The results of a search are stored in the Match class, which provides access to all the substrings extracted by the search. 它还会记住搜索的字符串和使用的正则表达式,因此可以调用 Match.NextMatch 方法,从上一次搜索结束的位置开始执行另一次搜索。It also remembers the string being searched and the regular expression being used, so it can call the Match.NextMatch method to perform another search starting where the last one ended.

显式命名的捕获Explicitly Named Captures

在传统正则表达式中,捕获圆括号会自动按顺序编号。In traditional regular expressions, capturing parentheses are automatically numbered sequentially. 这会导致两个问题。This leads to two problems. 首先,如果通过插入或删除一组圆括号修改正则表达式,则必须重新编写引用带编号捕获的所有代码才能反映新编号。First, if a regular expression is modified by inserting or removing a set of parentheses, all code that refers to the numbered captures must be rewritten to reflect the new numbering. 其次,由于不同的圆括号组通常用于为可接受的匹配项提供两个替代表达式,则可能难以确定这两个表达式中的哪个表达式实际返回了结果。Second, because different sets of parentheses often are used to provide two alternative expressions for an acceptable match, it might be difficult to determine which of the two expressions actually returned a result.

为了解决这些问题,Regex 类支持语法 (?<name>…),以便将匹配捕获到指定槽中(可以使用字符串或整数命名槽;如果使用整数命名,可以更快地召回)。To address these problems, the Regex class supports the syntax (?<name>…) for capturing a match into a specified slot (the slot can be named using a string or an integer; integers can be recalled more quickly). 因此,相同字符串的替代匹配全都可以定向到相同位置。Thus, alternative matches for the same string all can be directed to the same place. 发生冲突时,放入槽中的最后一个匹配项是成功的匹配项。In case of a conflict, the last match dropped into a slot is the successful match. (但是,提供了适用于单个槽的多个匹配项的完整列表。(However, a complete list of multiple matches for a single slot is available. 有关详细信息,请参阅 Group.Captures 集合。)See the Group.Captures collection for details.)

请参阅See also