Indexing html - token lenghts and highlights

Perak Piotr 1 Reputation point
2021-09-17T10:58:45.89+00:00

I'm trying to index HTML contents. For this I have built custom analyzer.

{
  "name": "html",
  "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
  "tokenizer": "standard_v2",
  "tokenFilters": [
    "lowercase"
  ],
  "charFilters": [
    "html_strip"
  ]
}

and assigned it to my HTML field

{
      "name": "htmlTest",
      "type": "Edm.String",
      "facetable": false,
      "filterable": false,
      "key": false,
      "retrievable": true,
      "searchable": true,
      "sortable": false,
      "analyzer": "html",
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "synonymMaps": [],
      "fields": []
    }

When I POST to analyze endpoint

{
    "text": "<p><strong>bold</strong> <i>italic</i> normal</p>",
    "analyzer": "html"
}

I see that it extract contents correctly but I'm not sure about their endOffset values. I get this back.

"tokens": [
        {
            "token": "bold",
            "startOffset": 11,
            "endOffset": 24,
            "position": 0
        },
        {
            "token": "italic",
            "startOffset": 28,
            "endOffset": 38,
            "position": 1
        },
        {
            "token": "normal",
            "startOffset": 39,
            "endOffset": 45,
            "position": 2
        }
    ]

If you look first token it's endOffset = 24 which is not something I expected. It includes closing </strong> tag. This is causing issues when I want to get back highlights. It highlights this tag too.

<p><strong><em>bold</strong></em> <i>italic</i> normal</p>

Is there any way I can improve my analyzer?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
720 questions
{count} votes

1 answer

Sort by: Most helpful
  1. SnehaAgrawal-MSFT 18,366 Reputation points
    2021-09-27T04:41:50.753+00:00

    @Perak Piotr Apologize for inconvenience with the issue. I had an internal discussion with PG and its identified a current incompatibility between the "html_strip" charFilter and highlighting. The product team is aware of the issue and working on this. There is no ETA for resolution at this specific moment.

    Work around suggested: In the meantime, as a workaround you may try client-side highlighting when the analyzer uses html_strip char filter.

    Please let us know if you have any questions or concerns and we’ll be happy to help.

    1 person found this answer helpful.