Indexing html - token lenghts and highlights

Question

I'm trying to index HTML contents. For this I have built custom analyzer.

{
  "name": "html",
  "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
  "tokenizer": "standard_v2",
  "tokenFilters": [
    "lowercase"
  ],
  "charFilters": [
    "html_strip"
  ]
}

and assigned it to my HTML field

{
      "name": "htmlTest",
      "type": "Edm.String",
      "facetable": false,
      "filterable": false,
      "key": false,
      "retrievable": true,
      "searchable": true,
      "sortable": false,
      "analyzer": "html",
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "synonymMaps": [],
      "fields": []
    }

When I POST to analyze endpoint

{
    "text": "bold italic normal",
    "analyzer": "html"
}

I see that it extract contents correctly but I'm not sure about their endOffset values. I get this back.

"tokens": [
        {
            "token": "bold",
            "startOffset": 11,
            "endOffset": 24,
            "position": 0
        },
        {
            "token": "italic",
            "startOffset": 28,
            "endOffset": 38,
            "position": 1
        },
        {
            "token": "normal",
            "startOffset": 39,
            "endOffset": 45,
            "position": 2
        }
    ]

If you look first token it's endOffset = 24 which is not something I expected. It includes closing tag. This is causing issues when I want to get back highlights. It highlights this tag too.

bold italic normal

Is there any way I can improve my analyzer?

Answer

@Perak Piotr Apologize for inconvenience with the issue. I had an internal discussion with PG and its identified a current incompatibility between the "html_strip" charFilter and highlighting. The product team is aware of the issue and working on this. There is no ETA for resolution at this specific moment.

Work around suggested: In the meantime, as a workaround you may try client-side highlighting when the analyzer uses html_strip char filter.

Please let us know if you have any questions or concerns and we’ll be happy to help.

Indexing html - token lenghts and highlights

1 answer