question

PerakPiotr-7487 avatar image
0 Votes"
PerakPiotr-7487 asked SnehaAgrawal-MSFT commented

Indexing html - token lenghts and highlights

I'm trying to index HTML contents. For this I have built custom analyzer.

 {
   "name": "html",
   "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
   "tokenizer": "standard_v2",
   "tokenFilters": [
     "lowercase"
   ],
   "charFilters": [
     "html_strip"
   ]
 }

and assigned it to my HTML field


 {
       "name": "htmlTest",
       "type": "Edm.String",
       "facetable": false,
       "filterable": false,
       "key": false,
       "retrievable": true,
       "searchable": true,
       "sortable": false,
       "analyzer": "html",
       "indexAnalyzer": null,
       "searchAnalyzer": null,
       "synonymMaps": [],
       "fields": []
     }

When I POST to analyze endpoint

 {
     "text": "<p><strong>bold</strong> <i>italic</i> normal</p>",
     "analyzer": "html"
 }

I see that it extract contents correctly but I'm not sure about their endOffset values. I get this back.

 "tokens": [
         {
             "token": "bold",
             "startOffset": 11,
             "endOffset": 24,
             "position": 0
         },
         {
             "token": "italic",
             "startOffset": 28,
             "endOffset": 38,
             "position": 1
         },
         {
             "token": "normal",
             "startOffset": 39,
             "endOffset": 45,
             "position": 2
         }
     ]

If you look first token it's endOffset = 24 which is not something I expected. It includes closing </strong> tag. This is causing issues when I want to get back highlights. It highlights this tag too.

<p><strong><em>bold</strong></em> <i>italic</i> normal</p>

Is there any way I can improve my analyzer?






azure-cognitive-search
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks! You may want to refer this official document on how queries work, see this article on full text search.

Also, please note that Partial term queries are an important exception to this rule. These queries (prefix query, wildcard query, regex query) bypass the lexical analysis process unlike regular term queries. Partial terms are only lowercased before being matched against terms in the index.

If an analyzer isn't configured to support these types of queries, you'll often receive unexpected results because matching terms don't exist in the index.

Check this document on understanding how analyzers work.


0 Votes 0 ·

I know how analyzers work and saw this document. But I still don't know if/how can I change my configuration to return correct highlights in HTML.

0 Votes 0 ·

1 Answer

SnehaAgrawal-MSFT avatar image
1 Vote"
SnehaAgrawal-MSFT answered SnehaAgrawal-MSFT commented

@PerakPiotr-7487 Apologize for inconvenience with the issue. I had an internal discussion with PG and its identified a current incompatibility between the "html_strip" charFilter and highlighting. The product team is aware of the issue and working on this. There is no ETA for resolution at this specific moment.

Work around suggested: In the meantime, as a workaround you may try client-side highlighting when the analyzer uses html_strip char filter.

Please let us know if you have any questions or concerns and we’ll be happy to help.

· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thank you for clearing this up.

I know there's no ETA, but when it's fixed can you let me know?
Or tell me where I can track status of it?

0 Votes 0 ·

Thanks for reply! Sure will keep you posted here.

0 Votes 0 ·