question

SuzanaEree-2102 avatar image
0 Votes"
SuzanaEree-2102 asked azure-cxp-api edited

Python: How to ignore most of html tags and select for translation only the text (maybe with Regex ?)

hi, I have some html files with tags such as <div id=""></div>, <span class..>, <dt>, <br>, etc.

But, also, I have this 4 special tags.

<title>I Love Movies</title>

<h1 class="den_articol" itemprop="name">The Heights Of The Eternal Spaces</h1>

<p class="text_obisnuit">In the end of the movie <em>I see him much different</em> that he was before.</p>

<p class="text_obisnuit2">Go and bring me some coffe.</p>

THE PROBLEM:

With my Python code, I want to select the text ONLY from those 4 tags, and to ignore others. And I have to keep these tags intact. So, I write the delimitators as below:


You have the complete script HERE

 extensie_fisier = ".html"
    
 lista_cale_fisiere = []
 delimitatori_text = [['<title','</title>'], ['<h1 class="den_articol" itemprop="name', '</h1>'], ['<p class="text_obisnuit', '</p>'], ['<span class="text', '</span>']]

My method WORKS, the translation is ok on those html tags. So, the selection is good. But has some small errors. Many tags are changing. Some empty spaces occurs after running the code. </span> becomes </ SPAN> or <em> becomes </ EM>. The same for </ li> or </ ol>.

what if there will be an easier solution? I wonder if I couldn't make the operation with a REGEX easier. For example this REGEX (<([^>]+)>.*?) will select all the possible html tags, and my Python code will easier select the text and translate it. So I believe it may ignore the html tags.

The problem in this case is that I don't know how to KEEP the html tags after running the Python code with that Regex. And don't know where to insert this regex in my code.

Again, you have my complete script HERE



not-supported
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

prmanhas-MSFT avatar image
0 Votes"
prmanhas-MSFT answered

@SuzanaEree-2102 Apologies for the delay in response and all the inconvenience caused because of the issue.

As this issue is not related to Azure Batch and more around Python code which is currently not supported in the Q&A forums, the supported products are listed over here https://docs.microsoft.com/en-us/answers/products (more to be added later on).

You can indeed raise your issue here on official Python forums where you will get help from experts in this area :)

Hope it helps!!!

Please "Accept as Answer" if it helped so it can help others in community looking for help on similar topics.



5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.