hi, I have some html files with tags such as <div id=""></div>, <span class..>, <dt>, <br>, etc.
But, also, I have this 4 special tags.
<title>I Love Movies</title>
<h1 class="den_articol" itemprop="name">The Heights Of The Eternal Spaces</h1>
<p class="text_obisnuit">In the end of the movie <em>I see him much different</em> that he was before.</p>
<p class="text_obisnuit2">Go and bring me some coffe.</p>
THE PROBLEM:
With my Python code, I want to select the text ONLY from those 4 tags, and to ignore others. And I have to keep these tags intact. So, I write the delimitators as below:
You have the complete script HERE
extensie_fisier = ".html"
lista_cale_fisiere = []
delimitatori_text = [['<title','</title>'], ['<h1 class="den_articol" itemprop="name', '</h1>'], ['<p class="text_obisnuit', '</p>'], ['<span class="text', '</span>']]
My method WORKS, the translation is ok on those html tags. So, the selection is good. But has some small errors. Many tags are changing. Some empty spaces occurs after running the code. </span> becomes </ SPAN> or <em> becomes </ EM>. The same for </ li> or </ ol>.
what if there will be an easier solution? I wonder if I couldn't make the operation with a REGEX easier. For example this REGEX (<([^>]+)>.*?) will select all the possible html tags, and my Python code will easier select the text and translate it. So I believe it may ignore the html tags.
The problem in this case is that I don't know how to KEEP the html tags after running the Python code with that Regex. And don't know where to insert this regex in my code.
Again, you have my complete script HERE