question

DhananjaySiwach avatar image
0 Votes"
DhananjaySiwach asked YijingSun-MSFT commented

Issue with DocumentFormat.OpenXml reading docX file

Hi ,

I am using DocumentFormat.OpenXml for reading content from .docX file in asp.net c#.
I have issue with paragraph.InnerText it is given " TOC \\o \"1-2\" \\h \\z \\u 1.Introduction PAGEREF _Toc294041589 \\h 4" but I need only content without heading. how I can achieve it.


My Code



Package wordPackage = Package.Open(filePath, FileMode.Open, FileAccess.Read);
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(wordPackage))
{
StringBuilder stringBuilder = new StringBuilder();

                 IEnumerable<Paragraph> paragraphs = wordDocument.MainDocumentPart.Document.Body.Elements<Paragraph>();

                 foreach (var paragraph in paragraphs)
                 {
                     Console.WriteLine(paragraph.InnerText);
                     stringBuilder.Append(paragraph.InnerText + "\r\n");
                 }
                 string content = stringBuilder.ToString();
             }

dotnet-csharpdotnet-aspnet-general
· 4
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @DhananjaySiwach,
What's your heading?"Introduction PAGEREF _Toc294041589"? Each paragraph has heading? Could you tell us more details to us?
Best regards,
Yijing Sun

0 Votes 0 ·

Hi @YijingSun-MSFT ,

I have many .docX files which is have content with heading. I have attach heading screenshot.
I have try many code but I am not able to find only Heading text with space or tab.
Example : I am using following code to getting heading content but I am not getting heading content with space or tab
IEnumerable<Paragraph> paragraphs = wordDocument.MainDocumentPart.Document.Body.Elements<Paragraph>();
foreach (var paragraph in paragraphs)
var paragraphText = paragraph.Descendants<DocumentFormat.OpenXml.Wordprocessing.Text>();
text += txt.Text;
if (!string.IsNullOrEmpty(txt.Space) && txt.Space == SpaceProcessingModeValues.Preserve)
text += " ";


93332-document-heading.png











0 Votes 0 ·

Hi @DhananjaySiwach,
As far as I think,you could use run() method.

 new Paragraph(new Run(new Text(para.InnerText)))

https://docs.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraphproperties?view=openxml-2.8.1

Best regard,
Yijing Sun

0 Votes 0 ·
Show more comments

1 Answer

APoblacion avatar image
0 Votes"
APoblacion answered YijingSun-MSFT commented

If you use InnerText, it concatenates all the texts of all the xml elements that make up the paragraph internally. That's not what you want.
Instead, you need to enumerate all the Run elements of the Paragraph and for each Run, each of the Text elements, and then you take the text from there.

To enumerate the Runs, you would use a loop similar to this:
foreach (var run in paragraph.Elements<Run>())

And a similar loop would enumerate the run.Elements<Text> to get all the texts.

For more info, explore the documentation starting here for the Run.


· 4
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @APoblacion ,

This code not working for Heading content. not getting docX heading content.

0 Votes 0 ·

Use the debugger. Place a breakpoint just after the code obtains a paragraph, and then expand the properties in the debugger and dig into them until you find which is the property that contains the information that you are looking for. Then, use the value of such properties in your code.

0 Votes 0 ·

Hi @APoblacion ,

There is no property which have a content for Heading.

93449-image.png


0 Votes 0 ·
image.png (167.2 KiB)
Show more comments