I receive PDF files (electronic invoices), and there are xml files embedded inside the PDF files I would like to extract and process in a Logic App. I could not find a way to extract the files embedded in PDFs. Any idea? Thanks.
I receive PDF files (electronic invoices), and there are xml files embedded inside the PDF files I would like to extract and process in a Logic App. I could not find a way to extract the files embedded in PDFs. Any idea? Thanks.
Hi @kazinad, Thank you for reaching out. Just a few questions. Can you please tell us how the xml files are embedded in the pdf? Is it a URL which when clicked fetches the file or does the pdf contains the file directly?
Meanwhile, we do provide Cloudmersive and Aquaforest pdf connectors for logic app which can be useful to parse pdf and get the text in it. If the file needs to be fetched using a URL, you can extract the URL as mentioned above and maybe leverage Docparser to fetch the file (This will work if the files are stored under a publicly accessible URL).
Please let me know if this helps in resolving the issue or not? I will be glad to continue with our discussion.
@ChaitanyaNaykodiMSFT-9638 : PDF contains the file directly, not an url. See this screenshot with Firefox PDF viewer:

@kazinad, Thank you for the reply. I am trying the scenario out myself and will make a response soon.
Hello @kazinad, just following up here to see if my response below was helpful or not?
Hello @kazinad, Sorry for the delay in my response. Currently none of the pdf connectors for logic app support the functionality to extract attached files from the pdf document. An alternate method to extract the attached ‘xml’ file will be to integrate a Function app within your Logic App. You can find more information here about how to call a function app using your logic app. We found this thread which we think might be helpful in implementing the code required to extract the ‘xml’ attachment in pdf using C# language.
Please let me know if you need any additional information, I will be glad to continue with our discussions.
You could try Spire.PDF library to extract attachments from PDF using C#.
//Load PDF
PdfDocument pdf = new PdfDocument("Attachment1.pdf");
//Get the first attachment
PdfAttachment attachment = pdf.Attachments[0];
//Write to file
File.WriteAllBytes(attachment.FileName, attachment.Data);
If you decide to call a function app to perform this, you can use LEADTOOLS PDF SDK Libraries to implement attachment extraction from PDF.
The following code shows how to do it in C#:
int attachmentCount = Leadtools.Pdf.PDFFile.GetEmbeddedFileCount(pdfFileName, null);
for (int i = 1; i <= attachmentCount; i++)
{
string tempFile = Path.GetTempFileName();
Leadtools.Pdf.PDFFile.ExtractEmbeddedFile(pdfFileName, null, i, tempFile);
MessageBox.Show($"attachement saved to file {tempFile}, can be processed or converted to other format if needed");
File.Delete(tempFile);
}
If you would like to try it, there’s a free evaluation here. Note: I’m an employee of this SDK’s vendor.
1.Install Nuget Package of IronPdf into your project
2.Follow the link: reading-pdf-text
PdfDocument PDF = PdfDocument.FromFile(@"D:\demoSp.pdf"); // D:\demoSp.pdf full path of your input pdf file
FileContent.Text = PDF.ExtractAllText();
ComPDFKit PDF SDK is a good choice.
Here is the website: https://www.compdf.com/
If you want to learn how to integrate PDF function to an app, you can look at the guides here. ComPDFKit is a full-featured PDF SDK & Conversion SDK including the function of extract PDF attachments.
7 people are following this question.