question

kazinad avatar image
0 Votes"
kazinad asked 77571591 edited

How to extract PDF attachments

I receive PDF files (electronic invoices), and there are xml files embedded inside the PDF files I would like to extract and process in a Logic App. I could not find a way to extract the files embedded in PDFs. Any idea? Thanks.

azure-logic-apps
· 4
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @kazinad, Thank you for reaching out. Just a few questions. Can you please tell us how the xml files are embedded in the pdf? Is it a URL which when clicked fetches the file or does the pdf contains the file directly?
Meanwhile, we do provide Cloudmersive and Aquaforest pdf connectors for logic app which can be useful to parse pdf and get the text in it. If the file needs to be fetched using a URL, you can extract the URL as mentioned above and maybe leverage Docparser to fetch the file (This will work if the files are stored under a publicly accessible URL).
Please let me know if this helps in resolving the issue or not? I will be glad to continue with our discussion.

1 Vote 1 ·
kazinad avatar image kazinad ChaitanyaNaykodiMSFT-9638 ·

@ChaitanyaNaykodiMSFT-9638 : PDF contains the file directly, not an url. See this screenshot with Firefox PDF viewer:

16514-pdf-attachment.png


0 Votes 0 ·
pdf-attachment.png (109.0 KiB)

@kazinad, Thank you for the reply. I am trying the scenario out myself and will make a response soon.

0 Votes 0 ·

Hello @kazinad, just following up here to see if my response below was helpful or not?

0 Votes 0 ·
ChaitanyaNaykodiMSFT-9638 avatar image
0 Votes"
ChaitanyaNaykodiMSFT-9638 answered

Hello @kazinad, Sorry for the delay in my response. Currently none of the pdf connectors for logic app support the functionality to extract attached files from the pdf document. An alternate method to extract the attached ‘xml’ file will be to integrate a Function app within your Logic App. You can find more information here about how to call a function app using your logic app. We found this thread which we think might be helpful in implementing the code required to extract the ‘xml’ attachment in pdf using C# language.
Please let me know if you need any additional information, I will be glad to continue with our discussions.


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Ezreal95-7594 avatar image
0 Votes"
Ezreal95-7594 answered

You could try Spire.PDF library to extract attachments from PDF using C#.

//Load PDF
PdfDocument pdf = new PdfDocument("Attachment1.pdf");
//Get the first attachment
PdfAttachment attachment = pdf.Attachments[0];
//Write to file
File.WriteAllBytes(attachment.FileName, attachment.Data);


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

AminDodin-1022 avatar image
0 Votes"
AminDodin-1022 answered

If you decide to call a function app to perform this, you can use LEADTOOLS PDF SDK Libraries to implement attachment extraction from PDF.

The following code shows how to do it in C#:

 int attachmentCount = Leadtools.Pdf.PDFFile.GetEmbeddedFileCount(pdfFileName, null);
 for (int i = 1; i <= attachmentCount; i++)
 {
    string tempFile = Path.GetTempFileName();
    Leadtools.Pdf.PDFFile.ExtractEmbeddedFile(pdfFileName, null, i, tempFile);
    MessageBox.Show($"attachement saved to file {tempFile}, can be processed or converted to other format if needed");
    File.Delete(tempFile);
 }

If you would like to try it, there’s a free evaluation here. Note: I’m an employee of this SDK’s vendor.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

VishantPandey-3061 avatar image
1 Vote"
VishantPandey-3061 answered VishantPandey-3061 edited

1.Install Nuget Package of IronPdf into your project
2.Follow the link: reading-pdf-text


PdfDocument PDF = PdfDocument.FromFile(@"D:\demoSp.pdf"); // D:\demoSp.pdf full path of your input pdf file
FileContent.Text = PDF.ExtractAllText();

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

77571591 avatar image
0 Votes"
77571591 answered 77571591 edited

ComPDFKit PDF SDK is a good choice.

Here is the website: https://www.compdf.com/

If you want to learn how to integrate PDF function to an app, you can look at the guides here. ComPDFKit is a full-featured PDF SDK & Conversion SDK including the function of extract PDF attachments.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.