question

IbrahimSIqbal-6507 avatar image
IbrahimSIqbal-6507 asked ·

U-SQL + Python PDF file parsing.in Azure Data Lake Analytics

I need to extract data from pdf files and store values to Table, Using Data lake Analytics. Can anyone help me with some examples or procedure on how to achieve this scenario..

azure-data-lake-analytics
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

ChiragMishra-MSFT avatar image
ChiragMishra-MSFT answered ·

Hi there,

Here are some resources for getting started with U-SQL in Azure Data Lake Analytics :

https://docs.microsoft.com/en-us/u-sql/

https://www.purplefrogsystems.com/paul/category/u-sql/

https://www.mssqltips.com/sqlservertip/5890/azure-data-lake-analytics-using-usql-queries/

About the scenario you talked about, you would have to write a Custom Extractor to read the PDF. Here's a C# example for the same :

 using System.Collections.Generic;
 using iTextSharp.text.pdf;
 using iTextSharp.text.pdf.parser;
 using Microsoft.Analytics.Interfaces;
    
 namespace PDFExtractor
 {
     [SqlUserDefinedExtractor(AtomicFileProcessing = true)]
     public class PDFExtractor : IExtractor
     {
         public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
         {
             var reader = new PdfReader(input.BaseStream);
             for (var page = 1; page <= reader.NumberOfPages; page++)
             {
                 output.Set(0, page);
                 output.Set(1, ExtractText(reader, page));
                 yield return output.AsReadOnly();
             }
         }
    
         public string ExtractText(PdfReader pdfReader, int pageNum)
         {
             var text = PdfTextExtractor.GetTextFromPage(pdfReader, pageNum, new LocationTextExtractionStrategy());
             // Encode new lines to prevent from line breaking in text editors,
             // I want nice line after line files
             return text.Replace("\r", "\\r").Replace("\n", "\\n");
         }
     }
 }

You can write something similar in Python.

Ref - https://devblog.xyz/simple-pdf-text-extractor-adla/

Hope this helps.


2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@IbrahimSIqbal-6507 Just checking in to see if the above answer helped. If this answers your query, do click “Accept Answer" and Up-Vote for the same.

0 Votes 0 · ·

Hi @IbrahimSIqbal-6507,

Was the above answer helpful to you? If yes, please consider accepting it as answer as it would help other community members who stumble upon this thread for a similar/same issue.

0 Votes 0 · ·