Search for Text in PDF File stored in Blob

The Timp 1 Reputation point
2020-06-14T23:32:41.487+00:00

Is there way to search blobs of PDF files without at Azure Cognitive Search?

I configured a test blob and from what I can see the Microsoft.search/SearchServices are going to be the AU$12.19 per day.. $360 per month..
(on about 100MB of test PDFs: AUD$0.02 Microsoft.storage/storage accounts)

Perhaps I have something configured that I can turn off?

It would be cheaper to spin up a Server VM with a SQL DB and use that with the Adobe PDF iFilter on a varbinary (max)BLOB..

9928-rgteststorageconfig.png
9996-rgteststoragecosts.png

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
728 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,449 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. The Timp 1 Reputation point
    2020-06-16T07:56:03.67+00:00

    I dropped and recreated the search with the Basic Tier down from the Standard (1) Tier - and that looks a bit cheaper..

    10232-searchpricingtier.png

    But given the limits for the search top out at a 256MB (Maximum File size), it looks like the service wont suit me, as my pdf reports are often larger than that..

    and if the Storage per Partition is the actual files, and not just the indexes, my 40GB of files will require a S2 @ almost AUD$2000 per month!

    Am I missing something??

    https://azure.github.io/LearnAI-KnowledgeMiningBootcamp/labs/lab-02-azure-cognitive-search.html

    0 comments No comments

  2. deherman-MSFT 33,701 Reputation points Microsoft Employee
    2020-06-17T18:00:19.137+00:00

    @The Timp Firstly, apologies for the delay in responding here!

    Currently Azure Cognitive Search is the recommended method inside Azure to accomplish this. I see you have already found the different pricing tiers that are available.

    To get the best answer possible in regards to pricing kindly contact Azure Billing support, it's free, and what I recommend in this circumstance.

    For further exploration you can also use the Pricing calculator, for your own detailed analysis.

    Hope this helps!
    Kindly let us know if the above helps or you need further assistance on this issue.

    ----------------------------------------------------------------------------------------

    Please do not forget to "Accept the answer" wherever the information provided helps you to help others in the community.

    0 comments No comments

  3. The Timp 1 Reputation point
    2020-06-18T06:12:17.26+00:00

    Thanks, I have contacted Azure Billing support and checked out the Pricing Calculator.. its not very clear:

    For Example:

    When indexing the content and metadata of a PDF file, that is for example 2MB in size do I need to ensure the 'Storage per partition' has:

    1. At least 2MB (the size of the original file), or
    2. An amount less than 2MB, as the 'Storage per partition' will only hold the indexes of the document.
      I would have assumed that the 'Storage per partition' size would have been an amount less than the original file size?

    RE: at the moment there is no direct way to index files greater than 256 MB
    I have 1400 PDF files to put in the Blob Container, which range in size from 5MB to 500MB..
    If I split the PDF files into individual pages, then loaded them into individual blobs, then the indexer could index each page, as it would be less than 16MB, but that would be a lot of Blobs…
    At what point would I realistically hit the upper limit of blobs?
    The documentation says: "approximately 24 billion documents per index on Basic".. at what point would I see performance degradation?