question

ZhengLin-2636 avatar image
0 Votes"
ZhengLin-2636 asked ·

QnA maker remove white space importing pdf file

I tried to add content to QnA maker knowledgebase by ingesting a pdf file. But the content got ingested lost a lot of sentences, and many of the white spaces are removed. There was only one Q and A pair generated and in very chaotic format. Anything I can do about it, what did I miss? As you can see, phrase like The Python Software Foundation became "ThePythonSoftware Foundation". And the content is entirely scrambled even though they were in separate paragraph. And the first part of the document was completely gone. This is what the pdf file looks like: ![74239-untitled.png][1] [1]: /answers/storage/attachments/74239-untitled.png and the QnA generated has Q: Anyspecial install / configuration? A: ClickOnceApplication Status AlreadyRequested AlreadyRequested CompanyName Telerik MozillaFoundation Individual Software Developer: Simon Tatham ThePythonSoftware Foundation ThePythonSoftware Foundation Microsoft Corporation TheWireSharkFoundation GoogleLLC Individual Software Developer: Eli Fulkerson Telerik/Progress.com CompanyAddress 14OakParkDrive Bedford, MA01730 331EastEvelyn Avenue Mountain View, CA94041 NA 9450SWGemini Dr. ECM#90772 Beaverton, OR97008 USA 9450SWGemini Dr. ECM#90772 Beaverton, OR97008 USA wireshark.org 1600Amphitheatre Parkway Mountain View, CA 94043 USA NA 14OakParkDrive Bedford, MA01730 LinktoManufacturerSpecSheet(ifapplicable) https://the.earth.li/~sgtatham/putty/0.73/htmldoc/ https://docs.python.org/2/ https://docs.python.org/3.7/ https://docs.microsoft.com/en-us/message-analyzer/getting-started-withmessage-analyzer https://www.wireshark.org/download/docs/user-guide.pdf https://chromereleases.googleblog.com/ https://docs.microsoft.com/en-us/azure/data-explorer/kusto/tools/kustoexplorer https://www.elifulkerson.com/projects/tcping.php https://docs.microsoft.com/en-us/azure/azure-resourcemanager/management/overview https://docs.microsoft.com/enus/powershell/azure/servicemanagement/overview?view=azuresmps-4.0.0 https://www.powershellgallery.com/packages/SqlServer/21.1.18221 https://docs.microsoft.com/en-us/powershell/azure/?view=azps-4.2.0 https://docs.telerik.com/fiddler/Configure-Fiddler/Tasks/ConfigureFiddler Ticketizer 1.3.0.27 Production http://toolbox/ticketizer SubstrateIdentity (Core Auth) WinSCP 5.17 Putty 0.73 Python2.7 2.7.18 Python3.7 3.7.7 MicrosoftMessage Analyzer 1.4 (Build4.0.8112.0) WireShark 3.2.4 Chrome Latest/83.0.4103.61 https://winscp.net/eng/download.php https://the.earth.li/~sgtatham/putty/latest/w64/putty-64bit-0.73-installer.msi https://www.python.org/downloads/release/python-2718/ https://www.python.org/downloads/release/python-377/ Internal Retired forpublicdownloadonNovember252019. https://2.na.dl.wireshark.org/win64/Wireshark-win64-3.2.4.exe GEARFabric http://dl.google.com/edgedl/chrome/install/GoogleChromeStandaloneEnterprise64.msi IC3ALL

azure-qna-maker
untitled.png (48.1 KiB)
· 5
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@ZhengLin-2636 I think the document did not get attached correctly. If you could share the document or the URL you are using it would be easier to check.
From my experience, using a PDF document works but it also depends on how data is structured in the document. I prefer to use a FAQ style format in PDF document to easily extract the question and answer.



0 Votes 0 ·

Here is the url: https://curatorstore.blob.core.windows.net/curationfile-test/Debugging Tools.pdf

At this point we dont have an say in the format of the document to process, so the interior format is not ideal at all, but I was hoping QnA maker can perform some intelligent and sensible parsing better than a straight copy and paste.

Thanks

0 Votes 0 ·

@ZhengLin-2636 this format might just need some tweaking. If you cannot update it directly the other option is to copy/paste as required and format the question and answers.

0 Votes 0 ·

So it is a native QnA maker service to skip content when parsing document like this? What about the white space removal from the text? That seems like a bug.


Thanks.

0 Votes 0 ·
Show more comments

0 Answers