TunAnhNguyn-8918 avatar image
0 Votes"
TunAnhNguyn-8918 asked JackJJun-MSFT commented

How do i Extract text from document scanned and save as pdf file? I am using C# to develop program.

I have a document scanned and this document be saved as a pdf file. I want to make a tool to read data from this document and convert it to a Word or text file. I tried to convert this pdf file to an image and used OCR and bitmap to convert it to text but when it finished, it got a font error. Anyone can help me?

here is my code:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;
using Tesseract;
using IronOcr;

namespace IMG_To_Text
public partial class Form1 : Form
public Form1()

     private void button1_Click(object sender, EventArgs e)
         OpenFileDialog openFileDialog1 = new OpenFileDialog
             InitialDirectory = @"D:\",
             Title = "Browse Text Files",

             CheckFileExists = true,
             CheckPathExists = true,

             DefaultExt = "png",
             Filter = "png files (*.png)|*.png|jpg files (*.jpg)|*.jpg",
             FilterIndex = 2,
             RestoreDirectory = true,

             ReadOnlyChecked = true,
             ShowReadOnly = true

         if (openFileDialog.ShowDialog() == DialogResult.OK)
         }        }


     private string OCR(Bitmap b)
         string res = "";
         using (var engine = new TesseractEngine(@"tessdata", "vie", EngineMode.Default))
             using (var page = engine.Process(b, PageSegMode.AutoOnly))
                 res = page.GetText();
         return res;

     private void button2_Click(object sender, EventArgs e)
             string result = "";
             Task.Factory.StartNew(() => {
                 picloading.BeginInvoke(new Action(() =>
                     picloading.Visible = true;

                 result = OCR((Bitmap)pictureBox1.Image);
                 richTextBox1.BeginInvoke(new Action(() => {

                     richTextBox1.Text = result;

                 picloading.BeginInvoke(new Action(() =>
                     picloading.Visible = false;



· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@TunAnhNguyn-8918, Welcome to Microsoft Q&A, Since your problem is related to Tesseract and IronOcr, they are all 3rd party products, which is not supported in Microsoft Q&A. Please ask your question in that GitHub.

Thanks for your understanding.

0 Votes 0 ·

0 Answers