AutoText
智能文本自动处理工具(Intelligent text automatic processing tool)。
项目地址:https://github.com/jiangnanboy/AutoText
AutoText的功能主要有文本纠错,图片ocr以及表格结构识别等。
Guide
文本纠错
- 文本纠错部分详细见jcorrector
- 本项目主要有基于ngram的纠错、基于深度学习的纠错、基于模板中文语法纠错以及成语、专名纠错等
- 具体使用见本项目中的examples/correct部分
图片ocr
-
这部分主要利用paddleocr 中的检测与识别部分,并将其中模型转为onnx格式进行调用,本项目在识别前对图片进行了预处理,使得在cpu环境下,平均一张图10秒左右。
-
具体使用见本项目中的examples/ocr/text/OcrDemo部分
-
PS
- 模型网盘下载
- 提取码:b5vq
- 模型下载后可放入resources的text_recgo下或其它位置
-
使用
// read image file String imagePath = "examples\\ocr\\img_test\\text_example.png"; var imageFile = Paths.get(imagePath); var image = ImageFactory.getInstance().fromFile(imageFile); // init model String detectionModelFile = OcrDemo.class.getClassLoader().getResource(PropertiesReader.get("text_recog_det_model_path")).getPath().replaceFirst("/", ""); String recognitionModelFile = OcrDemo.class.getClassLoader().getResource(PropertiesReader.get("text_recog_rec_model_path")).getPath().replaceFirst("/", ""); Path detectionModelPath = Paths.get(detectionModelFile); Path recognitionModelPath = Paths.get(recognitionModelFile); OcrApp ocrApp = new OcrApp(detectionModelPath, recognitionModelPath); ocrApp.init(); // predict result and consume time var timeInferStart = System.currentTimeMillis(); Pair<List<TextListBox>, Image> imagePair = ocrApp.ocrImage(image, 960); System.out.println("consume time: " + (System.currentTimeMillis() - timeInferStart)/1000.0 + "s"); for (var result : imagePair.getLeft()) { System.out.println(result); } // save ocr result image ocrApp.saveImageOcrResult(imagePair, "ocr_result.png", "examples\\ocr\\output"); ocrApp.closeAllModel();
- 结果,为文字及其坐标
position: [800.0, 609.0, 877.0, 609.0, 877.0, 645.0, 800.0, 645.0], text: 8.23%
position: [433.0, 607.0, 494.0, 607.0, 494.0, 649.0, 433.0, 649.0], text: 68.4
position: [96.0, 610.0, 316.0, 611.0, 316.0, 641.0, 96.0, 640.0], text: 股东权益比率(%)
position: [624.0, 605.0, 688.0, 605.0, 688.0, 650.0, 624.0, 650.0], text: 63.2
position: [791.0, 570.0, 887.0, 570.0, 887.0, 600.0, 791.0, 600.0], text: -39.64%
position: [625.0, 564.0, 687.0, 564.0, 687.0, 606.0, 625.0, 606.0], text: 49.7
position: [134.0, 568.0, 279.0, 568.0, 279.0, 598.0, 134.0, 598.0], text: 毛利率(%)
......
- 结果展示
表格结构识别
- 基于规则由opencv研发,主要识别的表格类型有:有边界表格、无边界表格以及部分有边界表格。
- 具体使用见本项目中的examples/ocr/table/TableDemo部分
- 使用
public static void borderedRecog() { String imagePath = "examples\\ocr\\img_test\\bordered_example.png"; Mat imageMat = imread(imagePath); System.out.println("imageMat : " + imageMat.size().height() + " " + imageMat.size().width() + " "); Pair< List<List<List<Integer>>>, Mat> pair = BorderedRecog.recognizeStructure(imageMat); System.out.println(pair.getLeft()); ImageUtils.imshow("Image", pair.getRight()); } public static void unBorderedRecog() { String imagePath = "examples\\ocr\\img_test\\unbordered_example.jpg"; Mat imageMat = imread(imagePath); System.out.println("imageMat : " + imageMat.size().height() + " " + imageMat.size().width() + " "); Pair< List<List<List<Integer>>>, Mat> pair = UnBorderedRecog.recognizeStructure(imageMat); System.out.println(pair.getLeft()); ImageUtils.imshow("Image", pair.getRight()); } public static void partiallyBorderedRecog() { String imagePath = "examples\\ocr\\img_test\\partially_example.jpg"; Mat imageMat = imread(imagePath); System.out.println("imageMat : " + imageMat.size().height() + " " + imageMat.size().width() + " "); Pair< List<List<List<Integer>>>, Mat> pair = PartiallyBorderedRecog.recognizeStructure(imageMat); System.out.println(pair.getLeft()); ImageUtils.imshow("Image", pair.getRight()); }
- 结果,为表格单元格坐标
[[[58, 48, 247, 182], [309, 48, 247, 182], [560, 48, 247, 182], [], [], [1061, 48, 247, 182], [1312, 48, 247, 182],
[811, 48, 246, 182], [], [], [], []], [[58, 234, 247, 118], [309, 234, 247, 118], [560, 234, 247, 118], [], [811, 234, 246, 118],
[], [1061, 234, 247, 118], [], [], [1312, 234, 247, 118], [], []], [[58, 356, 247, 118], [], [309, 356, 247, 118],
[560, 356, 247, 118], [], [811, 356, 246, 118], [], [], [1061, 356, 247, 118], [], [1312, 356, 247, 118], []], [[58, 478, 247, 118],
[309, 478, 247, 118], [], [560, 478, 247, 118], [811, 478, 246, 118], [], [], [1312, 478, 247, 118], [], [1061, 478, 247, 118], [], []],
[[58, 600, 247, 119], [309, 600, 247, 119], [], [811, 600, 246, 119], [560, 600, 247, 119], [1061, 600, 247, 119], [],
[1312, 600, 247, 119], [], [], [], []], [[58, 723, 247, 118], [], [309, 723, 247, 118], [811, 723, 246, 118], [560, 723, 247, 118],
[], [], [1061, 723, 247, 118], [], [1312, 723, 247, 118], [], []], [[58, 845, 247, 118], [309, 845, 247, 118], [], [],
[811, 845, 246, 118], [560, 845, 247, 118], [], [1312, 845, 247, 118], [], [1061, 845, 247, 118], [], []]]
- 结果展示
标签:text,processing,System,247,ocr,automatic,118,imageMat From: https://www.cnblogs.com/little-horse/p/17061993.html