Why am I getting duplicate pages extracted from iText7 C#?

Actually it is not the same text being returned from sequential pages. Instead you get

the text from page 1 when you extract page 1;
the text from pages 1 and 2 when you extract page 2;
the text from pages 1, 2, and 3 when you extract page 3;
...

Often this happens for code that re-uses a text extraction strategy for multiple pages. But that's not the case in your code, you correctly create a new strategy object for each page. Thus the cause must be in the PDF itself.

And indeed, each page of your document does contain the contents of all previous pages, too, merely outside its crop box. To extract only the text in the respective page crop box you have to filter, e.g. like this:

string SRC = @"285187.pdf";

PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));

Console.WriteLine("\n285187 Filtered\n============\n");

for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
    var strategy = new SimpleTextExtractionStrategy();
    var pdfPage = pdfDoc.GetPage(i);

    var filter = new IEventFilter[1];
    filter[0] = new TextRegionEventFilter(pdfPage.GetCropBox());
    var filteredTextEventListener = new FilteredTextEventListener(strategy, filter);

    var currentText = PdfTextExtractor.GetTextFromPage(pdfPage, filteredTextEventListener);

    Console.WriteLine("PAGE {0}", i);
    Console.WriteLine(currentText);
}

pdfDoc.Close();

需要注意的是，策略换成LocationTextExtractionStrategy读出来的内容就和原来一样了

标签：upgrade,text,when,itext7,page,itextsharp,new,extract,pages
From： https://www.cnblogs.com/chucklu/p/17956407

NX-OS Upgrade步骤vPC概述
第一章：介绍了Nexus3048的NX-OS升级方法。介绍了Nexus3048的License导入方法。第二章：介绍了采用vPC技术所带来的好处。介绍了vPC的术语及2种部署拓扑类型。介绍了vPC的配置。1.......
helm upgrade rancher server from v2.7.5 to v2.7.8 in kubernetes【helm 升级 ranch
文章目录1.预备条件2.目标3.下载介质4.镜像入库5.升级rancher6.检查测试1.预备条件KubernetesClusterHelm&KubernetesOfflineDeployRancherv2.7.5Demo注意：如果你是在vcenter的虚拟机测试该应用，记得给当前版本做好快照，便于反复练习。2.目标rancherv2.7.5升级......
[ME]Backup, upgrade & installation
Backup,upgrade&installationServiceDeskPlus>Support>FAQ>Backup,upgrade&installationSeeforCloudGeneralModuleHowdoImanuallybackupdatainServiceDeskPlus?HowdoIbackuponlythedatabasewithoutthefileattachmentsinServ......
python.exe -m pip install --upgrade pip什么问题
python.exe-mpipinstall--upgradepip命令的目的是升级Python包管理工具pip到最新版本。这通常是一个有用的操作，以确保你的pip版本是最新的，以便更好地管理Python包和依赖关系。但是，这个命令可能会遇到一些问题，具体取决于你的系统和安装环境。以下是一些可能的问题和......
org.apache.subversion.javahl.ClientException: The working copy needs to be upgra
eclipse不编译，每次修改代码控制台都显示错误svn:Theworkingcopyneedstobeupgradedorg.apache.subversion.javahl.ClientException:Theworkingcopyneedstobeupgradedsvn:Workingcopy‘E:\aliyun-spirit\spiritmap0916′istooold(format10,createdbySubversi......
Upgrade-Insecure-Requests:1 详解
Upgrade-Insecure-Requests:1Upgrade-Insecure-Requests 是一个HTTP响应头，用于向浏览器发出指示，要求浏览器使用HTTPS加密协议来访问网站，以提高网站的安全性。当浏览器收到这个响应头时，它会自动将所有的HTTP请求转换为HTTPS请求，从而避免使用不安全的HTTP协议......
Grafana导入 json 文件的 dashboard 错误 Templating Failed to upgrade legacy queri
前言编辑或者修改后的dashboard保存为json文件，在其他环境导入使用，报错FailedtoupgradelegacyqueriesDatasourcexxxxxxxwasnotfound，无法显示监控数据问题原因为：从其他grafana导出的dashboardjson文件中，数据源是写的固定的，如果当前要显示的监控数据的数据源名称......
Gitlab upgrade paths
UpgradepathsUpgradingacrossmultipleGitLabversionsinonegois onlypossiblebyacceptingdowntime.Ifyoudon’twantanydowntime,readhowto upgradewithzerodowntime.Foradynamicviewofexamplesofsupportedupgradepaths,trythe UpgradePa......
itext7.pdfhtml For C#
最近发现itext7(前身为iTextSharp)下有个https://github.com/itext/i7n-pdfhtml的项目可以支持html转PDF下面是官方电子书的翻译内容，原文地址：Chapter1:HelloHTMLtoPDF---第1章：你好HTML到PDF(itextpdf.com)第1章：你好HTML到PDF在本章中，我们将以许多不同的......
漏洞修复系列-如何升级linux系统Upgrade to PostgreSQL JDBC Driver version 42.2.27,
问题遇到一个PostgreSQLJDBCDriver漏洞PostgreSQLJDBCDriverthatis42.2.xpriorto42.2.27,42.3.xpriorto42.3.8,42.4.xpriorto42.4.3or42.5.xpriorto42.5.1.Itis,therefore,affectedbyaninformationdisclosurevulnerability.原因PostgreSQLJD......

itextsharp upgrade to itext7

Why am I getting duplicate pages extracted from iText7 C#?

相关文章

赞助商

阅读排行