首页 > 其他分享 >HtmlAgilityPack-xpath用法

HtmlAgilityPack-xpath用法

时间:2022-08-31 15:22:08浏览次数:56  
标签:xpath htmlDoc string int HtmlAgilityPack 用法 var nodes div

<div class="m-repbox"><!--/html/body/div-->
        <div class="m-repbody firstPage"><!--/html/body/div/div-->
<div class="t1">基本信息</div>
<div class="g-tt-h3 f-tleft f-mgtop">基本概况信息</div><!--/html/body/div/div[1]/div[2]-->
<table class="g-tab-bor f-tab-nomargin">
                <tr>
                    <th class="g-w-4">经济类型</th>
                    <td class="g-w-4 ">股份有限(公司)</td>
                    <th class="g-w-4">组织机构类型</th>
                    <td class="g-w-4 ">企业</td>
                </tr>
                <tr>
                    <th>企业规模</th>
                    <td class="">微型企业</td>
                    <th>所属行业</th>
                    <td class="">建材批发</td>
                </tr>
</table>
            <div class="g-tt-h3 f-tleft f-mgtop">实际控制人</div><!--/html/body/div/div[1]/div[2]-->
            <table class="g-tab-bor f-tab-nomargin">
                <tr>
                    <th class="g-w-4">名称</th>
                    <th class="g-w-4">身份标识类型</th>
                    <th class="g-w-4">身份标识号码</th>
                    <th class="g-w-4">更新日期</th>
                </tr>
                <tbody class="">
                    <tr>
                        <td>控制人</td>
                        <td class="g-w-4">身份证</td>
                        <td class="g-w-4">*******************</td>
                        <td class="g-w-4">2017-03-01</td>
                    </tr>
                </tbody>
                <tbody class="">
                    <tr>
                        <td>控制人二二二二二</td>
                        <td class="g-w-4">组织机构代码</td>
                        <td class="g-w-4">***********</td>
                        <td class="g-w-4">2017-03-01</td>
                    </tr>
                </tbody>
            </table>
</div>
</div>
NuGet 引入 HtmlAgilityPack 包


HtmlDocument htmlDoc;

        /// <summary>
        /// Load the html page source.
        /// </summary>
        /// <param name="htmlSource"></param>
        public void LoadHtml(string htmlSource)
        {
            htmlDoc = new HtmlDocument();
            htmlDoc.LoadHtml(htmlSource);
        }

        public int GetNodeIndexByKeyword(string xPath, string keyword)
        {
            var index = int.MinValue;
            var nodes = htmlDoc.DocumentNode.SelectNodes(xPath);
            if (nodes != null)
            {
                for (var i = 0; i < nodes.Count; i++)
                {
                    var data = nodes[i].InnerText;
                    if (Regex.IsMatch(data, keyword))
                    {
                        index = i + 1;
                        break;
                    }
                }
            }
            return index;
        }

public int GetNodeIndex(string divPath, int divIndex)
        {
            var index = int.MinValue;

            var tableXPath = string.Format("{0}[{1}]/following-sibling::table[1]/preceding-sibling::div[1]", divPath, divIndex);
    //tableXPath = "/html/body/div/div[4]/div[2]/following-sibling::table[1]/preceding-sibling::div[1]";
            var nodes = htmlDoc.DocumentNode.SelectNodes(tableXPath);
            if (nodes != null)
            {
                foreach (var node in nodes)
                {
                    var lastS = node.XPath.Substring(node.XPath.LastIndexOf("/") + 1);
                    var rgx = new Regex(@"(?i)(?<=\[)(.*)(?=\])");
                    var trimS = rgx.Match(lastS).Value;
                    _ = int.TryParse(trimS, out int i);
                    index = i;
                }

            }
            return index;
        }

        var xPath = "/html/body/div/div";
var keyword = "基本信息";
        var divIndex = GetNodeIndexByKeyword(xPath, keyword);

xPath = string.Format("/html/body/div/div[{0}]/div", divIndex);//"/html/body/div/div[4]/div"
keyword = "基本概况信息";
var divIndex2 = htmlDocument.GetNodeIndexByKeyword(xPath, keyword);//2

var precedingSiblingIndeox = GetNodeIndex(xPath, divIndex2);

var eq = divIndex == precedingSiblingIndeox;

 

标签:xpath,htmlDoc,string,int,HtmlAgilityPack,用法,var,nodes,div
From: https://www.cnblogs.com/hofmann/p/16643211.html

相关文章

  • 区别 chown和chmod的用法
    本人总是习惯使用chmod,而把chown混淆。chown就是修改第一列内容的,chmod是修改第3,4列内容的。chown用法用来更改某个目录或文件的用户名和用户组的chown用户名:组名......
  • SQL9 - 查找除复旦大学的用户信息 - NOT IN("")和<>等用法
    题目链接戳这里题解SELECTdevice_id,gender,age,universityFROMuser_profileWHEREuniversity!="复旦大学"#Or--WHEREuniversityNOTIN("复旦大学")......
  • assert断言的用法
    assert用于:防御性编程、程序逻辑检测s_age=input("请输入你的年龄:")age=int(s_age)assert20<age<80,"年龄错误"print("正确")print("ok")如果assert后......
  • Xpath_1_定位根元素
    语法:/AAAHTML参考结构:<AAA><BBB/><CCC/><BBB/><BBB/><DDD><BBB/></DDD>......
  • Xpath_2_定义AAA的所有CCC子元素
    语法:/AAA/CCCHTML参考结构:<AAA><BBB/><CCC/><BBB/><BBB/><DDD><BBB/></DDD>......
  • 11 个需要避免的 React 错误用法
    11个需要避免的React错误用法王平安​lovecoding,lovelife~​关注他 4人赞同了该文章随着React越来越受欢迎,React开发者也越来越......
  • 页面滚动到指定位置——js中scrollIntoView()的用法
    element.scrollIntoView()参数默认为true1.什么是scrollIntoView?scrollIntoView是一个与页面(容器)滚动相关的API2.如何调用?element.scrollIntoView()参数默认为true参......
  • 变长结构体中char data[0]的用法
    一、用法typedefstruct{intlength;chardata[0];}Header;在结构中,data是一个数组名,但该数组没有元素,该数组的真实地址紧随结构体Header之后,而这个地址就......
  • C++ lower_bound/upper_bound用法解析
    1.作用          lower_bound和upper_bound都是C++的STL库中的函数,作用差不多,lower_bound所返回的是第一个大于或等于目标元素的元素地址,而upper_bound则是返......
  • cp {,bak}用法(转载)
    cpfilename{,bak}cpfilename{,.bak}这个命令是用来把filename备份成filename.bak的等同于命令cpfilenamefilename.bak这里利用的是bash的braceexpansion(大......