首页 > 其他分享 >Treating HTML like XML using HtmlAgilityPack, and doing it inside of an XSLT too [转载]

Treating HTML like XML using HtmlAgilityPack, and doing it inside of an XSLT too [转载]

时间:2022-10-24 16:03:26浏览次数:84  
标签:Treating XML http like HtmlAgilityPack System new using com

I was not able to post this on Simon Mourier's blog due to the HTML and XSLT tags, so here it is on mine:

Maybe someone has done this already, but I don't see it in the comments.

I created an XSLT extension object based on HtmlAgilityPack. The class is tiny:

using System;
using System.Collections.Generic;
using System.Text;
using HtmlAgilityPack;
using System.Xml;
using System.Xml.XPath;
using System.IO;

namespace HtmlAgilityPack
{
public class XslExtension
{
public XmlDocument loadhtmlasxml(string url)
{
// Create an instance of the HtmlWeb object
HtmlWeb web = new HtmlWeb();
// Declare necessary stream and writer objects
MemoryStream m = new MemoryStream();
XmlTextWriter xtw = new XmlTextWriter(m,null);
// Load the content into the writer
web.LoadHtmlAsXml(url, xtw);
// Rewind the memory stream
m.Position = 0;
// Create, fill, and return the xml document
XmlDocument xdoc = new XmlDocument();
xdoc.LoadXml((new StreamReader(m)).ReadToEnd());
return xdoc;
}
}
}

Then, I used NXSLT from ​​http://www.xmllab.net​​ to load the custom extension function in from the command line so that the following XSL style sheet can be used directly:

<xsl:stylesheet
 xmlns:xsl="​​​http://www.w3.org/1999/XSL/Transform​​​"
 xmlns:hap="​​​http://smourier.blogspot.com​​​"
 xmlns:msxsl="urn:schemas-microsoft-com:xslt"
      version="1.0">

 <xsl:output method="html" omit-xml-declaration="yes" indent="no"/>

 <xsl:template match="/">

  <h1>BEGIN TEST OF HtmlAgilityPack.XslExtension</h1>

  <h2>First, connect to ​​http://www.cnn.com​​ and load its node set into a local variable</h2>   

  <xsl:variable name="cnn"><xsl:copy-of select="hap:loadhtmlasxml('http://www.cnn.com')" /></xsl:variable>

  <h3>CNN.com has this many nodes:</h3>

  <xsl:value-of select="count(msxsl:node-set($cnn)//*)" />
  <h2>Now, process all the A tags within the "Special Converage" stories inside the "div class="cnnLSSpecialCovBoxContent" that have an HREF that starts with /2005.</h2>
   <h3>Special Coverage</h3>
    <xsl:for-each select="msxsl:node-set($cnn)//div[@class='cnnLSSpecialCovBoxContent']//a[starts-with(@href, '/2005/')]">
   <div>
    <h3><xsl:copy-of select="." /></h3>
    <!-- Now get the images from each story if they exist -->
    <h5>Connecting to: <xsl:value-of select="concat('http://www.cnn.com', @href)" /> to retrieve image if it exists</h5>
    <xsl:copy-of select="hap:loadhtmlasxml(concat('http://www.cnn.com', @href))//img[@height = '168']" />
   <br /><br />
   </div>
   </xsl:for-each>
  <h1>END TEST OF HtmlAgilityPack.XslExtension</h1>
 </xsl:template>

</xsl:stylesheet>

The command for NXSLT to perform this is:

nxslt2.exe source.xml source.xsl -ext hap:HtmlAgilityPack.XslExtension xmlns:hap="​​http://smourier.blogspot.com​​​" -af .\HtmlAgilityPackXs
lExtension.dll

The style sheet connects to CNN.com using the syntax:

select="hap:loadhtmlasxml('http://www.cnn.com')"

Then, further down, after it processes each of the selected A HREF's, it connects to each of the linked stories and retrieves any images with height 168, outputting the HTML result tree.

This could allow for any number of descendent link followings. I haven't worked out the automatic form processor yet, but I think that could be an XSLT extension too perhaps...

Let me know what you think...
​​​http://blogs.wdevs.com/ultravioletconsulting/archive/2005/09/10/10506.aspx​



标签:Treating,XML,http,like,HtmlAgilityPack,System,new,using,com
From: https://blog.51cto.com/shanyou/5790067

相关文章