首页 > 编程语言 >C#简单爬虫实现

C#简单爬虫实现

时间:2024-02-28 11:26:28浏览次数:31  
标签:string C# 简单 System 爬虫 new var using public

一、环境

.net core 6.0

vs2022 控制台应用程序

Nuget引入:

AngleSharp 1.1.0 用于HTML解析

Downloader 3.0.6 用于下载文件

 ShellProgressBar 5.2.0 用于进度条显示

二、效果

 

三、相关代码

1.Program.cs

using ShellProgressBar;
using Spider;
using System.Collections;

var url = "https://blog.csdn.net/u011127019/article/details/124248757";
var data = await HttpHelper.GetHtmlDocument(url);
DownloadHandler downloadHandler = new DownloadHandler();
List<ImageList> imageList = new List<ImageList>();
ImageList imageList1 = new ImageList
{
    Name = "图片目录",
    Images = new List<string>()
};
foreach (var item in data.QuerySelectorAll("#article_content img"))
{

    var link = item.QuerySelector("img");
    var href = item?.GetAttribute("src");
    if (href != null)
    {
        imageList1.ImageCount++;
        imageList1.Images.Add(href);
    }
}
imageList.Add(imageList1);
var list = imageList;// 加载图集列表
ProgressBarOptions BarOptions = new()
{
    ProgressCharacter = '─',
    ProgressBarOnBottom = true,
    ForegroundColor = ConsoleColor.Yellow,
    ForegroundColorDone = ConsoleColor.DarkGreen,
    BackgroundColor = ConsoleColor.DarkGray,
    BackgroundCharacter = '\u2593'
};

ProgressBarOptions ChildBarOptions = new()
{
    ForegroundColor = ConsoleColor.Green,
    BackgroundColor = ConsoleColor.DarkGreen,
    ProgressCharacter = '─'
};
using var bar = new ProgressBar(list.Count, "正在下载所有图片", BarOptions);

foreach (var item in list)
{
    bar.Message = $"图集:{item.Name}";
    bar.Tick();
    int i = 1;
    foreach (var imgUrl in item.Images)
    {
        using (var childBar = bar.Spawn(item.ImageCount, $"图片:{imgUrl}", ChildBarOptions))
        {
            childBar.Tick();
            string fileName = string.Empty;
            // 具体的下载代码

            if (imgUrl.Contains(".png"))
            {
                fileName = ".png";
            }
            if (imgUrl.Contains(".jpg"))
            {
                fileName = ".jpg";
            }

            await downloadHandler.Download(childBar, imgUrl, AppDomain.CurrentDomain.BaseDirectory + "\\Images\\" + i + fileName);
            i++;
        }
    }
}

  2.HttpHelper.cs

using AngleSharp.Html.Dom;
using AngleSharp.Html.Parser;
using Downloader;
using System.Net;
using System.Text;

namespace Spider
{

    public static class HttpHelper
    {
        public const string UserAgent =
            "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36";
        public static IDownloadService Downloader { get; }

        public static DownloadConfiguration DownloadConf => new()
        {
            BufferBlockSize = 10240, // 通常,主机最大支持8000字节,默认值为8000。
            ChunkCount = 8, // 要下载的文件分片数量,默认值为1
                            // MaximumBytesPerSecond = 1024 * 50, // 下载速度限制,默认值为零或无限制
            MaxTryAgainOnFailover = 5, // 失败的最大次数
            ParallelDownload = true, // 下载文件是否为并行的。默认值为false
            Timeout = 1000, // 每个 stream reader  的超时(毫秒),默认值是1000
            RequestConfiguration = {
                Accept = "*/*",
                AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate,
                CookieContainer = new CookieContainer(), // Add your cookies
                Headers = new WebHeaderCollection(), // Add your custom headers
                KeepAlive = true,
                ProtocolVersion = HttpVersion.Version11, // Default value is HTTP 1.1
                UseDefaultCredentials = false,
                UserAgent = UserAgent
            }
        };

        public static HttpClientHandler Handler { get; }

        public static HttpClient Client { get; }

        static HttpHelper()
        {
            Handler = new HttpClientHandler();
            Client = new HttpClient(Handler);
            Client.DefaultRequestHeaders.Add("User-Agent", UserAgent);
            Downloader = new DownloadService(DownloadConf);
        }

        public static async Task<IHtmlDocument> GetHtmlDocument(string url)
        {
            var html = await Client.GetStringAsync(url);
            return new HtmlParser().ParseDocument(html);
        }

        public static async Task<IHtmlDocument> GetHtmlDocument(string url, string charset)
        {
            var res = await Client.GetAsync(url);
            var resBytes = await res.Content.ReadAsByteArrayAsync();
            var resStr = Encoding.GetEncoding(charset).GetString(resBytes);
            return new HtmlParser().ParseDocument(resStr);
        }

    }
}

  3.DownloadHandler.cs

using Downloader;
using ShellProgressBar;
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Diagnostics;
using System.Linq;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading.Tasks;

namespace Spider
{
    public class DownloadHandler
    {
       
        public async Task Download(IProgressBar bar, string url, string filepath)
        {
            var barOptions = new ProgressBarOptions
            {
                ForegroundColor = ConsoleColor.Yellow,
                BackgroundColor = ConsoleColor.DarkYellow,
                ForegroundColorError = ConsoleColor.Red,
                ForegroundColorDone = ConsoleColor.Green,
                BackgroundCharacter = '\u2593',
                ProgressBarOnBottom = true,
                EnableTaskBarProgress = RuntimeInformation.IsOSPlatform(OSPlatform.Windows),
                DisplayTimeInRealTime = false,
                ShowEstimatedDuration = false
            };
            var percentageBar = bar.Spawn(100, $"正在下载:{Path.GetFileName(url)}", barOptions);

            HttpHelper.Downloader.DownloadStarted += DownloadStarted;
            HttpHelper.Downloader.DownloadFileCompleted += DownloadFileCompleted;
            HttpHelper.Downloader.DownloadProgressChanged += DownloadProgressChanged;

            await HttpHelper.Downloader.DownloadFileTaskAsync(url, filepath);

            void DownloadStarted(object? sender, DownloadStartedEventArgs e)
            {
                Trace.WriteLine(
                    $"图片, FileName:{Path.GetFileName(e.FileName)}, TotalBytesToReceive:{e.TotalBytesToReceive}");
            }

            void DownloadFileCompleted(object? sender, AsyncCompletedEventArgs e)
            {
                Trace.WriteLine($"下载完成, filepath:{filepath}");
                percentageBar.Dispose();
            }

            void DownloadProgressChanged(object? sender, DownloadProgressChangedEventArgs e)
            {
                percentageBar.AsProgress<double>().Report(e.ProgressPercentage);
            }
        }
    }
}

  4.Images.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Spider
{
    public class ImageList
    {
        public string Name { get; set; } = string.Empty;
        public int ImageCount { get; set; }
        public List<string>? Images { get; set; }
    }
}

  四、源码下载

链接:https://pan.baidu.com/s/1VnnH05Har9hUhxAsIfKSMw?pwd=paws
提取码:paws

标签:string,C#,简单,System,爬虫,new,var,using,public
From: https://www.cnblogs.com/wenthing/p/18039373

相关文章

  • Spring Boot使用BESApplicationServer宝兰德替换内嵌Tomcat
    移除自带tomcat<dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-websocket</artifactId><version>${spring.version}</version>......
  • Java中使用Jsoup实现网页内容爬取与Html内容解析并使用EasyExcel实现导出为Excel文件
    场景Pythont通过request以及BeautifulSoup爬取几千条情话:https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/details/87348030Node-RED中使用html节点爬取HTML网页资料之爬取Node-RED的最新版本:https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/details/124182289Jsoup......
  • Sentinel系列之(八)@SentinelResource
    @SentinelResource相当于Hystrix中的@HystrixCommand1.按资源名称限流环境说明启动了单机版的Nacos启动了Sentinel基于项目cloudalibaba-sentinel-service8401继续改造增加RateLimitControllerpackagecom.atguigu.springcloud.alibaba.controller;importcom.ali......
  • Nacos系列之(一)简介
    简介SpringCloudAlibabaNacos服务注册和配置中心1.为什么叫NacosNamingConfigurationService2.是什么官网:一个更易于构建云原生应用的动态服务发现,配置管理和服务管理中心Nacos=Eureka+Config+Bus疑问:Bus是服务总线,为什么也包含在Nacos里3.能干嘛替代Eureka做服......
  • Python: Star unpacking expressions in for statements
    今天发现在Python3.11版本中一个很不错的新特性,可以在for循环中使用unpacking,这意味着可以更灵活地组合迭代对象。ls=[1,2,34]foriin1,2,3,*ls,789:print(i)"""1231234789"""其实我第一次知道for循环中可以使用x,y,z这样的结构,想想也是......
  • sonarqube for code qualities / sonarqube usage
    代码质量检查工具sonarqube的简单使用小册:sonarqubeforcodequalitiesusagebyukyo相关:SONARQUBE官网Setsonarstuffdependencieswithproject,Installsonarlint(plugin)forIDEandsonarscannerOverview|SonarQubeDocstopushprojectcodestosonarqube......
  • leedcode 环形链表
    快慢指针:classSolution:defhasCycle(self,head:Optional[ListNode])->bool:#如果链表为空或者只有一个节点,肯定不存在环ifnotheadornothead.next:returnFalse#初始化慢指针和快指针slow=headf......
  • 设置CPU亲和性
    即,某个线程固定跑在某个CPU的(某个)核上/** *设置当前线程cpu亲和性 *@paramicpu索引,如果为-1,那么取消cpu亲和性 *@return是否成功,目前只支持linux---代码块来自ZLtookit */boolsetThreadAffinity(inti){#if(defined(__linux)||defined(__linux__))&&!defined(......
  • arm64-ubuntu2204-opencv4.7.0源码编译
    参考:https://blog.csdn.net/weixin_43863869/article/details/128552342https://blog.csdn.net/weixin_39956356/article/details/102643415https://blog.csdn.net/quicmous/article/details/112714641 cdopencv-4.7.0 sudoapt-getinstallbuild-essentiallibgtk2.0-d......
  • 《Rupture Propagation along Stepovers of Strike-Slip Faults: Effects of Initial
    以往认为阶跃距离临界距离为5km,但是有很多例外。本文主要探索应力对于模型的影响。along-termfaultstressmodel:simulatesteady-statestressperturbationaroundstepovers.这样的应力扰动可以对应更长的阶跃距离。15kmforareleasingstepover;7kmforarestrai......