Giter VIP home page Giter VIP logo

simple-web-crawler's Introduction

简单且高效的网站爬虫

基于C#.NET的简单网页爬虫,支持异步并发、设置代理、操作Cookie、Gzip页面加速。

今日头条@全栈解密:查看完整教程

主要特性

  • 支持Gzip根据网页内容自动解压,加快爬虫载入速度;
  • 支持异步并发抓取;
  • 支持自动事件通知;
  • 支持代理切换;
  • 支持操作Cookies;

运行截图

  • 抓取城市列表

使用正则表达式清洗数据

  • 抓取酒店列表

抓取城市下的酒店列表

示例代码

    /// <summary>
    /// 抓取城市列表
    /// </summary>
    public static void CityCrawler() {
        
        var cityUrl = "http://hotels.ctrip.com/citylist";//定义爬虫入口URL
        var cityList = new List<City>();//定义泛型列表存放城市名称及对应的酒店URL
        var cityCrawler = new SimpleCrawler();//调用刚才写的爬虫程序
        cityCrawler.OnStart += (s, e) =>
        {
            Console.WriteLine("爬虫开始抓取地址:" + e.Uri.ToString());
        };
        cityCrawler.OnError += (s, e) =>
        {
            Console.WriteLine("爬虫抓取出现错误:" + e.Uri.ToString() + ",异常消息:" + e.Exception.Message);
        };
        cityCrawler.OnCompleted += (s, e) =>
        {
            //使用正则表达式清洗网页源代码中的数据
            var links = Regex.Matches(e.PageSource, @"<a[^>]+href=""*(?<href>/hotel/[^>\s]+)""\s*[^>]*>(?<text>(?!.*img).*?)</a>", RegexOptions.IgnoreCase);
            foreach (Match match in links)
            {
                var city = new City
                {
                    CityName = match.Groups["text"].Value,
                    Uri = new Uri("http://hotels.ctrip.com" + match.Groups["href"].Value
                )
                };
                if (!cityList.Contains(city)) cityList.Add(city);//将数据加入到泛型列表
                Console.WriteLine(city.CityName + "|" + city.Uri);//将城市名称及URL显示到控制台
            }
            Console.WriteLine("===============================================");
            Console.WriteLine("爬虫抓取任务完成!合计 " + links.Count + " 个城市。");
            Console.WriteLine("耗时:" + e.Milliseconds + "毫秒");
            Console.WriteLine("线程:" + e.ThreadId);
            Console.WriteLine("地址:" + e.Uri.ToString());
        };
        cityCrawler.Start(new Uri(cityUrl)).Wait();//没被封锁就别使用代理:60.221.50.118:8090
    }

技术探讨/联系方式

  • QQ号: 276679490

  • 爬虫架构讨论群:180085853

simple-web-crawler's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.