Giter VIP home page Giter VIP logo

commoncrawl's Introduction

Code Coverage Nuget

Toimik.CommonCrawl

.NET 8 C# Common Crawl processing tools.

Features

  • Parses WARC / WAT / WET datasets via streaming (read: no local download required)
  • Extracts URLs from WAT datasets via streaming
  • More to come ...

Quick Start

Installation

Package Manager

PM> Install-Package Toimik.CommonCrawl

.NET CLI

> dotnet add package Toimik.CommonCrawl

Usage

Streaming WARC / WAT / WET datasets

The code below is for streaming from remote datasets. To process local datasets, use Toimik.WarcProtocol.

using System.Diagnostics;
using System.Net.Http;
using System.Threading.Tasks;
using Toimik.WarcProtocol;

public class StreamerProgram
{
    public static async Task Main()
    {
        var streamer = new WarcParserStreamer(
            new HttpClient(), // Ideally a singleton
            new WarcParser(),
            new DebugParseLog());

        // The example below uses October 2021's dataset. Other datasets are found at
        // https://commoncrawl.org/the-data/get-started.
        var urlSegmentList = "/crawl-data/CC-MAIN-2021-43/warc.paths.gz";

        // var urlSegmentList = "/crawl-data/CC-MAIN-2021-43/wat.paths.gz";

        // var urlSegmentList = "/crawl-data/CC-MAIN-2021-43/wet.paths.gz";
        var results = streamer.Stream(hostname: "commoncrawl.s3.amazonaws.com", urlSegmentList);
        await foreach (Streamer<Record>.Result result in results)
        {
            var record = result.RecordSegment.Value;

            // The applicable types depend on the selected dataset list path
            switch (record.Type)
            {
                case ConversionRecord.TypeName:

                    // ...
                    break;

                case MetadataRecord.TypeName:

                    // ...
                    break;

                case RequestRecord.TypeName:

                    // ...
                    break;

                case ResponseRecord.TypeName:

                    // ...
                    break;

                case WarcinfoRecord.TypeName:

                    // ...
                    break;
            }
        }
    }

    private class DebugParseLog : IParseLog
    {
        public void ChunkSkipped(string chunk)
        {
            Debug.WriteLine(chunk);
        }

        public void ErrorEncountered(string error)
        {
            Debug.WriteLine(error);
        }
    }
}

ย 

Extracting URLs from streamed WAT datasets

using System;
using System.Diagnostics;
using System.Net.Http;
using System.Threading.Tasks;
using Toimik.WarcProtocol;

public class WatUrlExtractorProgram
{
    public static async Task Main()
    {
        const string Hostname = "commoncrawl.s3.amazonaws.com";

        // As the WarcProtocol.WarcParser expects an absolute URL, this factory takes care of
        // all relative URLs by prefixing them with the hostname
        var recordFactory = new WatRecordFactory(Hostname);

        var streamer = new WarcParserStreamer(
            new HttpClient(), // Ideally, a singleton
            new WarcParser(recordFactory),
            new DebugParseLog());
        var extractor = new WarcParserWatUrlExtractor(streamer);

        // The example below uses October 2021's dataset. Other datasets are found at
        // https://commoncrawl.org/the-data/get-started.
        var urlSegmentList = "/crawl-data/CC-MAIN-2021-43/wat.paths.gz";

        var results = extractor.Extract(Hostname, urlSegmentList);
        await foreach (WatUrlExtractor<Record>.Result result in results)
        {
            Console.WriteLine($"{result.Index}: {result.Url}");
        }
    }

    private class DebugParseLog : IParseLog
    {
        public void ChunkSkipped(string chunk)
        {
            Debug.WriteLine(chunk);
        }

        public void ErrorEncountered(string error)
        {
            Debug.WriteLine(error);
        }
    }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.