Giter VIP home page Giter VIP logo

gowarcserver's People

Contributors

andrbo avatar avokadoen avatar dependabot[bot] avatar maeb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

avokadoen

gowarcserver's Issues

Use github's docker registry

Describe the solution you'd like
Instead of using docker hub, we can simply use github's own docker registry. Update .github workflow to reflect this

Feature: Update gowarcserver to use the new gowarc API

Is your feature request related to a problem? Please describe.
There is a relatively big WIP change to gowarc which will update the gowarc API

Describe the solution you'd like
Update the dependency to use the new changes in gowarc

Describe alternatives you've considered

Additional context

Bug: sending a ID as a url in the /search endpoint can give results

Describe the bug
When a user sends a warc id as a url in a search URL, the endpoint will respond with records according to matching ID and report some errors parsing the url

To Reproduce
Steps to reproduce the behavior:

  1. Start gowarc with default configuration
  2. Start browser and paste http://localhost:9999/search?url=%3Curn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007%3E
  3. Observe response status 500 & error message & record entry

Expected behavior
Warcserver should respond as soon as the url fails to parse and should never begin search in db

Screenshots
None

Additional context
None

Feature: more linting in CI

Is your feature request related to a problem? Please describe.
Other than go vet we don't really lint go code in CI which might lead to common problems being introduced

Describe the solution you'd like
Introduce https://github.com/golangci/golangci-lint with some common linters to the CI

Describe alternatives you've considered

Additional context

Allow end user to control badger's memory usage

Is your feature request related to a problem? Please describe.
Currently the gowarcserver's can consume more memory than they should in their host container

Describe the solution you'd like
Badger has an extensive API that should enable a solution where the end user can configure the memory usage on startup using arguments and/or the config file. @maeb has a previous attempt at this which can be found on over at the gowarc repo (link is not permanent, so it might die at some point).

Additional context
A good place to start might be badgers documentation entry on memory usage https://dgraph.io/docs/badger/get-started/#memory-usage

All option fields in badger v2.2007.2:
https://github.com/dgraph-io/badger/blob/d5a25b83fbf4f3f61ff03a9202e36f5b75544426/options.go#L35

// Required options.
Dir      string
ValueDir string

// Usually modified options.
SyncWrites          bool
TableLoadingMode    options.FileLoadingMode
ValueLogLoadingMode options.FileLoadingMode
NumVersionsToKeep   int
ReadOnly            bool
Truncate            bool
Logger              Logger
Compression         options.CompressionType
InMemory            bool

// Fine tuning options.

MaxTableSize        int64
LevelSizeMultiplier int
MaxLevels           int
ValueThreshold      int
NumMemtables        int
// Changing BlockSize across DB runs will not break badger. The block size is
// read from the block index stored at the end of the table.
BlockSize          int
BloomFalsePositive float64
KeepL0InMemory     bool
BlockCacheSize     int64
IndexCacheSize     int64
LoadBloomsOnOpen   bool

NumLevelZeroTables      int
NumLevelZeroTablesStall int

LevelOneSize       int64
ValueLogFileSize   int64
ValueLogMaxEntries uint32

NumCompactors        int
CompactL0OnClose     bool
LogRotatesToFlush    int32
ZSTDCompressionLevel int

// When set, checksum will be validated for each entry read from the value log file.
VerifyValueChecksum bool

// Encryption related options.
EncryptionKey                 []byte        // encryption key
EncryptionKeyRotationDuration time.Duration // key rotation duration

// BypassLockGaurd will bypass the lock guard on badger. Bypassing lock
// guard can cause data corruption if multiple badger instances are using
// the same directory. Use this options with caution.
BypassLockGuard bool

// ChecksumVerificationMode decides when db should verify checksums for SSTable blocks.
ChecksumVerificationMode options.ChecksumVerificationMode

// DetectConflicts determines whether the transactions would be checked for
// conflicts. The transactions can be processed at a higher rate when
// conflict detection is disabled.
DetectConflicts bool

// Transaction start and commit timestamps are managed by end-user.
// This is only useful for databases built on top of Badger (like Dgraph).
// Not recommended for most users.
managedTxns bool

// 4. Flags for testing purposes
// ------------------------------
maxBatchCount int64 // max entries in batch
maxBatchSize  int64 // max batch size in bytes

Bug: unsortedParallelSearch respons result in client getting wrong order

Describe the bug
Because unsortedParallelSearch is used for http + https with two iterators, the response have wrong order between http and https sites.

To Reproduce
Ask for a url without scheme (which causes a double search on http + https)

Expected behavior
Response is sorted by date

Additional context
There should be an alternative search that is similar to closestUniSearch where is searched N keys with N iterators, select the oldest/newest item and iterate only this until all iterators are expended.

https://github.com/nlnwa/gowarcserver/blob/master/internal/server/warcserver/db.go

Improve pkg/searchhandler.go ServeHTTP error handling

The error handling in ServeHTTP should be refactored.

Currently there is a function that handles errors

func (h *searchHandler) handleError(err error, w http.ResponseWriter) {
	if err != nil {
		w.Header().Set("Content-Type", "text/plain")
		w.WriteHeader(404)
		fmt.Fprintf(w, "Error: %v\n", err)
	}
}

This gets called once, and its in ServeHTTP

key, err := surt.SsurtString(uri, true)
if err != nil {
  h.handleError(err, w)
  return
}

The 404 return status seems to be a bit eager as the error can be caused by uri parsing causing an error. There are also other cases where errors are not dealt with properly.

Feature: testing of the cmd pkg to avoid regressions

Is your feature request related to a problem? Please describe.
Draft of PR #3 introduced sever regression. To avoid merging regressions in the master there should be cmd tests to verify that all commands and arguments works as expected

Describe the solution you'd like
Unit tests for each command and argument combination

Describe alternatives you've considered
None

Additional context
https://medium.com/swlh/unit-testing-cli-programs-in-go-6275c85af2e7

Feature: Profiling performance for regression

Is your feature request related to a problem? Please describe.
Regressions is hard to avoid when expanding features of an application.

Describe the solution you'd like
We should watch for regressions in performance through use of CI (does not need to be github hosted).
CI step to profile master branch should look for improvements and regressions in performance. This would also serve as integration testing, which has been a minor issue in the repository (i.e duplicate flags are not noticed on compile time and needs to be triggered as an error on runtime)

Describe alternatives you've considered
None

Additional context
https://go.dev/blog/pprof
https://hackernoon.com/go-the-complete-guide-to-profiling-your-code-h51r3waz
https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go

Bug: indexing with cdxj causes a panic

Describe the bug
gowarcserver panics when i try to index testdata/IAH-20080430204825-00000-blackbook.warc

To Reproduce
Steps to reproduce the behavior:

  1. checkout branch in pr #3 (badger-control in fork)
  2. run gowarcserver index with -f cdxj i.e ./warcserver index -f cdxj ./testdata/IAH-20080430204825-00000-blackbook.warc

Expected behavior
gowarcserver should index the files or report a user error to the terminal.

Additional context

[akselhjerpbakk@localhost gowarcserver]$ ./warcserver index -f cdxj ./testdata/IAH-20080430204825-00000-blackbook.warc 
Using config file: /home/akselhjerpbakk/Projects/warcproject/gowarcserver/config.yaml
Format: cdxj
Count:  2
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x91749c]

goroutine 1 [running]:
github.com/golang/protobuf/jsonpb.(*jsonWriter).marshalMessage(0xc0001c9aa0, 0xcd9ba0, 0xc0001d4a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/akselhjerpbakk/go/pkg/mod/github.com/golang/[email protected]/jsonpb/encode.go:219 +0x73c
github.com/golang/protobuf/jsonpb.(*Marshaler).marshal(0x0, 0xccfca0, 0xc0001d4a00, 0x32, 0x2e2, 0xc0001d4a00, 0xc000330040, 0x0)
        /home/akselhjerpbakk/go/pkg/mod/github.com/golang/[email protected]/jsonpb/encode.go:116 +0x207
github.com/golang/protobuf/jsonpb.(*Marshaler).MarshalToString(...)
        /home/akselhjerpbakk/go/pkg/mod/github.com/golang/[email protected]/jsonpb/encode.go:78
github.com/nlnwa/gowarcserver/pkg/index.(*CdxJ).Write(0xc000010268, 0xcd4920, 0xc00007cd40, 0x7ffee18cb20a, 0x32, 0x2e2, 0x0, 0x0)
        /home/akselhjerpbakk/Projects/warcproject/gowarcserver/pkg/index/indexwriter.go:74 +0xe6
github.com/nlnwa/gowarcserver/cmd/warcserver/cmd/index.readFile(0xc0001dc5a0, 0x0, 0x0)
        /home/akselhjerpbakk/Projects/warcproject/gowarcserver/cmd/warcserver/cmd/index/index.go:120 +0x15f
github.com/nlnwa/gowarcserver/cmd/warcserver/cmd/index.runE(0xc0001dc5a0, 0x0, 0x0)
        /home/akselhjerpbakk/Projects/warcproject/gowarcserver/cmd/warcserver/cmd/index/index.go:92 +0xf0
github.com/nlnwa/gowarcserver/cmd/warcserver/cmd/index.NewCommand.func2(0xc0001d3400, 0xc0001dc780, 0x1, 0x3, 0x0, 0x0)
        /home/akselhjerpbakk/Projects/warcproject/gowarcserver/cmd/warcserver/cmd/index/index.go:79 +0xba
github.com/spf13/cobra.(*Command).execute(0xc0001d3400, 0xc0001dc6f0, 0x3, 0x3, 0xc0001d3400, 0xc0001dc6f0)
        /home/akselhjerpbakk/go/pkg/mod/github.com/spf13/[email protected]/command.go:826 +0x47c
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001d2f00, 0xc000000180, 0xc0001c9f78, 0x4118e5)
        /home/akselhjerpbakk/go/pkg/mod/github.com/spf13/[email protected]/command.go:914 +0x30b
github.com/spf13/cobra.(*Command).Execute(...)
        /home/akselhjerpbakk/go/pkg/mod/github.com/spf13/[email protected]/command.go:864
main.main()
        /home/akselhjerpbakk/Projects/warcproject/gowarcserver/cmd/warcserver/main.go:27 +0x2b

Bug: docker artifact on master even if linting and tests fails

Describe the bug
When a new commit arrives in master, the CI will make an artifact even if the testing and linting CI fails

To Reproduce
Push to master

Expected behavior
Docker artifact is omitted in the event of failures in tests and linter

Screenshots
None

Additional context
Introduced in PR 34

Bug: rollout with kubectl breaks

Describe the bug
It seems like badger refuses connection to the new pod with rollout strategy. The scope of fixing this bug might be to big to be worth it.

@maeb can expand this report if he wants (i.e steps to reproduce)

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Screenshots

Additional context

Support memento API

Is your feature request related to a problem? Please describe.
The gowarcserver should support the memento API. This enables users to perform queries related to time frames and easily paginate using time.

Describe the solution you'd like
Implement the memento specification as an optional API. It would also be nice to allow compiler flags dictate which API's that should be used in the executable

Additional context
http://timetravel.mementoweb.org/guide/api/

Feature: Support revisit blocks without WARC-Refers-To header

Is your feature request related to a problem? Please describe.
It is valid to have records that only have WARC-Refers-To-Target-URI and WARC-Refers-To-Date (see 6.7.2). This is currently not supported in gowarcserver

Describe the solution you'd like
Add handling of records without WARC-Refers-To in the loader/loader.go Load function

Feature: Delegate revisit errors to query siblings

Is your feature request related to a problem? Please describe.
As gowarcserver becomes more used with bigger datasets the need for splitting bigger collections might occur.

Describe the solution you'd like
Files can inherently be split since warc records are a thing. Add support for config files to express indexing a subsection of a file.

Example:

config for instance 1, index and "own" records in warc including record 0 to 10

indexFiles:
    - example.warc:0-10

config for instance 2, index and "own" records in warc including record 11 and the remaining in the file

indexFiles:
    - example.warc:11- 

The configuration should also be set up in a way so that instance 1 and 2 are sibilings (common parent) in order for queries about example.warc to work.

Describe alternatives you've considered
None

Additional context
None

Feature: Support running gowarcserver without badger

Is your feature request related to a problem? Please describe.
Collision of badger db folder can will cause panic. There is also overhead in running an empty DB

Describe the solution you'd like
Allow user to use argument i.e --no-badger to disable badger db. This will make it easier to have parent gowarc servers (see issue #4) that only ask children with replicas

Describe alternatives you've considered
None

Additional context
None

Feature: make pkg/server/warcserver/daterange.go more fool proof

Is your feature request related to a problem? Please describe.
It's easy to abuse the functionality in daterange. It should be refactored to use timestamp or other mechanism that makes values harder to abuse

Describe the solution you'd like
Convert daterange string to an int or date oriented type

Describe alternatives you've considered
None

Additional context
None

Feature: Specify pattern for filenamnes to include/exclude when indexing

Is your feature request related to a problem? Please describe.
Indexer tries to index any and all files in the traversal path.

Describe the solution you'd like
A flag or environment variable to specify a pattern for filenames to include/exclude. E.g.
INCLUDE="*.warc.gz" or EXCLUDE="*.md5".

Feature: support tikv as a db

Is your feature request related to a problem? Please describe.
Badger has problems when the index becomes too big. Currently, we solve this making multiple gowarcserver instances.

Describe the solution you'd like
We want a distributed DB for gowarcserver to simplify the gowarcserver's responsibility

Describe alternatives you've considered
None

Additional context
https://tikv.org/

Feature: Allow users to toggle index databases

Is your feature request related to a problem? Please describe.
Currently gowarcserver has 3 index databases in different formats. One or two of these formats are not needed, then they will bloat the memory footprint of the application without any benefit.

Describe the solution you'd like
Allow user to use config and arguments to toggle off individual index databases

Describe alternatives you've considered
Removing id index might also be a good change if we are not using the id index in production.

Feature: Update badger to 3.x

Is your feature request related to a problem? Please describe.
Badger 2.x is becoming old

Describe the solution you'd like
Update badger to the latest 3.xx release

Additional context
Badger repository

Allow distribution of gowarcservers with a "parent->child" relationship

Based on meeting with @maeb. He had an idea of a potential direction to improve gowarcserver.

Is your feature request related to a problem? Please describe.
This will solve two problems.

  1. In the loke GUI you can see all the different collections on the main page. If you have a series of collections, then it can be cumbersome to find a given warc record as you have to be aware which collection has the record or manually search each collection
  2. Optimize gowarcserver by distributing indexing and searching

Describe the solution you'd like

Gowarcserver network diagram

We can structure gowarcservers like a tree. Each node in the tree can hold records and N child nodes. Using arguments or editing the config should allow you to point at child nodes of the gowarcserver that is being fired up. When the server receive a query it should process the query while also ask all children to do the same. How it should handle finding results is left undefined for now i.e discarding request to children and just send found item or wait for all children to answer before aggregating result etc. It's important to note that based on the diagram, the only difference between a parent- and leaf node is that the leaf node has no registered children. Programmatically they should be identical.

Problem 1 will be solved by introduction of the concept of a parent-child relation. It will allow us to set up a network of servers where a root instance can aggregate queries throughout the gowarcserver network. Loke will only have to know about the root. This will result in the end user not having to care about which collection that contains the target record.

Problem 2 will be solved by the fact that queries can be aggregated using go routines to children and self which should make queries scale with increased data. Indexing of records will also be distributed without locking it to a topic or area (i.e all indexing of newspapers having to be central)

It's worth noting that this will introduce greater complexity to the codebase and abusing said tree structure might lead to slower results as request will be chained based on tree depth.

This will also open up future optimizations. Examples of this could be: caching common queries where no changes has been made in the db or skipping nodes when we already know target node for query.

Additional context
Googles talk about about go servers (mainly from slide 33 and out)
Potential API http://timetravel.mementoweb.org/guide/api/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.