Giter VIP home page Giter VIP logo

zoekt's Introduction

"Zoekt, en gij zult spinazie eten" - Jan Eertink

("seek, and ye shall eat spinach" - My primary school teacher)

This is a fast text search engine, intended for use with source code. (Pronunciation: roughly as you would pronounce "zooked" in English)

Note: This is a Sourcegraph fork of github.com/google/zoekt. It is now the main maintained source of Zoekt.

INSTRUCTIONS

Downloading

go get github.com/sourcegraph/zoekt/

Indexing

Directory

go install github.com/sourcegraph/zoekt/cmd/zoekt-index
$GOPATH/bin/zoekt-index .

Git repository

go install github.com/sourcegraph/zoekt/cmd/zoekt-git-index
$GOPATH/bin/zoekt-git-index -branches master,stable-1.4 -prefix origin/ .

Repo repositories

go install github.com/sourcegraph/zoekt/cmd/zoekt-{repo-index,mirror-gitiles}
zoekt-mirror-gitiles -dest ~/repos/ https://gfiber.googlesource.com
zoekt-repo-index \
    -name gfiber \
    -base_url https://gfiber.googlesource.com/ \
    -manifest_repo ~/repos/gfiber.googlesource.com/manifests.git \
    -repo_cache ~/repos \
    -manifest_rev_prefix=refs/heads/ --rev_prefix= \
    master:default_unrestricted.xml

Searching

Web interface

go install github.com/sourcegraph/zoekt/cmd/zoekt-webserver
$GOPATH/bin/zoekt-webserver -listen :6070

JSON API

You can retrieve search results as JSON by sending a GET request to zoekt-webserver.

curl --get \
    --url "http://localhost:6070/search" \
    --data-urlencode "q=ngram f:READ" \
    --data-urlencode "num=50" \
    --data-urlencode "format=json"

The response data is a JSON object. You can refer to web.ApiSearchResult to learn about the structure of the object.

CLI

go install github.com/sourcegraph/zoekt/cmd/zoekt
$GOPATH/bin/zoekt 'ngram f:READ'

Installation

A more organized installation on a Linux server should use a systemd unit file, eg.

[Unit]
Description=zoekt webserver

[Service]
ExecStart=/zoekt/bin/zoekt-webserver -index /zoekt/index -listen :443  --ssl_cert /zoekt/etc/cert.pem   --ssl_key /zoekt/etc/key.pem
Restart=always

[Install]
WantedBy=default.target

SEARCH SERVICE

Zoekt comes with a small service management program:

go install github.com/sourcegraph/zoekt/cmd/zoekt-indexserver

cat << EOF > config.json
[{"GithubUser": "username"},
 {"GithubOrg": "org"},
 {"GitilesURL": "https://gerrit.googlesource.com", "Name": "zoekt" }
]
EOF

$GOPATH/bin/zoekt-indexserver -mirror_config config.json

This will mirror all repos under 'github.com/username', 'github.com/org', as well as the 'zoekt' repository. It will index the repositories.

It takes care of fetching and indexing new data and cleaning up logfiles.

The webserver can be started from a standard service management framework, such as systemd.

SYMBOL SEARCH

It is recommended to install Universal ctags to improve ranking. See here for more information.

ACKNOWLEDGEMENTS

Thanks to Han-Wen Nienhuys for creating Zoekt. Thanks to Alexander Neubeck for coming up with this idea, and helping Han-Wen Nienhuys flesh it out.

FORK DETAILS

Originally this fork contained some changes that do not make sense to upstream and or have not yet been upstreamed. However, this is now the defacto source for Zoekt. This section will remain for historical reasons and contains outdated information. It can be removed once the dust settles on moving from google/zoekt to sourcegraph/zoekt. Differences:

  • zoekt-sourcegraph-indexserver is a Sourcegraph specific command which indexes all enabled repositories on Sourcegraph, as well as keeping the indexes up to date.
  • We have exposed the API via keegancsmith/rpc (a fork of net/rpc which supports cancellation).
  • Query primitive BranchesRepos to efficiently specify a set of repositories to search.
  • Allow empty shard directories on startup. Needed when starting a fresh instance which hasn't indexed anything yet.
  • We can return symbol/ctag data in results. Additionally we can run symbol regex queries.
  • We search shards in order of repo name and ignore shard ranking.
  • Other minor changes.

Assuming you have the gerrit upstream configured, a useful way to see what we changed is:

$ git diff gerrit/master -- ':(exclude)vendor/' ':(exclude)Gopkg*'

DISCLAIMER

This is not an official Google product

zoekt's People

Contributors

asdine avatar bobheadxi avatar camdencheek avatar chrismwendt avatar davejrt avatar daxmc99 avatar dylangriffith avatar eseliger avatar ggilmore avatar gl-srgr avatar glundh avatar hanwen avatar ijsnow avatar ijt avatar isker avatar jac avatar jhchabran avatar jtibshirani avatar keegancsmith avatar kzh avatar mpimenov avatar mrnugget avatar nicksnyder avatar nikos912000 avatar r10r avatar sluongng avatar sqs avatar stefanhengl avatar uwedeportivo avatar xavier-calland avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zoekt's Issues

gob: type regexp.Regexp has no exported fields - using List with `repo:` in RPC

Hello!

I set up a Go service that imports/uses the RPC client, something like this

package whatever
import (
        ...
	"github.com/sourcegraph/zoekt/query"
	"github.com/sourcegraph/zoekt/rpc"
)
func (s *Server) handleSearch(w.....) {
  searchQ := searchForm.Q

  q, err := query.Parse(searchArgs.Q)

  
  repoOnly := true
  query.VisitAtoms(q, ...) // decide if repoOnly

  if (repoOnly) { 
    s.Searcher.List(ctx, q, opts)
  }
  ...
}

func main() {
  client := rpc.Client(*rpcConnect) 

  mux.Handle("/api/search", handleSearch)
  ...
}

However, whenever the query request is repoOnly, like for example, repo:test, the request to the rpc server fails with:

Search failed, HTTP 500: Internal Server Error - {"Error":"gob: type regexp.Regexp has no exported fields"}.

I get what the error is saying, but not why it's happening/how Sourcegraph gets around it.

Things I've tried/checked:

  1. Made sure that the call to rpc.RegisterGob was getting called on webserver initialization
  2. run query.Simplify() on the query
  3. encoded RepoRegex query directly, like so: s.Searcher.List(ctx, &query.RepoRegexp{Regexp: re}, opts), using the grafana/regex package. This works, somehow?
  4. Imported and used grafana/regexp package for something else, but still used q in the call to List. This fails. (I was thinking somehow a conflict between the grafana & stdlib regexp packages was happening..)
  5. encoded Repo query directly, like so: s.Searcher.List(ctx, &query.Repo{Regexp: re}, opts), using the grafana/regex package. This failed!
  6. Looked into some of the .List() calls in sourcegraph/sourcegraph to see if I was missing anything obvious, didn't see anything.

Am I doing anything wrong here? As far as I can tell, both RepoRegexp and Repo are gob encoded, so not sure why one works and the other doesn't. It's late, I'll take a fresh look tomorrow but would appreciate a look at this. Thanks!

Add Repository.URL to SearchResult?

Right now, SearchResult provides URL templates for linking to individual files and line numbers within a repository, but not a URL for linking to the repository itself. If you want to get the repository URL, you have to join a SearchResult to a list request, or try to heuristically extract it from the file URL template.

I'm sure adding fields to SearchResult is not to be done lightly, but what do you think about adding this one?

Query parse support for `public` and `fork` in addition to the existing `archived`

Right now we can query for public/private, forked/not forked, archived/not archived:

zoekt/query/query.go

Lines 52 to 62 in 2560773

// RawConfig filters repositories based on their encoded RawConfig map.
type RawConfig uint64
const (
RcOnlyPublic RawConfig = 1
RcOnlyPrivate RawConfig = 2
RcOnlyForks RawConfig = 1 << 2
RcNoForks RawConfig = 2 << 2
RcOnlyArchived RawConfig = 1 << 4
RcNoArchived RawConfig = 2 << 4
)

But we can only parse queries for archived:

zoekt/query/parse.go

Lines 128 to 136 in 2560773

case tokArchived:
switch text {
case "yes":
expr = RawConfig(RcOnlyArchived)
case "no":
expr = RawConfig(RcNoArchived)
default:
return nil, 0, fmt.Errorf("query: unknown archived argument %q, want {yes,no}", text)
}

The other two are only supported via rpc. There should be parse support for them as well.

how to handle git credentials for git ops

Hello - I ran into this while deploying the indexserver on a machine yesterday.

While zoekt-mirror-* have credential handlers that look for tokens or usernames for calling the API's that list either the repos in an org/user/etc, the rest of the git calls (fetch, clone, etc) have no authentication wrappers.

I'm wondering how anyone who runs zoet-indexserver on a server to mirror private repos generally handles this - I know it'll depend on the codehost, but generally - do you all write the git credentials to a file? use the git-credentials store? In my case, I'm specifically using GitHub as the host, and am using a PAT with read scopes.

Right now I think leaning towards writing a github token as the password to the git-credentials store.

livegrep gets around this by using an environment variable for the github token, then passing that env variable through a pipe to git (through a custom askpass script) so it never touches the file system, but introducing something like that would probably mean abstracting all git calls into something like callGit(args []string, username, password string) and then that function handles credentials if username and password aren't empty. I think that'd be a pretty involved change, and I'm not sure how many here would use it, so really not leaning that direction atm.

Thoughts?

Move zoekt into monorepo

There aren't all that many people using Zoekt outside of Sourcegraph so we could perhaps consider moving this into the monorepo to simplify the scip-ctags build process.

Currently we need to:

  • merge a scip-ctags PR
  • merge a PR here
  • merge a Zoekt version change PR

and it's quite the hassle; it'd be nice if we could just do it all in one through the monorepo. :)

fix interaction between "tombname" and "untomb"

The shard logs show that some shards are restored just after they have been soft deleted. IE in the logs we see "tombname" followed directly by "untomb". The timestamp show that this happens within the same cleanup run.

Fri Feb 4 16:40:13 UTC 2022 0 tombname compound-9b30dd04d89435648576d61010a41d14809b0595_v17.00000.zoekt **repo name**
Fri Feb 4 16:41:00 UTC 2022 0 untomb compound-9b30dd04d89435648576d61010a41d14809b0595_v17.00000.zoekt **repo name**
Sat Feb 5 18:45:59 UTC 2022 0 tombname compound-b69c34db9f4c6dd7a31d6a8105cbfd16bcacc588_v17.00000.zoekt **repo name**
Sat Feb 5 18:47:26 UTC 2022 0 untomb compound-b69c34db9f4c6dd7a31d6a8105cbfd16bcacc588_v17.00000.zoekt **repo name**
Sun Feb 6 18:06:43 UTC 2022 0 tombname compound-b69c34db9f4c6dd7a31d6a8105cbfd16bcacc588_v17.00000.zoekt **repo name**
Sun Feb 6 18:07:27 UTC 2022 0 untomb compound-b69c34db9f4c6dd7a31d6a8105cbfd16bcacc588_v17.00000.zoekt **repo name**
Mon Feb 7 23:23:48 UTC 2022 0 tombname compound-b69c34db9f4c6dd7a31d6a8105cbfd16bcacc588_v17.00000.zoekt **repo name**
Mon Feb 7 23:24:39 UTC 2022 0 untomb compound-b69c34db9f4c6dd7a31d6a8105cbfd16bcacc588_v17.00000.zoekt **repo name**
Tue Feb 8 20:41:18 UTC 2022 0 tombname compound-b69c34db9f4c6dd7a31d6a8105cbfd16bcacc588_v17.00000.zoekt **repo name**
Tue Feb 8 20:42:00 UTC 2022 0 untomb compound-b69c34db9f4c6dd7a31d6a8105cbfd16bcacc588_v17.00000.zoekt **repo name**
Wed Feb 9 10:21:40 UTC 2022 0 tombname compound-b69c34db9f4c6dd7a31d6a8105cbfd16bcacc588_v17.00000.zoekt **repo name**
Wed Feb 9 10:22:19 UTC 2022 0 untomb compound-b69c34db9f4c6dd7a31d6a8105cbfd16bcacc588_v17.00000.zoekt **repo name**
Wed Feb 9 23:42:00 UTC 2022 0 tombname compound-b69c34db9f4c6dd7a31d6a8105cbfd16bcacc588_v17.00000.zoekt **repo name**
T

maybe remove bazel BUILD files

We added the BUILD files since we use bazel in the sourcegraph repo. However, we don't actually use bazel directly in this repo and its a bit of a chore to ensure it works. It also has caused some heartache in the sourcegraph repo since the tooling in bazel around cross repo stuff seems less than ideal (or maybe we just use it wrong?).

Should we just rely on normal go tooling and then gazelle in the sg repo? Or is there some advantage I am missing with this approach? cc @jhchabran @davejrt

webserver: fix stats calculation

Memory is our dominant cost factor for operating Zoekt. When evaluating alternatives to Zoekt in the past, we struggeled to get good numbers for MEM/Megabyte of indexed data. Compound shards complicate this issue, because the MEM footprint of a repo within a compound shard is different from its footprint as a simple shard. Additionally, a part of the index is memory mapped, which is something we should probably keep track of separately.

calculateStats is supposed to serve this purpose, but it needs an overhaul. See comments in the code.

zoekt-indexserver: Support Per-Repository Branch Selection

I'm using zoekt-indexserver to clone and index a bunch of GitHub repositories. For the most part, it is the default branch that I am interested in indexing, but there is one repository where this is not the case.

It would be great if I had a way to specify the branch to index.

It seems like zoekt-git-index supports being given the branch information, but there's only a single string that can be provided for the zoekt-indexserver invocation, rather than a per-repository setting.

Something like:

{ "GithubOrg": "foo", "Name": "^bar$", "GitHubURL": "https://github.com", "branch": "release" }

Thanks!

indexserver: add debug command to tombstone repos

We should add a debug command to zoekt-sourcegraph-indexserver that allows us to tombstone a repos savely. This command would be useful to remove duplicate shards in production.

We don't have a good way to manually tombstone repos in compound shards. The best we can do right now is editing the meta files directly or "explode" the compound shard with zoekt-sourcegraph-indexserver debug explode.

Branch display incorrect if branch filter matches multiple branches

When adding a filter that matches multiple branches, each result has an empty list of matched branches:
grafik

I suspect the culprit in this line:

zoekt/eval.go

Line 576 in 52664b7

d.branchNames[repoIdx][uint(bq.masks[repoIdx])])

Shouldn't the branchNames map be indexed by the "branch ID", not by the "mask of matching branches", similar to the code some lines below that?

zoekt/eval.go

Lines 582 to 589 in 52664b7

id := uint32(1)
for mask != 0 {
if mask&0x1 != 0 {
branches = append(branches, d.branchNames[repoIdx][uint(id)])
}
id <<= 1
mask >>= 1
}

(I'm guessing from my cursory understanding of the code ๐Ÿ˜ธ aka. I don't know what I'm talking about)

Is zoekt indexing crash consistent

I wanted to check whether zoekt indexing is crash consistent. If not, will re-indexing all the repositories on startup (after crash) ensure that indexing is in consistent state?

zoekt-webserver has a memory leak

Hi there friends ๐Ÿ‘‹

We noticed over at GitLab that zoekt-webserver seems to have a pretty obvious memory leak that is correlated with an increase in searches.

image

We resolved the incident for now by simply restarting pods and allocating more memory. You can read more about the incident here: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16192

Browsing through the recent commits, I don't see anything related to memory specifically, but we'll go ahead and update to a newer version of zoekt to see if that helps.

In the meantime, I thought I'd report ASAP in case this wasn't known. Have y'all seen this before?

Happy to contribute ๐Ÿค


Zoekt version: 5f25b3073480520aae1cd145d9f3f57226ff7fbc

zoekt-index gets stuck when indexing some files from a local AOSP repo

Hi!

I'm trying to index an AOSP local repo with zoekt-index but it gets stuck when indexing some files. These files, as I could see, are some test XML with a bad encoding or other XML with rare chars.

In the process list I see that ctags has the state <defunct> when these files are indexed. Indexing these files individually has the same result.

The full command that I'm executing with one of these files is:
zoekt-index -index ./indexes/ -shard_limit 50000000 -max_trigram_count 40000 ./aosp/external/libxml2/result/ebcdic_566012.xml

And the ctags process is now a zombie:
image

I'm using the last version of zoekt and ctags.

Thank you!

PS: Some sample files that produce the error
userprefs-badencoding xml
ebcdic_566012 xml

RPC: support non-gob encodings?

Hi. I'd like to slap an alternative web UI on zoekt (example motivations: instant search, like livegrep; syntax highlighting in the builtin file viewer; alternative homepage suitable to my particular deployment). The APIs exposed in web (when passing format=json) don't seem sufficient to do this. For example, there doesn't seem to be a way to list repositories, or to retrieve full file contents.

Both of those things are provided by rpc/. But rpc only speaks gob. It seems the only way to use rpc from not-go is to proxy it with go, translating into something that's not gob. I could do that, but it feels like a waste. Ideally there would be a way to serve another more portable encoding (JSON seems like the least controversial option).

I'm no go expert, but it doesn't seem trivial to serve JSON instead of gob, in the way that it is to serve JSON instead of HTML in web, because rpc directly uses net/rpc (or, your fork of it), which precludes alternative encodings. Does that sound correct?

Thanks!

Errors for queries with imbalanced parentheses are reported only for too many left parentheses

Here are some input query strings and the results of parsing them, including the error, if any. Some of them should result in parse errors but do not.

( -> <nil> (query: missing close paren, got token <nil>)
(( -> <nil> (error parsing regexp: missing closing ): `((`)
((( -> <nil> (error parsing regexp: missing closing ): `(((`)
() -> TRUE (<nil>)
) -> TRUE (<nil>) # should error
)) -> TRUE (<nil>) # should error
))) -> TRUE (<nil>) # should error
foo -> substr:"foo" (<nil>)
foo) -> substr:"foo" (<nil>) # should error
foo)) -> substr:"foo" (<nil>) # should error
foo))) -> substr:"foo" (<nil>) # should error
(foo -> <nil> (error parsing regexp: missing closing ): `(foo`)
((foo -> <nil> (error parsing regexp: missing closing ): `((foo`)
(((foo -> <nil> (error parsing regexp: missing closing ): `(((foo`)
(foo) -> substr:"foo" (<nil>)
(foo)) -> substr:"foo" (<nil>) # should error
((foo)) -> substr:"foo" (<nil>)
(((foo)) -> <nil> (error parsing regexp: missing closing ): `(((foo))`)

Update of the zoekt go module(s)

It seems like the github.com/google/zoekt go module is not updated, and I can't get github.com/sourcegraph/zoekt to work.

go: github.com/sourcegraph/zoekt@latest: github.com/sourcegraph/[email protected]: parsing go.mod:
	module declares its path as: github.com/google/zoekt
	        but was required as: github.com/sourcegraph/zoekt

The sourcegraph zoekt docker image does have the latest changes, though.

Are there plans on updating the go modules too?

zoekt-mirror-gerrit doesn't pass credentials to git clone

When using credentials zoekt-mirror-gerrit with http credentials.

Credentials are used when calling Gerrit API but are lost when executing git clone.

That's because cloneURL uses projectURL which doesn't contain the http password but only the username (unlike rootURL).

Language buttons donโ€™t work with multi-word language names

Itโ€™s possible for language names displayed in zoekt results to contain whitespace, e.g: โ€œDIGITAL Command Languageโ€, or โ€œDNS Zone Fileโ€.

Taking โ€œDNS Zone Fileโ€ as an example, when you click the โ€œlanguage DNS Zone Fileโ€ button, you end up with lang:DNS Zone File, which searches for โ€œZone Fileโ€ in files of language โ€œDNSโ€. Instead, the search term should be lang:"DNS Zone File".

I think ideally the language term should only be quoted when necessary, so for single-word languages quotes still shouldnโ€™t be used. That way queries are still as concise as possible.

Iโ€™ll probably have a go fixing this myself when I get time.

How to use this project

The readme of this project has not been updated for a long time. I am new to this project. How should I run it? I can only know by looking at the code. Please update the usage documentation.

`Repository.MergeMutable` doesn't handle changes to *Template fields

Hi. I changed the CommitURLTemplate on some (non-git!) repos we are indexing and noted that they did not get reindexed with the new template. This is because changes to that field (and related ones) are not being handled in Repository.MergeMutable:

zoekt/api.go

Line 654 in bec12a7

// Note: URL, *Template fields are ignored. They are not used by Sourcegraph.

@keegancsmith (since you're on blame ๐ŸŒž) would Sourcegraph be negatively impacted if they were? Is there any reason to not add these?

Embedding Bootstrap/jQuery static files

Hi!

Have you considered embedding jQuery/Bootstrap static files?

I say this because for certain environments it's necessary to work without internet access and if these files are invoked from the CDN, the web will be displayed without styles.

Thank you!

tests fail when using ./all.bash

When attempting to install zoekt, many tests fail. Is this expected?

One example below. Checked out at 5ddb944

2023/03/24 04:23:43 loading 1 shard(s): repo_v16.00000.zoekt                                                                                                                                                                                                                    
--- FAIL: TestSkipSubmodules (0.57s)                                                                                                                                                                                                                                            
    tree_test.go:466: createMultibranchRepo: execution error: exit status 128, output + mkdir adir bdir                                                                                                                                                                         
        + cd adir                                                                                                                                                                                                                                                               
        + git init -b master                                                                                                                                                                                                                                                    
        Initialized empty Git repository in /usr/local/google/tmp/TestSkipSubmodules132999780/001/adir/.git/                                                                                                                                                                    
        + mkdir subdir                                                                                                                  
        + echo acont                                                                                                                                                                                                                                                            
        + echo sub-cont                                                                                                                 
        + git add afile subdir/sub-file                                                                                                                                                                                                                                         
        + git config user.email [email protected]                                                                                         
        + git config user.name 'Your Name'                                                                                                                                                                                                                                      
        + git commit -am amsg                                                                                                           
        [master (root-commit) f9c87f6] amsg                                                                                                                                                                                                                                     
         2 files changed, 2 insertions(+)                                                                                                                                                                                                                                       
         create mode 100644 afile                                                                                                                                                                                                                                               
         create mode 100644 subdir/sub-file                                                                                                                                                                                                                                     
        + cd ..                                                                                                                         
        + cd bdir                                                                                                                                                                                                                                                               
        + git init -b master                                                                                                            
        Initialized empty Git repository in /usr/local/google/tmp/TestSkipSubmodules132999780/001/bdir/.git/                                                                                                                                                                    
        + echo bcont                                                                                                                                                                                                                                                            
        + ln -s bfile bsymlink                                                                                                                                                                                                                                                  
        + git add bfile bsymlink                                                                                                                                                                                                                                                
        + git config user.email [email protected]                                                                                                                                                                                                                                 
        + git config user.name 'Your Name'                                                                                              
        + git commit -am bmsg                                                                                                           
        [master (root-commit) 63a7b2b] bmsg                                                                                                                                                                                                                                     
         2 files changed, 2 insertions(+)                                                                                               
         create mode 100644 bfile                                                                                                                                                                                                                                               
         create mode 120000 bsymlink                              
        + cd ../adir                                                                                                                                                                                                                                                            
        + git submodule add --name bname -- ../bdir bname                                                                               
        Cloning into '/usr/local/google/tmp/TestSkipSubmodules132999780/001/adir/bname'...                                                                                                                                                                                      
        fatal: transport 'file' not allowed                                                                                                                                                                                                                                     
        fatal: clone of '/usr/local/google/tmp/TestSkipSubmodules132999780/001/bdir' into submodule path '/usr/local/google/tmp/TestSkipSubmodules132999780/001/adir/bname' failed                                                                                              
2023/03/24 04:23:44 finished /usr/local/google/tmp/TestFullAndShortRefNames3149365922/002/repo_v16.00000.zoekt: 1966 index bytes (overhead 37.1)                                                                                                                                
2023/03/24 04:23:44 loading 1 shard(s): repo_v16.00000.zoekt                                                                                                                                                                                                                    
2023/03/24 04:23:44 finished /usr/local/google/tmp/TestLatestCommit3747486278/002/repo_v16.00000.zoekt: 1871 index bytes (overhead 35.3)                                                                                                                                        
2023/03/24 04:23:44 loading 1 shard(s): repo_v16.00000.zoekt                                                                            
FAIL                                                                                                            

`Content` missing trailing empty lines (with NumContextLines > 0)

I'm seeing a bit of a weird bug in context in ChunkMatches' Content in the JSON API that, as far as I can tell, is not specific to the JSON API:

  1. Checkout latest main
  2. go run cmd/zoekt-git-index/main.go -index zoekt-index -require_ctags .
  3. go run cmd/zoekt-webserver/main.go -html=false -rpc=true -index zoekt-index
  4. curl -sX POST -d'{"q": "testing f:^eval_test.go", "opts": {"ChunkMatches": true, "NumContextLines": 1}}' localhost:6070/api/search | jq -r '.Result.Files[0].ChunkMatches[0].Content' | base64 --decode
    The output you get has only two lines; there is no trailing line of context:
	"strings"
	"testing"

Taking ChunkMatches[1].Content correctly gives output with three lines:


func printRegexp(t *testing.T, r *syntax.Regexp, lvl int) {
	t.Logf("%s%s ch: %d", strings.Repeat(" ", lvl), opnames[r.Op], len(r.Sub))

It seems to be that whenever the final context line is empty (i.e. \n\n), the final newline character is missing from Content.

Viewing the same query on sourcegraph shows the proper context for the first match:
Screenshot 2023-01-29 at 01 25 09
I say that I don't think this is related to the JSON API in particular because print debugging Content values before they're serialized shows that the above example simply only has one trailing newline when it should have two. I'd expect the net/rpc API to have the same problem if I had an easy way to interactively query it. But, sourcegraph has the right data, so maybe that expectation is not correct?

gitindex: inform the caller whether there were changes since the last index

We run this fork in our indexer, which tells us whether incremental git indexing resulted in any changes. Knowing whether there were any changes can drive metrics and even help prioritize subsequent indexes (e.g. backoff indexing repos that infrequently change).

That commit as it is would certainly be an unacceptable breaking change, but I'm curious what maintainers' thoughts on this approach are. Am I missing a better way to do this? Would a not-breaking overload of IndexGitRepo that does something like this be acceptable?

Feature request: filtering on whether code is in an archived repository

Gitlab and GitHub have the concept of archived repositories (and possibly other sources zoekt can use, but I haven't used the others). In many organisations, archived repositories contain code which is not currently in use, but is kept for future reference. When searching code, archived repositories are often no longer relevant. It would be nice to be able to ignore results in these repositories.

In an ideal world, I'd be able to include a search term such as archived:no to filter out results in archived repositories (and, similarly, archived:yes would filter out results in non-archived repositories).

If this feature is something that you'd be interested in adopting, I may try and implement it myself.

ambiguous result: [] when click on links of zoekt web search result

The command to create index is,

$zoekt-repo-index -parallelism 16 -index /media/d/zoekt -name gerrit -base_url http://gerrit:8080/plugins/gitiles/ -manifest_repo_url http://gerrit:8080/plugins/gitiles/platform/manifest.git  -repo_cache /media/d/mirror -manifest_rev_prefix= --rev_prefix=refs/heads/ master:default.xml

The command to start zoekt web server is,

$zoekt-webserver -index /media/d/zoekt -listen :8080

When click on links of zoekt web search result, the following error shows, how to fix it?
2023-01-08_22-11-24

`case:` does not have any effect within nested expressions

Here are some input query strings and the results of parsing them. case: seems to not have any effect on the parsed query unless it's in the top-level expression.

foo bar -> (and substr:"foo" substr:"bar")
foo bar case:yes -> (and case_substr:"foo" case_substr:"bar")
foo bar case:no -> (and substr:"foo" substr:"bar")

foo or bar -> (or substr:"foo" substr:"bar")
foo or bar case:yes -> (or case_substr:"foo" case_substr:"bar") 
foo or bar case:no -> (or substr:"foo" substr:"bar") 

# `case:` does nothing if not at the top level.
(foo case:yes) bar -> (and substr:"foo" substr:"bar")
(case:yes foo) bar -> (and substr:"foo" substr:"bar")
(case:yes foo (bar)) -> (and substr:"foo" substr:"bar")

case: has special "application" logic (as it only modifies other expressions instead of doing anything on its own) and I suspect there is something wrong with it in that regard:

zoekt/query/parse.go

Lines 322 to 339 in 2560773

for _, q := range qs {
switch s := q.(type) {
case *caseQ:
setCase = s.Flavor
case *Type:
if s.Type < typeT {
typeT = s.Type
}
default:
newQS = append(newQS, q)
}
}
qs = mapQueryList(newQS, func(q Q) Q {
if sc, ok := q.(setCaser); ok {
sc.setCase(setCase)
}
return q
})

I believe the problem is that the logic to identify caseQ expressions in that for-loop only iterates over top-level expressions. It needs to iterate the entire tree of expressions, with lower caseQs overriding higher ones.

(By that reasoning it may be that Type expressions have the same problem, but I don't really understand what those are supposed to do even normally ๐ŸŒž.)

zoekt-indexserver - Prune branches when fetching git repository

zoekt-indexserver fetches the git repository to have an up-to-date repository with all branches.

func fetchGitRepo(dir string) bool {
cmd := exec.Command("git", "--git-dir", dir, "fetch", "origin")

When a branch is deleted in the repository, it still exists in the index because fetch does not prune deleted branches.

Adding the --prune parameter to the fetch command guarantees an up-to-date list of branches

cmd := exec.Command("git", "--git-dir", dir, "fetch", "origin", "--prune") 

spike: cache matchTree

Webserver creates a new matchTree for every shard it searches. The structure of the matchTree, however, only depends on the query. Some trees, like substrMatchTree, call 'iterateNgramsand thus depend onindexData`. We should timebox this spike and see whether

  1. matchTree construction shows up in webserver's CPU profile
  2. there is a good way to hydrate a cached matchTree with indexData, instead of creating it from scratch every time.

Problems building the project as a dependency: `golang.org/x/exp/slices` is not pinned?

https://pkg.go.dev/golang.org/x/exp/slices is not in go.mod or go.sum. I'm pretty ignorant with respect to go.mod, but how can the project even build like this?

I'm having particular problems because the project's usage relies on an API in slices that no longer exists in the latest version:

zoekt/indexdata.go

Lines 411 to 415 in 659eac9

// PERF: Sort to increase the chances adjacent checks are in the same btree
// bucket (which can cause disk IO).
slices.SortFunc(ngramOffs, func(a, b runeNgramOff) bool {
return a.ngram < b.ngram
})

Compare https://pkg.go.dev/golang.org/x/[email protected]/slices#SortFunc to https://pkg.go.dev/golang.org/x/exp/slices#SortFunc.

Document API

It would be great to have a small excerpt in the README of the project (or similar) with a few examples on how to interact with the exposed API (via rpc and/or stream endpoints).

Something as simple as curl commands would make it easier to onboard other folks to the project and make it easier to get started on integrations to zoekt.

TOC has unknown section "metadata", unexpected end of JSON input error when start zoekt web server

When start zoekt web server using index created by zoekt-repo-index command, we get the following error, how to fix it?

$ sudo ./zoekt-webserver -index /media/d/zoekt -listen :8080
2023/01/07 21:40:30 loading 329 shard(s): gerrit_v15.00000.zoekt, gerrit_v15.00001.zoekt, gerrit_v15.00002.zoekt, gerrit_v15.00003.zoekt, gerrit_v15.00004.zoekt... 324 more
2023/01/07 21:40:30 file /media/d/zoekt/gerrit_v15.00005.zoekt TOC has unknown section "metadata"
2023/01/07 21:40:30 file /media/d/zoekt/gerrit_v15.00000.zoekt TOC has unknown section "metadata"
2023/01/07 21:40:30 file /media/d/zoekt/gerrit_v15.00001.zoekt TOC has unknown section "metadata"
2023/01/07 21:40:30 file /media/d/zoekt/gerrit_v15.00003.zoekt TOC has unknown section "metadata"
2023/01/07 21:40:30 file /media/d/zoekt/gerrit_v15.00004.zoekt TOC has unknown section "metadata"
2023/01/07 21:40:30 file /media/d/zoekt/gerrit_v15.00002.zoekt TOC has unknown section "metadata"
2023/01/07 21:40:30 reloading: /media/d/zoekt/gerrit_v15.00005.zoekt, err NewSearcher(/media/d/zoekt/gerrit_v15.00005.zoekt): unexpected end of JSON input
2023/01/07 21:40:30 reloading: /media/d/zoekt/gerrit_v15.00000.zoekt, err NewSearcher(/media/d/zoekt/gerrit_v15.00000.zoekt): unexpected end of JSON input
```

replace directive in go.mod breaks `go install github.com/sourcegraph/zoekt/cmd/zoekt@latest`

Commit 3ce1f2b breaks go install github.com/sourcegraph/zoekt/cmd/zoekt@latest by introducing a replace directive in go.mod.

The error I get is:

go: github.com/sourcegraph/zoekt/cmd/zoekt@latest (in github.com/sourcegraph/[email protected]):
	The go.mod file for the module providing named packages contains one or
	more replace directives. It must not contain directives that would cause
	it to be interpreted differently than if it were the main module.

Replace directives can be problematic for users. Is there a suggested workaround for this?

Gap in monitoring on Zoekt not reporting an outage

Change Zoekt from google/zoekt to sourcegraph/zoekt broke the gob decoding which broke Zoekt in .com. This is not trigger an Opsgenie alert. Look at Search Blitz as a place to capture and generate a notify if there is an outage of this level. Error information in Slack thread

merging: set lock on index while merging

Shard merging doesn't set a lock on the index dir which potentially leads to duplicate repositories in compound shards.
The reason why we didn't set a lock so far is that it takes around 3 min to merge a single 2GB compound shard and we
don't want to block other processes like indexing for that long.

However, we have seen duplicate shards in production and the merge process is the most likely culprit.

Proposal

  • Update shard merging to create only one compound shard per call
  • Set a lock on index dir, similar to how cleanup does it
  • Remove duplicate shards

Ideally we would not lock the entire dir while merging, but only set a lock on the involved shards. However we don't
have a central oracle (yet) that could manage this, but maybe there are other ways to achieve this?

provide option to make zoekt-git-index run deterministically

zoekt-git-index sets a timestamp which eventually becomes part of the shard. Some of our tests would profit from zoekt-git-index being deterministic. For example, the recently added "zoekt-merge-index explode" could be tested by comparing the input of "zoekt-merge-index merge" with the output of "zoekt-merge-index explode" byte by byte.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.