uptake / uptasticsearch Goto Github PK
View Code? Open in Web Editor NEWAn Elasticsearch client tailored to data science workflows.
License: BSD 3-Clause "New" or "Revised" License
An Elasticsearch client tailored to data science workflows.
License: BSD 3-Clause "New" or "Revised" License
https://uptakeopensource.github.io/uptasticsearch/reference/get_fields.html
These docs mention retrieve_mapping
instead of the actual function name get_fields()
. I believe we change the name mid-development but missed this in the docs
Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.I'm having issues getting aggregate searches working. I use the exact aggregate query within the elastic site search and get results fine, but when I run it via the es_search() function, I get an error about missing data:
Error in log_fatal(msg) :
The column given to unpack_nested_data had no data in it.
I understand that there are empty buckets in the aggregation that are affecting the "unpack_nested_data" function. However, this isn't a problem when the aggregation results are written to a file and the read back in and parsed with "chomp_aggs".
Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.to improve readability.
Like we did here: uptake/pkgnet#10
Right now in both Python and R packages, host names to hit are hard-coded into the tests. We should consider makingES_HOST
configurable as an environment variable and have the tests pick that up.
since there are also minor changes in scrolling between the versions i'd love if we could pull all the version-specific stuff up higher and configure w/ passing functions instead of putting internal switches like this
IMHO the best way to handle this would be to use an internal R6 object that just holds all of the version specific stuff and have that object get passed around to methods that need it (all internal). But idk...open to suggestions
uptasticsearch
integration tests need a lot of love. We need to test all of the functions in the package on live data in Elasticsearch
Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.In progress!
Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.The Python package uses Sphinx docstrings and we declare various sphinx packages in setup.py
, but none of the actual infra (like a conf.py
) to run the docs is set up.
Closing this issue involves walking through sphinx-quickstart
(here)
Error message for reference: Error in if (!is.integerish(x)) return(FALSE) : missing value where TRUE/FALSE needed
.
Today, uptasticsearch
only works when you have the ability to directly query the cluster. If your cluster has some authentication / authorization set up on it (e.g. Shield), this library will not work.
I think we need to add support for auth-enabled 5.x and 6.x clusters. As of 5.x, you can use X-pack to enable security features for Elasticsearch.
I do NOT think we should put any time or effort into supporting Shield on earlier versions of ES, but open to discussion if some users say they have a need for it.
I accidentally did this tonight, and got unexpected behavior. With a NULL
for index, your URL will look like http://mycluster.whatever:9200//_search
, which is totally valid for Elasticsearch and means "search all indexes".
I propose that the case where you are explicitly passing a NULL is almost CERTAINLY a mistake and we should err on the side of caution and break when that happens. Searching over all indexes is a valid use case (e.g. for cases where you have data with the same mapping stored in different indexes for different time periods), but we should only support it by explicitly passing _all
.
Thoughts @ngparas @austin3dickey @mfrasco ?
Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.As of #87 , our pkgdown
docs are no longer in a docs/
folder at the repo root. That means Github-pages can't pick them up!
Per discussion in #58 ... es_search
does not currently support reverse_nested
aggregations. Well, technically it might support them but we have no tests around that.
To be honest I have no idea how those aggregations work, but would love for someone knowledgable to take a run at this issue. Reference on them:
Requirements for closing this issue:
reverse_nested
stuff)reverse_nested
stuff in themCurrently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.Now that the python project is released under the py-pkg directory, we should move the R into r-pkg for consistency
Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.After much discussion internally at Uptake, we have decided to add the Python version of uptasticsearch
to this project!!!
Initial plan is to host the Python package here and provide documentation on how to install it from source. Whether or not we eventually push to PyPi has not been decided.
In anticipating of this, I've created R
and Python
labels for issues in this project.
This issue can be closed once py-pkg
is created.
Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.Should the next release we do (whenever that happens) be v1.0???
Would love to hear your thoughts @wdearden @jayqi @austin3dickey @ngparas @skirmer and any others finding their way to this issue
Currently, unit tests cover only 59% of the lines of code in this project. Ideally, we would like every line in the project to be covered by tests.
Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.Error in data.table::setnames(batchDT, old = names(batchDT), new = gsub("_source\\.", :
Some duplicates exist in 'old': _source.details.issueName
If an elastic search result populates duplicate fields, this es_search will break here.
From what I can see, unpack_nest_data
is an order of magnitude slower than tidyr::unnest
.
library(tidyverse)
library(microbenchmark)
library(uptasticsearch)
library(data.table)
n <- 1000
nested_df <-
tibble(
x = 1:n,
y = rep(list(c("a", "b", "c")), n)
)
nested_dt <- as.data.table(nested_df)
microbenchmark(
unnest(nested_dt),
unnest(nested_df),
unpack_nested_data(nested_dt, col_to_unpack = "y"),
times = 100
)
Since unpack_nested_data
has the same functionality as tidyr::unnest
but is limited to data.table
and unnesting one column, it should be faster than tidyr::unnest
in those cases.
Most of the computational cost is in the line lapply(listDT, data.table::as.data.table)
. I wrote a version of the function which uses the basic idea of tidyr::unnest
without the nonstandard evaulation and without the dplyr verbs (but still with purrr since that's already an imported package).
unpack_nested_data <- function(chomped_df, col_to_unpack) {
if (!("data.table" %in% class(chomped_df))) {
msg <- "For unpack_nested_data, chomped_df must be a data.table"
futile.logger::flog.fatal(msg)
stop(msg)
}
if (".id" %in% names(chomped_df)) {
msg <- "For unpack_nested_data, chomped_df cannot have a column named '.id'"
futile.logger::flog.fatal(msg)
stop(msg)
}
if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) !=
1) {
msg <- "For unpack_nested_data, col_to_unpack must be a character of length 1"
futile.logger::flog.fatal(msg)
stop(msg)
}
if (!(col_to_unpack %in% names(chomped_df))) {
msg <- "For unpack_nested_data, col_to_unpack must be one of the column names"
futile.logger::flog.fatal(msg)
stop(msg)
}
outDT <- data.table::copy(chomped_df)
listDT <- outDT[[col_to_unpack]]
is_df <- purrr::map_lgl(listDT, is.data.frame)
is_atomic <- purrr::map_lgl(listDT, purrr::is_atomic)
if (all(is_df)) {
newDT <- data.table::rbindlist(listDT, fill = TRUE)
} else if (all(is_atomic)) {
newDT <- as.data.table(unlist(listDT))
} else {
msg <- "For unpack_nested_data, col_to_unpack must be all atomic vectors or all data frames"
futile.logger::flog.fatal(msg)
stop(msg)
}
if (nrow(newDT) == 0) {
msg <- "The column given to unpack_nested_data had no data in it."
futile.logger::flog.fatal(msg)
stop(msg)
}
group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack))
n <- purrr::map_int(listDT, NROW)
rest <- chomped_df[rep(1:nrow(chomped_df), n), ..group_vars, drop = FALSE]
outDT <- data.table::data.table(newDT, rest)
if ("V1" %in% names(outDT)) {
data.table::setnames(outDT, "V1", col_to_unpack)
}
return(outDT)
}
This is about 2.5x faster than tidyr::unnest
in the example above. I can submit a pull request if this makes sense. It works but it doesn't match all of the edge cases in the tests yet.
Travis recently added Windows support. Since many R users are also Windows users and Windows compatibility is necessary for packages to make it to CRAN, it would be valuable to test on Windows at CI time.
For anyone addressing this issue...you do not need to replicate ALL of our current Linux builds with Windows. One R build and one Python build (both with ES 6.2.x) would be sufficient.
Roxygen's @inheritParams
tools allows you to import parameter documentation from one function into another. Some arguments (like es_host
and es_index
) are re-used throughout uptasticsearch
. Right now documentation for those parameters is duplicated across functions, but we could guarantee consistency across functions by centralizing documentation for those parameters in a null object decorated with @inheritParams
Right now if a single request fails, so does es_search
Taking minutes for data scrolling (OK) and tens and twenties of minutes for parsing (very strange) for 700-800K of documents.
I tried to keep only plain (not nested) fields. 4-5 fields for the document.
Here was at least one observation when for the same request I had such long situation but after updating the library the same request was processed with in minutes (OK).
I can’t realize which data to put here for represent the issue. Any thoughts?
I am using the latest version of uptasticsearch now but the strange behavior is repeating.
Is it known behavior?
How should I avoid it?
Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.Looking for someone to figure out how to configure appveyor testing for this repo. I've added the project to the UptakeOpenSource
account with appveyor, just need someone to figure out how to create the .appveyor.yml
that will run our tests
This is a client for a database, but uptasticsearch
currently only has unit tests. This project should have integration tests which use uptasticsearch
functions on an actual Elasticsearch index.
Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search()
cannot parse a result from this type of query into a pandas DataFrame.
It's possible that this is handled easily and correctly by pandas.DataFrame.from_json()
.
To close this issue, a PR would need to:
README.md
README.md
es_search()
stays magical. Reference the R implementation to see how this could be done.ES 6 GA is out in the world now:
I haven't done due diligence on ES 6 yet but would love if someone would take some time to get our tests running against it
Now that python's released we need to Update the Aggs support table in README.md
.
Also, es_search
will throw NotImplementedError for aggs right now, so this feature needs to be integrated.
The Python code now duplicates some of the integration testing and isn't set up with Travis. Per @jameslamb Here's an example of testing R and Python packages in different sub-builds within one travis build using their nifty matrix thing: https://github.com/dmlc/xgboost/blob/master/.travis.yml#L15
The redundant files are now under py-pkg/dummy_data
, and the logic that needs to be moved to travis is in py-pkg/Makefile
and py-pkg/docker-compose.yml
A few times in our codebase, we have
data.table::setnames(someDT, old = names(someDT), new = new_names)
This is actually less safe than the following call with unnamed arguments:
data.table::setnames(someDT, new_names)
because the former will break when someDT
has duplicate column names. It's not this package's job to break if there are duplicate names in a user's data.
The changes introduced in #51 (thanks again @wdearden !) did not impact any user code or change the algorithmic correctness of uptasticsearch
. They did, however, substantially improve the speed and efficiency of the package.
Verifying that this change actually did what it said it would was tricky. We had to manually try the PR branch on our own machines and against other instances of ES and use system.time()
manually to check the speed benchmarks.
Would love if someone would take a shot at adding performance benchmarks to our tests for CI! I would like to test the following:
This will be pretty difficult, I think, because of this library's reliance on connecting to a separate service over a network connection. I envision these tests being limited to the processing of data once it is returned from the server.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.