Giter VIP home page Giter VIP logo

axs's Introduction

Welcome to AXS

Astronomy Extensions for Spark, or AXS, is a distributed framework for astronomical data processing based on Apache Spark. AXS provides simple Python API to enable fast cross-matching, querying and analysis of data from astronomical catalogs.

Prerequisites

Before running AXS make sure you have Java v8 installed and JAVA_HOME variable set.

You will also need Python 3 with numpy, pandas and arrow packages.

Installing

To install AXS follow these steps (note that Anaconda installer is planned for the future):

  1. Download the latest AXS tarball from the realeases page.
  2. Unpack the tarball to a directory of your choosing.
  3. Set SPARK_HOME environment variable to point to the extraction directory.
  4. Add SPARK_HOME/bin to your PATH variable.
  5. Run the axs-init-config.sh script to update spark-defaults.conf and hive-site.xml files with the exact SPARK_HOME path.

And you're good to go!

Further reading

Read more about starting and using AXS, and its architecture, in the documentation.

If you are using AXS in your scientific work, please cite this paper.

axs's People

Contributors

mjuric avatar zecevicp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

axs's Issues

ADQL support

Hi,
are there any plans to add ADQL support to AXS? Since a lot of archives are already using ADQL as their query language it could be useful to have also this interface, not only for cross-matches but even for other database queries (cone searches etc.). I see that there are some libraries that could help translating an ADQL query into an SQL one, like ADQL Library (Java) and queryparser (python). Could this be feasible? The biggest problem that I could think of would be how to ensure that AXS's optimizations are used even when plain ADQL queries are launched, so that the user doesn't have to manually specify to join on cat1.zone = cat2.zone during a cross-match , or to exclude duplicates for example. A solution could be to add some new ad-hoc ADQL functions or (maybe better but probably more difficult) to add a sort of optimization layer during the ADQL->SQL translation that automatically takes these aspects into account.

Thanks,
Davide Viero

Documentation for build with newer versions of Spark

Hi all,
maybe this would be more appropriate on the axs-spark repo, but it's not possible to open issues there so I'm posting here.
I would like to install AXS on a standalone cluster with a more recent version of Spark (2.4.5 or even 3.0-preview), is there a documentation explaining how to prepare the distribution? I noticed that axs-spark has some branches for spark version like 2.3.0 , 2.4.3 and 3.0 preview, but there is an AXS release only with spark 2.4.0. I see that @stevenstetzler is testing an automatic pipeline to create an AXS distribution with spark 3.0-preview, would it be possible to do it manually while it is not ready?
Thanks a lot,
Davide Viero

issue with AXA optimization for catalogue cross-matches.

This part of the code (below) for the optimisation of the AXS crossmatches will not work for regions close to the equatorial poles, as for the window around ra the r (cross correlation radios) should be divided by cos(dec),
otherwise the preselected region will be smaller than r along the right ascension axis as the distance along the right ascension is delta(ra)*cos(dec).

val join = if (useSMJOptim)
df1.join(df2, df1("zone") === df2("zone") and (df1("ra") between(df2("ra") - r, df2("ra") + r)))
else
df1.join(df2, df1("zone") === df2("zone") and (df1("ra") between(df2("ra") - r, df2("ra") + r)) and
(df1("dec") between(df2("dec") - r, df2("dec") + r)))

The current implementation leads to a loss of cross-correlated sources at high latitudes if optimisation in enabled.

axsdist units are unclear

Ernesto Castillo asks a very good question: what are the units on axsdist? This seems to be computed by calcGnom in FrameFunctions.scala, but I can't follow what the final formula is producing.

Tracking issue for a new AXS release

Spark 3.0 has been out for a while, and that seems like a good chance to collect all the improvements we've had laying around, merge them, and make a new AXS release. I'm opening this as a sort of meta-issue for keeping track of what we think is most important to include. @mjuric @zecevicp @stevenstetzler, let me know in the comments what your thoughts are and I'll try and keep this post updated with a combined list and status. If there's no significant disagreement on these, I'll start fixing and merging them.

Small fixes/improvements:

  • Rebase axs-spark onto the Spark 3.0 release. I've done that for a pre-release version, so it should be quick to update for the final release. (https://github.com/astronomy-commons/axs-spark/tree/axs-3.0.0-preview)
  • #6 "save_axs_table() deletes files from path". We got bit by this again, need to remove the options that allow that to happen.
  • #19 "Cannot use local Derby database for Spark > 2.4.0". I think we just need to find the right place to set this default, iirc.
  • #12 "histogram2d doesn't exclude out-of-bounds entries". Already have code on a branch, just need to test and merge.
  • #24 "dec_to_zone should use AxsFrame zone height rather than the default value "
  • #25 "Warn if AxsFrame.crossmatch uses a radius larger than the zone height"
  • Add healpix_hist. Mario wrote this function and we've been copy-pasting it a lot; we should include it.

Larger improvements:

  • Automated builds with github actions. @stevenstetzler has PR #18 open to work on that; is it ready to go?
  • I believe there was still some variation in the crossmatch radius towards the poles (might have been an email chain instead of a github issue). There's also issue #22 on distance calculation. I remember doing some investigation on this, but I need to pull up the results.

I'm sure there's other things I've forgotten. We'll also have to pick a new version number ๐Ÿ™‚

catalog.drop_table() does not appear to remove table files?

I'm trying to replace an existing AXS table with a new version.

I ran

catalog.drop_table('green19_stellar_params')
catalog.save_axs_table( sdf, 'green19_stellar_params', repartition=True, calculate_zone=True)

and got the following error:

AnalysisException: "Can not create the managed table('`green19_stellar_params`'). The associated location('file:/epyc/projects/lsd2/pzwarehouse/green19_stellar_params') already exists.;"

And indeed the files are there.

running drop_table again reports: 'Table or view not found: green19_stellar_params;'

(I manually removed the directory and continued.)

Cannot use local Derby database for Spark > 2.4.0

@ctslater and I have been investigating an issue where it seems that more than one connection is being made by Spark to the Hive metastore when backed by a local Derby database. The Derby database only allows a single connection at a time, so a crash occurs when a second connection is attempted. On epyc, we use a shared MySQL database, so it seems we are blind to this issue between version changes. This bug originally appeared on our AWS JupyterHub where each user is using a local Derby database instead of a shared one.

The following is enough to reproduce the bug with Spark 3.0.0 and to work without the bug for Spark 2.4.0:

from axs import AxsCatalog, Constants

db = AxsCatalog(spark)
db.import_existing_table(
    "ztf",
    "/epyc/users/stevengs/ztf_oct19_small",
    num_buckets=500,
    zone_height=Constants.ONE_AMIN,
    import_into_spark=True
)

@ctslater to reproduce this on epyc, navigate to: /epyc/users/stevengs/spark-testing and do

source 2.4.0/env.sh
pyspark

and copy in the code above, which should work. Then do

source 3.0.0/env.sh
pyspark

and copy in the code above again, which should fail with a traceback like:

Caused by: ERROR XJ040: Failed to start database '/epyc/users/stevengs/axs/metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@158bc877, see the next exception for details.
	at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
	at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)
	... 115 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /data/epyc/users/stevengs/axs/metastore_db.
	at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
	at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
	at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source)
	at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown Source)
	at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source)
	at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
	at org.apache.derby.impl.store.raw.RawStore$6.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at org.apache.derby.impl.store.raw.RawStore.bootServiceModule(Unknown Source)
	at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source)
	at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
	at org.apache.derby.impl.store.access.RAMAccessManager$5.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at org.apache.derby.impl.store.access.RAMAccessManager.bootServiceModule(Unknown Source)
	at org.apache.derby.impl.store.access.RAMAccessManager.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source)
	at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
	at org.apache.derby.impl.db.BasicDatabase$5.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at org.apache.derby.impl.db.BasicDatabase.bootServiceModule(Unknown Source)
	at org.apache.derby.impl.db.BasicDatabase.bootStore(Unknown Source)
	at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(Unknown Source)
	at org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Unknown Source)
	at org.apache.derby.impl.jdbc.EmbedConnection$4.run(Unknown Source)
	at org.apache.derby.impl.jdbc.EmbedConnection$4.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at org.apache.derby.impl.jdbc.EmbedConnection.startPersistentService(Unknown Source)
	... 112 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
  File "/epyc/opt/spark-axs-3.0.0-beta/python/axs/catalog.py", line 83, in import_existing_table
    self.spark.catalog.createTable(table_name, path, "parquet")
  File "/epyc/opt/spark-axs-3.0.0-beta/python/pyspark/sql/catalog.py", line 162, in createTable
    df = self._jcatalog.createTable(tableName, source, options)
  File "/epyc/opt/spark-axs-3.0.0-beta/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__
  File "/epyc/opt/spark-axs-3.0.0-beta/python/pyspark/sql/utils.py", line 102, in deco
    raise converted
pyspark.sql.utils.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;

the source files simply change SPARK_HOME, PATH, and SPARK_CONF_DIR to point to the right version of Spark on epyc and also to load special configurations in hive-site.xml that specify using a local Derby database instead of the database running on epyc.

The following is the hive-site.xml for 2.4.0

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:/epyc/users/stevengs/spark-testing/2.4.0/metastore_db;create=true</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.apache.derby.jdbc.EmbeddedDriver</value>
  </property>
</configuration>

and the following is the hive-site.xml for 3.0.0

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:/epyc/users/stevengs/spark-testing/3.0.0/metastore_db;create=true</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.apache.derby.jdbc.EmbeddedDriver</value>
  </property>
</configuration>

Metadata table misnamed

I think there is a bug in the method getSparkTableId() of CatalogUtils.java (commit f97d002). This method is trying to get the TBL_ID from the table HIVE_SCHEMA+"TBLS", but this table does not exist. The table created in the method createDbTableIfNotExists() and used in all other methods of that class is "AXSTABLES".

Thus, when trying to save a table with AxsCatalog.save_axs_table(), an exception is thrown:
Py4JJavaError: An error occurred while calling o46.saveNewTable.
: java.sql.SQLSyntaxErrorException: Table/View 'APP.TBLS' does not exist.
at org.apache.derby.client.am.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.client.am.SqlException.getSQLException(Unknown Source)
at org.apache.derby.client.am.ClientStatement.executeQuery(Unknown Source)
at org.dirac.axs.util.CatalogUtils.getSparkTableId(CatalogUtils.java:134)
at org.dirac.axs.util.CatalogUtils.saveNewTable(CatalogUtils.java:203)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: ERROR 42X05: Table/View 'APP.TBLS' does not exist.
at org.apache.derby.client.am.ClientStatement.completeSqlca(Unknown Source)
at org.apache.derby.client.net.NetStatementReply.parsePrepareError(Unknown Source)
at org.apache.derby.client.net.NetStatementReply.parsePRPSQLSTTreply(Unknown Source)
at org.apache.derby.client.net.NetStatementReply.readPrepareDescribeOutput(Unknown Source)
at org.apache.derby.client.net.StatementReply.readPrepareDescribeOutput(Unknown Source)
at org.apache.derby.client.net.NetStatement.readPrepareDescribeOutput_(Unknown Source)
at org.apache.derby.client.am.ClientStatement.readPrepareDescribeOutput(Unknown Source)
at org.apache.derby.client.am.ClientStatement.flowExecute(Unknown Source)
at org.apache.derby.client.am.ClientStatement.executeQueryX(Unknown Source)
... 14 more

(A Derby database had been setup as HIVE metastore)

Crossmatch returns weird results when RA and Dec are strings

Ernesto Castillo found a really confusing behavior where some data he loaded from CSV was returning strange results when self-crossmatched. It turned out that the RA and Dec columns were being loaded as strings, rather than doubles, and for some reason the math in crossmatch managed to return non-zero results.

We should test for RA/Dec data types and raise an error, rather than letting these through.

distance calculation in AXS

We have found an issue with the distance computed by calcGnom for catalogue cross-correlation in AXS :

def calcGnom = (r1: Double, r2: Double, d1: Double, d2: Double) => {

Which for very small angles is ~ SQRT( delta(ra)^2 + delta(dec)^2 )
while in the equirectangular approximation it is ~ SQRT( (delta(ra) * cos(dec1+dec2)/2 )^2 + delta(dec)^2 )

The missing factor of the cosine of the declination yields to an overestimated distance away from the equator.

My impression is that the Gnomonic projection is not suitable in astronomy and is more related to geophysics.

This leads to an actual loss of cross-correlated sources at high latitudes.

Cone search slower than similar region query.

@ebellm reports that the cone search is much slower (~100x) than a region search for 10"-sized areas. I'm guessing this is due to the pandas_udf involved in the cone search; maybe this distance function can get moved into scala to speed up the search?

save_axs_table() deletes files from path

I was using save_axs_table() to save the new ZTF data and used the path= argument to put it on the new disk drives. Spark then deleted everything from the path I gave it, including a bunch of unrelated directories and all the ZTF data I was trying to import.

Explore S3 caching options

We've talked about a use case where archives decide to keep datasets internally, but put up S3 API facade for remote access with AXS. E.g., imagine the data is physically in IPAC and MAST, but being analyzed at TACC. The question then is whether accesses to the datasets can transparently be cached where AXS is running, for faster repeated access.

Option 1: Spark seems to have recently added support for caching of remote datasets through Delta cache. It's not clear to me whether this is broadly available, or a Databricks-only thing? This should be the thing to investigate first.

Option 2: Another way to do this may be to have AXS access the files through a caching layer. I looked at S3 caching options, and found there are many. Example:

(and see the list of more projects at the bottom of s3fs-fuse README).

Opening this issue so we don't forget about this use case.

(@dennyglee, @zecevicp, any thoughts/ideas/comments?)

Add conda build recipe

Make it possible to build AXS with Conda.

This came up in the discussion with the archives as a feature needed to ease evaluation and spur end-user adoption.

py4j.Py4JException: Method saveNewTable(...) does not exist

Hi,
trying to call save_axs_table I get the error in this subject. Below are the details that might help resolve this. Is there something obvious I'm overlooking or is this a bug?

[datalab@gp05 ~/axs]$ java -version
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
e[datalab@gp05 ~/axs]$ echo $JAVA_HOME
/usr/lib/jvm/jre-openjdk
[datalab@gp05 ~/axs]$ echo $SPARK_HOME
/home/datalab/axs
[datalab@gp05 ~/axs]$ echo $PATH
/usr/lib/jvm/jre-openjdk:/home/datalab/axs/bin:/usr/lib64/qt-3.3/bin:/data0/sw/anaconda2/bin:/data0/sw/anaconda3/bin:/usr/local/bin:/bin:/usr/bin
[datalab@gp05 ~/axs]$ pyspark
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
19/06/26 14:16:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/06/26 14:16:06 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
      /_/

Using Python version 3.6.8 (default, Dec 30 2018 01:22:34)
SparkSession available as 'spark'.
>>> from axs import AxsCatalog, Constants
>>> db = AxsCatalog(spark)
>>> spark.catalog.currentDatabase()
'default'
>>> dat = spark.read.csv(header=True,path='/gaia_source/csv/GaiaSource_1703858022185355904_1704227084430340864.csv')
>>> db.save_axs_table(dat, 'test4', repartition=True, calculate_zone=True, num_buckets=Constants.NUM_BUCKETS, zone_height=500)
19/06/26 14:17:11 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
19/06/26 14:17:18 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`test4` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/datalab/axs/python/axs/catalog.py", line 181, in save_axs_table
    False, None)
  File "/home/datalab/axs/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/home/datalab/axs/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/datalab/axs/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 332, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o38.saveNewTable. Trace:
py4j.Py4JException: Method saveNewTable([class java.lang.String, class java.lang.Integer, class java.lang.Integer, class java.lang.String, class java.lang.String, class java.lang.String, class java.lang.Boolean, null]) does not exist
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
        at py4j.Gateway.invoke(Gateway.java:274)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

Exception when running crossmatch with Spark 3.0.0 version

I've installed at IPAC the automated build version that @stevenstetzler mentioned in #26:

Yeah, the automated builds seem to work still. I recently used it to make a new Spark 3 release with an updated version of Scala (I was running into similar issues as @stargaser using Spark 3 with AXS until I updated Scala).

When I run crossmatch, an exception is raised that I think is the same as the one I encountered when I built axs and axs-spark myself from the instructions in #20. Does this mean I need to upgrade Scala, and if so, what version should I have?

The informational part of the exception:

Py4JJavaError: An error occurred while calling o234.crossmatch.
: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
	at org.dirac.axs.FrameFunctions$.crossmatch(FrameFunctions.scala:42)
	at org.dirac.axs.FrameFunctions.crossmatch(FrameFunctions.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:745)

region query fails with `spans_prime_mer=True`

The command

ztf_mar19.region(ra1=340.0, ra2=1.0, dec1=10.0, dec2=20.0, spans_prime_mer=True).count()

Results in

ERROR: Py4JError: An error occurred while calling o1569.or. Trace:
py4j.Py4JException: Method or([class java.lang.Double]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

 [py4j.protocol]

---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<ipython-input-88-79eee7240433> in <module>()
----> 1 ztf_mar19.region(ra1=340.0, ra2=1.0, dec1=10.0, dec2=20.0, spans_prime_mer=True).count()

/epyc/opt/spark-axs/python/axs/axsframe.py in region(self, ra1, dec1, ra2, dec2, spans_prime_mer)
    235             return wrap(self._df.where(self._df.zone.between(zone1, zone2) &
    236                                        (self._df.ra >= ra1 |
--> 237                                         self._df.ra <= ra2) &
    238                                        self._df.dec.between(dec1, dec2)), self._table_info)
    239         else:

/epyc/opt/spark-axs/python/pyspark/sql/column.py in _(self, other)
    113     def _(self, other):
    114         jc = other._jc if isinstance(other, Column) else other
--> 115         njc = getattr(self._jc, name)(jc)
    116         return Column(njc)
    117     _.__doc__ = doc

/epyc/opt/spark-axs/python/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/epyc/opt/spark-axs/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/epyc/opt/spark-axs/python/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    330                 raise Py4JError(
    331                     "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
--> 332                     format(target_id, ".", name, value))
    333         else:
    334             raise Py4JError(

Py4JError: An error occurred while calling o1569.or. Trace:
py4j.Py4JException: Method or([class java.lang.Double]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

It runs fine if spans_prime_mer=False

After discussing with @ctslater, it seems likely that this block

235             return wrap(self._df.where(self._df.zone.between(zone1, zone2) &
    236                                        (self._df.ra >= ra1 |
--> 237                                         self._df.ra <= ra2) &
    238                                        self._df.dec.between(dec1, dec2)), self._table_info)
    239         else:

needs parentheses around self._df.ra>=ra1 and self._df.ra<=ra2, so that pyspark can figure out the precedence of the comparisons.

add header / metadata for user to describe how a table was made

For an AXS catalog loaded as
axs_catalog = AxsCatalog(spark) table = AxsCatalog(spark).load('table_name')
all the existing info methods , eg. .get_table_info() , .describe(), .explain() provide automatically- made information. Now that more and more tables are made, crossmatched, etc., it would be really nice to be able to add a user-made info that explains how a table was created from a user perspective. Eg., when making the table from a crossmatch, crossmatch = ztf_lc.crossmatch(sdss_lc, r=2*Constants.ONE_ASEC).\ save_axs_table(fname, info), with info being a string that a user can define (eg. table.info = 'ztf DR1 LCs within S82, nobs > 1, all filters, crossmatch to S82 Quasars within 2 asec), accessible by .info(). It is up to users to keep it going, but I would definitely do that to all tables I make, to avoid asking on slack all the time eg. who made gaia_500b_28e_10800z and what does it contain...

Add version info

It seems that versions on the different machines are diverging (release vs epyc, etc). It would be very useful to have an axs.__version__ or similar to easily see the status.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.