seismicdata / pyasdf Goto Github PK
View Code? Open in Web Editor NEWPython Interface to ASDF based on ObsPy
Home Page: http://seismicdata.github.io/pyasdf/
License: BSD 3-Clause "New" or "Revised" License
Python Interface to ASDF based on ObsPy
Home Page: http://seismicdata.github.io/pyasdf/
License: BSD 3-Clause "New" or "Revised" License
Hi,
We recently ran into this issues on openmpi(gcc) in ubuntu system.
In asdf_data_set.py, Line 1409-1411:
if os.path.exists(output_filename):
msg = "Output file '%s' already exists." % output_filename
raise ValueError(msg)
The check is running on every processors. So the case is: not everytime the processors will reach this point at the same time. If one processor is running very fast and already created the file, the other processors will raise an error.
Somehow, it works on openmpi(cray-gcc) but not working on openmpi(gcc) in ubuntu(Norbert found this problem when he is trying our workflow). I think there might be some implementation difference between these two.
My suggestion would be either of the following:
Maybe in the near future we need some performance profile.
I did some simple test on Rhea and here is the table of running time when processing a single file(~ 200MB) using ASDFDataSet.process()
NPROC | Running Time(s) |
---|---|
16 | 94.88 |
32 | 70.15 |
48 | 70.62 |
64 | 73.13 |
The running time is from the asdf terminal output. Like
Master done, shutting down workers...
Worker 4 shutting down...
Worker 3 shutting down...
Worker 7 shutting down...
Worker 11 shutting down...
Worker 9 shutting down...
Worker 5 shutting down...
Worker 1 shutting down...
Worker 2 shutting down...
Worker 10 shutting down...
Worker 8 shutting down...
Worker 6 shutting down...
Worker 13 shutting down...
Worker 12 shutting down...
Worker 14 shutting down...
Worker 15 shutting down...
Jobs (running 94.88 seconds): queued: 0 | finished: 207 | total: 207
Worker 1: 0 active, 7 completed jobs
Worker 2: 0 active, 17 completed jobs
Worker 3: 0 active, 14 completed jobs
Worker 4: 0 active, 6 completed jobs
Worker 5: 0 active, 6 completed jobs
Worker 6: 0 active, 19 completed jobs
Worker 7: 0 active, 14 completed jobs
Worker 8: 0 active, 19 completed jobs
Worker 9: 0 active, 20 completed jobs
Worker 10: 0 active, 25 completed jobs
Worker 11: 0 active, 18 completed jobs
Worker 12: 0 active, 16 completed jobs
Worker 13: 0 active, 17 completed jobs
Worker 14: 0 active, 5 completed jobs
Worker 15: 0 active, 4 completed jobs
I think it spends quite a lot time in the writing.
Hi Lion,
When I am using pyasdf to process observed data(with mpi), I found the asdf_data_set.process() function won't return correctly. To be more specific, every processor runs thorough the process() function but can not return to the upper level.
I did some research and found the bug is introduced in the commit: c75d65b
The commit before this is fine.
We have something setup on bitbucket but it is private now. @mpbl Is it OK to give Lion permission to the repocitory(source_inversion_wf) so he can run our example test case?
We have everything setup there(bitbucket repo). So after downloading it, you will be able to run some tests and you will see the bugs.
Hi Lion,
I got a friend asking me about the asdf. But unfortunately, he wants the netcdf...
I think hdf5 is thought to be better than netcdf. But do we have plans to provide netcdf in the future?
Hi Lion,
I think we have reached the point when we need the parallel output for adjoint source for the preprocessing workflow.
Is it very difficult to write such a function for adjoint source writer? I am thinking about looking into this issue this weekend? Are you comfortable with me having some attempts?
Some of our tools work on sac files, others work on ASDF files, so I'd like to be able to convert forward and backward without any loss of metadata.
pyasdf/scripts/sac2asdf.py
does not preserve stats.sac
dictionaries, so I wrote a new version that does: https://github.com/rmodrak/asdf_converters/blob/master/asdf_converters/sac2asdf.py
this new script constructs a dictionary {(original_sac_filename: trace.stats.sac) for trace in stream}
, dumps it into a dill virtual file that is then saved as an ASDF auxiliary data file.
What remains is to devise a mapping between original_sac_filenames <--> ASDF waveforms. I see no easy way of doing this using the existing ASDF API. ASDFDataSet.get_waveforms
comes close, but I think there ought to be a simpler way.
Am I missing something? Any advice would be greatly appreciated.
Can't install pyasdf into home directory.
% pip install -v .
ERROR: (Gentoo) Please run pip with the --user option to avoid breaking python-exec
Exception information:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 153, in _main
status = self.run(options, args)
File "/usr/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 289, in run
raise CommandError("(Gentoo) Please run pip with the --user option to avoid breaking python-exec")
pip._internal.exceptions.CommandError: (Gentoo) Please run pip with the --user option to avoid breaking python-exec
% pip install -v . --user
Created temporary directory: /tmp/pip-ephem-wheel-cache-4gbfsomu
Created temporary directory: /tmp/pip-req-tracker-4joucdkx Created requirements tracker '/tmp/pip-req-tracker-4joucdkx'
Created temporary directory: /tmp/pip-install-qxlbbpdr
Processing pyasdf
Created temporary directory: /tmp/pip-req-build-esqgclk9
Added file:///.../pyasdf to build tracker '/tmp/pip-req-tracker-4joucdkx'
Created temporary directory: /tmp/pip-build-env-_bf99xeq
Running command /usr/bin/python3.7 /usr/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-_bf99xeq/overlay --no-warn-script-location -v --no-binary :none: --only-binary :none:
-i https://pypi.org/simple -- 'setuptools>=40.8.0' wheel
ERROR: (Gentoo) Please run pip with the --user option to avoid breaking python-exec
Exception information:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 153, in _main
status = self.run(options, args)
File "/usr/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 289, in run
raise CommandError("(Gentoo) Please run pip with the --user option to avoid breaking python-exec")
pip._internal.exceptions.CommandError: (Gentoo) Please run pip with the --user option to avoid breaking python-exec
Installing build dependencies ... error
Cleaning up...
Removing source in /tmp/pip-req-build-esqgclk9
Removed file:///.../pyasdf from build tracker '/tmp/pip-req-tracker-4joucdkx'
Removed build tracker '/tmp/pip-req-tracker-4joucdkx'
ERROR: Command errored out with exit status 1: /usr/bin/python3.7 /usr/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-_bf99xeq/overlay --no-warn-script-location -v --no-binary
:none: --only-binary :none: -i https://pypi.org/simple -- 'setuptools>=40.8.0' wheel Check the logs for full command output.
Exception information:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 153, in _main
status = self.run(options, args)
File "/usr/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 385, in run
resolver.resolve(requirement_set)
File "/usr/lib/python3.7/site-packages/pip/_internal/legacy_resolve.py", line 201, in resolve
self._resolve_one(requirement_set, req)
File "/usr/lib/python3.7/site-packages/pip/_internal/legacy_resolve.py", line 365, in _resolve_one
abstract_dist = self._get_abstract_dist_for(req_to_install)
File "/usr/lib/python3.7/site-packages/pip/_internal/legacy_resolve.py", line 313, in _get_abstract_dist_for
req, self.session, self.finder, self.require_hashes
File "/usr/lib/python3.7/site-packages/pip/_internal/operations/prepare.py", line 224, in prepare_linked_requirement
req, self.req_tracker, finder, self.build_isolation,
"Installing build dependencies"
File "/usr/lib/python3.7/site-packages/pip/_internal/build_env.py", line 201, in install_requirements
call_subprocess(args, spinner=spinner)
File "/usr/lib/python3.7/site-packages/pip/_internal/utils/subprocess.py", line 242, in call_subprocess
raise InstallationError(exc_msg)
pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /usr/bin/python3.7 /usr/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-_bf99xeq/overlay --n
o-warn-script-location -v --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- 'setuptools>=40.8.0' wheel Check the logs for full command output.
Latest version that is listed as working with Python 3.6 (0.5.1) works but the latest version which requires Python 3.7 doesn't.
machine 1 (ok)
setuptools 44.0
python 3.6
pyasdf 0.5.1
machine 2 (not ok)
setuptools 44.1
python 3.7
pyasdf 0.6.1 or git
We're starting to use ASDF and seis-prov for some strong motion data processing code we're writing. In the seis-prov documentation (see here: http://seismicdata.github.io/SEIS-PROV/_generated_details.html#seis-prov-ids)
it states that the third element of a seis-prov id is:
"A 7 to 12 letter lowercase alphanumeric hash to ensure uniqueness of ids."
What is the intended use of this hash, and what is the domain where the uniqueness is being considered? Is it just within the document, or across a span of them?
The mpi4py.mpi_c has changed to mpi4py.libmpi, so cause the imcompatibility. Current solution is switch to the old version of mpi4py (1.3.x) or make changes in h5py. See discussion at rdhyee/h5py@f2e3f13#diff-c9e1291e5e01f0f5e0a484dccaf67cf6L28
Hi Lion,
I am trying to run a data processing using the process() function in mpi environment.
In the following case, I got 10 asdf file to process and I am using 4 processors. So I go through them one by one and delete the dataset when I am done with it. I got such errors after the program runs for a certain time(for example, after processing 4 files and now it is processing the 5th):
====================
Dir info:
input asdf: /lustre/atlas/proj-shared/geo111/Wenjie/DATA_SI/ASDF_new/raw/synt/C200502151442A.Mpp.synthetic.h5
output asdf: /lustre/atlas/proj-shared/geo111/Wenjie/DATA_SI/ASDF_new/proc/synt/C200502151442A.Mpp.proc_synt_50_100.h5
Processing info:
filter band: [40.0, 50.0, 100.0, 150.0]
tag map: {"synthetic" ==> "proc_synt_50_100"}
interp relative start and end(to eventcentroid time): (0.00 , 6000.00)
interp deltat: 0.5
output_asdf exists and removed:/lustre/atlas/proj-shared/geo111/Wenjie/DATA_SI/ASDF_new/proc/synt/C200502151442A.Mpp.proc_synt_50_100.h5
Launching processing using MPI on 4 processors.
Jobs (running 2.17 seconds): queued: 142 | finished: 0 | total: 161
Worker 1: 9 active, 0 completed jobs
Worker 2: 9 active, 0 completed jobs
Worker 3: 1 active, 0 completed jobs
Jobs (running 4.36 seconds): queued: 123 | finished: 0 | total: 161
Worker 1: 19 active, 0 completed jobs
Worker 2: 18 active, 0 completed jobs
Worker 3: 1 active, 0 completed jobs
Jobs (running 6.51 seconds): queued: 105 | finished: 0 | total: 161
Worker 1: 27 active, 0 completed jobs
Worker 2: 28 active, 0 completed jobs
Worker 3: 1 active, 0 completed jobs
Jobs (running 8.67 seconds): queued: 90 | finished: 0 | total: 161
Worker 1: 35 active, 0 completed jobs
Worker 2: 35 active, 0 completed jobs
Worker 3: 1 active, 0 completed jobs
Jobs (running 10.81 seconds): queued: 76 | finished: 0 | total: 161
Worker 1: 42 active, 0 completed jobs
Worker 2: 42 active, 0 completed jobs
Worker 3: 1 active, 0 completed jobs
Jobs (running 12.87 seconds): queued: 62 | finished: 0 | total: 161
Worker 1: 49 active, 0 completed jobs
Worker 2: 49 active, 0 completed jobs
Worker 3: 1 active, 0 completed jobs
Jobs (running 15.02 seconds): queued: 45 | finished: 0 | total: 161
Worker 1: 57 active, 0 completed jobs
Worker 2: 58 active, 0 completed jobs
Worker 3: 1 active, 0 completed jobs
Jobs (running 17.09 seconds): queued: 29 | finished: 0 | total: 161
Worker 1: 66 active, 0 completed jobs
Worker 2: 65 active, 0 completed jobs
Worker 3: 1 active, 0 completed jobs
Jobs (running 19.13 seconds): queued: 13 | finished: 0 | total: 161
Worker 1: 74 active, 0 completed jobs
Worker 2: 73 active, 0 completed jobs
Worker 3: 1 active, 0 completed jobs
Jobs (running 21.90 seconds): queued: 0 | finished: 0 | total: 161
Worker 1: 80 active, 0 completed jobs
Worker 2: 80 active, 0 completed jobs
Worker 3: 1 active, 0 completed jobs
It seems the "Worker 3" is not reponding. So the program just hangs up. I am not quite sure what trigers that.
It might be some memory issues here since if I am using 16 cores, the problem never shows up. I am not quite sure. I will dive into the code and see if I can find the bug.
Or maybe it is not good to run several dataset processing in one mpirun(maybe the python is not good at this kind of memory management). So in the future, it will be safer to process one file at one mpiexec call.
Hi Lion,
I am collaborating with Yanhua and another seismologist working on China tomography.
One problem that prevents us using pyasdf is that they don't have STATIONXML file.(Sorry to mention that, but the China data is not available online....). They only have Pole and Zero files.
Actually, I think a lot of seismologists has this issue right now. Not all of us are using STATIONXML. So my question is:
To attract those "old" users, should we consider adding the capability of add PZ files or RESP files?
Why are tag path names validated against r"^[a-zA-Z0-9][a-zA-Z0-9_]*[a-zA-Z0-9]$"
? This is more restrictive than the naming scheme used in /Waveforms
. For example /AuxiliaryData/NET.STA/MYDATASET_NAME would not be allowed due to the .
in NET.STA
. It would be nice to allow consistency in the layout between /Waveforms
and /AuxiliaryData
.
Our use case is that we are processing event-based ground motion records and want to add datasets like /AuxiliaryData/WaveformMetrics/NET.STA/NET.STA.CHAN_EVENTID_LABEL
where EVENTID_LABEL
is the "tag". We would like to keep as much consistency in the naming scheme as possible between the waveforms and the ground-motion metrics.
ASDFDataSet.WaveformAccessor uses a station/network/channel naming convention, but not all data sets include this type of information. For example, obspy.read leaves these attributes empty when reading from certain formats, such as SeismicUnix. Any suggested workarounds? Would you suggest manually setting the 'station' attribute for each trace with, say, the trace number index?
Any suggestions would be greatly appreciatated. Thanks, Ryan
hello,
i get an error when trying to add a dictionary containing None
values in parameters
with add_auxiliary_data
everything works as expected if the a
dictionary does not contain None
values.
am i missing something? is it a bug?
here's a transcript of a sample session:
$ python -c "import pyasdf; pyasdf.print_sys_info()"
pyasdf version 0.4.x
===============================================================================
CPython 2.7.9, compiler: GCC 4.9.2
Linux 4.9.0-0.bpo.3-amd64 64bit
Machine: x86_64, Processor: with 24 cores
===============================================================================
HDF5 version 1.10.1, h5py version: 2.7.1
MPI: MPICH, version: 3.1.0, mpi4py version: 3.0.0
Parallel I/O support: True
Problematic multiprocessing: False
===============================================================================
Other_modules:
dill: 0.2.6
lxml: 3.4.0
numpy: 1.8.2
obspy: 1.1.0
prov: 1.5.0
scipy: `0.14.0`
$ asdf-validate UZ.A3150..HN.UZ-1984-0011.h5
WARNING: No QuakeML found in the file.
Valid ASDF File!
$ ipython
In [1]: import pyasdf
In [2]: import numpy as np
In [3]: data = np.random.random(100)
In [4]: data_type = "RandomArrays"
In [5]: path = "example_array"
In [6]: ds = pyasdf.ASDFDataSet('UZ.A3150..HN.UZ-1984-0011.h5')
In [7]: a= {u'user6': 'foo', u'user7': '', u'user4': None}
In [8]: ds.add_auxiliary_data(data=data, data_type=data_type, path=path, parameters=a)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[...]
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
In [9]:
Hi Lion,
I am currently using this to process asdf file and I found the output information is limited. For example, in the following output, the interpolation method raise a ValueError but I don't know what stream(or trace) is that. I think knowing the stream.id is very helpful so I can go into the hdf5 file and see what is wrong with that trace.
Do you have specific reasons to do so(simple terminal output)?
Terminal Output:
--------------------------------------------------------------------------
Processing!!! 1445269490.98
Processing!!! 1445269490.98
Processing!!! 1445269490.98
Processing!!! 1445269490.99
Process Process-2:
Traceback (most recent call last):
File "/ccs/home/lei/anaconda/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1680, in run
output_stream = self.processing_function(stream, inv)
File "/autofs/nccs-svm1_home1/lei/test/git_merge/source_inversion_wf/src/preproc_asdf/proc_util.py", line 112, in process_function
cut_func(st, starttime, endtime)
File "/autofs/nccs-svm1_home1/lei/test/git_merge/source_inversion_wf/src/preproc_asdf/proc_util.py", line 51, in cut_func
tr_list.append(flex_cut_trace(tr, t1, t2))
File "/autofs/nccs-svm1_home1/lei/test/git_merge/source_inversion_wf/src/preproc_asdf/proc_util.py", line 39, in flex_cut_trace
return tr.slice(cut_starttime, cut_endtime)
File "/ccs/home/lei/anaconda/lib/python2.7/site-packages/obspy/core/trace.py", line 1114, in slice
tr.trim(starttime=starttime, endtime=endtime)
File "/ccs/home/lei/anaconda/lib/python2.7/site-packages/obspy/core/trace.py", line 231, in new_func
result = func(*args, **kwargs)
File "/ccs/home/lei/anaconda/lib/python2.7/site-packages/obspy/core/trace.py", line 1075, in trim
raise ValueError("startime is larger than endtime")
Hi Lion,
I don't know if you notice it or not.
When using mpi4py.MPI and a exception(or error) was raised, but the program will just hang and quit.
I found we probably need to replace the **raise *Error() with something like:
try:
do something here...
except Exception as exp:
print(exp)
MPI.COMM_WORLD.Abort()
Probably this is not necessary in the pyasdf code. But when we use the pyasdf module and write our own code, it is better to have this thing to prevent burning unnecessary CPU hours.
Reading gappy data incurs a significant performance penalty compared to non-gappy data.
Attached is a read timing test and test data. The test script times the read if specified miniSEED using, for reference, obspy.read() followed by pyasdf's add_waveforms(). The test data are a day of both gappy (2200+ gaps) and non-gapped time series.
On my machine:
$ ./read-timing-test.py -o output.h5 clean-day.mseed gappy-day.mseed
Opening output ASDF volume: output.h5
Processing clean-day.mseed
ObsPy read(): 0.05147713300000012 seconds
ASDF add_waveforms(): 0.1556968969999999 seconds
Processing gappy-day.mseed
ObsPy read(): 0.49582375 seconds
ASDF add_waveforms(): 7.62076154 seconds
The add_waveforms() method, at 7.6 seconds, is more than an order of magnitude slower than an obspy.read() of the same data at 0.49 seconds.
Obviously it would be nice if this were faster. As ASDF gains popularity it will be used with a likewise-broadening set of input data.
Hello,
Is there any improvement of compatibility of the parallel I/O mode for pyasdf with python 3?
Hi Lion,
We are now integrating the whole workflow. For example, we want to combine signal processing(for both observed and synthetic) and window selection together. During the processing, we don't want any I/Os. To be more specific, we want the workflow to read in the raw data(one observed and one synthetic asdf file) at the very beginning, process them and select windows. After window selections, only the windows are written out.
Currently, I am using "process" function in the asdf_data_set.py. However, there is one argument called output_filename:
def process(self, process_function, output_filename, tag_map,
traceback_limit=3, **kwargs):
meaning currently implementation requires to write the processed files out. However, if possible, my preferred way would be keep things in the memory. I am guessing if so, it is even against the basic implementation of the asdf, right? Cause when you initialize the "read", it is not even read the whole thing into the memory. So there is not such a thing called "keep all the things in the memory".
Or I am thinking about another option would be modify the process function. So the process function would take one observed and one synthetic, and walked all the way down to window selection. I think that might be the right and possible way.
If my words get you confused here, I will just illustrate a little more:
For example, you have two files, one raw observed asdf and one raw synthetic asdf. You want to process them and select windows. There are two ways of doing so:
Sorry to bring it up so late. I looked through the code and found if so, it might involves a lot of changes in the code.
Hi Lion,
Did you implement the mpi test running on Travis?
We are setting up Jenkins on Princeton cluster so we are able to do some mpi test in the future. Let me know how you solve this problem right now.
Also, another question regarding to obspy...I found when using conda install -c obspy obspy
, the default version is 1.0.0. It has a lot of updates regarding to APIs(which makes part of my code fails with the new version). So should I fix my obspy version to 0.10.2 or should I use the new version?
When I add a station XML ds.add_stationxml("multi-date-ranges.xml")
See attached xml file,
OA.CE22_station_inv_modified_xml.txt
then extracted the station xml file as shown below: There are two issues:
Can you help by comments on these issues. Thank you!
I'm able to use the parallel process() function for medium sized files without problem (on Summit), but when I tried to process a 112GB file with 32,925 traces, the program will always get stuck. I've tried different node configurations (1 node 6 CPUs, 1 node 42 CPUs, 2 nodes 84 CPUs, etc.) but none of them worked. I was wondering if this related to #32 and is there any workaround to this problem?
Below is the full debug info.
Launching processing using MPI on 6 processors.
MASTER received from WORKER 2 [WORKER_REQUESTS_ITEM] -- None
WORKER 2 sent to MASTER [WORKER_REQUESTS_ITEM] -- None
WORKER 1 sent to MASTER [WORKER_REQUESTS_ITEM] -- None
WORKER 5 sent to MASTER [WORKER_REQUESTS_ITEM] -- None
WORKER 4 sent to MASTER [WORKER_REQUESTS_ITEM] -- None
WORKER 3 sent to MASTER [WORKER_REQUESTS_ITEM] -- None
MASTER sent to WORKER 2 [MASTER_SENDS_ITEM] -- ('1A.CORRE', 'synthetic')
WORKER 2 received from MASTER [MASTER_SENDS_ITEM] -- ('1A.CORRE', 'synthetic')
MASTER received from WORKER 3 [WORKER_REQUESTS_ITEM] -- None
MASTER sent to WORKER 3 [MASTER_SENDS_ITEM] -- ('1A.NE00', 'synthetic')
MASTER received from WORKER 4 [WORKER_REQUESTS_ITEM] -- None
MASTER sent to WORKER 4 [MASTER_SENDS_ITEM] -- ('1A.NE01', 'synthetic')
WORKER 3 received from MASTER [MASTER_SENDS_ITEM] -- ('1A.NE00', 'synthetic')
RT Inventory created at 1995-06-29T12:22:32.070000Z
Created by: SPECFEM3D_GLOBE/asdf-library
http://seismic-data.org
Sending institution: SPECFEM3D_GLOBE
Contains:
Networks (1):
1A
Stations (1):
1A.CORRE (N/A)
Channels (3):
1A.CORRE.S3.MXZ, 1A.CORRE.S3.MXN, 1A.CORRE.S3.MXE
WORKER 4 received from MASTER [MASTER_SENDS_ITEM] -- ('1A.NE01', 'synthetic')
MASTER received from WORKER 5 [WORKER_REQUESTS_ITEM] -- None
MASTER sent to WORKER 5 [MASTER_SENDS_ITEM] -- ('1A.NE02', 'synthetic')
WORKER 5 received from MASTER [MASTER_SENDS_ITEM] -- ('1A.NE02', 'synthetic')
MASTER received from WORKER 1 [WORKER_REQUESTS_ITEM] -- None
MASTER sent to WORKER 1 [MASTER_SENDS_ITEM] -- ('1A.NE03', 'synthetic')
WORKER 1 received from MASTER [MASTER_SENDS_ITEM] -- ('1A.NE03', 'synthetic')
WORKER 2 sent to MASTER [WORKER_REQUESTS_ITEM] -- None
MASTER received from WORKER 2 [WORKER_REQUESTS_ITEM] -- None
MASTER sent to WORKER 2 [MASTER_SENDS_ITEM] -- ('1A.NE04', 'synthetic')
WORKER 2 received from MASTER [MASTER_SENDS_ITEM] -- ('1A.NE04', 'synthetic')
The parallel processing capabilities in ASDFDataSet are extremely well written. Thanks for crafting such a useful package and example of good coding!
ASDF adoption is my long term goal, however, I'm temporarily tied to other data formats.
Could anyone comment on whether the parallel processing methods in ASDFDataSet could be adapted to work on general obspy stream objects? I'm not looking to speed up reading and writing traces to disk, only operations on obspy traces in memory such as instrument response removal, bandpassing, muting and so on.
Many thanks for any comments!
Hi Lion,
In ASDFDataSet.process()
, you used the master-slave
working mode. I think you introduce it mainly because of balancing the workload.
I had a discussion with @mpbl about it and we think we should also have another job partition way(the tradition and simple way), which is partition the job prior to job running. For example, if we get 1000 stations and 10 cores, every processors would take 100 and they know which 100 they would take. The advantage of doing that is:
The drawback:
What do you think? Worth adding such a method?
I should preface this by saying, I'm not sure if you support Windows, or want to support it, but I have run into a small issue cloning the repository (as the suggested install method). Files:
*
in the file-name. Do you know of anyway around that, or can the files be re-named?I am running on Windows 7, Python 2.7, git 2.10.1.windows.1.
Main reason for asking is I would like to use pyASDF for EQcorrscan (for storing templates and detections with waveforms alongside catalogs and meta-data from processing) and have run in to issues running tests on appveyor.
The full error message is:
$ git clone https://github.com/SeismicData/pyasdf.git
Cloning into 'pyasdf'...
remote: Counting objects: 2391, done.
remote: Compressing objects: 100% (42/42), done.
remote: Total 2391 (delta 23), reused 0 (delta 0), pack-reused 2349
Receiving objects: 100% (2391/2391), 4.46 MiB | 1017.00 KiB/s, done.
Resolving deltas: 100% (1618/1618), done.
error: unable to create file pyasdf/tests/data/small_sample_data_set/AE.113A..BH*.xml: Invalid argument
error: unable to create file pyasdf/tests/data/small_sample_data_set/TA.POKR..BH*.xml: Invalid argument
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'
This is only a very minor issue. I realize that numpy is directly imported several times:
pyasdf/asdf_data_set.py:import numpy as np
pyasdf/tests/test_asdf_data_set.py:import numpy as np
pyasdf/utils.py:import numpy as np
So it probably should become a formal dependency.
Hi @krischer! I have ASDF files with ~100 events and ~100 stations. I try to make a list of the source, receiver location information of all waveforms using the following loop:
import pyasdf
from obspy.clients.iris import Client
client = Client()
ds = pyasdf.ASDFDataSet(file)
for event in ds.events:
origin = event.preferred_origin() or event.origins[0]
for station in ds.ifilter(ds.q.event == event, ds.q.channel == 'BHZ'):
result = client.distaz(station.coordinates["latitude"], station.coordinates["longitude"], origin.latitude,
origin.longitude)
deg = result['distance']
baz = result['backazimuth']
iwav += 1
print('%10d %15.4f %15.4f %15.4f %15.4f %6.1f %6.1f'
% (iwav, origin.latitude, origin.longitude,
station.coordinates["latitude"], station.coordinates["longitude"],
deg, baz))
The code works, but the event query is very slow. Do you know of a better way? Perhaps a query and loop over all BHZ traces would be faster, but in that case I don't know how to get the origin information for each trace.
I was thinking that maybe it is not a bad idea to not only read one hd5 but also a directory which contains all _hd5_s and have a list of all the files in that directory. Since it is just a pointer (before any real reading), there should not be any problem in terms of memory. This can make it possible to have some filterings on events, stations and etc. Let me make an example to clear what I am suggesting:
data_set = ASDFDataSet("dir_h5")
which reads all the files.
data_set.filter_events("...similar filtering as read_event in obspy...")
or:
data_set.filter_stations("...channel='*Z', coordinates and etc...")
The advantage is that we would have the functionality to read all files at once and filter them in one or two lines...and afterwards you have your whole desired dataset out of a larger archive.
I just installed pyasdf on my Mac, and I tried importing it in ipython, which resulted in the following error:
AttributeError Traceback (most recent call last)
in
----> 1 import pyasdf
~/miniconda/envs/gmprocess/lib/python3.7/site-packages/pyasdf/init.py in
11
12 from .exceptions import ASDFException, ASDFWarning, WaveformNotInFileException
---> 13 from .asdf_data_set import ASDFDataSet
14
15
~/miniconda/envs/gmprocess/lib/python3.7/site-packages/pyasdf/asdf_data_set.py in
63 SUPPORTED_FORMAT_VERSIONS, MSG_TAGS, MAX_MEMORY_PER_WORKER_IN_MB,
64 POISON_PILL, PROV_FILENAME_REGEX, TAG_REGEX, VALID_SEISMOGRAM_DTYPES
---> 65 from .query import Query, merge_query_functions
66 from .utils import is_mpi_env, StationAccessor, sizeof_fmt, ReceivedMessage,
67 pretty_receiver_log, pretty_sender_log, JobQueueHelper, StreamBuffer, \
~/miniconda/envs/gmprocess/lib/python3.7/site-packages/pyasdf/query.py in
102 "elevation_in_m": _type_or_none(float),
103 # Temporal constraints.
--> 104 "starttime": obspy.UTCDateTime,
105 "endtime": obspy.UTCDateTime,
106 "sampling_rate": float,
AttributeError: module 'obspy' has no attribute 'UTCDateTime'`
Adding a waveform with complex valued data (e.g. spectra) raises a TypeError:
TypeError: The trace's dtype ('complex128') is not allowed inside ASDF. Allowed are little and big endian 4 and 8 byte signed integers and floating point numbers.
A work around is to split the real and imaginary parts into separate tags, but this seems kludgy. Is this a design decision, or am I missing something?
I think we now need to make the unit in asdf file uniform.
Two things I have noticed in quakeml:
By "standard quakeml" I mean the quakeml you gave me(convert from NDK file), Lion.
So do we need to keep the unit same or we add another attribute in quakeml to specify the unit?
Hi Lion,
In line 1267 in asdf_data_set.py:
gathered_results = self.mpi.comm.allgather(results)
Of which I think allgather may not be necessary but gather would be enough. Two reasons:
My preference would be:
gathered_results = self.mpi.comm.gather(results, root=0)
results = {}
if self.mpi.rank == 0:
for result in gathered_results:
results.update(result)
return results
What do you think?
Just curious. I noticed in asdf_data_set.py, in __init__()
, you wrote:
:param mode: The mode the file is opened in. Passed to the
underlying :class:`h5py.File` constructor. pyasdf expects to be
able to write to files for many operations so it might result in
strange errors if a read-only mode is used. Nonetheless this is
quite useful for some use cases as long as one is aware of the
potential repercussions.
:type mode: str
Not sure what you mean exactly here...
For example, for the data processing procedure, I will read the raw data and write the processed data into a new file. For the window selection, I will read the observed data and synthetic data, and write out windows. None of them modify the input asdf file. So is it safe to use mode='r'
or still better to use mode='a'
?
Hi Lion,
A few month a ago, I mentioned a bug as pyasdf that it failed spliting job on 16 processor. I think I have located it somehow. It is related to how you split the job.
Please check line 1242-1252 in asdf_data_set.py:
+++++++++++++++++++++++++++++++++
# Divide into chunks, each rank takes their corresponding chunks.
def chunks(l, n):
"""
Yield successive n-sized chunks from l.
From http://stackoverflow.com/a/312464/1657047
"""
for i in range(0, len(l), n):
yield l[i:i+n]
chunksize = int(math.ceil(len(usable_stations) / self.mpi.size))
all_chunks = list(chunks(usable_stations, chunksize))
++++++++++++++++++++++++++++++++++
Let me make a example. If we have 164 streams and 16 processors. Based on the calculation, the chunksize will be 11 since you used math.ceil. Then the 15th processor will already consume all the jobs( 15 * 11 = 165 > 164). Thus, the len(all_chunks) = 15 and the 16th processor has nothing to take from all_chunks(invalid index). This kind of problem will always happen if the number of process is larger than the chunksize.
However, I think the current version of mpi4py doesn't handle error very well so the program hangs instead of quit.
I have written another piece of job splitting code to substitute the above chunk:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
def chunks(l, nchunks):
all_chunks = []
# general calculation
ntotal = len(l)
chunksize = int(math.floor(ntotal / nchunks))
nrest = ntotal % nchunks
# calculate number of jobs for every chunk
njob_chunk = [ chunksize for i in range(nchunks)]
for i in range(nrest):
njob_chunk[i] += 1
# assign jobs for each chunk
for i in range(nchunks):
start_idx = sum(njob_chunk[0:i])
end_idx = sum(njob_chunk[0:(i+1)])
all_chunks.append(l[start_idx:end_idx])
return all_chunks
# calling in a different way
all_chunks = chunks(usable_stations, self.mpi.size)
++++++++++++++++++++++++++++++++++
The basic idea will be more easily illustrated using an example. If I have 164 streams and 16 processors, the chunksize would be 10. But there will be 4 more streams left behind. So the rest 4 more streams will be assigned to [rank0, rank,1..., rank3]. Each of them will get one more. In this way, the job will be split into a more even way.
As far as the errror catching for mpi4py, I have another issue in window selection that the error could't be caught. Then I have to use a brutal way to catch the error...I may discuss that with you later on.
I'm implementing ASDF for my lab seismology/AE experiments and have run into a timescales issue. My waveforms come as ~1 Msample blocks at 20 MHz, often multiple of these blocks occur within one second. When add_waveforms
checks for duplicate traces, the lines quoted from __get_waveform_ds_name
truncate the times to seconds and thus reject any traces that started in the same second as duplicates.
pyasdf/pyasdf/asdf_data_set.py
Lines 1299 to 1300 in c976927
This is a three character fix:
"%Y-%m-%dT%H:%M:%S"
-> "%Y-%m-%dT%H:%M:%S.%f"
I'm new to open source contributing and this community. Should I just clone and make this fix for myself?
Hi,
I'm using pyasdf as database for my data processing and encountered two problems.
1. Manipulating catalog with a large number of events is extremely slow in pyasdf.
For example, below is a quakeml file with more than 10000 events
https://github.com/NoisyLeon/DataRequest/blob/master/alaska_2017_aug.ml
we can add the catalog to ASDF
import pyasdf
dset = pyasdf.ASDFDataSet('test001.h5')
dset.add_quakeml('/path_to_alaska_2017_aug.ml')
Then the catalog can be accessed through dset.events
It is extremely slow to manipulate this catalog, even a simple 'print dset.events[0]' can take a long time
However, in obspy, it's not the case.
import obspy
cat = obspy.read_events('/path_to_alaska_2017_aug.ml')
It is still slow reading the file, but once cat is in memory, manipulating it is much faster.
You can compare the speed of 'print cat[0]' and 'print dset.events[0]'
2. Some information is lost if catalog is stored in ASDF.
For example, preferred_origin() and preferred_magnitude() is lost once the catalog is stored in ASDF.
Because of the two problems above, I have to always store my event catalog in another quakeml file while not using the catalog from ASDF file, which is very inconvenient.
Is there any way to improve these?
Thanks,
Lili
I think the current Output method is too limited. For example in your asdf_data_set.py:
def process_two_files_without_parallel_output(
self, other_ds, process_function, traceback_limit=3):
"""
Process data in two data sets.
This method simply gather's everything to master node.
However, for my case, when calculating adjoint sources, I need to use this function and write out the adjoint sources in paralle. So I modified your function and make it this way:
def process_two_files(self, other_ds, process_function,
output_filename=None, traceback_limit=3):
"""
Process data in two data sets.
So if output_filename
is specified, it write out the adjoint source in parallell(I embedded the adjoint writer in this function, so it is really limited). If output_filename
is None
, it just gather information to
master node as what you did.
However, this is also very limited. For example, if an user want to use this function to achieve more goals, it is very hard unless he changed the source code. So my opinion is, we should give more freedom to the user. How should we do that? I recommend we changed from this line.
print(msg)
print(tb)
else:
results[station] = result
# just return results here
return results
After we got results and now the results are sitting one different processors. We just return the results. So people can define their process functions anyway they like and they are responsible for writing things out.
So for example, if people get the raw observed data and synthetic data, they want to marching down till adjoint sources, but they still want to keep the processed observed and synthetic, and also window information, they can just keep every thing and write them out afterwards.
If you are worried about the parallel write out of asdf file, I think we can provide some independent
writers for them, like writer for waveform and auxiliary data. We don't need to hard-coded the writer into those process functions. For example, if people have waveforms sitting on different processors, they may call:
dump_waveform_to_asdf(list_of_waveforms, output_filename, mpi_comm).
If they want to write out auxiliary data:
dump_auxiliary_to_asdf(list_of_auxiliary, otuput_filename, mpi_comm).
This gives the user more freedom to design their workflow. One drawback, I can think of, is users must be careful about memory. But I think more modern computer clusters, on one board, it is usually 64G to 128G memories. So shouldn't be a big issue.
I would propose to include a relatively small, but valid dataset. So that getting started, doing some first tests, and exploring and understanding the structures becomes easier and more immediate.
Hi Lion,
I was trying to launch the job on more than one node(16 processors per node) on the ornl machine.
So if I do:
[lei@rhea12 test_specfem]$ mpiexec -n 17 python process_synthetic.py -p proc_synt_27_60.params.json -f parfile/C200502151442A.proc_synt_50_100.asdf.dirs.json -v
Traceback (most recent call last):
File "process_synthetic.py", line 6, in <module>
from proc_util import process_synt
File "/autofs/nccs-svm1_home1/lei/source_inversion_wf/src/preproc_asdf/proc_util.py", line 1, in <module>
import obspy
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy/__init__.py", line 36, in <module>
from obspy.core.utcdatetime import UTCDateTime # NOQA
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy/core/__init__.py", line 107, in <module>
from obspy.core.util.attribdict import AttribDict
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy/core/util/__init__.py", line 27, in <module>
from obspy.core.util.base import (ALL_MODULES, DEFAULT_MODULES,
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy/core/util/base.py", line 28, in <module>
import numpy as np
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/numpy/__init__.py", line 180, in <module>
from . import add_newdocs
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/numpy/add_newdocs.py", line 13, in <module>
from numpy.lib import add_newdoc
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/numpy/lib/__init__.py", line 8, in <module>
from .type_check import *
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/numpy/lib/type_check.py", line 11, in <module>
import numpy.core.numeric as _nx
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/numpy/core/__init__.py", line 15, in <module>
os.environ.clear()
File "/ccs/home/lei/anaconda2/lib/python2.7/os.py", line 501, in clear
unsetenv(key)
OSError: [Errno 22] Invalid argument
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[47909,1],16]
Exit code: 1
--------------------------------------------------------------------------
Do you know what causes that? It seems numpy is not happy with this :)
Hi Lion,
I found there is a issue related to the traceback print.
First, our asdf data contains a file like I mentioned here:
obspy/obspy#1371
Pyasdf first gives out a error like this:
Error during the processing of station 'XA.SA72' and tag 'raw_observed' on rank 4:
Traceback (At max 3 levels - most recent call last):
File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1708, in process
traceback_limit=traceback_limit)
File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1725, in _dispatch_processing_mpi
traceback_limit=traceback_limit)
File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1907, in _dispatch_processing_mpi_worker_node
stream = process_function(stream, inv)
File "/autofs/nccs-svm1_home1/lei/software/pypaw/src/pypaw/process.py", line 28, in process_wrapper
return process(stream, inventory=inv, **param)
File "/autofs/nccs-svm1_home1/lei/software/pytomo3d/pytomo3d/signal/process.py", line 228, in process
st.detrend("linear")
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy-0.10.2-py2.7-linux-x86_64.egg/obspy/core/util/decorator.py", line 241, in new_func
return func(*args, **kwargs)
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy-0.10.2-py2.7-linux-x86_64.egg/obspy/core/stream.py", line 2304, in detrend
tr.detrend(type=type)
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy-0.10.2-py2.7-linux-x86_64.egg/obspy/core/util/decorator.py", line 258, in new_func
return func(*args, **kwargs)
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy-0.10.2-py2.7-linux-x86_64.egg/obspy/core/util/decorator.py", line 241, in new_func
return func(*args, **kwargs)
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy-0.10.2-py2.7-linux-x86_64.egg/obspy/core/trace.py", line 231, in new_func
result = func(*args, **kwargs)
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy-0.10.2-py2.7-linux-x86_64.egg/obspy/core/trace.py", line 1817, in detrend
self.data = func(self.data, **options)
File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/scipy/signal/signaltools.py", line 1900, in detrend
newdata = newdata.astype(dtype)
ValueError: could not convert string to float: �
It is because the data array is string, but not float. Then this error should be catched here:
https://github.com/SeismicData/pyasdf/blob/master/pyasdf/asdf_data_set.py#L1751
However, when executing this line, there is an error coming out:
tb += "".join(exc_line)
The error log is:
Traceback (most recent call last):
File "process_asdf.py", line 19, in <module>
proc.smart_run()
File "/autofs/nccs-svm1_home1/lei/software/pypaw/src/pypaw/procbase.py", line 201, in smart_run
self._core(path, param)
File "/autofs/nccs-svm1_home1/lei/software/pypaw/src/pypaw/process.py", line 85, in _core
ds.process(process_function, output_asdf, tag_map=tag_map)
File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1708, in process
traceback_limit=traceback_limit)
File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1725, in _dispatch_processing_mpi
traceback_limit=traceback_limit)
File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1928, in _dispatch_processing_mpi_worker_node
tb += "".join(exc_line)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 47: ordinal not in range(128)
Following up from previous discussion here, I'm reaching out to see if you're all still happy to do a domain swap on readthedocs with our ASDF project?
Let me know, and we can coordinate the switch if you're still game!
Hi @krischer, after source inversion, we need update the event information in ASDF file of raw data, is there a safe way you can suggest to do such update?
Hi Lion,
There are some people want to use our tools, including pyflex, pyadjoint, pytomo3d, pyasdf and so on. I am thinking about moving all these packages to one place so people could easily find those resource at once. What do you think about it? If so, what kind of "one place" would be suitable?
Because now all the packages are more or less hosting in quite a few of places, when I introduce these tools to people, they get confused.
The filter method of ASDFDataSet results in a runtime error in python 3.7
This is because the generator terminates with raise StopIteration instead of return.
some details about this change to python syntax are here https://www.python.org/dev/peps/pep-0479/
I think the validate() is not working as we expect it to be.
In asdf_data_set.py, Line 1224 - 1239.
for station_id in dir(self.waveforms):
station = getattr(self.waveforms, station_id)
contents = dir(station)
if not contents:
continue
if "StationXML" not in contents and contents:
print("No station information available for station '%s'" %
station_id)
summary["no_station_information"] += 1
continue
contents.remove("StationXML")
if not contents:
print("Station with no waveforms: '%s'" % station_id)
summary["no_waveforms"] += 1
continue
summary["good_stations"] += 1
When you check for waveform, you remove the "StationXML" and check if there are attributes left. But as the evolving of pyasdf, this is no longer true. Once you have StationXML, you will also have "channle_coordinates" and "coordinates".
A more strict way to check waveform is loop over attributes ans see if there is one same as type as "obspy.core.stream.Stream"
Hi Lion,
I got pyflex hang up when using asdf data on rhea(Oak Ridge machine). The program runs to a point and stops at a certain trace(As far as I noticed, it stops at the same traces everytime).
The command I used to run the code is:
mpiexec -n 16 python parallel_pyflex.py
The problem only happens when I use 16 processors(using mpi). For 2, 4, 8 processors, the program works.
The parallel_pyflex.py uses the function: process_two_files_without_parallel_output(ds1, ds2). The script is almost the same as this one: https://github.com/krischer/InversionWorkflowExample/blob/master/parallel_pyflex.py
Might be 2 reasons:
I will post more updates later after doing more test.
Hi Lion,
perhaps we are missing something in the docs, or maybe something is not working properly:
pyasdf allows to tag waveforms (or any other data) with strings starting with numbers, but trying to access previously tagged data results in "invalid syntax" exception.
moreover, network names starting with numbers result in the same behaviour (FDSN specifications allow numbers in net codes, and some actual networks already use this syntax).
PS we are also allowed to insert special chars in tag strings... should it be forbidden?
sample sessions below (using latest pyasdf from git repository)
a) tag string starting with numbers
In [1]: import pyasdf
In [2]: ds = pyasdf.ASDFDataSet("test.h5", compression="gzip-3")
In [3]: ds.add_waveforms('input.mseed', tag='00_XX', event_id='test')
In [4]: ds.waveforms.NZ_RTZ.00_XX
File "<ipython-input-4-124329b5b6a4>", line 1
ds.waveforms.NZ_RTZ.00_XX
^
SyntaxError: invalid syntax
b) tag string starting with letters
In [5]: ds.add_waveforms('input.mseed', tag='YY_00', event_id='test')
In [6]: ds.waveforms.NZ_RTZ.YY_00
Out[6]:
1 Trace(s) in Stream:
NZ.RTZ.20.HNE | 2016-02-01T19:02:03.000000Z - 2016-02-01T19:05:23.995000Z | 200.0 Hz, 40200 samples
c) tag string containing special chars (....is this allowed?)
In [9]: ds.add_waveforms('input.mseed', tag='YY_0%%%%0', event_id='test')
In [10]: ds.waveforms.NZ_RTZ.YY_0%%%%0
File "<ipython-input-10-1e37cee81f48>", line 1
ds.waveforms.NZ_RTZ.YY_0%%%%0
^
SyntaxError: invalid syntax
d) network names starting with numbers result in the same error!
In [4]: ds.waveforms.11_RTZ
File "<ipython-input-4-20ad8374b5b6>", line 1
ds.waveforms.11_RTZ
^
SyntaxError: invalid syntax
pyasdf/pyasdf/inventory_utils.py
Line 144 in 8431d2d
Trying to run set() on a list of Comments throws a TypeError: unhashable type: 'Comment'
I encountered this when trying to add a StationXML with a list of comments that was already contained in the dataset.
It would be nice if it threw an ASDFWarning, similar to when adding waveform data already contained in the dataset.
Hi Lion,
I recently found that running multiple processing jobs at the same time greatly increase the running time. For example, if only running one job, job could be done in 25 sec. If running 5 jobs at the same time, the job will last for 90 sec. If running 20 jobs at the same time, it becomes really really slow.
After some simple investigation, I found the problem comes from the _sync_metadata()
method in asdf_data_set.py
. I will do more investigation and let you know the the result.
Hi, Lion,
I think adding a tags attribute to the ASDFDataSet class would useful. It would provide the user with quick access to a list of all available tags in the dataset, and could be used to drive processing in some cases.
I would be happy to contribute this, but want to conform to any standards and recommendations you might have. Do you have any standards for contributing code? or recommendations for how to add this particular feature?
Grace and peace,
Malcolm
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.