seismicdata / pyasdf Goto Github PK

View Code? Open in Web Editor NEW

52.0 52.0 31.0 6.4 MB

Python Interface to ASDF based on ObsPy

Home Page: http://seismicdata.github.io/pyasdf/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

pyasdf's People

Contributors

Stargazers

Watchers

pyasdf's Issues

check file existance

Hi,

We recently ran into this issues on openmpi(gcc) in ubuntu system.

In asdf_data_set.py, Line 1409-1411:

        if os.path.exists(output_filename):
            msg = "Output file '%s' already exists." % output_filename
            raise ValueError(msg)

The check is running on every processors. So the case is: not everytime the processors will reach this point at the same time. If one processor is running very fast and already created the file, the other processors will raise an error.

Somehow, it works on openmpi(cray-gcc) but not working on openmpi(gcc) in ubuntu(Norbert found this problem when he is trying our workflow). I think there might be some implementation difference between these two.

My suggestion would be either of the following:

only check on master processor
put a mpi barrier after this block to make sure every processor reach this point at the same time.

@mpbl

Performance profile in the future

Maybe in the near future we need some performance profile.

I did some simple test on Rhea and here is the table of running time when processing a single file(~ 200MB) using ASDFDataSet.process()

NPROC	Running Time(s)
16	94.88
32	70.15
48	70.62
64	73.13

The running time is from the asdf terminal output. Like

Master done, shutting down workers...
Worker 4 shutting down...
Worker 3 shutting down...
Worker 7 shutting down...
Worker 11 shutting down...
Worker 9 shutting down...
Worker 5 shutting down...
Worker 1 shutting down...
Worker 2 shutting down...
Worker 10 shutting down...
Worker 8 shutting down...
Worker 6 shutting down...
Worker 13 shutting down...
Worker 12 shutting down...
Worker 14 shutting down...
Worker 15 shutting down...
Jobs (running 94.88 seconds): queued: 0 | finished: 207 | total: 207
    Worker 1: 0 active, 7 completed jobs
    Worker 2: 0 active, 17 completed jobs
    Worker 3: 0 active, 14 completed jobs
    Worker 4: 0 active, 6 completed jobs
    Worker 5: 0 active, 6 completed jobs
    Worker 6: 0 active, 19 completed jobs
    Worker 7: 0 active, 14 completed jobs
    Worker 8: 0 active, 19 completed jobs
    Worker 9: 0 active, 20 completed jobs
    Worker 10: 0 active, 25 completed jobs
    Worker 11: 0 active, 18 completed jobs
    Worker 12: 0 active, 16 completed jobs
    Worker 13: 0 active, 17 completed jobs
    Worker 14: 0 active, 5 completed jobs
    Worker 15: 0 active, 4 completed jobs

I think it spends quite a lot time in the writing.

bug introduced in recent commit

Hi Lion,

When I am using pyasdf to process observed data(with mpi), I found the asdf_data_set.process() function won't return correctly. To be more specific, every processor runs thorough the process() function but can not return to the upper level.

I did some research and found the bug is introduced in the commit: c75d65b

The commit before this is fine.

We have something setup on bitbucket but it is private now. @mpbl Is it OK to give Lion permission to the repocitory(source_inversion_wf) so he can run our example test case?

We have everything setup there(bitbucket repo). So after downloading it, you will be able to run some tests and you will see the bugs.

netcdf

Hi Lion,

I got a friend asking me about the asdf. But unfortunately, he wants the netcdf...

I think hdf5 is thought to be better than netcdf. But do we have plans to provide netcdf in the future?

process_two_files_with_parallel_output() function needed

Hi Lion,

I think we have reached the point when we need the parallel output for adjoint source for the preprocessing workflow.

Is it very difficult to write such a function for adjoint source writer? I am thinking about looking into this issue this weekend? Are you comfortable with me having some attempts?

two-way conversion : sac <--> asdf

Some of our tools work on sac files, others work on ASDF files, so I'd like to be able to convert forward and backward without any loss of metadata.

pyasdf/scripts/sac2asdf.py does not preserve stats.sac dictionaries, so I wrote a new version that does: https://github.com/rmodrak/asdf_converters/blob/master/asdf_converters/sac2asdf.py

this new script constructs a dictionary {(original_sac_filename: trace.stats.sac) for trace in stream}, dumps it into a dill virtual file that is then saved as an ASDF auxiliary data file.

What remains is to devise a mapping between original_sac_filenames <--> ASDF waveforms. I see no easy way of doing this using the existing ASDF API. ASDFDataSet.get_waveforms comes close, but I think there ought to be a simpler way.

Am I missing something? Any advice would be greatly appreciated.

setup broken between 0.5.1 and 0.6.1

Can't install pyasdf into home directory.

% pip install -v .
ERROR: (Gentoo) Please run pip with the --user option to avoid breaking python-exec                                                                                                                                              
Exception information:                                                                                          
Traceback (most recent call last):                                                                                                                                                                                               
  File "/usr/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 153, in _main
    status = self.run(options, args)                                                                                                                                                                                             
  File "/usr/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 289, in run
    raise CommandError("(Gentoo) Please run pip with the --user option to avoid breaking python-exec") 
pip._internal.exceptions.CommandError: (Gentoo) Please run pip with the --user option to avoid breaking python-exec

% pip install -v . --user
Created temporary directory: /tmp/pip-ephem-wheel-cache-4gbfsomu
Created temporary directory: /tmp/pip-req-tracker-4joucdkx                                                                                                                                                                       Created requirements tracker '/tmp/pip-req-tracker-4joucdkx'                                                                                                                                                                     
Created temporary directory: /tmp/pip-install-qxlbbpdr                                                                                                                                                                           
Processing pyasdf
  Created temporary directory: /tmp/pip-req-build-esqgclk9
  Added file:///.../pyasdf to build tracker '/tmp/pip-req-tracker-4joucdkx'
  Created temporary directory: /tmp/pip-build-env-_bf99xeq
  Running command /usr/bin/python3.7 /usr/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-_bf99xeq/overlay --no-warn-script-location -v --no-binary :none: --only-binary :none: 
-i https://pypi.org/simple -- 'setuptools>=40.8.0' wheel 
  ERROR: (Gentoo) Please run pip with the --user option to avoid breaking python-exec
  Exception information:
  Traceback (most recent call last):
    File "/usr/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 153, in _main
      status = self.run(options, args)
    File "/usr/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 289, in run
      raise CommandError("(Gentoo) Please run pip with the --user option to avoid breaking python-exec")
  pip._internal.exceptions.CommandError: (Gentoo) Please run pip with the --user option to avoid breaking python-exec
  Installing build dependencies ... error
Cleaning up...
  Removing source in /tmp/pip-req-build-esqgclk9
Removed file:///.../pyasdf from build tracker '/tmp/pip-req-tracker-4joucdkx'
Removed build tracker '/tmp/pip-req-tracker-4joucdkx'
ERROR: Command errored out with exit status 1: /usr/bin/python3.7 /usr/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-_bf99xeq/overlay --no-warn-script-location -v --no-binary
 :none: --only-binary :none: -i https://pypi.org/simple -- 'setuptools>=40.8.0' wheel Check the logs for full command output.
Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 153, in _main
    status = self.run(options, args)
  File "/usr/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 385, in run
    resolver.resolve(requirement_set)
  File "/usr/lib/python3.7/site-packages/pip/_internal/legacy_resolve.py", line 201, in resolve
    self._resolve_one(requirement_set, req)
  File "/usr/lib/python3.7/site-packages/pip/_internal/legacy_resolve.py", line 365, in _resolve_one
    abstract_dist = self._get_abstract_dist_for(req_to_install)
  File "/usr/lib/python3.7/site-packages/pip/_internal/legacy_resolve.py", line 313, in _get_abstract_dist_for
    req, self.session, self.finder, self.require_hashes
  File "/usr/lib/python3.7/site-packages/pip/_internal/operations/prepare.py", line 224, in prepare_linked_requirement
    req, self.req_tracker, finder, self.build_isolation,
    "Installing build dependencies"
  File "/usr/lib/python3.7/site-packages/pip/_internal/build_env.py", line 201, in install_requirements
    call_subprocess(args, spinner=spinner)
  File "/usr/lib/python3.7/site-packages/pip/_internal/utils/subprocess.py", line 242, in call_subprocess
    raise InstallationError(exc_msg)
pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /usr/bin/python3.7 /usr/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-_bf99xeq/overlay --n
o-warn-script-location -v --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- 'setuptools>=40.8.0' wheel Check the logs for full command output.

Latest version that is listed as working with Python 3.6 (0.5.1) works but the latest version which requires Python 3.7 doesn't.

machine 1 (ok)
setuptools 44.0
python 3.6
pyasdf 0.5.1

machine 2 (not ok)
setuptools 44.1
python 3.7
pyasdf 0.6.1 or git

Seis-prov hash IDs

We're starting to use ASDF and seis-prov for some strong motion data processing code we're writing. In the seis-prov documentation (see here: http://seismicdata.github.io/SEIS-PROV/_generated_details.html#seis-prov-ids)

it states that the third element of a seis-prov id is:
"A 7 to 12 letter lowercase alphanumeric hash to ensure uniqueness of ids."

What is the intended use of this hash, and what is the domain where the uniqueness is being considered? Is it just within the document, or across a span of them?

mpi4py (2.0.0) is not compatible with h5py (2.5.0)

The mpi4py.mpi_c has changed to mpi4py.libmpi, so cause the imcompatibility. Current solution is switch to the old version of mpi4py (1.3.x) or make changes in h5py. See discussion at rdhyee/h5py@f2e3f13#diff-c9e1291e5e01f0f5e0a484dccaf67cf6L28

mpi run process() function hangs at some cases

Hi Lion,

I am trying to run a data processing using the process() function in mpi environment.

In the following case, I got 10 asdf file to process and I am using 4 processors. So I go through them one by one and delete the dataset when I am done with it. I got such errors after the program runs for a certain time(for example, after processing 4 files and now it is processing the 5th):

====================
Dir info:
input asdf: /lustre/atlas/proj-shared/geo111/Wenjie/DATA_SI/ASDF_new/raw/synt/C200502151442A.Mpp.synthetic.h5
output asdf: /lustre/atlas/proj-shared/geo111/Wenjie/DATA_SI/ASDF_new/proc/synt/C200502151442A.Mpp.proc_synt_50_100.h5
Processing info:
filter band: [40.0, 50.0, 100.0, 150.0]
tag map: {"synthetic" ==> "proc_synt_50_100"}
interp relative start and end(to eventcentroid time): (0.00    ,  6000.00)
interp deltat: 0.5
output_asdf exists and removed:/lustre/atlas/proj-shared/geo111/Wenjie/DATA_SI/ASDF_new/proc/synt/C200502151442A.Mpp.proc_synt_50_100.h5
Launching processing using MPI on 4 processors.
Jobs (running 2.17 seconds): queued: 142 | finished: 0 | total: 161
    Worker 1: 9 active, 0 completed jobs
    Worker 2: 9 active, 0 completed jobs
    Worker 3: 1 active, 0 completed jobs

Jobs (running 4.36 seconds): queued: 123 | finished: 0 | total: 161
    Worker 1: 19 active, 0 completed jobs
    Worker 2: 18 active, 0 completed jobs
    Worker 3: 1 active, 0 completed jobs

Jobs (running 6.51 seconds): queued: 105 | finished: 0 | total: 161
    Worker 1: 27 active, 0 completed jobs
    Worker 2: 28 active, 0 completed jobs
    Worker 3: 1 active, 0 completed jobs

Jobs (running 8.67 seconds): queued: 90 | finished: 0 | total: 161
    Worker 1: 35 active, 0 completed jobs
    Worker 2: 35 active, 0 completed jobs
    Worker 3: 1 active, 0 completed jobs

Jobs (running 10.81 seconds): queued: 76 | finished: 0 | total: 161
    Worker 1: 42 active, 0 completed jobs
    Worker 2: 42 active, 0 completed jobs
    Worker 3: 1 active, 0 completed jobs

Jobs (running 12.87 seconds): queued: 62 | finished: 0 | total: 161
    Worker 1: 49 active, 0 completed jobs
    Worker 2: 49 active, 0 completed jobs
    Worker 3: 1 active, 0 completed jobs

Jobs (running 15.02 seconds): queued: 45 | finished: 0 | total: 161
    Worker 1: 57 active, 0 completed jobs
    Worker 2: 58 active, 0 completed jobs
    Worker 3: 1 active, 0 completed jobs

Jobs (running 17.09 seconds): queued: 29 | finished: 0 | total: 161
    Worker 1: 66 active, 0 completed jobs
    Worker 2: 65 active, 0 completed jobs
    Worker 3: 1 active, 0 completed jobs

Jobs (running 19.13 seconds): queued: 13 | finished: 0 | total: 161
    Worker 1: 74 active, 0 completed jobs
    Worker 2: 73 active, 0 completed jobs
    Worker 3: 1 active, 0 completed jobs

Jobs (running 21.90 seconds): queued: 0 | finished: 0 | total: 161
    Worker 1: 80 active, 0 completed jobs
    Worker 2: 80 active, 0 completed jobs
    Worker 3: 1 active, 0 completed jobs

It seems the "Worker 3" is not reponding. So the program just hangs up. I am not quite sure what trigers that.

It might be some memory issues here since if I am using 16 cores, the problem never shows up. I am not quite sure. I will dive into the code and see if I can find the bug.

Or maybe it is not good to run several dataset processing in one mpirun(maybe the python is not good at this kind of memory management). So in the future, it will be safer to process one file at one mpiexec call.

Possible to be more flexible on stationxml side?

Hi Lion,

I am collaborating with Yanhua and another seismologist working on China tomography.

One problem that prevents us using pyasdf is that they don't have STATIONXML file.(Sorry to mention that, but the China data is not available online....). They only have Pole and Zero files.

Actually, I think a lot of seismologists has this issue right now. Not all of us are using STATIONXML. So my question is:

To attract those "old" users, should we consider adding the capability of add PZ files or RESP files?

Why are tag paths validated against r"^[a-zA-Z0-9][a-zA-Z0-9_]*[a-zA-Z0-9]$"?

Why are tag path names validated against r"^[a-zA-Z0-9][a-zA-Z0-9_]*[a-zA-Z0-9]$"? This is more restrictive than the naming scheme used in /Waveforms. For example /AuxiliaryData/NET.STA/MYDATASET_NAME would not be allowed due to the . in NET.STA. It would be nice to allow consistency in the layout between /Waveforms and /AuxiliaryData.

Our use case is that we are processing event-based ground motion records and want to add datasets like /AuxiliaryData/WaveformMetrics/NET.STA/NET.STA.CHAN_EVENTID_LABEL where EVENTID_LABEL is the "tag". We would like to keep as much consistency in the naming scheme as possible between the waveforms and the ground-motion metrics.

question regarding station names

ASDFDataSet.WaveformAccessor uses a station/network/channel naming convention, but not all data sets include this type of information. For example, obspy.read leaves these attributes empty when reading from certain formats, such as SeismicUnix. Any suggested workarounds? Would you suggest manually setting the 'station' attribute for each trace with, say, the trace number index?

Any suggestions would be greatly appreciatated. Thanks, Ryan

cannot add dictionaries with 'None' values in add_auxiliary_data

hello,

i get an error when trying to add a dictionary containing None values in parameters with add_auxiliary_data

everything works as expected if the a dictionary does not contain None values.

am i missing something? is it a bug?

here's a transcript of a sample session:

$ python -c "import pyasdf; pyasdf.print_sys_info()"
pyasdf version 0.4.x
===============================================================================
CPython 2.7.9, compiler: GCC 4.9.2
Linux 4.9.0-0.bpo.3-amd64 64bit
Machine: x86_64, Processor:  with 24 cores
===============================================================================
HDF5 version 1.10.1, h5py version: 2.7.1
MPI: MPICH, version: 3.1.0, mpi4py version: 3.0.0
Parallel I/O support: True
Problematic multiprocessing: False
===============================================================================
Other_modules:
	dill: 0.2.6
	lxml: 3.4.0
	numpy: 1.8.2
	obspy: 1.1.0
	prov: 1.5.0
	scipy: `0.14.0`

$ asdf-validate UZ.A3150..HN.UZ-1984-0011.h5
WARNING: No QuakeML found in the file.
Valid ASDF File!

$ ipython
In [1]: import pyasdf

In [2]: import numpy as np

In [3]: data = np.random.random(100)

In [4]: data_type = "RandomArrays"

In [5]: path = "example_array"

In [6]: ds = pyasdf.ASDFDataSet('UZ.A3150..HN.UZ-1984-0011.h5')

In [7]: a= {u'user6': 'foo', u'user7': '', u'user4': None}

In [8]: ds.add_auxiliary_data(data=data, data_type=data_type, path=path, parameters=a)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

[...]

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

In [9]:

tune the output information in ASDFDataSet.process() method

Hi Lion,

I am currently using this to process asdf file and I found the output information is limited. For example, in the following output, the interpolation method raise a ValueError but I don't know what stream(or trace) is that. I think knowing the stream.id is very helpful so I can go into the hdf5 file and see what is wrong with that trace.

Do you have specific reasons to do so(simple terminal output)?

Terminal Output:

--------------------------------------------------------------------------
Processing!!! 1445269490.98
Processing!!! 1445269490.98
Processing!!! 1445269490.98
Processing!!! 1445269490.99
Process Process-2:
Traceback (most recent call last):
  File "/ccs/home/lei/anaconda/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1680, in run
    output_stream = self.processing_function(stream, inv)
  File "/autofs/nccs-svm1_home1/lei/test/git_merge/source_inversion_wf/src/preproc_asdf/proc_util.py", line 112, in process_function
    cut_func(st, starttime, endtime)
  File "/autofs/nccs-svm1_home1/lei/test/git_merge/source_inversion_wf/src/preproc_asdf/proc_util.py", line 51, in cut_func
    tr_list.append(flex_cut_trace(tr, t1, t2))
  File "/autofs/nccs-svm1_home1/lei/test/git_merge/source_inversion_wf/src/preproc_asdf/proc_util.py", line 39, in flex_cut_trace
    return tr.slice(cut_starttime, cut_endtime)
  File "/ccs/home/lei/anaconda/lib/python2.7/site-packages/obspy/core/trace.py", line 1114, in slice
    tr.trim(starttime=starttime, endtime=endtime)
  File "/ccs/home/lei/anaconda/lib/python2.7/site-packages/obspy/core/trace.py", line 231, in new_func
    result = func(*args, **kwargs)
  File "/ccs/home/lei/anaconda/lib/python2.7/site-packages/obspy/core/trace.py", line 1075, in trim
    raise ValueError("startime is larger than endtime")

MPI can not exit

Hi Lion,

I don't know if you notice it or not.

When using mpi4py.MPI and a exception(or error) was raised, but the program will just hang and quit.

I found we probably need to replace the **raise *Error() with something like:

try:
     do something here...
except Exception as exp:
     print(exp)
     MPI.COMM_WORLD.Abort()

Probably this is not necessary in the pyasdf code. But when we use the pyasdf module and write our own code, it is better to have this thing to prevent burning unnecessary CPU hours.

Slow performance when adding gappy data to ASDF

Reading gappy data incurs a significant performance penalty compared to non-gappy data.

Attached is a read timing test and test data. The test script times the read if specified miniSEED using, for reference, obspy.read() followed by pyasdf's add_waveforms(). The test data are a day of both gappy (2200+ gaps) and non-gapped time series.

On my machine:

$ ./read-timing-test.py -o output.h5 clean-day.mseed gappy-day.mseed 
Opening output ASDF volume: output.h5
Processing clean-day.mseed
ObsPy read(): 0.05147713300000012 seconds
ASDF add_waveforms(): 0.1556968969999999 seconds
Processing gappy-day.mseed
ObsPy read(): 0.49582375 seconds
ASDF add_waveforms(): 7.62076154 seconds

The add_waveforms() method, at 7.6 seconds, is more than an order of magnitude slower than an obspy.read() of the same data at 0.49 seconds.

Obviously it would be nice if this were faster. As ASDF gains popularity it will be used with a likewise-broadening set of input data.

read-timing-test.zip

parallel I/O and python 3

Hello,

Is there any improvement of compatibility of the parallel I/O mode for pyasdf with python 3?

Possible to keep things in the memory?

Hi Lion,

We are now integrating the whole workflow. For example, we want to combine signal processing(for both observed and synthetic) and window selection together. During the processing, we don't want any I/Os. To be more specific, we want the workflow to read in the raw data(one observed and one synthetic asdf file) at the very beginning, process them and select windows. After window selections, only the windows are written out.

Currently, I am using "process" function in the asdf_data_set.py. However, there is one argument called output_filename:

def process(self, process_function, output_filename, tag_map,
                traceback_limit=3, **kwargs):

meaning currently implementation requires to write the processed files out. However, if possible, my preferred way would be keep things in the memory. I am guessing if so, it is even against the basic implementation of the asdf, right? Cause when you initialize the "read", it is not even read the whole thing into the memory. So there is not such a thing called "keep all the things in the memory".

Or I am thinking about another option would be modify the process function. So the process function would take one observed and one synthetic, and walked all the way down to window selection. I think that might be the right and possible way.

If my words get you confused here, I will just illustrate a little more:
For example, you have two files, one raw observed asdf and one raw synthetic asdf. You want to process them and select windows. There are two ways of doing so:

process the whole observed asdf(but keep all things in the memory), process the whole synthetic asdf(and keep things in the memory), and select windows(for traces in the memory). The advantage of this is to make my code modulelized so I can simply ensemble different parts together. But dis-advantage is this method might be not applicable to the currently asdf implementation.
modify the process_function, to make it incorporate all the procedures. I think if so, it is possible to implement. But disadvantage is this will make the process_function so big and not very user-friendly?

Sorry to bring it up so late. I looked through the code and found if so, it might involves a lot of changes in the code.

Did you implement parallel test in pyasdf?

Hi Lion,

Did you implement the mpi test running on Travis?

We are setting up Jenkins on Princeton cluster so we are able to do some mpi test in the future. Let me know how you solve this problem right now.

Also, another question regarding to obspy...I found when using conda install -c obspy obspy, the default version is 1.0.0. It has a lot of updates regarding to APIs(which makes part of my code fails with the new version). So should I fix my obspy version to 0.10.2 or should I use the new version?

How to make pyasdf to support multi-date-range station inventory + custom tags?

When I add a station XML ds.add_stationxml("multi-date-ranges.xml")
See attached xml file,

OA.CE22_station_inv_modified_xml.txt

then extracted the station xml file as shown below: There are two issues:

the multidate-ranges are merged into one xml node for station code="CE22".
our custom tagged metadata (allowed by obspy and FDSN) are lost.

Can you help by comments on these issues. Thank you!

Geoscience Australia ObsPy 1.0.2 https://www.obspy.org 2019-02-02T18:42:45 1 -18.49507 139.002731 62.7 CE22 Transportable Array 2017-11-04T03:16:35 2018-06-06T01:02:24 3 -18.49507 139.002731 62.7 0.0 0.0 90.0 200.0 0.0 -18.49507 139.002731 62.7 0.0 90.0 0.0 200.0 0.0 -18.49507 139.002731 62.7 0.0 0.0 0.0 200.0 0.0

Calling parallel process() for a very large file gets stuck

I'm able to use the parallel process() function for medium sized files without problem (on Summit), but when I tried to process a 112GB file with 32,925 traces, the program will always get stuck. I've tried different node configurations (1 node 6 CPUs, 1 node 42 CPUs, 2 nodes 84 CPUs, etc.) but none of them worked. I was wondering if this related to #32 and is there any workaround to this problem?

Below is the full debug info.
Launching processing using MPI on 6 processors.
MASTER received from WORKER 2 [WORKER_REQUESTS_ITEM] -- None
WORKER 2 sent to MASTER [WORKER_REQUESTS_ITEM] -- None
WORKER 1 sent to MASTER [WORKER_REQUESTS_ITEM] -- None
WORKER 5 sent to MASTER [WORKER_REQUESTS_ITEM] -- None
WORKER 4 sent to MASTER [WORKER_REQUESTS_ITEM] -- None
WORKER 3 sent to MASTER [WORKER_REQUESTS_ITEM] -- None
MASTER sent to WORKER 2 [MASTER_SENDS_ITEM] -- ('1A.CORRE', 'synthetic')
WORKER 2 received from MASTER [MASTER_SENDS_ITEM] -- ('1A.CORRE', 'synthetic')
MASTER received from WORKER 3 [WORKER_REQUESTS_ITEM] -- None
MASTER sent to WORKER 3 [MASTER_SENDS_ITEM] -- ('1A.NE00', 'synthetic')
MASTER received from WORKER 4 [WORKER_REQUESTS_ITEM] -- None
MASTER sent to WORKER 4 [MASTER_SENDS_ITEM] -- ('1A.NE01', 'synthetic')
WORKER 3 received from MASTER [MASTER_SENDS_ITEM] -- ('1A.NE00', 'synthetic')
RT Inventory created at 1995-06-29T12:22:32.070000Z
Created by: SPECFEM3D_GLOBE/asdf-library
http://seismic-data.org
Sending institution: SPECFEM3D_GLOBE
Contains:
Networks (1):
1A
Stations (1):
1A.CORRE (N/A)
Channels (3):
1A.CORRE.S3.MXZ, 1A.CORRE.S3.MXN, 1A.CORRE.S3.MXE
WORKER 4 received from MASTER [MASTER_SENDS_ITEM] -- ('1A.NE01', 'synthetic')
MASTER received from WORKER 5 [WORKER_REQUESTS_ITEM] -- None
MASTER sent to WORKER 5 [MASTER_SENDS_ITEM] -- ('1A.NE02', 'synthetic')
WORKER 5 received from MASTER [MASTER_SENDS_ITEM] -- ('1A.NE02', 'synthetic')
MASTER received from WORKER 1 [WORKER_REQUESTS_ITEM] -- None
MASTER sent to WORKER 1 [MASTER_SENDS_ITEM] -- ('1A.NE03', 'synthetic')
WORKER 1 received from MASTER [MASTER_SENDS_ITEM] -- ('1A.NE03', 'synthetic')
WORKER 2 sent to MASTER [WORKER_REQUESTS_ITEM] -- None
MASTER received from WORKER 2 [WORKER_REQUESTS_ITEM] -- None
MASTER sent to WORKER 2 [MASTER_SENDS_ITEM] -- ('1A.NE04', 'synthetic')
WORKER 2 received from MASTER [MASTER_SENDS_ITEM] -- ('1A.NE04', 'synthetic')

Parallel processing request for comment

The parallel processing capabilities in ASDFDataSet are extremely well written. Thanks for crafting such a useful package and example of good coding!

ASDF adoption is my long term goal, however, I'm temporarily tied to other data formats.

Could anyone comment on whether the parallel processing methods in ASDFDataSet could be adapted to work on general obspy stream objects? I'm not looking to speed up reading and writing traces to disk, only operations on obspy traces in memory such as instrument response removal, bandpassing, muting and so on.

Many thanks for any comments!

A non master-slave partition mode in process function

Hi Lion,

In ASDFDataSet.process(), you used the master-slave working mode. I think you introduce it mainly because of balancing the workload.

I had a discussion with @mpbl about it and we think we should also have another job partition way(the tradition and simple way), which is partition the job prior to job running. For example, if we get 1000 stations and 10 cores, every processors would take 100 and they know which 100 they would take. The advantage of doing that is:

code is much easier, so more maintainable, less bugs.
scalable. If in the future we have 100000 stations, and we want to have 1000 cores. The master-slave may suffer from communications between cores.

The drawback:

we can't ensure workload balance. Under some extreme conditions, it will perform very poor.
However, if we shuffle the job queues, it won't be to bad.

What do you think? Worth adding such a method?

File naming of test dataset results in error during clone on Windows

I should preface this by saying, I'm not sure if you support Windows, or want to support it, but I have run into a small issue cloning the repository (as the suggested install method). Files:

pyasdf/tests/data/small_sample_data_set/AE.113A..BH*.xml
pyasdf/tests/data/small_sample_data_set/TA.POKR..BH*.xml
cannot be created on windows, because of the * in the file-name. Do you know of anyway around that, or can the files be re-named?

I am running on Windows 7, Python 2.7, git 2.10.1.windows.1.

Main reason for asking is I would like to use pyASDF for EQcorrscan (for storing templates and detections with waveforms alongside catalogs and meta-data from processing) and have run in to issues running tests on appveyor.

The full error message is:

$ git clone https://github.com/SeismicData/pyasdf.git
Cloning into 'pyasdf'...
remote: Counting objects: 2391, done.
remote: Compressing objects: 100% (42/42), done.
remote: Total 2391 (delta 23), reused 0 (delta 0), pack-reused 2349
Receiving objects: 100% (2391/2391), 4.46 MiB | 1017.00 KiB/s, done.
Resolving deltas: 100% (1618/1618), done.
error: unable to create file pyasdf/tests/data/small_sample_data_set/AE.113A..BH*.xml: Invalid argument
error: unable to create file pyasdf/tests/data/small_sample_data_set/TA.POKR..BH*.xml: Invalid argument
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

numpy should probably be a dependency

This is only a very minor issue. I realize that numpy is directly imported several times:

pyasdf/asdf_data_set.py:import numpy as np
pyasdf/tests/test_asdf_data_set.py:import numpy as np
pyasdf/utils.py:import numpy as np

So it probably should become a formal dependency.

Fast extraction of event location data for all waveforms

Hi @krischer! I have ASDF files with ~100 events and ~100 stations. I try to make a list of the source, receiver location information of all waveforms using the following loop:

import pyasdf
from obspy.clients.iris import Client

client = Client()

ds = pyasdf.ASDFDataSet(file)

for event in ds.events:
    origin = event.preferred_origin() or event.origins[0]

    for station in ds.ifilter(ds.q.event == event, ds.q.channel == 'BHZ'):
        result = client.distaz(station.coordinates["latitude"], station.coordinates["longitude"], origin.latitude,
                               origin.longitude)
        deg = result['distance']
        baz = result['backazimuth']

        iwav += 1
        print('%10d %15.4f %15.4f %15.4f %15.4f %6.1f %6.1f'
          % (iwav, origin.latitude, origin.longitude,
             station.coordinates["latitude"], station.coordinates["longitude"],
             deg, baz))

The code works, but the event query is very slow. Do you know of a better way? Perhaps a query and loop over all BHZ traces would be faster, but in that case I don't know how to get the origin information for each trace.

SUGGESTION: Reading several datasets in one command

I was thinking that maybe it is not a bad idea to not only read one hd5 but also a directory which contains all _hd5_s and have a list of all the files in that directory. Since it is just a pointer (before any real reading), there should not be any problem in terms of memory. This can make it possible to have some filterings on events, stations and etc. Let me make an example to clear what I am suggesting:

Assume that we have the following directory that has several hd5 files:
dir_h5
file1.h5 file2.h5 file3.h5
read the whole directory by lets say:

data_set = ASDFDataSet("dir_h5")

which reads all the files.

we are only interested in the events and/or stations in one specific location or whatever:

data_set.filter_events("...similar filtering as read_event in obspy...")

or:

data_set.filter_stations("...channel='*Z', coordinates and etc...")

come up with a dataset with only desired stations/events.

The advantage is that we would have the functionality to read all files at once and filter them in one or two lines...and afterwards you have your whole desired dataset out of a larger archive.

cannot import pyasdf

I just installed pyasdf on my Mac, and I tried importing it in ipython, which resulted in the following error:

`In [1]: import pyasdf

AttributeError Traceback (most recent call last)
in
----> 1 import pyasdf

~/miniconda/envs/gmprocess/lib/python3.7/site-packages/pyasdf/init.py in
11
12 from .exceptions import ASDFException, ASDFWarning, WaveformNotInFileException
---> 13 from .asdf_data_set import ASDFDataSet
14
15

~/miniconda/envs/gmprocess/lib/python3.7/site-packages/pyasdf/asdf_data_set.py in
63 SUPPORTED_FORMAT_VERSIONS, MSG_TAGS, MAX_MEMORY_PER_WORKER_IN_MB,
64 POISON_PILL, PROV_FILENAME_REGEX, TAG_REGEX, VALID_SEISMOGRAM_DTYPES
---> 65 from .query import Query, merge_query_functions
66 from .utils import is_mpi_env, StationAccessor, sizeof_fmt, ReceivedMessage,
67 pretty_receiver_log, pretty_sender_log, JobQueueHelper, StreamBuffer, \

~/miniconda/envs/gmprocess/lib/python3.7/site-packages/pyasdf/query.py in
102 "elevation_in_m": _type_or_none(float),
103 # Temporal constraints.
--> 104 "starttime": obspy.UTCDateTime,
105 "endtime": obspy.UTCDateTime,
106 "sampling_rate": float,

AttributeError: module 'obspy' has no attribute 'UTCDateTime'`

storing complex valued data

Adding a waveform with complex valued data (e.g. spectra) raises a TypeError:

TypeError: The trace's dtype ('complex128') is not allowed inside ASDF. Allowed are little and big endian 4 and 8 byte signed integers and floating point numbers.

A work around is to split the real and imaginary parts into separate tags, but this seems kludgy. Is this a design decision, or am I missing something?

unit issue

Hi @krischer @Jas11

I think we now need to make the unit in asdf file uniform.

Two things I have noticed in quakeml:

depth unit is meter in standard quakeml file but in CMTSOLUTION file(and specfem3d_globe) is km(kilometer)
For moment tensor, I think CMTSOLUTION(specfem3d_globe) is dyne/m(or something) but in standarnd quakeml file it is N/m.

By "standard quakeml" I mean the quakeml you gave me(convert from NDK file), Lion.

So do we need to keep the unit same or we add another attribute in quakeml to specify the unit?

mpi.comm.allgather may not necessary for speed concern(not sure)

Hi Lion,

In line 1267 in asdf_data_set.py:

 gathered_results = self.mpi.comm.allgather(results)

Of which I think allgather may not be necessary but gather would be enough. Two reasons:

at this stage, people may want just the final results. So there is no need to keep duplicates on each processors.
for speed concern. If people have a lot of data sitting on each processors, the allgather will take more time than gather since it needs to bcast all results.

My preference would be:

        gathered_results = self.mpi.comm.gather(results, root=0)
        results = {}
        if self.mpi.rank == 0:
            for result in gathered_results:
                results.update(result)
        return results

What do you think?

why mode='a' is preferred?

Just curious. I noticed in asdf_data_set.py, in __init__(), you wrote:

        :param mode: The mode the file is opened in. Passed to the
            underlying :class:`h5py.File` constructor. pyasdf expects to be
            able to write to files for many operations so it might result in
            strange errors if a read-only mode is used. Nonetheless this is
            quite useful for some use cases as long as one is aware of the
            potential repercussions.
        :type mode: str

Not sure what you mean exactly here...

For example, for the data processing procedure, I will read the raw data and write the processed data into a new file. For the window selection, I will read the observed data and synthetic data, and write out windows. None of them modify the input asdf file. So is it safe to use mode='r' or still better to use mode='a'?

Bug found on job spliting

Hi Lion,

A few month a ago, I mentioned a bug as pyasdf that it failed spliting job on 16 processor. I think I have located it somehow. It is related to how you split the job.

Please check line 1242-1252 in asdf_data_set.py:
+++++++++++++++++++++++++++++++++

        # Divide into chunks, each rank takes their corresponding chunks.
        def chunks(l, n):
            """
            Yield successive n-sized chunks from l.
            From http://stackoverflow.com/a/312464/1657047
            """
            for i in range(0, len(l), n):
                yield l[i:i+n]

        chunksize = int(math.ceil(len(usable_stations) / self.mpi.size))
        all_chunks = list(chunks(usable_stations, chunksize))

++++++++++++++++++++++++++++++++++
Let me make a example. If we have 164 streams and 16 processors. Based on the calculation, the chunksize will be 11 since you used math.ceil. Then the 15th processor will already consume all the jobs( 15 * 11 = 165 > 164). Thus, the len(all_chunks) = 15 and the 16th processor has nothing to take from all_chunks(invalid index). This kind of problem will always happen if the number of process is larger than the chunksize.

However, I think the current version of mpi4py doesn't handle error very well so the program hangs instead of quit.

I have written another piece of job splitting code to substitute the above chunk:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

     def chunks(l, nchunks):
            all_chunks = []
            # general calculation
            ntotal = len(l)
            chunksize = int(math.floor(ntotal / nchunks))
            nrest = ntotal % nchunks
            # calculate number of jobs for every chunk
            njob_chunk = [ chunksize for i in range(nchunks)]
            for i in range(nrest):
                njob_chunk[i] += 1

            # assign jobs for each chunk
            for i in range(nchunks):
                start_idx = sum(njob_chunk[0:i])
                end_idx = sum(njob_chunk[0:(i+1)])
                all_chunks.append(l[start_idx:end_idx])

            return all_chunks

     # calling in a different way 
     all_chunks = chunks(usable_stations, self.mpi.size)

++++++++++++++++++++++++++++++++++
The basic idea will be more easily illustrated using an example. If I have 164 streams and 16 processors, the chunksize would be 10. But there will be 4 more streams left behind. So the rest 4 more streams will be assigned to [rank0, rank,1..., rank3]. Each of them will get one more. In this way, the job will be split into a more even way.

As far as the errror catching for mpi4py, I have another issue in window selection that the error could't be caught. Then I have to use a brutal way to catch the error...I may discuss that with you later on.

Very short separate traces are labeled duplicate if starting in the same second

I'm implementing ASDF for my lab seismology/AE experiments and have run into a timescales issue. My waveforms come as ~1 Msample blocks at 20 MHz, often multiple of these blocks occur within one second. When add_waveforms checks for duplicate traces, the lines quoted from __get_waveform_ds_name truncate the times to seconds and thus reject any traces that started in the same second as duplicates.

pyasdf/pyasdf/asdf_data_set.py

Lines 1299 to 1300 in c976927

 start=start.strftime("%Y-%m-%dT%H:%M:%S"), 

 end=end.strftime("%Y-%m-%dT%H:%M:%S"),

This is a three character fix:
"%Y-%m-%dT%H:%M:%S" -> "%Y-%m-%dT%H:%M:%S.%f"

I'm new to open source contributing and this community. Should I just clone and make this fix for myself?

Two problems for event catalog in pyasdf

Hi,
I'm using pyasdf as database for my data processing and encountered two problems.

1. Manipulating catalog with a large number of events is extremely slow in pyasdf.
For example, below is a quakeml file with more than 10000 events
https://github.com/NoisyLeon/DataRequest/blob/master/alaska_2017_aug.ml

we can add the catalog to ASDF

import pyasdf
dset = pyasdf.ASDFDataSet('test001.h5')
dset.add_quakeml('/path_to_alaska_2017_aug.ml')

Then the catalog can be accessed through dset.events
It is extremely slow to manipulate this catalog, even a simple 'print dset.events[0]' can take a long time

However, in obspy, it's not the case.

import obspy
cat = obspy.read_events('/path_to_alaska_2017_aug.ml')

It is still slow reading the file, but once cat is in memory, manipulating it is much faster.

You can compare the speed of 'print cat[0]' and 'print dset.events[0]'

2. Some information is lost if catalog is stored in ASDF.
For example, preferred_origin() and preferred_magnitude() is lost once the catalog is stored in ASDF.

Because of the two problems above, I have to always store my event catalog in another quakeml file while not using the catalog from ASDF file, which is very inconvenient.

Is there any way to improve these?

Thanks,
Lili

A more flexible I/O

I think the current Output method is too limited. For example in your asdf_data_set.py:

def process_two_files_without_parallel_output(
            self, other_ds, process_function, traceback_limit=3):
        """
        Process data in two data sets.

This method simply gather's everything to master node.

However, for my case, when calculating adjoint sources, I need to use this function and write out the adjoint sources in paralle. So I modified your function and make it this way:

    def process_two_files(self, other_ds, process_function,
                          output_filename=None, traceback_limit=3):
        """
        Process data in two data sets.

So if output_filename is specified, it write out the adjoint source in parallell(I embedded the adjoint writer in this function, so it is really limited). If output_filename is None, it just gather information to
master node as what you did.

However, this is also very limited. For example, if an user want to use this function to achieve more goals, it is very hard unless he changed the source code. So my opinion is, we should give more freedom to the user. How should we do that? I recommend we changed from this line.

                print(msg)
                print(tb)
            else:
                results[station] = result
           # just return results here
            return results

After we got results and now the results are sitting one different processors. We just return the results. So people can define their process functions anyway they like and they are responsible for writing things out.

So for example, if people get the raw observed data and synthetic data, they want to marching down till adjoint sources, but they still want to keep the processed observed and synthetic, and also window information, they can just keep every thing and write them out afterwards.

If you are worried about the parallel write out of asdf file, I think we can provide some independent
writers for them, like writer for waveform and auxiliary data. We don't need to hard-coded the writer into those process functions. For example, if people have waveforms sitting on different processors, they may call:

    dump_waveform_to_asdf(list_of_waveforms, output_filename, mpi_comm).

If they want to write out auxiliary data:

    dump_auxiliary_to_asdf(list_of_auxiliary, otuput_filename, mpi_comm).

This gives the user more freedom to design their workflow. One drawback, I can think of, is users must be careful about memory. But I think more modern computer clusters, on one board, it is usually 64G to 128G memories. So shouldn't be a big issue.

proposal: adding a small dataset

I would propose to include a relatively small, but valid dataset. So that getting started, doing some first tests, and exploring and understanding the structures becomes easier and more immediate.

How to run on more than 1 node?

Hi Lion,

I was trying to launch the job on more than one node(16 processors per node) on the ornl machine.

So if I do:

[lei@rhea12 test_specfem]$ mpiexec -n 17 python process_synthetic.py -p proc_synt_27_60.params.json -f parfile/C200502151442A.proc_synt_50_100.asdf.dirs.json -v
Traceback (most recent call last):
  File "process_synthetic.py", line 6, in <module>
    from proc_util import process_synt
  File "/autofs/nccs-svm1_home1/lei/source_inversion_wf/src/preproc_asdf/proc_util.py", line 1, in <module>
    import obspy
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy/__init__.py", line 36, in <module>
    from obspy.core.utcdatetime import UTCDateTime  # NOQA
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy/core/__init__.py", line 107, in <module>
    from obspy.core.util.attribdict import AttribDict
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy/core/util/__init__.py", line 27, in <module>
    from obspy.core.util.base import (ALL_MODULES, DEFAULT_MODULES,
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy/core/util/base.py", line 28, in <module>
    import numpy as np
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/numpy/__init__.py", line 180, in <module>
    from . import add_newdocs
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/numpy/lib/__init__.py", line 8, in <module>
    from .type_check import *
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/numpy/lib/type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/numpy/core/__init__.py", line 15, in <module>
    os.environ.clear()
  File "/ccs/home/lei/anaconda2/lib/python2.7/os.py", line 501, in clear
    unsetenv(key)
OSError: [Errno 22] Invalid argument
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[47909,1],16]
  Exit code:    1
--------------------------------------------------------------------------

Do you know what causes that? It seems numpy is not happy with this :)

Issues related to traceback print

Hi Lion,

I found there is a issue related to the traceback print.

First, our asdf data contains a file like I mentioned here:
obspy/obspy#1371

Pyasdf first gives out a error like this:

Error during the processing of station 'XA.SA72' and tag 'raw_observed' on rank 4:
Traceback (At max 3 levels - most recent call last):
  File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1708, in process
    traceback_limit=traceback_limit)
  File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1725, in _dispatch_processing_mpi
    traceback_limit=traceback_limit)
  File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1907, in _dispatch_processing_mpi_worker_node
    stream = process_function(stream, inv)
  File "/autofs/nccs-svm1_home1/lei/software/pypaw/src/pypaw/process.py", line 28, in process_wrapper
    return process(stream, inventory=inv, **param)
  File "/autofs/nccs-svm1_home1/lei/software/pytomo3d/pytomo3d/signal/process.py", line 228, in process
    st.detrend("linear")
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy-0.10.2-py2.7-linux-x86_64.egg/obspy/core/util/decorator.py", line 241, in new_func
    return func(*args, **kwargs)
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy-0.10.2-py2.7-linux-x86_64.egg/obspy/core/stream.py", line 2304, in detrend
    tr.detrend(type=type)
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy-0.10.2-py2.7-linux-x86_64.egg/obspy/core/util/decorator.py", line 258, in new_func
    return func(*args, **kwargs)
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy-0.10.2-py2.7-linux-x86_64.egg/obspy/core/util/decorator.py", line 241, in new_func
    return func(*args, **kwargs)
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy-0.10.2-py2.7-linux-x86_64.egg/obspy/core/trace.py", line 231, in new_func
    result = func(*args, **kwargs)
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/obspy-0.10.2-py2.7-linux-x86_64.egg/obspy/core/trace.py", line 1817, in detrend
    self.data = func(self.data, **options)
  File "/ccs/home/lei/anaconda2/lib/python2.7/site-packages/scipy/signal/signaltools.py", line 1900, in detrend
    newdata = newdata.astype(dtype)

ValueError: could not convert string to float: �

It is because the data array is string, but not float. Then this error should be catched here:
https://github.com/SeismicData/pyasdf/blob/master/pyasdf/asdf_data_set.py#L1751

However, when executing this line, there is an error coming out:

tb += "".join(exc_line)

The error log is:

Traceback (most recent call last):
  File "process_asdf.py", line 19, in <module>
    proc.smart_run()
  File "/autofs/nccs-svm1_home1/lei/software/pypaw/src/pypaw/procbase.py", line 201, in smart_run
    self._core(path, param)
  File "/autofs/nccs-svm1_home1/lei/software/pypaw/src/pypaw/process.py", line 85, in _core
    ds.process(process_function, output_asdf, tag_map=tag_map)
  File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1708, in process
    traceback_limit=traceback_limit)
  File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1725, in _dispatch_processing_mpi
    traceback_limit=traceback_limit)
  File "/autofs/nccs-svm1_home1/lei/software/pyasdf/pyasdf/asdf_data_set.py", line 1928, in _dispatch_processing_mpi_worker_node
    tb += "".join(exc_line)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 47: ordinal not in range(128)

Confusion with asdf / pyasdf package name

Following up from previous discussion here, I'm reaching out to see if you're all still happy to do a domain swap on readthedocs with our ASDF project?

Let me know, and we can coordinate the switch if you're still game!

Update event information in ASDF file

Hi @krischer, after source inversion, we need update the event information in ASDF file of raw data, is there a safe way you can suggest to do such update?

Put all the packages together

Hi Lion,

There are some people want to use our tools, including pyflex, pyadjoint, pytomo3d, pyasdf and so on. I am thinking about moving all these packages to one place so people could easily find those resource at once. What do you think about it? If so, what kind of "one place" would be suitable?

Because now all the packages are more or less hosting in quite a few of places, when I introduce these tools to people, they get confused.

@mpbl

ifilter generator function results in runtime error in python 3.7

The filter method of ASDFDataSet results in a runtime error in python 3.7
This is because the generator terminates with raise StopIteration instead of return.
some details about this change to python syntax are here https://www.python.org/dev/peps/pep-0479/

validate() not working as we expect

I think the validate() is not working as we expect it to be.

In asdf_data_set.py, Line 1224 - 1239.

 for station_id in dir(self.waveforms):
            station = getattr(self.waveforms, station_id)
            contents = dir(station)
            if not contents:
                continue
            if "StationXML" not in contents and contents:
                print("No station information available for station '%s'" %
                      station_id)
                summary["no_station_information"] += 1
                continue
            contents.remove("StationXML")
            if not contents:
                print("Station with no waveforms: '%s'" % station_id)
                summary["no_waveforms"] += 1
                continue
            summary["good_stations"] += 1

When you check for waveform, you remove the "StationXML" and check if there are attributes left. But as the evolving of pyasdf, this is no longer true. Once you have StationXML, you will also have "channle_coordinates" and "coordinates".

A more strict way to check waveform is loop over attributes ans see if there is one same as type as "obspy.core.stream.Stream"

Reading hangs when using mpi for 16 processors

Hi Lion,

I got pyflex hang up when using asdf data on rhea(Oak Ridge machine). The program runs to a point and stops at a certain trace(As far as I noticed, it stops at the same traces everytime).

The command I used to run the code is:
mpiexec -n 16 python parallel_pyflex.py

The problem only happens when I use 16 processors(using mpi). For 2, 4, 8 processors, the program works.

The parallel_pyflex.py uses the function: process_two_files_without_parallel_output(ds1, ds2). The script is almost the same as this one: https://github.com/krischer/InversionWorkflowExample/blob/master/parallel_pyflex.py

Might be 2 reasons:

the file is corrupted. So I need to do more test, to see if this happens on more files. But the wired thing is the same file works on 2 to 8 processors.
the pyasdf is buggy.

I will post more updates later after doing more test.

tag strings starting with numbers

Hi Lion,

perhaps we are missing something in the docs, or maybe something is not working properly:

pyasdf allows to tag waveforms (or any other data) with strings starting with numbers, but trying to access previously tagged data results in "invalid syntax" exception.

moreover, network names starting with numbers result in the same behaviour (FDSN specifications allow numbers in net codes, and some actual networks already use this syntax).

PS we are also allowed to insert special chars in tag strings... should it be forbidden?

sample sessions below (using latest pyasdf from git repository)

a) tag string starting with numbers

In [1]: import pyasdf

In [2]: ds = pyasdf.ASDFDataSet("test.h5", compression="gzip-3")

In [3]: ds.add_waveforms('input.mseed', tag='00_XX', event_id='test')

In [4]: ds.waveforms.NZ_RTZ.00_XX
  File "<ipython-input-4-124329b5b6a4>", line 1
    ds.waveforms.NZ_RTZ.00_XX
                         ^
SyntaxError: invalid syntax

b) tag string starting with letters

In [5]: ds.add_waveforms('input.mseed', tag='YY_00', event_id='test')

In [6]: ds.waveforms.NZ_RTZ.YY_00
Out[6]: 
1 Trace(s) in Stream:
NZ.RTZ.20.HNE | 2016-02-01T19:02:03.000000Z - 2016-02-01T19:05:23.995000Z | 200.0 Hz, 40200 samples

c) tag string containing special chars (....is this allowed?)

In [9]: ds.add_waveforms('input.mseed', tag='YY_0%%%%0', event_id='test')

In [10]: ds.waveforms.NZ_RTZ.YY_0%%%%0
  File "<ipython-input-10-1e37cee81f48>", line 1
    ds.waveforms.NZ_RTZ.YY_0%%%%0
                             ^
SyntaxError: invalid syntax

d) network names starting with numbers result in the same error!

In [4]: ds.waveforms.11_RTZ
  File "<ipython-input-4-20ad8374b5b6>", line 1
    ds.waveforms.11_RTZ
                  ^
SyntaxError: invalid syntax

adding an existing StationXML throws TypeError

pyasdf/pyasdf/inventory_utils.py

Line 144 in 8431d2d

set(station.comments).union(set(other_station.comments))

Trying to run set() on a list of Comments throws a TypeError: unhashable type: 'Comment'

I encountered this when trying to add a StationXML with a list of comments that was already contained in the dataset.

It would be nice if it threw an ASDFWarning, similar to when adding waveform data already contained in the dataset.

_sync_metadata() impacts the performace

Hi Lion,

I recently found that running multiple processing jobs at the same time greatly increase the running time. For example, if only running one job, job could be done in 25 sec. If running 5 jobs at the same time, the job will last for 90 sec. If running 20 jobs at the same time, it becomes really really slow.

After some simple investigation, I found the problem comes from the _sync_metadata() method in asdf_data_set.py. I will do more investigation and let you know the the result.

ASDFDataSet.tags attribute

Hi, Lion,

I think adding a tags attribute to the ASDFDataSet class would useful. It would provide the user with quick access to a list of all available tags in the dataset, and could be used to drive processing in some cases.

I would be happy to contribute this, but want to conform to any standards and recommendations you might have. Do you have any standards for contributing code? or recommendations for how to add this particular feature?

Grace and peace,
Malcolm

	start=start.strftime("%Y-%m-%dT%H:%M:%S"),
	end=end.strftime("%Y-%m-%dT%H:%M:%S"),

seismicdata / pyasdf Goto Github PK

pyasdf's People

Contributors

Stargazers

Watchers

Forkers

pyasdf's Issues

`In [1]: import pyasdf

Recommend Projects

Recommend Topics

Recommend Org