Giter VIP home page Giter VIP logo

fabtests's People

Contributors

a-ilango avatar aingerson avatar dmitrygx avatar dsolovyev avatar goodell avatar hppritcha avatar jbbintel avatar jithinjose avatar jlbyrne-hpe avatar jsquyres avatar msalnik avatar nrspruit avatar p91paul avatar prankurgupta avatar raffenet avatar rfaucett avatar shefty avatar soniczhao avatar stanfordlightfoot avatar sungeunchoi avatar vkrishna avatar xuywang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fabtests's Issues

Interface structure versioning

From OFIWG F2F, interface structure versioning will be done using a size field within the struct. A query method (or static inline or define) will indicate if a specific interface is available.

CLOCK_MONOTONIC_RAW not always supported

The unit test fi_eq_test makes use of CLOCK_MONOTONIC_RAW. Though it's been available in linux for a while, slightly less older versions of glibc did not make it available (like on some of the enterprise systems I'm using). Seems like something that could be checked at config time. FWIW, I think you could #define it by hand with the value from linux/time.h and things would work, though be a bit unsatisfying.

Error when attempting to enhance simple/msg_pingpong.c with a -f option.

Some of the tests support a -f option to allow giving a fabric hint (such as rdm_pingpong.c). I am trying to enhance msg_pingpong.c to have a -f option and am encountering an issue. The changes I've made can be viewed in commit: 36ff792.

If I run fi_msg_pingpong as a server I get the result:

[wf143][99.35] :: export SFI_PSM_NAME_SERVER=1
[wf143][99.51] :: ./fi_msg_pingpong -f IP
EP opened on fabric IP

Which seems okay, but the client errors in this fashion:

[wf144][99.35] :: export SFI_PSM_NAME_SERVER=1
[wf144][99.35] :: ./fi_msg_pingpong -d wf143 -f IP
psmx_resolve_name: couldn't connect to wf143:4096
[SOCK_ERROR - sock_eq_openwait:430]: no available AF_INET address
no available AF_INET address: Success
[SOCK_ERROR - sock_util_sendto:71]: sendto failed with error 9 - Bad file descriptor
fi_connect Bad file descriptor

If I enable socket log output:

[wf144][99.35] :: export OFI_SOCK_LOG_LEVEL=4
[wf144][99.35] :: ./fi_msg_pingpong -d wf143 -f IP                                                                                                                                                                                                                                                                                                                          psmx_resolve_name: couldn't connect to wf143:4096
[SOCK_INFO - sock_msg_getinfo:357]: dest_addr: family: 2, IP is 192.168.0.143
[SOCK_INFO - sock_msg_getinfo:363]: src_addr: family: 2, IP is 192.168.0.144
[SOCK_INFO - sock_pe_init_table:2097]: PE table init: OK
[SOCK_INFO - sock_pe_init:2121]: PE init: OK
[SOCK_INFO - sock_pe_progress_thread:2044]: Progress thread started
[SOCK_INFO - _sock_conn_listen:274]: Binding listener thread to port: 60308
[SOCK_INFO - sock_eq_openwait:409]: enter
[SOCK_ERROR - sock_eq_openwait:430]: no available AF_INET address
no available AF_INET address: Success
[SOCK_INFO - sock_pe_add_tx_ctx:1867]: TX ctx added to PE
[SOCK_INFO - sock_pe_add_rx_ctx:1875]: RX ctx added to PE
[SOCK_INFO - sock_rx_new_entry:58]: New rx_entry: 0x985a30, ctx: 0x9232e0
[SOCK_INFO - sock_ep_recvmsg:108]: New rx_entry: 0x985a30 (ctx: 0x9232e0)
[SOCK_ERROR - sock_util_sendto:71]: sendto failed with error 9 - Bad file descriptor
fi_connect Bad file descriptor
[SOCK_INFO - sock_pe_progress_thread:2078]: Progress thread terminated
[SOCK_INFO - sock_pe_finalize:2145]: Progress engine finalize: OK

If I break the server at sock_eq_openwait I get the following:

Breakpoint 1, sock_eq_openwait (eq=0x605220, service=0x6051cc "9228") at prov/sockets/src/sock_eq.c:408

Notice that the value for service is "9228" which is correct.

But if I break the client at sock_eq_openwait I get the following:

Breakpoint 1, sock_eq_openwait (eq=0x67c8f0, service=0x6055d4 "-\243") at prov/sockets/src/sock_eq.c:408

The value of service changes everytime. Not exactly sure what is going on there.

Another issue is that the test continues after the first failure in fi_ep_bind in bind_ep_res (msg_pingpong.c:255) since ret doesn't contain an error value.

Am I missing something, or is it just not behaving as expected?

EDIT:
A bit more information:

I just re-compiled libfabric with:

./configure --prefix=/users/bturrubiates --disable-verbs
make install

I then switched to the master branch of fabtests that contains the unmodified test and attempted to execute it and I get the exact same behavior.

Is this an issue with the test, or in libfabric?

@hppritcha

Cmatose port

cmatose uses port# 7471, which is used by Apple QuickTime Streaming Server.
This causes the test to fail on Mac.

Is it OK to change the port to something else?
Just double checking, since it is a ported example.

fi_mr_reg "access" parameter incorrect

I think that the access parameter to fi_mr_reg in fi_msg_pingpong is incorrect (should be something like FI_SEND|FI_RECV, not 0), and I suspect it's incorrect elsewhere in the fabtests codebase. I don't have spare cycles right now to audit things, so I'm filing this as a reminder that we need to check this at some point.

Add src_addr to simple tests

The basic tests don't have src_addr option. Adding src_addr to following tests would help to unify the code structure-

  • msg.c
  • rdm.c
  • rdm_rma_simple.c

Warning in common/shared.c

@pmmccorm You have some "cleanups for fabtests" PRs recently, so I thought I'd throw this one your way... :-)

With gcc 4.9.1:

common/shared.c: In function ‘init_test’:
common/shared.c:150:29: warning: argument to ‘sizeof’ in ‘snprintf’ call is the same expression as the destination; did you mean to provide an explicit length? [-Wsizeof-pointer-memaccess]
  snprintf(test_name, sizeof test_name, "%s_lat", sstr);
                             ^

PR #194 broke fi_ud_pingpong and other tests

CC: @shantonu

The master branch is currently broken with for usNIC:

$ FI_LOG_LEVEL=3 ./simple/fi_ud_pingpong
libfabric:fi_register_provider():108<2> registering provider: usnic (1.0)
libfabric:fi_register_provider():108<2> registering provider: verbs (1.0)
libfabric:fi_register_provider():108<2> registering provider: sockets (1.0)
libfabric:verbs:fi_ibv_check_info():391<2> Required mode bits not set
libfabric:fi_getinfo():432<1> fi_getinfo: provider verbs returned -61 (No data available)
libfabric:sockets:sock_ep_getinfo():324<3> src_addr: 127.0.0.1
libfabric:usnic:usdf_fabric_open():979<3> successfully opened usnic_0/eth6
Unable to resolve remote address 0x0 0x0
fi_close(): 191, -16 (Device or resource busy)
fi_close(): 197, -16 (Device or resource busy)

I haven't tracked down the root cause, but PR #194 seems to be the culprit, specifically commit f255f0d. I haven't been able to spot the bug yet by inspection, so I'll start proper debugging soon.

fi_rdm_pingpong is also failing, though with different symptoms:

$ FI_LOG_LEVEL=3 FI_PROVIDER=usnic ./simple/fi_rdm_pingpong  
[... start other side in other window ...]
libfabric:fi_register_provider():108<2> registering provider: usnic (1.0)
libfabric:fi_register_provider():108<2> registering provider: verbs (1.0)
libfabric:fi_register_provider():125<2> "verbs" filtered by provider include/exclude list, skipping
libfabric:fi_register_provider():108<2> registering provider: sockets (1.0)
libfabric:fi_register_provider():125<2> "sockets" filtered by provider include/exclude list, skipping
libfabric:usnic:usdf_fabric_open():979<3> successfully opened usnic_0/eth6
fi_recv(): 117, -11 (Resource temporarily unavailable)

ud_pingpong AV issues

The UD pingpong test has at least 2 issues. The first is relatively easy to fix. The av_attr structure is not fully initialized before being used. A memset 0 would work.

The second is that the server side never enters the client address into its AV. It simply expects the CQ readfrom call to have inserted the value. As a generic example for UD, I would recommend having the client call getname(), insert the results in a message, and send that to the server. The server would extract the name from the message, insert it into its AV, and send an ack response. This would be done as part of the communication setup before real data transfers actually start.

Copying @rfaucett (author) and @luomiao (who found the issues). Reese, can I assign you to this?

Determining associated completion queue for waitset and pollset

A waitset(fid_wait) can be attached to multiple event queues but fi_wait doesn't return any information about the signaled waitset. How do we know which completion queue the signaled waitset refers to?

Also note, fi_poll returns contexts associated with event queue for the corresponding pollset but it might be more generic to return fids.

fi_rdm and psmx_resolve_hostname

We are trying to run some of the FI_EP_RDM tests in fabtests/simple on our qlogic cluster.
The tests invariably fail in a getaddrinfo call within psmx_resolve_hostname with messages
like:

bash-4.1$ ./fi_rdm -d 10.16.9.48
psmx_resolve_name: couldn't connect to 10.16.9.48:4096

When we build libfabric on the intel/qlogic systems, we get socket/verbs/psm providers,
and fi_getinfo test returns sensible information. However, whenever we try to run any of the
client/server tests that use FI_EP_RDM, we hit this hostname resolution error in psmx_resolve_name -
aka getaddrinfo.

Note these are production psm-based systems, so there is not likely to be anything wrong with the base OFED setup.

@bturrubiates
@shantonu

fi_rdm_inject_pingpong can hang

Because injected sends do not generate a completion, the last inject call may be aborted when the app exits. The result is that the peer side hangs. The test needs to be modified to synchronize with the peer before exiting.

Remove redundant wait_for_completion code from every tests

We should make use of wait_for_completion and wait_for_data_completion in common/shared.c instead of repeating the code block in all the tests. I suggest:

  • Remove code blocks that are already there in the common code. For example, msg_rma.c uses wait_for_data_completion's code inside wait_remote_writedata_completion() without calling the function.
  • Add the following function in common/shared.c and remove corresponding code block from tagged examples.

    int wait_for_tagged_completion(struct fid_cq *cq, int num_completions)

is the psmx_resolve_name error normal for successful runs?

It doesn't matter whether or not a test run is successful or not, I still get a psmx_resolve_name error.

[wf541][85.25] :: ./fi_dgram -d wf540
psmx_resolve_name: couldn't connect to wf540:4096
Posting a send...
Send completion received

segfault when using -f argument on rdm_rma_simple test

Backtrace looks like this:
#0 0x00007ffff6b948ec in free () from /lib64/libc.so.6
#1 0x00007ffff7b7bf62 in fi_freeinfo (info=0x605040) at src/fabric.c:276
#2 0x0000000000400fb7 in main (argc=, argv=) at simple/rdm_rma_simple.c:355

I had some printf's added so line numbers may be off by a line or two. The issue appears to be with the hints info object. IT is first allocated by fi_allocinfo. If -f is specificed, hints->fabric_attr->prov_name = optarg (argv). Then at the end of the test, fi_freeinfo(hints) is called, which in turn calls free() on hints->fabric_attr->prov_name.

Reproduce by running ./fi_rdm_rma_simple -f sockets (or any specific provider). Seems like a simple fix is to use strdup().

Several other tests may also have the same issue:

$ grep prov_name *
dgram.c:332: hints->fabric_attr->prov_name = optarg;
msg.c:465: hints->fabric_attr->prov_name = optarg;
rdm.c:329: hints->fabric_attr->prov_name = optarg;
rdm_rma_simple.c:336: hints->fabric_attr->prov_name = strdup(optarg);

Possible race condition in rdm-cntr example

For recv_counter, we read the current counter and wait until it reaches current value plus one.
But, the counter might have been updated already.
ret = fi_cntr_wait(rcntr, fi_cntr_read(rcntr) + 1, CNTR_TIMEOUT);

Just like send counter, we need to use a separate variable to keep track of counter values.

Issue in specifying src_addr

-s argument for many examples seem to ignore the port information.
eg: in fi_rdm_pingpong/tagged, getaddrinfo is called for "-s", but without the port (even when port is specified).

fi_msg_rma performance is too good

The latency performance for fi_msg_rma is reporting sub half micro-second latencies. This seems low, which makes me suspect that the way the performance is being captured is incorrect.

Tagging @a-ilango

Add for support for new connections

From the OFIWG F2F, there was a request to allow an application to indicate that a new connection request would be handed off to another process (forked or otherwise). The idea is that the provider could arrange its data structures accordingly, so that the new connection could successfully be migrated to another process.

Fix return value of the calling function when fi_av_insert returns 0

fi_av_insert returns the number of successful inserts but it doesn't return negative value when it fails to insert an address since it treated as "not being able to resolve an user supplied address". In our simple examples, we return what fi_av_insert returns but in case it returns 0(success), the calling function should return failure instead of success(0).
original code:

ret = fi_av_insert(av, remote_addr, 1, &remote_fi_addr, 0, &fi_ctx_av);
if (ret != 1) {
    FT_PRINTERR("fi_av_insert", ret);
    return ret;
}

suggested code:

ret = fi_av_insert(av, remote_addr, 1, &remote_fi_addr, 0, &fi_ctx_av);
if(ret == 0) {
       FT_DEBUG("fi_av_insert(): cannot insert address 0x%" PRIx64 "\n", (uint64_t *)remote_addr);
       return -1;
} else if(ret !=1) {
    FT_PRINTERR("fi_av_insert", ret);
    return ret;
}

Simple atomic operation example

Need a simple, but complete example that demonstrates using all of the atomic operations, including write coherent support.

fabtests fail to build on mac

msg_epoll.c requires epoll.h:
simple/msg_epoll.c:37:10: fatal error: 'sys/epoll.h' file not found

Should there be a configure check for this?

Simple RMA example

Need ping-pong type example for using RMA. Candidate example for using counters to check for completion of RMA.

autogen.sh fails on first run

Does there need to be a config directory?

cst-fs:~/sfi/fabtests$ ./autogen.sh

  • autoreconf -ivf
    autoreconf: Entering directory `.'
    autoreconf: configure.ac: not using Gettext
    autoreconf: running: aclocal --force
    aclocal: error: couldn't open directory 'config': No such file or directory
    autoreconf: aclocal failed with exit status: 1

Add queue sizes to endpoint attribute

The endpoint attribute structure should be expanded to expose the size of the underlying queue. Now that the EP attribute exist, we can simplify things for the user and avoid needing to use control interfaces to override the default values. But default values should still be available to the user, with the actual values returned when an endpoint is created.

Improve /scripts/runFabtests.sh

  • Add a "known" return exit code to the tests when it's not supposed to run for some certain provider
  • Have an option where user can run the tests locally without ssh'ing to the nodes
  • Record the test result for later parsing and analysis

AV unit-test

The negative tests in AV unit-test reads a bad IP address from user and does a getaddrinfo prior to av_insert(). The getaddrinfo call fails (for an invalid IP address) and the test is reported as failed.

So, was the intended meaning of bad IP address "a valid IP address but not reachable"?
AFAIK, the av_insert() need not check for reachability.
Or is it just an issue with test itself?

Any thoughts?
@shefty @rfaucett @goodell

Fix ft_finalize call in rdm_shared_ctx and scalable_ep

Modify ft_finalize call for rdm_shared_ctx and scalable_ep. Currently our ft_finalize is called with ep[0]. We need to add a local ft_finalize for these because each ep is individually addressable. We need to get the completion for the right ep.

ft_finalize(ep[0], scq, rcq, remote_fi_addr[0]);

Data structure versioning

From the OFIWG F2F, data structure versions will be indicated using a version parameter to fi_getinfo. The version parameter will indicate the version of the set of data structures known to the application. libfabric will adjust its behavior accordingly, based on the data structures and fields known to the app. This mechanism will replace the field/mask concept in the current data structure scheme.

fi_allocinfo causing problems in fabtests

I'm using a HEAD of master libfabric and same for fabtests. I'm getting this compile error when trying to build fabtests:

CC simple/rdm_tagged_search.o
simple/rdm_tagged_search.c: In function ‘main’:
simple/rdm_tagged_search.c:519:2: warning: implicit declaration of function ‘fi_allocinfo’ [-Wimplicit-function-declaration]
hints = fi_allocinfo();
^
simple/rdm_tagged_search.c:519:8: warning: assignment makes pointer from integer without a cast [enabled by default]
hints = fi_allocinfo();
^
CCLD simple/fi_rdm_tagged_search
simple/rdm_tagged_search.o: In function main': /home/hpp/fabtests/simple/rdm_tagged_search.c:519: undefined reference tofi_allocinfo'
collect2: error: ld returned 1 exit status

@shantonu
@shefty

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.