Giter VIP home page Giter VIP logo

Comments (17)

Thomseeen avatar Thomseeen commented on May 25, 2024

I have traced down the problem to lwmqtt_write_to_network() in client.c. You have a while-loop in there to send a message in parts if necessary. You interrupt that loop based on the timeout given to your esp_mqtt_init() function. The time to a timeout you pass to client->network_write() is based on the remaining time which doesn't make much sense. This means if client->network_write() times out there is no remaining time so the while-loop will never loop (it will interrupt immidiatly with LWMQTT_NETWORK_TIMEOUT)and try to send the remaining data in another try.
To fix that and actually enable multiple partial writes the timeout time given to client->network_write has to be shorter than the time used to calculate the remaining_time in lwmqtt_write_to_network().
I just changed the line

lwmqtt_err_t err = client->network_write(client->network, client->write_buf + offset + written, len - written, &partial_write, (uint32_t)remaining_time / 15);

as my timeout time is 15s.
I will test this and wether it fixes the problem, so far it seems promising.

from esp-mqtt.

256dpi avatar 256dpi commented on May 25, 2024

Every call into lwmqtt resets the command timeout and passes it down to nested function calls to ensure an action is cancelled in case the timeout has been reached. In esp-mqtt the command timeout is defined globally and passed into the lwmqtt function calls.

If you initialized esp-mqtt with a command timeout of 15s it will reset the timer when you call esp_mqtt_publish. After the packet has been encoded it will call client->network_write the first time with the 15s as the initial timeout. As the client->network_write my return early without writing all the data, it will call it repeatedly until the timeout has been reached. In case of a timeout the operation is aborted.

Since we assume that the command timeout is reasonably high and client->network_write quite quick, there should not be an issue with timeouts if the connection is healthy. However, in this scenario the data might overwhelm the modem and cause a delay that might trigger the timeout.

Did you try a higher command timeout and keep alive?

from esp-mqtt.

Thomseeen avatar Thomseeen commented on May 25, 2024

As the client->network_write my return early without writing all the data, it will call it repeatedly until the timeout has been reached. In case of a timeout the operation is aborted.

No. If client->network_write returns it has a) sent all data or b) timed out and only sent a part of the data or nothing at all. In case b) 15s have allready passed so there will be no repeated call to client->network_write.

See http://pubs.opengroup.org/onlinepubs/007908799/xns/getsockopt.html

SO_SNDTIMEO
Reports the timeout value specifying the amount of time that an output function blocks because flow control prevents data from being sent. If a send operation has blocked for this time, it returns with a partial count or with errno set to [EAGAIN] or [EWOULDBLOCK] if no data were sent. The default for this option is zero, which indicates that a send operation will not time out. The option stores a timeval structure. Note that not all implementations allow this option to be retrieved.

I verified that behaviour by changing lwmqtt_write_to_network to:

static lwmqtt_err_t lwmqtt_write_to_network(lwmqtt_client_t* client, size_t offset, size_t len) {
    // prepare counter
    size_t written = 0;
    ESP_LOGI("client.c_write", "lwmqtt sending data...");  // DEBUG
    // write while data is left
    while (written < len) {
        // check remaining time
        int32_t remaining_time = client->timer_get(client->command_timer);
        ESP_LOGI("client.c_write", "...remaining time %d", remaining_time);  // DEBUG
        if (remaining_time <= 0) {
            return LWMQTT_NETWORK_TIMEOUT;
        }
        // write
        size_t partial_write = 0;
        lwmqtt_err_t err = client->network_write(client->network, client->write_buf + offset + written, len - written, &partial_write, (uint32_t)remaining_time / 15);
        if (err != LWMQTT_SUCCESS) {
            return err;
        }
        // increment counter
        written += partial_write;
        ESP_LOGI("client.c_write", "...bytes written %d/%d", written, len);  // DEBUG
    }

    return LWMQTT_SUCCESS;
}

And esp_lwmqtt_network_write to:

lwmqtt_err_t esp_lwmqtt_network_write(void* ref, uint8_t* buffer, size_t len, size_t* sent, uint32_t timeout) {
    // cast network reference
    esp_lwmqtt_network_t* n = (esp_lwmqtt_network_t*)ref;

    // set timeout
    struct timeval t = {.tv_sec = timeout / 1000, .tv_usec = (timeout % 1000) * 1000};
    int rc = lwip_setsockopt_r(n->socket, SOL_SOCKET, SO_SNDTIMEO, (char*)&t, sizeof(t));
    if (rc < 0) {
        return LWMQTT_NETWORK_FAILED_WRITE;
    }

    // write to socket
    int bytes = lwip_write_r(n->socket, buffer, len);
    ESP_LOGI("esp_lwmqtt", "network write errno: %d", errno);
    if (bytes < 0 && errno != EAGAIN) {
        return LWMQTT_NETWORK_FAILED_WRITE;
    } else if (errno != EAGAIN) {
        // increment counter
        *sent += bytes;
    } else {
        vTaskDelay(200 / portTICK_PERIOD_MS);
    }

    return LWMQTT_SUCCESS;
}

That revealed another little mistake as lwip_write_r returns -1 when it fails to send any data and errno == EAGAIN. That causes written to count down if that case isn't catched like in the shown modified code. The vTaskDelay is ugly and something similar should probably be achieved with select.

And yes, extending the timeout and the keep alive to 60s each doesn't help.

from esp-mqtt.

256dpi avatar 256dpi commented on May 25, 2024

The write system call may return early with not all data written:

On success, the number of bytes written is returned (zero indicates nothing was written). It is not an error if this number is smaller than the number of bytes requested; this may happen for example because the disk device was filled. (http://man7.org/linux/man-pages/man2/write.2.html)

While this might happen on Linux it does not necessarily happen on the ESP.

However the other thing really seems to be a bug. We should definitely check if the returned value is below zero and not decrease the total amount. I'll fix it in lwmqtt and esp-mqtt. Thanks for noticing.

from esp-mqtt.

Thomseeen avatar Thomseeen commented on May 25, 2024

What do you mean by "early"? I think you don't get my point. The write function can return without having sent anything or only a part of the message, yes. But if it does so it does so after a timeout specified by SO_SNDTIMEO. You set this timeout to be the same time lwmqtt_write_to_network uses to determine wether it should retry a send or not. This means the check

if (remaining_time <= 0) {
    return LWMQTT_NETWORK_TIMEOUT;
}

will always trigger if there has only been a partial or failed write, which in return means there will never be a retry in that while-loop.

from esp-mqtt.

256dpi avatar 256dpi commented on May 25, 2024

The timeout is not there to perform any retries. The check is there to stop writing if the operation takes more time that has been granted. In other words: If something takes too long, we want to stop and eventually reestablish the connection and try again. The higher you set the command timeout the more time you grant to finish reads and writes in time.

from esp-mqtt.

Thomseeen avatar Thomseeen commented on May 25, 2024

Than why even do that loop?

// write while data is left
while (written < len) {...}

from esp-mqtt.

256dpi avatar 256dpi commented on May 25, 2024

Because on Linux write might not send all data at once and need multiple calls. Depending on the OS the kernel might have a smaller output buffer than the data you're about to send. Thus, you need to call write multiple times to get all the data through. The loop tracks the amount of bytes written and stops if all has been written.

from esp-mqtt.

Thomseeen avatar Thomseeen commented on May 25, 2024

Yes. But between each retry to send the remaining data the timespan is SO_SNDTIMEO. The library sets SO_SNDTIMEO to the same 15s (in my case) as the command_timer that times out the retries.

size_t written = 0;
// write while data is left
while (written < len) {
    */ 2. now we loop to try to send the rest of the data but before that we check wether we timed out
    */    as the call to write allready took 15s we whill always timeout here to and not retry
    // check remaining time
    int32_t remaining_time = client->timer_get(client->command_timer);
    if (remaining_time <= 0) {
        return LWMQTT_NETWORK_TIMEOUT;
    }
    // write
    size_t partial_write = 0;
    */ 1. this call takes 15s before it returns if it only sends part of the data...
    */    (as remaining_time will be 15s at the start)
    lwmqtt_err_t err = client->network_write(client->network, client->write_buf + offset + written, len - written, &partial_write, (uint32_t)remaining_time);
    if (err != LWMQTT_SUCCESS) {
        return err;
    }
    // increment counter
    written += partial_write;
}

The command_timer should take longer to timeout the while loop than what is defined as timeout in SO_SNDTIMEO. The command_timer is based on the time defined in timeout and so is SO_SNDTIMEO. That means timeout/timeout = 1 which is the maximum partial writes that could happen. If SO_SNDTIMEO would for example be only 1s while timeout for the command_timer would stay at 15s that would be 15 maximum partial writes before the publish-try completly times out.

I hope you see what I mean now as I fear we could otherwise keep moving in circles for hours... :D

from esp-mqtt.

256dpi avatar 256dpi commented on May 25, 2024

There is no difference in giving the client->network_write a smaller timeout: It will just timeout earlier. Yes the loop would then retry the operation, but this has the same effect as giving it the whole timeout from the beginning. Either the underlying write is able to write the bytes out in time or not. I can't see why this would make a difference.

An example: You need to bake 10 cakes. I give you the task and say: "Come back to me after and hour if you didn't already bake all the cakes". If you manage to bake all cakes in time, all is good. If you're to slow then you reach the timeout and come back to me. Now, there is no difference in giving you 6 times the task to bake 10 cakes in 10 minutes. Either you're fast enough or you'r to slow. ;)

from esp-mqtt.

Thomseeen avatar Thomseeen commented on May 25, 2024

I see. Now we are on the same page and what you say makes sense and fits into my understanding. And I now understand that there could be a thrid case in which the write only writes part of the data but returns before it times out, thats for the loop is for. Sadly that still doesn't fix my issue. I will update to your latest changes and do some more digging.

from esp-mqtt.

256dpi avatar 256dpi commented on May 25, 2024

Ok cool. But, Is your issue not more related to memory management? I mean 37K is a lot of data for the esp32...

from esp-mqtt.

Thomseeen avatar Thomseeen commented on May 25, 2024

No. The memory management problem I had described in #26 is working now. And with your fix 8fd0f16 I am able to check wether the task has started and if not (as not enough memory was available) I just TaskDelay a bit to wait for the IDLE task to free memory, as there should be enough memory, just not freed yet, and then try to start the task again. To double check I call heap_caps_get_largest_free_block(MALLOC_CAP_8BIT) and print its result alot. I'm pretty much always above 15kB.

The problem has to be somewhat connection related as I also have problems connecting to the broker at times. Lwmqtt errors -6 and -9. In Wireshark there is alot going on and I'm yet unable to get any solid structure into it to better describe the problem. Trying it anyway I see a few errors:

  1. Sometimes I can see lwip trying to send something and it never reaches my computer (can't see the package in Wireshark). That means the socket is now "full" (wont accept any more data, program keeps running) and can't sent anything else... the timeout happens -4. The client then disconnects and can't connect again because the socket is still blocked and the client can't send a MQTT connection request.
  2. My mosquitto broker sometime (randomly?!, but mostly inbetween multiple TCP packages of one message) sends a TCP-RST to the ESP and the ESP doesn't react with a new SYN as expected (in my TCP/IP understanding). That causes the whole connection to not work anymore... hard reset needed.
  3. MQTT client starts to send data. Wireshark shows the first few ~1490byte TCP packages of the image but then a flood of retransmissions messes everything up.
  4. Can't connect to mosquitto... The sent MQTT connection request isn't shown in Wireshark. Wireshark only prompts "tcp previous segment not captured". Lwmqtt errors -6 and -9 mostly.

Errors 1. and 3. don't happen if I only send a timestamp which is so small, it doesn't have to be split into multiple TCP packages.
Most of it sounds like problems with the physical connection to me, but if I run a simple program on the ESP that just sends random strings (no camera, no MQTT) via TCP (using the write function from sockets.h) I don't have any connectivity problems whatsoever.
Best thing... sometimes there are like 5minutes where everything is working like a charm no problems at all...

I know this is probably very confusing but I'm pretty much unable to structure everything that's going wrong at the moment...

from esp-mqtt.

256dpi avatar 256dpi commented on May 25, 2024

That is quite a weird problem indeed. One thing you could try that I can think of is changing lwmqtt_write_to_network to call client->network_write multiple times with buffers up to 1024 bytes or just below the MTU (usually 1400 or 1500 bytes):

size_t write_len = len - written;
if (length > 1024) {
  length = 1024;
}

lwmqtt_err_t err = client->network_write(client->network,
  client->write_buf + offset + written, write_len, &partial_write,
  (uint32_t)remaining_time);

Maybe lwip is doing something weird under the hood if it is called with big buffers.

from esp-mqtt.

Thomseeen avatar Thomseeen commented on May 25, 2024

Doing that would mean I will have to reassemble the data on the other side by myself? If I really need to go that route I will just send multiple MQTT-publishes.
At the moment I'm trying to trace down the problems by enabling the debug-logs from lwip. In combination with wireshark it seems like there are indeed a lot of packages dropping. Lwip correctly tries to resent them and sometimes they reach there goal... sometimes not. The retries happen in a somewhat wierd and slow pattern so eventually it just times out after a while.
At this point my problem is still unsolved but I'm pretty sure it has nothing to do with your library so I will close this issue.

from esp-mqtt.

Thomseeen avatar Thomseeen commented on May 25, 2024

If anyone may end up here or you want to note it in your README. I found a solution for my problem which completly solves my connection issues.
As stated here espressif/esp-mqtt#48 for larger buffers sent via lwip #define CONFIG_USE_ONLY_LWIP_SELECT 1 needs to be set. In menuconfig it can be found at "Component Setting" -> "LWIP" -> "Support LWIP socket select() only".

from esp-mqtt.

256dpi avatar 256dpi commented on May 25, 2024

Thanks for figuring it out. I added a note to the readme. I hope this gets resolved soon.

from esp-mqtt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.