Giter VIP home page Giter VIP logo

Comments (57)

TD-er avatar TD-er commented on July 22, 2024 1

@aapris
Pick your issue:

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

this is a possible cause for this forum bug:
Bug #15 OPEN: ESP server processes (Web, ICMP) become unresponsive in
certain environments under certain conditions?

I will review all network code and add checks to ensure invalid data wont do any harm.

from espeasy.

krikk avatar krikk commented on July 22, 2024

i also see this stability problems with the current git version:
my uptime & free ram on my "productive wemos d1 mini" with the git version:

image

the first time i reached more than 3-4 hours was today when i flashed ESPEasy_R147_RC8 (with a "backported" https advanced plugin"), you see this in the last part of the graph

the link to the bug above seems to be wrong?

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

thats very unstable indeed. i think my esp has been running for days now. ill check later.

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

krikk i found a critical bug in MQTT handling. are you using that btw? my esp seems stable.

if there is a mqtt messages thats bigger than around 255 bytes, it will overflow some buffers during logging.

currently fixing that by replacing oldskool static char[] buffers with proper String handlers.

from espeasy.

krikk avatar krikk commented on July 22, 2024

no i do not use mqtt (is on my TODO list), until now, i only use "my" http advanced plugin (c011) to write directly to influxdb...
the strange thing is, that my test wemos d1 r2 is running on the git version stable as long as i do not use the controller to send data, as soon as i enable the http advanced controller it starts to reboot in random intervalls like seen in the above graph... ok the logical conclusion would be: bad c011 controller code causes the reboots, BUT:

i also backported the git version of the c011 controller, back to version ESPEasy_R147_RC8, and there everything runs stable without reboots, and this with more sensors (also with backported code) than on my test wemos... ....and this gives the logical conclusion that the controller/sensor code can't be the issue!!???

so perhaps something wrong in the new controller code, that changed since the R147 Release OR it's some kind of memory issue... also tried to find a way to debug this, but debugging esp826 on windows is a no-go on current version as far as i have seen...

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

can you post the output that happens during the crash? perhaps on gist.github.com if its too long

from espeasy.

krikk avatar krikk commented on July 22, 2024

setup: wemos d1 R2, current git version of espeasy, controller http advanced only taskplugins: 4 x systeminfo, reporting every 10 seconds...

first crashdump see here:
https://gist.github.com/krikk/f5ac56f123e7276c6840e66e7657442e#file-dump-2

sometimes i get a different one...will post if i get it...

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

does it also happen if you use domoticz http instead? just post to some http existing server, doesnt mather for the test.

from espeasy.

krikk avatar krikk commented on July 22, 2024

just tested this, but with domoticz http posting to the same influxdb, it works, no resets... :(

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

i will review the advanced plugin to see what i can discover.

from espeasy.

krikk avatar krikk commented on July 22, 2024

would be really great, perhaps you could try to debug it:

https://blog.squix.org/2016/04/esp8266-offline-debugging-with.html <-- perhaps this works on linux?

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

ah thanks. i still have to figure out how and if online debugging is possible with the esp. offline debugging seems to be possible at least.

still would have to reproduce the problem first when i have time.:)

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

i see a bunch of off-by-one errors and possible overflows in c011. ill fix them and then you can try again. :)

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

Ok i founds lots of stuff in c011. These kind of mistakes are pretty common in C and are to be found all over espeasy it seems. I will explain them so perhaps you (and others) can more easily spot those errors in other places and help me fix them :)

  • use strlcpy instead of strncpy. strncpy is nasty, see https://linux.die.net/man/3/strncpy. When using strncpy, the size always needs to be one less than the buffer (sizeof()-1) , and you need to make the last byte always \0. We should probably replace all str*cpy stuff by strlcpy, or regular Strings.

  • All the temporarty static log-buffers are dangerous: if we printf something to it that ends up too big after formatting we have a buffer overflow and crashes/instability.

  • Use "IPAddress ip;" for ip-adresses and convert to string with ip.toString() when needed.

  • This gets copypasted all over the place and is horrible and unnecessary and dangerous:

strcpy_P(tmp, PSTR("HTTP : connecting to "));
sprintf_P(log, PSTR("%s%s using port %u"), tmp, host, ControllerSettings.Port);

Why not just:

sprintf_P(log, PSTR("HTTP : connecting to %s using port %u"),  host, ControllerSettings.Port);

Or better yet prevent the whole char log buffer:

addLog(LOG_LEVEL_DEBUG, String(F("HTTP : connecting to "))+host.toString()+":"+ControllerSettings.Port);
  • addLog now also supports F() strings, so temporary buffers are no longer neccesary

  • This is something i also see a lot when reading network/http responses (also bad code that is copy/pasted a lot):

while (client.available()) {
   String line = client.readStringUntil('\n');
   line.toCharArray(log, 80);
   addLog(LOG_LEVEL_DEBUG_MORE, log);
   if (line.substring(0, 10) == F("HTTP/1.1 2")

The substring(0,10) again will cause problems: If we receive a line thats <10 bytes we’re running in to out-of-bounds issues. (crashes/instability)

Update: this is not true. I've looked in the arduino/ESP8266 source code and substring does proper boundschecking. (altough the wiki states something else)

Also instead of copying the string with toCharArray, you can just use line.c_str(). (altough you may not change this c_str()) And i’m not sure if toCharArray always nullterminates correctly, like with strncpy.

Replaced with :

  while (client.available()) {
    String line = client.readStringUntil('\n');
    addLog(LOG_LEVEL_DEBUG_MORE, line);
    if (line.startsWith(F("HTTP/1.1 2")))
    {
  • I cleaned up DeleteNotNeededValues a bit with comments and extra variables to make it more readable. Keep in mind that hese variables dont take extra space. A String() created on one line also takes space. Usually the compiler can even optimze it better if you store stuff you repeat in extra variables. Main reason i cleaned it up is so we can better see if the function is safe or not.

  • Also the DNS option is used incorrectly in C011. It seems it always uses the IP and just fills in the DNS name in the Host: header? It never does an actual DNS lookup. We should fix this. (Or is the DNS lookup done on saving or something?)

Correction: this is fine. DNS lookups are done in webserver.ino when settings are saved.

  • All this stuff is fixed in d0eedce. The code comiles but i didnt test it, so your issues could be fixed or they could be worse ;)

Thanks for you help! :)

from espeasy.

krikk avatar krikk commented on July 22, 2024

👍 ...compiled and uploaded, first look seems good, i have a uptime of 15 minutes already :) ...i will let it run over night, then we know more...

can only say: big thanks

from espeasy.

krikk avatar krikk commented on July 22, 2024

so after running a test over night with the following 4 test-devices posting values every 10 seconds:
image

i get this very good uptime:
image

only one panic (with the old code it would have been way more!) and the panic is a different one than the panics before you fixes:

HTTP : connecting to 10.0.0.6:8086
HTTP : Success!
HTTP : closing connection
SYS  : -85.00

Panic C:\users\t159\.platformio\packages\framework-arduinoespressif8266\cores\esp8266\core_esp8266_main.cpp:131 loop_task

ctx: sys 
sp: 3ffffdd0 end: 3fffffb0 offset: 01b0

>>>stack>>>
3fffff80:  4024b56e 3fffdab0 00000000 3fff2c00  
3fffff90:  00000000 3fffdad0 3fff2bec 40203329  
3fffffa0:  40000f49 40000f49 3fffdab0 40000f49  
<<<stack<<<

 ets Jan  8 2013,rst cause:2, boot mode:(3,7)

load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v09826c6d
~ld
ªU


INIT : Booting version: 
FS   : Mount successful
INIT : Free RAM:24816

and a quick google after this exception points to this: esp8266/Arduino#1542 ...where the developer states:

The panic line comes from the stack overflow check.

but once again, thanks for your improved code, i will have a look on the remaining issues...

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

It it possible your http server returns a lot of data in some situations? multiple kilobytes.

I think we dont limit the amount of data we readm So now String will perhaps grow too big. No buffer overflows but an out of mem condition. Will check the limits later.

from espeasy.

krikk avatar krikk commented on July 22, 2024

no the http server response should not be to big, the influxdb http api does only give back longer responses in case of wrong queries, which do not happen in my test cases...

i did a bit more tests with the improved http advanced plugin, and it is more stable now, but not really "rock" stable, i think we have a problem with memory because of much string-handling in c011...

i did also implement the change you suggestet with the parsetemplate function (see here: https://github.com/krikk/ESPEasy/blob/parseTemplate_Test/src/_C011.ino#L301) from misc.ino, but as soon as i use this when posting my http to my influxdb, i get much more reboots, which i think is caused by the greater memory use with the parseTemplate funtion... but i am not really sure about that...

btw. just found another tool for debugging stack traces:

https://github.com/me-no-dev/EspExceptionDecoder#command-line-version <- Arduino ESP8266/ESP32 Exception Stack Trace Decoder

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

I checked out the sourcecode of Arduino/ESP8266: The readStringUntil function has no read-limits. So if a response is too big, we will run out of memory and crash.

Since we never trust data from the network, we should not use this function in its current form. I will replace them all with a SafeReadStringUntil() function that allows us to specify a maximum size.

This will most certainly improve stability.

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

krikk: dns lookups are working correctly. DNS lookups are done when saving settings in webserver.info.

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

ok i created a safeReadStringUntil and replace all the readstringuntil's.

there is still lots of other string=handling and char buffers that i need to check/fix in ESPEasy.

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

discovered a memory leak in probably most http controllers: if you let them send data very often, memory keeps decreasing. probably a feature of wificlient.

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

i just fixed #253. can you make that free-memory graph again? it should be flat now.

maybe you can do exactly the same test, with the same controller/devices and server?

with some luck it stays up forever now ;)

Edwin

from espeasy.

krikk avatar krikk commented on July 22, 2024

image

the graph shows my testrun, until the red marker it was running with my rgb color sensor plugin as data provider, after the red marker only the systeminfo provider was posting data to influx db...

sorry, but we seem to have another problem... its still crashing... i will try to get a crash dump in the evening, had running the board without attached serial console for testing...

from espeasy.

krikk avatar krikk commented on July 22, 2024

why does the current github version crash while the R148 with the "backported" http advanced plugin does run more than 1 week without a reboot? ...i think we should look into the code which was changed in the timeframe from R148 release till now...

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

well there are lots of changes, but maybe we will find it.

can you give me the config so i can reproduce the problem?

i might even create a script that automaticly compiles and uploads every commit until it hits the bug.

from espeasy.

krikk avatar krikk commented on July 22, 2024

update on the stability test:
image
...i wanted to get a dump of the crash, ...and it does not happen... i will let the test run until morning...

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

yeah there is some hard to trigger bug, Shardan has issues as well. although one bug was indeed the memory leak.

sometimes it gets triggered often, other times not at all.

can you disable globalsync and UDP ? (enter port 0 for udp) that code still has some issues.

from espeasy.

krikk avatar krikk commented on July 22, 2024

image

and now i have a dump: https://gist.github.com/krikk/f3d8203dc575c574e38d6738f1486a27#file-dump-3

espexceptiondecoder output:
see above gist...

from espeasy.

krikk avatar krikk commented on July 22, 2024

my config while running above test:
one controller:
image
Tasks:
image
image
(i had 3 disabled tasks also, but they should not be relevant)

freeram and load was posting every 5 sec, wifirssi every 10 sec, uptime every 15 sec
...no globalsync, no upd (never worked with them)
...and ArduinoOTA enabled.

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

does a powercycle change anything?

also how do you get such detailed crash dumps. did you enable some kind of debugging? ( i do know the exception decoder)

from espeasy.

krikk avatar krikk commented on July 22, 2024

no i did not enable debugging, the tested build was compiled with the following settings:

[env:dev_4096]
platform = espressif8266
framework = arduino
board = esp12e
upload_speed=460800
upload_port=192.168.11.63
;upload_port=COM4
build_flags = -Wl,-Tesp8266.flash.4m1m.ld -D PLUGIN_BUILD_NORMAL -D PLUGIN_BUILD_TESTING -D PLUGIN_BUILD_DEV -D FEATURE_ARDUINO_OTA -D BUILD_GIT='"${env.TRAVIS_TAG}"'

...i simply get this dump on serial console. hardware: wemos d1 R1 board

from espeasy.

krikk avatar krikk commented on July 22, 2024

after dumping this to serial, the board reboots and runs fine again

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

are you using platformio? how did you manage to get the exception decoder? is there a plugin for it aleady?

from espeasy.

krikk avatar krikk commented on July 22, 2024

i use platformio integrated into eclipse IDE... but i also use arduino IDE to test things and for example to use this espexception decoder...

just install the decoder in arduino ide, start it with an empty file... is asks for the .elf firmware file, and then you need to paste the exception... ...this was simple, didn't manage to get another debugging solution to work on windows...

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

can you test with the domitcz http controller, to see if that one stays stable?

shardan just discovered some of his problems are due to faulty ESP boards. (the wemos boards where ok tough) He will post his results to the forum later.

from espeasy.

krikk avatar krikk commented on July 22, 2024

migrated my "productiv" wemos d1 mini to current mega git build, (was running a ESPEasy_R147_RC8 customized build before) ...and compared to my first try when i first posted here, its better, bug sadly it's still not stable:
image
image
...and if i look on the "flat" ram-usage line and take the previous tests into account, i don't think its a ram issue...

just realized that i had ArduinoOTA enabled in my test-build... will update to a build without arduinoOTA to eliminate that.

from espeasy.

krikk avatar krikk commented on July 22, 2024

so another status update, and sadly still not as stable as the old RC147... i still get "random" reboots...

its a selfcompiled dev_4096 build without ARDUINO_OTA (commit: 28208ac)

image
image

i have now permanently attached the serial console and got this dumps (espexception decoded on the bottom of the file..)

serial_mini_dump2.txt
serial_mini_dump3.txt
serial_mini_dump4.txt
serial_mini_dump1.txt

...it seems to happen always while doing backgroundtasks...

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

ah thanks!

still can you try with the domoticz http module (to the same http server). i know it wont do you any good, but at least we know if the problem is in http generic advanced, or somewhere else.

from espeasy.

uzi18 avatar uzi18 commented on July 22, 2024

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

i assumed you can stay in loop() indefinitely, as long as you call yield regulary?

from espeasy.

uzi18 avatar uzi18 commented on July 22, 2024

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

we could try that.

from espeasy.

uzi18 avatar uzi18 commented on July 22, 2024

from espeasy.

krikk avatar krikk commented on July 22, 2024

any input is welcome, i tried with a few added yield() statements, but i did not have any effect, put i did only have a quick look at this "core" code...

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

uptime of my Wemos D1 minis with dev8 is currently 13 days and counting.one of them has 2x DS18b20 and 2 PIR's. Controller is domoticz MQTT.

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

how do you power the Wemos?

from espeasy.

sassod avatar sassod commented on July 22, 2024

from espeasy.

krikk avatar krikk commented on July 22, 2024

the stability problem of my setup must be a combination of the code changes between ESPEasy_R147_RC8 and 2.x and the code of the http advanced controller... why do i know this:

because i had uptimes greater than 1 week when i ran my custom build of ESPEasy_R147_RC8 with the "backported" http advanced controller, but as soon as i updated the exactly same hardware with exactly the same sonsors to the latest git version, i get always this panic after a few hours:

Panic C:\users\krikkit\.platformio\packages\framework-arduinoespressif8266\cores\esp8266\core_esp8266_main.cpp:98 __yield

you can view my "backported" http advanced controller over here, its nearly the same code than in current git version...

so as it runs on the same hardware, i can NOT be a power problem (and power problems would not produce such full dumps i think)

... i think it is really time for me to migrate to openhab, so that i can use mqtt :)

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

can you upload your config.dat somewhere? then i can try to reproduce and fix it.

from espeasy.

krikk avatar krikk commented on July 22, 2024

my config:
config_Vorzimmer_U90_Build20000_20170524172955.zip

will be really great if you could have a look at it... 👍

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

krikk i found a problem in background processing: If something is doing a delayBackground(), it will keep calling backgroundtasks() so that serial and webinterface still respond.

however that way its possible to do something (for example calling deepsleep via serialcommand), that will do ANOTHER delaybackground(), hence recursions, stackoverflows and crashes.

is it possible you made http calls to the espeasy, or serialcalls to the espeasy, while you where testing?

anyway i fixed it so it would recurse anymore. (the second delayBackground() will just be a normal delay effectively)

from espeasy.

krikk avatar krikk commented on July 22, 2024

is it possible you made http calls to the espeasy, or serialcalls to the espeasy, while you where testing?

...no don't thinks so, because the crashes also occure when everyone in the house is sleeping, and there should be no network device which activly scans for devices... see the red marker:
image

...but anyways i updated my setup with the fresh code and will have a look if it helps...

btw. do you have serial port logging enabled on your stable running setups?

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

yep i normally have the default log level of 2.

from espeasy.

psy0rz avatar psy0rz commented on July 22, 2024

i think most of these random issues should be fixed by now. there are always more checks we need to fixed and strcpy's we need to remove, but for 2.0.0 this is ok now.

from espeasy.

aapris avatar aapris commented on July 22, 2024

I now this is an old and closed issue, but I've exactly the same panic problem, if I use Generic HTTP Advanced and POST method to send sensor data. Should I create a new issue or continue this one? I use Platform.io and the latest mega branch.

from espeasy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.