Genesis Z axis stall detection fails --> and a workaround

RickCarlino · July 20, 2020, 7:22pm

@jebba We have plans to update the SSH parts of FBOS in the next three months or so. As a side note for anyone wishing to discuss security matters on the forum (since the RSA issue has been brought up once previously this month), please see our responsible disclosure statement prior to posting or privately send the matter to security@farmbot.io.

@jrwaters I will take a look later today.

jebba · July 20, 2020, 9:39pm

@RickCarlino got it, thx.

jsimmonds · July 21, 2020, 7:26am

It’s not clear that @jrwaters is seeing Factory Resets . . but is seeing FBOS reboots

jsimmonds · July 21, 2020, 7:43am

Thanks @jrwaters

@RickCarlino SSH Console is way too noisy in the logs.

All that’s there is just your SSH session startup. Something is killing IPv4 connectivity between the Bot and everyone else. Is it normal that FBOS reboots after loss of Internet ? ( I’ve forgotten )

@jrwaters to go further you’ll need to connect up to the physical RPi3B Serial Console ( see other posts in here ) with your trusty serial 3V3TTL-USB cable

| edit |
Another recent post in here discovered inadequate 5V supply to the RPi3B … are you able to check the 5V0 main supply at the RPi3B ?

RickCarlino · July 21, 2020, 1:45pm

@jrwaters This is indeed a strange one, and the first time I’ve seen anything like this.

Until we figure out what’s going on, I’ve:

Disabled auto-update
Disabled pin bindings

At this point we know that it is:

NOT A pin binding that is accidentally getting triggered.
NOT A farm event that accidentally had a REBOOT block.

The power supply theory that @jsimmonds is suggesting sounds plausible. There may be some other thing going on that might show up via TTL cable.

Are you available for a phone call today? I have a feeling there is something unique to your setup that is causing the issue.

jrwaters · July 21, 2020, 1:53pm

@Jsimmonds, you are amazing man. I don’t think the loss of connectivity is the cause, rather, suspect spurious reboot which immediately causes loss of connectivity. I say that because the time period is so short - it is happening every 5 or so minutes. And there is only about a 40 second period where connectivity is lost (during the reboot). But, of course, I could be wrong. I will look at the other thread no matter what. I love learning about this stuff. Thank you!

From day 1 my voltage icon shows Yellow and sometimes Red FWIW. I might be able to check that but will have to read up as I’m a novice with my multi-meter.

@RickCarlino,

I am available at 9 am PST, 1 pm PST, or 3 pm PST. Do any of those work?

Best
Jack

jrwaters · July 22, 2020, 12:36am

Just wanted to update for those who are interested - especially @jsimmonds.

We swapped out the power supply and the Raspberry Pi and the spurious reboots still happened.
The reboots happen even when the Farmduino is not attached.

This is good news in the sense that we have fewer variables. Some of the main variables left

Things happening in the network
Software.

I’ve just downgraded to 10.1.2 and the issue still happens - so it isn’t that simple. I’ve also opened my network completely for the FarmBot. I can use tping and see that the FarmBot can ping things in the Internet.

Next up - that debug cable. My debug cable should arrive tomorrow. I have an idea to eliminate the network equation. Tomorrow I’ll re-flash and put my iPhone out by the FarmBot and use it as a hotspot. I timed the reboots by sshing into the FarmBot and using the Toolshed uptime command. They always seem to happen after 7.5 minutes and usually before 9 minutes. So, it shouldn’t take long to see.

jsimmonds · July 22, 2020, 5:55am

Final suspect in the RPi3B power theory is the USB cable A to MicroB . . tried a new one ?

That’s a big neon light right there

jrwaters · July 22, 2020, 1:08pm

Sorry I wasn’t clear. When I said “we swapped out the power supply”, I meant that I had an official Raspberry Pi power supply because I had a spare Pi of the same version (silkscreen same and everything). So, the Pi was powered by that with no Arduino connected.

But, having yellow voltage has bothered me - I assumed others might have the same. If not the case then sounds like I should swap out that cable. Thank you.

jrwaters · July 24, 2020, 3:04am

UPDATED AT VERY BOTTOM

Here is the latest. My cable is not here yet but it will be here tomorrow. In the mean time, I’ve got Screen working on my Windows box under the Linux sub-system.

I have the Pi inside with no Arduino attached. A couple of other interesting things

With completely different Pi and completely different power supply, confirm same symptoms - reboot every 7 to 8 minutes.
With my power supplies, voltage reading shows GREEN so I suspect the power cable between my Arduino and Pi is flawed in some way or perhaps even my power supply (unlikely) but this is definitely not triggering this issue.
Cleaned up firewall rules and I think I have it pretty pristine - made not difference.
Tried different firewall - made no difference.
WiFi hotspot didn’t work as my AT&T coverage at home is poor.
I’ve tried all the released versions of FarmBot OS that I have and - interestingly enough, 9.2.2 does not exhibit the issue. I let it run for over 15 minutes.
Also interesting is that 10.2.0-rc0 seems immune. I think this is for self hosted users but it is interesting. One thing I noticed is that SSH has periods (couple of seconds) of non-responsiveness with this build.
I tried 10.2.0-rc1 and it rebooted promptly at 7 minutes. This got me thinking about 10.2.0-rc0 - why did it stay up. So, I reloaded 10.2.0-rc0 and looked more closely. It does go through a restart process but it just doesn’t disconnect so you get to see the full reset (I assume).

screenlog.0 (354.3 KB)

I’m going outside to stain some wood. Next update when I have a cable and 10.1.3 log captured!

Thanks
Jack

RickCarlino · July 24, 2020, 2:17pm

@jrwaters Interesting note about 9.2.2. I will talk to the developer who managed FBOS at that time to see if they have any input on the matter.

It’s unfortunate that the hotspot did not work. If you have a friend with known-good WiFi, it might be worth carrying the the RPi to their LAN and trying that (if the TTL cable is going to be delayed significantly).

jrwaters · July 24, 2020, 3:25pm

Only if you are interested in 9.2.2 Rick . . . I’m a bit of a completionist But certainly appreciate that you all don’t have cycles to dig through old code.

If no TTY cable today, I’ll throw my family off the network and put a simple switch on my cable modem and hook laptop and Pi directly!

jrwaters · July 24, 2020, 10:15pm

Cable came - captured the failure with 10.1.3. Logs attached (keys redacted) but the interesting bit is below. I logged back in and catted out the crash.dump but its too large to attach. If you (Rick or the intellectually curious John Simmonds) want a copy then please give me an e-mail address and I’ll share it on Google drive or something (open to ideas).

Thanks
Jack

eheap_alloc: Cannot allocate 457731892 bytes of memory (of type “ol[ 365.750349] heart: Erlang is crashing … (waiting for crash dump file)
d_heap”).

Crash dump is being written to: /root/crash.dump…done
e[999H
[nbtty: terminating]
e[?25h[ 365.750375] heart: waiting for dump - timeout set to -1 seconds.
[ 367.995257] erlinit: Erlang VM exited

[ 368.007896] erlinit: Sending SIGTERM to all processes

[ 368.014564] watchdog: watchdog0: watchdog did not stop!
[ 369.695761] erlinit: Sending SIGKILL to all processes

jsimmonds · July 25, 2020, 11:09am

@jrwaters , YET more puzzles

Ok, we have a genuine Crash dump and Reboot issue now vs. a suspected “low power” issue (?) ( What colour is your RPi Power now ? ) Very good.

Re: the crash.dump file . . the Slogan text that you posted is #1 key
The #2 key is the Current Process . .

I don’t want to see the dump, but, we can get a good picture of the problem with those 2 items.

FBOS has some imitations on quantities of things in Groups and Farm Events at the moment.

What workload is your bot executing when this dump+reboot occurs ?
( Size of Group that the active Sequence is working with … … Number of Farm Events lined up … … etc. ?)

Be kind to your family . . With that test, what are you proving/disproving ?

jrwaters · July 25, 2020, 4:31pm

Hi @jsimmonds ,

Answers below. Thank you again for your help!

Jack

Current Process
In terms of the current process, there are a number of instances in the dump file relating to this - spread out over 127 lines. I hope it won’t be offensive to put that much here? Posted down at the bottom.

Power
I have the Raspberry Pi inside with a Canakit power supply so it is green. When in the normal FarmBot electrical box it is nearly always yellow . . sometimes red and rarely green. While not the trigger here, this seems like something I need to resolve.

Workload
Literally asking it to do nothing. In an idle state whether connected to arduino or not, it reboots every 7-8 minutes.

The Test to Which I Referred and the Potential Family Suffering
This problem happens round the clock. I can’t keep my system up for 10 minutes and it happens on any modern firmware version (10.X). All of this stuff was working for weeks and weeks so I’m trying to figure out “what changed”. It looks like there is a bug because there is a crash but I’m also interested in what triggered it. I’ve eliminated many variables (the Pi itself, the power supply). So, my network may have changed or my SDCARD may have degraded, etc. While I think my network is clean, I haven’t proven that by using a different network. For now, I haven’t set this up because, as you recommended, I’m trying to be kind to my family

Current Process Details
Slogan: eheap_alloc: Cannot allocate 457731892 bytes of memory (of type “heap”).
System version: Erlang/OTP 22 [erts-10.7.2.1] [source] [smp:4:4] [ds:4:4:10] [async-threads:1]
Compiled: Wed Jul 1 21:37:29 2020
Taints: Elixir.Circuits.I2C.Nif,Elixir.FarmbotCore.Asset.FarmEvent,esqlite3_nif,Elixir.Circuits.GPIO.Nif,asn1rt_nif,crypto
Atoms: 33119
Calling Thread: scheduler:0
=scheduler:1
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work: THR_PRGR_LATER_OP
Current Port:
Run Queue Max Length: 0
Run Queue High Length: 0
Run Queue Normal Length: 0
Run Queue Low Length: 0
Run Queue Port Length: 0
Run Queue Flags: OUT_OF_WORK | HALFTIME_OUT_OF_WORK
Current Process:
=scheduler:2
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work: THR_PRGR_LATER_OP
Current Port:
Run Queue Max Length: 0
Run Queue High Length: 0
Run Queue Normal Length: 0
Run Queue Low Length: 0
Run Queue Port Length: 0
Run Queue Flags: OUT_OF_WORK | HALFTIME_OUT_OF_WORK
Current Process:
=scheduler:3
Scheduler Sleep Info Flags: SLEEPING | POLL_SLEEPING | WAITING
Scheduler Sleep Info Aux Work: THR_PRGR_LATER_OP
Current Port:
Run Queue Max Length: 0
Run Queue High Length: 0
Run Queue Normal Length: 0
Run Queue Low Length: 0
Run Queue Port Length: 0
Run Queue Flags: OUT_OF_WORK | HALFTIME_OUT_OF_WORK
Current Process:
=scheduler:4
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Port:
Run Queue Max Length: 0
Run Queue High Length: 0
Run Queue Normal Length: 0
Run Queue Low Length: 0
Run Queue Port Length: 0
Run Queue Flags: OUT_OF_WORK | HALFTIME_OUT_OF_WORK | INACTIVE
Current Process:
=dirty_cpu_scheduler:5
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_cpu_scheduler:6
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_cpu_scheduler:7
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_cpu_scheduler:8
Scheduler Sleep Info Flags:
Scheduler Sleep Info Aux Work:
Current Process: <0.531.0>
Current Process State: Garbing
Current Process Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS
Current Process Program counter: 0x764d5a04 (gen_server:loop/7 + 336)
Current Process CP: 0x00000000 (invalid)
Current Process Limited Stack Trace:
0x4b83329c:SReturn addr 0x73065494 (proc_lib:init_p_do_apply/3 + 36)
0x4b8332b8:SReturn addr 0x2EE61C ()
=dirty_cpu_run_queue
Run Queue Max Length: 0
Run Queue High Length: 0
Run Queue Normal Length: 0
Run Queue Low Length: 0
Run Queue Port Length: 0
Run Queue Flags: OUT_OF_WORK | HALFTIME_OUT_OF_WORK | NONEMPTY | EXEC
=dirty_io_scheduler:9
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_io_scheduler:10
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_io_scheduler:11
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_io_scheduler:12
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_io_scheduler:13
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_io_scheduler:14
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_io_scheduler:15
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_io_scheduler:16
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_io_scheduler:17
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_io_scheduler:18
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING
Scheduler Sleep Info Aux Work:
Current Process:
=dirty_io_run_queue
Run Queue Max Length: 0
Run Queue High Length: 0
Run Queue Normal Length: 0
Run Queue Low Length: 0
Run Queue Port Length: 0
Run Queue Flags: OUT_OF_WORK | HALFTIME_OUT_OF_WORK

jsimmonds · July 26, 2020, 12:02am

True, however, all but the one on dirty_cpu_scheduler:8 are nil.
Further down in the =process: details section of the dump you should find process <0.531.0> If there is more detail in there, can you post it ?

Ok, but in the background FBOS does a “ton” of housekeeping and checking to keep things in sync with Web App’s view of the world. I’ve observed crashes similar to this ( where a Process is in the middle of doing a Garbage Collection for itself and suddenly asks for more heap than the system can supply . . caused in my case by too many Farm Events ( > 2000 ). The current process was Elixir.FarmbotCeleryScript.Scheduler trying to get the Farm Event time ordering up-to-date ( it carries around a huge amount of state so needing to GC quite often ).

That’s why I was enquiring about the numbers of “things” the bot is dealing with in your current garden setup.

You’re being so thorough up to now. That test needs to happen

jrwaters · July 26, 2020, 12:46am

Hi @Jsimmonds,

Regarding the housekeeping - sorry - I misunderstood. I have only 50-ish plants and just a couple of things scheduled each morning. I don’t have any weeds in my bed yet and my camera calibration isn’t quite right (see other post) so I haven’t scheduled any of that. Once a day I have a moisture sequence check in the early morning and then I have a watering sequence. Still, I’ll go back over all of that and see if I changed anything recently. Worst case, I’ll delete my scheduled events (after that other test).

Here is the process info:
=proc:<0.531.0>
State: Garbing
Name: ‘Elixir.FarmbotCeleryScript.Scheduler’
Spawned as: proc_lib:init_p/5
Spawned by: <0.318.0>
Message queue length: 1
Number of heap fragments: 1651
Heap fragment data: 847508
Link list: [<0.532.0>, <0.318.0>, {to,<0.3982.0>,#Ref<0.1440891332.806092801.248177>}, {to,<0.4753.0>,#Ref<0.1440891332.806354945.4523>}, {to,<0.3976.0>,#Ref<0.1440891332.806092804.227742>}, {to,<0.3975.0>,#Ref<0.1440891332.806092804.227787>}, {to,<0.3977.0>,#Ref<0.1440891332.806092804.227822>}, {to,<0.3979.0>,#Ref<0.1440891332.806092804.227838>}, {to,<0.3981.0>,#Ref<0.1440891332.806092804.227835>}, {to,<0.3980.0>,#Ref<0.1440891332.806092804.227809>}, {to,<0.3978.0>,#Ref<0.1440891332.806092804.227786>}, {from,<0.4753.0>,#Ref<0.1440891332.806354947.167595>}]
Reductions: 93526437
Stack+heap: 38323372
OldHeap: 79467343
Heap unused: 614327
OldHeap unused: 6263772
BinVHeap: 6
OldBinVHeap: 596
BinVHeap unused: 46416
OldBinVHeap unused: 45826
Memory: 474553836
New heap start: 42602018
New heap top: 4B5DB3EC
Stack top: 4B83329C
Stack end: 4B8332C8
Old heap start: 53208018
Old heap top: 64947DE4
Old heap end: 6612CD54
Program counter: 0x764d5a04 (gen_server:loop/7 + 336)
CP: 0x00000000 (invalid)
Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS

Regarding that test - I have conceived of the way I want to do it without disrupting the family. Hopefully tonight!

PS - thanks for the commentary on the dump file and happy to extract anything else.

Jack

jrwaters · July 26, 2020, 3:51am

The network experiment

Dear diary. I’m shocked. I took a completely different brand of WiFi router (Asus) and connected it to my cable modem. Then I connected the Farmbot to the Asus. None of my network policies, etc. are in the picture. After about 9 minutes, I got the same crash - “eheap_alloc: Cannot allocate 549278268 bytes of memory (of type “he[ 567.256445] heart: Erlang is crashing … (waiting for crash dump file)
ap”).”

Running SpeedTest from this machine (on the same WiFi as Pi) shows normal rates for my home.

I’m running out of ideas. I did try a new San Disk Class 10 SD Card and the result was the same.

@RickCarlino told me that you can run a FarmBot without the RTC daughter card and without the Arduino. I just want to double triple check that. If, for example, there is a bug where you get a crash after 10 minutes if you don’t have proper communication with your arduino or RTC . . . then that would send me in another direction.

jsimmonds · July 26, 2020, 6:32am

@jrwaters if Elixir.FarmbotCeleryScript.Scheduler is the crashing process again, the problem is likely to relate to the total number of Events ( which can be Sequences or Regimes ).
@RickCarlino is in the best position to sort this out with you, rather than this post-reply ping-pong that we’re conducting

For reference I’d like to see your Events list ( scroll to the bottom of the list )
Here’s mine ( which doesn’t seem to bother my RPi3B+ on FBOS 10.1.3 )

jrwaters · July 26, 2020, 6:38am

Happy to oblige. As I said, I’m a man of few events

I didn’t suspect events because I created a new FarmBot account and associated my FarmBot with that. The problem still happened [might have lost mind and be making this up but nearly sure].

Best
Jack