Jump to content

Debugging CANBus and Communication timeout while homing/bytes_invalid


Recommended Posts

CAN bus toolhead boards have a great benefit of reducing wiring but like many I ran into periodic issues with "Communication timeout while homing" errors as well as getting bytes_invalid increasing regularly in the secondary MCU stats. I've seen some strange advice related to these issues. As I was able to find a complete resolution to my issues, I thought I'd write up my understanding of the problem and some strategies I found useful for troubleshooting.

The Problem

The first thing to understand is "Communication timeout while homing", "bytes_invalid" and "bytes_retransmit" are all high level indications of a problem somewhere in Klipper's multi-MCU stack, which includes the CAN bus but can also include USB/SPI/USB/Linux networking stack/Klipper's serial communication libraries ect.

When we see "bytes_invalid" that means that Klipper ran into a problem in serial communication with the secondary MCU, but doesn't tell us where in the stack the problem is. It might be the physical CAN Bus, might be somewhere else.

During homing its critical that there isn't a large delay between the primary and secondary MCU. Generally the Z motors will be controlled by the primary MCU while the Z endstop/probe would be connected to the secondary MCU on the CAN toolhead board. The latency between primary and secondary MCU needs to be low to get an accurate position when the endstop is triggered. Klipper tracks the round trip latency between the primary and secondary MCU and will fault if it is getting to the point where it would impact homing accuracy (greater than 25ms). Again, the cause of the latency can be anywhere in the multi MCU stack. https://www.klipper3d.org/Multi_MCU_Homing.html

These symptoms would tend to go together, as issues in the multi-MCU stack can result in bytes_invalid stats going up, and communication issues are going to increase latency as the stack retransmits data.

Advice To Be Avoided (at least in my opinion)

1. Increase TRSYNC_TIMEOUT = 0.025 to 0.05

25 ms is a long time for a CAN bus with relatively low traffic. If your latency is overrunning 25ms, bumping the cutoff to 50ms is just ignoring the underlying problem. If you are getting "Communication timeout while homing" errors something is wrong. Look to fix the underlying issue, do not nerf the safety code.

2. Increasing your bitrate

A single 500 Kbps CAN bus is capable of handling all the ADAS modules on a modern car. With the exception of resonance testing, very little data actually flows over the multi MCU bus on a 3d printer. I think the slowest I've ever seen recommended in a guide is 250 Kbps, which should easily come under the 25 ms check. My printer's bus is currently set to 1 Mbps and my srtt usually sits at 1 ms. 50 ms is probably an order of magnitude more latency than you should be seeing on a healthy 250 Kbps bus. There are other reasons to bump the bit rate, we'll get to that later.

Debugging

If you've ended up here, its probably because you ran into the homing timeout issue. The first thing to check is your "bytes_invalid" and "bytes_retransmit" multi MCU stats. Go to your MACHINE view in mainsail and then click on your CAN MCU. Which will bring up the stats for that secondary MCU. These stats will also be in the Klipper logs.

image.thumb.png.f2f546ecb5bb74a80fcd4c921e732f91.png

image.png.131ca4904a77030e73dec776fbc0771d.png

Watch these stats while the printer is in operation. Ideally "bytes_invalid" and "bytes_retransmit" are always at 0. However, I think depending on how things initialize with your specific hardware, you might get a few at initial startup and that could theoretically be fine. What is cause for concern is for these numbers to be increasing as the printer operates.

My bytes_invalid/bytes_retransmit numbers are going up, is it the CAN Bus?

It may or may not be the CAN bus, here are some trouble shooting steps to try to find out one way or the other.

First check the error stats on your CAN interface on the Host MCU. SSH into your host and run:

ip -details -statistics link show can0

image.thumb.png.220b6bd621f9dd6beece29126b8f687e.png

CAN bus requires 2 healthy devices for communication to be successful, so if your Host MCU was trying to send messages before the secondary MCU was up and acknowledging then you might see a few errors here and wouldn't necessarily be an issue. The cause for concern is if you repeatedly execute this command while the printer is operating and you see errors increasing. If the stats here are stable, your physical CAN wiring is probably fine and the low level operation of the bus is healthy.

These errors are going up too, what next?

1. Wiggle your wiring and see if you can induce spikes of errors, if you can, recheck your crimps/continuity on the wires.

2. Verify your termination resistors on both ends of the bus by checking the resistance between high and low wires. Anywhere convenient to access with a multi meter is fine. You should see 60 Ohms across H and L. The bus is so short there probably isn't going to be a huge amount of ringing, but it doesn't hurt to make sure this is to spec.

3. Twist your CAN H and CAN L wires. Much like the termination resistors this probably isn't going to make or break the health of your bus for such a short run, but printers can be noisy environments, so it can't hurt to do this to spec as well.

4. If you have an oscilloscope you could take a peak at the CAN H / CAN L to look for obvious problems. Don't read into noise too much though.

image.thumb.png.0f06900b13d15c3d082b83311194eebc.png

Scoping my bus you can see periodic spikes of noise from the 24V PSU. Looking at just CAN L or CAN H in isolation it seems like the noise might be bad enough to flip a bit now and then, but its important to remember that when you have CAN H and CAN L the same data is streamed over both with H and L inverted from each other. The CAN transceiver inverts one signal and sums them together canceling out much of the noise that appears equally on both lines. This is why we twist CAN H and CAN L together. In my case, this level of noise causes no issues despite not looking so great.

5. If you are only seeing CAN errors during extremely high load operations, like resonance testing. Verify that you aren't overloading the bus.

canbusload can give insight into how much headroom or lack there of you have during different operations. You must specify your bitrate to get a correctly scaled graph, the example below is getting the load for interface can0 running at 250 Kbps.

sudo apt install can-utils

canbusload can0@250000 -c -b

Under normal operations 250 kbps was never loaded heavily, but during resonances tests the bus would become 100% loaded and CAN errors, bytes_invalid and bytes_retransmit would all start popping up, as you'd expect with trying to send more messages than the bus can handle. If you run into these stats increasing only when doing things like resonance testing, and no other times. Check with canbusload and try bumping the bitrate if you confirm you are out of headroom with your current settings.

 

I don't see CAN bus issues, but I'm still getting bytes_invalid or homing timeouts

If your CAN stats are looking stable once everything has a minute to boot up and errors are 0 or at least not increasing beyond initial bootup, its time to look up the stack. Once I validated that my bus was fine, I found my issue straight away so this section isn't going to be exhaustive.

I'm using a BIGTREETECH U2C adapter to provide a CAN interface on my Raspberry Pi via USB. I had validated that the CAN bus was fine, yet Klipper was still reporting problems. So the issue had to be somewhere between the U2C and the RPi. Looking at the klipper code, one cause of the bytes_invalid counter increasing would be messages coming out of order. My best guess was that the candlelight firmware running on the U2C was not properly buffering messages and sometimes swapped the order when forwarding them over USB to the RPi. Before digging into candlelight firmware code, I tried flashing the U2C with the BIGTREETECH fork of candlelight (https://github.com/bigtreetech/U2C/tree/master/firmware) and that completely resolved my issue. No more bytes_invalid, no more timeouts during homing.

If that hadn't have worked, I would have tried swapping to a different CAN adapter next, ideally one using a totally different interface type. For example going from a USB Can interface to a RPi CAN hat that connects to the Pi via SPI bus. 

Hopefully somewhat documenting my path to a resolution can be helpful to others running into this.

  • Like 2
Link to comment
Share on other sites

On 2/28/2023 at 1:13 AM, locki said:

Very nice detailed how to, can you just add how to flash firmware via DFU?

If you are running into this issues, you've already flashed your hardware. So you should be past that. My intention was to focus on debugging these issues with CAN bus not write another how to for initial setup. There are quite a few setup guides at this point.

Link to comment
Share on other sites

  • 2 months later...

I have same problem. Klipper with orange Pi 5, SHT36 Mellow Canbus. I removed the canbus and the SHT36 I connected it with usb to the orange pi 5. Same problem Communication timeout during homing z. If I use the raspberry with sd, which is much slower than the orange pi 5 with SSD, I have no timeout problems

 

 

Link to comment
Share on other sites

I am having the communication timeout during homing issue as well when attempting QGL.  I followed the steps above and am getting increasing rx errors but only when tap is engaging during the probing.  I have tried checking wiring and that all seems solid along with my termination resistance.  Im not sure what to do now.

Link to comment
Share on other sites

  • 2 months later...
39 minutes ago, PFarm said:

Any suggestion what else I can go/check?

What are your micro steps set at? Have had read a few comments on discord that anything above 32 can be a reason for the timeout. 

  • Like 1
Link to comment
Share on other sites

  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...