Category Icon
sigma-star
|
09.04.2023

Too many CPUs to build

Lately, we’ve been facing strange build errors on one of our build servers. Building a Yocto based firmware sometimes failed to build. While the error message was clear - bitbake’s git fetcher was unable to pull sources from a remote git server - it was less clear what caused the problem. The error message from bitbake indicated a connection reset. Cloning the affected repository manually worked perfectly fine, and also building the failed target by manual execution of bitbake worked.

When the issue occured first, we thought it was a network glitch and didn’t bother much. But the issue persisted and occurred again. With more failures, we were able to characterize the problem:

  • Incremental rebuilds failed most likely
  • Full builds with an empty sstate-cache never failed
  • Full builds with a populated sstate-cache failed often
  • Manual rebuild of a single tailed target always worked
  • Only recipes with AUTOREV were affected
  • Only recipes that fetch from a specific git server using ssh were affected

In the meantime, we suspected the connection to the git server - but none of our tests indicated a problem. Reaching the git server in question involves another ssh jump host and a VPN connection to a customer. So there were quite a few components which were not under our control.

Finally, setting the log level of the ssh jump host to verbose gave us the crucial clue, as it logged:

drop connection #10 from hidden:48324 on hidden:22 past MaxStartups

So, we’ve been overrunning the jump host by reaching more than 10 concurrent unauthenticated connections. This was the reason behind the strange build errors! Due to Yocto’s parallelism, it ran git ls-remote in parallel and established at least one connection per CPU. The affected build server has way more than 10 CPUs. This explains why only our most powerful server was affected, and only rebuilds or builds with populated sstate-cache failed sometimes. All other builds ran enough other tasks in between to never reach the connection limit.

The morale of the story?

Always make sure that you allow more parallel connections than you have CPUs. In our case, adjusting sshd’s MaxSessions and MaxStartups settings fixed the problem.

Publish date

09.04.2023

Category

sigma-star

Authors

Richard Weinberger

Icon with a waving hand

Get in touch

+43 5 9980 400 00

sigma star gmbh
Eduard-Bodem-Gasse 6, 1st floor
6020 Innsbruck | Austria

LinkedIn logo
sigma star gmbh logo