Case Study: Improving Kubo’s Maintainer Quality of Life

Tags

Case Study

Author

Published

Case Study: How we improve the Kubo’s team Quality Of Life accross three axises

- Kubo project & Kubo team's description + how IPDX joined the team - kubo team structure - ipdx team structure - workflow - relationship - how we start working together - improving life with uci - problem / workflow / solution / benefits / ipdx special sauce - improving life with github management - problem / workflow / solution / benefits / ipdx special sauce - improving life with kuboreleaser - problem / workflow / solution / benefits / ipdx special sauce - results of collaborations - takeways - open to CTA.

start / end

piotr joins the IPDX team,

our workflow was mostly

We meet with people and provide space to complain about stuff and listening “psychologist”. Just share complaints. Very useful to have something.
Few kubo maintainers doing MANY different tasks / wearing MANY hats.
as a very small team, most of our work was 1:1
watching from the sidelines and helping. making medicine available after we listen to them.
mental model of how much time / effort we would save. How widespread an issue is, how many people complained.
joining pretty much every meetings, slack, forums, etc. Wherever people spend time to discuss issues. It’s important to trail all the places people are hanging out

meetings (zoom),
sync / async slack conversation,
a lot of stuff happening on github issues, disquss, Discord, Element,
historic discussions - learning about the history, reading old issues and discussion.
open culture, meant we had a vision on what is happening, real discussions and trouble (recommendation - foster open communication instead of DMs)

summary

so you crawl through discussion, 1:1, etc. gather requrements, come up with new changes, select yourself, and work on it.

was there any dead ends?

pre-emptively invest on uci customization, templating distributing workflows, etc. That was, in retrospect, not necessary. The ideas where born out of real life problem but it didn’t get traction in real life.
templating was good, but wrong target: maintainers have a use, and they would keep using it to customize their workflows. But that wasn’t the case, people where happy with adjusting the uci a bit rather than doing the customization. For uci maintainers, templating is useful because we can work faster. But in the end that was mostly selfish / internal improvement.

fix: would have been garnering more data points. before getting into it.

alumni

started working on unified CI

There where Kubo and 100 of libraries and repositori accross dozen of orgs.
They are all dependencies, and they don’t really have active maintainers,
general design of Kubo is “a bunch of libraries / micro-services / micro single purpose libraries sittched together”, was annoying for everyone, a lot of maintainance overhead for the teams. Even spending very little time on a change, multiplies amongst all the repos

Very similar two micro service

unified CI applies to this space, unifies the maintenance of CI

unified how repos are build, tested, and released (version.json)

We reused what Martin was doing in libp2p and what other maintainers where familiar with.

go-libp2p built by marten and stitched together. (beg of 2021 marten, end of 2021 piotr)
before: no tests or manual setup where there cared, which meant some point of failures. non uniform quality. updates don’t happens, at best you copy and paste some config most of the times.
gets you out of the zone when you are coding and you are trying to release a repository and now you realise something is missing. Death by a thousands papercut. Context switch

Before uci

Everyone would “reinvent” the same CI, because everyone google and everyone find the same tutorial, but everyone do it slightly differently. High cognitive loads.
Marten was already working on setting up uci in the kubo ecosystem

at the moment:

all the repos, including boxo, are uci except kubo

Issues:

When proposed, PR had no “owners”, because no one was around to adjust, updates, and merge the PR. (formatting, dependencies, etc).
Automated that workflow as well - installation wizard sort of.

Another big thing:

automate the dection of new members of the ecosystem / new repositoties, it’s a living system so new projects join and you need to make sure the setup is automated as well. Make sure they can’t stray in the wrong direction.

Kubo maintainers

Introduction of unified ci for JS was prety important as well.
In Kubo specifically, didn’t introduce unified CI, it had custom values, it was a really complex and custom thingy so it was fine to have to snowflake as a single project. Helped people concentrate on the complex snowflake rather than doing busy work for the other repositories.

reduced the maintainance,
it also made possible to make the micro-librarie / service model possible at scale

(enabled a design pattern for the team)
their are pros and cons to these design patterns, and it’s a shame that the restraining factor is the actual developper tooling, not the actual efficiency of the software model.
micro services tends to “stumble on themselves”

simplified onboarding, you learn how to use ONE workflow and now you know how to build and use everything else
improved quality: release, versioned, tested

baseline for everyone and the baseline quality quality sort of came for free
build on windows, latest go versions, macos, etc. Race dectections, These things are easy to miss / forget.
new features would be ported automatically (-shuffle update)

not using it

kubo needs a lot of customization and UCI was not “up to date”
one limitation was the release workflow, it’s more edge-cases-y, there are more specific release workflow process maintainers are doing. And some maintainers wouldn’t want to use it.
Hector had sort of muscle memory about workflow release, and he did not wanter to change to. We accomodated his workflow so it could be customized and he could import “only what he wanted”.

Explicitly tell unified CI “exclude feature X”.

kubo releaser

noticed:

derails the entire team. huge impact accross many maintainers. ⇒ huge choice.
your release takes a month, every month. It delays everyone.
it had impact on community: low predictability, they wanted to know “I raised an issue, it was fixed, when can I use this?”

it was not just a developer diva tooling improvement, but very impactful for the product. (highlight for article)

Discussing with the team: correct cadence should be 1 month. There was a balance to strike between having to many binaries, lack of feedback vs too slow to get improvements

team decided on 1 month, without changing anything, they tried and over and over again they failed

prepare RC, let it bake for a week, (deploy on bootstrappers), if there are any issues they would need another RC. It’s “incompressible” without changing the whole testing process.
actual tasks for the release where time consuming. and the process was someone taking a week if everything went according to plan but it was regulargly taking two or three.
diagram:

monthly release → tooling + testing + developement is 5 weeks → you’re losing.

Could we have changed this? I mean we knew it took us 2 month to release before, so maybe not aim for a monthly release?

we made an ambitious goal possible, it was a hard requirements from a product point of view. Achievable but ambitious. And we made it possible, because we could improve part of the process.

Identify the problem?

Embed with the team

“call me anytime you are doing something release related”

Piotr doesn’t do anything else for 2 weeks. It was so involved.
Follow along for the first rc, silently, just see what happens.

second RC, ask questions

how do you know that you need to do this thing? how did you setup that local environment? who did you asked to get access to this weird system he posts to?
gather “in-between the lines”, super important for devexp, we embed because a lot of what engineers do is unwritten, muscle memory.
noticed the long waiting involved, which is very time consumming. We care about what he’s doing, when he is waiting, etc. And we realise, he’s mostly waiting because he doesn’t know when / where he’ll be needed next. (WHY?)

The next release, we rewrites the instructions. The next release, we pair again, he’s driving, but he’s following the instructions one by one.

We write down the missing steps, not changed a thing, just capture everything.
Knowledge is important, no implicit actions. Explicit.
Almost “no code” mindset.
mindlessly following the list, super freeing, no mental overhead. no creative input. No second guessing.
There where still a few gaps, but many many less.
CRITICAL: by then we KNEW what steps and code we need to implement.

The third release

Piotr does the release, and follow the instructions. Antonio reviews.
Spent a lot of time discussing different options, what is important, what is not important, etc.
What to do if someone else needs to learn about the process? How do you interact with other teams / what are the stuff that you can’t control. What do you like in your tooling?
Long, detailed, gathering process. Embedding.

By then, we decided to build the kuborelease, “glorified” go-based makefile

groups together things that you do during a kubo release around (push binaries, update package managers, promote, etc). - abstractions

These are activities and then you implement the actions
Actions are like: send a slack message, create a PR, upload an artifact, etc.

Piotr and Antonio identified the goals (functional) and what activities you do to achieve these goals (which might change often, Slack project changes).
using abstraction intuitive and easy to work on and change if you have a kubo maintainer mindset.
Important goal: familiarity for developpers, execute in their env, real release process is stressful it’s important to make sure it’s natural for these maintainers.

Gathered requirements: needed go, everyone has different laptops, so it has to be env agnostic (Dockerfile). core features also: it’s a time consuming process, there are different time zones, it should be easy to hand off the process to someone else and collaborate on it.

feature: is no local source of truth, we check the actual outcomes of the actions (before you push the binary, is the binary already released? or maybe the slack message was published?). slack message, forum annoucements, etc.
And it’s also the easiest way to move to automation (CI). But we’re not changing everything at once, it’s a critical process. Iterative approach, improvements.

Implementations

core orchestration:

Evaluated dagger to write this. benefits dagger: more granular caching of steps.
Mage: package activities. go-makefile like. isolated steps. kind of worked.
Final idea: one container with the env, and in it you run the CLI tool. So flip the idea around, keep the main benefits.

urfav/cli
it’s a regular go binary, perfect for go developpers. they know what it is.

I/Os: gitgo, github client in go, matrix client,
hard parts: signed tags in releases. How do you sign a tag in an automated fashion?

gather info from your system: signing key, tokens, etc. With your permission. Making it available for the tool running in Docker.
we missed the yubi key or git signing key. It’s sort of an overlook, that happens. It came out during user testing. We worked around it.

Introduced a way to have the tool give you some thing, you sign, then send back the info.
Future: port this to CI, trigger a workflow, signing is done by a bot user shared by the maintainance team. No local credentials and stuff. As long as you have access to this system you can trigger the release, no direct access to the credentials.

support developper workflows?

release
impact

improving maintainers life - github management

2022 - april

talking with people during labweek. Many people complaining about access, there was no story around access to repositories. “who gets access where?”. “how do you get access to a repo?”, “how do you find out who can access what?”. “how to ensure someone should not have access to something?”. That openned the discussion.
Look into how big players do it, and adapt it to the open source world.
big enterprises:

team management, repositories, etc → there is a one way thingy, some it admin manage the source of truth of contributors, team, orgs, etc. Then there is some source of truth dispatch that percolate through the ecosystem.

That solution doesn’t work for us: blurred line of working / not working with someone

open source, community driven, etc. It needed a more flexible and more efficient system. collaboration, decentralizating decision from someone. Need for a collaborative process ⇒ having an open source repositories, where you can propose changes and discussion, and have involved parties was critical. Adapt to the team structure and mindset.
It was also possible to automate changes, enforce quality, force push, etc.
We could apply the same principle that uci,

enforce codescan to everyone,
default branch protected,
etc

takes away the how to do it, and concentrate on the what you want to achieve

story

start implementing, distribute to larger and larger organization (small orgs like multiformats), up to ipfs, and even larger ones.
it’s living in private orgs now, we built it for open source with the assumption that would be super useful for private orgs (they have many repos, and a helping a community / collaboration, whether it’s fully public or closed would provide pretty much the same value).

all features was “great” for private orgs.
the management repo is not public, because it contains critical info. But we did gave the group of admin a super power kind of.

they could audit member activities, verify the number of seats, etc.
The project is almost more relevant in a private orgs, because you can automate checking the seats you should be paying for (save money), etc.

Very conscious decision: make it work alongside existing process.

Critical for such large orgs
Example: two way synchronization means you can always do it “the old way” (through the UI).

Explain the feature,
You don’t mess up the workflow of everyone, the orgs and team members can transition at their own pace, they would move to conf as code beause it was just better.
light “advertising”: because we where involved / looking at so many discussions, we would advertise the feature, “you could do this in code”, instead of waiting for an answer on slack from someone. “just do it yourself, it’s better”. Then word of month

Changes we implemented along the way

Lack of verbosity → would be nicer to have a conversation, readable.

follow-up features: “I want to add user X to this team, what access to which repos would they get?”
we added feedbacks

I was a very natural process: everybody was familiar with the tooling, we just advertised a new and better way to do thing. And people switched. A right balance, natural. Meet them where they are.

the CI side, code side, which is important,
and a HUGE untapped opportunity on the “human / organization” side - managing the github organization.

how do you get visibilit into this? it’s hidden, have to click thorugh all the orgs, UI, etc. You can’t track it over time. It’s not “hidden iceberg” data. No data.
Misconfiguration of a repository is usually silent. Can’t really improve the organisation. For exemple:

you should forbid force push, it’s a reasonable agreement, but you make a slight mistake, so you allow force pushing for a second locally. And maybe you forget, we’re human and we’re working.
That mistake will be silent for a long time, until it blows up in your face. And no one is to blame here, you want to trust your maintainers to do the right thing (allow force push sometime, no zeal), but you can’t assume they’ll go through all the configuration and double check them every day. It’s OK to make mistake, the problem is at the organization level.

Not that’s a core of our value, “do the right thing”, might also mean bypass a rule temporarily (under certain condition). We’re not pedantic on rule, we’re implementing ones that are useful / efficient / avoiding mistakes.

Being able to collaborate on the organisation, treat it as code, etc. It becomes more efficient. It’s another part of your regular work with the same standards (quality, openness, etc).

user-centric
lack of project oriented approach in the article? roi / data driven?