Rewrite k3d-dev in golang

My thoughts:

Current k3d-dev.sh script is 1250 lines of clean and well-tended bash. It's got 34 named functions but it doesn't have a test suite.
Migrating some/all of this functionality to one of our go-to in-house programming languages should help us with increased potential for static analysis, modularization, and unit testing.
Language recs in preference order:
- Go: +static typing, +lots of BigBang usage, +common in k8s community
- Python: +type hints, +moderate BigBang usage, +wide userbase, +common in sysadmin community
- TypeScript: +static types, +wide userbase if JS counts, -more dependencies required to build and maintain.
Many of our persistent issues with this tooling can be boiled down to interactions and compatibilities between our bigbang packages (helm charts) and their HelmReleases. A Go/Python/TypeScript replacement of k3d-dev.sh will never be able to fully solve those problems, though it may aid us in better identifying and addressing such below-the-surface issues in the future.

Officially I'm neutral on this. Golang or bash can both do the job. Personally I think the language is the wrong thing to focus on with this tool, and there are bigger problems to be solved regardless of language.

I think the tool will have the same problems regardless of the language it is in. Those problems are:

There is no user documentation. Documentation for the tool is spread throughout various other projects where it does exist, and in the majority of cases, it simply does not exist.
There is no developer documentation. What do the internal functions do? Why is the code organized the way it is? What are the current variables and datastructures for?
The use cases for the tool are not well documented. The tool can provide no less than a dozen different configurations of k3d, and that's just going off what I can think of off the top of my head. None of those use cases are clearly documented as to the architecture being built and the use case it serves.
There is no testing. Whenever the tool is modified, it is a significant challenge to ensure that all use cases still work as expected, and we basically always miss something, causing downstream impact that can be significant.

I believe that the desire to rewrite k3d-dev in a different language is actually a hopeful assumption that if we rewrite the tool from scratch in a new language, these issues will be resolved as a matter of course in the development of the new implementation.

The testing needed will have the same challenges regardless of the language k3d-dev is written in: there is very little code in k3d-dev that is unit testable. The stuff that writes templated configurations and the command line parsing are the only bits that don't create or interact with infrastructure or external APIs. We can mock up cloud provider APIs to test some of the functions that interact with those systems, and that is somewhat less work in a language like python or golang than it is in bash. However the real heavy lifting on the testing is going to be writing test harnesses that tell k3d-dev to "build me a system like XYZ" and then interrogating the actual system that gets built to ensure it functions according to the spec; and that will have more to do with what virtualization backend is used to build the systems, and how well the testing language integrates w/ SSH and other tools, than the language k3d-dev itself is written in.

If we switch to golang, the portions that interact with AWS and Kubernetes APIs would probably be cleaner. The portions that communicate with other shell programs would probably be significantly worse - bash is way better (in my opinion) at managing shell processes and handling their errors than golang. But once the AWS/K8s api stuff is cleaned up and a good SSH automation lib is introduced, there wouldn't be much shelling out left to do, so this may be a moot point.

We wouldn't be utilizing most of the features that make golang useful versus bash. The tool is entirely sequential, so no use for goroutines. There is very little opportunity to gracefully recover from errors, we basically panic on everything. The execution speed of the tool is limited far more by the nature of the operations it's performing than whether the language is compiled or not. Providing a compiled binary versus a simple shell script may make the tool more difficult to debug (dlv is okay but bash -x is more convenient for most of what this tool does). The majority of the issues experienced by the script are not related to the types of data being moved around internally (everything internal to the script is either a string or a boolean), so golang's static typing system would not be of a significant benefit here.

I agree with your central thesis here — moving the k3d dev environment wrapper code from one language to another is not going to fix the fundamental issues that cause ongoing dev environment management and configuration headaches.

So long as our central model is built on dozens of highly configurable YAML-based helm charts we won't be able to fully test and validate every piece.

That said, I still think we'll be well served to reduce the proportion of our shared codebases that is written in either YAML or bash over time. Once we shrink that mountain of difficult-to-test-or-even-validate assets down a bit, who knows what we'll find left at the bottom?

Since @akesterson has worked on this script a lot more than the rest of us recently I can imagine those of us weighing in from a safe distance might be overlooking some critical info.

Here's a list of the named functions in the current k3d-dev.sh on master. I'll spend some time reading over them to get a feel for which (if any) might actually work better in Go/Python/etc.

❯ curl -sf https://repo1.dso.mil/big-bang/bigbang/-/raw/master/docs/assets/scripts/developer/k3d-dev.sh | rg "^function" | awk '{ print $2 }' | sort
check_missing_tools
cloud_aws_assign_ip_addresses
cloud_aws_configure
cloud_aws_create_instances
cloud_aws_prep_objects
cloud_aws_report_instances
cloud_aws_request_spot_instance
cloud_aws_toolnames
cluster_mgmt_select_action_for_existing
create_instances
destroy_instances
fix_etc_hosts
getDefaultAmi()
getPrivateIP2()
initialize_instance
install_docker
install_k3d
install_kubectl
install_metallb
k3dsshcmd()
main
print_instructions
process_arguments
report_instances
run()
run_batch_add()
run_batch_execute()
run_batch_new()
runwithexitcode()
runwithreturn()
set_kubeconfig
update_ec2_security_group
update_instances

The following individuals have made commits against this script in the past and still have repo1 accounts. I'm pinging them here for any thoughts they may have regarding the desirability/feasibility of moving the k3d-dev script to another language:

@micah.nagel @jrb @obuh.alozie @chris.oconnell @jeffv @kershaw.jacob @lgomez2 @michaelmartin @rob.mccarthy1 @svsarnowski @stephen.galamb @zcallahan @bjacksonfv @brandt.w.keller

For those pinged, if you have no input, feel free to disregard.

A large part of my initial support for a Go rewrite was based on the assumption that bbctl would become central to our workflow for dogfooding and, eventually, customer deployments of BigBang.

I'm no longer sure if that work is prioritized or will be completed in a timeframe that justifies a rewrite. If bbctl adoption is low, a Go rewrite would likely fail, as it wouldn’t get enough usage to drive bugs to zero.

Unless bbctl’s roadmap significantly changes, we should not pursue a rewrite. Instead, we should focus on reducing long-term maintenance burden by making the script more modular, testable, and maintainable.

Actionable Improvements:

Modularize the script – Split it into separate files (aws_setup.sh, k3d_deploy.sh, instance_provision.sh), or at least well-defined sections. This improves maintainability and readability.
Add basic testing – Use BATS or another testing framework to validate key functionality. Even limited unit tests will make the script easier to compartmentalize and understand.
Reduce reliance on environment variables – Move configuration into JSON/YAML files. This was the main reason I initially favored Go, as I expected bbctl to provide templating for common configurations. Even without bbctl, structured config files would add significant value.
Enforce a Bash style guide – Introduce a Bash linting pipeline (shellcheck, shfmt) to improve script quality and consistency.

TL;DR

Unless bbctl is confirmed as our CLI with a clear roadmap, we should NOT rewrite in Go. Instead, we should focus on refactoring the script, improving testing, and standardizing Bash practices.

I'm happy to support this as a good next step. Breaking the large thing up into smaller modules and applying static analysis and a test harness to it is 90% of why I'd have wanted to use anything-but-bash anyway.

I've never actually used BATS but it appears reasonable. That'd be a great addition to our hundreds of lines of bash in pipeline-templates if we're not using it there already.

As an immediate step we could add a simple BATS-based test harness and a shellcheck job to the CI pipelines for this repository (over at pipeline-templates), add in a ton of # shellcheck disable directives to allow the existing stuff to stay viable until we make time to clean it up, and then work our way through cleaning / replacing things one at a time from there.

FYI pipeline-templates does use BATS to some extent, and I've found it to be pretty useful

Could we consider another approach entirely here?

How feasible is it for us to instead maintain an AMI (and make it publicly available even) that is a development environment in a box? Docker, metallb, and k3d already installed and just ready to go. There'll be some configuration necessary to launch the instance and map the instance's public IPs to metallb IPs, but that can be done in 100 or so lines of bash instead of the >1000 lines we have here.

We could maintain this AMI in its own repo as a packer project and have a pipeline that validates the cluster spins up correctly on every merge.

Once the AMI is in place, we standardize everything on it. Quickstart, dev environment, BB merge pipelines, etc. There's no more "well how did you create your cluster?" as there's only one way to do it.

I like the idea of shipping images instead of a script. But ultimately we have to solve the same problems; we need a script that does the provisioning and configuration of the various image types and configurations, we have to document the use cases for each image, we have to write test cases for the images produced to ensure they're healthy. The difference is we're shipping images, not a script.

But I do love this idea and would honestly prefer shipping images to users (AMI OVF etc), and retaining the script for internal-only usage inside of our pipelines that builds the images.

An AMI can also be made "closer" to a production env, integrating stigging and selinux rules and an OS (RHEL) that might be closer to a real cluster used in gov cloud

@akesterson agreed about it still being a script that needs maintained, but since we're not making any options available, there's no logic around what things happen first and no context switching between commands running locally vs commands running over SSH. That sounds far more maintainable as a bash script in my mind.

Also, I want to be clear that I'm advocating for one image. No options on this menu. Since the AMI approach would reduce cluster creation time to basically however long it takes for an EC2 instance to spin up, I also don't think we'd have any need for the "recreate the k3d cluster but not the ec2 instance" option.

Any attempt to build an AMI via packer or EC2 image builder ought to provide us with improvements in modularity, testability, and automated verification of some of the sticky bits of this workflow, so I like the sound of that.

@zcallahan good points. Overall I agree it's a superior idea.

The only sticking point is the idea of a single image. Is it possible to only offer one? I think that would require significant input from anchors and product owners. Current configurations include:

K3D with default CNI with the default ingress on the public IP address
K3D with Weave CNI with the default ingress on the public IP address
K3D with default CNI with the default ingress on the private IP address
K3D with Weave CNI with the default ingress on the private IP address
K3D with default CNI and MetalLB
K3D with Weave CNI and MetalLB
K3D with Default CNI, MetalLB and 2 public IP addresses
K3D with Weave CNI, MetalLB and 2 public IP addresses

... and that's per cloud provider. We'd need to duplicate that on AWS and at least one private virtualization provider (vmware or virtualbox) to meet current goals.

I'd love to know if we can trim the fat off that list.

I don't think we'd need to go as far as maintaining and auto-building images for every cloud provider and every major ingress configuration.

Couldn't we get by with one AMI for primarily-internal use that has A) all of our OS-level prereqs installed as well as B) a set of smaller single-purpose bootstrapping scripts (or Ansible-ish modules) capable of getting a freshly booted EC2 node from the baseline up to "now your BigBang cluster is ready with your chosen features"?

one AMI for primarily-internal use

Unfortunately we can't keep an internal audience scope. The new quickstart.sh (published in the public big bang docs) relies upon k3d-dev on the backend to build a functional k3d cluster on any suitable VM in the control of the user, and/or to create EC2 instance in AWS. The functionality is documented for the public. We may start with a limited scope initially, but we can't stay there long.

In that case I imagine we'll want another CI job for "quickstart.sh isn't broken on master" to help keep us all honest as we set off on this next voyage.

Agree that quickstart.sh needs test. I opened a ticket for that and related it to the ticket calling for k3d-dev testing

The supported use cases in the quickstart are:

K3D on an EC2 instance
K3D on an EC2 instance with MetalLB
K3D on a VM the user can access via SSH
K3D on a VM the user can access via SSH with MetalLB

The quickstart doesn't expose all of the functionality in the k3d-dev script, so that saves us some complexity. Creating the AMI would solve the first two cases. But the second two cases will require the k3d-dev script to stay around until we publish some system images that can be used on the common virtualization platforms our customers use (vagrant+virtualbox, vmware, others?)

I think there are a few things in flight that might help guide this. I think we are still looking at supporting "flavors" and having bbctl be the primary tool to support that. I think we still want a "one liner" for a customer or a dev to get a development cluster, could be in or outside of bbctl, but I do think it'd be nice to decouple it from the umbrella. I also strongly agree on the testing and stability requests. Finally, I think at some point we need to update our workflow to use kustomize like how our customers use it so we feel the same pain points they do.

added triage-kind label

added triage-priority label

Closing this issue. &472 is one outgrowth of this discussion, and I encourage everyone to provide feedback there if you have it.

closed

Rewrite k3d-dev in golang

Designs

Child items ...

Activity

Actionable Improvements:

TL;DR

Admin message

Admin message

Rewrite k3d-dev in golang

Activity

Actionable Improvements:

TL;DR