Infrastructure for KGR runners
Infrastructure-runner overview
Runner name/tag prefix | Platform |
---|---|
kgr1 |
Kubernetes (Tanzu Kubernetes Grid) |
kgr2 |
Docker in VM |
KGR1
It runs on Tanzu Kubernetes Grid: "Kubernetes on VMs"
Our Kubernetes cluster is following:
Nodes | Flavor | # | Cores/Memory | / |
$BUILD_DIR |
---|---|---|---|---|---|
large | best-effort-2xlarge | 1 | 8/64 | 160 | 80 |
medium | best-effort-large | 8 | 4/16 | 80 | 40 |
control plane | 3 |
Explanation:
- Nodes: group of nodes
- Flavor: VM class (determines the size of the node)
-
: How many node of this type is available
/
mounted disk size on root for jobs. The storage is shared across jobs and processes on whole node, so the value is purely informational$BUILD_DIR
mounted disk size for repository directories (build directory). The storage is shared across jobs on whole node, so the value is purely informational
NOTE 🗒️: The large node is two times larger than medium node (only ram is fourth time as much, because of technical reasons)
WARNING ⚠️: The flavours are best-effort, it means the inner implementation will try its best to keep up with the specs, but it is not unlikely to be lower.
WARNING ⚠️: There runs no Workload on control plane nodes.
NOTE 🗒️: - Experimental runner runs on a standard or a test cluster. - Memory is in GiB
Illustration of KGR1 clusters
Diagram illustrates how does the cluster setup look like (The release cluster is bigger!).
The relation between node sizes and runner sizes
The nodes and runner sizes are tightly connected. Here is connection between medium node and standard runner represented on rules for request (guaranteed values for jobs, lower value) and for limit (limiting values for jobs, upper value*)
* Request and limit:
* The request is made in a way that 3 runners (1 main and 1 helper container) can run on large node
* The limit is made in a way that 2 runners (1 main, 1 helper and 1 service container) can run on large node
* Overwrite values (advanced user can set in pipeline specification)
* The limit overwrite is set so 2 runners (1 main and helper container) can run on large node
* The request overwrite is just set under the limit overwrite
* service container overwrite
* Service and helper container overwrite values are just made to be higher than the standard
* Storage
* ephemeral storage is following these rules as well, however less tightly
Diagram illustrates Size relations between runner values for request and limit and node size
NOTE 🗒️:
- There is a bit of reserve on each node even with three pods at request values.
- There will be other workload at the time, like monitoring, so the real deployment situation will differ.
- With usage of best-effort nodes the node size might differ.
KGR2
- Runs in docker and the docker runs in bwCloud VM (If bwCloud has complete downtime, this runner is also down)
- Flavor: m2.large.hugedisk
- All VM resources are allocated to one job, because concurrency is set to one
- Some resources are allocated to docker, host OS of VM, but these wont take as much (not more than)