Skip to content

Add ability to configure affinity and tolerations for scans and hooks#806

Merged
malexmave merged 13 commits intomainfrom
feature/affinity-and-tolerations
Dec 6, 2021
Merged

Add ability to configure affinity and tolerations for scans and hooks#806
malexmave merged 13 commits intomainfrom
feature/affinity-and-tolerations

Conversation

@malexmave
Copy link
Copy Markdown
Member

@malexmave malexmave commented Nov 11, 2021

Description

This PR resolves #803 by adding affinity and tolerations to the ScanSpec and ScanCompletionHookSpec. Supercedes #804 to avoid having an ugly history.

Currently in draft state. Tasks:

  • Add to ScanSpec and ScanCompletionHookSpec CRDs and re-generate them
  • Add to controller (pull information from CRD and put it into the pod)
  • Check if / how the parsers can also be informed about affinity / tolerations
  • Have the hooks inherit their affinity and tolerations from the scans triggering them
  • Check if the JS CRDs need to be updated (hooks, SDKs)
  • Add new inheritX labels? Or should it simply be inherited by cascading scans without asking?
  • Add test cases for cascading hooks
  • Test everything
  • Support configuration via values.yaml file
  • Double-check if the docs need to be updated anywhere or if everything is done automatically anyway

malexmave and others added 3 commits November 11, 2021 08:52
Signed-off-by: Max Maass <max.maass@iteratec.com>
Signed-off-by: Max Maass <max.maass@iteratec.com>
Signed-off-by: GitHub Actions <securecodebox@iteratec.com>
@malexmave malexmave added enhancement New feature or request CRD Improvements or additions to CRDs labels Nov 11, 2021
@malexmave malexmave added this to the v3.4.0 milestone Nov 11, 2021
@malexmave malexmave self-assigned this Nov 11, 2021
Signed-off-by: Max Maass <max.maass@iteratec.com>
Signed-off-by: Max Maass <max.maass@iteratec.com>
@malexmave
Copy link
Copy Markdown
Member Author

Tolerations work, and are also pulled into the parser job from the scanner job now. I think that this is useful because of the following reasons:

  • The parser is part of the scanning process and thus should respect the same scheduling constraints as the scanner
  • If we do not do this, the parser may be scheduled in a completely different place from the scanner, which is surprising to the user and may, in the worst case, lead to compliance problems ("We need to run all processing of this data on nodes within the EU / on-site within our company" and then suddenly the parser jobs get scheduled on US GCP nodes within the same cluster)
  • If you are working with taints, in the worst case, not copying over tolerations can lead to the parser not being scheduled at all (if all nodes have some form of taint)
  • If we do not copy the tolerations and taints from the scan, we need to offer a different way to specify taints and tolerations for the parser, which is a lot of work, and which would in 99% of cases be set to the same values as the taints / tolerations of the scan job (I can see no use case where they should use different values)

Anyway, this is my pitch why it should be done like this.

Tolerations work (I have tested them). Affinity is somehow borked right now, it seems like it is not copied over to the job template for some reason, and I don't know why. I would appreciate if someone with more experience with Go and the k8s APIs (@J12934?) could take a look at the diff and check if I am doing something obvious wrong.

How to test

Here's a quick info on how I tested this feature.

Affinity (does not work)

My testing setup is: First give a node a label that we can select on:

kubectl label nodes kind-control-plane disktype=ssd

Then, using the operator and CRDs from this branch, schedule the following job:

apiVersion: "execution.securecodebox.io/v1"
kind: Scan
metadata:
  name: "nmap-example"
spec:
  scanType: "nmap"
  parameters:
    - "scanme.nmap.org"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd

Then, check the YAML of the created scan job using k9s or kubectl and observe that the affinity is not reflected in the template.

I am looking for support in figuring out why this is the case. Definition in the CRD:

// Affinity allows to specify a node affinity, to control on which nodes you want a scan to run. See: https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes-using-node-affinity/
Affinity *corev1.Affinity `json:"affinity,omitempty"`
// Tolerations are a different way to control on which nodes your scan is executed. See https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`

Copying it in the reconciler:

// Set affinity from ScanTemplate
job.Spec.Template.Spec.Affinity = scan.Spec.Affinity
// Merge Tolerations from ScanTemplate with Tolerations defined in scan
job.Spec.Template.Spec.Tolerations = append(
job.Spec.Template.Spec.Tolerations,
scan.Spec.Tolerations...,
)

Tolerations (do work)

Tolerations are reflected in the YAML and respected by the cluster when using them. To test, first give your node a taint:

kubectl taint node kind-control-plane taint=test:NoSchedule

Now, no jobs will be scheduled on it by default (make sure this does not interfere with your operator by running it outside the cluster). Then, schedule the following job:

apiVersion: "execution.securecodebox.io/v1"
kind: Scan
metadata:
  name: "nmap-test"
spec:
  scanType: "nmap"
  parameters:
    - "scanme.nmap.org"

Observe that it will never be scheduled (the pod will hang in "pending"). Delete the scan and then change the scan definition as follows:

apiVersion: "execution.securecodebox.io/v1"
kind: Scan
metadata:
  name: "nmap-test"
spec:
  scanType: "nmap"
  parameters:
    - "scanme.nmap.org"
  tolerations:
    - key: "taint"
      value: "test"
      effect: "NoSchedule"

Schedule it and observe that it will get scheduled, including the parser (thanks to the above-mentioned change of copying the tolerations from the scan). They will also be reflected in the YAML version of the job if you inspect it using k9s or kubectl.

To get rid of the taint after you are done testing, run:

kubectl taint node kind-control-plane taint=test:NoSchedule-

@J12934
Copy link
Copy Markdown
Member

J12934 commented Nov 12, 2021

I'm still not sure if the handling currently is intuitive.

At the moment I'm understanding that it is configured for each job like the following:

# configurable by CR (ScanType, ParseType, ScanCompletionHook) configurable via Scan
Scan Job
Parse Job
Hook Job

Is that accurate? Shouldn't the hook job also inherit the Scans Affinity & Tolerations? Why can the Affinity & Toleration be configured via the ScanCompletionHook Custom Resource but not via the ParseDefinition CR?

@malexmave
Copy link
Copy Markdown
Member Author

At the moment, you table is accurate.

The rationale for not including affinity and tolerations as part of the ParseDefinition follows from the behavior of inheriting it from the scan job: If we inherit it from the scan job (which is the more specific source of information of what is desired for a specific scan), it would pretty much always replace whatever is defined in the ParseDefinition, and it may lead to unintuitive behavior if we merge instead of replacing (affinity cannot easily be merged, tolerations can, but merging would potentially lead to strange behavior as well). Is there a use case where you would want your scans to run with a specific affinity / toleration, and your parsers to run with a different one? I'm usually in favor of not implementing a feature simply for feature parity with a different function if I don't see a real use case for the feature, but do see how it would make the implementation more complex and, in the worse case, error-prone. (Hence, parsers currently also do not support initContainers, and neither do ScanCompletionHooks).

Inheriting affinity and tolerations from the scan job to the hook seems like an interesting idea. I would have to dive into the code and see how we can do this, but I definitely see how this would be useful.

Do you have any idea why the affinity is not copied over properly from the scan to the job in the reconcilers, i.e., am I doing something obvious wrong in the snippets I posted above?

@J12934
Copy link
Copy Markdown
Member

J12934 commented Nov 12, 2021

What would feel intuitive to me would be the following:

# configurable by CR (ScanType, ParseType, ScanCompletionHook) configurable via Scan
Scan Job
Parse Job
Hook Job

Basically only configurable via the Scan CR. This makes this consistent across all of the job types.
ScanType is also supported implicitly as this is already configurable via the the jobTemplate anyway.

Do you have any idea why the affinity is not copied over properly from the scan to the job in the reconcilers, i.e., am I doing something obvious wrong in the snippets I posted above?

🤔 Mh not these look good to me. Can you make sure that you are only running the changed operator for the cluster and don't have a second operator connected to it which doesn't have the new feature?

@malexmave
Copy link
Copy Markdown
Member Author

I think your proposed table looks very sensible. I'll have to check how to have the hooks inherit the settings from the scans, but assuming this is possible, it seems like a very reasonable way to configure this. In that case, I would remove affinity and tolerations from the ScanCompletionHook CRD and have it only grab its config from the scan jobs.

Can you make sure that you are only running the changed operator for the cluster and don't have a second operator connected to it which doesn't have the new feature?

I am certain that this was the case, and everything worked fine for the tolerations, consistently over something like 5-10 test runs, while it never worked for affinity in 5-10 test runs, so the error should be in the operator, not the deployment. Maybe we can take a look in a pair whenever it is convenient for you.

@malexmave
Copy link
Copy Markdown
Member Author

So, it turns out that the issue was that the tolerations are shown in kubectl describe, but affinity isn't. If you look at the YAML output you see that it is applied correctly.

Signed-off-by: Max Maass <max.maass@iteratec.com>
Signed-off-by: Max Maass <max.maass@iteratec.com>
Signed-off-by: Max Maass <max.maass@iteratec.com>
Default to true because this is the most sensible behavior and
consistent with how we do it for parsers and hooks.

Signed-off-by: Max Maass <max.maass@iteratec.com>
@malexmave
Copy link
Copy Markdown
Member Author

I have added inheritAffinity and inheritTolerations to control if they should be inherited, but I am defaulting the value to true because I believe that this is the most consistent behavior (since hooks and parsers inherit the values as well, it makes sense that cascaded scans also inherit it).

Concerning changes to the values: Affinity cannot be merged, it can only be replaced in cascading hooks, as it is a deeply nested dictionary, which the current merger logic cannot handle. Tolerations, being a list, are merged using the standard list rules.

@malexmave
Copy link
Copy Markdown
Member Author

malexmave commented Nov 16, 2021

This is ready for review on the functional side, I still have to update a few pieces of documentation.

How to test

Preconditions:

  • Operator, CRDs, and Cascading Scan hook from this branch are installed in your cluster
  • Nmap scan type (for testing) also installed

First, set up a cascading scan rule (it's not a very useful rule, its just there to show that inheritance on cascading scans works):

apiVersion: "cascading.securecodebox.io/v1"
kind: CascadingRule
metadata:
  name: "nmap-hostscan"
  labels:
    securecodebox.io/invasive: non-invasive
    securecodebox.io/intensive: light
spec:
  matches:
    anyOf:
      - category: "Host"
        osi_layer: "NETWORK"
  scanSpec:
    scanType: "nmap"
    parameters:
      - "-Pn"
      - "{{location}}"

Then, trigger an nmap scan:

apiVersion: "execution.securecodebox.io/v1"
kind: Scan
metadata:
  name: "nmap-localhost"
spec:
  scanType: "nmap"
  parameters:
    - "127.0.0.1"
  # Demo affinity
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
  # Demo tolerations
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoSchedule"
  cascades:
    matchLabels:
      securecodebox.io/intensive: light

This will not run on your machine unless you have already set up the "disktype=ssd" label for your kubernetes node (kubectl label nodes kind-control-plane disktype=ssd, see above). If you also want to explicitly test if taints are respected, see the same linked comment for details on how to (un)taint a node.

Schedule the scan above, wait a bit, then run kubectl get jobs. You should see something like this:

NAME                                                       COMPLETIONS   DURATION   AGE
cascading-scans-nmap-localhost-nmap-hostscan-vbsr2-mgrgk   1/1           7s         9s
cascading-scans-nmap-localhost-pbtfv                       1/1           7s         23s
parse-nmap-localhost-bs6qb                                 1/1           4s         27s
parse-nmap-localhost-nmap-hostscan-vbsr2-k9kt6             1/1           4s         13s
scan-nmap-localhost-95vmf                                  1/1           3s         30s
scan-nmap-localhost-nmap-hostscan-vbsr2-dqz8f              1/1           4s         17s

You can now go through all of these jobs and check that the tolerations and affinity are present:

$ kubectl get job scan-nmap-localhost-95vmf -o yaml | yq e .spec.template.spec.affinity -
nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
      - matchExpressions:
          - key: disktype
            operator: In
            values:
              - ssd
$ kubectl get job scan-nmap-localhost-95vmf -o yaml | yq e .spec.template.spec.tolerations -
- effect: NoSchedule
  key: key1
  operator: Equal
  value: value1

Repeat for all the jobs to check them.

Further things to check for:

  • To disable inheritance of affinity or tolerations, set the annotation inheritAffinity: false or inheritTolerations: false, respectively.
  • To check how additional tolerations or affinities get incorporated, try setting affinity or tolerations in the cascading scan spec (or check the unit tests starting at this line)
  • The

# Open Question
Currently, it is not possible to configure the affinity and tolerations via the YAML. While it may seem simple to implement this, it actually isn't (I just spent an hour trying). The problem is that pulling information from the values.yaml / incorporating it in the ScanType works fine for a single scan, but the information we pull from it cannot then be easily transmitted to other components (like using the information from the ScanSpec and inheriting it for the parser, hooks or cascaded scans). The reason for this is that the information is only available in the ScanType and the Job, but (as far as I can tell) the only thing that we have access to when setting up the parser / hook / cascaded scan is the Scan definition. This does not include the information pulled from the ScanSpec / values.yaml, only the information that was explicitly configured in the scan definition YAML.

So, there are two options: Either we do not allow configuring the affinity and tolerations via the values.yaml, or we have to include it everywhere (ScanSpec, ParseDefinition, HookSpec). Since data pulled from these sources cannot be inherited, this would require the user adding it consistently on every installed ScanType and Hook. If done consistently, this will probably work fine, and if done inconsistently, the result will be a nightmare 😁. So, my concern is chiefly for situations where the two options start to get mixed, or where the values.yaml is not used consistently.

Any preferences for a particular solution here?

Edit: Reading through this again, I guess the answer is quite obviously "yes, we want to support it using the YAML files", and I'm just trying to avoid it because it's a lot of annoying work to implement and test it 😬. So, I guess I'll take care of this this afternoon.

@malexmave malexmave removed this from the v3.4.0 milestone Nov 17, 2021
This commit makes the operator respect affinity and tolerations that are
specified in ScanSpec and ParseDefinitions. The default values are used
if the scan does override it, and otherwise discarded (not merged).
Default values are not inherited like explicitly specified values are,
and you need to make sure to specify affinity and tolerations for both
scanner and parser (they are not copied over, either).
Hooks will get the same treatment in the next commit.

Signed-off-by: Max Maass <max.maass@iteratec.com>
Signed-off-by: Max Maass <max.maass@iteratec.com>
@malexmave
Copy link
Copy Markdown
Member Author

I just pushed support for loading the affinity and tolerations from helm values. At the moment, I have not yet touched all the values.yaml files as well as the ScanSpec, HookSpec, etc. files, to keep the review easier. If you want to test it, change your nmap ScanSpec to this:

# SPDX-FileCopyrightText: 2021 iteratec GmbH
#
# SPDX-License-Identifier: Apache-2.0

apiVersion: "execution.securecodebox.io/v1"
kind: ScanType
metadata:
  name: "nmap{{ .Values.scanner.nameAppend | default ""}}"
spec:
  extractResults:
    type: nmap-xml
    location: "/home/securecodebox/nmap-results.xml"
  jobTemplate:
    spec:
      {{- if .Values.scanner.ttlSecondsAfterFinished }}
      ttlSecondsAfterFinished: {{ .Values.scanner.ttlSecondsAfterFinished }}
      {{- end }}
      backoffLimit: {{ .Values.scanner.backoffLimit }}
      {{- if .Values.scanner.activeDeadlineSeconds }}
      activeDeadlineSeconds: {{ .Values.scanner.activeDeadlineSeconds }}
      {{- end }}
      template:
        spec:
          restartPolicy: OnFailure
          affinity:
            {{- toYaml .Values.scanner.affinity | nindent 12 }}
          tolerations:
            {{- toYaml .Values.scanner.tolerations | nindent 12 }}
          containers:
            - name: nmap
              image: "{{ .Values.scanner.image.repository }}:{{ .Values.scanner.image.tag | default .Chart.AppVersion }}"
              imagePullPolicy: {{ .Values.scanner.image.pullPolicy }}
              command:
                - "nmap"
                - "-oX"
                - "/home/securecodebox/nmap-results.xml"
              resources:
                {{- toYaml .Values.scanner.resources | nindent 16 }}
              securityContext:
                {{- toYaml .Values.scanner.securityContext | nindent 16 }}
              env:
                {{- toYaml .Values.scanner.env | nindent 16 }}
              volumeMounts:
                {{- toYaml .Values.scanner.extraVolumeMounts | nindent 16 }}
            {{- if .Values.scanner.extraContainers }}
            {{- toYaml .Values.scanner.extraContainers | nindent 12 }}
            {{- end }}
          volumes:
            {{- toYaml .Values.scanner.extraVolumes | nindent 12 }}

And the nmap parse definition to this:

# SPDX-FileCopyrightText: 2021 iteratec GmbH
#
# SPDX-License-Identifier: Apache-2.0

apiVersion: "execution.securecodebox.io/v1"
kind: ParseDefinition
metadata:
  name: "nmap-xml"
spec:
  image: "{{ .Values.parser.image.repository }}:{{ .Values.parser.image.tag | default .Chart.Version }}"
  imagePullPolicy: {{ .Values.parser.image.pullPolicy }}
  ttlSecondsAfterFinished: {{ .Values.parser.ttlSecondsAfterFinished }}
  env: 
    {{- toYaml .Values.parser.env | nindent 4 }}
  affinity:
    {{- toYaml .Values.parser.affinity | nindent 4 }}
  tolerations:
    {{- toYaml .Values.parser.tolerations | nindent 4 }}

You can then set affinity and tolerations for scaner and parser by writing a file like this and passing it to the helm install command using the --values option:

scanner:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: disktype
                operator: In
                values:
                  - ssd
  tolerations:
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoSchedule"

parser:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: disktype
                operator: In
                values:
                  - ssd
  tolerations:
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoSchedule"

Of course, this does not work if you install nmap from the official sources, you are going to have to do a local install from source for that. If memory serves, the following are the commands required for that:

cd parser-sdk/nodejs/
make docker-build
cd ../../scanners/nmap
make docker-build
make docker-export
make kind-import
rm *.tar

Then run the helm upgrade command like this, assuming your values are set in a file called values2.yaml:

helm upgrade --install nmap ./ --wait --values values2.yaml \             
--set="parser.image.repository=docker.io/securecodebox/parser-nmap" \
                --set="parser.image.tag="sha-$(git rev-parse --short HEAD)"" \
                --set="scanner.image.repository=docker.io/securecodebox/scanner-nmap" \
                --set="scanner.image.tag="sha-$(git rev-parse --short HEAD)""

To test the same for the hooks, change your cascading-scan-hook.yaml to this:

# SPDX-FileCopyrightText: 2021 iteratec GmbH
#
# SPDX-License-Identifier: Apache-2.0

apiVersion: "execution.securecodebox.io/v1"
kind: ScanCompletionHook
metadata:
  name: {{ include "cascading-scans.fullname" . }}
  labels:
    {{- include "cascading-scans.labels" . | nindent 4 }}
    securecodebox.io/internal: "true"
    {{- with .Values.hook.labels }}
    {{ toYaml . }}
    {{- end }}
spec:
  priority: {{ .Values.hook.priority }}
  type: ReadOnly
  image: "{{ .Values.hook.image.repository }}:{{ .Values.hook.image.tag | default .Chart.Version }}"
  imagePullSecrets:
    - name: "securecodebox"
  ttlSecondsAfterFinished: {{ .Values.hook.ttlSecondsAfterFinished }}
  serviceAccountName: cascading-scans
  affinity:
    {{- toYaml .Values.hook.affinity | nindent 4 }}
  tolerations:
    {{- toYaml .Values.hook.tolerations | nindent 4 }}

And use the following format for the values.yaml:

hook:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: disktype
                operator: In
                values:
                  - ssd
  tolerations:
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoSchedule"

You'll also have to install the hook manually, but the instructions are almost the same as above - you just need to build the hook-sdk instead of the parser-sdk.

Gotchas:

  • The default values from the helm values will be used if no values are set in the ScanSpec.
  • They are replaced (not merged with) values from the ScanSpec.
  • They are not inherited to children, all ScanTypes need to explicitly set these during install.
  • They are set separately for scanner and parser.

@J12934 requested that I keep the changes to the YAML files separate to make reviewing easier. So, once this PR is merged, I will create a new PR that adds the changes to all the YAML files, and also takes care of #820, which requires changes to the same files.

Signed-off-by: Max Maass <max.maass@iteratec.com>
Signed-off-by: Max Maass <max.maass@iteratec.com>
@malexmave
Copy link
Copy Markdown
Member Author

malexmave commented Nov 22, 2021

As requested by @Weltraumschaf, I have added an ADR that describes the architecture considerations of this change, and how the behavior in this PR differs from the standard behavior in other places.

I have also verified that the inheritAffinity / inheritTolerations flags work as intended. So, I now consider this ready to review.

@malexmave
Copy link
Copy Markdown
Member Author

I created a companion PR with documentation updates: secureCodeBox/documentation#199

I am currently unsure if more documentation is needed, since this is a fairly niche feature, so it probably does not warrant its own howto entry. Since the updates to the values.yaml etc. are going to be done in a separate PR, I will also keep the required updates of the "how to build a scanner / hook" sections in a separate documentation PR.

@malexmave malexmave merged commit 05d273f into main Dec 6, 2021
@malexmave malexmave deleted the feature/affinity-and-tolerations branch December 6, 2021 14:29
@porcinnealabatoir
Copy link
Copy Markdown

bj j'ai un probleme avec nmap
namp -oX
cette commande ne marche pas elle ne veux pas crée de fichié xml (ces pareil avec -oS ect)

@J12934
Copy link
Copy Markdown
Member

J12934 commented Dec 17, 2021

@porcinnealabatoir
Sorry don't speak much French, from what I can gather this seems unrelated to this issue. If it is still relevant please open up a new issue or ask the question in the slack channel. (Preferably in English)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CRD Improvements or additions to CRDs enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use preemptible node for scans

3 participants