Chapter 1. Working with pods
A pod is one or more containers deployed together on one host, and the smallest compute unit that can be defined, deployed, and managed.
1.1.1. Understanding pods
Pods are the rough equivalent of a machine instance (physical or virtual) to a Container. Each pod is allocated its own internal IP address, therefore owning its entire port space, and Containers within pods can share their local storage and networking.
Pods have a lifecycle; they are defined, then they are assigned to run on a node, then they run until their Container(s) exit or they are removed for some other reason. Pods, depending on policy and exit code, might be removed after exiting, or can be retained in order to enable access to the logs of their Containers.
OpenShift Container Platform treats pods as largely immutable; changes cannot be made to a pod definition while it is running. OpenShift Container Platform implements changes by terminating an existing pod and recreating it with modified configuration, base image(s), or both. Pods are also treated as expendable, and do not maintain state when recreated. Therefore pods should usually be managed by higher-level controllers, rather than directly by users.
For the maximum number of pods per OpenShift Container Platform node host, see the Cluster Limits.
Bare pods that are not managed by a replication controller will be not rescheduled upon node disruption.
1.1.2. Example pod configurations
OpenShift Container Platform leverages the Kubernetes concept of a pod , which is one or more Containers deployed together on one host, and the smallest compute unit that can be defined, deployed, and managed.
The following is an example definition of a pod that provides a long-running service, which is actually a part of the OpenShift Container Platform infrastructure: the integrated Container image registry. It demonstrates many features of pods, most of which are discussed in other topics and thus only briefly mentioned here:
Pod Object Definition (YAML)
kind: Pod apiVersion: v1 metadata: name: example namespace: default selfLink: /api/v1/namespaces/default/pods/example uid: 5cc30063-0265780783bc resourceVersion: '165032' creationTimestamp: '2019-02-13T20:31:37Z' labels: 1 app: hello-openshift annotations: openshift.io/scc: anyuid spec: restartPolicy: Always 2 serviceAccountName: default imagePullSecrets: - name: default-dockercfg-5zrhb priority: 0 schedulerName: default-scheduler terminationGracePeriodSeconds: 30 nodeName: ip-10-0-140-16.us-east-2.compute.internal securityContext: 3 seLinuxOptions: level: 's0:c11,c10' containers: 4 - resources: <> terminationMessagePath: /dev/termination-log name: hello-openshift securityContext: capabilities: drop: - MKNOD procMount: Default ports: - containerPort: 8080 protocol: TCP imagePullPolicy: Always volumeMounts: 5 - name: default-token-wbqsl readOnly: true mountPath: /var/run/secrets/kubernetes.io/serviceaccount terminationMessagePolicy: File image: registry.redhat.io/openshift4/ose-ogging-eventrouter:v4.1 6 serviceAccount: default 7 volumes: 8 - name: default-token-wbqsl secret: secretName: default-token-wbqsl defaultMode: 420 dnsPolicy: ClusterFirst status: phase: Pending conditions: - type: Initialized status: 'True' lastProbeTime: null lastTransitionTime: '2019-02-13T20:31:37Z' - type: Ready status: 'False' lastProbeTime: null lastTransitionTime: '2019-02-13T20:31:37Z' reason: ContainersNotReady message: 'containers with unready status: [hello-openshift]' - type: ContainersReady status: 'False' lastProbeTime: null lastTransitionTime: '2019-02-13T20:31:37Z' reason: ContainersNotReady message: 'containers with unready status: [hello-openshift]' - type: PodScheduled status: 'True' lastProbeTime: null lastTransitionTime: '2019-02-13T20:31:37Z' hostIP: 10.0.140.16 startTime: '2019-02-13T20:31:37Z' containerStatuses: - name: hello-openshift state: waiting: reason: ContainerCreating lastState: <> ready: false restartCount: 0 image: openshift/hello-openshift imageID: '' qosClass: BestEffort
Pods can be «tagged» with one or more labels, which can then be used to select and manage groups of pods in a single operation. The labels are stored in key/value format in the metadata hash. One label in this example is registry=default .
The pod restart policy with possible values Always , OnFailure , and Never . The default value is Always .
OpenShift Container Platform defines a security context for Containers which specifies whether they are allowed to run as privileged Containers, run as a user of their choice, and more. The default context is very restrictive but administrators can modify this as needed.
containers specifies an array of Container definitions; in this case (as with most), just one.
The Container specifies where external storage volumes should be mounted within the Container. In this case, there is a volume for storing the registry’s data, and one for access to credentials the registry needs for making requests against the OpenShift Container Platform API.
Each Container in the pod is instantiated from its own Container image.
Pods making requests against the OpenShift Container Platform API is a common enough pattern that there is a serviceAccount field for specifying which service account user the pod should authenticate as when making the requests. This enables fine-grained access control for custom infrastructure components.
The pod defines storage volumes that are available to its Container(s) to use. In this case, it provides an ephemeral volume for the registry storage and a secret volume containing the service account credentials.
This pod definition does not include attributes that are filled by OpenShift Container Platform automatically after the pod is created and its lifecycle begins. The Kubernetes pod documentation has details about the functionality and purpose of pods.
1.2. Viewing pods
As an administrator, you can view the pods in your cluster and to determine the health of those pods and the cluster as a whole.
1.2.1. About pods
OpenShift Container Platform leverages the Kubernetes concept of a pod , which is one or more containers deployed together on one host, and the smallest compute unit that can be defined, deployed, and managed. Pods are the rough equivalent of a machine instance (physical or virtual) to a container.
You can view a list of pods associated with a specific project or view usage statistics about pods.
1.2.2. Viewing pods in a project
You can view a list of pods associated with the current project, including the number of replica, the current status, number or restarts and the age of the pod.
Procedure
To view the pods in a project:
- Change to the project:
$ oc project
$ oc get pods
For example:
$ oc get pods -n openshift-console NAME READY STATUS RESTARTS AGE console-698d866b78-bnshf 1/1 Running 2 165m console-698d866b78-m87pm 1/1 Running 2 165m
Add the -o wide flags to view the pod IP address and the node where the pod is located.
$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE console-698d866b78-bnshf 1/1 Running 2 166m 10.128.0.24 ip-10-0-152-71.ec2.internal console-698d866b78-m87pm 1/1 Running 2 166m 10.129.0.23 ip-10-0-173-237.ec2.internal
1.2.3. Viewing pod usage statistics
You can display usage statistics about pods, which provide the runtime environments for Containers. These usage statistics include CPU, memory, and storage consumption.
Prerequisites
- You must have cluster-reader permission to view the usage statistics.
- Metrics must be installed to view the usage statistics.
Procedure
To view the usage statistics:
- Run the following command:
$ oc adm top pods
For example:
$ oc adm top pods -n openshift-console NAME CPU(cores) MEMORY(bytes) console-7f58c69899-q8c8k 0m 22Mi console-7f58c69899-xhbgg 0m 25Mi downloads-594fcccf94-bcxk8 3m 18Mi downloads-594fcccf94-kv4p6 2m 15Mi
$ oc adm top pod --selector=''
1.3. Configuring an OpenShift Container Platform cluster for pods
As an administrator, you can create and maintain an efficient cluster for pods.
By keeping your cluster efficient, you can provide a better environment for your developers using such tools as what a pod does when it exits, ensuring that the required number of pods is always running, when to restart pods designed to run only once, limit the bandwidth available to pods, and how to keep pods running during disruptions.
1.3.1. Configuring how pods behave after restart
A pod restart policy determines how OpenShift Container Platform responds when Containers in that pod exit. The policy applies to all Containers in that pod.
The possible values are:
- Always — Tries restarting a successfully exited Container on the pod continuously, with an exponential back-off delay (10s, 20s, 40s) until the pod is restarted. The default is Always .
- OnFailure — Tries restarting a failed Container on the pod with an exponential back-off delay (10s, 20s, 40s) capped at 5 minutes.
- Never — Does not try to restart exited or failed Containers on the pod. Pods immediately fail and exit.
After the pod is bound to a node, the pod will never be bound to another node. This means that a controller is necessary in order for a pod to survive node failure:
Pods that are expected to terminate (such as batch computations)
OnFailure or Never
Pods that are expected to not terminate (such as web servers)
Pods that must run one-per-machine
If a Container on a pod fails and the restart policy is set to OnFailure , the pod stays on the node and the Container is restarted. If you do not want the Container to restart, use a restart policy of Never .
If an entire pod fails, OpenShift Container Platform starts a new pod. Developers must address the possibility that applications might be restarted in a new pod. In particular, applications must handle temporary files, locks, incomplete output, and so forth caused by previous runs.
Kubernetes architecture expects reliable endpoints from cloud providers. When a cloud provider is down, the kubelet prevents OpenShift Container Platform from restarting.
If the underlying cloud provider endpoints are not reliable, do not install a cluster using cloud provider integration. Install the cluster as if it was in a no-cloud environment. It is not recommended to toggle cloud provider integration on or off in an installed cluster.
For details on how OpenShift Container Platform uses restart policy with failed Containers, see the Example States in the Kubernetes documentation.
1.3.2. Limiting the bandwidth available to pods
You can apply quality-of-service traffic shaping to a pod and effectively limit its available bandwidth. Egress traffic (from the pod) is handled by policing, which simply drops packets in excess of the configured rate. Ingress traffic (to the pod) is handled by shaping queued packets to effectively handle data. The limits you place on a pod do not affect the bandwidth of other pods.
Procedure
To limit the bandwidth on a pod:
- Write an object definition JSON file, and specify the data traffic speed using kubernetes.io/ingress-bandwidth and kubernetes.io/egress-bandwidth annotations. For example, to limit both pod egress and ingress bandwidth to 10M/s:
Limited Pod Object Definition
< "kind": "Pod", "spec": < "containers": [ < "image": "openshift/hello-openshift", "name": "hello-openshift" >] >, "apiVersion": "v1", "metadata": < "name": "iperf-slow", "annotations": < "kubernetes.io/ingress-bandwidth": "10M", "kubernetes.io/egress-bandwidth": "10M" >> >
$ oc create -f
1.3.3. Understanding how to use pod disruption budgets to specify the number of pods that must be up
A pod disruption budget is part of the Kubernetes API, which can be managed with oc commands like other object types. They allow the specification of safety constraints on pods during operations, such as draining a node for maintenance.
PodDisruptionBudget is an API object that specifies the minimum number or percentage of replicas that must be up at a time. Setting these in projects can be helpful during node maintenance (such as scaling a cluster down or a cluster upgrade) and is only honored on voluntary evictions (not on node failures).
A PodDisruptionBudget object’s configuration consists of the following key parts:
- A label selector, which is a label query over a set of pods.
- An availability level, which specifies the minimum number of pods that must be available simultaneously, either:
- minAvailable is the number of pods must always be available, even during a disruption.
- maxUnavailable is the number of Pods can be unavailable during a disruption.
A maxUnavailable of 0% or 0 or a minAvailable of 100% or equal to the number of replicas, is permitted, but can block nodes from being drained.
You can check for pod disruption budgets across all projects with the following:
$ oc get poddisruptionbudget --all-namespaces NAMESPACE NAME MIN-AVAILABLE SELECTOR another-project another-pdb 4 bar=foo test-project my-pdb 2 foo=bar
The PodDisruptionBudget is considered healthy when there are at least minAvailable pods running in the system. Every pod above that limit can be evicted.
Depending on your pod priority and preemption settings, lower-priority pods might be removed despite their pod disruption budget requirements.
1.3.3.1. Specifying the number of pods that must be up with pod disruption budgets
You can use a PodDisruptionBudget object to specify the minimum number or percentage of replicas that must be up at a time.
Procedure
To configure a pod disruption budget:
- Create a YAML file with the an object definition similar to the following:
apiVersion: policy/v1beta1 1 kind: PodDisruptionBudget metadata: name: my-pdb spec: minAvailable: 2 2 selector: 3 matchLabels: foo: bar
PodDisruptionBudget is part of the policy/v1beta1 API group.
The minimum number of pods that must be available simultaneously. This can be either an integer or a string specifying a percentage, for example, 20% .
A label query over a set of resources. The result of matchLabels and matchExpressions are logically conjoined.
apiVersion: policy/v1beta1 1 kind: PodDisruptionBudget metadata: name: my-pdb spec: maxUnavailable: 25% 2 selector: 3 matchLabels: foo: bar
PodDisruptionBudget is part of the policy/v1beta1 API group.
The maximum number of pods that can be unavailable simultaneously. This can be either an integer or a string specifying a percentage, for example, 20% .
A label query over a set of resources. The result of matchLabels and matchExpressions are logically conjoined.
$ oc create -f -n
1.3.4. Preventing pod removal using critical pods
There are a number of core components that are critical to a fully functional cluster, but, run on a regular cluster node rather than the master. A cluster might stop working properly if a critical add-on is evicted.
Pods marked as critical are not allowed to be evicted.
To make a pod critical:
- Create a pod specification or edit existing pods to include the system-cluster-critical priority class:
spec: template: metadata: name: critical-pod priorityClassName: system-cluster-critical 1
Default priority class for pods that should never be evicted from a node.
Alternatively, you can specify system-node-critical for pods that are important to the cluster but can be removed if necessary.
- Create the pod:
$ oc create -f .yaml
1.4. Automatically scaling pods
As a developer, you can use a horizontal pod autoscaler (HPA) to specify how OpenShift Container Platform should automatically increase or decrease the scale of a replication controller or deployment configuration, based on metrics collected from the pods that belong to that replication controller or deployment configuration.
1.4.1. Understanding horizontal pod autoscalers
You can create a horizontal pod autoscaler to specify the minimum and maximum number of pods you want to run, as well as the CPU utilization or memory utilization your pods should target.
Autoscaling for Memory Utilization is a Technology Preview feature only.
After you create a horizontal pod autoscaler, OpenShift Container Platform begins to query the CPU and/or memory resource metrics on the pods. When these metrics are available, the horizontal pod autoscaler computes the ratio of the current metric utilization with the desired metric utilization, and scales up or down accordingly. The query and scaling occurs at a regular interval, but can take one to two minutes before metrics become available.
For replication controllers, this scaling corresponds directly to the replicas of the replication controller. For deployment configurations, scaling corresponds directly to the replica count of the deployment configuration. Note that autoscaling applies only to the latest deployment in the Complete phase.
OpenShift Container Platform automatically accounts for resources and prevents unnecessary autoscaling during resource spikes, such as during start up. Pods in the unready state have 0 CPU usage when scaling up and the autoscaler ignores the pods when scaling down. Pods without known metrics have 0% CPU usage when scaling up and 100% CPU when scaling down. This allows for more stability during the HPA decision. To use this feature, you must configure readiness checks to determine if a new pod is ready for use.
In order to use horizontal pod autoscalers, your cluster administrator must have properly configured cluster metrics.
1.4.1.1. Supported metrics
The following metrics are supported by horizontal pod autoscalers:
Table 1.1. Metrics
Number of CPU cores used. Can be used to calculate a percentage of the pod’s requested CPU.
Amount of memory used. Can be used to calculate a percentage of the pod’s requested memory.
For memory-based autoscaling, memory usage must increase and decrease proportionally to the replica count. On average:
- An increase in replica count must lead to an overall decrease in memory (working set) usage per-pod.
- A decrease in replica count must lead to an overall increase in per-pod memory usage.
Use the OpenShift Container Platform web console to check the memory behavior of your application and ensure that your application meets these requirements before using memory-based autoscaling.
1.4.2. Creating a horizontal pod autoscaler for CPU utilization
You can create a horizontal pod autoscaler (HPA) for an existing DeploymentConfig or ReplicationController object that automatically scales the Pods associated with that object in order to maintain the CPU usage you specify.
The HPA increases and decreases the number of replicas between the minimum and maximum numbers to maintain the specified CPU utilization across all Pods.
When autoscaling for CPU utilization, you can use the oc autoscale command and specify the minimum and maximum number of Pods you want to run at any given time and the average CPU utilization your Pods should target. If you do not specify a minimum, the Pods are given default values from the OpenShift Container Platform server. To autoscale for a specific CPU value, create a HorizontalPodAutoscaler object with the target CPU and Pod limits.
Prerequisites
In order to use horizontal pod autoscalers, your cluster administrator must have properly configured cluster metrics. You can use the oc describe PodMetrics command to determine if metrics are configured. If metrics are configured, the output appears similar to the following, with Cpu and Memory displayed under Usage .
$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Name: openshift-kube-scheduler-ip-10-0-135-131.ec2.internal Namespace: openshift-kube-scheduler Labels: Annotations: API Version: metrics.k8s.io/v1beta1 Containers: Name: wait-for-host-port Usage: Memory: 0 Name: scheduler Usage: Cpu: 8m Memory: 45440Ki Kind: PodMetrics Metadata: Creation Timestamp: 2019-05-23T18:47:56Z Self Link: /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-135-131.ec2.internal Timestamp: 2019-05-23T18:47:56Z Window: 1m0s Events:
Procedure
To create a horizontal pod autoscaler for CPU utilization:
- Perform one of the following one of the following:
- To scale based on the percent of CPU utilization, create a HorizontalPodAutoscaler object for an existing DeploymentConfig:
$ oc autoscale dc/ \ 1 --min \ 2 --max \ 3 --cpu-percent= 4
Specify the name of the DeploymentConfig. The object must exist.
Optionally, specify the minimum number of replicas when scaling down.
Specify the maximum number of replicas when scaling up.
Specify the target average CPU utilization over all the Pods, represented as a percent of requested CPU. If not specified or negative, a default autoscaling policy is used.
$ oc autoscale rc/ 1 --min \ 2 --max \ 3 --cpu-percent= 4
Specify the name of the ReplicationController. The object must exist.
Specify the minimum number of replicas when scaling down.
Specify the maximum number of replicas when scaling up.
Specify the target average CPU utilization over all the Pods, represented as a percent of requested CPU. If not specified or negative, a default autoscaling policy is used.
- Create a YAML file similar to the following:
apiVersion: autoscaling/v2beta2 1 kind: HorizontalPodAutoscaler metadata: name: cpu-autoscale 2 namespace: default spec: scaleTargetRef: apiVersion: v1 3 kind: ReplicationController 4 name: example 5 minReplicas: 1 6 maxReplicas: 10 7 metrics: 8 - type: Resource resource: name: cpu 9 target: type: Utilization 10 averageValue: 500Mi 11
Use the autoscaling/v2beta2 API.
Specify a name for this horizontal pod autoscaler object.
Specify the API version of the object to scale:
- For a ReplicationController, use v1 ,
- For a DeploymentConfig, use apps.openshift.io/v1 .
Specify the kind of object to scale, either ReplicationController or DeploymentConfig .
Specify the name of the object to scale. The object must exist.
Specify the minimum number of replicas when scaling down.
Specify the maximum number of replicas when scaling up.
Use the metrics parameter for memory utilization.
Specify cpu for CPU utilization.
Set to Utilization .
Set the type to averageValue .
$ oc create -f .yaml
$ oc get hpa hpa-resource-metrics-memory NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE oc get hpa hpa-resource-metrics-memory ReplicationController/example 2441216/500Mi 1 10 1 20m
For example, the following command creates a horizontal pod autoscaler that maintains between 3 and 7 replicas of the Pods that are controlled by the image-registry DeploymentConfig in order to maintain an average CPU utilization of 75% across all Pods.
$ oc autoscale dc/image-registry --min 3 --max 7 --cpu-percent=75 deploymentconfig "image-registry" autoscaled
The command creates a horizontal pod autoscaler with the following definition:
$ oc edit hpa frontend -n openshift-image-registry
apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: creationTimestamp: "2020-02-21T20:19:28Z" name: image-registry namespace: default resourceVersion: "32452" selfLink: /apis/autoscaling/v1/namespaces/default/horizontalpodautoscalers/frontend uid: 1a934a22-925d-431e-813a-d00461ad7521 spec: maxReplicas: 7 minReplicas: 3 scaleTargetRef: apiVersion: apps.openshift.io/v1 kind: DeploymentConfig name: image-registry targetCPUUtilizationPercentage: 75 status: currentReplicas: 5 desiredReplicas: 0
The following example shows autoscaling for the image-registry DeploymentConfig. The initial deployment requires 3 Pods. The HPA object increased that minimum to 5 and will increase the Pods up to 7 if CPU usage on the Pods reaches 75%:
$ oc get dc image-registry NAME REVISION DESIRED CURRENT TRIGGERED BY image-registry 1 3 3 config $ oc autoscale dc/image-registry --min=5 --max=7 --cpu-percent=75 horizontalpodautoscaler.autoscaling/image-registry autoscaled $ oc get dc image-registry NAME REVISION DESIRED CURRENT TRIGGERED BY image-registry 1 5 5 config
1.4.3. Creating a horizontal pod autoscaler object for memory utilization
You can create a horizontal pod autoscaler (HPA) for an existing DeploymentConfig or ReplicationController object that automatically scales the Pods associated with that object in order to maintain the average memory utilization you specify, either a direct value or a percentage of requested memory.
The HPA increases and decreases the number of replicas between the minimum and maximum numbers to maintain the specified memory utilization across all Pods.
For memory utilization, you can specify the minimum and maximum number of Pods and the average memory utilization your Pods should target. If you do not specify a minimum, the Pods are given default values from the OpenShift Container Platform server.
Autoscaling for memory utilization is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs), might not be functionally complete, and Red Hat does not recommend to use them for production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information on Red Hat Technology Preview features support scope, see https://access.redhat.com/support/offerings/techpreview/.
Prerequisites
In order to use horizontal pod autoscalers, your cluster administrator must have properly configured cluster metrics. You can use the oc describe PodMetrics command to determine if metrics are configured. If metrics are configured, the output appears similar to the following, with Cpu and Memory displayed under Usage .
$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-129-223.compute.internal -n openshift-kube-scheduler
Name: openshift-kube-scheduler-ip-10-0-129-223.compute.internal Namespace: openshift-kube-scheduler Labels: Annotations: API Version: metrics.k8s.io/v1beta1 Containers: Name: scheduler Usage: Cpu: 2m Memory: 41056Ki Name: wait-for-host-port Usage: Memory: 0 Kind: PodMetrics Metadata: Creation Timestamp: 2020-02-14T22:21:14Z Self Link: /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-129-223.compute.internal Timestamp: 2020-02-14T22:21:14Z Window: 5m0s Events:
Procedure
To create a horizontal pod autoscaler for memory utilization:
- Create a YAML file for one of the following:
- To scale for a specific memory value, create a HorizontalPodAutoscaler object similar to the following for an existing DeploymentConfig or ReplicationController:
apiVersion: autoscaling/v2beta2 1 kind: HorizontalPodAutoscaler metadata: name: hpa-resource-metrics-memory 2 namespace: default spec: scaleTargetRef: apiVersion: v1 3 kind: ReplicationController 4 name: example 5 minReplicas: 1 6 maxReplicas: 10 7 metrics: 8 - type: Resource resource: name: memory 9 target: type: Utilization 10 averageValue: 500Mi 11
Use the autoscaling/v2beta2 API.
Specify a name for this horizontal pod autoscaler object.
Specify the API version of the object to scale:
- For a ReplicationController, use v1 ,
- For a DeploymentConfig, use apps.openshift.io/v1 .
Specify the kind of object to scale, either ReplicationController or DeploymentConfig .
Specify the name of the object to scale. The object must exist.
Specify the minimum number of replicas when scaling down.
Specify the maximum number of replicas when scaling up.
Use the metrics parameter for memory utilization.
Specify memory for memory utilization.
Set the type to Utilization .
Specify averageValue and a specific memory value.
apiVersion: autoscaling/v2beta2 1 kind: HorizontalPodAutoscaler metadata: name: memory-autoscale 2 namespace: default spec: scaleTargetRef: apiVersion: apps.openshift.io/v1 3 kind: DeploymentConfig 4 name: example 5 minReplicas: 1 6 maxReplicas: 10 7 metrics: metrics: 8 - type: Resource resource: name: memory 9 target: type: Utilization 10 averageValue: 50 11
Use the autoscaling/v2beta2 API.
Specify a name for this horizontal pod autoscaler object.
Specify the API version of the object to scale:
- For a ReplicationController, use v1 ,
- For a DeploymentConfig, use apps.openshift.io/v1 .
Specify the kind of object to scale, either ReplicationController or DeploymentConfig .
Specify the name of the object to scale. The object must exist.
Specify the minimum number of replicas when scaling down.
Specify the maximum number of replicas when scaling up.
Use the metrics parameter for memory utilization.
Specify memory for memory utilization.
Set to Utilization .
Specify averageUtilization or averageValue and a memory value.
$ oc create -f .yaml
For example:
$ oc create -f hpa.yaml horizontalpodautoscaler.autoscaling/hpa-resource-metrics-memory created
$ oc get hpa hpa-resource-metrics-memory NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE oc get hpa hpa-resource-metrics-memory ReplicationController/example 2441216/500Mi 1 10 1 20m
$ oc describe hpa hpa-resource-metrics-memory Name: hpa-resource-metrics-memory Namespace: default Labels: Annotations: CreationTimestamp: Wed, 04 Mar 2020 16:31:37 +0530 Reference: ReplicationController/example Metrics: ( current / target ) resource memory on pods: 2441216 / 500Mi Min replicas: 1 Max replicas: 10 ReplicationController pods: 1 current / 1 desired Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True ReadyForNewScale recommended size matches current size ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from memory resource ScalingLimited False DesiredWithinRange the desired count is within the acceptable range Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulRescale 6m34s horizontal-pod-autoscaler New size: 1; reason: All metrics below target
1.4.4. Understanding horizontal pod autoscaler status conditions
You can use the status conditions set to determine whether or not the horizontal pod autoscaler (HPA) is able to scale and whether or not it is currently restricted in any way.
The HPA status conditions are available with the v2beta1 version of the autoscaling API.
The HPA responds with the following status conditions:
- The AbleToScale condition indicates whether HPA is able to fetch and update metrics, as well as whether any backoff-related conditions could prevent scaling.
- A True condition indicates scaling is allowed.
- A False condition indicates scaling is not allowed for the reason specified.
- A True condition indicates metrics is working properly.
- A False condition generally indicates a problem with fetching metrics.
- A True condition indicates that you need to raise or lower the minimum or maximum replica count in order to scale.
- A False condition indicates that the requested scaling is allowed.
$ oc describe hpa cm-test Name: cm-test Namespace: prom Labels: Annotations: CreationTimestamp: Fri, 16 Jun 2017 18:09:22 +0000 Reference: ReplicationController/cm-test Metrics: ( current / target ) "http_requests" on pods: 66m / 500m Min replicas: 1 Max replicas: 4 ReplicationController pods: 1 current / 1 desired Conditions: 1 Type Status Reason Message ---- ------ ------ ------- AbleToScale True ReadyForNewScale the last scale time was sufficiently old as to warrant a new scale ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric http_request ScalingLimited False DesiredWithinRange the desired replica count is within the acceptable range Events:
The horizontal pod autoscaler status messages.
The following is an example of a pod that is unable to scale:
Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale False FailedGetScale the HPA controller was unable to get the target's current scale: no matches for kind "ReplicationController" in group "apps" Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedGetScale 6s (x3 over 36s) horizontal-pod-autoscaler no matches for kind "ReplicationController" in group "apps"
The following is an example of a pod that could not obtain the needed metrics for scaling:
Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: unable to get metrics for resource cpu: no metrics returned from heapster
The following is an example of a pod where the requested autoscaling was less than the required minimums:
Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True ReadyForNewScale the last scale time was sufficiently old as to warrant a new scale ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric http_request ScalingLimited False DesiredWithinRange the desired replica count is within the acceptable range
1.4.4.1. Viewing horizontal pod autoscaler status conditions
You can view the status conditions set on a pod by the horizontal pod autoscaler (HPA).
The horizontal pod autoscaler status conditions are available with the v2beta1 version of the autoscaling API.
Prerequisites
In order to use horizontal pod autoscalers, your cluster administrator must have properly configured cluster metrics. You can use the oc describe PodMetrics command to determine if metrics are configured. If metrics are configured, the output appears similar to the following, with Cpu and Memory displayed under Usage .
$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal Name: openshift-kube-scheduler-ip-10-0-135-131.ec2.internal Namespace: openshift-kube-scheduler Labels: Annotations: API Version: metrics.k8s.io/v1beta1 Containers: Name: wait-for-host-port Usage: Memory: 0 Name: scheduler Usage: Cpu: 8m Memory: 45440Ki Kind: PodMetrics Metadata: Creation Timestamp: 2019-05-23T18:47:56Z Self Link: /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-135-131.ec2.internal Timestamp: 2019-05-23T18:47:56Z Window: 1m0s Events:
Procedure
To view the status conditions on a pod, use the following command with the name of the pod:
$ oc describe hpa
$ oc describe hpa cm-test
The conditions appear in the Conditions field in the output.
Name: cm-test Namespace: prom Labels: Annotations: CreationTimestamp: Fri, 16 Jun 2017 18:09:22 +0000 Reference: ReplicationController/cm-test Metrics: ( current / target ) "http_requests" on pods: 66m / 500m Min replicas: 1 Max replicas: 4 ReplicationController pods: 1 current / 1 desired Conditions: 1 Type Status Reason Message ---- ------ ------ ------- AbleToScale True ReadyForNewScale the last scale time was sufficiently old as to warrant a new scale ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric http_request ScalingLimited False DesiredWithinRange the desired replica count is within the acceptable range
1.4.5. Additional resources
For more information on replication controllers and deployment controllers, see Understanding Deployments and DeploymentConfigs.
1.5. Providing sensitive data to pods
Some applications need sensitive information, such as passwords and user names, that you do not want developers to have.
As an administrator, you can use Secret objects to provide this information without exposing that information in clear text.
1.5.1. Understanding secrets
The Secret object type provides a mechanism to hold sensitive information such as passwords, OpenShift Container Platform client configuration files, private source repository credentials, and so on. Secrets decouple sensitive content from the pods. You can mount secrets into Containers using a volume plug-in or the system can use secrets to perform actions on behalf of a pod.
Key properties include:
- Secret data can be referenced independently from its definition.
- Secret data volumes are backed by temporary file-storage facilities (tmpfs) and never come to rest on a node.
- Secret data can be shared within a namespace.
YAML Secret Object Definition
apiVersion: v1 kind: Secret metadata: name: test-secret namespace: my-namespace type: Opaque 1 data: 2 username: dmFsdWUtMQ0K 3 password: dmFsdWUtMg0KDQo= stringData: 4 hostname: myapp.mydomain.com 5
Indicates the structure of the secret’s key names and values.
The allowable format for the keys in the data field must meet the guidelines in the DNS_SUBDOMAIN value in the Kubernetes identifiers glossary.
The value associated with keys in the data map must be base64 encoded.
Entries in the stringData map are converted to base64 and the entry will then be moved to the data map automatically. This field is write-only; the value will only be returned via the data field.
The value associated with keys in the stringData map is made up of plain text strings.
You must create a secret before creating the pods that depend on that secret.
When creating secrets:
- Create a secret object with secret data.
- Update the pod’s service account to allow the reference to the secret.
- Create a pod, which consumes the secret as an environment variable or as a file (using a secret volume).
1.5.1.1. Types of secrets
The value in the type field indicates the structure of the secret’s key names and values. The type can be used to enforce the presence of user names and keys in the secret object. If you do not want validation, use the opaque type, which is the default.
Specify one of the following types to trigger minimal server-side validation to ensure the presence of specific key names in the secret data:
- kubernetes.io/service-account-token . Uses a service account token.
- kubernetes.io/basic-auth . Use with Basic Authentication.
- kubernetes.io/ssh-auth . Use with SSH Key Authentication.
- kubernetes.io/tls . Use with TLS certificate authorities.
Specify type: Opaque if you do not want validation, which means the secret does not claim to conform to any convention for key names or values. An opaque secret, allows for unstructured key:value pairs that can contain arbitrary values.
You can specify other arbitrary types, such as example.com/my-secret-type . These types are not enforced server-side, but indicate that the creator of the secret intended to conform to the key/value requirements of that type.
For examples of different secret types, see the code samples in Using Secrets .
1.5.1.2. Example secret configurations
The following are sample secret configuration files.
YAML Secret That Will Create Four Files
apiVersion: v1 kind: Secret metadata: name: test-secret data: username: dmFsdWUtMQ0K 1 password: dmFsdWUtMQ0KDQo= 2 stringData: hostname: myapp.mydomain.com 3 secret.properties: |- 4 property1=valueA property2=valueB
File contains decoded values.
File contains decoded values.
File contains the provided string.
File contains the provided data.
YAML of a Pod Populating Files in a Volume with Secret Data
apiVersion: v1 kind: Pod metadata: name: secret-example-pod spec: containers: - name: secret-test-container image: busybox command: [ "/bin/sh", "-c", "cat /etc/secret-volume/*" ] volumeMounts: # name must match the volume name below - name: secret-volume mountPath: /etc/secret-volume readOnly: true volumes: - name: secret-volume secret: secretName: test-secret restartPolicy: Never
YAML of a Pod Populating Environment Variables with Secret Data
apiVersion: v1 kind: Pod metadata: name: secret-example-pod spec: containers: - name: secret-test-container image: busybox command: [ "/bin/sh", "-c", "export" ] env: - name: TEST_SECRET_USERNAME_ENV_VAR valueFrom: secretKeyRef: name: test-secret key: username restartPolicy: Never
YAML of a Build Config Populating Environment Variables with Secret Data
apiVersion: v1 kind: BuildConfig metadata: name: secret-example-bc spec: strategy: sourceStrategy: env: - name: TEST_SECRET_USERNAME_ENV_VAR valueFrom: secretKeyRef: name: test-secret key: username
1.5.1.3. Secret data keys
Secret keys must be in a DNS subdomain.
1.5.2. Understanding how to create secrets
As an administrator you must create a secret before developers can create the pods that depend on that secret.
When creating secrets:
- Create a secret object with secret data.
- Update the pod’s service account to allow the reference to the secret.
- Create a pod, which consumes the secret as an environment variable or as a file (using a secret volume).
1.5.2.1. Secret creation restrictions
To use a secret, a pod needs to reference the secret. A secret can be used with a pod in three ways:
- To populate environment variables for Containers.
- As files in a volume mounted on one or more of its Containers.
- By kubelet when pulling images for the pod.
Volume type secrets write data into the Container as a file using the volume mechanism. Image pull secrets use service accounts for the automatic injection of the secret into all pods in a namespaces.
When a template contains a secret definition, the only way for the template to use the provided secret is to ensure that the secret volume sources are validated and that the specified object reference actually points to an object of type Secret . Therefore, a secret needs to be created before any pods that depend on it. The most effective way to ensure this is to have it get injected automatically through the use of a service account.
Secret API objects reside in a namespace. They can only be referenced by pods in that same namespace.
Individual secrets are limited to 1MB in size. This is to discourage the creation of large secrets that could exhaust apiserver and kubelet memory. However, creation of a number of smaller secrets could also exhaust memory.
1.5.2.2. Creating an opaque secret
As an administrator, you can create a opaque secret, which allows for unstructured key:value pairs that can contain arbitrary values.
Procedure
-
Create a secret object in a YAML file on master. For example:
apiVersion: v1 kind: Secret metadata: name: mysecret type: Opaque 1 data: username: dXNlci1uYW1l password: cGFzc3dvcmQ=
Specifies an opaque secret.
$ oc create -f
- Update the service account for the pod where you want to use the secret to allow the reference to the secret.
- Create the pod, which consumes the secret as an environment variable or as a file (using a secret volume).
1.5.3. Understanding how to update secrets
When you modify the value of a secret, the value (used by an already running pod) will not dynamically change. To change a secret, you must delete the original pod and create a new pod (perhaps with an identical PodSpec).
Updating a secret follows the same workflow as deploying a new Container image. You can use the kubectl rolling-update command.
The resourceVersion value in a secret is not specified when it is referenced. Therefore, if a secret is updated at the same time as pods are starting, then the version of the secret will be used for the pod will not be defined.
Currently, it is not possible to check the resource version of a secret object that was used when a pod was created. It is planned that pods will report this information, so that a controller could restart ones using a old resourceVersion . In the interim, do not update the data of existing secrets, but create new ones with distinct names.
1.5.4. About using signed certificates with secrets
To secure communication to your service, you can configure OpenShift Container Platform to generate a signed serving certificate/key pair that you can add into a secret in a project.
A service serving certificate secret is intended to support complex middleware applications that need out-of-the-box certificates. It has the same settings as the server certificates generated by the administrator tooling for nodes and masters.
Service pod specification configured for a service serving certificates secret.
apiVersion: v1 kind: Service metadata: name: registry annotations: service.alpha.openshift.io/serving-cert-secret-name: registry-cert 1 .
Specify the name for the certificate
Other pods can trust cluster-created certificates (which are only signed for internal DNS names), by using the CA bundle in the /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt file that is automatically mounted in their pod.
The signature algorithm for this feature is x509.SHA256WithRSA . To manually rotate, delete the generated secret. A new certificate is created.
1.5.4.1. Generating signed certificates for use with secrets
To use a signed serving certificate/key pair with a pod, create or edit the service to add the service.alpha.openshift.io/serving-cert-secret-name annotation, then add the secret to the pod.
Procedure
To create a service serving certificate secret :
- Edit the pod specification for your service.
- Add the service.alpha.openshift.io/serving-cert-secret-name annotation with the name you want to use for your secret.
kind: Service apiVersion: v1 metadata: name: my-service annotations: service.alpha.openshift.io/serving-cert-secret-name: my-cert 1 spec: selector: app: MyApp ports: - protocol: TCP port: 80 targetPort: 9376
$ oc create -f .yaml
$ oc get secrets NAME TYPE DATA AGE my-cert kubernetes.io/tls 2 9m $ oc describe secret my-service-pod Name: my-service-pod Namespace: openshift-console Labels: Annotations: kubernetes.io/service-account.name: builder kubernetes.io/service-account.uid: ab-11e9-988a-0eb4e1b4a396 Type: kubernetes.io/service-account-token Data ca.crt: 5802 bytes namespace: 17 bytes token: eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Ii wia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtY29uc29sZSIsImt1YmVyb cnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiJhYmE4Y2UyZC00MzVlLTExZTktOTg4YS0wZWI0ZTFiNGEz OTYiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6b3BlbnNoaWZ
apiVersion: v1 kind: Pod metadata: name: my-service-pod spec: containers: - name: mypod image: redis volumeMounts: - name: foo mountPath: "/etc/foo" volumes: - name: foo secret: secretName: my-cert items: - key: username path: my-group/my-username mode: 511
When it is available, your pod will run. The certificate will be good for the internal service DNS name,
In most cases, the service DNS name
1.5.5. Troubleshooting secrets
If a service certificate generation fails with (service’s service.alpha.openshift.io/serving-cert-generation-error annotation contains):
secret/ssl-key references serviceUID 62ad25ca-d703-11e6-9d6f-0e9c0057b608, which does not match 77b6dd80-d716-11e6-9d6f-0e9c0057b60
The service that generated the certificate no longer exists, or has a different serviceUID . You must force certificates regeneration by removing the old secret, and clearing the following annotations on the service service.alpha.openshift.io/serving-cert-generation-error , service.alpha.openshift.io/serving-cert-generation-error-num :
$ oc delete secret $ oc annotate service service.alpha.openshift.io/serving-cert-generation-error-1 $ oc annotate service service.alpha.openshift.io/serving-cert-generation-error-num-1
The command removing annotation has a — after the annotation name to be removed.
1.6. Using device plug-ins to access external resources with pods
Device plug-ins allow you to use a particular device type (GPU, InfiniBand, or other similar computing resources that require vendor-specific initialization and setup) in your OpenShift Container Platform pod without needing to write custom code.
1.6.1. Understanding device plug-ins
The device plug-in provides a consistent and portable solution to consume hardware devices across clusters. The device plug-in provides support for these devices through an extension mechanism, which makes these devices available to Containers, provides health checks of these devices, and securely shares them.
OpenShift Container Platform supports the device plug-in API, but the device plug-in Containers are supported by individual vendors.
A device plug-in is a gRPC service running on the nodes (external to the kubelet ) that is responsible for managing specific hardware resources. Any device plug-in must support following remote procedure calls (RPCs):
service DevicePlugin < // GetDevicePluginOptions returns options to be communicated with Device // Manager rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) <>// ListAndWatch returns a stream of List of Devices // Whenever a Device state change or a Device disappears, ListAndWatch // returns the new list rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) <> // Allocate is called during container creation so that the Device // Plug-in can run device specific operations and instruct Kubelet // of the steps to make the Device available in the container rpc Allocate(AllocateRequest) returns (AllocateResponse) <> // PreStartcontainer is called, if indicated by Device Plug-in during // registration phase, before each container start. Device plug-in // can run device specific operations such as reseting the device // before making devices available to the container rpc PreStartcontainer(PreStartcontainerRequest) returns (PreStartcontainerResponse) <> >
Example device plug-ins
- Nvidia GPU device plug-in for COS-based operating system
- Nvidia official GPU device plug-in
- Solarflare device plug-in
- KubeVirt device plug-ins: vfio and kvm
For easy device plug-in reference implementation, there is a stub device plug-in in the Device Manager code: vendor/k8s.io/kubernetes/pkg/kubelet/cm/deviceplugin/device_plugin_stub.go .
1.6.1.1. Methods for deploying a device plug-in
- Daemonsets are the recommended approach for device plug-in deployments.
- Upon start, the device plug-in will try to create a UNIX domain socket at /var/lib/kubelet/device-plugin/ on the node to serve RPCs from Device Manager.
- Since device plug-ins must manage hardware resources, access to the host file system, as well as socket creation, they must be run in a privileged security context.
- More specific details regarding deployment steps can be found with each device plug-in implementation.
1.6.2. Understanding the Device Manager
Device Manager provides a mechanism for advertising specialized node hardware resources with the help of plug-ins known as device plug-ins.
You can advertise specialized hardware without requiring any upstream code changes.
OpenShift Container Platform supports the device plug-in API, but the device plug-in Containers are supported by individual vendors.
Device Manager advertises devices as Extended Resources . User pods can consume devices, advertised by Device Manager, using the same Limit/Request mechanism, which is used for requesting any other Extended Resource .
Upon start, the device plug-in registers itself with Device Manager invoking Register on the /var/lib/kubelet/device-plugins/kubelet.sock and starts a gRPC service at /var/lib/kubelet/device-plugins/.sock for serving Device Manager requests.
Device Manager, while processing a new registration request, invokes ListAndWatch remote procedure call (RPC) at the device plug-in service. In response, Device Manger gets a list of Device objects from the plug-in over a gRPC stream. Device Manager will keep watching on the stream for new updates from the plug-in. On the plug-in side, the plug-in will also keep the stream open and whenever there is a change in the state of any of the devices, a new device list is sent to the Device Manager over the same streaming connection.
While handling a new pod admission request, Kubelet passes requested Extended Resources to the Device Manager for device allocation. Device Manager checks in its database to verify if a corresponding plug-in exists or not. If the plug-in exists and there are free allocatable devices as well as per local cache, Allocate RPC is invoked at that particular device plug-in.
Additionally, device plug-ins can also perform several other device-specific operations, such as driver installation, device initialization, and device resets. These functionalities vary from implementation to implementation.
1.6.3. Enabling Device Manager
Enable Device Manager to implement a device plug-in to advertise specialized hardware without any upstream code changes.
Device Manager provides a mechanism for advertising specialized node hardware resources with the help of plug-ins known as device plug-ins.
- Obtain the label associated with the static Machine Config Pool CRD for the type of node you want to configure. Perform one of the following steps:
- View the Machine Config:
# oc describe machineconfig
For example:
# oc describe machineconfig 00-worker oc describe machineconfig 00-worker Name: 00-worker Namespace: Labels: machineconfiguration.openshift.io/role=worker 1
Label required for the device manager.
Procedure
-
Create a Custom Resource (CR) for your configuration change.
Sample configuration for a Device Manager CR
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: devicemgr 1 spec: machineConfigPoolSelector: matchLabels: machineconfiguration.openshift.io: devicemgr 2 kubeletConfig: feature-gates: - DevicePlugins=true 3
Assign a name to CR.
Enter the label from the Machine Config Pool.
Set DevicePlugins to ‘true`.
$ oc create -f devicemgr.yaml kube letconfig.machineconfiguration.openshift.io/devicemgr created
1.7. Including pod priority in pod scheduling decisions
You can enable pod priority and preemption in your cluster. Pod priority indicates the importance of a pod relative to other pods and queues the pods based on that priority. Pod preemption allows the cluster to evict, or preempt, lower-priority pods so that higher-priority pods can be scheduled if there is no available space on a suitable node Pod priority also affects the scheduling order of pods and out-of-resource eviction ordering on the node.
To use priority and preemption, you create priority classes that define the relative weight of your pods. Then, reference a priority class in the pod specification to apply that weight for scheduling.
Preemption is controlled by the disablePreemption parameter in the scheduler configuration file, which is set to false by default.
1.7.1. Understanding pod priority
When you use the Pod Priority and Preemption feature, the scheduler orders pending pods by their priority, and a pending pod is placed ahead of other pending pods with lower priority in the scheduling queue. As a result, the higher priority pod might be scheduled sooner than pods with lower priority if its scheduling requirements are met. If a pod cannot be scheduled, scheduler continues to schedule other lower priority pods.
1.7.1.1. Pod priority classes
You can assign pods a priority class, which is a non-namespaced object that defines a mapping from a name to the integer value of the priority. The higher the value, the higher the priority.
A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than one billion for critical pods that should not be preempted or evicted. By default, OpenShift Container Platform has two reserved priority classes for critical system pods to have guaranteed scheduling.
$ oc get priorityclasses NAME CREATED AT cluster-logging 2019-03-13T14:45:12Z system-cluster-critical 2019-03-13T14:01:10Z system-node-critical 2019-03-13T14:01:10Z
- system-node-critical — This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node. Examples of pods that have this priority class are sdn-ovs , sdn , and so forth. A number of critical components include the system-node-critical priority class by default, for example:
- master-api
- master-controller
- master-etcd
- sdn
- sdn-ovs
- sync
- fluentd
- metrics-server
- descheduler
If you upgrade your existing cluster, the priority of your existing pods is effectively zero. However, existing pods with the scheduler.alpha.kubernetes.io/critical-pod annotation are automatically converted to system-cluster-critical class. Fluentd cluster logging pods with the annotation are converted to the cluster-logging priority class.
1.7.1.2. Pod priority names
After you have one or more priority classes, you can create pods that specify a priority class name in a pod specification. The priority admission controller uses the priority class name field to populate the integer value of the priority. If the named priority class is not found, the pod is rejected.
1.7.2. Understanding pod preemption
When a developer creates a pod, the pod goes into a queue. If the developer configured the pod for pod priority or preemption, the scheduler picks a pod from the queue and tries to schedule the pod on a node. If the scheduler cannot find space on an appropriate node that satisfies all the specified requirements of the pod, preemption logic is triggered for the pending pod.
When the scheduler preempts one or more pods on a node, the nominatedNodeName field of higher-priority pod specification is set to the name of the node, along with the nodename field. The scheduler uses the nominatedNodeName field to keep track of the resources reserved for pods and also provides information to the user about preemptions in the clusters.
After the scheduler preempts a lower-priority pod, the scheduler honors the graceful termination period of the pod. If another node becomes available while scheduler is waiting for the lower-priority pod to terminate, the scheduler can schedule the higher-priority pod on that node. As a result, the nominatedNodeName field and nodeName field of the pod specification might be different.
Also, if the scheduler preempts pods on a node and is waiting for termination, and a pod with a higher-priority pod than the pending pod needs to be scheduled, the scheduler can schedule the higher-priority pod instead. In such a case, the scheduler clears the nominatedNodeName of the pending pod, making the pod eligible for another node.
Preemption does not necessarily remove all lower-priority pods from a node. The scheduler can schedule a pending pod by removing a portion of the lower-priority pods.
The scheduler considers a node for pod preemption only if the pending pod can be scheduled on the node.
1.7.2.1. Pod preemption and other scheduler settings
If you enable pod priority and preemption, consider your other scheduler settings:
Pod priority and pod disruption budget A pod disruption budget specifies the minimum number or percentage of replicas that must be up at a time. If you specify pod disruption budgets, OpenShift Container Platform respects them when preempting pods at a best effort level. The scheduler attempts to preempt pods without violating the pod disruption budget. If no such pods are found, lower-priority pods might be preempted despite their pod disruption budget requirements. Pod priority and pod affinity Pod affinity requires a new pod to be scheduled on the same node as other pods with the same label.
If a pending pod has inter-pod affinity with one or more of the lower-priority pods on a node, the scheduler cannot preempt the lower-priority pods without violating the affinity requirements. In this case, the scheduler looks for another node to schedule the pending pod. However, there is no guarantee that the scheduler can find an appropriate node and pending pod might not be scheduled.
To prevent this situation, carefully configure pod affinity with equal-priority pods.
1.7.2.2. Graceful termination of preempted pods
When preempting a pod, the scheduler waits for the pod graceful termination period to expire, allowing the pod to finish working and exit. If the pod does not exit after the period, the scheduler kills the pod. This graceful termination period creates a time gap between the point that the scheduler preempts the pod and the time when the pending pod can be scheduled on the node.
To minimize this gap, configure a small graceful termination period for lower-priority pods.
1.7.3. Configuring priority and preemption
You apply pod priority and preemption by creating a priority class object and associating pods to the priority using the priorityClassName in your pod specifications.
Sample priority class object
apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: high-priority 1 value: 1000000 2 globalDefault: false 3 description: "This priority class should be used for XYZ service pods only." 4
The name of the priority class object.
The priority value of the object.
Optional field that indicates whether this priority class should be used for pods without a priority class name specified. This field is false by default. Only one priority class with globalDefault set to true can exist in the cluster. If there is no priority class with globalDefault:true , the priority of pods with no priority class name is zero. Adding a priority class with globalDefault:true affects only pods created after the priority class is added and does not change the priorities of existing pods.
Optional arbitrary text string that describes which pods developers should use with this priority class.
Procedure
To configure your cluster to use priority and preemption:
- Create one or more priority classes:
- Specify a name and value for the priority.
- Optionally specify the globalDefault field in the priority class and a description.
Sample pod specification with priority class name
apiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent priorityClassName: high-priority 1
Specify the priority class to use with this pod.
$ oc create -f .yaml
1.7.4. Disabling priority and preemption
You can disable the pod priority and preemption feature.
After the feature is disabled, the existing pods keep their priority fields, but preemption is disabled, and priority fields are ignored. If the feature is disabled, you cannot set a priority class name in new pods.
Critical pods rely on scheduler preemption to be scheduled when a cluster is under resource pressure. For this reason, Red Hat recommends not disabling preemption. DaemonSet pods are scheduled by the DaemonSet controller and not affected by disabling preemption.
Procedure
To disable the preemption for the cluster:
- Edit the Scheduler Operator Custom Resource to add the disablePreemption: true parameter:
oc edit scheduler cluster
apiVersion: config.openshift.io/v1 kind: Scheduler metadata: creationTimestamp: '2019-03-12T01:45:02Z' generation: 1 name: example resourceVersion: '1882034' selfLink: /apis/config.openshift.io/v1/schedulers/example uid: 743701e9-4468-11e9-bd34-02a7fe1bf828 spec: disablePreemption: true
1.8. Placing pods on specific nodes using node selectors
A node selector specifies a map of key-value pairs. The rules are defined using custom labels on nodes and selectors specified in pods.
For the pod to be eligible to run on a node, the pod must have the indicated key-value pairs as the label on the node.
If you are using node affinity and node selectors in the same pod configuration, see the important considerations below.
1.8.1. Using node selectors to control pod placement
You can use node selector labels on pods to control where the pod is scheduled.
With node selectors, OpenShift Container Platform schedules the pods on nodes that contain matching labels.
You can add labels to a node or MachineConfig, but the labels will not persist if the node or machine goes down. Adding the label to the MachineSet ensures that new nodes or machines will have the label.
To add node selectors to an existing pod, add a node selector to the controlling object for that node, such as a ReplicaSet, Daemonset, or StatefulSet. Any existing pods under that controlling object are recreated on a node with a matching label. If you are creating a new pod, you can add the node selector directly to the pod spec.
You cannot add a node selector to an existing scheduled pod.
Prerequisites
If you want to add a node selector to existing pods, determine the controlling object for that pod. For exeample, the router-default-66d5cf9464-m2g75 pod is controlled by the router-default-66d5cf9464 ReplicaSet:
$ oc describe pod router-default-66d5cf9464-7pwkc Name: router-default-66d5cf9464-7pwkc Namespace: openshift-ingress . Controlled By: ReplicaSet/router-default-66d5cf9464
The web console lists the controlling object under ownerReferences in the pod YAML:
ownerReferences: - apiVersion: apps/v1 kind: ReplicaSet name: router-default-66d5cf9464 uid: d81dd094-da26-11e9-a48a-128e7edf0312 controller: true blockOwnerDeletion: true
Procedure
-
Add the desired label to your nodes:
$ oc label =
For example, to label a node:
$ oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east
The label is applied to the node:
kind: Node apiVersion: v1 metadata: name: ip-10-0-131-14.ec2.internal selfLink: /api/v1/nodes/ip-10-0-131-14.ec2.internal uid: 7bc2580a-8b8e-11e9-8e01-021ab4174c74 resourceVersion: '478704' creationTimestamp: '2019-06-10T14:46:08Z' labels: beta.kubernetes.io/os: linux failure-domain.beta.kubernetes.io/zone: us-east-1a node.openshift.io/os_version: '4.1' node-role.kubernetes.io/worker: '' failure-domain.beta.kubernetes.io/region: us-east-1 node.openshift.io/os_id: rhcos beta.kubernetes.io/instance-type: m4.large kubernetes.io/hostname: ip-10-0-131-14 region: east 1 beta.kubernetes.io/arch: amd64 type: user-node 2 .
Specify the label(s) you will add to the node.
Alternatively, you can add the label to a MachineSet:
$ oc edit MachineSet abc612-msrtw-worker-us-east-1c
apiVersion: machine.openshift.io/v1beta1 kind: MachineSet . spec: replicas: 2 selector: matchLabels: machine.openshift.io/cluster-api-cluster: ci-ln-89dz2y2-d5d6b-4995x machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: ci-ln-89dz2y2-d5d6b-4995x-worker-us-east-1a template: metadata: creationTimestamp: null labels: machine.openshift.io/cluster-api-cluster: ci-ln-89dz2y2-d5d6b-4995x machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: ci-ln-89dz2y2-d5d6b-4995x-worker-us-east-1a spec: metadata: creationTimestamp: null labels: region: east 1 type: user-node 2 .
Specify the label(s) you will add to the node.
- To add a node selector to existing and furture pods, add a node selector to the controlling object for the pods: For example:
kind: ReplicaSet . spec: . template: metadata: creationTimestamp: null labels: ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default pod-template-hash: 66d5cf9464 spec: nodeSelector: beta.kubernetes.io/os: linux node-role.kubernetes.io/worker: '' type: user-node 1
Add the desired node selector.
apiVersion: v1 kind: Pod . spec: nodeSelector: : .
For example:
apiVersion: v1 kind: Pod . spec: nodeSelector: region: east type: user-node
If you are using node selectors and node affinity in the same pod configuration, note the following:
- If you configure both nodeSelector and nodeAffinity , both conditions must be satisfied for the pod to be scheduled onto a candidate node.
- If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms is satisfied.
- If you specify multiple matchExpressions associated with nodeSelectorTerms , then the pod can be scheduled onto a node only if all matchExpressions are satisfied.
OpenShift — Basic Concept
Before beginning with the actual setup and deployment of applications, we need to understand some basic terms and concepts used in OpenShift V3.
Containers and Images
Images
These are the basic building blocks of OpenShift, which are formed out of Docker images. In each pod on OpenShift, the cluster has its own images running inside it. When we configure a pod, we have a field which will get pooled from the registry. This configuration file will pull the image and deploy it on the cluster node.
apiVersion: v1 kind: pod metadata: name: Tesing_for_Image_pull -----------> Name of Pod spec: containers: - name: neo4j-server ------------------------> Name of the image image: ----------> Image to be pulled imagePullPolicy: Always ------------->Image pull policy command: [“echo”, “SUCCESS”] -------------------> Massage after image pull
In order to pull and create an image out of it, run the following command. OC is the client to communicate with OpenShift environment after login.
$ oc create –f Tesing_for_Image_pull
Container
This gets created when the Docker image gets deployed on the OpenShift cluster. While defining any configuration, we define the container section in the configuration file. One container can have multiple images running inside and all the containers running on cluster node are managed by OpenShift Kubernetes.
spec: containers: - name: py ------------------------> Name of the container image: python----------> Image going to get deployed on container command: [“python”, “SUCCESS”] restartPocliy: Never --------> Restart policy of container
Following are the specifications for defining a container having multiple images running inside it.
apiVersion: v1 kind: Pod metadata: name: Tomcat spec: containers: - name: Tomcat image: tomcat: 8.0 ports: - containerPort: 7500 imagePullPolicy: Always -name: Database Image: mongoDB Ports: - containerPort: 7501 imagePullPolicy: Always
In the above configuration, we have defined a multi-container pod with two images of Tomcat and MongoDB inside it.
Pods and Services
Pods
Pod can be defined as a collection of container and its storage inside a node of OpenShift (Kubernetes) cluster. In general, we have two types of pod starting from a single container pod to multi-container pod.
Single Container Pod − These can be easily created with OC command or by a basic configuration yml file.
$ oc run —image =
Create it with a simple yaml file as follows.
apiVersion: v1 kind: Pod metadata: name: apache spec: containers: - name: apache image: apache: 8.0 ports: - containerPort: 7500 imagePullPolicy: Always
Once the above file is created, it will generate a pod with the following command.
$ oc create –f apache.yml
Multi-Container Pod − Multi-container pods are those in which we have more than one container running inside it. They are created using yaml files as follows.
apiVersion: v1 kind: Pod metadata: name: Tomcat spec: containers: - name: Tomcat image: tomcat: 8.0 ports: - containerPort: 7500 imagePullPolicy: Always -name: Database Image: mongoDB Ports: - containerPort: 7501 imagePullPolicy: Always
After creating these files, we can simply use the same method as above to create a container.
Service − As we have a set of containers running inside a pod, in the same way we have a service that can be defined as a logical set of pods. It’s an abstracted layer on top of the pod, which provides a single IP and DNS name through which pods can be accessed. Service helps in managing the load balancing configuration and to scale the pod very easily. In OpenShift, a service is a REST object whose deification can be posted to apiService on OpenShift master to create a new instance.
apiVersion: v1 kind: Service metadata: name: Tutorial_point_service spec: ports: - port: 8080 targetPort: 31999
Builds and Streams
Builds
In OpenShift, build is a process of transforming images into containers. It is the processing which converts the source code to an image. This build process works on pre-defined strategy of building source code to image.
The build processes multiple strategies and sources.
Build Strategies
- Source to Image − This is basically a tool, which helps in building reproducible images. These images are always in a ready stage to run using the Docker run command.
- Docker Build − This is the process in which the images are built using Docker file by running simple Docker build command.
- Custom Build − These are the builds which are used for creating base Docker images.
Build Sources
Git − This source is used when the git repository is used for building images. The Dockerfile is optional. The configurations from the source code looks like the following.
source: type: "Git" git: uri: "https://github.com/vipin/testing.git" ref: "master" contextDir: "app/dir" dockerfile: "FROM openshift/ruby-22-centos7\nUSER example"
Dockerfile − The Dockerfile is used as an input in the configuration file.
source: type: "Dockerfile" dockerfile: "FROM ubuntu: latest RUN yum install -y httpd"
Image Streams − Image streams are created after pulling the images. The advantage of an image stream is that it looks for updates on the new version of an image. This is used to compare any number of Docker formatted container images identified by tags.
Image streams can automatically perform an action when a new image is created. All the builds and deployments can watch for image action and perform an action accordingly. Following is how we define a build a stream.
apiVersion: v1 kind: ImageStream metadata: annotations: openshift.io/generated-by: OpenShiftNewApp generation: 1 labels: app: ruby-sample-build selflink: /oapi/v1/namespaces/test/imagestreams/origin-ruby-sample uid: ee2b9405-c68c-11e5-8a99-525400f25e34 spec: <> status: dockerImageRepository: 172.30.56.218:5000/test/origin-ruby-sample tags: - items: - created: 2016-01-29T13:40:11Z dockerImageReference: 172.30.56.218:5000/test/origin-apache-sample generation: 1 image: vklnld908.int.clsa.com/vipin/test tag: latest
Routes and Templates
Routes
In OpenShift, routing is a method of exposing the service to the external world by creating and configuring externally reachable hostname. Routes and endpoints are used to expose the service to the external world, from where the user can use the name connectivity (DNS) to access defined application.
In OpenShift, routes are created by using routers which are deployed by OpenShift admin on the cluster. Routers are used to bind HTTP (80) and https (443) ports to external applications.
Following are the different kinds of protocol supported by routes −
- HTTP
- HTTPS
- TSL and web socket
When configuring the service, selectors are used to configure the service and find the endpoint using that service. Following is an example of how we create a service and the routing for that service by using an appropriate protocol.
< "kind": "Service", "apiVersion": "v1", "metadata": , "spec": < "selector": , "ports": [ < "protocol": "TCP", "port": 8888, "targetPort": 8080 >] > >
Next, run the following command and the service is created.
$ oc create -f ~/training/content/Openshift-Rservice.json
This is how the service looks like after creation.
$ oc describe service Openshift-Rservice Name: Openshift-Rservice Labels: Selector: name = RService-openshift Type: ClusterIP IP: 172.30.42.80 Port: 8080/TCP Endpoints: Session Affinity: None No events.
Create a routing for service using the following code.
< "kind": "Route", "apiVersion": "v1", "metadata": , "spec": < "host": "hello-openshift.cloudapps.example.com", "to": < "kind": "Service", "name": "OpenShift-route-service" >, "tls": > >
When OC command is used to create a route, a new instance of route resource is created.
Templates
Templates are defined as a standard object in OpenShift which can be used multiple times. It is parameterized with a list of placeholders which are used to create multiple objects. This can be used to create anything, starting from a pod to networking, for which users have authorization to create. A list of objects can be created, if the template from CLI or GUI interface in the image is uploaded to the project directory.
apiVersion: v1 kind: Template metadata: name: annotations: description: iconClass: «icon-redis» tags: objects: — apiVersion: v1 kind: Pod metadata: name: spec: containers: image: name: master ports: — containerPort: protocol: labels: redis:
Authentication and Authorization
Authentication
In OpenShift, while configuring master and client structure, master comes up with an inbuilt feature of OAuth server. OAuth server is used for generating tokens, which is used for authentication to the API. Since, OAuth comes as a default setup for master, we have the Allow All identity provider used by default. Different identity providers are present which can be configured at /etc/openshift/master/master-config.yaml.
There are different types of identity providers present in OAuth.
- Allow All
- Deny All
- HTPasswd
- LDAP
- Basic Authentication
Allow All
apiVersion: v1 kind: Pod metadata: name: redis-master spec: containers: image: dockerfile/redis name: master ports: - containerPort: 6379 protocol: TCP oauthConfig: identityProviders: - name: my_allow_provider challenge: true login: true provider: apiVersion: v1 kind: AllowAllPasswordIdentityProvider
Deny All
apiVersion: v1 kind: Pod metadata: name: redis-master spec: containers: image: dockerfile/redis name: master ports: - containerPort: 6379 protocol: TCP oauthConfig: identityProviders: - name: my_allow_provider challenge: true login: true provider: apiVersion: v1 kind: DenyAllPasswordIdentityProvider
HTPasswd
In order to use HTPasswd, we need to first set up Httpd-tools on the master machine and then configure it in the same way as we did for others.
identityProviders: - name: my_htpasswd_provider challenge: true login: true provider: apiVersion: v1 kind: HTPasswdPasswordIdentityProvider
Authorization
Authorization is a feature of OpenShift master, which is used to validate for validating a user. This means that it checks the user who is trying to perform an action to see if the user is authorized to perform that action on a given project. This helps the administrator to control access on the projects.
Authorization policies are controlled using −
Evaluation of authorization is done using −
- Cluster policy
- Local policy
Kickstart Your Career
Get certified by completing the course
How does openshift multiple pods loadbalancing work?
In openshift I have a project with one Spring Boot application. When I send a HTTP request, the Spring Boot application logs a message. I can see this message in the log of the pod in openshift. When I scale the pod to two and send the same HTTP request, only one of the pods receives this request and logs it. This also happens when I send a 1000 requests using JMeter: only one pod receives all requests. Doesn’t openshift balance the load? I tried sending multiple requests and expected these requests to be balanced over the 2 pods.
asked May 13 at 19:04
3 2 2 bronze badges
How are you actually connecting to the container? Is JMeter running inside the cluster, via a LoadBalancer-type Service, another way?
May 13 at 19:36
1 Answer 1
Pod traffic load balancing, is not an OpenShift-specific mechanism. This is a core kubernetes functionality, handled by Service objects. Consider Service s as your network path into pods. The Service will handle load balancing traffic to multiple pods. Service s will balance network traffic within the cluster, and when you need to expose this network traffic externally you rely on Route s (OpenShift-specific) or Ingress es (kubernetes-generic).
Without knowing more specifics about how JMeter is accessing the application hosted in the pods, it’s difficult to give a solid answer. However, there are several functionalities that Service s and Route s expose to address pod network load balancing.
There are various Route -specific annotations that can control network «stickiness». Probably the two most relevant ones are haproxy.router.openshift.io/balance and haproxy.router.openshift.io/disable_cookies . There are also Service session affinity options as well.
All of these components work together to get network traffic into your application. And only you know the exact path that you’re following and able to adjust any necessary configurations as such.
Pods and Services
OpenShift Online leverages the Kubernetes concept of a pod, which is one or more containers deployed together on one host, and the smallest compute unit that can be defined, deployed, and managed.
Pods are the rough equivalent of a machine instance (physical or virtual) to a container. Each pod is allocated its own internal IP address, therefore owning its entire port space, and containers within pods can share their local storage and networking.
Pods have a lifecycle; they are defined, then they are assigned to run on a node, then they run until their container(s) exit or they are removed for some other reason. Pods, depending on policy and exit code, may be removed after exiting, or may be retained in order to enable access to the logs of their containers.
OpenShift Online treats pods as largely immutable; changes cannot be made to a pod definition while it is running. OpenShift Online implements changes by terminating an existing pod and recreating it with modified configuration, base image(s), or both. Pods are also treated as expendable, and do not maintain state when recreated. Therefore pods should usually be managed by higher-level controllers, rather than directly by users.
Bare pods that are not managed by a replication controller will be not rescheduled upon node disruption.
Below is an example definition of a pod that provides a long-running service, which is actually a part of the OpenShift Online infrastructure: the integrated container image registry. It demonstrates many features of pods, most of which are discussed in other topics and thus only briefly mentioned here:
Pod Object Definition (YAML)
apiVersion: v1 kind: Pod metadata: annotations: . > labels: (1) deployment: docker-registry-1 deploymentconfig: docker-registry docker-registry: default generateName: docker-registry-1- (2) spec: containers: (3) - env: (4) - name: OPENSHIFT_CA_DATA value: . - name: OPENSHIFT_CERT_DATA value: . - name: OPENSHIFT_INSECURE value: "false" - name: OPENSHIFT_KEY_DATA value: . - name: OPENSHIFT_MASTER value: https://master.example.com:8443 image: openshift/origin-docker-registry:v0.6.2 (5) imagePullPolicy: IfNotPresent name: registry ports: (6) - containerPort: 5000 protocol: TCP resources: <> securityContext: . > (7) volumeMounts: (8) - mountPath: /registry name: registry-storage - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: default-token-br6yz readOnly: true dnsPolicy: ClusterFirst imagePullSecrets: - name: default-dockercfg-at06w restartPolicy: Always (9) serviceAccount: default (10) volumes: (11) - emptyDir: <> name: registry-storage - name: default-token-br6yz secret: secretName: default-token-br6yz
1 | Pods can be «tagged» with one or more labels, which can then be used to select and manage groups of pods in a single operation. The labels are stored in key/value format in the metadata hash. One label in this example is docker-registry=default. |
2 | Pods must have a unique name within their namespace. A pod definition may specify the basis of a name with the generateName attribute, and random characters will be added automatically to generate a unique name. |
3 | containers specifies an array of container definitions; in this case (as with most), just one. |
4 | Environment variables can be specified to pass necessary values to each container. |
5 | Each container in the pod is instantiated from its own Docker-formatted container image. |
6 | The container can bind to ports which will be made available on the pod’s IP. |
7 | OpenShift Online defines a security context for containers which specifies whether they are allowed to run as privileged containers, run as a user of their choice, and more. The default context is very restrictive but administrators can modify this as needed. |
8 | The container specifies where external storage volumes should be mounted within the container. In this case, there is a volume for storing the registry’s data, and one for access to credentials the registry needs for making requests against the OpenShift Online API. |
9 | The pod restart policy with possible values Always , OnFailure , and Never . The default value is Always . |
10 | Pods making requests against the OpenShift Online API is a common enough pattern that there is a serviceAccount field for specifying which service account user the pod should authenticate as when making the requests. This enables fine-grained access control for custom infrastructure components. |
11 | The pod defines storage volumes that are available to its container(s) to use. In this case, it provides an ephemeral volume for the registry storage and a secret volume containing the service account credentials. |
This pod definition does not include attributes that are filled by OpenShift Online automatically after the pod is created and its lifecycle begins. The Kubernetes pod documentation has details about the functionality and purpose of pods.
Pod Restart Policy
A pod restart policy determines how OpenShift Online responds when containers in that pod exit. The policy applies to all containers in that pod.
The possible values are:
- Always — Tries restarting a successfully exited container on the pod continuously, with an exponential back-off delay (10s, 20s, 40s) until the pod is restarted. The default is Always .
- OnFailure — Tries restarting a failed container on the pod with an exponential back-off delay (10s, 20s, 40s) capped at 5 minutes.
- Never — Does not try to restart exited or failed containers on the pod. Pods immediately fail and exit.
Once bound to a node, a pod will never be bound to another node. This means that a controller is necessary in order for a pod to survive node failure:
Pods that are expected to terminate (such as batch computations)
OnFailure or Never
Pods that are expected to not terminate (such as web servers)
Pods that need to run one-per-machine
If a container on a pod fails and the restart policy is set to OnFailure , the pod stays on the node and the container is restarted. If you do not want the container to restart, use a restart policy of Never .
If an entire pod fails, OpenShift Online starts a new pod. Developers need to address the possibility that applications might be restarted in a new pod. In particular, applications need to handle temporary files, locks, incomplete output, and so forth caused by previous runs.
Kubernetes architecture expects reliable endpoints from cloud providers. When a cloud provider is down, the kubelet prevents OpenShift Online from restarting.
If the underlying cloud provider endpoints are not reliable, do not install a cluster using cloud provider integration. Install the cluster as if it was in a no-cloud environment. It is not recommended to toggle cloud provider integration on or off in an installed cluster.
For details on how OpenShift Online uses restart policy with failed containers, see the Example States in the Kubernetes documentation.
Services
A Kubernetes service serves as an internal load balancer. It identifies a set of replicated pods in order to proxy the connections it receives to them. Backing pods can be added to or removed from a service arbitrarily while the service remains consistently available, enabling anything that depends on the service to refer to it at a consistent address. The default service clusterIP addresses are from the OpenShift Online internal network and they are used to permit pods to access each other.
Services are assigned an IP address and port pair that, when accessed, proxy to an appropriate backing pod. A service uses a label selector to find all the containers running that provide a certain network service on a certain port.
Like pods, services are REST objects. The following example shows the definition of a service for the pod defined above:
Service Object Definition (YAML)
apiVersion: v1 kind: Service metadata: name: docker-registry (1) spec: selector: (2) docker-registry: default clusterIP: 172.30.136.123 (3) ports: - nodePort: 0 port: 5000 (4) protocol: TCP targetPort: 5000 (5)
1 | The service name docker-registry is also used to construct an environment variable with the service IP that is inserted into other pods in the same namespace. The maximum name length is 63 characters. |
2 | The label selector identifies all pods with the docker-registry=default label attached as its backing pods. |
3 | Virtual IP of the service, allocated automatically at creation from a pool of internal IPs. |
4 | Port the service listens on. |
5 | Port on the backing pods to which the service forwards connections. |
The Kubernetes documentation has more information on services.
Service Proxy
OpenShift Online has an iptables-based implementation of the service-routing infrastructure. It uses probabilistic iptables rewriting rules to distribute incoming service connections between the endpoint pods. It also requires that all endpoints are always able to accept connections.
Headless services
If your application does not need load balancing or single-service IP addresses, you can create a headless service. When you create a headless service, no load-balancing or proxying is done and no cluster IP is allocated for this service. For such services, DNS is automatically configured depending on whether the service has selectors defined or not.
Services with selectors: For headless services that define selectors, the endpoints controller creates Endpoints records in the API and modifies the DNS configuration to return A records (addresses) that point directly to the pods backing the service.
Services without selectors: For headless services that do not define selectors, the endpoints controller does not create Endpoints records. However, the DNS system looks for and configures the following records:
- For ExternalName type services, CNAME records.
- For all other service types, A records for any endpoints that share a name with the service.
Creating a headless service
Creating a headless service is similar to creating a standard service, but you do not declare the ClusterIP address. To create a headless service, add the clusterIP: None parameter value to the service YAML definition.
For example, for a group of pods that you want to be a part of the same cluster or service.
List of Pods
$ oc get pods -o wide
Example Output
NAME READY STATUS RESTARTS AGE IP NODE frontend-1-287hw 1/1 Running 0 7m 172.17.0.3 node_1 frontend-1-68km5 1/1 Running 0 7m 172.17.0.6 node_1
You can define the headless service as:
Headless Service Definition
apiVersion: v1 kind: Service metadata: labels: app: ruby-helloworld-sample template: application-template-stibuild name: frontend-headless (1) spec: clusterIP: None (2) ports: - name: web port: 5432 protocol: TCP targetPort: 8080 selector: name: frontend (3) sessionAffinity: None type: ClusterIP status: loadBalancer: <>
1 | Name of the headless service. |
2 | Setting clusterIP variable to None declares a headless service. |
3 | Selects all pods that have frontend label. |
Also, headless service does not have any IP address of its own.
$ oc get svc
Example Output
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE frontend ClusterIP 172.30.232.77 5432/TCP 12m frontend-headless ClusterIP None 5432/TCP 10m
Endpoint discovery by using a headless service
The benefit of using a headless service is that you can discover a pod’s IP address directly. Standard services act as load balancer or proxy and give access to the workload object by using the service name. With headless services, the service name resolves to the set of IP addresses of the pods that are grouped by the service.
When you look up the DNS A record for a standard service, you get the loadbalanced IP of the service.
$ dig frontend.test A +search +short