January 11, 202616 min read3527 words

Post Installation OCP on AWS (part 2)

OpenShift Container Platform after deployment on AWS. This guide covers post-install checks, access setup, and best practices to ensure your OCP cluster.

OpenshiftCloudMonitoringDevOps

Introduction

In part one, we have successfully created our OpenShift Container Platform (OCP) cluster using the Installer-Provisioned Infrastructure (IPI) method. This approach provides a streamlined and automated way to deploy OpenShift, allowing the installer to handle most of the infrastructure setup, configuration, and integration with the chosen cloud provider. By using IPI, we can focus more on exploring and building upon the cluster rather than spending time on manual provisioning. With the cluster now up and running, we’re ready to move forward and dive deeper into configuring and optimizing our environment for real-world use cases.

Topology

In this topology, I’ve committed to providing detailed post-installation steps that will be applied to this cluster. For this setup, I’ll be referencing OpenShift Container Platform version 4.18. There are notable differences between OCP versions 4.16 and earlier compared to 4.18 and later, especially in terms of supported technologies and architectural improvements. In versions 4.16 and below, OpenShift still relied heavily on the OpenShift SDN as its default networking stack. While functional, it had limited scalability and flexibility compared to the newer OVN-Kubernetes model introduced in later releases. Starting from 4.17 onward, OpenShift began fully adopting OVN-Kubernetes, offering better performance, enhanced security, and improved network policy management aligned with upstream Kubernetes standards.

Another major change involves the logging subsystem. In older versions (≤4.16), OpenShift utilized the Elasticsearch, Fluentd, and Kibana (EFK) stack for centralized logging. However, beginning with version 4.17 and above (including 4.18), OpenShift transitioned to using LokiStack based on Grafana Loki as the default logging solution. Loki provides a more lightweight, scalable, and cost-efficient approach to storing and querying logs. While Elasticsearch can still be integrated, its role is now primarily limited to acting as a log forwarder rather than the main storage and visualization engine. In summary, OpenShift 4.18 embodies a more modern, cloud-native direction, offering improved network performance, simplified management, and an updated logging architecture that better aligns with the broader Kubernetes ecosystem.

Prerequisite

CLI Access to OCP cluster
Web Console OCP
Aws Console

Post Installation

A. Add Admin User

Reference Reference

Create dedicated admin user is standard post-installation step for every OCP cluster, this user serves as the primary cluster administrator, as the default 'kubeadmin' user is temporary and should be removed according to best practices

Create local Identity Provider htpasswd

sudo apt install apache2-utils -y
htpasswd -c -B -b user.htpasswd adminocp P@ssw0rd
htpasswd -B -b user.htpasswd <another_user> <password>

Create secret

# Delete old identity provider in secret when you already have
oc delete secret htpass-secret -n openshift-config

# Jump for this step if you first time configuration
oc create secret generic htpass-secret --from-file htpasswd=user.htpasswd -n openshift-config

Edit oauth

# Refer cluster OCP with identity provider name
oc get oauth cluster -o yaml > oauth.yaml
vim oauth.yaml

spec:
  identityProviders:
  - htpasswd:
      fileData:
        name: htpass-secret
    mappingMethod: claim
    type: HTPasswd
    name: local-users

# Apply policy
oc replace -f oauth.yaml

Add cluster-admin role to user adminocp

# If u create a multiple user, you need to apply for all user
oc adm policy add-cluster-role-to-user cluster-admin adminocp
oc adm policy add-cluster-role-to-user cluster-admin <another_user>

Verify Login user adminocp or login via internal API CLI
```
oc login -u adminocp -p <your_created_password>
```

B. Machine Config Pool

Reference

Machine Config Pool (MCP) works by grouping the types of infrastructure that we will create later, in part 1 there is a topology explained we will use 3 types of infrastructure nodes namely : Router, Logging, and Monreg. From each infrastructure node is template take from worker

Repeat this step for Router, Logging, Monreg

router
logging
monreg

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: router #CHANGE THIS#
spec:
  machineConfigSelector:
    matchExpressions:
      - key: machineconfiguration.openshift.io/role
        operator: In
        values:
        - worker
        - router  #CHANGE THIS#
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/router: "" #CHANGE THIS#

Add label node-type worker config for project template later

for node in $(oc get nodes -l node-role.kubernetes.io/worker --no-headers -o custom-columns=":metadata.name"); do
  oc label node $node node-type=worker --overwrite
done

Verify Machine Config Pools

💡Others machine config pool will shown when the configuration apply correctly
Verify node label to worker with node-type
```
oc get node -L node-type
```

C. Machine Sets

Reference

MachineSets in AWS OpenShift are used to automatically manage the configuration and deployment of EC2 instances. These instances are typically distributed across multiple AWS Availability Zones (AZs) based on the defined subnets to ensure high availability and fault tolerance. By default, MachineSets are used to manage worker nodes, and in a multi-zone setup. For example, using the ap-southeast-1 (Singapore) region with three AZs (a, b, and c) separate MachineSets should be defined for each zone. However, in this case, we will define MachineSets not only by zone but also by node type, separating those used for infra nodes and those for application workloads (worker nodes).

🚨 Before do configuration for machine sets, you need to identify these needs from your cluster, you need to access OCP bastion

Example :

How to get Infrastructure ID

<infrastructure_id> = ocp-dev-rdscr

oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster

How to get AMI ID

<ami_id> = ami-0e22aa720418e43f5

oc -n openshift-machine-api \
  -o jsonpath='{.spec.template.spec.providerSpec.value.ami.id}{"\n"}' \
  get machineset/<infrastructure_id>-worker-ap-southeast-1a

Dashboard Menu → Compute → MachineSets → Create MachineSets

💡Repeat this step for Worker Node & Infra Node

Node Name	Zone A	Zone B	Zone C	Machine Spec
worker	ap-southeast-1a	ap-southeast-1b	ap-southeast-1c	m5.4xlarge
logging	ap-southeast-1a	ap-southeast-1b	ap-southeast-1c	m5.2xlarge
monreg	ap-southeast-1a	ap-southeast-1b	-	m5.4xlarge
router	ap-southeast-1a	-	ap-southeast-1c	m5.xlarge

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: <infrastructure_id>-monreg-ap-southeast-1a #CHANGE THIS#
  namespace: openshift-machine-api
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructure_id> #CHANGE THIS#
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructure_id> #CHANGE THIS#
      machine.openshift.io/cluster-api-machineset: <infrastructure_id>-monreg-ap-southeast-1a #CHANGE THIS#
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructure_id> #CHANGE THIS#
        machine.openshift.io/cluster-api-machine-role: monreg #CHANGE THIS#
        machine.openshift.io/cluster-api-machine-type: monreg #CHANGE THIS#
        machine.openshift.io/cluster-api-machineset: <infrastructure_id>-monreg-ap-southeast-1a #CHANGE THIS#
    spec:
      metadata:
        labels:
          node-role.kubernetes.io/monreg: "" #CHANGE THIS#
      providerSpec:
        value:
          apiVersion: machine.openshift.io/v1beta1
          kind: AWSMachineProviderConfig
          ami:
            id: <ami_id> #CHANGE THIS#
          instanceType: m5.4xlarge #CHANGE THIS#
          iamInstanceProfile:
            id: <infrastructure_id>-worker-profile #CHANGE THIS#
          placement:
            availabilityZone: ap-southeast-1a #CHANGE THIS#
            region: ap-southeast-1 #CHANGE THIS#
          securityGroups:
            - filters:
                - name: tag:Name
                  values:
                    - <infrastructure_id>-node
            - filters:
                - name: tag:Name
                  values:
                    - <infrastructure_id>-lb
          subnet:
            filters:
              - name: tag:Name
                values:
                  - <cluster_name>-subnet-private1-ap-southeast-1a #CHANGE THIS#
          blockDevices:
            - ebs:
                volumeSize: 100
                volumeType: gp2
                iops: 4000
          credentialsSecret:
            name: aws-cloud-credentials
          userDataSecret:
            name: worker-user-data
          tags:
            - name: kubernetes.io/cluster/<infrastructure_id> #CHANGE THIS#
              value: owned

Verify Machine Sets

You can verify machine sets instance by looking in Machines section

In the picture above when creating logging node the phase status set to provisioned, its mean your machine sets creation successfully made and waiting instance in aws fully up

Below are the successfully infra node created. The private IP address will be assigned and auto join to our cluster, pretty ezee right.. :”

If you have encountered an error with subnet id. you can check the yaml machine sets and matching the name with in console

D. Project Template

Reference

💡Project templates are used to provision projects with predefined settings. In this case, we add node selectors so that when an application project is created, workloads are automatically scheduled on designated worker nodes. To ensure this works correctly, make sure that appropriate node labels have been applied to both worker and infra nodes. Additionally, you can disable self-provisioning of projects for regular users if needed.

Create Project template

#Create folder
mkdir yaml_collections

oc adm create-bootstrap-project-template -o yaml > yaml_collections/template.yaml
vim yaml_collections/template.yaml

apiVersion: config.openshift.io/v1
kind: Template
metadata:
  creationTimestamp: null
  name: project-request
objects:
- apiVersion: project.openshift.io/v1
  kind: Project
  metadata:
    annotations:
      openshift.io/description: ${PROJECT_DESCRIPTION}
      openshift.io/display-name: ${PROJECT_DISPLAYNAME}
      openshift.io/requester: ${PROJECT_REQUESTING_USER}
      openshift.io/node-selector: "node-type=worker" #ADD THIS#
    creationTimestamp: null
    name: ${PROJECT_NAME}
  spec: {}
  status: {}
- apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    creationTimestamp: null
    name: admin
    namespace: ${PROJECT_NAME}
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: admin
  subjects:
  - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: ${PROJECT_ADMIN_USER}
parameters:
- name: PROJECT_NAME
- name: PROJECT_DISPLAYNAME
- name: PROJECT_DESCRIPTION
- name: PROJECT_ADMIN_USER
- name: PROJECT_REQUESTING_USER

oc apply -f yaml_collections/template.yaml -n openshift-config

Edit project config

oc edit project.config.openshift.io/cluster

spec:
  projectRequestMessage: Contact Cluster System Administrator to Create Project
  projectRequestTemplate: 
    name: project-request

Disable self provisioning project for basic user

oc adm policy remove-cluster-role-from-group self-provisioner system:authenticated:oauth

#Disable automatic update to cluster role#
oc patch clusterrolebinding.rbac self-provisioners -p '{ "metadata": { "annotations": { "rbac.authorization.kubernetes.io/autoupdate": "false" } } }'

Wait for pod in namespace in openshift-apiserver for rolling new pod in new age
```
watch oc get pod -n openshift-apiserver
```

Verify Project Template for self provision project

oc new-project test-project --as=<user_with_basic_role> --as-group=system:authenticated --as-group=system:authenticated:oauth

Expected output

How to enable self provision project

oc adm policy add-cluster-role-to-group self-provisioner system:authenticated:oauth
oc patch clusterrolebinding.rbac self-provisioners -p '{ "metadata": { "annotations": { "rbac.authorization.kubernetes.io/autoupdate": "true" } } }'

Verify Project Template for node selector worker

Verify basic-user create project from template add node selector

#a. Check node role#
oc get node -o wide

#b. Change project to test app workload#
oc login -u <basic-user> -p <user-password> 
oc new-project test-project

#c. Create test app#
oc new-app https://github.com/rgerardi/hellogo.git

#d. Scale replica pod to make sure app only run in worker node#
oc scale deployment/hellogo --replicas=10

Expected output :

Check Node role
Change project to test app workload
Create test app
Scale app verify run in worker node

Delete recent testing project

oc login -u admin -p <password-admin>
oc delete project test-project
oc delete user <basic_user>

E. Garbage Collection

Garbage Collection is use to automatically control node resource usage by removing unused images or by limiting the resource usage of pods scheduled on worker nodes. Garbage Collection is related to cleaning up images that are not being used by any running pods.

When limiting resources with Garbage Collection, there are two main parameters: soft eviction and hard eviction. Soft eviction works with a specified grace period, while hard eviction immediately triggers deletion once the threshold is reached, with no waiting period. This mechanism helps prevent resource oscillation. Additionally, Garbage Collection can be configured to prune images based on specific disk usage percentages.

Reference

Create file with name garbage-collection.yaml

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: worker-kubeconfig 
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: "" 
  kubeletConfig:
	  #Config for node resource#
    evictionSoft: 
      memory.available: "500Mi" 
      nodefs.available: "10%"
      nodefs.inodesFree: "5%"
      imagefs.available: "15%"
      imagefs.inodesFree: "10%"
    evictionSoftGracePeriod:  
      memory.available: "1m30s"
      nodefs.available: "1m30s"
      nodefs.inodesFree: "1m30s"
      imagefs.available: "1m30s"
      imagefs.inodesFree: "1m30s"
    evictionHard: 
      memory.available: "200Mi"
      nodefs.available: "5%"
      nodefs.inodesFree: "4%"
      imagefs.available: "10%"
      imagefs.inodesFree: "5%"
    evictionPressureTransitionPeriod: 3m 
    #Config for unused images#
    imageMinimumGCAge: 5m 
    imageGCHighThresholdPercent: 75 
    imageGCLowThresholdPercent: 60

Apply configuration
```
oc apply -f garbage-collection.yaml
```
Verify config in machine config pool
```
oc get mcp
```
💡If the updating status change to true the config of garbage collection are running succeed apply to worker role, you need to watch the status until updating status change to false it approximately around 5 minutes, depends on how many are worker

F. Automated ETCD backup

The etcd backup is performed to comply with organizational policies, ensuring that all cluster configurations are safely stored. In case of corruption or loss of cluster settings, the etcd backup provides the necessary data to restore the cluster to its previous state.

Reference Reference

Create project for etcd backup

oc new-project ocp-etcd-backup --description "Openshift Backup Automation Tool" --display-name "Backup ETCD Automation"

Setup Service Account, Cluster role, Cluster Role Binding

vim etcd-backup.yaml

# Service Account
kind: ServiceAccount
apiVersion: v1
metadata:
  name: openshift-backup
  namespace: ocp-etcd-backup
  labels:
    app: openshift-backup
---
# Cluster Role
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-etcd-backup
rules:
- apiGroups: [""]
  resources:
    - "nodes"
  verbs: ["get", "list"]
- apiGroups: [""]
  resources:
    - "pods"
    - "pods/log"
  verbs: ["get", "list", "create", "delete", "watch"]
---
# Cluster Role Binding
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: openshift-backup
  labels:
    app: openshift-backup
subjects:
  - kind: ServiceAccount
    name: openshift-backup
    namespace: ocp-etcd-backup
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-etcd-backup

oc apply -f etcd-backup.yaml

Config service account with special privilege

oc adm policy add-scc-to-user privileged -z openshift-backup

Create Cronjob name cronjob-etcd-bkp.yaml

💡in this configuration you can set schedule with your organization agreement to backup. You can change the directory backup to another node or just send to bastion

kind: CronJob
apiVersion: batch/v1
metadata:
  name: openshift-backup
  namespace: ocp-etcd-backup
  labels:
    app: openshift-backup
spec:
  schedule: "56 23 * * *" #CHANGE THIS#
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 5
  jobTemplate:
    metadata:
      labels:
        app: openshift-backup
    spec:
      backoffLimit: 0
      template:
        metadata:
          labels:
            app: openshift-backup
        spec:
          containers:
            - name: backup
              image: "registry.redhat.io/openshift4/ose-cli"
              command:
                - "/bin/bash"
                - "-c"
                - oc get no -l node-role.kubernetes.io/master --no-headers -o name | xargs -I {} --  oc debug {}  --to-namespace=ocp-etcd-backup -- bash -c 'chroot /host sudo -E /usr/local/bin/cluster-backup.sh /home/core/backup/ && chroot /host sudo -E find /home/core/backup/ -type f -mmin +"1" -delete'
          restartPolicy: "Never"
          terminationGracePeriodSeconds: 30
          activeDeadlineSeconds: 500
          dnsPolicy: "ClusterFirst"
          serviceAccountName: "openshift-backup"
          serviceAccount: "openshift-backup"

Verify

💡To verify scheduled etcd backup was running, we can force to job immediate running and create backup file of etcd
```
oc create job backup --from=cronjob/openshift-backup
```
Check pod running on ocp-etcd-backup namespace like down below

Check inside master node

After backing up the etcd cluster database, we can retain the etcd data by storing it on a persistent volume within the cluster. This approach ensures that, in the event of an etcd-related issue in the future, the etcd database can be restored reliably.

Node Infra Configuration

In this section, we will configure the infrastructure nodes for routing, monitoring, and the container registry to ensure they run on the infrastructure nodes that were created previously. To configure the image registry, we will use NFS storage provisioned from NetApp to store the image registry volume, so we just setup the pvc for persistent volume for image registry cluster This configuration is intended to prevent these workloads from being scheduled on worker or master nodes.

A. Ingress / Router

Edit existing ingresscontroller

oc edit ingresscontrollers default -n openshift-ingress-operator

Add configuration node selector

kind: IngressController
apiVersion: operator.openshift.io/v1
metadata:
  name: default
  namespace: openshift-ingress-operator
spec:
  ...output omitted...
  replicas: 2
  nodePlacement:
    nodeSelector:
      matchLabels:
        node-role.kubernetes.io/router: ""
  ...output omitted...

B. Monitoring

Reference Reference

Verify or ensure the ConfigMap cluster-monitoring-config

oc get configmap cluster-monitoring-config -n openshift-monitoring
vim cm-openshift-monitoring.yaml

Create ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    alertmanagerMain:
      nodeSelector:
        node-role.kubernetes.io/monreg: ""
      volumeClaimTemplate:
        spec:
          storageClassName: gp3-csi
          resources:
            requests:
              storage: 10Gi
    prometheusK8s:
      nodeSelector:
        node-role.kubernetes.io/monreg: ""
      volumeClaimTemplate:
        spec:
          storageClassName: gp3-csi
          resources:
            requests:
              storage: 200Gi
      retention: 7d
      retentionSize: 10GB
    prometheusOperator:
      nodeSelector:
        node-role.kubernetes.io/monreg: ""
    grafana:
      nodeSelector:
        node-role.kubernetes.io/monreg: ""
    k8sPrometheusAdapter:
      nodeSelector:
        node-role.kubernetes.io/monreg: ""
    kubeStateMetrics:
      nodeSelector:
        node-role.kubernetes.io/monreg: ""
    telemeterClient:
      nodeSelector:
        node-role.kubernetes.io/monreg: ""
    openshiftStateMetrics:
      nodeSelector:
        node-role.kubernetes.io/monreg: ""
    thanosQuerier:
      nodeSelector:
        node-role.kubernetes.io/monreg: ""

Apply Configuration

oc apply -f cm-openshift-monitoring.yaml -n openshift-monitoring

C. Registry

Reference

Create PVC Storage → PersistentVolumeClaims → Create PersistentVolumeClaim

Edit Config Image Registry

oc edit configs.imageregistry.operator.openshift.io cluster

kind: Config
apiVersion: imageregistry.operator.openshift.io/v1
metadata:
  name: cluster
  namespace: openshift-image-registry
spec:
  ...output omitted...
  defaultRoute: true
  rolloutStrategy: Recreate
  storage:
    managementState: Managed
    pvc:
      claim: image-registry-storage
  nodeSelector:
    node-role.kubernetes.io/monreg: ""
  ...output omitted...

This concludes all configuration steps covered in Part 2 of the post-installation process for the OpenShift Container Platform (OCP) 4.18 cluster deployed on AWS using the IPI method. In the next section, I will discuss the detailed configuration of the logging stack using Loki.