TKG Integration with NSX

NSX could add great values in Kubernetes environment by providing consisting Networking and Security across VMs and Containers. In this post I will show how to integrate TKG with NSX in a vSphere 8 environment to highlight the values of NSX with TKG.

TKG uses Antrea as a CNI which can be integrated with NSX. This integration will empower the Network & Security Admin to have full visibility to their TKG workloads and the ability to apply security policies to those workloads. This Integration could provide healthy operation model between Platform and Infra teams with clear distribution of responsibilities.

Until vSphere 8 U1, the Antrea & NSX Integration will need to be performed after the TKG cluster creation with some simple manual steps. In future vSphere releases this integration will be done Automatically during the cluster creation process. I will document the automated process once it released in the future. For now, let us do it the manual way.

Lets start by deploying a TKG cluster on our vSphere 8 U1 environment.

Deploy a TKG Cluster with Antrea

Before deploying a TKG cluster. We need Workload Management enabled in vCenter and at least a Namespace is created, right permissions, storage, a VM Class, and a Content Library added to the Namespace.

Login to the Namespace,

k vsphere login --server 192.168.26.2 -u administrator@vsphere.local --insecure-skip-tls-verify --tanzu-kubernetes-cluster-namespace namespace-01

Now we can create a TKG cluster in that Namespace. I am using below yaml,

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
#define the cluster
metadata:
  #user-defined name of the cluster; string
  name: ali-tkg-cluster-01
  #kubernetes namespace for the cluster; string
  namespace: namespace-01
#define the desired state of cluster
spec:
  #specify the cluster network; required, there is no default
  clusterNetwork:
    #network ranges from which service VIPs are allocated
    services:
      #ranges of network addresses; string array
      #CAUTION: must not overlap with Supervisor
      cidrBlocks: ["198.51.100.0/12"]
    #network ranges from which Pod networks are allocated
    pods:
      #ranges of network addresses; string array
      #CAUTION: must not overlap with Supervisor
      cidrBlocks: ["192.0.2.0/16"]
    #domain name for services; string
    serviceDomain: "cluster.local"
  #specify the topology for the cluster
  topology:
    #name of the ClusterClass object to derive the topology
    class: tanzukubernetescluster
    #kubernetes version of the cluster; format is TKR NAME
    version: v1.23.8---vmware.2-tkg.2-zshippable
    #describe the cluster control plane
    controlPlane:
      #number of control plane nodes; integer 1 or 3
      replicas: 1
    #describe the cluster worker nodes
    workers:
      #specifies parameters for a set of worker nodes in the topology
      machineDeployments:
        #node pool class used to create the set of worker nodes
        - class: node-pool
          #user-defined name of the node pool; string
          name: node-pool-1
          #number of worker nodes in this pool; integer 0 or more
          replicas: 2
    #customize the cluster
    variables:
      #virtual machine class type and size for cluster nodes
      - name: vmClass
        value: best-effort-medium
      #persistent storage class for cluster nodes
      - name: storageClass
        value: vsan-default-storage-policy
      # default storageclass for control plane and worker node pools
      - name: defaultStorageClass
        value: vsan-default-storage-policy

Create the TKG Cluster,

 k apply -f ali-tkg-cluster-01.yml

Check the status of the cluster (deploy Tanzu CLI first),

tanzu cluster list

Once the cluster is “running”, login to the TKG cluster and check Antrea pods status

k vsphere login --server 192.168.26.2 -u administrator@vsphere.local --insecure-skip-tls-verify --tanzu-kubernetes-cluster-namespace namespace-01 --tanzu-kubernetes-cluster-name ali-tkg-cluster-01

k get pods -n kube-system | grep antrea
antrea-agent-7m9xj   2/2     Running   0             12m
antrea-agent-9jh7p   2/2     Running   0             12m
antrea-agent-ffmdg   2/2     Running   0             18m
antrea-controller-58bbcdff66-nvgcz 1/1     Running   0        18m

Integrate Antrea with NSX

Lets start by creating a self-signed cert to authenticate Antrea with NSX. You can do below step in any jump host.
“ali-tkg-cluster-01” should be replaced with your cluster name.

openssl genrsa -out ali-tkg-cluster-01-private.key 2048

openssl req -new -key ali-tkg-cluster-01-private.key -out ali-tkg-cluster-01.csr -subj "/C=US/ST=CA/L=Palo Alto/O=VMware/OU=Antrea Cluster/CN=ali-tkg-cluster-01"

openssl x509 -req -days 3650 -sha256 -in ali-tkg-cluster-01.csr -signkey ali-tkg-cluster-01-private.key -out cluster-sales.crt

Create a user in NSX using the crt file generated from above.
In NSX Manager: System >> User Management >> ADD PRINCIPAL IDENTITY
Use Cluster name as Principal Identity Name and Node Id. Role should be Enterprise Admin. Paste the content of the crt file as shown below

We need to deploy Antrea Interworking pod which is responsible for the integration between Antrea and NSX.
The needed Yamls are available on VMware.com as a compressed file under Antrea download page. This file will include below yamls

Edit below lines in the boostrap-config.yaml
NSXManagers: [192.168.10.181] #just example
tls.crt: xxxxxxxx # One line base64 encoded data. Can be generated by command: cat tls.crt | base64 -w 0
tls.key: xxxxxxx # One line base64 encoded data. Can be generated by command: cat tls.key | base64 -w 0

Here is my bootsrap-config.yaml with tls.crt and tls.key removed

apiVersion: v1
kind: Namespace
metadata:
  name: vmware-system-antrea
  labels:
    app: antrea-interworking
    openshift.io/run-level: '0'
---
# NOTE: In production the bootstrap config and secret should be filled by admin
# manually or external automation mechanism.
apiVersion: v1
kind: ConfigMap
metadata:
  name: bootstrap-config
  namespace: vmware-system-antrea
data:
  bootstrap.conf: |
    # Fill in the cluster name. It should be unique among the clusters managed by the NSX-T.
    clusterName: ali-tkg-cluster-01
    # Fill in the NSX manager IPs. If there is only one IP, the value should be like [dummyNSXIP1]
    NSXManagers: [192.168.10.181]
    # vhcPath is optional. By default it's empty. If need to inventory data isolation between clusters, create VHC in NSX-T and fill the vhc path here.
    vhcPath: ""
---
apiVersion: v1
kind: Secret
metadata:
  name: nsx-cert
  namespace: vmware-system-antrea
type: kubernetes.io/tls
data:
  # One line base64 encoded data. Can be generated by command: cat tls.crt | base64 -w 0
  tls.crt: xxxxx
  # One line base64 encoded data. Can be generated by command: cat tls.key | base64 -w 0
  tls.key: xxxxx

We can edit the Interworking.yaml to point to the right images. I have edited all image pointers to point to below image from VMware public repo.
image: projects.registry.vmware.com/antreainterworking/interworking-photon:0.5.0

Now lets submit those yamls to our TKG Cluster. You can/should use more restricted RoleBinding. I am doing it the lazy way since this deployment is in a local lab.

kubectl create clusterrolebinding privileged-role-binding --clusterrole=psp:vmware-system-privileged --group=system:authenticated

kubectl apply -f bootstrap-config.yaml -f interworking.yaml

Check the status,

k -n vmware-system-antrea get jobs
NAME       COMPLETIONS   DURATION   AGE
register   1/1           90s        3m35s

k -n vmware-system-antrea get pods
NAME                            READY   STATUS      RESTARTS   AGE
interworking-5ddd49766b-ltzzp   4/4     Running     0          3m30s
register-tvm4t                  0/1     Completed   0          3m30s

Check the integration on the NSX Manager side (you can see below that I have one Tanzu and one OpenShift clusters integrated with NSX using same process),

Now we can see the Inventory of our TKG workloads from NSX

We can apply a Security policy from NSX DFW to achieve Pods Micro-segmentation and Pods2VMs polices

And we can analyze internal TKG workloads traffic using Traceflow.

Enabling Antrea Advanced Features in TKG

Starting vSphere 8, we can even enable Antrea advanced features in Antreaconfig for our TKG cluster. We can easily edit below config to enable features like Egress, Flow Exporter, NodePortLocal and others.
From vSphere Namesapce,

k get antreaconfigs
NAME                                TRAFFICENCAPMODE   DEFAULTMTU   ANTREAPROXY   ANTREAPOLICY   SECRETREF
ali-tkg-cluster-01-antrea-package   encap                           true          true           ali-tkg-cluster-01-antrea-data-values


k get antreaconfigs -o yaml
apiVersion: v1
items:
- apiVersion: cni.tanzu.vmware.com/v1alpha1
  kind: AntreaConfig
  metadata:
    creationTimestamp: "2023-05-18T07:58:13Z"
    generation: 1
    labels:
      tkg.tanzu.vmware.com/cluster-name: ali-tkg-cluster-01
      tkg.tanzu.vmware.com/package-name: antrea.tanzu.vmware.com.1.5.3---tkg.2-zshippable
    name: ali-tkg-cluster-01-antrea-package
    namespace: namespace-01
    ownerReferences:
    - apiVersion: cluster.x-k8s.io/v1beta1
      kind: Cluster
      name: ali-tkg-cluster-01
      uid: 99e77ae9-ed72-4254-9ddd-10fa570cb2fc
    - apiVersion: run.tanzu.vmware.com/v1alpha3
      blockOwnerDeletion: true
      controller: true
      kind: ClusterBootstrap
      name: ali-tkg-cluster-01
      uid: 8dbb6045-64ed-4db9-9992-89c6e90d1db9
    resourceVersion: "76557512"
    uid: 76650a72-cd35-4ded-aefa-9f5c00ebc37d
  spec:
    antrea:
      config:
        defaultMTU: ""
        disableUdpTunnelOffload: false
        featureGates:
          AntreaPolicy: true
          AntreaProxy: true
          AntreaTraceflow: true
          Egress: false
          EndpointSlice: true
          FlowExporter: false
          NetworkPolicyStats: false
          NodePortLocal: false
        noSNAT: false
        tlsCipherSuites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384
        trafficEncapMode: encap
  status:
    secretRef: ali-tkg-cluster-01-antrea-data-values
kind: List
metadata:
  resourceVersion: ""

Once above config is edited, Antrea Agents will need to be restarted.

As you can see, with these simple steps we managed to empower the Networking and Security Admins to have full visibility and policy enforcement for their containers workloads using Antrea and NSX.

Please note that the integration steps are applicable not only to Tanzu, but to OpenShift, EKS, AKS, GKE, and wherever Antrea is deployed.

Thank you for reading!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: