# Troubleshooting Connectware on Kubernetes

This guide helps you diagnose and resolve common issues with Connectware running on Kubernetes. Follow the sections in order for systematic troubleshooting.

## How to Troubleshoot

When troubleshooting Connectware issues, proceed in the following order:

1. Check pod status to identify obvious failures.
2. Inspect pod events for Kubernetes level errors.
3. Review logs to identify application-level problems.
4. Collect debug information before making changes.
5. Restart or remove unhealthy pods if appropriate.

If you cannot identify or resolve the issue, contact the Cybus support team at **<support@cybus.io>**.

## Prerequisites

Before troubleshooting, ensure you have:

* [Helm version 3](https://helm.sh/docs/intro/quickstart/#install-helm) is installed on your system.
* [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) is installed on your system.
* You know the name and namespace of your Connectware installation. See [Obtaining the Name, Namespace, and Version of Your Connectware Installation](#obtaining-the-name-namespace-and-version-of-your-connectware-installation).
* You have permissions to view pods, logs, and events.

## Checking Pod Status

Connectware requires all pods to be in `Running` status with all containers ready. Check this by running:

{% code lineNumbers="true" %}

```bash
kubectl get pods
```

{% endcode %}

**Expected output**: All pods show matching values in the **Ready** column, for example `1/1` or `2/2`, and a Status of `Running`.

| Name                                     | Ready | Status  | Restarts | Age   |
| ---------------------------------------- | ----- | ------- | -------- | ----- |
| admin-web-app-8649f98fc6-sktb7           | 1/1   | Running | 0        | 3m1s  |
| auth-server-5f46964984-5rwvc             | 1/1   | Running | 0        | 2m39s |
| broker-0                                 | 1/1   | Running | 0        | 2m11s |
| broker-1                                 | 1/1   | Running | 0        | 2m50s |
| connectware-5b948ffdff-tj2x9             | 1/1   | Running | 0        | 2m41s |
| container-manager-5f5678657c-94486       | 1/1   | Running | 0        | 2m46s |
| ingress-controller-85fffdcb4b-m8kpm      | 1/1   | Running | 0        | 2m37s |
| nats-0                                   | 1/1   | Running | 0        | 2m31s |
| nats-1                                   | 1/1   | Running | 0        | 2m30s |
| nats-2                                   | 1/1   | Running | 0        | 2m30s |
| postgresql-0                             | 1/1   | Running | 0        | 2m58s |
| protocol-mapper-69f59f7dd4-6xhkf         | 1/1   | Running | 0        | 2m42s |
| resource-status-tracking-fcd58dc79-cl5nw | 1/1   | Running | 0        | 2m12s |
| resource-status-tracking-fcd58dc79-vlzqs | 1/1   | Running | 0        | 2m22s |
| service-manager-6b5fffd66d-gt584         | 1/1   | Running | 0        | 2m52s |
| system-control-server-bd486f5bd-2mkxz    | 1/1   | Running | 0        | 2m45s |
| welder-robots-0                          | 1/1   | Running | 0        | 2m59s |
| workbench-57d4b59fbb-gqwnb               | 1/1   | Running | 0        | 2m38s |

### Identifying Unhealthy Pods

A pod should be considered unhealthy if it:

* Shows an error state such as `CrashLoopBackOff` or `Init`.
* Remains in a transitional state for an extended time.
* Shows mismatched **Ready** values (e.g., `0/1` instead of `1/1`).

**Example of a pod that is unable to start**

| Name                        | Ready | Status   | Restarts | Age |
| --------------------------- | ----- | -------- | -------- | --- |
| auth-server-b4b69ccfd-fvsmz | 0/1   | Init:0/1 | 0        | 8m  |

### Inspecting Pod Events

To identify the cause of a pod issue:

1. Describe the pod:

{% code lineNumbers="true" %}

```bash
kubectl describe pod <podname>
```

{% endcode %}

2. Review the **Events** section at the bottom of the output.

{% code lineNumbers="true" %}

```bash
Warning  FailedMount  7m4s kubelet Unable to attach or mount volumes: unmounted volumes=[testfail], unattached volumes=[certs testfail kube-api-access-52xmc]: timed out waiting for the condition
```

{% endcode %}

This indicates a cluster-level issue where required volumes are unavailable. Such issues must be resolved at the Kubernetes or infrastructure-level and are outside the scope of Connectware documentation.

If no clear cause is visible, continue with [log inspection](#checking-logs-using-kubetail).

As general guidance:

* Issues immediately after upgrades or configuration changes are often caused by incorrect Helm values.
* Issues appearing later are often related to cluster infrastructure.

## Checking Logs Using Kubetail

For viewing logs from multiple pods simultaneously, we recommend using kubetail. kubetail is a wrapper around kubectl that aggregates logs from multiple pods. By default, kubetail will always follow the logs like `kubectl logs -f` would.

Installation instructions are available at <https://github.com/johanhaleby/kubetail>.

Here are a few examples of how you can use kubetail. Also make sure to check `kubetail --help`.

### Displaying Logs from All Pods in a Namespace

{% code lineNumbers="true" %}

```bash
kubetail -n ${NAMESPACE}
```

{% endcode %}

### Displaying Logs of Pods That Match a Search Term

{% code lineNumbers="true" %}

```bash
kubetail broker
```

{% endcode %}

### Displaying Logs for Pods That Match a Regular Expression

{% code lineNumbers="true" %}

```bash
kubetail '(service-manager|protocol-mapper)' -e regex
```

{% endcode %}

### Displaying Logs from the Past

You can combine the parameter `-s <timeframe>` with any other command to display logs from the past up to now:

{% code lineNumbers="true" %}

```bash
kubetail broker -s 10m
```

{% endcode %}

### Displaying Logs of a Terminated Container of a Pod

{% code lineNumbers="true" %}

```bash
kubetail broker --previous
```

{% endcode %}

### Displaying Timestamps

If the logs you are viewing are missing timestamps, you can use the parameter `--timestamps` for kubetail to add timestamps to each log line:

{% code lineNumbers="true" %}

```bash
kubetail broker --timestamps
```

{% endcode %}

## Checking Logs Using Kubectl

If you do not want to use kubetail as suggested in the previous section, you can use kubectl to read logs.

Here are a few examples of how you can use it:

### Displaying and Tailing Logs of a Pod

{% code lineNumbers="true" %}

```bash
kubectl logs -f <podname>
```

{% endcode %}

### Displaying and Tailing Logs for All Pods with a Label

{% code lineNumbers="true" %}

```bash
kubectl logs -f -l app=broker
```

{% endcode %}

### Displaying Logs of a Terminated Container of a Pod

{% code lineNumbers="true" %}

```bash
kubectl logs --previous <podname>
```

{% endcode %}

### Displaying Logs from the Past

You can combine the parameter `--since <timeframe>` with any other command to display logs from the past up to now:

{% code lineNumbers="true" %}

```bash
kubectl logs -f --since 10m <podname>
```

{% endcode %}

### Displaying Timestamps

If the logs that you are viewing are missing timestamps, you can use the parameter `--timestamps` for kubectl to add timestamps to each log line:

{% code lineNumbers="true" %}

```bash
kubectl logs -f --timestamps <podname>
```

{% endcode %}

## Removing Unhealthy Pods

When a pod is identified as unhealthy, either through pod status checks or log inspection, first collect the current system state using the debugging script (`collect_debug.sh`) from the Connectware Kubernetes Toolkit. This ensures that diagnostic information is preserved before any changes are made. For more information, see [Collecting Debug Information](#collecting-debug-information).

After collecting debug data, delete the affected pod:

{% code lineNumbers="true" %}

```bash
kubectl delete pod <podname>
```

{% endcode %}

The controller managing the pod will automatically create a new instance. Restarting pods in this way often resolves transient issues and does not delete persisted data.

### Special Considerations for StatefulSet Pods

Pods whose names end with a fixed number, such as `broker-0`, belong to a StatefulSet. Kubernetes handles StatefulSets differently from other workloads. An unhealthy StatefulSet pod is not automatically replaced after configuration changes.

If a StatefulSet pod is unhealthy due to a configuration issue, you must:

1. Correct the configuration.
2. Manually delete the affected pod so it can be recreated with the updated settings.

This behavior is intentional, as StatefulSets often manage persistent or stateful data.

In Connectware, StatefulSets include the broker, nats, postgresql, and any protocol mapper agents that you have defined.

## Collecting Debug Information

The Connectware Kubernetes Toolkit provides a debugging script (`collect_debug.sh`) to gather logs and state information. Run this script to collect diagnostic information about the system status before attempting fixes. If you plan to open a support ticket, the output of this script is required.

### Prerequisites

* Installed the following tools: kubectl, tar, sed, rm, sort, timeout
* Access to the target installation using kubectl.

### Downloading the Debugging Script

* You can download the debugging script from [https://download.cybus.io/](https://download.cybus.io/connectware-k8s-toolkit/latest/collect_debug.sh).

**Example**

{% code lineNumbers="true" %}

```bash
wget https://download.cybus.io/connectware-k8s-toolkit/latest/collect_debug.sh
chmod u+x ./collect_debug.sh
```

{% endcode %}

### Running the Debugging Script

Use the following parameters to configure the debugging script. For example, you can specify the namespace of your Connectware installation and a custom kubeconfig file if needed.

| Parameter                           | Value                              | Description                                                                 |
| ----------------------------------- | ---------------------------------- | --------------------------------------------------------------------------- |
| `-n`                                | namespace                          | Kubernetes namespace containing the Connectware installation                |
| `-k`                                | path to kubeconfig file            | Kubeconfig file to use instead of the default (`~/.kube/config`)            |
| `-c`                                | kubeconfig context name            | Kubeconfig context to use instead of the currently active context           |
| `--skip-debug-containers`           | none (flag)                        | Prevents the script from running debug containers on the Kubernetes cluster |
| `--debug-containers-timeout`        | seconds                            | Timeout in seconds for debug container operations (default: `120`)          |
| `--nats-servicesCRUD-filter`        | servicesCRUD stream subject filter | NATS consumer filter subject for servicesCRUD (default: `>`)                |
| `--nats-resourceDefinitions-filter` | resourceDefinitions bucket filter  | NATS consumer filter subject for resourceDefinitions (default: `>`)         |
| `--nats-resourceStates-filter`      | resourceStates stream filter       | NATS consumer filter subject for resourceStates (default: `>`)              |

Run the script and specify the namespace. If kubectl is already configured for your target cluster, no other parameters are required:

**Example**

{% code lineNumbers="true" %}

```bash
./collect_debug.sh -n ${NAMESPACE}
```

{% endcode %}

If kubectl is not configured for the target cluster, use the `-k` or `-c` parameters to specify the kubeconfig file or context.

#### How the Debugging Script Works

The debugging script collects diagnostic information through read-only operations:

* Queries the Kubernetes API
* Executes commands in Connectware pods
* Runs `connectware-toolkit` containers using `kubectl debug`

{% hint style="warning" %}
The debugging script uses Kubernetes debug containers, which temporarily run containers on your cluster. These containers only perform read-only operations. To prevent debug containers from running, use the `--skip-debug-containers` parameter. Note that this may prevent the collection of crucial diagnostic data.
{% endhint %}

The debugging script continues execution even if individual operations fail, as it is designed to gather as much information as possible from potentially unhealthy systems. Error messages in the output are expected and not immediately concerning.

#### Debugging Script Output

When the debugging script completes, it creates a compressed archive in the current directory containing the collected information. Provide this archive to Cybus support when reporting issues.

{% hint style="info" %}
Kubernetes only retains logs for currently running containers and their immediate predecessors. If you have logs stored in a central log aggregator or other external system, include relevant logs for the timeframe when the issue occurred.
{% endhint %}

## Troubleshooting Protocol-Mapper Agents

This section covers issues with protocol-mapper agents caused by minor configuration mistakes.

### Agent does not connect when the Connectware broker uses mTLS

**Symptoms**

* Agent log shows:

{% code lineNumbers="true" %}

```txt
VRPC agent connection to broker lost
Reconnecting to mqtts://localhost:8883
```

{% endcode %}

**Likely cause**

* mTLS is not enabled in the agent configuration.

**Resolution**

* Enable mTLS for the agent as described in [Using Mutual Transport Layer Security (mTLS) for agents with the connectware-agent Helm chart](https://docs.cybus.io/documentation/agents/agents-in-kubernetes/configuring-agents-with-the-connectware-agent-helm-chart/using-mutual-transport-layer-security-mtls-for-agents-with-the-connectware-agent-helm-chart).

### TLS connection fails before handshake

**Symptoms**

* Agent log shows:

{% code lineNumbers="true" %}

```txt
Client network socket disconnected before secure TLS connection was established
```

{% endcode %}

**Likely cause**

* The agent is connecting to the wrong MQTTS port on the broker.

**Resolution**

* Verify `mqttPort` and `mqttDataPort` in the `protocolMapperAgents` section of your Helm `values.yaml`.
* If you are not using a custom setup, these values are correct by default and can be removed.

### Agent with mTLS enabled does not connect to broker

**Symptoms**

* Agent log shows:

{% code lineNumbers="true" %}

```txt
Failed to read certificates during mTLS setup please check the configuration
```

{% endcode %}

**Likely cause**

* Certificates are missing or invalid.

**Resolution**

* Verify certificate generation and configuration as described in [Using Mutual Transport Layer Security (mTLS) for agents with the connectware-agent Helm chart](https://docs.cybus.io/documentation/agents/agents-in-kubernetes/configuring-agents-with-the-connectware-agent-helm-chart/using-mutual-transport-layer-security-mtls-for-agents-with-the-connectware-agent-helm-chart).
* Ensure Kubernetes objects were created from files named `ca-chain.pem`, `tls.crt`, and `tls.key`. Incorrect filenames will cause the agent to fail to locate certificates.

### Agent registration fails due to certificate Common Name mismatch

**Symptoms**

Allowing an mTLS enabled agent in Connectware Client Registry fails with the message `An Error has occurred - Registration failed`.

* auth-server logs show:

{% code lineNumbers="true" %}

```txt
Unable to process request: 'POST /api/client-registry/confirm', because: Certificate Common Name does not match the username. CN: someCN, username: agentName
```

{% endcode %}

**Likely cause**

* The certificate Common Name does not match the agent name.

**Resolution**

* Ensure the certificate Common Name exactly matches the agent name configured in the Helm value `name`.

### Agent registration fails with connection error

**Symptoms**

* Agent log shows:

{% code lineNumbers="true" %}

```txt
Cannot register protocol-mapper agent, because: socket hang up
```

{% endcode %}

**Likely cause**

* The agent certificate was signed by the wrong Certificate Authority.

**Resolution**

* Verify the agent certificate was signed by the [Certificate Authority](https://docs.cybus.io/documentation/security/tls-certificates/ca-certificates) that is used by Connectware.

### Agent registration fails with conflict error

**Symptoms**

* Agent log shows:

{% code lineNumbers="true" %}

```txt
Failed to register agent. Response: 409 Conflict. A conflicting registration might be pending, or a user with the same username <agent-name> is already existing (which you must delete first).
```

{% endcode %}

**Likely cause**

* An agent or user with the same name already exists.

**Resolution**

Every agent needs a user whose username matches the value configured in the `name` Helm value for the agent.

1. Ensure the agent name is unique.
2. If there is another agent with the same name, do the following:

* Delete the agent.
* Delete the corresponding user. For more information, see [Deleting Users](https://docs.cybus.io/user-management/users#deleting-users).

3. If you created a user with the agent’s name for something else, you have to choose a different name for the agent.

### Agent enters CrashLoopBackOff due to license errors

**Symptoms**

* Agent pod enters `CrashLoopBackOff`.
* Logs show authentication or license errors followed by agent shutdown.

**Example**

{% code lineNumbers="true" %}

```bash
{"level":30,"time":1670940068658,"pid":8,"hostname":"welder-robots-0","service":"protocol-mapper","msg":"Re-starting using cached credentials"}
{"level":50,"time":1670940068759,"pid":8,"hostname":"welder-robots-0","service":"protocol-mapper","msg":"Failed to query license at https://connectware/api/system/info probably due to authentication": 401 Unauthorized."}
{"level":50,"time":1670940068759,"pid":8,"hostname":"welder-robots-0","service":"protocol-mapper","msg":"No valid license file available. Protocol-mapper will stop."}
```

{% endcode %}

**Likely cause**

* Cached agent credentials are no longer valid.

**Resolution**

The agent needs to be re-registered.

1. Delete the agent.
2. Delete the corresponding user. For more information, see [Deleting Users](https://docs.cybus.io/user-management/users#deleting-users).
3. Delete the agent StatefulSet:

{% code lineNumbers="true" %}

```bash
kubectl -n ${NAMESPACE} delete sts <agent-name>
```

{% endcode %}

4. Delete the agent PersistentVolumeClaim:

{% code lineNumbers="true" %}

```bash
kubectl -n ${NAMESPACE} delete pvc protocol-mapper-<agent-name>-0
```

{% endcode %}

5. Apply the configuration changes via the `helm upgrade` command:

{% code lineNumbers="true" %}

```bash
helm upgrade -n ${NAMESPACE} ${INSTALLATION_NAME} -f values.yaml
```

{% endcode %}

For more information, see [Applying Helm Configuration Changes](https://docs.cybus.io/documentation/connectware-helm-chart#applying-helm-configuration-changes).
