End-to-end tests¶
Every release is gated on a real Kubernetes cluster running the real Flink Operator with the real Kafka and the real remote-function HTTP server. Not unit tests pretending to be integration tests — the actual production topology.
Stack¶
| Component | What it provides |
|---|---|
kind |
A throwaway single-node Kubernetes cluster, provisioned per test run |
| Flink Kubernetes Operator 1.11 | Same Operator a production user would deploy |
| Apache Kafka 3.9 | Single-broker KRaft mode, dual listener (cluster + port-forward) |
| LocalStack 4.1 | Emulates Kinesis (Kinesis test path) and S3 (checkpoint storage) |
| Remote function pod | Multi-stage jlink-stripped Alpine image, ~80 MB |
FlinkDeployment CR |
RocksDB state backend, S3 checkpoints, leader-election tuned for fast startup |
Coverage¶
Two test classes run in the same mvn verify invocation:
| Test | Validates |
|---|---|
StateFunK8sE2E |
Kafka ingress → stateful counter function → Kafka egress; greeter function over JSON; checkpoint persistence |
StateFunKinesisE2E |
Kinesis ingress → stateful counter → Kinesis egress; ARN-keyed routing; LocalStack-backed S3 checkpoints |
JUnit 5 @Tag("kafka") / @Tag("kinesis") allow running either suite in isolation:
./mvnw verify -pl :statefun-e2e-k8s-native -am -Dgroups=kinesis # Kinesis only
./mvnw verify -pl :statefun-e2e-k8s-native -am -DexcludedGroups=kinesis # Kafka only
CI integration¶
The reusable workflow .github/workflows/e2e-test.yml is invoked:
- On every PR — mandatory gate. No PR merges without a green E2E.
- On push to
release— post-merge validation. - By the release pipeline — gates Maven Central publish and GHCR image push.
Concurrency is configured so that repeated triggers on the same ref auto-cancel earlier in-progress runs — important guard given the run cost (~25 min on a GH-hosted runner).
Local execution¶
Provisions a kind cluster (creating it if absent), runs the suite, tears down.
Keep the cluster after the test for debugging
./mvnw verify -pl :statefun-e2e-k8s-native -am -Dskip.teardown=true
kubectl get pods -n statefun-e2e
kubectl logs -n statefun-e2e -l component=jobmanager --tail=200
Tear down manually when done: kind delete cluster --name statefun-e2e.
Restricted-network override¶
Set IMAGE_REGISTRY_PREFIX to wire base images through an internal mirror:
export IMAGE_REGISTRY_PREFIX=harbor.example.com/dockerhub-proxy/
./mvnw verify -pl :statefun-e2e-k8s-native -am
Apache Kafka, LocalStack, and the JDK base images all honour the prefix — no manifest fork required.
What "real" means¶
The E2E gate uses the same Operator, the same image, the same Kafka, and the same wire protocol that a production deployment would. The only difference is scale — single broker, single replica, kind instead of EKS/GKE/AKS.
Specifically:
- The Flink Operator is unmodified upstream 1.11
- Kafka runs the apache/kafka image (not a mock)
- The remote function is built from the same Dockerfile as production fixtures, just under a different artifact name
- Checkpoints are written to LocalStack S3 with the same Flink S3 plugin used in production
- The JM/TM topology matches what a
FlinkDeploymentproduces in cloud
A test that passes in this environment has been exercised against the actual production paths — there's no mock layer to silently diverge.
Next steps¶
- Architecture overview — system context for the E2E topology.
- Production K8s deployment — same shape, real cloud.
- Release process — how the E2E gate fits into release CI.