Managing your own Temporal cluster can be a challenging task. With multiple core services, various metrics to monitor, and a separate persistence service, it can be quite an undertaking for any team. In this post, we will start a new series that aims to review the work involved in hosting Temporal and make it more understandable. However, running your own cluster should not be taken lightly. Unless your organization has a well-established operations department with sufficient resources to monitor and administer the cluster, you might find yourself regretting not using Temporal Cloud instead. However, if you have to run your own cluster due to data ownership regulations, uptime guarantees, or other organizational requirements, we hope this series will help you uncover and address the common challenges of self-hosting Temporal. In this article, we will begin by reviewing the overall service architecture of Temporal and covering some important cluster concepts and metrics. It is crucial to be familiar with the different subsystems of Temporal and monitor their associated logs and metrics in order to effectively administer them.
Architecture of Temporal:
A Temporal Cluster consists of four core services – Frontend, Matching, History, and Worker – along with a database. Each of these services can be scaled independently to meet the specific performance needs of your cluster. The responsibilities of these services determine the hardware requirements and scaling conditions. Typically, the ultimate bottleneck in scaling Temporal is the limitations of the underlying persistence technology used in the deployment.
Frontend Service:
The frontend service is a stateless service that serves as the client-facing gRPC API of the Temporal cluster. It acts as a pass-through to the other services and handles rate-limiting, authorization, validation, and routing of inbound calls. Inbound calls include communication from the Temporal Web UI, the tctl CLI tool, worker processes, and Temporal SDK connections. The nodes hosting this service usually benefit from additional compute resources.
Matching Service:
The matching service is responsible for efficiently dispatching tasks to workers by managing and caching operations on task queues. The task queues are distributed among the shards of the matching service. One important optimization provided by the matching service is called “synchronous matching”. If a long-polling worker is waiting for a new task at the same time that the matching host receives a new task, the task is immediately dispatched to the worker. This avoids the need to persist the task, reducing the load on the persistence service and lowering task latency. Administrators should monitor the “sync match rate” of their cluster and scale workers as needed to keep the rate as high as possible. The hosts of this service typically benefit from additional memory.
History Service:
The history service is responsible for efficiently writing workflow execution data to the persistence service and acting as a cache for workflow history. The workflow executions are distributed among the shards of the history service. When the history service progresses a workflow by saving its updated history to persistence, it enqueues a task with the updated history to the task queue. A worker can then poll for more work and continue progressing the workflow using the new history. The hosts of this service typically benefit from additional memory. This service is particularly memory intensive.
Worker Service:
The cluster worker hosts are responsible for running background workflow tasks. These tasks support operations such as tctl batch operations, archiving, and inter-cluster replication. The scaling requirements of this service are not well-documented, but it is suggested that a balance of compute and memory is needed. Starting with two hosts is a good recommendation for most clusters, but the performance characteristics may vary depending on the internal background workflows utilized by your cluster.
Importance of History Shards:
History shards are a critical cluster configuration parameter as they determine the maximum number of concurrent persistence operations that can be performed. It’s important to select a number of history shards that can handle the theoretical maximum level of concurrent load that your cluster may experience. While it may be tempting to choose a large number of history shards, doing so comes with additional costs. More shards increase CPU and memory consumption on History hosts and put more pressure on the persistence service. Temporal recommends 512 shards for small production clusters, and even large production clusters rarely exceed 4096 shards. However, the suitability of a specific number of history shards for a high-load scenario can only be determined through testing.
Importance of Retention:
Every Temporal workflow keeps a history of the events it processes in the Persistence database. As the data in the database grows, it can impact performance. Scaling the database can help, but at some point, data must be removed or archived to avoid indefinite scaling. To maintain high performance for active workflows, Temporal deletes the history of stopped workflows after a configurable Retention Period. Temporal has a minimum retention of one day, but if your use case allows for it, you can raise the retention period to keep the stopped workflow data for a longer time, potentially months. Although the event data needs to be eventually removed from the primary persistence database, you can configure Temporal to preserve the history in an Archival storage layer, like S3. This allows Temporal to store workflow data indefinitely without impacting cluster performance. The archived workflow history can still be queried through the Temporal CLI and Web client, albeit with lower performance compared to the primary Persistence service.
Essential Metrics and Monitoring:
Temporal exports various Prometheus-based metrics to monitor the four services and the database. These metrics fall into two categories – metrics exported by the Cluster services and metrics exported by the clients and worker nodes running the Temporal SDK for your chosen language. Cluster metrics are tagged with information like type, operation, and namespace to distinguish activity from different services, namespaces, and operations. Monitoring these metrics is crucial for ensuring the smooth operation of the cluster. Some important metrics include service availability, service performance, persistence service availability, and persistence service performance. These metrics can be monitored using PromQL queries in tools like Grafana.
By understanding the architecture, concepts, and metrics of Temporal, you can effectively manage your own cluster and address common challenges that arise in self-hosting Temporal.
Source link