Intro to Temporal Architecture and Essential Metrics

Managing your own Temporal cluster is a daunting task. Between the four core services, the myriad of metrics to monitor, and a separate persistence service, it's a sizeable undertaking for any team. This post begins a new series that will review the work involved in hosting Temporal yourself and try to demystify it.

Running your own cluster is not an effort that should be undertaken lightly. Unless your organization has a mature operations department with resources available to monitor & administer the cluster, you may find yourself wishing you'd avoided this work and just opted to use Temporal Cloud instead.

However, if you need to run your cluster in-house due to data ownership regulations, uptime guarantees, or other organizational mandates, we hope this series will help uncover and resolve the common caveats of self-hosting Temporal.

This article starts the series off by reviewing Temporal's overall service architecture and some important cluster concepts & metrics. Being familiar with the different Temporal subsystems and monitoring their related logs and metrics is essential to administering them.

Table of Contents

Architecture of Temporal
Importance of History Shards
Importance of Retention
Essential Metrics & Monitoring

Architecture of Temporal

Each Temporal Cluster is made of a group of 4 core services - Frontend, Matching, History, and Worker - plus a database:

Intro to Temporal Architecture and Essential Metrics

Each of these services can be scaled independently of the others to meet the unique performance requirements of your cluster's specific workload. The differing responsibilities of the services inform what sort of hardware they're typically bottlenecked by and under what conditions they should be scaled. It's almost always the case that Temporal can be scaled to the point where the ultimate bottleneck becomes the limitations of whatever underlying persistence technology is used in a deployment.

Frontend Service

The frontend service is a stateless service responsible for providing the client-facing gRPC API of the Temporal cluster. Its primary function is to serve as a pass-through to the other services. As such, it implements the rate-limiting, authorization, validation, and routing of all inbound calls.

Inbound calls include communication from the Temporal Web UI, the tctl CLI tool, worker processes, and Temporal SDK connections.

The nodes hosting this service typically benefit from additional compute.

Matching Service

The matching service is responsible for dispatching tasks to workers efficiently by managing and caching operations on task queues. The different task queues of the cluster are distributed among the shards of the matching service.

A notable optimization the matching service provides is called "synchronous matching". In the case that the host happens to have a long-polling worker waiting for a new task at the same the matching host receives a new task, that task is immediately dispatched to the worker. This immediate “synchronous matching" avoids having to persist that task, lowering the overall persistence service load and lowering latency for tasks. Administrators should monitor the "sync match rate" of their cluster as it indicates how often synchronous matching is taking place and endeavor to keep it as high as possible by scaling workers as needed.

The hosts of this service typically benefit from additional memory.

History Service

The history service is responsible for writing workflow execution data to the persistence service efficiently and acting as a cache for workflow history. The workflow executions of the cluster are distributed among the shards of the history service. When the history service progresses a workflow by saving its updated history to persistence, it also enqueues a task with the updated history to the task queue. From there, a worker can poll for more work - receiving that task with the new history and will continue progressing the workflow.

The hosts of this service typically benefit from additional memory. This service is particularly memory intensive.

Worker Service

The cluster worker hosts are responsible for running cluster background workflow tasks. These background workflows support tctl batch operations, archiving, and inter-cluster replication, among other functionality.

There isn’t a great deal of information regarding the scaling requirements of this service, but I’ve seen it suggested this service needs a balance of compute and memory and that having 2 hosts is a good starting place for most clusters. However, the exact performance characteristics will depend on which internal background workflows are most utilized by your cluster.

Importance of History Shards

History shards are an especially important cluster configuration parameter as they represent an upper bound on the number of concurrent persistence operations that can be performed, and critically, this parameter currently cannot be modified after the initial cluster setup. As such, it's important to select a number of history shards sufficient for a theoretical maximum level of concurrent load that your cluster may reach.

It may be tempting to immediately choose an especially large number of history shards, but doing so comes with additional costs. Keeping track of the additional shards increases CPU & memory consumption on History hosts and increases pressure on the persistence service since each additional shard represents an additional potential concurrent operation on the persistence cluster.

Temporal recommends 512 shards for small production clusters and mentions that it's rare for even large production clusters to exceed 4096 shards. However, the only way to concretely determine if a particular number of history shards is suitable for any given high-load scenario is by testing to confirm it.

If you'd like to see an article in this series cover the topic of cluster load testing in detail, come vote for that topic to be up next in our Discord #temporal channel.

Importance of Retention

Every Temporal workflow keeps a history of the events it processes in the Persistence database. As the amount of data in any database grows larger it eventually exhausts available resources, and the performance of that database begins to degrade. Scaling the database, either horizontally or vertically, improves the situation; however, eventually, data must be removed or shifted to archives to avoid scaling the DB indefinitely. To maintain high performance for actively executing workflows, Temporal deletes the history of stopped workflows after a configurable Retention Period. Temporal has a minimum retention of one day.

If your use case doesn't have particularly high load requirements, you may be able to afford to raise the retention and leave this stopped workflow data around for a long time, potentially months. This can be convenient as a source of truth about the outcomes of the completed business processes represented by your stopped workflows.

Even though the event data must eventually be removed from the primary persistence database, it's possible to configure Temporal to preserve the history in an Archival storage layer, typically S3. Preservation allows Temporal to persist workflow data indefinitely without impacting cluster performance. The workflow history kept in the Archival storage layer can still be queried via the Temporal CLI & Web client, but with lower performance than the primary Persistence service.

Essential Metrics & Monitoring

To keep these four services and the database all operating smoothly, Temporal exports a large variety of Prometheus-based metrics to monitor. There are two categories of metrics to monitor - metrics exported by the Cluster services and metrics exported from your clients & worker nodes running the Temporal SDK for your language.

Cluster metrics are tagged with type, operation & namespace, which help distinguish activity from different services, namespaces & operations.

The table below shows a variety of important Cluster & SDK metrics and how they can be applied in different PromQL queries to monitor different aspects of cluster performance in tools like Grafana.

PromQL Query	Description	Metrics Used (Origin)
Service Availability Monitoring
100 - (sum(rate(service_errors[2m]) OR on() vector(0)) / sum(rate(service_requests[2m])) * 100)	Overall percentage of successful gRPC requests in the last 2 minutes. Useful to monitor overall cluster health.	service_requests service_errors (Cluster)
sum(rate(service_error_with_type{service_name="frontend"}[5m])) by (error_type)	The number of errors for a given service, grouped by type. Useful to monitor what sorts of errors are occurring on a given service. Substitute another `service_name` to monitor services other than `frontend`.	service_errors_ with_type (Cluster)
Service Performance Monitoring
histogram_quantile(0.95, sum(rate(service_latency_bucket{service_name="frontend"}[5m])) by (operation, le))	The latency of gRPC calls to a service, grouped by operation. Useful to monitor what services & operations are contributing most to overall cluster latency. Substitute another `service_name` to monitor services other than `frontend`.	service_latency_ bucket (Cluster)
Persistence Service Availability Monitoring
100 - (sum (rate(persistence_errors[2m]) OR on() vector(0)) /sum (rate(persistence_requests[2m])) * 100)	The percentage of successful requests to the persistence service in the last 2 minutes. Useful to monitor persistence service health.	persistence_requests persistence_errors (Cluster)
Persistence Service Performance Monitoring
sum by (operation) (rate(persistence_requests[5m]))",	The number of persistence requests done in the last 5 minutes, grouped by cluster operation. Useful to determine what cluster operations are responsible for persistence service load.	persistence_requests (Cluster)
sum by (operation) (rate(persistence_requests[5m]))",	The number of persistence requests done in the last 5 minutes, grouped by cluster operation. Useful to determine what cluster operations are responsible for persistence service load.	persistence_requests (Cluster)
External Event Monitoring
sum by (operation) (rate(service_requests{operation=~"StartWorkflowExecution\|SignalWorkflowExecution\|SignalWithStartWorkflowExecution"}[2m]))	The number of gRPC requests coming into the cluster from "external" clients, either to start workflows or send signals.	service_requests (Cluster)
Workflow Monitoring
sum(rate(service_requests{ operation="AddWorkflowTask"}[2m]))	The number of gRPC requests to add new workflows in the last two minutes. Should be monitored along with the 'RecordWorkflowTaskStarted', 'RespondWorkflowTaskCompleted', & 'RespondWorkflowTaskFailed' operations. Useful to observe if workflows are being moved between these states as expected.	service_requests (Cluster)
sum(rate(schedule_to_start_timeout{ operation="TimerActiveTaskWorkflowTimeout"}[2m]))	The number of workflows in the last 2 minutes that exceeded the maximum time to go from their initial scheduled state to their execution actually being started on a worker. Useful to monitor as a sign that workers are busy enough (or otherwise unavailable) to cause major delays to new workflows being started.	schedule_to_start_timeout (Cluster)
sum(rate(start_to_close_timeout{ operation="TimerActiveTaskWorkflowTimeout"}[2m]))	The number of workflows in the last 2 minutes that exceeded the maximum time to go from their initial scheduled state to being completed. Useful to monitor as a sign that workflows are not completing due to some sort of slowdown.	schedule_to_close_timeout (Cluster)
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))	Latency while acquiring the lock needed to access a workflow’s history. Useful to determine how much this lock is contributing to overall workflow progress latency.	cache_latency_bucket (Cluster)
histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))	Latency while acquiring the lock needed to access a workflow history shard. Useful to determine how much this lock is contributing to overall workflow progress latency.	lock_latency_bucket (Cluster)
Activity Monitoring
sum(rate(service_requests{operation="AddActivityTask"}[2m]))	The number of gRPC requests to add new activities in the last 2 minutes. Useful to monitor against the following query as a representation of the latency between activities being scheduled and their execution beginning.	service_requests (Cluster)
sum(rate(service_requests{operation="RecordActivityTaskStarted"}[2m]))	The number of gRPC requests recording the start of an activity in the last 2 minutes. Useful to monitor against the previous query as a representation of the latency between activities being scheduled and their execution beginning.	service_requests (Cluster)
sum(rate(service_requests{operation=~"RespondActivityTaskCompleted\|RespondActivityTaskFailed\|RespondActivityTaskCanceled"}[2m]))	The number of gRPC requests recording the end of an activity in the last two minutes. Useful to monitor against the previous query as a representation of the latency between activities starting execution and them completing.	service_requests (Cluster)
Service Health Monitoring
sum by (temporal_service_type) (restarts)	The number of cluster service restarts, grouped by service. Useful to monitor since restarts impact performance and if regularly occurring indicate an issue with that service.	restarts (Cluster)
sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)	The number of “resource exhausted” errors reported, grouped by the type of limit being reached. Useful to determine when tasks aren’t running due to rate limits or system overload	service_errors_ resource_exhausted (Cluster)
State Transition Monitoring
sum(rate(state_transition_count[1m]))	The number of state transitions persisted to the database in the last minute via the History service. Useful as a measure of the current throughput of the cluster since all workflow & activity progress in Temporal results in state transitions.	state_transition_count (Cluster)
Shard Rebalancing
sum(rate(membership_changed_count[2m])	The number of times a History shard was rebalanced between hosts in the History cluster in the last two minutes. Useful to monitor as rebalancing is costly and if it's not happening as part of scaling it indicates those history nodes are having an issue. Interesting to plot with state_transition_count as a measure of History throughput changing as shards are rebalanced to new hosts.	membership_changed_count (Cluster)
Worker Performance
avg_over_time(temporal_worker_task_slots_available{namespace="default",worker_type="WorkflowWorker"}[10m])	How many task slots are available for workers to process tasks. Useful to monitor as hitting 0 indicates that workers are not keeping up with processing tasks. `worker_type` can be `WorkflowWorker` `ActivityWorker` or `LocalActivityWorker`	temporal_worker_task_slots_available (SDK)
sum by (namespace, task_queue) (rate(temporal_workflow_task_schedule_to_start_latency_seconds_bucket[5m]))	The latency from when a workflow task is placed on the task queue to the time a worker picks it up to execute it. Useful to monitor as measure of cluster latency and establish alerts when latency exceeds SLO.	workflow_task_schedule_to_start_latency (SDK)
sum by (namespace, task_queue) (rate(temporal_activity_schedule_to_start_latency_seconds_bucket[5m]))	The latency from when an activity task is placed on the task queue to the time a worker picks it up to execute it. Useful to monitor as measure of cluster latency and establish alerts when latency exceeds SLO.	activity_schedule_to_ start_latency (SDK)
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))	The “sync match rate“ between the workers and the matching service. Useful since a high sync match rate reduces task latency and reduces load on the persistence service.	poll_success_sync poll_success (Cluster)

This is only a portion of the metrics exported by the Temporal Cluster & SDKs. The complete set of Cluster metrics can be found in the metric_defs.go file in the Temporal source.

DataDog includes a very useful integration that provides a dashboard to help monitor your Temporal cluster https://docs.datadoghq.com/integrations/temporal/ . Provides turnkey visibility into the metrics above.

Temporal also hosts a repo of community supported Grafana dashboards that are an easy way to get started monitoring your cluster: https://github.com/temporalio/dashboards

Apart from these Temporal-specific metrics, you should also monitor typical infrastructure metrics like CPU, memory & network performance for all hosts that are running cluster services.

Until Next Time!

Hope the information on the important architecture concepts, configuration parameters & metrics collected above helps you get your organization's Temporal cluster provisioning off on the right foot! Join us in the next entry in this blog series, where we'll investigate the benefits of choosing different database technologies for your persistence service.

Need more Temporal help?

Our Temporal Consulting experts have experience implementing and optimizing Temporal on an enterprise scale. Schedule a free consultation with our software architects to talk through your Temporal implementation and see if we’d be a good fit for your project.

Schedule Your Free Temporal Consultation

Sources

Scaling Temporal - The Basics
# Choosing the Number of Shards in Temporal History Service
Temporal OSS Cluster Metrics
Cadence Meetup: Cadence Architecture
Several posts: 1. 2. by Tihomir on the Temporal Community Forums
Temporal Cloud Observability: Introduction to Tuning Workers with Metrics
Developer's guide - Worker performance | Temporal Documentation