Replay Testing To Avoid Non-Determinism in Temporal Workflows

Deploying an updated version of Temporal workflow code can result in errors if there are non-deterministic changes to the code. Determinism is verified during the “replay“ process that rebuilds the last known state of an ongoing workflow in order to continue its execution. Rebuilding execution state enables Temporal to support long-sleeping workflows and reliably relocate workflow executions to another worker when one crashes.

Replay doesn't re-execute the commands recorded in the history. Instead, it uses the recorded results to reconstruct the state. While that history is playing back, if updated workflow code attempts a command that wasn't expected at that point in the history, a non-determinism exception will be thrown.

We'll review an example of a non-deterministic change and see how we can proactively check for them with replay testing. This article will use the Typescript SDK, but the Temporal SDKs for other languages behave similarly and offer comparable functionality.

It’s worth noting that since replay is a relatively expensive operation, Temporal doesn’t perform a replay every time a workflow execution progresses. Replays are only performed when the worker doesn’t have the execution state in its local cache of workflow states. Common scenarios for state cache misses are workflows picked up by a newly deployed worker, workflows that have been idle for long periods of time, or workflows that moved to a new worker after their original became unavailable.

Example of a Non-Deterministic Change

Consider a hypothetical e-commerce order processing workflow:

async function orderProcessingWorkflow(
  account: Account,
  products: Product[], 
  payment: PaymentInfo
) {
  const processingResult = await processPaymentActivity(payment)

  if (processingResult.successful) {
	  await sendUserNotificationActivity(account, products, payment)
	  await startShippingWorkflowActivity(products)
  } else { ... }
}

Now imagine this workflow required a change to make an additional call to a 3rd party anti-fraud platform before processing the payment of any new orders:

async function orderProcessingWorkflow(
  account: Account,
  products: Product[], 
  payment: PaymentInfo
) {
  if (await isFraudulentActivity(account, products, payment)) {
    throw ApplicationError.create({ message: 'Fraud detected.', nonRetryable: true })
  }
   
  const processingResult = await processPaymentActivity(payment)

  if (processingResult.successful) { ... } else { ... }
}

When this change is deployed, the worker history replay process would expect the history to begin with isFraudluentActivity, but the recorded history will have started with processPaymentActivity. This mismatch results in the non-determinism exception being thrown on all workflow instances trying to resume execution, for example:

Unhandled rejection { runId: ... } 
  DeterminismViolationError: Replay failed with a nondeterminism error. This means that the workflow code as written is not compatible with the history that was fed in. 
    Details: Workflow activation completion failed: Failure { 
      failure: Some(Failure { 
        message: "Nondeterminism(\"Activity machine does not handle this event: HistoryEvent(id: 5, ScheduleActivityTask)\")", ... ) 
      })
    }

However, non-deterministic changes can often be avoided by writing your code so that any of the existing histories won't end up making any unexpected commands during their replay process, like the sample below.

async function orderProcessingWorkflow(
  account: Account,
  products: Product[], 
  payment: PaymentInfo
) {
  const checkForFraud = payment.checkForFraud ?? false

  if (checkForFraud && await isFraudulentActivity(account, products, payment)) {
    throw ApplicationError.create({ message: 'Fraud detected.', nonRetryable: true })
  }
   
  const processingResult = await processPaymentActivity(payment)

  if (processingResult.successful) { ... } else { ... }
}

In the above example, a new checkForFraud property on the payment argument can be optionally provided. Any new workflow executions can pass in payment.checkForFraud = true and will, in turn, run the new isFraudlentActivity activity. Any ongoing workflow executions will have no payment.checkForFraud property and will skip the isFraudlentActivity during their replays, avoiding the non-determinism error.

Replay Testing

To proactively guard against non-determinism errors, you can add replay testing to your CI. Replay testing involves simulating the history replay process that the worker will do when resuming execution of a workflow and throwing an error at any non-deterministic changes encountered.

To run the simulation, you'll need to have some workflow history to replay. The easiest way to do this is by manually sampling histories by downloading them in JSON form via the Temporal CLI:

tctl workflow show -w myWorkflowId -of ./myWorkflowId_history.json

or the Web UI:

The downloaded histories can then be replayed during testing runs with a simple script:

#!/usr/bin/env -S npx ts-node-esm
// replay_test.ts {history_file}
import fs from 'fs'
import { Worker } from "@temporalio/worker"

const filePath = process.argv[2]
const history = JSON.parse(await fs.promises.readFile(filePath, 'utf8'))
const workflowsPath = new URL('../src/workflows.ts', import.meta.url).toString().replace('file://', '')

await Worker.runReplayHistory({ workflowsPath }, history)

However, in this case, the downloaded histories are yet another static test fixture developers need to keep up to date. As you change your workflow, you'll need to update your saved test histories to ensure your replay testing is representative of the executions that will be resuming on the workers.

Rather than depend on static test fixtures alone, you can dynamically download a sample of histories from the cluster and include those histories in your replay testing:

# samples 5 random active workflow ids and downloads their history
tctl workflow listall --open --workflow_type your_workflow_type_name\
  |  awk '{ print $3; }' \
  |  tail -n+2 \
  |  shuf -n 5 \
  |  xargs -I {} tctl workflow show -w {} -of ./{}_history.json

# replay & test each downloaded history
ls -a | grep _history.json | xargs -L1 ./replay_test.ts

This script exiting successfully (code 0) confirms that a sampling of the currently running workflows will be compatible with the updated code. If any history replays produce an error, the script will return with exit code 123.

You can see an example of this testing in action in the replay-testing branch of the Bitovi Temporal Examples repository. After starting a local Temporal test cluster, you can run a worker in one terminal via npm run worker, while in a second shell, you run npm run client.

That client will slowly spawn some simple workflows that will sleep for 5 seconds before they complete their activity. This gives about 40 seconds to run the next command:

./scripts/sample_replay_test.sh

That script will then query the cluster for the list of currently executing workflows, take a sample of them, download their history, and exit with code 0 if all those downloaded histories can be replayed without error.

Parting Thoughts

Non-determinism exceptions are an unexpected complexity for many new Temporal developers, but I hope this article has helped you to understand why this limitation exists. The ability you've gained to test for and avoid these exceptions ahead of time will help give you confidence that your workflow changes are compatible before they're deployed.

Let’s Talk Temporal

Need more Temporal help? Talk to our Temporal Consulting team to get to the root of your challenge. As Temporal Partners, we’re expertly equipped to implement and optimize Temporal in your application.