Azure Pipeline Environments and Deployment Strategies
A comprehensive guide to Azure DevOps pipeline environments and deployment strategies including rolling, canary, approvals, lifecycle hooks, and multi-environment configurations.
Azure Pipeline Environments and Deployment Strategies
Overview
Azure DevOps environments are first-class resources in YAML pipelines that represent the actual targets where your code gets deployed -- a Kubernetes cluster, a set of virtual machines, or a logical grouping that gates your releases. They give you approval workflows, deployment history, traceability from commit to production, and most importantly, deployment strategies like rolling and canary that would otherwise require significant custom scripting. If you are deploying anything beyond a toy project and you are not using environments, you are leaving safety and visibility on the table.
Prerequisites
Before working through this article, you should have:
- An Azure DevOps organization and project with Pipelines enabled
- Basic familiarity with YAML pipeline syntax (stages, jobs, steps)
- A Node.js application with a build pipeline already producing artifacts
- Access to create environments in your Azure DevOps project (Project Administrator or Environment Creator role)
- Optionally, an Azure subscription with an AKS cluster or VMs for resource targets
What Are Environments in Azure DevOps?
An environment in Azure DevOps is not just a label you slap on a stage. It is a first-class resource that carries its own configuration, permissions, approval gates, and deployment history. When you reference an environment in a deployment job, Azure DevOps tracks every deployment to that environment, records which pipeline run deployed which commit, and enforces whatever checks you have configured before allowing the deployment to proceed.
This is fundamentally different from using stage-level variables or naming conventions to represent your environments. With a proper environment resource, you get:
- Deployment history -- a full audit trail of what was deployed, when, by whom, and from which commit
- Approval and check gates -- human approvals, business hours restrictions, branch control, and exclusive locks
- Resource targeting -- direct association with Kubernetes namespaces or VM pools
- Deployment strategies -- built-in rolling, canary, and runOnce strategies with lifecycle hooks
Think of environments as the deployment equivalent of service connections. They are a managed resource that you configure once and reference across pipelines.
Creating and Configuring Environments
You can create environments through the Azure DevOps UI or let your pipeline create them automatically on first reference. I recommend creating them explicitly through the UI so you can configure approvals and checks before the first deployment ever runs.
To create an environment through the UI, navigate to Pipelines > Environments > New Environment. Give it a name, an optional description, and choose whether to add resource targets immediately or leave it as a logical grouping.
In your YAML pipeline, you reference an environment in a deployment job:
stages:
- stage: DeployStaging
displayName: 'Deploy to Staging'
jobs:
- deployment: DeployStagingJob
displayName: 'Deploy to Staging Environment'
environment: 'staging'
strategy:
runOnce:
deploy:
steps:
- script: echo "Deploying to staging"
Notice that the job type is deployment, not job. This is critical. A regular job does not support environment references, deployment strategies, or lifecycle hooks. The deployment job type is purpose-built for this.
If the environment named staging does not exist when this pipeline first runs, Azure DevOps creates it automatically. But it will be created with no approvals or checks, which is rarely what you want for anything beyond development.
Environment Approvals and Checks
Approvals and checks are the enforcement mechanism that makes environments useful for governance. You configure them on the environment itself, not in the pipeline YAML. This means that no matter which pipeline deploys to production, the same approval gates apply.
Manual Approvals
The most common check. Navigate to your environment, click the three-dot menu, select Approvals and checks, and add an Approvals check. You specify one or more approvers, set a timeout (how long the approval request stays active before auto-rejecting), and optionally allow the approver to defer the deployment.
A practical configuration for a production environment:
- Approvers: Your team lead and a senior engineer (require any one to approve)
- Timeout: 72 hours (gives people time across weekends)
- Instructions: "Verify staging deployment passed smoke tests before approving production"
Branch Control
Branch control restricts which branches can deploy to an environment. For production, you almost always want to restrict deployments to main or master:
In the environment checks, add a Branch control check:
- Allowed branches:
refs/heads/main, refs/heads/release/* - This prevents feature branches from accidentally deploying to production
Business Hours
The business hours check prevents deployments outside of specified time windows. This is useful for production environments where you want deployments to happen only when the full team is available to respond to incidents:
- Time zone: Your team's primary timezone
- Business days: Monday through Friday
- Start time: 09:00
- End time: 16:00
Exclusive Lock
The exclusive lock check ensures only one pipeline run deploys to an environment at a time. When multiple runs target the same environment, they are serialized -- the second run waits until the first completes. You can configure this with two behaviors:
- Sequential -- runs queue up and execute in order
- Latest only -- only the most recent queued run proceeds, older queued runs are canceled
For production deployments, I almost always use exclusive lock with "latest only." If three commits are waiting to deploy, I only care about the most recent one.
Deployment Strategies
This is where environments become genuinely powerful. Azure DevOps supports three built-in deployment strategies, each with lifecycle hooks that let you run custom logic at specific points during the deployment.
runOnce
The simplest strategy. It deploys to all targets at once with no incremental rollout. Use this for development environments or any deployment target where you do not need gradual rollout.
strategy:
runOnce:
deploy:
steps:
- task: AzureWebApp@1
inputs:
azureSubscription: 'my-azure-connection'
appName: 'my-node-app-dev'
package: '$(Pipeline.Workspace)/drop/*.zip'
Rolling Deployment
A rolling deployment updates targets in batches. If you have 10 VMs, you can deploy to 2 at a time, verify each batch is healthy, and then move to the next batch. If a batch fails, the remaining batches are not updated.
strategy:
rolling:
maxParallel: 2
deploy:
steps:
- script: |
echo "Deploying to $(Environment.ResourceName)"
npm install --production
pm2 restart my-app
displayName: 'Deploy application'
routeTraffic:
steps:
- script: |
echo "Routing traffic to updated instance"
displayName: 'Route traffic'
postRouteTraffic:
steps:
- script: |
echo "Running health check on $(Environment.ResourceName)"
curl --fail http://$(Environment.ResourceName):8080/health
displayName: 'Validate health'
on:
failure:
steps:
- script: |
echo "Deployment failed on $(Environment.ResourceName)"
echo "Rolling back..."
pm2 restart my-app --update-env
displayName: 'Rollback on failure'
success:
steps:
- script: echo "Batch deployment succeeded"
displayName: 'Confirm success'
The maxParallel property controls how many targets receive the update simultaneously. You can specify an absolute number (maxParallel: 2) or a percentage (maxParallel: 25%). For a 10-node cluster, maxParallel: 2 means deploy to 2 nodes at a time, verify, then move on.
Canary Deployment
Canary deployments route a small percentage of traffic to the new version first, validate it, and then incrementally roll out to more targets. This is the safest strategy for production deployments of critical services.
strategy:
canary:
increments: [10, 25, 50, 100]
preDeploy:
steps:
- script: echo "Preparing canary deployment - current increment $(Strategy.CycleSize)"
displayName: 'Pre-deploy canary'
deploy:
steps:
- script: |
echo "Deploying canary to $(Strategy.CycleSize)% of targets"
displayName: 'Deploy canary increment'
routeTraffic:
steps:
- script: |
echo "Routing $(Strategy.CycleSize)% of traffic to canary"
displayName: 'Route traffic to canary'
postRouteTraffic:
steps:
- script: |
echo "Monitoring canary health for 5 minutes..."
sleep 300
curl --fail http://my-app-canary:8080/health
displayName: 'Validate canary health'
on:
failure:
steps:
- script: |
echo "Canary failed at $(Strategy.CycleSize)%. Rolling back."
displayName: 'Canary rollback'
success:
steps:
- script: echo "Canary deployment completed successfully"
displayName: 'Canary success'
The increments array defines the rollout percentages. With [10, 25, 50, 100], the pipeline first deploys to 10% of targets, validates, then 25%, validates, then 50%, and finally 100%. If validation fails at any increment, the on.failure hook runs and the rollout stops.
Lifecycle Hooks in Detail
Every deployment strategy supports lifecycle hooks that execute at specific points during the deployment. Understanding these hooks is essential for building robust deployment pipelines.
| Hook | When It Runs | Typical Use |
|---|---|---|
preDeploy |
Before the deployment starts | Database backups, feature flag checks, snapshot creation |
deploy |
The main deployment step | Actual application deployment |
routeTraffic |
After deploy, before validation | Load balancer updates, DNS changes, traffic shifting |
postRouteTraffic |
After traffic is routed | Health checks, smoke tests, integration tests |
on.success |
After all steps succeed | Notifications, cleanup, metric annotations |
on.failure |
If any step fails | Rollback, alerting, incident creation |
Here is a practical example showing all hooks for a Node.js application:
strategy:
runOnce:
preDeploy:
steps:
- script: |
echo "Creating database backup before deployment..."
mongodump --uri="$(MONGO_URI)" --out=/tmp/backup-$(Build.BuildId)
displayName: 'Backup database'
deploy:
steps:
- task: DownloadPipelineArtifact@2
inputs:
buildType: 'current'
artifactName: 'drop'
targetPath: '$(Pipeline.Workspace)/drop'
- script: |
cd $(Pipeline.Workspace)/drop
npm install --production
pm2 stop my-app || true
pm2 start app.js --name my-app
displayName: 'Deploy Node.js app'
routeTraffic:
steps:
- script: |
echo "Updating load balancer to include new deployment"
az network lb rule update \
--resource-group my-rg \
--lb-name my-lb \
--name my-rule \
--backend-pool-name new-pool
displayName: 'Update load balancer'
postRouteTraffic:
steps:
- script: |
echo "Running smoke tests..."
node tests/smoke-test.js
displayName: 'Run smoke tests'
- script: |
echo "Checking health endpoint..."
for i in 1 2 3 4 5; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://my-app:8080/health)
if [ "$STATUS" = "200" ]; then
echo "Health check $i passed"
else
echo "Health check $i failed with status $STATUS"
exit 1
fi
sleep 10
done
displayName: 'Health check validation'
on:
failure:
steps:
- script: |
echo "Deployment failed. Restoring database backup..."
mongorestore --uri="$(MONGO_URI)" /tmp/backup-$(Build.BuildId)
pm2 restart my-app-previous || true
displayName: 'Rollback deployment'
- task: SendSlackNotification@1
inputs:
channel: '#deployments'
message: 'FAILED: Deployment $(Build.BuildNumber) to $(Environment.Name)'
success:
steps:
- script: |
echo "Deployment successful. Cleaning up old backups..."
rm -rf /tmp/backup-$(Build.BuildId)
displayName: 'Cleanup'
Resource Targets
Environments can be associated with specific infrastructure targets. The two supported resource types are Kubernetes and Virtual Machines.
Kubernetes Resources
When you add a Kubernetes resource to an environment, Azure DevOps connects directly to your cluster and can deploy to a specific namespace:
environment: 'production.my-app-namespace'
The dot notation (environment.namespace) targets a specific Kubernetes namespace within the environment. This is powerful because it means your environment checks (approvals, branch control) apply at the namespace level.
To add a Kubernetes resource, go to your environment in the UI, click Add resource, select Kubernetes, and provide your cluster connection details. You can connect via Azure Kubernetes Service (direct integration) or a generic Kubernetes service connection.
Virtual Machine Resources
For VM-based deployments, you install the Azure Pipelines agent on each target VM and register it with an environment. This is how rolling deployments work -- the pipeline distributes the deployment across registered VMs.
To register a VM, go to your environment, click Add resource, select Virtual machines, and follow the registration script. The script installs the pipeline agent and registers the VM with your environment. You can tag VMs to target specific subsets:
environment:
name: 'production'
resourceType: VirtualMachine
tags: 'web-server'
This targets only VMs tagged as web-server in the production environment.
Environment History and Traceability
One of the most underrated features of environments is the deployment history view. Navigate to Pipelines > Environments > [your environment], and you get a chronological list of every deployment, including:
- Which pipeline and run number
- Which commit triggered the deployment
- Who approved it (if approvals are configured)
- Whether it succeeded or failed
- The duration of the deployment
This is invaluable during incident response. When something breaks in production, you can immediately see what was deployed recently and trace it back to the exact commit. You do not need to cross-reference build logs, deployment scripts, and git history manually -- it is all in one place.
You can also use the Azure DevOps REST API to query environment deployment history programmatically:
var https = require("https");
var org = "my-org";
var project = "my-project";
var envId = 5;
var pat = process.env.AZURE_DEVOPS_PAT;
var options = {
hostname: "dev.azure.com",
path: "/" + org + "/" + project + "/_apis/distributedtask/environments/" + envId + "/environmentdeploymentrecords?api-version=7.1",
headers: {
"Authorization": "Basic " + Buffer.from(":" + pat).toString("base64")
}
};
https.get(options, function(res) {
var data = "";
res.on("data", function(chunk) {
data += chunk;
});
res.on("end", function() {
var records = JSON.parse(data);
records.value.forEach(function(record) {
console.log(record.definition.name + " - " + record.result + " - " + record.startTime);
});
});
});
Environment Permissions and Security
Environments have their own permission model, separate from pipeline permissions. You can control:
- Creator -- who can create new environments
- Reader -- who can view the environment and its deployment history
- User -- who can reference the environment in their pipelines
- Administrator -- who can manage approvals, checks, and permissions
For a mature setup, I recommend:
- Restrict environment creation to Project Administrators
- Grant User role on development environments broadly (all developers)
- Grant User role on staging environments to your team leads
- Grant User role on production environments only to the release pipeline service account
- Grant Reader role on all environments to the broader team for visibility
You can also set pipeline-level permissions on environments. This restricts which specific pipelines can target an environment. Navigate to the environment, click Security, and under Pipeline permissions, add specific pipelines or choose "All pipelines."
Combining Environments with Variable Groups
A common pattern is pairing environments with variable groups to inject environment-specific configuration. Each environment references a different variable group containing connection strings, API keys, and feature flags appropriate for that stage:
stages:
- stage: DeployDev
variables:
- group: 'app-config-dev'
jobs:
- deployment: Deploy
environment: 'dev'
strategy:
runOnce:
deploy:
steps:
- script: |
echo "Deploying with DB_HOST=$(DB_HOST)"
echo "Feature flags: $(FEATURE_FLAGS)"
displayName: 'Deploy with env-specific config'
- stage: DeployProd
variables:
- group: 'app-config-prod'
jobs:
- deployment: Deploy
environment: 'production'
strategy:
runOnce:
deploy:
steps:
- script: |
echo "Deploying with DB_HOST=$(DB_HOST)"
displayName: 'Deploy with prod config'
Link your variable groups to Azure Key Vault for secrets. Never store connection strings or API keys directly in variable groups as plaintext.
Complete Working Example
Here is a complete multi-environment YAML pipeline for a Node.js application. It deploys to three environments: dev (runOnce), staging (rolling), and production (canary with health checks and automatic rollback).
# azure-pipelines.yml
trigger:
branches:
include:
- main
paths:
exclude:
- '*.md'
- 'docs/**'
pool:
vmImage: 'ubuntu-latest'
variables:
nodeVersion: '20.x'
artifactName: 'node-app'
stages:
# ==========================================
# BUILD STAGE
# ==========================================
- stage: Build
displayName: 'Build and Test'
jobs:
- job: BuildJob
displayName: 'Build Node.js Application'
steps:
- task: NodeTool@0
inputs:
versionSpec: '$(nodeVersion)'
displayName: 'Install Node.js'
- script: npm ci
displayName: 'Install dependencies'
- script: npm run lint
displayName: 'Run linter'
- script: npm test
displayName: 'Run unit tests'
- script: npm run build --if-present
displayName: 'Build application'
- task: ArchiveFiles@2
inputs:
rootFolderOrFile: '$(System.DefaultWorkingDirectory)'
includeRootFolder: false
archiveType: 'zip'
archiveFile: '$(Build.ArtifactStagingDirectory)/$(artifactName)-$(Build.BuildId).zip'
replaceExistingArchive: true
displayName: 'Archive application'
- publish: '$(Build.ArtifactStagingDirectory)/$(artifactName)-$(Build.BuildId).zip'
artifact: '$(artifactName)'
displayName: 'Publish artifact'
# ==========================================
# DEV DEPLOYMENT - runOnce
# ==========================================
- stage: DeployDev
displayName: 'Deploy to Dev'
dependsOn: Build
condition: succeeded()
variables:
- group: 'app-config-dev'
jobs:
- deployment: DeployDevJob
displayName: 'Deploy to Dev Environment'
environment: 'dev'
strategy:
runOnce:
preDeploy:
steps:
- script: echo "Starting dev deployment for build $(Build.BuildId)"
displayName: 'Pre-deploy notification'
deploy:
steps:
- download: current
artifact: '$(artifactName)'
- task: ExtractFiles@1
inputs:
archiveFilePatterns: '$(Pipeline.Workspace)/$(artifactName)/*.zip'
destinationFolder: '$(Pipeline.Workspace)/extracted'
displayName: 'Extract artifact'
- task: AzureWebApp@1
inputs:
azureSubscription: 'azure-dev-connection'
appType: 'webAppLinux'
appName: 'my-node-app-dev'
package: '$(Pipeline.Workspace)/$(artifactName)/*.zip'
runtimeStack: 'NODE|20-lts'
startUpCommand: 'npm start'
displayName: 'Deploy to Azure App Service (Dev)'
postRouteTraffic:
steps:
- script: |
echo "Running dev smoke tests..."
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" https://my-node-app-dev.azurewebsites.net/health)
if [ "$RESPONSE" != "200" ]; then
echo "Health check failed with status $RESPONSE"
exit 1
fi
echo "Dev deployment healthy"
displayName: 'Dev health check'
on:
failure:
steps:
- script: echo "##vso[task.logissue type=warning]Dev deployment failed for build $(Build.BuildId)"
displayName: 'Log deployment failure'
# ==========================================
# STAGING DEPLOYMENT - Rolling
# ==========================================
- stage: DeployStaging
displayName: 'Deploy to Staging (Rolling)'
dependsOn: DeployDev
condition: succeeded()
variables:
- group: 'app-config-staging'
jobs:
- deployment: DeployStagingJob
displayName: 'Rolling Deploy to Staging'
environment:
name: 'staging'
resourceType: VirtualMachine
tags: 'web-tier'
strategy:
rolling:
maxParallel: 2
preDeploy:
steps:
- script: |
echo "Pre-deploy on $(Environment.ResourceName)"
echo "Current app version:"
pm2 describe my-node-app 2>/dev/null | head -5 || echo "App not yet deployed"
displayName: 'Capture current state'
deploy:
steps:
- download: current
artifact: '$(artifactName)'
- script: |
echo "Deploying to $(Environment.ResourceName)..."
APP_DIR=/opt/my-node-app
# Backup current version
if [ -d "$APP_DIR" ]; then
cp -r $APP_DIR ${APP_DIR}-backup-$(Build.BuildId)
fi
# Extract new version
mkdir -p $APP_DIR
unzip -o $(Pipeline.Workspace)/$(artifactName)/*.zip -d $APP_DIR
# Install production dependencies
cd $APP_DIR
npm ci --production
# Restart application
pm2 stop my-node-app 2>/dev/null || true
pm2 start app.js --name my-node-app --env production
pm2 save
displayName: 'Deploy and restart application'
routeTraffic:
steps:
- script: |
echo "Enabling traffic to $(Environment.ResourceName)"
# Re-register with load balancer
az network lb address-pool address add \
--resource-group my-rg \
--lb-name staging-lb \
--pool-name staging-pool \
--name $(Environment.ResourceName) \
--ip-address $(Environment.ResourceName)
displayName: 'Add to load balancer'
postRouteTraffic:
steps:
- script: |
echo "Health check on $(Environment.ResourceName)..."
RETRIES=5
DELAY=10
for i in $(seq 1 $RETRIES); do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://$(Environment.ResourceName):8080/health)
if [ "$STATUS" = "200" ]; then
echo "Health check $i/$RETRIES passed"
else
echo "Health check $i/$RETRIES failed (HTTP $STATUS)"
if [ "$i" = "$RETRIES" ]; then
echo "All health checks failed. Marking deployment as failed."
exit 1
fi
fi
sleep $DELAY
done
displayName: 'Validate health'
on:
failure:
steps:
- script: |
echo "Rolling back $(Environment.ResourceName)..."
APP_DIR=/opt/my-node-app
BACKUP_DIR=${APP_DIR}-backup-$(Build.BuildId)
if [ -d "$BACKUP_DIR" ]; then
pm2 stop my-node-app || true
rm -rf $APP_DIR
mv $BACKUP_DIR $APP_DIR
cd $APP_DIR
pm2 start app.js --name my-node-app --env production
echo "Rollback complete on $(Environment.ResourceName)"
else
echo "##vso[task.logissue type=error]No backup found for rollback on $(Environment.ResourceName)"
fi
displayName: 'Rollback on failure'
success:
steps:
- script: |
echo "Cleaning up backup for $(Environment.ResourceName)"
rm -rf /opt/my-node-app-backup-$(Build.BuildId)
displayName: 'Cleanup backup'
# ==========================================
# PRODUCTION DEPLOYMENT - Canary
# ==========================================
- stage: DeployProduction
displayName: 'Deploy to Production (Canary)'
dependsOn: DeployStaging
condition: succeeded()
variables:
- group: 'app-config-production'
jobs:
- deployment: DeployProductionJob
displayName: 'Canary Deploy to Production'
environment:
name: 'production'
resourceType: VirtualMachine
tags: 'web-tier'
strategy:
canary:
increments: [10, 25, 50, 100]
preDeploy:
steps:
- script: |
echo "========================================"
echo "PRODUCTION CANARY DEPLOYMENT"
echo "Build: $(Build.BuildId)"
echo "Increment: $(Strategy.CycleSize)%"
echo "========================================"
displayName: 'Canary pre-deploy'
deploy:
steps:
- download: current
artifact: '$(artifactName)'
- script: |
echo "Deploying to $(Strategy.CycleSize)% of production targets"
echo "Target: $(Environment.ResourceName)"
APP_DIR=/opt/my-node-app
# Create timestamped backup
TIMESTAMP=$(date +%Y%m%d%H%M%S)
cp -r $APP_DIR ${APP_DIR}-backup-${TIMESTAMP}
# Deploy new version
unzip -o $(Pipeline.Workspace)/$(artifactName)/*.zip -d $APP_DIR
cd $APP_DIR
npm ci --production
pm2 stop my-node-app || true
pm2 start app.js --name my-node-app --env production
pm2 save
echo "BACKUP_DIR=${APP_DIR}-backup-${TIMESTAMP}" > /tmp/deploy-meta.env
displayName: 'Deploy canary increment'
routeTraffic:
steps:
- script: |
echo "Routing $(Strategy.CycleSize)% traffic to canary on $(Environment.ResourceName)"
displayName: 'Route traffic to canary'
postRouteTraffic:
steps:
- script: |
echo "Monitoring canary at $(Strategy.CycleSize)% for 3 minutes..."
CHECKS=6
INTERVAL=30
FAILURES=0
MAX_FAILURES=2
for i in $(seq 1 $CHECKS); do
# Check HTTP status
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://$(Environment.ResourceName):8080/health)
# Check response time
RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" http://$(Environment.ResourceName):8080/health)
echo "Check $i/$CHECKS: HTTP $STATUS, Response time: ${RESPONSE_TIME}s"
if [ "$STATUS" != "200" ]; then
FAILURES=$((FAILURES + 1))
echo "##vso[task.logissue type=warning]Health check failed ($FAILURES/$MAX_FAILURES allowed)"
fi
# Check if response time exceeds threshold (2 seconds)
SLOW=$(echo "$RESPONSE_TIME > 2.0" | bc -l 2>/dev/null || echo "0")
if [ "$SLOW" = "1" ]; then
FAILURES=$((FAILURES + 1))
echo "##vso[task.logissue type=warning]Response time too slow: ${RESPONSE_TIME}s"
fi
if [ "$FAILURES" -ge "$MAX_FAILURES" ]; then
echo "##vso[task.logissue type=error]Too many failures. Canary is unhealthy."
exit 1
fi
sleep $INTERVAL
done
echo "Canary healthy at $(Strategy.CycleSize)%"
displayName: 'Validate canary health'
- script: |
echo "Checking error rate in application logs..."
ERROR_COUNT=$(pm2 logs my-node-app --nostream --lines 100 2>&1 | grep -c "ERROR" || echo "0")
echo "Errors in last 100 log lines: $ERROR_COUNT"
if [ "$ERROR_COUNT" -gt "10" ]; then
echo "##vso[task.logissue type=error]Error rate too high: $ERROR_COUNT errors"
exit 1
fi
displayName: 'Check error rates'
on:
failure:
steps:
- script: |
echo "CANARY FAILED at $(Strategy.CycleSize)%"
echo "Initiating rollback on $(Environment.ResourceName)..."
if [ -f /tmp/deploy-meta.env ]; then
source /tmp/deploy-meta.env
if [ -d "$BACKUP_DIR" ]; then
pm2 stop my-node-app || true
rm -rf /opt/my-node-app
mv $BACKUP_DIR /opt/my-node-app
cd /opt/my-node-app
pm2 start app.js --name my-node-app --env production
echo "Rollback complete."
fi
fi
displayName: 'Rollback canary'
- script: |
echo "Sending failure notification..."
curl -X POST "$(SLACK_WEBHOOK_URL)" \
-H "Content-Type: application/json" \
-d "{\"text\":\"PRODUCTION CANARY FAILED at $(Strategy.CycleSize)% - Build $(Build.BuildId) rolled back automatically\"}"
displayName: 'Notify team of failure'
success:
steps:
- script: |
echo "Production canary deployment complete!"
echo "All increments deployed successfully."
curl -X POST "$(SLACK_WEBHOOK_URL)" \
-H "Content-Type: application/json" \
-d "{\"text\":\"Production deployment $(Build.BuildId) completed successfully via canary rollout\"}"
displayName: 'Notify success'
The health check script for the Node.js application referenced in the pipeline:
// health.js - Health check endpoint
var express = require("express");
var os = require("os");
var router = express.Router();
var startTime = Date.now();
router.get("/health", function(req, res) {
var uptime = Date.now() - startTime;
var memUsage = process.memoryUsage();
var health = {
status: "healthy",
uptime: Math.floor(uptime / 1000),
timestamp: new Date().toISOString(),
hostname: os.hostname(),
memory: {
rss: Math.floor(memUsage.rss / 1024 / 1024) + "MB",
heapUsed: Math.floor(memUsage.heapUsed / 1024 / 1024) + "MB",
heapTotal: Math.floor(memUsage.heapTotal / 1024 / 1024) + "MB"
},
version: process.env.npm_package_version || "unknown",
nodeVersion: process.version
};
// Check critical dependencies
var checks = [];
// Database connectivity check
checks.push(checkDatabase());
Promise.all(checks).then(function(results) {
var allHealthy = results.every(function(r) { return r.healthy; });
health.checks = results;
if (allHealthy) {
res.status(200).json(health);
} else {
health.status = "degraded";
res.status(503).json(health);
}
}).catch(function(err) {
health.status = "unhealthy";
health.error = err.message;
res.status(503).json(health);
});
});
function checkDatabase() {
var mongoose = require("mongoose");
return new Promise(function(resolve) {
var state = mongoose.connection.readyState;
resolve({
name: "database",
healthy: state === 1,
state: state === 1 ? "connected" : "disconnected"
});
});
}
module.exports = router;
And a simple smoke test script referenced in the postRouteTraffic hook:
// tests/smoke-test.js
var http = require("http");
var BASE_URL = process.env.APP_URL || "http://localhost:8080";
var tests = [
{ path: "/health", expectedStatus: 200 },
{ path: "/", expectedStatus: 200 },
{ path: "/api/status", expectedStatus: 200 },
{ path: "/nonexistent", expectedStatus: 404 }
];
var passed = 0;
var failed = 0;
function runTest(test, callback) {
var url = BASE_URL + test.path;
http.get(url, function(res) {
if (res.statusCode === test.expectedStatus) {
console.log("PASS: " + test.path + " returned " + res.statusCode);
passed++;
} else {
console.log("FAIL: " + test.path + " expected " + test.expectedStatus + " got " + res.statusCode);
failed++;
}
callback();
}).on("error", function(err) {
console.log("FAIL: " + test.path + " error: " + err.message);
failed++;
callback();
});
}
function runAll(index) {
if (index >= tests.length) {
console.log("\nResults: " + passed + " passed, " + failed + " failed");
process.exit(failed > 0 ? 1 : 0);
return;
}
runTest(tests[index], function() {
runAll(index + 1);
});
}
runAll(0);
Common Issues and Troubleshooting
1. Environment Not Found After Rename
Error: Environment [old-name] does not exist or has not been authorized for use.
This happens when you rename an environment in the UI but forget to update the YAML pipeline. Environment references in YAML use the name, not an internal ID. After renaming, update every pipeline that references the old name. There is no automatic redirect.
Fix: Search your repository for the old environment name and update all references:
# Before
environment: 'old-name'
# After
environment: 'new-name'
2. Deployment Job Stuck Waiting for Approval
Error: The pipeline shows "Waiting for approval" indefinitely with no notification sent.
This usually means the approval check is configured but the specified approvers do not have notification subscriptions enabled, or the approvers were set to a team that has since been deleted.
Fix: Navigate to the environment, check the approvals configuration, verify the approvers still exist, and check their notification settings under User Settings > Notifications. Also verify you have not set the approval timeout to an unreasonably long value.
3. Rolling Deployment Targets Zero VMs
Error: No resources found in environment 'staging' matching the specified tags.
This happens when you specify resourceType: VirtualMachine with tags, but no VMs in the environment have matching tags, or the VM agents are offline.
Fix: Go to your environment in the UI and check the Resources tab. Verify that VMs are registered, online, and have the correct tags assigned. A common mistake is registering VMs with the agent but forgetting to add tags:
# This requires VMs tagged 'web-tier' to exist in the environment
environment:
name: 'staging'
resourceType: VirtualMachine
tags: 'web-tier'
4. Canary Increments Not Working as Expected
Error: The strategy 'canary' is not supported for the pool type 'vmImage'.
Canary and rolling strategies require VM or Kubernetes resource targets. They do not work with Microsoft-hosted agents (vmImage: 'ubuntu-latest'). The strategy needs actual infrastructure targets to distribute the deployment across.
Fix: Add VM resources to your environment and specify resourceType: VirtualMachine in your deployment job, or use a Kubernetes resource. If you want to simulate canary behavior with Azure App Service, use the App Service deployment slots feature instead of the pipeline-level canary strategy.
5. Exclusive Lock Causing Pipeline Queue Backup
Error: Multiple pipeline runs pile up in a "Waiting" state with the message Waiting for exclusive lock on environment 'production'.
When you have exclusive lock configured and frequent commits, runs stack up. If you are using the "Sequential" lock behavior, every single run will execute in order, which can create long queues.
Fix: Switch the exclusive lock to "Latest only" behavior. This cancels intermediate queued runs and only executes the most recent one, which is almost always what you want for continuous delivery pipelines.
6. Lifecycle Hook Steps Failing Silently
Error: The deployment reports success even though postRouteTraffic checks should have caught issues.
This can happen if your health check script has a bug that causes it to exit with code 0 even on failure. A common mistake is using curl without --fail -- by default, curl returns exit code 0 even when the server responds with HTTP 500.
Fix: Always use curl --fail or explicitly check the HTTP status code:
# Wrong - exits 0 even on HTTP 500
curl http://my-app/health
# Correct - exits non-zero on HTTP 4xx/5xx
curl --fail http://my-app/health
# Most reliable - explicit status check
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://my-app/health)
if [ "$STATUS" != "200" ]; then
echo "Health check failed with HTTP $STATUS"
exit 1
fi
Best Practices
Create environments explicitly through the UI before first use. Do not rely on auto-creation from YAML. Auto-created environments have no approvals, no checks, and no security controls. Set up approvals and branch control before the first deployment ever targets the environment.
Use exclusive locks with "latest only" on production. There is almost never a reason to deploy an older commit when a newer one is already queued. The "latest only" behavior ensures you always deploy the most current version and avoids unnecessary queue buildup.
Put health checks in
postRouteTraffic, not indeploy. The deploy hook should handle the actual deployment. Validation belongs inpostRouteTrafficso that the pipeline's failure handling can distinguish between deployment failures and health validation failures. This separation makes rollback logic cleaner.Back up before deploying, not after. Always create a backup or snapshot in the
preDeployhook, before any changes are made. If the deployment itself fails partway through, you need a clean state to roll back to. Backups created after deployment has started may capture a corrupted state.Use variable groups linked to Key Vault for secrets. Never hardcode connection strings, API keys, or credentials in your YAML pipeline. Create a variable group that pulls from Azure Key Vault, and reference it at the stage level. This keeps secrets out of source control and gives you centralized secret rotation.
Start canary increments small. Begin with 5-10% of traffic, not 25%. The whole point of canary is to minimize blast radius. If your first increment is 25%, you have already exposed a quarter of your users to a potentially bad deployment. Use increments like
[5, 15, 50, 100]for critical production services.Set meaningful approval timeouts. A 30-day approval timeout is not a safety measure -- it is a forgotten deployment waiting to surprise someone. Use 24-72 hours for production approvals. If the deployment is not approved within that window, it should require a new pipeline run with fresh artifacts.
Tag your VM resources meaningfully. Use tags like
web-tier,api-tier, andworker-tierto target specific subsets of your infrastructure. This lets you deploy web servers and API servers on different schedules using the same environment with different tag filters.Monitor canary increments for at least 2-3 minutes. A quick health check that runs once is not sufficient. Production issues often manifest under load over time. Run multiple health checks with intervals between them to catch issues that only appear after the application has been handling traffic for a while.
Keep
on.failurehooks simple and reliable. Your rollback logic should not depend on external services that might also be experiencing issues. A rollback that calls a third-party API to fetch the previous deployment version is fragile. Use local backups or well-known artifact locations that are guaranteed to be available.
