Zero-Downtime AWS Cross-Account Migration: A Real Playbook

Moving a live production workload from one AWS account to another is one of those projects that looks deceptively simple in a slide deck and turns into a six-week obstacle course in practice. The vendor blog posts treat it as “use AWS Application Migration Service and you’re done.” The reality involves RDS replication strategy, transparent user authentication migration, overlapping VPC CIDR ranges, CloudFormation drift, and pipelines that all need to point at new ARNs.

This post is the playbook we actually used for a 2026 migration — moving a production SaaS workload (Cognito-backed React SPA, multiple RDS instances, Lambda functions, S3 buckets, and three CI/CD pipelines) from one AWS account to a different one in the same region, with strict no-impact constraints on the source environment and zero authentication-flow disruption for live users. Below is what worked, what we had to redesign mid-flight, and the five gotchas that aren’t in the AWS docs.

The constraints that shaped the playbook

Before the technical pattern, the constraints matter — different constraints produce different playbooks. Ours were:

  1. Strict read-only on the source. The source account contained the live production workload serving real users. We could read anything; we could not change anything. No new IAM roles in the source, no resource tags, no security-group edits — nothing that would risk a regression on the still-live system.
  2. No production downtime. End users had to remain authenticated and able to use the system throughout. Any cutover had to happen with a connection-draining window measured in seconds, not minutes.
  3. No password resets. Existing user passwords had to survive the migration. Forcing a global password reset was unacceptable.
  4. Overlapping VPC CIDR ranges. The source and target VPCs were both allocated 10.0.0.0/16 because somebody, at some point, copy-pasted a CloudFormation template. We could not re-IP either side.
  5. CloudFormation truth divergence. Several DynamoDB tables had been created manually in the source console and never reconciled into the CFN stack. The target account had to start from a clean truth.

If any of those constraints sound familiar, the playbook below probably applies to you. If you can afford a maintenance window or a forced password reset, you have easier options.

The architectural pattern: stage, then cut over

The naive approach — using AWS Application Migration Service or the older Server Migration Service — assumes you want to clone-and-shift specific EC2 instances. We deliberately rejected that. Instead, we did a stage-and-cut-over pattern:

  1. Stand up a parallel copy of the workload in the target account using CloudFormation (clean templates, no drift inherited).
  2. Continuously replicate state — RDS via DMS, S3 via cross-account replication, Cognito via lazy-on-login migration — into the target account.
  3. Validate the parallel copy by running synthetic transactions against it for at least a week.
  4. Flip DNS at the cutover point, with a 30-second TTL set 24 hours in advance.

This pattern has three big advantages over migration-service approaches:

  • The source is genuinely never touched. Every read happens through DMS-replicated state, S3 replication, or Cognito’s transparent-migration trigger.
  • You can roll back in 30 seconds by flipping DNS back, because the source is still live.
  • You can run integration tests against the target before cutover, which catches the issues that would otherwise blow up in production.

RDS migration with AWS DMS

For an RDS migration between AWS accounts, the Database Migration Service is the right tool — but with three operational details that are easy to miss:

Use full load + CDC, not just full load. Configure the DMS task as “Migrate existing data and replicate ongoing changes.” If you do full-load only, the cutover window has to be the entire replication time. With CDC running, the cutover is just the time it takes for CDC to drain (typically under 30 seconds for a healthy task).

Point the source endpoint at a read replica, not the writer. Point the DMS source endpoint at a read replica of your source RDS instance. This isolates the migration load from production traffic. If your engine doesn’t support replicas (Aurora Serverless v1, for example), use a snapshot-based seed and then enable binlog replication afterwards.

Pre-create the schema; don’t let DMS infer it. DMS will happily infer column types, but it makes ugly choices — VARCHAR(4000) instead of VARCHAR(255), DATETIME instead of TIMESTAMP WITH TIME ZONE. Run your CloudFormation / Liquibase / Flyway scripts against the target first to get the schema right, then turn DMS on with “Truncate target tables” disabled.

A typical migration of a few hundred GB of RDS data takes 6–10 hours for the initial full load, then CDC keeps replication caught up indefinitely. Watch the CDCLatencyTarget CloudWatch metric — anything above 60 seconds means CDC is falling behind and you need to address it before cutover.

Cognito UserMigration trigger for transparent user moves

Cognito user pools have one quirk that breaks naive migration: you cannot export hashed passwords. The hash format is internal to Cognito and there is no API to read it.

The clean solution is the UserMigration Lambda trigger. The pattern:

  1. Stand up the target Cognito user pool empty.
  2. Attach a Lambda function to the user pool’s UserMigration trigger.
  3. The Lambda is invoked the first time a user attempts to sign in to the target pool. It receives the username and the plaintext password.
  4. The Lambda calls AdminInitiateAuth against the source pool (using cross-account credentials). If auth succeeds, the Lambda returns the user’s attributes and the target pool creates the user, hashing the password locally.
  5. Subsequent logins by that user go directly against the target pool — no further Lambda invocation.

This gives you transparent migration: users sign in once with their existing password and they’re silently moved. They never see a “please reset your password” screen.

Two gotchas:

  • The Lambda needs cross-account IAM permission. Create an IAM role in the source account that allows cognito-idp:AdminInitiateAuth on the source pool, and let the target-account Lambda role assume it.
  • MFA users need a separate code path. If a user has SMS or TOTP MFA enabled, AdminInitiateAuth returns a challenge. Handle the challenge in the Lambda by calling AdminRespondToAuthChallenge — but you’ll need to either prompt the user for a fresh MFA code or skip MFA verification during migration (acceptable if you trust the password check).

In a recent migration of around 4,000 users, the bulk silently migrated over six weeks. By the end of the migration window, 92% had logged in at least once and been moved. The remaining 8% were force-migrated by exporting their non-password attributes from the source pool and creating them in the target pool with a flag that required a password reset on next login.

VPC peering with overlapping CIDR ranges

This is the gotcha most blog posts skip. Standard VPC peering requires non-overlapping CIDR blocks, so if both your source and target VPCs are 10.0.0.0/16, plain peering will refuse to connect.

The workable patterns are:

  1. Re-IP one side. Cleanest but usually impossible on the source.
  2. NAT translation through a transit VPC. Stand up a third “transit” VPC with a non-overlapping CIDR (say 172.16.0.0/16). Connect both source and target to the transit VPC via peering. NAT instances or a private NAT gateway in the transit VPC translate addresses on the way through.
  3. PrivateLink endpoints. If you only need a few specific service interactions (e.g., your target Lambda needs to call your source RDS), expose those services as VPC endpoint services and consume them via Interface endpoints in the target VPC. This is the cleanest option for narrow integration.

We used pattern 3 (PrivateLink) for the DMS endpoint and pattern 2 (transit VPC) for the broader peer-to-peer traffic during the parallel-run period. PrivateLink is preferable when applicable because it’s stateless from the peering perspective — no route table maintenance, no half-open connections to debug.

Reconciling CloudFormation drift before cutover

Most production AWS accounts that are 2+ years old have some amount of drift: resources created in the console, modified outside of CFN, or manually retagged. Migration is the perfect time to clean it up — but doing it on the source is forbidden by constraint #1.

The pattern that worked: the describe-and-redeclare workflow.

  1. Run cloudformation detect-stack-drift against the source account stacks and dump the drift report.
  2. For each drifted resource, query its current state via the AWS API (DescribeTable, DescribeBucket, etc.) and emit a clean CFN snippet that matches the actual state.
  3. Hand-merge those snippets into the target-account template before deploying.

This is mechanical and slow but produces a clean target stack with no inherited drift. For our migration this took about three days for ~20 stacks. There’s no good off-the-shelf tooling (cdk-from-aws and former2 help but neither is reliable end to end); plan for hand-stitching.

Replicating CI/CD pipelines

The least exciting and most error-prone part of the migration: every CodePipeline, CodeBuild project, and ECR repository in the source had to be recreated in the target with new ARNs. Three gotchas worth budgeting for:

  • CodeStar connections don’t transfer. You have to re-authorise the GitHub / Bitbucket OAuth connection in the target account. This is a manual click in the console, not an API call.
  • ECR image URIs are hardcoded in dozens of places. Search every CFN template, every Helm chart, and every Lambda environment variable for the old account ID. Replace systematically.
  • CloudWatch alarm SNS topic ARNs. Same problem — every alarm references topic ARNs that include the account ID. Easy to miss until the first production alert silently fails.

Frequently asked questions

How long does a typical AWS cross-account migration take?

For a workload with a few RDS instances, a Cognito pool, S3 buckets, and a handful of CI/CD pipelines, plan on 4–6 weeks of elapsed calendar time — about 2 weeks of preparation (drift reconciliation, target stack standup, DMS configuration), 2–3 weeks of parallel-run validation and lazy user migration, and a one-day cutover window. The cutover itself takes minutes; everything before it is preparation.

Can I use AWS Application Migration Service instead of this stage-and-cut-over pattern?

For pure EC2 lift-and-shift, yes. For workloads with Cognito user pools, RDS, and SaaS-style data plane requirements (no downtime, no password resets), the stage-and-cut-over pattern handles cases that Application Migration Service does not. Application Migration Service migrates servers; this playbook migrates state.

What happens if a user logs in during the cutover window?

If you’ve kept the source live until DNS flips, they hit the source as normal. If they log in just after DNS flips, the Cognito UserMigration trigger fires for them on the target pool. Either way, the experience is invisible to the user.

How do you handle writes to the source after cutover begins?

DMS CDC keeps replicating until you explicitly stop it. The safe pattern is: at cutover, switch the source database to read-only (revoke INSERT/UPDATE on the application user), wait 30 seconds for CDC to drain, then flip DNS. Any writes during the 30-second window replicate cleanly.

What if the DMS task fails mid-migration?

DMS tasks are restartable from the last LSN / binlog position. Failures during full load are recoverable by restarting the task. Failures during CDC require you to investigate the offending row (usually a constraint mismatch) and either fix it in the target schema or skip it. Don’t restart the task from scratch — you’ll lose the LSN position and have to re-do the full load.

Need a sanity check on your AWS migration plan?

Diginuance has run several cross-account AWS migrations — account splits, account consolidations, cross-region replications — for live SaaS workloads. We have felt every gotcha in this playbook in production. If you are planning an AWS cross-account migration and want an architecture review, get in touch for a 30-minute call. See our cloud and DevOps services or browse past work.