cv usk(@cv_usk):🏗 Migrating tens of thousands of jobs that ingest petabytes a day, without ever stopping data delivery. Meta's playbook of shadow then reverse-shadow then cleanup deprecated the legacy system 100%. Title: Migrating Data Ingestion Systems at Meta Scale URL: https://t.co/8wesxudKSi 📝 Overview Meta incrementally scrapes several petabytes of social-graph data daily from one of the world's largest MySQL deployments into its data warehouse. This post explains how they migrated tens of thousands of those ingestion jobs to a new self-managed service without disrupting analytics, reporting, and ML pipelines. ❓ Challenges Solved The legacy system was customer-owned pipelines, fine at small scale but unstable at hyperscale. They had to meet increasingly strict data landing-time requirements while migrating without interrupting data delivery across the organization. 💡 Methodology & Proposed Approach They migrate through a three-phase lifecycle. ・Shadow: in pre-production, consume production data while writing to isolated tables, continuously monitoring row-count and checksum mismatches against production jobs ・Reverse shadow: promote shadow jobs to production tables and send the original production jobs to shadow, keep comparing outputs for quality signals, and roll back instantly if needed ・Cleanup: deprecate old jobs after confirming consistency ・Each job is verified on four axes (zero differences, landing latency, resource usage, custom criteria), and CDC maintains full-dump, delta, and target tables 🎯 Use Cases It informs migrating large data-ingestion platforms, phasing CDC pipeline cutovers, and designing zero-downtime system replacements. 📊 Outcomes ・100% of the workload was migrated and the legacy system fully deprecated ・Job status signals streamed continuously to Scuba, and a migration tool monitored each job and auto-promoted/demoted between stages to manage thousands of concurrent migrations ・Bad partitions were flagged in metadata to prevent propagation to downstream jobs and trigger alerts ・To handle capacity limits, they reused old-system snapshot partitions as initial snapshots to cut full-dump load, and the resulting data-quality analysis tool is still used in release validation after the migration #DataEngineering #DataInfrastructure

2026.06.17 18:58

🏗 Migrating tens of thousands of jobs that ingest petabytes a day, without ever stopping data delivery. Meta's playbook of shadow then reverse-shadow then cleanup deprecated the legacy system 100%. Title: Migrating Data Ingestion Systems at Meta Scale URL: 📝 Overview Meta incrementally scrapes several petabytes of social-graph data daily from one of the world's largest MySQL deployments into its data warehouse. This post explains how they migrated tens of thousands of those ingestion jobs to a new self-managed service without disrupting analytics, reporting, and ML pipelines. ❓ Challenges Solved The legacy system was customer-owned pipelines, fine at small scale but unstable at hyperscale. They had to meet increasingly strict data landing-time requirements while migrating without interrupting data delivery across the organization. 💡 Methodology & Proposed Approach They migrate through a three-phase lifecycle. ・Shadow: in pre-production, consume production data while writing to isolated tables, continuously monitoring row-count and checksum mismatches against production jobs ・Reverse shadow: promote shadow jobs to production tables and send the original production jobs to shadow, keep comparing outputs for quality signals, and roll back instantly if needed ・Cleanup: deprecate old jobs after confirming consistency ・Each job is verified on four axes (zero differences, landing latency, resource usage, custom criteria), and CDC maintains full-dump, delta, and target tables 🎯 Use Cases It informs migrating large data-ingestion platforms, phasing CDC pipeline cutovers, and designing zero-downtime system replacements. 📊 Outcomes ・100% of the workload was migrated and the legacy system fully deprecated ・Job status signals streamed continuously to Scuba, and a migration tool monitored each job and auto-promoted/demoted between stages to manage thousands of concurrent migrations ・Bad partitions were flagged in metadata to prevent propagation to downstream jobs and trigger alerts ・To handle capacity limits, they reused old-system snapshot partitions as initial snapshots to cut full-dump load, and the resulting data-quality analysis tool is still used in release validation after the migration #DataEngineering# #DataInfrastructure#