Search DataEngineering on X

2026.06.17 18:58

🏗 Migrating tens of thousands of jobs that ingest petabytes a day, without ever stopping data delivery. Meta's playbook of shadow then reverse-shadow then cleanup deprecated the legacy system 100%. Title: Migrating Data Ingestion Systems at Meta Scale URL: 📝 Overview Meta incrementally scrapes several petabytes of social-graph data daily from one of the world's largest MySQL deployments into its data warehouse. This post explains how they migrated tens of thousands of those ingestion jobs to a new self-managed service without disrupting analytics, reporting, and ML pipelines. ❓ Challenges Solved The legacy system was customer-owned pipelines, fine at small scale but unstable at hyperscale. They had to meet increasingly strict data landing-time requirements while migrating without interrupting data delivery across the organization. 💡 Methodology & Proposed Approach They migrate through a three-phase lifecycle. ・Shadow: in pre-production, consume production data while writing to isolated tables, continuously monitoring row-count and checksum mismatches against production jobs ・Reverse shadow: promote shadow jobs to production tables and send the original production jobs to shadow, keep comparing outputs for quality signals, and roll back instantly if needed ・Cleanup: deprecate old jobs after confirming consistency ・Each job is verified on four axes (zero differences, landing latency, resource usage, custom criteria), and CDC maintains full-dump, delta, and target tables 🎯 Use Cases It informs migrating large data-ingestion platforms, phasing CDC pipeline cutovers, and designing zero-downtime system replacements. 📊 Outcomes ・100% of the workload was migrated and the legacy system fully deprecated ・Job status signals streamed continuously to Scuba, and a migration tool monitored each job and auto-promoted/demoted between stages to manage thousands of concurrent migrations ・Bad partitions were flagged in metadata to prevent propagation to downstream jobs and trigger alerts ・To handle capacity limits, they reused old-system snapshot partitions as initial snapshots to cut full-dump load, and the resulting data-quality analysis tool is still used in release validation after the migration #DataEngineering# #DataInfrastructure#

0

Forward to community

cv usk@cv_usk

2026.06.17 17:09

# Snowflake Features and Practical Usage 🚀 "Bumping the size up makes it faster, but what about cost?" Snowflake cost optimization hinges on answering that question correctly. Let's master virtual warehouse sizing and auto-suspend. 📌 Title and Feature URL Title: Working with Virtual Warehouses URL: 📝 Overview A virtual warehouse is a cluster of compute resources that supplies the CPU, memory, and temporary storage needed to run SQL queries and data operations such as INSERT, UPDATE, DELETE, and COPY. It consumes credits only while running and can be resized or auto-suspended flexibly. Designing size and auto-suspend per workload is the first step in Snowflake cost optimization. 🔧 How It Works Key facts about warehouse sizing and billing: ・Sizes range from X-Small to 6X-Large, and each step up doubles compute and credit consumption. X-Small=1, Small=2, Medium=4, Large=8, X-Large=16, 2X-Large=32 ... up to 6X-Large=512 credits/hour. ・Billing is per-second with a 60-second minimum each time a warehouse starts or resumes. For example, an X-Large running 61 seconds costs about 0.271 credits, while a full hour costs 16 credits. ・Larger warehouses speed up large, complex queries, but larger is not necessarily faster for small, basic queries. ・Besides standard warehouses, Snowpark-optimized warehouses target memory-heavy workloads like ML training. 🛠 Practical Usage ・Use AUTO_SUSPEND (on by default) to suspend after idle time and AUTO_RESUME (on by default) to resume when a statement arrives, preventing wasted credits while idle. ・Create with CREATE WAREHOUSE etl_wh WAREHOUSE_SIZE = XLARGE; for batch, and use WAREHOUSE_SIZE = SMALL AUTO_SUSPEND = 60 for ad-hoc analytics to "pay only for what you use." ・Add INITIALLY_SUSPENDED = TRUE to create the warehouse in a suspended state. ・Warehouses can be resized even while running, so you can temporarily scale up just before a heavy job. 🎯 Use Cases ・Run a daily batch on X-Large to finish fast. Since one size step roughly doubles speed and halves runtime, you cut wall-clock time at a comparable credit cost. ・Set an ad-hoc analytics warehouse to Small with AUTO_SUSPEND=60 so it costs nothing when nobody is querying. ・For data loading, small-to-medium sizes are often sufficient; tune based on file count and size rather than warehouse size. ⚠️ Caveats ・Every resume bills a 60-second minimum, so an extremely short AUTO_SUSPEND (a few seconds) can backfire by triggering frequent start/stop cycles. ・A large size is wasted on small queries. "Scale up for slow queries" is the rule — bigger is not universally better. ・Loading performance depends more on file count and size than on warehouse size. Consider parallelizing files before scaling up. #Snowflake# #DataEngineering#

0

Forward to community

TRM Labs@trmlabs

2026.05.01 13:34

Some data engineering roles end at the dashboard. This one helps stop real-world crime. TRM Labs is hiring a 𝗙𝗼𝗿𝘄𝗮𝗿𝗱 𝗗𝗲𝗽𝗹𝗼𝘆𝗲𝗱 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿 to build the systems that power investigations at federal agencies. 🚀 Apply here:

0

1

8

2

Forward to community

Microsoft Learn@MicrosoftLearn

2026.06.09 13:22

The conversation continues at AI Skills Fest. From interactive learning to real-world AI use cases, today's sessions are packed with ideas and inspiration for learners, professionals, and organizations worldwide. Which one are starting first? ⬇️ Turn your everyday work into credentials that count 6/9 5:00 PM PDT (Americas) 6/9 7:30 IST (EMEA/Asia) Prepare for Microsoft Certification Exam AB-730: AI Business Professional 6/9 6:30 PM PDT (Americas) 6/9 9:00 IST (EMEA/Asia) Microsoft Cert Exam Prep for DP-700: Data Engineering with Microsoft Fabric 6/9 8:30 AM IST (EMEA/Asia) 6/9 10:00 AM PDT (Americas) And when you finish, you can jump into a playlist specifically designed for your role:

0

4

82

12

Forward to community