Search DataWarehouse on X

Search results for DataWarehouse

DataWarehouse community

One keyword maps to one global community path.

Create community

People

Not Found

Tweets including DataWarehouse

cv usk@cv_usk

2026.06.13 01:02

# Snowflake Features and Practical Usage 🚀 People often say Snowflake "separates storage from compute," but once you truly grasp what that means, every discussion about cost, performance, and concurrency suddenly clicks. Let's dig into Snowflake's foundational three-layer architecture. 📌 Title and Feature URL Title: Key Concepts and Architecture URL: 📝 Overview Snowflake is a cloud-native SQL data platform delivered as a fully managed service, with no hardware to manage or software to install. Its architecture is a hybrid of shared-disk and shared-nothing designs, organized into three layers: database storage, query processing (compute), and cloud services. The defining trait is that these three layers scale independently of one another. 🔧 How It Works Each layer has a distinct role: ・Database storage layer: Ingested data is reorganized into an internally optimized, compressed, columnar format and divided into micro-partitions (contiguous units of storage). Snowflake fully manages the physical layout; you only ever interact via SQL. ・Compute layer (virtual warehouses): A virtual warehouse is a cluster of compute resources that runs queries using massively parallel processing (MPP). Each warehouse runs independently, so load on one has no effect on the performance of others. ・Cloud services layer: The "brain" that coordinates everything — authentication and access control, metadata management, query parsing and optimization, and infrastructure management. The core idea: storage is centrally shared (a shared-disk benefit) while processing happens on distributed nodes (a shared-nothing benefit). 🛠 Practical Usage Given this separation model, the first design step is to split compute by workload: ・Provision separate virtual warehouses for ETL, BI dashboards, and data science. Since storage is shared, all warehouses read the same tables without duplicating data. ・A heavy nightly batch won't slow BI queries running on a different warehouse — you eliminate interference structurally, not by tuning. ・For table types, you can use standard Snowflake tables, Apache Iceberg tables backed by your own external cloud storage, or Hybrid Tables for transactional workloads. 🎯 Use Cases ・Permanently fix the "dashboards get slow during the nightly batch" problem by isolating workloads onto separate warehouses. ・Quarantine data scientists' exploratory queries on a dedicated warehouse to protect production analytics. ・Split warehouses per department to make cost visible and chargeable. ⚠️ Caveats ・Compute consumes credits only while a warehouse is running, billed separately from storage. Estimate the two independently. ・Shared storage does not mean shared access — permissions are enforced separately by RBAC in the cloud services layer. ・"Separation" is a logical design principle. Spinning up warehouses indiscriminately increases what you must manage, so split by meaningful workload boundaries. #Snowflake# #DataWarehouse#

Forward to community

cv usk@cv_usk

2026.06.17 18:58

🏗 Migrating tens of thousands of jobs that ingest petabytes a day, without ever stopping data delivery. Meta's playbook of shadow then reverse-shadow then cleanup deprecated the legacy system 100%. Title: Migrating Data Ingestion Systems at Meta Scale URL: 📝 Overview Meta incrementally scrapes several petabytes of social-graph data daily from one of the world's largest MySQL deployments into its data warehouse. This post explains how they migrated tens of thousands of those ingestion jobs to a new self-managed service without disrupting analytics, reporting, and ML pipelines. ❓ Challenges Solved The legacy system was customer-owned pipelines, fine at small scale but unstable at hyperscale. They had to meet increasingly strict data landing-time requirements while migrating without interrupting data delivery across the organization. 💡 Methodology & Proposed Approach They migrate through a three-phase lifecycle. ・Shadow: in pre-production, consume production data while writing to isolated tables, continuously monitoring row-count and checksum mismatches against production jobs ・Reverse shadow: promote shadow jobs to production tables and send the original production jobs to shadow, keep comparing outputs for quality signals, and roll back instantly if needed ・Cleanup: deprecate old jobs after confirming consistency ・Each job is verified on four axes (zero differences, landing latency, resource usage, custom criteria), and CDC maintains full-dump, delta, and target tables 🎯 Use Cases It informs migrating large data-ingestion platforms, phasing CDC pipeline cutovers, and designing zero-downtime system replacements. 📊 Outcomes ・100% of the workload was migrated and the legacy system fully deprecated ・Job status signals streamed continuously to Scuba, and a migration tool monitored each job and auto-promoted/demoted between stages to manage thousands of concurrent migrations ・Bad partitions were flagged in metadata to prevent propagation to downstream jobs and trigger alerts ・To handle capacity limits, they reused old-system snapshot partitions as initial snapshots to cut full-dump load, and the resulting data-quality analysis tool is still used in release validation after the migration #DataEngineering# #DataInfrastructure#

Forward to community