When a partitioning change to our petabyte-scale ClickHouse cluster caused critical billing jobs to stall, standard metrics showed no obvious errors. Here's how we identified severe lock contention in ClickHouse's query planner and built upstream patches to fix it.