cv usk(@cv_usk):A useful but little-known Gemini API feature 💰 Sending the same massive system prompt with every request? That token bill adds up fast. Gemini's "Context Caching" lets you cache long, shared inputs once and reuse them across subsequent requests, dramatically cutting input token costs. If you're repeatedly working with large documents or lengthy prompts, this changes the economics entirely. 📌 Title: Context Caching 🔗 URL: https://t.co/4M6Xd8z2Gv 🧩 Overview Every time you send a long shared context (full manuals, codebases, hundreds of PDF pages) to an LLM, you pay for those tokens again. Context caching holds that context on Google's side so follow-up requests just reference it. There are two flavors: implicit caching (identical prefixes are automatically reused) and explicit caching (you manually create a cache with a TTL). 🛠 How to use it For explicit caching, call the cache creation API with your system instruction or large input content. You get back a cache name to pass into subsequent generateContent requests. TTL defaults to one hour and is adjustable. Implicit caching requires zero setup: identical input prefixes are reused automatically, so you may already be benefiting without any code changes. 🏗 Building it into production ・Internal knowledge-base QA: cache your full company manuals or policy docs so each user question doesn't re-send the entire corpus. Faster responses, lower cost. ・Code review bots: cache the repo's codebase and coding standards, then each PR review request skips re-uploading the common context. ・Customer support: cache FAQs and product specs so every ticket doesn't carry a massive context payload. ・Batch analysis pipelines: when running many individual queries against the same reference data, caching compresses the per-query cost. 💡 Use cases 📚 Repeated Q&A over long documents 🔍 High-volume requests sharing the same system prompt 🧑‍💻 Dev tools with an entire codebase as context 📊 Multi-angle analysis on a single dataset ⚠️ Watch out There's a minimum token count to create a cache, so short prompts won't qualify. Cache storage also has its own cost, which can outweigh savings if usage is infrequent. Match the TTL to your actual access pattern and focus on use cases where the same large context is genuinely hit many times. ✨ The cost of "sending the same thing over and over" compounds quickly. Start by caching your longest shared context and watch the bill drop. #Gemini #LLM

9hours ago

A useful but little-known Gemini API feature 💰 Sending the same massive system prompt with every request? That token bill adds up fast. Gemini's "Context Caching" lets you cache long, shared inputs once and reuse them across subsequent requests, dramatically cutting input token costs. If you're repeatedly working with large documents or lengthy prompts, this changes the economics entirely. 📌 Title: Context Caching 🔗 URL: 🧩 Overview Every time you send a long shared context (full manuals, codebases, hundreds of PDF pages) to an LLM, you pay for those tokens again. Context caching holds that context on Google's side so follow-up requests just reference it. There are two flavors: implicit caching (identical prefixes are automatically reused) and explicit caching (you manually create a cache with a TTL). 🛠 How to use it For explicit caching, call the cache creation API with your system instruction or large input content. You get back a cache name to pass into subsequent generateContent requests. TTL defaults to one hour and is adjustable. Implicit caching requires zero setup: identical input prefixes are reused automatically, so you may already be benefiting without any code changes. 🏗 Building it into production ・Internal knowledge-base QA: cache your full company manuals or policy docs so each user question doesn't re-send the entire corpus. Faster responses, lower cost. ・Code review bots: cache the repo's codebase and coding standards, then each PR review request skips re-uploading the common context. ・Customer support: cache FAQs and product specs so every ticket doesn't carry a massive context payload. ・Batch analysis pipelines: when running many individual queries against the same reference data, caching compresses the per-query cost. 💡 Use cases 📚 Repeated Q&A over long documents 🔍 High-volume requests sharing the same system prompt 🧑‍💻 Dev tools with an entire codebase as context 📊 Multi-angle analysis on a single dataset ⚠️ Watch out There's a minimum token count to create a cache, so short prompts won't qualify. Cache storage also has its own cost, which can outweigh savings if usage is infrequent. Match the TTL to your actual access pattern and focus on use cases where the same large context is genuinely hit many times. ✨ The cost of "sending the same thing over and over" compounds quickly. Start by caching your longest shared context and watch the bill drop. #Gemini# #LLM#

显示更多