Reduced RAG: Stop Stuffing Context Windows and Start Extracting Signals (English)
If you're brand new to RAG, start with RAG Explained and RAG Architecture. This post is for the point where you've built a RAG pipeline that mostly works…...
Small language models: Rethinking enterprise AI architecture
As LLMs hit the limits of scale and cost, specialized SLMs are emerging as the faster, cheaper, and more private workhorse for the autonomous enterprise.
When it comes to software developers, there are a few distinct types. For example, the extroverted, chatty type, who is always going out there to share the latest and newest libraries and projects …
The agent-led growth playbook: how to make AI agents discover, use, and pay for your developer tool, and defend against the ones you didn't invite. LLM discoverability, agent-first onboarding, agent payments, AX security.
Most AI SEO advice is unproven. We tested what ChatGPT, Claude, and Perplexity actually read on our own site. Six LLM visibility techniques that worked, eight that didn't, and the metrics to tell the difference.
Using a local LLM in OpenCode with llama.cpp – Aayush Garg
Step-by-step setup for running a quantized Qwen3.5-27B model on a remote GPU via llama.cpp, exposing it over Tailscale and using it as a provider in OpenCode (optionally with Codex).
The Complete Developer's Guide to Running LLMs Locally
A comprehensive guide covering the local LLM stack from hardware requirements to production deployment. Compare Ollama, LM Studio, llama.cpp and build your first local AI application.
Cursor, Claude Code, and Codex are merging into one AI coding stack nobody planned
Cursor, Claude Code, and OpenAI Codex are forming a composable AI coding stack with orchestration, execution, and review layers instead of consolidating into one tool.
Credit: Museums Victoria / Unsplash There’s a lot of conversation right now about “context engineering” for dev work; structuring what you feed an LLM so it can do useful things. …
A deep dive into effective caching strategies for building scalable and cost-efficient LLM applications, covering exact key vs. semantic caching, architectural patterns, and practical implementation tips.
Build an Inference Cache to Save Costs in High-Traffic LLM Apps - MachineLearningMastery.com
In this article, you will learn how to add both exact-match and semantic inference caching to large language model applications to reduce latency and API costs at scale.