Jeff Huber
Good to see you.
Drew Breunig
Nice to see you, Jeff.
Jeff Huber
Thanks for coming today. We've had a lot of fun conversations over the past year about all things context engineering, touched on philosophy and other topics as well. So I have no doubt this is going to be an interesting conversation. We wind back like six months ago, I think roughly you started publishing a series of blog posts four or five, six months ago. on like context engineering. Yeah. Like, where were you in life? How did you come to those posts? Why did you write them? What is the response like? Let's start there.
Drew Breunig
Oh, great. So that's a fun one. So I write about AI. I write period not about AI, but just I write about technology because it helps me organize my thoughts. It provides a searchable index for polished thinking and artifacts, which I have found incredibly useful over 20 years. But also, you connect with people who are thinking about problems. If you put something on the internet, people will argue with you. The bad signal. Yeah, and what's great about that is your position gets more robust, you understand it, you correct yourself, and you meet lots of interesting people. That's how we met. That's how we met. And so context engineer. So I started writing about AI as it started to become a thing. I mean, I had been writing about machine learning and deep learning since at least 2016. I was actually revisiting my post about how machine learning will change everything in 2016. And there's a couple things wrong, but I went back to it because I cited a Pebble device as like the future, like Internet of Things is going to be more of a thing because you can do more stuff at the edge, as so I thought, but it turned out I was... more reliant on cloud computing, but I revisited it because of that pebble ring thing. The pebble index, which is another good example. I love it. It's local compute. Anyway, so I started writing about AI. One of the reasons I leaned into AI is because I was frustrated with the discourse of the conversation. There's two big buckets of AI writing, which is one, the boosters and hype beasts who are just like, it's going to change everything. I don't know if you remember, but I would get into arguments with like very high level product and business leaders like three years ago where they're like, yeah, when GPT-5 comes, we're at AGI. And I'm just like, I don't think like I think you've got to pay attention to the progression, but it was just in the air and that was everybody was buying that hype. Totally. PhD in your pocket, all of these things. They're inherently reductive. They overpromise and they focus on prototypes and demos. And AI, as you know, is really good at doing demos. It's really hard to get it to work. The other part of writing that I find much more interesting is the research and papers writing. But those are... rather hard to read. They require some skills. They require sorting through an avalanche. I mean, how many papers are published every single day? And so it's really unattainable for normal people who don't live and breathe this stuff. So my goal was, how can I write about AI kind of as normal technology? talk about how it applies, how it works. And that's kind of the standard I hold myself to. Yes, yes.
Jeff Huber
So if AI is not a, as I like to cheekily refer to it as sometimes, if AI is not a deus ex machina, if it is not a techno machine god, which is going to, you know, sort of plot device, literary device, it's going to solve all of humanity's problems. Yeah. It's not a panacea. What is it?
Drew Breunig
It's technology. I mean, so I mean, We can come to that in a second. I want to work up to that, I think. Let's revisit that. Maybe we'll get back to our philosophy and history of science and technology. I hope a little bit, yes. All of that. Anyway, I like to read model papers when they get released. The technical papers, not the model cards. The model cards are pretty boring, but the technical papers are really good. One of my favorite technical papers was Gemini 2.5. And the reason I loved it is because in the appendix, they did two to three pages of explaining how they got Gemini to play Pokemon well. And the reason I liked it was they walked through the harness. They actually walked through the prompts they used, which again, people don't do this. They're like, here's my benchmark score. You don't get the prompts. You don't get the tools. You don't get the harness. You don't get any of that. You just get the end, what they RL'd against, what they post-trained against. You get nothing. And so here was a big lab going into the warts and all discussion. And better than that, they talked about how it failed. And like that, if everybody talked about how their stuff failed, it would be such a more interesting conversation. And in there was one throwaway line, which is... In the rest of the paper, they're talking about how they have, you know, a million token contacts coming out soon. And like, it's gonna be huge. And in this, their work at the end, in the appendix, they say, anytime we went over 200,000 tokens, it all kind of went south.
Jeff Huber
Oh, no way.
Drew Breunig
They say it in there. Oh, wow, I didn't know that. And that was what I'm like, That is what started me into thinking about context is like, there's a soft limit there. It's not a panacea. And you compared that to the conversation we were having in 2024, which is long context would come and save us all. We didn't need harnesses. We didn't need rag. Rag was gonna be dead because you could just throw the entire internet into the context and the problem would be solved. And here we had a very different message. And so I became obsessed with this. I started reading about all the ways context failed. And then when you start to read, you synthesize and you find patterns and you write that up. And so I published the first article, which is how long context fail. And I really am like, most of the time when I publish, I'm like, okay, maybe someone will find this interesting. That one I'm like, this is... really interesting. I feel like it's kind of bulletproof. And I published it and kind of nothing happened. It was very slow. I was lucky enough to be on a bicycling vacation in Napa with my wife that week. So it was great. I didn't have to check the computer. It was just kind of like... Your wife was happy. But I was a little disappointed. I'm like, this is something... I very rarely put long research efforts into writing and suddenly it's there. Right. And then, so brief aside, I was down at Stanford talking with Chris Potts, who's a professor down there. And we were talking about a bunch of things. And we were talking about sometimes having hard arguments or truism or talking about realism within how you build with AI. And how it's sometimes really hard to convince people because they're just so excited about it. And he like kind of like got that thousand yard stare and he's like, or you can just wait for Karpathy to say it in a couple months. So true, though. And I laughed because I thought he was making a joke and then he didn't laugh. And I'm like, oh yeah, you're right. And that was the same case, which is Karpathy said context engineering. Suddenly everybody scrambled to find context engineering pieces. And then I think Simon Willison and someone else pointed to my pieces and they're like, this is the only... framework we have. So let's, here it is. And so then it took off and it was great. The feedback has been great. Again, the reason you write is for feedback, to understand what you missed, what you didn't, what the fixes are, what you haven't. And now I've had this excuse to go have tons of conversations with people, get feedback from some of the brightest minds working and building with AI today, or researching. People actually with their hands dirty about this stuff. And the other thing that was really fun is I didn't appreciate it is I got tons of feedback, not just from builders. But like 50% of the feedback I got back was like, thanks for this. It helped me code better with coding agents. Like it helped them think about what was wrong with their context when they were building and it changed the way that I used cloud code or whatever it may be. And I was shocked by that. I hadn't thought about it that way, but I was really happy to see that benefit.
Jeff Huber
Yeah. I think you mentioned in particular about like the model cards and the benchmarks and how it's not just the model. It's the agent harness, it's the prompts, the resources and skills and tools it has access to, sort of composite into the harness, obviously. And this is commonly not well understood, right? And you can, from auto provider's perspective, you can kind of... the best way to juice the most performance out of your model for benchmark maxing purposes is to pull all of those levers. And in some ways, it would be good for the developer community to understand what levers are being pulled to get the most performance in certain directions. But it seems a lot of that information tends to be below the NDA line and does not get published. And I think when you pointed out Google and Gemini 2.5 report, and it seems increasingly so, Structurally, Google has so many other advantages to where they stand in their position in the market that actually, in some ways, it feels like Google is... I even heard from a friend that at NeurIPS, they have a much larger cohort and a contingent presence than other labs did. And I was positing, is this because Google can be open source versus the other labs can't?
Drew Breunig
Yeah. I mean, there's a lot of different ways to come on that discussion. And I like it. This is another good site for negotiation, which reveals a lot of preferences, not just technology, but like culture and everything else. I think some of you have the legal side, which is I don't want to say the NDAs because I think we joked at some point, which is like, what is when all this is done? I want someone to go back and calculate what is the value of being one point higher on SweeBench? Because there is unquestionably a valuation value, a stock market value, and anything to come out and do that. So I think there's such a huge incentive there. There's also the NDA that comes up, which is are you in an open or closed shop? I also think there's kind of the political dimension inside of the companies. I think I've talked to at least three different people who are working on post-training frontier models at a host of labs, and they're bonus. is tied to their sweet bench score or whatever the benchmark is. And I've seen this movie before. Like, I remember like a lifetime ago, I remember I worked, Electronic Arts was a big client and I worked in the video game industry. And, you know, Metacritic ratings were bonus tied to it. It's like whatever becomes measured becomes a target. And so you have these things. And that can be... a pessimistic way to look at it. It's frustrating that we don't know how these models are achieving the benchmarks. We just kind of have to take their word at it. Sometimes they give us clues. Credit to the Moonshot team, their Kimi thinking model, K2 thinking model. When it came out, they had a little paper and they said, here's what we did. And then there was an asterisk at the bottom that said, You can't expect these results in our chat interface because we overloaded the model with exactly the right tools. And in our chat interface, we only have these tools. So at least they're being honest about that. I like that. But I think the positive way to do this, and this is something that came out of a conversation a couple of weeks ago I had with Omar Khattab, is we were talking about this problem of benchmarks and how representative they are and ultimately how useful are they to people building. Yep. The metaphor he used to explain it to me, and I really like this, and this is the most San Francisco tech bro AI metaphor, which is funny coming from Omar, who's not that type, is the benchmark is, to use a weightlifting metaphor, is you've been training for a year, and then you take a break with rest. You wait for your perfect day. You take a week break. You go in, and you see how much you can hit clean. Now, is that representative of how much you can lift in practical? Is that representative of your day-to-day health? Is that representative of your performance as an athlete? Or is it just representative of what you are capable of? How many steroids are you on? On your best day? It's the theoretical maximum. And that is useful in some cases. It helps you easily eliminate a model, perhaps. But it also really is not the full story. And I think it's an important thing to keep in mind when people just start comparing benchmarks anytime a model comes out.
Jeff Huber
Yeah. What is your, I guess, like, how much... Are you doing day-to-day, week-to-week, month-to-month now with the models? Are you building personal tools? How do you get a handle on what they're capable of? Obviously, this informs your writing as well, presumably, right?
Drew Breunig
I used to be a lot more. I have a handful of projects that we're working on these days that take a lot of my time away. I used to, anytime a new model would go out, I'd see, hey, is there an MLX build that I can download and run on my... This is actually one of the reasons I like DSPy so much, is that I have just a couple programs that I optimize their prompt optimization against. And once you've done that for one program, you can throw any model in there and then you can say, hey, how does this perform against the test? You've kind of created your own personal benchmark because of the modularity of it. So no, I now take the... I feel like this is the path of... of old, busy family people, which is like, new model comes out, all right, I'll read what Simon has to write about it, and then I'll wait a couple weeks and see how it all sorts out. That's certainly relatable.
Jeff Huber
I have been spending quite a bit of time over the past two weeks with Opus 4.5. And I've found that it is like maybe bordering on like level three, level four autonomy in the sense that I can just like give it high level English. Yeah. And it just goes off and does stuff.
Drew Breunig
Does what? Like explain.
Jeff Huber
Walk through a use case. Anything. I mean, like, for example, I did this. I had dumped the user's database from Chroma's cloud service. And then I wanted to do some smart, intelligent re-engagement campaigns. So I wanted to send them a personalized email from me, genuinely trying to be helpful and supportive. I want to learn more about their use case, how we can support them. But of course, . It's a big list. How should you stack rank that? How should you prioritize that? Ideally, it would hook into our product analytics to use that to personalize the email. Not a panopticon sense. You don't want to send someone an email saying, I see that you submitted 4,000 API requests yesterday and 2,000 today. Can we be helpful? That's probably maybe too much, crossing the line, if you will. Yeah. Yeah, I just told Cloud Code, in this case, that's the harness with Opus 4.5 under the hood, and then no additional config or tools or really configuration system prompts. And I said, here's the CSV of all of our users. I want to use POSOG to go and reference the events for these users. And then I want to draft emails in my Gmail to send. and it just did it. Watch it go. I didn't have to fix anything, didn't have to correct anything. It had to ask me for my post.API key. It had to ask me for my Gmail OAuth bridge, which I did. It opened up the website. I clicked accept, approve, went back, I got the token. The ability to build an internal tool at that one shot was incredible.
Drew Breunig
We work with... fair amount of researchers and recently one of the big things that is kind of the conclusion that's bubbling beneath the surface is The models are essentially untapped. Opus 4.5, GPT-5.1, we have barely scratched the surface of what they're capable of. They're actually capable of incredible things. In recent days, the gains are in the harness. The gains are in everything you just ran through. I think that is really interesting because I don't want... AGI. I want to give it a task that I define, and then I get that task performance.
Jeff Huber
Good.
Drew Breunig
And I think expressing what you want ends up being the bulk of the work. And that is an incredible thing. And that goes back to model switching, which is say there's a better model tomorrow.
Jeff Huber
how do you take that spec how do you take that harness and how do you shift that over to where you want to be yeah and i think that's a really interesting way to start thinking about this that makes me think of a question that i would love your take on which is it seems there is not a well i'm sure something exists on github somewhere but to my knowledge there's no popular or even semi-popular context engineering harness broadly. There are maybe closed source ones sort of folded into like Cloud Code. Maybe there's something inside of the Codex code base from OpenAI that's like open source, Cloud Code competitor in Rust. Maybe OpenCode has some version of this inside of the OpenCode world. But they always seem at a minimum overly tied or sort of very much so tied to these coding CLIs and or all these topics that you've written about and we've talked about, like context poisoning, context rot, context focusing, et cetera, et cetera. How is there not a tool set? that builders can pick up and use more generically.
Drew Breunig
Well, this is, I hadn't thought about it this way, but when you ask like, hey, do you play with the new models when they come out? How do you do that? And I'm like, and I realized I don't anymore. And I kind of have been realizing it. But I can tell you that what I do look at now more often, what I should take as a signal for where the work is going is... DevStrawl came out yesterday, a new DevStrawl model from the Mistral team. And they came out with their own kind of version of Claude Code. And it's open source. And so the first thing I did is I checked the code. I wanted to see what were the tools, what were the agents, what were the prompts. And what was the harness. And I did the same thing when Kimi did theirs. And you have open hands and you have all of these different things. And I look at them and it's very interesting to see how everybody implements it. There's a trend. I think all of the people working on this are trying to find that knife's edge balance between generic simplicity. Yes. And how much specificity do I actually need to provide in to get the job done? That is like the task of the agent framework designer these days. To dive deeper into this, I actually... on Monday stayed up a little too late. And what I did is I gathered a bunch of these agent frameworks and I just said, all right, let's do a hello world in each of them, but let's do the same hello world. And so the hello world I did was just like deep research. I'm going to give it a tool to search the web. I'm going to give it a tool to get web pages. I'm going to give it a tool to write out files and read files that exist. And then I'm going to give it a React loop and you start to go through. And so I did that with OpenAI's agent SDK, I did that with Claude's agent SDK, which is basically piped into Claude code in headless. I did that with naked API calls to open AI with tool specs. I did that with Crew. I did that with Langchain's deep agents, which I think is actually a good starting point for anyone who's doing this experiment. And then I also did it within DSPy and using their React module. So exact same thing. And you go through them and you kind of see this commonality that everyone's kind of settling on, which is like, all right, there's tool provisioning and definition and handling. There's the initial prompt, kind of system prompt and context preparation. There's the user input. And then there's the agentic loop. And the thing that I think is interesting about the agentic loop is it's not complex at all. It is sending the first thing. Basically, all of these tools are essentially just prompt assemblers. And then you send it in. You extract a tool call if it comes back. If it's a finished tool call, you exit. And then you write. They're all fundamentally the same, but they're all... We're in a Cambrian era of these being different ways to achieve the same thing. And so to your point about what the context engineering tool is, I think... The best way to state that is it's really the tools you provide that are kind of managing that repeat flow. And people have different ways of doing that. So Deep Agents has to-do lists and essentially a virtual file system. Claude and Kimi and Devstral are all like... doing compact and like write out mostly compact in grep it's like almost all they do um i think the compact is like the only place where you can argue that that context engineering is kind of taking place and yes you're sort of
Jeff Huber
There's some sort of trajectory control right there is kind of how I think about it, which is as you determine what context is important to be passed to the next iteration of that loop.
Drew Breunig
Well, that is just like you're hitting the limit and I need to shuffle this off while not disrupting the user. That's the goal.
Jeff Huber
Yeah.
Drew Breunig
I think the problem is that people are trying to decide how much to automate that trajectory discovery based on what Opus or whatever model knows that your trajectory is and doing essentially reflection to figure out how to compact. Correct. Because there is not one size fits all compact. Yeah. And then you have, I think AMP has their version of that where it's like handoff where you actually input how the compact should take place.
Jeff Huber
As a user.
Drew Breunig
As a user. Oh, interesting. You're basically saying, hey, compact this. This is what I'm doing next. You kind of let it in on that secret. And I think right now it's just being automated. And the question is, is like the only people who are doing context engineering are, there's some passive stuff where people are making their own little plugins and their own, yak shaving and bike shedding of their own clawed code instance or whatever tool they're using. Or you're someone who's building your own agent or your own pipeline and you're actively thinking about how you assemble this and how you maintain it to actually get the accuracy and reliability you need.
Jeff Huber
yeah you mentioned giving this deep research task to many of these sort of agents agent harnesses yeah combined did they all perform the same were some better than others like how did you evaluate that roughly the same okay i mean it's like it's like because naked tool calls compared to that would be surprising to me if naked tool calls were just as good or similarly as good as but you have to remember that these models are being like
Drew Breunig
all of this and this is why like we're past the agi the idea of general intelligence like the labs have given up on it like they're training these tool calls in during post-training i see like like so like they're baking in how to use tool calls into the model themselves and so that's like another question which is like i think is really interesting is like as you start to rl rl like or sorry post training a lot of post training can be thought of as building ux into the model Baking UX into the model. Initially, you were doing that with reasoning to basically expand the prompt and increase the surface area and allow for exploration and search. But then you start to bake in things like how to handle tool calls, when to use them for different things. That stuff just gets baked in. You're adding UX over and over. The naked tool calls are pretty good. The new front Opus and GPT-1, they're incredible models. And they can do so much with very, very little.
Jeff Huber
Does this mean that to use the Silicon Valley soundbite that the bitter lesson is coming for context engineering?
Drew Breunig
Or do you think that there will always be alpha, some level of alpha? I don't think so. I think another fun experiment that we were talking about, especially with Omar, is like just the thought experiment, like fast forward five years, if you have GPT-8 that can effectively handle any problem you solve at it, if you throw at it, the hard part becomes describing what you want consistently and then also expanding beyond that. And you start asking questions, well, what are the other things I want? out of this squishy programmable intelligence. I don't want to treat it like an autonomous thing. I actually want to define the behaviors. And how I define and express those behaviors becomes the real challenge. And unfortunately, right now with prompt engineering, it's different per model. If you work really hard and hand build a prompt for any given model, you're kind of... You think you're expressing it in natural language, but the statistics around those words and phrases you use in that sequence inside of GPT is a different representation than Opus. So you can't even take that prompt once it's optimized. You've essentially overfit it to the model. And so I think figuring out how to define that intent in a way that is expressive, yet communicable and consistent is really difficult. But then you have other things like the tools that you choose to give it, the things you want it to focus on, the transparency of the decision making, the auditability of the path, the preferences. Even with the best agents, they're really smart, but they will go down the wrong path and answer things that you didn't ask it to all the time. It will get distracted. And we still have the problem of... So even if you fast forward the tape, there's still... we're kind of rediscovering the software engineering best practices that we need to bake reliability in, even if you have these all powerful models. Reliability is not guaranteed from the smartest person you know. In fact, maybe the opposite. Exactly. And I think that's why there's a huge role of software engineering in applied AI. We kind of have applied AI, we don't yet have applied AI or AI engineering.
Jeff Huber
Yeah. You use this term like compound AI systems. Yes. Maybe like for the audience to find what is compound AI systems.
Drew Breunig
So I talked to lots of people building and people building with AI generally fall into two categories. and it's a function of the tasks they're working on but the two categories are you're a data pipeline or data processing person so you're a data engineer you're a data scientist you're a data ops person and then there's everything else which is what i would call like the agents and the applications and all those other things now In software engineering, one of the great sins we don't want to do is we don't want to prematurely optimize. You get things working, you make sure it works well, then you figure out how to make it work cheaply and fast. So in the agents and application layer, there's not a lot of incentive to optimize yet because getting it to work and getting it to work well is hard enough. Like reliability is a bigger issue than optimization. But in the data world, you're at scale from day zero. So if I'm writing a pipeline to process a terabyte of PDFs, I can't throw that at Opus 4.5. I would be spending a ton of money, which I may not have, or I'd be waiting a month for it to complete. And then I also have to worry about reliability much more than a human-in-a-loop chatbot. Because it's autonomously processing that text. So I'm going to go pay a ton and wait a ton, and then out of the pipeline comes results I didn't want. So I have to care about reliability a lot. And this has been the case for data engineers and data builders. forever. You're always thinking about how long that run's going to be, how you can make them run more efficient, and what are those gains. And so this came out of the Bayer lab at UC Berkeley. They published a paper called Compound AI. If you Google Compound AI, it'll be the top result. And it's kind of a manifesto or position paper. It was published early last year. And it's about using kind of data pipeline techniques, multi-step techniques, multiple different models. You may have a great model for text-to-speech here. You may have a really hard coordination path that you're going to toss to Opus 4.5 and you're going to be cool with it for that one step. But then you have an incredibly tiny model and... that has been processed to run some very specific categorization task that has been tuned within an inch of its life. Maybe it's been fine-tuned, maybe it's been prompt optimized, whatever it is, you know it works. And so you have this multi-stage path because the stakes are so much higher. You have to be thinking about reliability and efficiency on day one. That's compound AI. the agents and the application builders are only just now starting to think about it. And the only people who are starting to think about it are the people who have really found product market fit. Because until then, there's no need. You basically, like, if you come to me and say, hey, I want you to build an AI agent that does X, first thing I'm going to do is go type in that job, an example job into... Chat GPT and see if it can do it. And if it can do it, then I know it can do it. And then I can kind of create some harness and application layer across it. And it will do it eight out of 10 times completely right, whatever it is. And then I figure out what my threshold for reliability. I keep a human in a loop. So for those errors, it's not a big deal. And then I go launch it and start getting feedback and usage, collecting data. All the while, it's still just held together with Vibe, Duct Tape, SimPrompts. And I haven't been systematic about any of that. And I don't blame those people. That's the right move. That is the optimal move. But now if you start getting traction... Now you've got the inference bills. And you also, like, speed starts to become an advantage. Melissa Pan, who's a researcher at UC Berkeley, just published a great report called MAP, M-A-P, which is Measuring Agents in Production. And her and her team surveyed over 300 agents in production, all the teams that built them. And one of the things they found is that the agents that actually make it into production, which is quite rare, most of them don't make it past the pilot stage in enterprises. Right. the hardest bottleneck and barrier is reliability because employees just won't use it if it's wrong most of the time or even some of the time because then it's more work for themselves exactly so they'll dial back the complexity of their own agents simplify their tasks and then throw it to the biggest model they can and that will work and they'll be fine with it. They won't look at the token costs because it's still cheaper than a human. And they're also fine with high latency. They're waiting one to 10 minutes for these things and they're fine. They don't care. But when you start to have coding agents and you want to use Opus 4.5 for them, Now you have to start thinking about it. So you see teams like Cognition who launched their Swig rep, which is like, hey, we noticed that the large agent is spending 50% of its time doing code search. Let's train a model just to do code search. Yes. And then put that in the pipeline. So now I'm calling two models and I maybe have a coordinator model. And so now that's a compound pipeline. Got it. And they were able to reduce their time, I think, by 40%. And it was exactly the same in terms of acceptability. Yep. Like, so you're starting to see this. The other one I always think about is Figma just released their earnings. And they talked about how their agent is like, people love it. it's eating into their margins now. And they're like, well, we're not going to raise prices. We have to keep capturing the user. We have to keep doing this.
Jeff Huber
So the question is, now what do they do?
Drew Breunig
Yeah.
Jeff Huber
And some of a macro question for SaaS, I think, broadly.
Drew Breunig
Yeah. And so I do think there's an opportunity to take all these compound techniques that have been developed... refined and you kind of created by the data world yes but they've been doing quietly at their own conferences at their own universities at their own companies yes and start to pick them up and take it and bring it over to the agents so i guess like what should ai ai practitioners learn from ml data practitioners pretty much and like how did what does that mean what is the mindset i guess i think i think starting with is just like decomposing your tasks I'm starting to think about them not as just here's one blob you throw in, but how can I break these down into measurable and testable tasks? I think when you think about them in different pieces, that also makes eval creation, which is critical, even easier. Because a lot of people come to me and they're like, well, I can't create evals because it's just too hard. And evals are impossible. It's like, yeah, but if you break your task... I was talking to someone who was developing a chat app yesterday and really interesting way of doing it. I won't talk too much about it because it's not public, but they're like, I don't know how to eval, you know, like a chat bot inside of a chat app. We hear this all the time. Yeah. It's just like, but if you break it down into parts, it's like, is the message too long? That's an easy thing. And you can train against that. And you can problem-optimize against it. Like, is it relevant to the topic at hand? Did it make the right tool calls? So like a good example they were asking is like, is the chatbot, if you ask the chatbot a question, is it looking at your chats with your friends or is it looking at the web? And the difference between those tool calls is actually really easy to create a training data set. If I say, hey, what was the thing Jeff and I talked about last week? It should never look at the web. And so I can create evals for that really easily, and I'm not boiling the ocean with these multiple tests. So once you start down the path of kind of decomposing your tasks in ways, and I think this is like the other insight, which is like, it becomes easier for you to achieve reliability as the creator, the ability to reason about it, an ability to structure recovery. evals and data collection and then ultimately optimization into your harnesses and your whole system right is a better way to say it right and i think this also ties back to the thought experiment we had in five years if you got gpt eight yep why would you decompose your tasks if you had an all intelligent decompose your task to help you the developer right maintain control over your system right and ensure reliability not to help the model remember the model is all powerful in this scenario and so i think that's the way people need to be thinking about it as well yeah yeah i've always said like you don't need a supercomputer to write your email which maybe implies you don't want to spend the amount of money or wait the amount of time for like the yeah and you as an email writer right you want to see more things coming back to you like i don't know have you ever coded with um a fast coding model Not like really fast. Yeah. It's amazing. I recommend this to everyone. Okay. How fast is fast, by the way? Hundreds or thousands of tokens per second, I guess. So, Clogged Sonnet 4.5, I believe, which was last time I did this test, it was about 4.5. You're looking at 50 tokens a second. Yeah. Now go to download Klein and go connect to Grok, G-R-O-K, and use KimiK2. It's 200 tokens a second. Also, I think a sixth the price. And then... go to Cerebras and connect to Quen Coder, which was 2,000 tokens a second. So, you're going order of magnitudes up. Yes. How you process that is very, it's really interesting. I think product managers and anyone who wants to be building in this space should be looking at this because the first thing you find out when you get to 2,000 is the way you were coding with Claude code at 4.5 does not work at 2,000 tokens a second. Yes. Because you just see all the code coming by. You can never check it. Right. And some people will say, well, I just put unit test before. It's like, no, what you end up doing is you end up treating it like instead of an email, which is what Claude feels like, hey, go work on this thing. All right, I'm waiting. All right, he emailed me back. Now it's an IM. an instant messenger for those people who didn't have AOL instant messenger between like Quen and it's just like, do this one thing. All right, did it. All right, reload my page. All right, do this one thing. Did it. And so it completely breaks your brain when thinking about...
Jeff Huber
the speed in terms of the the ux yeah behavior i've had some early access um they can't share too many details about but to models that are running at like tens of thousands of tokens per second are they the diffusion models i'm everybody's waiting for those diffusions no these are actually still just lms um but it is it just appears It doesn't make any sense. Yeah. And like, you know, things like, oh, like streaming. Yeah. Does not matter. No. When it's coming back at tens of thousands of tokens per second, because you can't read that fast.
Drew Breunig
Yeah. And that's like kind of the thing, which is like, it does feel kind of silly where I am using Opus 4.5. I ask it to plan out a function or interrogate a thing. And then... It's done. It's been thinking for a long time. And then I have to scroll, scroll back up, scroll back up. All right, let me read the plan slowly. And so it is this UX is going to change as we start to push these things forward.
Jeff Huber
Totally. I mean, some of the, like, internal tools that I've just been, like, hacking on for my own, like, you know. Yeah. Edification, pleasure, utility. Like, you know, obviously this is very much an internal tools use case. So I'm not thinking about, like, am I going to put this in production? Am I going to do it end user? It's kind of, like, nice, right? You can just, like, build cool stuff. Just let it rip. It's just let it rip. But, like, I don't pay attention to any of it. I'm not, like, looking at. I think I literally built a Mac app that probably has 50,000 plus lines of Swift.
Drew Breunig
Yeah.
Jeff Huber
I have not read a single line of Swift. And I go and I use the app and I'm like, oh, that's weird. That's broken now. And I kind of give that feedback and then it makes it better and better and better. It's a really weird way to build software.
Drew Breunig
Yeah, and I do think we're in this awkward period where it's just going to seem really quaint and weird, which is like, oh, you had to reread the code? Why would you have to reread the code? And it's like, well, I don't technically have to. I could just trust it and ship it and go. And I think that's what someone would say is like, oh, if you have instantaneous code, that's great. You're just going to spin up a billion threads and you're going to give it a much bigger prompt and it's just going to go and have the test cases. And I still think there's times where you're going to blow it in but I think it does push a lot of UX questions yeah I recently heard met someone who had their work desk was five laptops with cloud code and I think they had two instances on each laptop and they would just like kind of like five laptops I did not ask the questions okay I'm just like I assumed it was like an old timey cartoon where you're just like going over like a typewriter this is an interesting question which is like I don't
Jeff Huber
think this future is so far away. I think the future is now, which is for certain people who are investing at the edge of building their own tooling, realistically, you could be managing or administering tens, dozens, probably not hundreds yet of simultaneous Agents doing valuable work for you, organizing your desktop, front-running your to-do list, thinking through your key priorities, what you're currently working on, what things you're missing. These are kind of background ambient agents trying to help you reason, think about things, clean things up, as well as the active things. Drafting all your emails, and writing demos of all the code that you want to write, and doing all of this at once. what is the user interface going to be or how do we as humans like get a handle on like what dozens of agents are doing simultaneously right we're still very much in like a like a single player mode almost where like you know it's almost like a um like a first person shooter if you will right we're like you know you're viewing one character that one character is doing one action you know you're like doing things and i feel like It's going to be like StarCraft, right?
Drew Breunig
Where you have tons of agents on a map and you're like, let's stretch the gaming comparison further. Let's do it. We're not even at first person shooter. You're playing text adventure games with Claude. That's good, yeah. That's what you're doing. That's totally fair. And so I think that is a big thing. But I do think it comes down to, I think the barriers are human rather than technical, which is why I think it's a UX challenge. Yeah. don't know this is this is a funny weird thing which is like i came of age where instant messaging so aol instant messenger became a thing aim for those that need to know yeah aol instant messenger was the way people chatted yes yes like 1996 yes post irc msn messenger was on the same time a popular era prodigy screen name or pre-skype kind of exactly totally yeah And it came of age in, like, junior high. And I remember, like, setting up a plan with a friend of mine. And, like, we talked about the plan. And we were like, all right, I'll come over to your house at this time. And because it was, like, a new thing, I still, like, after that was all done, like, 30 minutes later, I picked up the phone and called them. And I'm like... hey, just to reiterate, we're all cool with this plan, right? I'm going to come over it. Because it just didn't feel real because it was this new mode of technology. And I think there's this level of trust that we are going to have to build over time at human speeds, not at technology speeds. Right. Which is like, how far do people feel throwing tasks into an LLM? Like, you're writing emails. Most people would not. I'm not sending them yet. So you're reading, which is great. So there's things that you're going to want that missile control where you flick the switch and then you hit it. There's things that you're going to be fine completely offsetting. One of the best harnesses out there right now that I really like, again, this is a good example of it, is have you tried the ChatGPT deep research for shopping? I have not. You should try it. It's my first stop now for anything that I want to buy that I have opinions about. Right. But I don't have so strong opinions about that I want to waste hours looking into it.
Jeff Huber
And so I guess like the alternative path here just to sort of like pull out the differences. My default path is like item name or designation, Reddit, buy for life. Totally.
Drew Breunig
Or Wirecutter or whatever. Well, that used to be like in the golden era of Wirecutter, the real function Wirecutter had was like there was this whole category of purchases. Like if I'm a cycling person, I really like cycling and e-bikes. And like if I'm going to buy one of those, I'm going to like research it for so long. I'm going to like wait and like, it's just going to be, it's fun. It's super fun. Yeah. Totally. Yeah. Yeah. I don't want to offload that to the wire cutter. Yes. But there's a whole bunch of other purchases that like microphone. Do I care this much about a microphone or do I want someone to just say, I have done all the work. here's why this is the best microphone and here's the other alternatives for this job. It was such a great mental work that had been taken off you. You could just easily go buy that. I feel like that was like 2008 era Wirecutter. Everyone had the same products in our weird social circle because they did that. Yes. That has kind of ceased to be, and now ChatGPT. I was like, I'm looking to put records on my wall, but I want it so I can flip through them, so it can hold more than one record, so it's storage as well as display. Finding them that weren't crap was like incredibly hard. I spoke into my mic and translated to ChatGPT. It nailed it, gave me a little carousel. I'm like, that's exactly what I want, $50, bye. And that is a good example of like, the sweet spot of trust, but it could have told me one thing and said, this is what you're looking for. But it presented me a range asked me to make the option. It's like, it's almost like client services. Totally. Like this is like, yeah, it's like the old design trick. If like, you're gonna show like design review to someone, you show them three ideas and you know, one of them, they're gonna say no to. So you're forcing the decision. Um, I, I think that's great. Um, it can backfire on you. Of course, sometimes they'll choose that one. Right. And then you'll have to do it. But I think that's an interesting thing to think about is agent design as client services, UX management. That is good. And I think there'll be lots of that. Even if you have an all-powerful model, it benefits from knowing that. Yeah.
Jeff Huber
In some ways, it feels like the harness of the context engineering... you know, may shift from being agent-centric to being human-centric. Because the bottleneck will also shift as the agents get both better and smarter and also way faster.
Drew Breunig
But the other thing is what you show it. And I think that's the other thing, which is building the mental model of how the agent reached that decision is what establishes the trust. If it came back and just showed me, here's the listing I found, which is effectively what Google did, Which is, I would type it in, and it's the first result. And then I would, I'm like, well, I can't, I don't know why. Like, they could just have good SEO. Like, they could have all these things. But they came in and said, here's the reasons. I heard you. Here's why. And now I built a mental model of where it's checked to see these things. And building that mental model is critical for UX. And I think- Does the ChatGPT shopping experience deliver on that well?
Jeff Huber
Is it like, do you feel like it showed you? I think it's pretty good. I think it's very good. Because broadly, it's like the deep research tools are just sort of, you know, he's like a, 400 links they looked at.
Drew Breunig
It's kind of hard for me to look at the model. This is why everybody's talking about general stuff, but I think what's most interesting with the UX is focusing on a specific slice because now you can just over-optimize for that one use case when it comes to the UX. The models are so capable now. really the barrier is, does the human trust you? And that UX and harness, that's how you earn the trust with a little bit of the model design and accuracy. And so like that performance in theater is absolutely critical. And you need reliability, which you have to get through context management, context engineering, compound AI, all that other fun stuff. But like, it's also the presentation. And that's why I think narrowing that slice and stop trying to build like, you know, there's people talk down and it's like, you're building a bit, like, I think, I think, Aaron Levy had a tweet. If you're building a model that isn't a general model, and it's not huge, why are you even bothering? You're just wasting your intelligence. And I'm like, come on. I don't buy into that religion. Yeah, I think that like, because again, it's human governance.
Jeff Huber
Yeah, I think sub-agents are broadly underrated. And sub-agents, I think, is sort of shared lingo for compound AI systems, meaning that they're different. models or prompts or harnesses or tools to do different things.
Drew Breunig
If you're not like protecting the workflow of how things pass between those necessarily. I think the only thing I'd push back on is like the word subagents is a little broad and I think in its most generic is kind of like the way deep agents, like just deep agents, which is a subagent is just a new LLM call with a clean context and a system prompt. For sure. That's not what I'm talking about. A sub-agent to me is when you run any command into Claude code, 90% of the time they're going to do code search. Maybe we should make a code search model that's really good. Yeah. I don't disagree with that either. I don't think either of them are wrong. I just think there's two things that I think are worth thinking about.
Jeff Huber
Yeah. We have a few minutes left and it feels like one of the... popular topics in Silicon Valley. Apparently, one of the popular topics at NeurIPS last week as well is all about continual learning. Sure. How do we build AI systems that get better on their own or get better with minimal human intervention?
Drew Breunig
Yeah.
Jeff Huber
What are your thoughts?
Drew Breunig
Well, I think you have to first, I think, construct the conversation in scope. So are you talking about continual, like when you talk about continual learning, I think there's continual learning you can talk about, which is I have a big frontier model and I want it to continue to learn as people use it, as feed it. I think there's a second local continual learning, which is what I would call like memory. A lot of people have different ways of implementing it, but the idea being that as you learn, you don't have to reiterate yourself constantly and it pulls the right thing from the right place.
Jeff Huber
Yeah, I think there's both, like, continual learning in the weights. Yes. There's continual learning in the prompts. Yes. There's continual learning in, like, the tools. Yeah. Memory, maybe that's just a tool. Well, I would say, like, the way... Yeah, and I think... Maybe the harness itself also could be sometimes continually learned. Yeah.
Drew Breunig
Like... I... Help me decide if I'm the weird one or like this is a common perspective. This is a fun game. Let's do it. It's the Simpsons meme. Like, are the children wrong or am I? No, it's the children are wrong. I don't like how, and like, again, there's this academic continual learning and then there's how it actually reaches our products today, which is like memory systems and cloud code and chat GPT, searching back things. I don't like how they're implemented today because they are a black box that is out of my control. And I am not one self. I am many selves. Totally. And maybe it's just because I have several projects going at the same time. Within the last year, I left a job and like it still has some like questions relevant to that, whether it has sometimes I'll ask a question and like it's a pain to manage. And like sometimes it'll come in when you just don't want it. Yes. Like I recently was so I For Christmas this year, for small token gifts, I have a chocolatier. This is the most bougie thing ever, but it's actually the most cost-effective as well. But it's also delicious. I have this amazing chocolatier in Belgium. And I import... $400 worth of chocolate and just hand them out as gifts. And they're so good. It was unbelievable. I love it. That's a good hack. It's a great hack. I love it to death. I was like, all right, what are all the people who I should be, because I don't want to miss an order, because I got to pay the shipping and everything. Who are the people I should be giving this to? And what I wanted was like an outsider perspective of like all the different Mr. Rogers neighborhood people in your life that you should be giving little holiday gifts to. And it didn't do that at all. It like went through all of its memory that it learned and like is like this person, this person, this person. And that was not the perspective I was looking for at all. And now it's more work for me where I have to get it off that train over somewhere else. And then sometimes the way the applications and systems are designed, it doesn't do it in the way that you would expect. So I use ChatGPT audio, where you can record a conversation. It's a great interface. And I will then group those by projects. So if I'm writing a book, which I am, will have conversations with people put that in a project and then i want to be able to ask questions of that entire project in those conversations but it doesn't do that it looks it just does not pull that information and so it's just yeah and it's and you can instruct it there's all these things so it's like so my issue with we talk about continual learning as if it's this one size fits all thing that's a perfect interface. And I worry that it's not that at all. And I also worry that... We've seen this movie before, having worked in the ad tech industry for a long time. The way data is leveraged to improve products often are not to improve the experience and satisfaction of the user. They're designed to increase engagement. And so if you have open AI aiming for engagement, then the memory is a very different strategy than if you have open AI aiming for quick task solving. And to put this back, and one thing to finish, is this reminds me of the behavior I used to use. Google has their news app on their phone, which is usually really good. But the problem is it uses it based off your search results. So I used to have Firefox, go as a browser on my phone connected to DuckDuckGo for any time I wanted to search something that I didn't want to be in my Google News results. And it's the same with YouTube. I want to show my son a video of how to do something. And I'm like, fuck, I got to go back to history and delete this later. Otherwise, my whole thing is going to be poisoned with 3D printing recipes. It's invisible. And so I think the continual learning we can tie back where it's like, it's a UX challenge as much as anything else, because at the end of the day, it's humans doing this. Now, if you start building a model for doing one thing or a what have you, then it becomes, I think, a tractable problem.
Jeff Huber
Yeah. Yeah. I think like, you know, obviously on the continual learning front, like the majority of things that I've seen are kind of these like remember bits of facts about the user, like kind of the chat GPT type interaction. And as you said, like even that has quite a bit of complexity to it. Yeah. Like I see people I saw somebody on Twitter this morning, actually everything app complaining about like literally it like messing up and like using the wrong context in the wrong place and like you know, we as humans have all kinds of like social rules about like natural firewalls between like, okay, like you don't overly bring your personal problems to work and like you don't overly bring your work problems home. And like we kind of have these like social firewalls and the models don't obviously have like a natural intuition for that. And therefore, maybe you need to build some harness around. I mean, it's really hard.
Drew Breunig
I mean, it's like this is the dream. Like this is why everybody wants it to be like the assistant to their office. They're like, The person who sets their calendar, who can also see all their calendars, who can make judgment calls about what you want to do and knows that you're trying to work out that day, even though it's not on your calendar. This is the dream. But I think what's going to happen is as we start to build these things, the complexity is going to rear its head about who we are. And I think, at least in this era, the people who focus on a specific niche and kind of mode, to borrow a term from Emacs, you have your mode. This is your Jeff at work mode. There's probably another mode for Jeff with clients mode. There's another Jeff for Jeff at home with his kids mode. And all of those modes are very different. And they have different preferences. They have different contexts. That's not... I don't want a computer to treat me as one thing. I almost want to check out. Totally. Hit my keyboard shift and switch into that mode.
Jeff Huber
Yeah. And I'm surprised that, again, our computers, our phones have these ideas of modes that you can set up. And oftentimes they're kind of burdensome and difficult to set up. So they exist, I think, for a good reason. Yeah. And I'm kind of surprised that the chat providers don't. lean more into that UX and just like let the users tell you what they want. Because if the user tells you I'm in work mode.
Drew Breunig
Well, I think they tried to with projects and then they just kind of left them there. And then it almost, it's like, I don't know what the conversations are like inside the, those offices, but I can like make a guess where it's like, well, no one's using these things according to our engagement numbers. Don't spend a lot of time on this. We've got so many other things to do. And so that seems to be the case. Yeah. Yeah. So good opportunity for someone.
Jeff Huber
We're at time, so next conversation we'll tackle big eval. Big eval.
Drew Breunig
You don't want to get into the history of science philosophy or the philosophy of science and get in all that?
Jeff Huber
You have to leave people wanting more. This is the first rule of conversation.
Drew Breunig
That's what they come here for.
Jeff Huber
That's right.
Drew Breunig
Thanks for chatting, Joe. It's been awesome. Thanks, Jeff. Always good to be here. Amazing.