Dex Horthy on Agents and Context Engineering

Jeff Huber sits down with Dex Horthy to talk about agents, model orchestration, and the practical side of context engineering. They cover how new model releases are changing agent design, why smaller and denser context often wins, and how evals, memory, and collaborative AI workflows fit into modern agent systems.

Released December 3, 2025


Timestamps


Transcript

Jeff Huber

Hey, man. How's it going?

Dex Horthy

What's up, dude? It's good to see you, man.

Jeff Huber

Good to see you. All right. So this is an experimental format, which is to say we just talk. Whenever we hang out, we seem to have great conversations about all things context engineering. And so here we are. We're in December of 25. How are you thinking about context engineering? What are your latest thoughts?

Dex Horthy

So my favorite thing about context engineering, actually, I think I figured this out like a month or two ago, but like I had posted that thing in like April 3rd or something, right?

Jeff Huber

This is the 12 factor agents? 12 factor agent thing, right?

Dex Horthy

Which is like had an article on like one of the 12 factors was like about context engineering, which apparently that was the one that blew up. But like two weeks afterwards, you just posted the words context engineering with no context. And I'm like, oh, Jeff was thinking about the exact same thing at the exact same time as me. And so that was exciting. It's fun to have a secret co-creator of context engineering out there.

Jeff Huber

I don't remember if it was, you know, spontaneous evolution or if I just totally unintentionally left and copied you, but I could do worse than copy you. So yeah.

Dex Horthy

Yeah. One of the new words I'm seeing a lot is harness engineering. Yeah.

Jeff Huber

Have you seen this? I tweeted about that, unfortunately, as well.

Dex Horthy

I just tweeted about a guy. I was like, oh, yeah, this guy said it first. And then I was like, oh, no, that guy was actually reposting a post from another guy who I have literally talked to in the last two weeks. And I've read I've like seen the article. I didn't actually read it, but

Jeff Huber

yeah attribution is hard and i don't think anyone can really own a word for that long anyways and who cares right it's kind of cringe to be like the person who coined x is like your name through your claim to fame anyways so i think we're aligned on that um what do you i mean new models are coming out every day in the past like three days three weeks so many new powerful models have come out like how have you updated your priors on like what it means to do content engineering with these like latest and greatest models

Dex Horthy

Yeah, I mean, I think, I mean, context engineering for me is like, how do you get the most out of today's models? If I was going to put it like at the top, a lot of people say, oh, it's all about putting the right information into the model. But it's like, it's like, it's getting the right information in, but also like keeping it as small as possible. And the keeping it as small as possible and as dense as possible is like the thing that actually I think, like not token dense, but like information per token density. Yeah. uh so it's every time a new model comes out we're obviously we're playing with it we're testing it we're working in different workflows i think um a lot of what we're building on is built on kind of a paradigm that we're exploring a way past because i think there's a there's a new vision for it that is a little bit more flexible and like easier to take advantage of new models while they come out because today a lot of what we are doing and what we're building and something that works really really well is like use a really smart model as your like top level steering orchestrator something like opus four five or opus it was opus four one forever opus four one was the first thing that really like oh we can actually do really incredible things with this model And then you would delegate out like with your harness has sub-agents, which is one of the reasons we really like Cloud Code, delegate out to a bunch of sub-agents that use faster, dumber models, right? We've seen this with Cognition and SweetGrep. We've seen this with Warp just in like release their grep agent. Is it Warp? No, Morph. Sorry.

Jeff Huber

Morph, but it's called WarpGrep.

Dex Horthy

Okay, that's why. Okay. It's a lot to keep track of. But one thing we've been exploring, and it's actually how the AMP code team designed, like AMP's been this architecture for a while where your main orchestrator is actually a smart but not the smartest fast model. So they've been using Sonnet. as the default for a really long time um as kind of the core one and then they had this thing oh you can delegate out to a smart model like this oracle concept right and so that you put the reasoning and the really like beefy slow thinking intelligence hey i have 50 files i need you to read and like try to help me figure out where this race condition is or something like that yes uh And that becomes something you delegate. And doing that really well is tricky because those big slow models are slow. And if you're just like, hey, read these 50 files. I don't like sitting. We've experimented with this using Sonnet as the driver and Opus as the thinker. The problem is Opus is going to sit there and call tools and read every single file. And it's just slow because it's a big slow model. They had the same thing when they had O3 as the Oracle. I'm sure it's GPT-5 high now or whatever it is. I haven't checked in a while. But it's like, yeah, O3 is not great at tool calling and it's slow. And so if you actually wanted to read... So there's like the context engineering nugget there is like, okay, well, if you can have the... the fast orchestrator model like figure out which files are relevant without having to like really understand every line of code yeah you can put some deterministic layer in between that is just going to like stuff all those files into a big prompt and you kind of like step away or like you're de-emphasizing the agentic loop and you're just like, here's a crap ton of context. Tell me how it works or answer whatever the question is, whether it's like explain architecture or things like that. So the answer is like, I think if we can, the more people can move to that architecture and there's like a lot there and there's a lot to eval and there's a lot to like figure out what the right, just the right way of doing it beyond just like the vibes of us and our customers using it internally and trying it for us.

Jeff Huber

Yeah. This makes me think of another question, which is like, there's one school of thought, which is like one model is all you need. And maybe you're not even using sub-agents, right? You're kind of just using like one model, one agent loop, like, you know, multi-agent, maybe like too brittle. This was the opinion, at least, you know, a few months back for certain people. Or like multi-model was also like, you're just like too hard to reason about or too brittle. And then obviously now you'll, I think increasingly seeing kind of this view or idea that like, no, actually like sub agents are really important for a lot of reasons. And like breaking, breaking out kind of agents in different roles and responsibilities, which is also begs the question, like, do you use like purpose built models? You know, so like take for like search, for example, uh, you mentioned these like fast, you know, search agents, these like, you know, models that are very good at using, uh, search tools and using them hopefully well. Yeah, what are your thoughts? Is the better lesson coming for all of us in the end of time? How durable is this? Is it just the best practice for now? And we should just all embrace it because it is the best practice for now to use multi-agent, multi-model. How do you think about this in your head?

Dex Horthy

Well, and yeah, you have orthogonal to the like, I want to answer your question, but not to get to, you have orthogonal to like new models every week, but then you also have like Ilya, like scaling's not going to keep happening. Like we actually like, we're getting near the plateau for the current set of technology and we need to go back to research for a while. I don't know if I buy that either.

Jeff Huber

And the real better lesson is that S-curves come for us all in the end.

Dex Horthy

It's almost like the inverse, right? It's like, okay, if things are topping out, then now is the time to be investing in how do we get the most out of today's models? Because two years ago, you could be like, look at all this code we wrote so that GPT-3.5 can actually solve these problems. And it's like GPT-4 comes out and you're like, oh yeah, we can throw all that out now.

Jeff Huber

It does feel now, if you think about the landscape of boxes, that an agent harness needs to do a good job. It's like, okay, maybe there's a file system. There's some level of tools and tool use, tool search. Has the ability to, inclusive of that, write and run code. It feels like the map of the boxes now is like, it's going to be the same in 10 years. I can't imagine a world in which those boxes change dramatically or go away.

Dex Horthy

As long as we're dealing with quadratic transformer attention, you're always going to benefit from doing the deterministic engineering that allows you to keep the context window as small as possible for a given task yeah um there was more to your question let me ask a different way um i've had this thought or thesis that like

Jeff Huber

there's more than one class of inference workload. And what I mean by that is people clearly understand the beefy model, the beefy reasoner model. Slow, hopefully methodical, very high reasoning power. I think oftentimes if I'm just using you know, TouchEBT or Claude or Gemini or any of the models, I will often reach first for the high thinking model just because... You want to see if AI can do it. Yeah, it's kind of like see if AI can do it. And it's almost like a script. You know, I'm not trying to optimize it. I'm running it one time. And so like, well, if I have to wait an extra 20% longer or even like, you know, 400% longer, I don't really care because it's kind of just like a one-off task versus if I'm using it in a loop. And as a core part of my job, like programming, for example, then like I obviously care much more about like it's like speed. Right. If you're around a thousand times in a day. Exactly. Exactly. And so like I guess like, you know, are do you think of different like agents and subagents as like demanding or deserving in some sense their own inference workload? Or to ask us another question another way, like, you know, do you think that like. there will be dedicated models for search that will have staying power. Do you think that context compaction is an interesting candidate for dedicated model inference as well? How do you think about the map of agents and subagents and where it makes sense for there to be other models that are different than the high reasoning models?

Dex Horthy

I see. And this is in the context of the, what about the people who just say use one model for everything? Exactly. Because it's good enough. And yeah, you waste more time min maxing across all of them and developing intuition for all of them than if you like i tell this to a lot of people it's like well i use clod code for this and use codex for this and sometimes i use cursor and then sometimes i'll shell out to deep research and it's like yeah you're only going to get to like 80 of the possible like the level of intuition that you could have with if you're constantly switching whereas like if you sit and talk to one model all day for like two months you will develop a level of intuition. That's actually that, like, that's where I think the people who are the best, especially in the programming, like using like, what do we call it? Agentic engineering. Cause I'm not saying vibe coding anymore. Uh, but like, yeah, the agenda engineering world, the people who are really good at this, like have a really good intuition of the models and their context windows and like when, when to yell at them versus when to be supportive and all these like things that like are kind of feel a little bit like superstitious.

Jeff Huber

Yeah.

Dex Horthy

Uh, but these people get results that I haven't seen anyone else be able to get. And so my advice is always like, you will get better results if you pick one model, one tool and work with it a lot for a month or two versus the like incremental gains you might get by like, I mean, unless you have like a huge eval set that's really, really baked, which most, almost none of us do. The incremental gains of like, oh, let me try the new deep seek. Okay, let me try the new Opus and using a different model for three weeks every time. Yeah. I think there's more upside in just getting to know one model or one family of models really, really well.

Jeff Huber

That's interesting. I guess that implies that the models are not, at least at this point, highly swappable. You as a carpenter, if you will, a master carpenter, right? Anybody can pick up a saw and go saw some wood. Like you as a master carpenter really learn the characteristics of this saw for like this grain of wood. Yeah. Through like your 10,000 hours and like, you know, maybe swapping saws out or swapping grains of wood or types of wood. You do lose something.

Dex Horthy

There's a thing when Codex CLI came out and especially the Codex model, I think I saw Swix posting about this one too, is like, oh, if you yell at Codex the way you're used to yelling at Claude, you've completely detuned the model and you actually screw up the performance a lot. Like all the all caps and like important and you must always is like... is helpful and gets good results from Opus and things like this. And Opus is actually getting much better instruction following as well. But yeah, you go use the same prompts with a different model. We've taken our prompts that are optimized for Cloud and Opus. And I've always said, oh, our prompts are optimized for Opus. And we basically only use Sonnet for searching and finding things versus actually understanding how stuff works and generating summaries and documents. And I think when I say that, it's really like, okay, cool. I know that if I have a six-step workflow, I can rely on Opus to actually go through all those steps. And if I give that to Sonnet, it's going to get halfway through step three, and it's going to forget that there was a step four, five, and six. I have to remind it. And you can change your workflow around that. But again, it's like, cool, I know this works for this world. And if I'm constantly trying to switch, like, I think Opus can do this. I think Sonic can do this. Now I need two sets of prompts. And every time the model's changing, I need to update both of them. And it's like a whole thing.

Jeff Huber

We were texting last week about the idea of, like, using AI more deeply and kind of just our, like, day-to-day work and productivity. specifically you're talking about managing to-do lists. What is your current personal to-do list set up and do you use cloud code? If it's just a piece of paper and not AI related, that's also okay, but I was curious to hear how you're thinking about how you're using AI in your to-do list management.

Dex Horthy

I mean, everything's super chaotic right now, of course. I think since bringing on a co-founder a couple months ago, it has definitely changed because we need to be, like, when it's just you kind of steering the ship and, like, you know, people on the team, you need to keep them on the same page. But it's like... it was enough for me to just have a pile of markdown files that I occasionally like sync to GitHub and think that was my system. And I used, I had, I think I had the, do you know the getting things done? The like Robert Allen thing? Yeah, GTD. So I was like, okay, cool. Like deep research, go make me a long summary of the GTD method and then drop that in a markdown file for Claude and just be like, go implement this system. And then like kind of built the whole set of stuff. And I know lots of people have done this. No, it was interesting though, yeah. It actually ended up being really heavyweight. And when we needed to collaborate more, we consolidated on more of the YC-inspired, like, okay, what's the goal for the month? What's the goal for the week? What's the goal for today? Check in at the beginning of every day and at the end of every day. Who did their thing? What's behind? What do we need to do? And constantly reorienting around what's the most important thing. And that just happens to work better in a Google Doc than in a Markdown file.

Jeff Huber

But there's no AI in that Google Doc.

Dex Horthy

It's just you guys as humans typing. It's just us as humans typing.

Jeff Huber

Do you think it should be like a eyes like with you typing?

Dex Horthy

Well, so this is also like a thing that we're hoping to this is like not not not out yet and probably not before the end of the year, but like Collaborative markdown editing. Yeah isn't like there's just like I haven't seen a good tool like VS code live share is pretty good but it's missing all the AI stuff and like a collaborative workspace where two people can have multiple cursors on a document and also have multiple cursors like in a prompt box and like back and forth with AI in a way that both people can see it and collaborate on this stuff I think that's are really interesting. People talk about like, oh, chat is a bad UX for AI. We barely scratched the surface. And also the web is like the pre-AI UX. What's the new UX? And I think it's like the way that humans interact with AI and each other and can maintain visibility about what's going on is going to be a really important and very technical. Linear's entire company based on like, hey, we built Jira, but with a sync engine. And obviously they're way more than that, right? They've done a lot in terms of design and things like, but at the core of it, it's like, it's a snappier, better UI. It's better for collaboration. It feels real time. Yeah. And that's the biggest, that's the biggest unlock for me.

Jeff Huber

Yeah. Yeah. Yeah. I think earlier this year, I like, again, we'll just, we'll be okay referencing your own tweets or zits in this room. It's a safe place to do that. Um, I tweeted something which is like, you know, most operating systems are built for single mode, not for multiplayer and like for AI, they need to be like multiplayer. Um, and what I meant by that is like people quickly on Twitter were like, you're no, you're wrong. There are daemons. What are you talking about? And like, Okay, fine. But you're not using my computer the same time I'm using my computer. And this question of file diffs and merging comes into play. Again, if we both have the same markdown file, we both change the same line, do all markdown files need to be now CRDT native in a way that the agents and the humans can all be editing at the same time? Yeah, you have to give away anything if you have secret thoughts about this. Yes, they will. Okay, okay. Yes. I mean, they won't have to be, but I think you unlock a lot if you can solve that. Okay, that makes me think a lot about AI UX. It feels like in some ways AI UX is incredibly primitive still, and it looks like kind of both in some ways. Well, there's two sides of this. One is AI UX is like... actually really great and like chat is way better than people thought it was and you know it's sort of wrong to like dunk on chat and like make fun of chat because actually like chat and or kind of like the you know uh cloud code style chat is actually very powerful and you can do so much with it and so like maybe it's wrong to dunk on that And the same way it feels like we are in the caveman days of like, again, collaborating with agents and other humans all in a shared workspace. And so like, do you have thoughts around like, I mean, one potential pattern which has emerged is this like inbox pattern where like the agent is like kind of teeing up decisions for you, a human to make. And presumably it needs to give you all the context at the same time, kind of engineering for humans, if you will, so that you, a human can make a good decision. But I guess like, tell me more about like AI UX, like what are you using? What do you want to exist? Like what do you think is going to exist around AI UX? It doesn't exist yet.

Dex Horthy

Well, so the chat thing and the inbox thing is really interesting because I spent nine-plus months obsessing with this problem. And, like, I really – when we were working on the kind of core human layer, which we're kind of, like, sunsetting now, is, like, that was a really powerful, like – I thought that that was the most important thing that was going to happen in AI. It was, like, people dunk on chat, and I actually, like, didn't like chat either because it was, like – Who wants to have seven browser tabs open for every single different agent that they have purchased and they have access to, right? It's like, I want to talk. The whole point of chat is like, oh, you interact with this the same way you interact with other people. Like, you text your friends, you type to the agent, same experience, it works. The issue is that, like... you couldn't actually they weren't in slack they weren't in your email inbox like the workflow that i came that like i thought was my favorite was like i had a bunch of emails and i've built something where it's like you can just forward this email to an agent and it will like has a bunch of tools but the tools are wired with like an email based human in the loop so you get you delegate a bunch of stuff out next time you come to your inbox you have some inbound it's like cool here's the like tool call i'm gonna make like, your permission. It was a lot of, like, we were using Linear as our CRM at the time. Okay. So I was, like, every time, every inbound email, I would just forward it and, like, add this to the CRM. And it would either make a new contact or add a comment to existing contacts. Yeah, yeah, yeah. And all this. And so it was, like, come back and, like, here's the comment I'm going to add. And then I could reply and be, like, no, do it this way or whatever it is. And eventually, when it was good, it would go actually update the CRM. And that's what I do. I don't use that agent much anymore because we've moved our CRM into Markdown and Airtable and it's like managed by Claude and it's a whole, it's not ready yet. It's not done yet. I have so many questions about that, but maybe we'll get, keep going. Yeah, we can get it. But that was the... that was my favorite AI UX ever is like talk to AI the way you talk to your human coworkers. And I think the same thing is like some of the best collaboration experiences I've had are like collaborative whiteboarding, like everybody writing in a Google doc. Um, Physical whiteboarding is going to be hard for AI, but maybe eventually. I worked for seven years at a company called replicated.com. They did Kubernetes infrastructure for deploying your SaaS into enterprise environments. And hired a bunch of ex-GitLab people, which I don't know if you know, GitLab has one of the most intense remote cultures. Yeah, there's the whole GitLab Unleashed YouTube channel where you can just watch all their meetings.

Jeff Huber

And it's like, who are all these people? This is absolutely insane. Yeah.

Dex Horthy

Yeah, like the being like fully open and transparent was like one dimension of it. But it was like there's things it was like, OK, every meeting needs to have an agenda. If it doesn't have an agenda, it gets automatically deleted. And like every meeting needs to be recorded. And if you are in a meeting and you're not participating, like you will be actively encouraged to leave because your goal is like if you just want to be in a meeting to feel included, just watch the recording. And meetings are only for decision making. They're not for getting in sync. And there's all this stuff like never answer someone's question. Someone asks a question, never answer it. You have to respond with an answer, like a link to the handbook page. And if the handbook page doesn't exist, you have to write it and then send the link. And the handbook page exists, but it's wrong. Like this forces people to constantly go and review the documentation. Yeah. Is like if it's the only source of truth. Yeah. If like Slack is not a knowledge. Anyway, there's all these like. Yeah. Very strong rules and everything's in a Google Doc and everyone's collaboratively taking notes in every single meeting. I'm like, this works really well. How is this going to work with AI?

Jeff Huber

I think it's going to work a lot better with AI, number one, because you don't have to have humans be quite so much like robots. It's kind of my critique of the proof is in the pudding, right? Obviously, yes, GitLab continues to be successful as a company as far as companies go, but there's a lot of stuff they've missed and missed pretty significantly. You have to wonder, is the... is making product engineering a lot like a factory and it's like specification process actually like removing some of the more like what are the water cooler conversations happening about like what if we did this you know like where is a safe place to like bring up like a crazy idea and like let it be a crazy idea for a while if every meeting has to have a agenda and a decision and you know like only people who are supposed to be there are supposed to be there right like are you actually like clamping down on a lot of the potential like creativity and upside um of like what could be

Dex Horthy

Yeah, and this is also part of why we did Human Layer super all-in in person. It's just like, I don't want it to be a chore to hang out with my teammates. And getting on a Zoom meeting is a chore. I don't care if you're sitting around having a beer. It's... It's a thing I don't look forward to doing. I mean, occasionally, like during COVID, like, yes, because you're locked inside and you don't see anybody ever. But it's like there's so much more that happens in between the lines that you have to like counteract with. all this overhead and rules versus like, okay, if we can all just be in the same room, then I don't know. Viobob put this really, really well, which was like, I think he put this in there. Yeah. And they're like job posting was like, you're going to be in person because work should be fun. And in person is just more fun. Like it's not fun to get on a zoom with somebody most of the time. Yep. I mean, none of these things are like black and white true, but it's just like the, the, the, the spectrum is shifted significantly. Yeah.

Jeff Huber

Yeah. Are there other, I mean, going back to the AI UX kind of patterns, right? There's sort of chat, there's like this inbox pattern potentially. Are there other patterns that you think are like useful or important or will be important?

Dex Horthy

I mean, I think tab complete is really dope. I think tab complete is like the thing, the thing that cursor figured out how to do really well is, is super impressive. Where do you want tab complete to work that it doesn't work right now? uh for like non-text things non-texting say more i mean this was kind of like one of the i think i was talking to benny at anthropic and i was explaining like what human layer was back in like last like a year ago and it was like okay cool yeah you have this agent in the world and then you know when things are happening that need ai action you get a slack message or an email that's like hey we're gonna do this thing and you're like yep looks good yep looks good and it was like it was like tab auto complete for generic tool calls based on you have something running in the background and it's like And regularly querying the state of the world and deciding if there's some action that needs to be taken and then auto completing that action where it's like, hey, I'm going to send this email to this person or, hey, I'm going to update this record with this data or whatever it is. Yeah. You just be like, yep, yep, yep. No, do it this way. And then like kind of going through there.

Jeff Huber

It makes me think of this term, which I think is kind of cursed because there are already too many terms. We should not invent more terms. But I will briefly say it. Let's go. Let's do it. You've already heard the term, which is like organizations kind of having this like shared context layer, which is like exactly what it sounds like. You know, the way a human has a mental map of like... okay we do this in notion we do this in github we do this over here this is all these tools like intersect and interact like presumably if you want an agent to be a other colleague sitting alongside you and doing useful labor and work it has to have a similar mental map of like how to get work done where to go look for certain information and how to connect it all together

Dex Horthy

Is that a prompt and tools? Is that a specialized automatic query system?

Jeff Huber

Yeah, it's probably prompt and tools, some agentic search thrown in for fun. So I'm not saying it's like a new set of components per se, but I guess my question is like... Do you have, talking to a few friends at this idea of almost like an AI native organization and like it should run differently than like organizations have for the last like hundred years because the level of access to information is like so much faster. And like, I guess, yeah. What do you, what do you think about that?

Dex Horthy

Yeah, I mean, it gives a little bit like 15 competing standards, right? Or 14, what was it? It's like there's 14 and then someone's like, there's too many, we need to consolidate into one thing and now you got another tool. If you win it, it's great. Yeah, I mean, as far as AI Native Org, like... trying to build the systems like kind of on the side and passive like hacking on stuff that allow us to basically just use a pile of like we're doing markdown with front matter for a while because front matter is nice because you can do a lot of like slicing and filtering deterministically without the model having to actually go read the whole file interesting um or use like the search tool so it's like a little tiny layer on top of the file system yeah that's cool And then we have some stuff in Airtable as well, which we're experimenting with and like syncing between the two. So what's the Airtable use case? You mentioned this like CRM thing, right?

Jeff Huber

Like it's Markdown is the source of truth. Like what is Airtable? Like an ephemeral view for humans?

Dex Horthy

So we have, I have a Markdown system and then like we needed to like scale it to beyond me and like basically adding a cloud command to sync it basically was like, okay, like commit, pull the latest, resolve the conflicts, push it back up and just like, kind of have a post-session hook to just always prompt it to do that, basically.

Jeff Huber

That kind of worked, but also... Is the Airtable a view to the data, or is it also a...

Dex Horthy

It's a view to the data, but it's bidirectional. So, yeah, so we have the Git sync, and then we also have the Airtable sync, and it's like a bidirectional sync that is basically taking mostly stuff from the front matter, like the actual structured data about a record, and then taking the body of it and just putting it in a notes field, basically. But it had task management and stuff. It's very early. I don't know exactly how we're going to use it. We're not using it all day. But it allows me to work by mostly just talking to Claude. And I can just be like, hey, here's my... I just had a call with this person. Here's their email. Go use the CRM writer agent, which is a sub-agent that is prompted to, hey, if you're going to create a person or a company and link them together, whether you're creating a person or you're creating a company... Always go search the way it was like, basically like a, a poor man's, uh, like data enrichment system is like, you can find most of this on the web and like cloud can just do it. And it's like, yeah, I could go do like clay and zoom info and like learn all this stuff. But again, it's just like, yeah, if we can just do it with cloud and markdown, then that works great. And like, that makes everything a lot simpler and I'd rather have the like 80% or 90% and just all one tool versus like specialized tools for every single part of the GTM stack.

Jeff Huber

It makes me. So I've been hacking on a couple of things on my own. One of them is like kind of to do list management. Um, and something about like it just being text as like kind of the actual storage layer, uh, mark down, maybe mark down where it doesn't matter, as you said, like it's incredibly satisfying. And I think that you do need like probably the CRDT story, multiple people to like write the file at the same time. Yes. But it's so flexible. There's no database migrations. You're not migrating your Postgres database to add a new field. You just add a thing to your front matter and it works.

Dex Horthy

It's a little NoSQL-y. And you can also do linting and stuff. You can enforce schema in the front matter. You can just pre-commit hook.

Jeff Huber

To reference another tweet I said last week that NoSQL is more AI-friendly than SQL is. Actually, I strongly believe that.

Dex Horthy

Yeah, because AI doesn't care about the schema. It's going to read the thing and see all the stuff. Schemas are for humans, not for AIs. Or rigid schemas are at least, yeah. So this is interesting. So I saw a tweet recently that was like, someone was talking about, do you remember the tune thing? Oh, yeah. Yeah. I was like, actually, fundamentally, I agree with the direction of this in practice, but the tweet about it was so clearly part of the AI hype slop machine that I was just like, okay, cool. Saying a correct or somewhat correct thing in an overhyped way turns it into slop. But one of the things is JSON was for humans and Toon is for language models. I'm like, okay, JSON, first of all, JSON is not for humans. Jason is for programs. And I think front matter is for like schemas are for programs. It makes your code safer or whatever it is. I don't think human, I mean, I don't know. It all comes from spreadsheets, right? Yeah. Okay.

Jeff Huber

Going back to, so you have like the front matter, which could be like your, sorry, the markdown, which is like your source of truth for your CRM. And this is like a schema that the agent itself can evolve, right? Like the agent itself can decide, oh, I need a new field. Great, I'll add it. Maybe I'll backfill. Maybe I won't backfill. It can just sort of decide how to evolve that schema. It's very flexible. Yeah. And then you need the tools to be able to interact with that schema and the rules, right? And so there's sort of conceptually some level of like rules engine that like the AI kind of can develop its own rules about how to process, you know, a certain set of information. Then you also as a human might give the agent sort of teach it over time, right? Like, you know, if X, then Y, if, you know, A, then B, you kind of give it more and more and more rules.

Dex Horthy

And it can refine. It's very easy to just be like, hey, we got too many tags. Like go refine the tags to less than 20 tag categories. Yes. And then it's like, cool, here's what we're going to do. We're going to work back and forth. And then it's like, all right, cool. Let me go launch 10 sub-agents to go update all of those.

Jeff Huber

But then you need the views as well. So right now it sounds like you're using Airtable as this complicated syncing view thing. But why doesn't the agent itself just write some basic React app to load that markdown into that local React app and then flush it whenever you save it or even in real time be flushing it basically. Why do you need a separate tool, I guess?

Dex Horthy

I just use an editor. I have a hook set up that basically every week it goes and looks at everything on my calendar. I just run it as part of my Friday review process. Go look at all my meetings. Check who's not already in the CRM. If it's external and they're not in the CRM and it looks like, you know, think about, does it look like a sales meeting or whatever it is? Yeah. Pull them in. And we have the same thing, like same thing for investors, for customers, for like random users. We have like a folder of like, just like they don't get an account, but they just have like a CRM contact. And it's just like, cool, go create a record for them. And then when I'm on a call with somebody, I open up my editor and I can pretty consistently just be like, okay, cool. Here's who this person is. And it's so much faster than clicking around a web UI. Like it's just. The data's already there. Right. So you're using an IDE for this as your main interface.

Jeff Huber

Yes. Why does Airtable exist as a separate interface with the same data?

Dex Horthy

For other people on the team. We have a head of operations that does a lot of automations. So for non-technical people. Less technical people. Semi-technical people. Yeah, exactly. He's dangerous enough to run cloud code and ask questions. He wrote a skill the other day. This is not like non-tech. It's just like people who prefer to work in something like Airtable.

Jeff Huber

But presumably you could like build, the agent can build an Airtable clone. Yeah. And then you avoid the syncing problem because like it's just some sort of like local sync story happening versus like another, there's not another like durable store of the same data. Right. It's just like this like HTML page almost that like knows how to load in a certain dir on your desktop.

Dex Horthy

But you still have to Git pull. Like we store it all. It's like everyone's just pushing and pulling straight to master for that repo because it's like plain text document. So it's like even without an Airtable like separate sync system, there's still like the repo.

Jeff Huber

Do you like Git for this or do you feel like it's overly burdensome for this use case?

Dex Horthy

I basically just use it as like S3 or like a generic document store, right? Like, I mean, being able to merge conflicts is great. But again, the CRDT thing is the actual answer. You don't want to.

Jeff Huber

You have no merge conflicts if you have CRDTs. in theory or the series would handle the merge conflicts but git would not do the merge conflicts and it would happen in an atomic level and you would see it as it was happening like it would only happen as you were editing and you would see it as it was happening i guess you have like the logs to some of like replayability for this repo like have you ever used that is it actually an important feature of that tool

Dex Horthy

um no it's literally just like cool before you start work poll okay like it's just we just built it into the prompts and like yeah yeah yeah it's annoying because it's a little slow right yeah you're doing inference on every call you're doing the same poll you're doing but like the thing the reason why we use cloud to do it instead of just writing a script is like because when there's merge conflicts we have instructions in the prompts of like here's how to resolve merge conflicts. And if it's really simple, just do it. For these sorts of files, just keep both always because it's like a journal. It's like a log of all the activity. It's like it's going to get lots of conflicts, but it's just like just make sure that you keep the format correct versus like some things is like, okay, this person put this update and this person put this. It's almost always just like additive merge. There's very rarely like for markdown files, this have to like actually merge logic like you get with code.

Jeff Huber

Yes, exactly, exactly. Yeah, I think a lot of people are reaching for Git, and I kind of have a suspicion that that's not going to be a durable component. Clearly, a shared data store is incredibly important. Maybe some level of versioning is also very important, but all the heaviness of Git for something that's not code.

Dex Horthy

Okay, so what do you need then? So you need like local first speed, right? Yep. You need like a UI that's accessible to less technical people, at least for reading, probably for writing. Yeah, yeah. And then you need like a very, very like... I don't know, high efficiency, but well, you know, context respectful interface for an agent to use. And ideally, like, if you use the file system, then Cloud Code can just go. Yeah, I mean, if you take...

Jeff Huber

If you take your MVC architecture, classic, Ruby on Rails, Django, right? Model, view, controllers. What is that? The model is the schema of the data. It's the data itself. But I don't want to have to think about the schema. I'm not saying you should think about the schema. I'm just teeing up another analogy I think relates here. So one is there's the data. The second one is there's the views. How do you view that data? And then the third one is the logic, right? Like the controllers are ultimately like your business logic. Or take any application. What is it? And you sort of, again, it's like, you know, Data, compute, and then some level of, like, pixels to, like, kind of render back to, like, the user, you know, the different things that they should be able to do. So I kind of want imagining this, like, crazy flexible. What is the, like, AVC analog in the AI world? I mean, I think you need. Could it be, like, Markdown plus some sort of, like, rules, you know, AI rules engine? What is the AI native version of this, I guess?

Dex Horthy

I think the business logic goes in your prompts, right? We have a lot of slash commands and subagents that is just like, like this thing that does the calendar, that's slash command. It just runs like once a week in GitHub Actions and it just like goes and pulls the things and then like pushes updates to the markdown, commits and pushes it. Yep. Right? Yep. So markdown is part of it. I think you need some kind of like orchestration scheduling because you don't want to open this up and like tell it every time you want to do this. That's kind of outside the AI. I think you need... like tooling, you need ways to like the, the, the tricky part of this is like, okay, how do I like figure out the Google OAuth stuff, get the scopes, right. Actually often. And it's like, I'm not going to build a web app that has full like OAuth like cycle because it's just me using it and I'm just using CLI scripts. And so I was like, okay, like you need like a bare bones, like, like auth management layer and you need a way, like you need scripts and you need a way to write scripts and you need a way to, Like control. I mean, I was talking with, um, Ian from key card about like MCP and like often, and like basically everyone just has like API keys littered anywhere, everywhere. And that's terrifying because even the most fine grained off, like something you get from Google or something is really. I mean, you have calendar.read and you have calendar.manage. And those are your options. And, like, I want to be able to do, like, you can create events as long as they only have these people on the invite. And you're only allowed to delete events on my calendar if I'm the only one. Like, those kind of rules just can't exist. 100% off is, like, very unsolved for, like, the AI world. So, and then I think you need a way to, like, write and execute code. Which is, like, is part of, like, because then you can do new things. And you can generate new integrations. And so, like... secrets management code execution and like storing the executed code so you can like use it over and over again and then you need like a data layer like markdown and then you obviously need some kind of agent harness yep what else uh views for humans okay like actual pixel views um which again feels like it could be that's been lowest priority for me because i just look at everything in the editor

Jeff Huber

While true, for your semi-technical users, you're talking about doing things that feel complicated to me, which is building an Airtable syncing engine, for example, when you could just have a view for those users to the same data.

Dex Horthy

Yeah, that's probably a better approach.

Jeff Huber

By the way, the OAuth thing can not be that hard. Over the weekend, I was building a desktop app, and I told Opus 4.5 to add OAuth for Google.

Dex Horthy

And did it do the local flow credential server?

Jeff Huber

Mm-hmm.

Dex Horthy

Yeah.

Jeff Huber

Just one chapter. Yeah. It's been a year ago that did not work. It was so easy. I thought that I was insane. In my to-do list, I was like, this seems kind of complicated and really hard. I don't know if this is going to work. And I was like... Yeah, let's just try. And I was actually incredibly shocked.

Dex Horthy

And it walked you through going and creating the service account and downloading it. Had you ever done that before with Google OAuth?

Jeff Huber

I had done it for web apps before, never for a desktop app, but it just did it. It just did it, one-shotted it, which was crazy. And it walked me through, like, okay, go to the Google dashboard console and get these things.

Dex Horthy

So that's another thing. I don't know if that's a good benchmark. Maybe that's one of my new benchmarks. I was doing this a year ago with an agent. Do you know CodeBuff? It was one of these coding CLIs that came out way before Cloud Code. It was super fast, and it was only supported YOLO mode. There was no way to add permissions. They had no complex interface. It was just like, go. Just make it happen. It was very cool. But I was back and forth with, I think it was on a combination of Gemini Pro 2.0 and Cloud 3.6, whatever. And...

Jeff Huber

it was like 10 or 15 rounds to get it working and when it finally got the oauth and it could like list out my emails from gmail i was like literally typed in all cat i don't know you've ever done this to model it's like holy you did it never have maybe i should you know yeah yeah um you mentioned a lot about using cloud code for a bunch of stuff yeah um are there other agent harnesses that you're using and if not why not is that cloud coaches the best or how do you think about this

Dex Horthy

Every time a new one comes out, I'll try it for, like, an hour. Like, when the new cursor agent view, I'm, like, I'm interested more in, like, the UX and, like, how do you help humans keep all their work straight when their agents can kind of go off and do things headless and you've got to, like, maintain context and context switch a lot. So, like, I played with the cursor agent view. Didn't get what I... But, again, I was, like, now when I open any model, I, like... talk to it like i would talk to cloud code and i was like cool i bet i could spend two weeks talking to this and not 10 000 hours but give my 100 hours and then be able to like actually know what it's good at and not but i'd rather just keep focused on getting better and better at cloud code and like refining that so we play with codecs i play with the anti-gravity like we mess with all these things yeah

Jeff Huber

You clearly have a lot of thoughts and feel for the limitations of the models or the agent harness that they are given. Cloud code is not open source, so you can't change it. So even if you find a way that it is weak, you have no ability. Presumably, maybe you can yell at Anthropic DevRel on Twitter and make some progress there. A lot of people do that. I try not to do that. I...

Dex Horthy

I have a lot of sympathy for Tariq and the team after I ended up going through, like, I opened an issue on the Anthropic repo because there was a cloud code, like, breaking change that came out with 2.0. And I, like, filed an issue and then I, like, pinged the one guy I knew at Anthropic at the time. I was like, hey, can you help me get this, like, escalated or whatever it is? Yeah, yeah, yeah. And while I was waiting, I was, like, reading through the other, like, there's, like, 6,000 open issues on the cloud code repo. 6,000 open issues. It might be 4,000. I don't remember. It was in the thousands. It's intimidating. And I read like five or six of them. And I think five out of the six that I read were people just being like, this thing did a bad job on a coding test I gave it. This thing sucks. And I'm like, man, you guys. I just think about how messy the signal is for a company that big with that much adoption. And I don't know how they make sense of it. Hopefully AI.

Jeff Huber

Yeah, I would hope so. But why not, I guess, like, going back to this, like, Agent Harness piece, you know, like... Do you think there will be an open source equivalent to Cloud Code, which is just as good? Have you tried OpenCode? I asked this, again, on Twitter maybe a week or two ago, and resoundingly the comments were OpenCode, OpenCode, OpenCode, OpenCode, OpenCode.

Dex Horthy

Well, because, yeah, because OpenCode is basically built by, like, you can reverse engineer the Cloud Code harness. Have you ever hooked up a proxy to Cloud Code and read all the traffic? I know that you have. Yeah, you can do this. It's actually, like, I did it yesterday because I'm rebuilding a lot of our, like, plan generation workflows to try to be, we realized there's a lot of like, don't let me go on this tangent for too long, but like, please start realize there's a lot of, uh, in using the like research plan implement workflow, which we shared and thousands of people have adopted on GitHub and like have grabbed the prompts and put them in their own projects that are constantly being like, these are the best. This is a state of the art for like using cloud code to like solve hard problems and complex code bases. Um, I realize as we work actually in the trenches with customers where we have like a couple champions and they're trying to roll it out to a team of 10, 50, 100 engineers, just like as an initial like test of like, hey, can we consolidate around one workflow for using AI in this company? what we found is like when I use the prompts, they're very different from how most people use the prompts. And most people haven't used them enough to know what a good session versus a not so good session looks like. Um, and part of it is like, there's just like, Oh, there's six instructions. And if you don't reinforce the instructions, like, okay, now we're on phase three, like please also do five, four, five, and six. Sometimes it'll just skip to the end and things like that. And, uh, I realized there's a lot of what I call like oral tradition. This is the same thing of like people who used to be really good prompters. They're just like, okay, cool. There's this thing where like, okay, you use this command and then you tell it what you want. And at the end you have to say like, remember, stay objective. We don't want you to tell me how to solve it. Just tell me how the code base works today. Step-by-step. Yeah, exactly. Think step by step, all these kinds of things. And so what we're trying to do now is like, how do we how do we make the product and the tooling less require that oral tradition? How do we bake that into the opinions? And so it's funny is like. I was a 12-factor agents guy, right? I was like, full-fat agents don't work. Just do context engineering. Treat LLM calls as just like an atomic step in your software, just like any other function. And then two months later, Cloud Code starts blowing up. And I'm like, actually, full-fat agents are good to go. I'm the Cloud Code guy now. Let's go. And now we're realizing, oh, the thing that we want to do is actually break up this workflow into a bunch of smaller, there's like a chat loop. And then you progress the conversation to another part of the chat loop. And so it's like, oh, we're back to... context engineering and microagents, if you know what the steps are, don't rely on the prompt for control flow. If you know what the workflow is, split the prompt up into smaller workflow steps. You can still iterate with the human in those steps and then explicitly proceed to the next one, either by a model doing a specific structured output or by the user opting in like, yes, I'm done with the questions phase. Now I want to go to the plan outline phase. They're working on this and it's really fun to like build an AI product from scratch with like really good evals from day one because we know exactly what we're doing. And so like, but one of the things I built was like to be able to really diagnose this and understand things is from day one, the whole system has a logging proxy. Everything gets proxied through and then we log every single request response pair. So whenever anything happens, we can say like, hey, go look into logs. Here's the exact response from Anthropic, like reverse engineer cloud code from the outside because it is closed source. Yep. So why not switch to open code? Uh, no comment. Okay. Maybe coming. Okay, cool. Uh, but yeah, open code is great because it was basically the proxy to it. And it's just like a token for token replica of cloud code. Cause you're going to pass the same tools. You can pass the same tool definitions. You can use the same models, right? You can make the tools behave in the exact same way. Right. And that's why open code is tied with cloud code on most of the benchmarks. Cause it's the same thing. It's just open source. Right.

Jeff Huber

Right. Um, Speaking about evals. Yeah. Popular topic in the press. I got cooked by big eval, man. Ooh, ooh. Let's get into that. Yeah, big eval. It exists. I wrote my notes here. There's a million AI observability companies, many of them very well funded. And there's also everybody saying, actually, LLM as judge doesn't really work very well. Like, actually, actually, actually, actually, actually.

Dex Horthy

How do you do evals? Oh, man, LLM as judge. I was working with a customer a long time ago. And they were like, hey, we're going to do it. I was like, I don't think LLM as judge works very well. Like... I don't think models are good at evaluating things. Like when we do, when we work, we try to keep the model objective as long as possible. And Kyle actually just put a post on like a good clot MD. Uh, and part of it is like never send an AI to do a linter's job. Like anything that can be done deterministically, like, I don't trust a model to read code and tell me if it's good or not. Because these models are like optimized, optimized, optimized to tell us what we want to hear. And you could say, hey, like review this code and tell me if it's good or not. It's like, yep, it's great. Like, hey, review this code and tell me if it's bad or not. It's like, yeah, it's trash. And it's like, it all depends on how you phrase the question.

Jeff Huber

Yeah, I've heard the, I asked the question again online, like, you know, how do you get like valuable critique from one of the models? And it was like, you have to tell it. my friend sent me this and I want to give them some valuable advice. You know, like, what should I tell them? Basically, you know, otherwise the model will like think that, you know, it doesn't want to hurt your feelings.

Dex Horthy

Exactly. Exactly. Yeah. Yeah. Yeah. So, yeah, the element just thing is interesting. I mean, there's a lot to be said for like evaluating the like objective characteristics of it is like, you know, we do in I tinkerers, we have an algorithm on the back end that is like, hey, cool, like we want to make sure that this event is mostly like by builders for builders. And so we don't ask like, hey, AI, rank this person on a scale of zero to a thousand on how technical they are. It's like, no, extract like 50 data points. And then we have like an algorithm that turns that into a one to a thousand score. It's like, have they had a software engineering job in the last two years? Do they, have they pushed anything to GitHub? Does their GitHub stuff have AI stuff in it? You know, it's all these like, you know, there's 50 questions. I actually don't know how it works exactly, but I know I have reverse engineered it by the number of bugs I've reported in it, but it's getting really good now. So as far as observability goes, I don't know. You just said we have very good evals. What are your evals? Uh, they are snapshot based. What does that mean? Uh, so, um, we did an episode, uh, about evals with, um, with Vibob and there's a bunch of different categories and like, we don't have most of these. We'll link to it in the show notes. It's really fun. It's from like, it's from like four months ago and like, I haven't seen anything like Vibob led the episodes. I'm happy to talk about how great it is. I haven't seen anything of like higher signal and like value density as far as like how to do good evals since then.

Jeff Huber

Okay.

Dex Horthy

Um, and to be fair, like I don't get that excited about evals. Um, so if I'm sure if I looked harder, I would find some really good stuff. Yeah. Uh, but essentially, you know, we have this prompt workflow and it's split into stages. And so we have a test that like runs it end to end for a question. It takes kind of like a long, cause it's like cloud code running sub agents, searching files, doing all this stuff. And then we output the snapshot. Basically, here's what the final output was. And then we can also break down and do evals on each stage. Kind of like unit test versus integration test kind of thing. But even the unit test is around kind of a large part of the workflow. And then we just store the output. And then when you run it, you create a set of new snapshots. And then you can diff the snapshots in the CLI. We'll probably build a web... It's very easy to vibe code a web UI for these things. And then... And then you can accept the new snapshots. It's like, okay, that change is better. And I like that. Like, oh, I made a change and it's changed significantly then. But it's basically the, the ability to like, cause evals for me are, I think of them the way like software engineers, like think about unit tests or integration tests or end to end tests. It's like a way to prevent regressions.

Jeff Huber

Yeah.

Dex Horthy

Right. Yeah. And so you can have the very low level unit test evals, which is like, okay, this comes out and I make a bunch of deterministic assertions about the output. Really nice for structured output problems and parsing unstructured data into structured objects. You can make a lot of assumptions. You can make a lot of assertions there. But we're not there yet. So it's like, I don't know. The advice that I like the most is like anything, the first layer is vibes. Vibes is very high leverage, especially if you don't know what you're building yet and you don't know what you want it to look like. I think there was a guy who talked to AI Engineer World's Fair. Ben Stein talked about how does product management change in a world where the capabilities of what you're building are emergent? You don't actually even know what it's capable of until you build it and try it on a bunch of stuff. And so his flow is like, the BDD thing never really worked anyways. They're like, OK, let me define the behaviors that I want and then work backwards from that. And then that's what we evaluate. And building the evals first for an AI tool feels his take is like that's you're going to constrain what you're actually building versus like build the thing have a product manager play with it for a couple days and then have them like point out okay these are the behaviors we really like here's the bugs but like then you back into okay here's what we're going to eval against going forward yep yep um continuing that thought but also just a little bit um

Jeff Huber

The topic of like continual learning has been in the news extensively also recently through like 18 of Dwarkesh's podcasts now basically, but also most notably the most recent one with Ilya.

Dex Horthy

Summarize it for me because I've been like kind of following, but like what's your understanding? You've clearly consumed more of this than I have.

Jeff Huber

It just seems like there's an increasing awareness that like what we really want is the ability for these AI systems to be able to get better through experience. Okay. So the AI system goes out. We tell it to do a job. It does the job. It observes what it does well or what it does not do well. It also gets feedback from humans about what it's doing well or what it's not doing well. And then it's able to sort of update its intuition about how to do that thing and then get better over time. This is how humans operate, right? You could not write a manual that was detailed enough to onboard an engineer onto your team and one-shot it. In practice, you're going to be sitting next to that person and giving them micro-feedback for months to get them to be 100% autonomous, basically.

Dex Horthy

But you're also not writing down all that feedback. You expect them to write down the feedback either internally or on paper. No, exactly, exactly.

Jeff Huber

And so, yeah, the tacit knowledge transfer is a big problem broadly, right? Even in human systems. But I guess it feels like that is what we all want these AI systems to be able to do because, again, you use the thing and you notice where it makes mistakes and then you need to now try to either work around those mistakes and or try to teach it itself to avoid those mistakes and then ideally again there's like some aspect where like it can just do that in the weights because that's presumably more expressive than like adjusting the prompt or adjusting like the rag system it has access to or whatever or like adding adding directives to the end of your cloud md right like claude has this memory you do like a hash thing and then it's like cool i'm going to memorize that instruction

Dex Horthy

Right. Um, but you want it to update its own stuff, right?

Jeff Huber

So there's like some level of like, maybe like offline, like compaction or like the model. So the current version, I think I've seen people now increasingly trying to do, I'm not sure how successful it is, is like, you know, every night the model itself goes back through everything you did that day. And then it like reflects on like, oh, like across these like hundred traces, like what could I do better? And then like tries to like bake that into its knowledge for the next day or some level of backtesting to see if that works. for the previous week or you can kind of imagine this sort of offline compaction system that obviously could also be a training loop as well so there could be like continuous rl or whatever on top of the model but like i guess thoughts about continual learning broadly i mean it sounds like the the naive solution is like build a good memory and not naive in the sense like it's good building a good memory system is really freaking hard yeah why is it so hard um

Dex Horthy

I think it's hard to do. I think it's almost impossible to do generically right now. I think like you can build really low level building blocks. I think building a like thick horizontal memory layer is less like models are changing too much. The use cases are changing too much. The engineering techniques are changing too much. I know people who are building for a very specific use case. Like my buddy Brian is at an applied AI lab. They build like a tutor lab. And so this AI has, they implemented from scratch because this is the only thing that gave them enough control. They call it like decaying resolution memory. And so every time the agent turns on, it's like, cool, here's what's happened today. Here's like daily summaries for the last 14 days. Here's weekly summaries for the like three weeks before that. And then here's monthly summaries for the last two. It's not like conceptually hard. It's like an educational use case, like teaching somebody how to, yeah. It does like tutoring for like grade school and high school students basically. And so it like receives emails from parents, it receives emails from students, it receives like a daily wake up to just see like, hey, is there anything to do today? Here's your rules and here's your memory and stuff like this. And like, but that's a very specific implication for a very specific use case. And I think if they had tried to generalize it, they would not have solved their own problem and they would not have also not have solved anybody else's problem.

Jeff Huber

I guess not that. I mean, the general case is obviously very interesting because you can then theoretically point this agent at anything and it sort of just will get better naturally. So that's interesting from that perspective. But I think for now, for people who are builders, the ability to build agent memory successfully for my use case is sort of a proxy to continual learning. It sounds like you're seeing some people be successful at that, but it sounds like it's not easy still.

Dex Horthy

And it's less about behavior, right? It's more about factual recall of what are the things I need to know to do my job. And the actual continual learning is because they need really good performance, they're not willing to let the agent update its own memories. They're not willing to let the agent update its own instructions. the memory layer is the single like system of continuity through the whole thing. I think I haven't, I haven't talked, I've talked to him for about it for an hour, you know? Yeah. Yeah. Yeah.

Jeff Huber

But the factual stuff feels doable today, right? Like remember this user likes potatoes, like they don't like. The factual stuff feels fairly doable. The thematic stuff feels much harder. But still doable.

Dex Horthy

The instructions and the rules, the how to be, not the what is true. Right. Yes, yes, yes, yes.

Jeff Huber

Have you seen anybody attempt that?

Dex Horthy

I mean, I see a lot of people try to attempt it in their CloudMD files, and it doesn't go very well. But why is that? Is it like the model, the harness, or all of the above? I mean, there's some finding, and I think the study is like six months old at this point, so there's no Gemini 3 or anything in there. But it's like frontier models can follow about 100 to 200 instructions, or 150 to 200 instructions. And if you give them more than that, you basically like really start to lose out on, you spread the attention across all the instructions. The model's got to try to decide which ones are relevant and sometimes it won't.

Jeff Huber

This is like context for tools, basically.

Dex Horthy

For instructions. Yeah, exactly. It's like you tell it too many ways to do things, it just like won't work. And so people just like spam with like, always do this, never do this, always do this. Like, you put the all caps thing, that's gonna put more attention there, that's gonna take away attention from everything else. And so you have this almost like instruction severity inflation, where everybody who wants to add a new instruction wants it to be followed, so they put theirs in all caps, and then the other ones get followed last, and then everyone is coming in, and suddenly your entire memory, like instruction system, your whole system prompt is just like, everything's in all caps, and you're actually detuning it from everything else in the conversation.

Jeff Huber

Why is, I guess, agentic search not a solution to that problem? Anthropic just launched their tool search thing, for example. It seems like rules search in this context would also be potentially very effective.

Dex Horthy

Oh. Yeah, I haven't seen anybody implement that. Oh, okay. Of like, hey, I'm doing this. How should I perform it? And then you rag against it or something. All right, well, I know what we're hacking on now. Yeah, rules bench. Yeah. All right, great. I'm sure there's a lot of instruction following benchmarks. I don't know if anyone's evaluated anything like that. If you're on the YouTube, ping us on Twitter and tell us because I would be interested in hearing more about that.

Jeff Huber

Yep. All right. We just hit an hour. Damn, really? That was an hour. Okay. We could talk all day. It was fun. But why don't we call it there and save the good stuff for next time? Sounds great. Can't wait. Good stuff, dude. Cheers. See ya.