Latent Space: The AI Engineer Podcast – Détails, épisodes et analyse
Détails du podcast
Informations techniques et générales issues du flux RSS du podcast.

Latent Space: The AI Engineer Podcast
swyx + Alessio
Fréquence : 1 épisode/6j. Total Éps: 143

Classements récents
Dernières positions dans les classements Apple Podcasts et Spotify.
Apple Podcasts
Aucun classement récent disponible
Spotify
🇺🇸 États-Unis - technology
09/08/2025#40↗🇺🇸 États-Unis - technology
08/08/2025#43↗🇺🇸 États-Unis - technology
07/08/2025#44↘🇺🇸 États-Unis - technology
06/08/2025#43↘🇺🇸 États-Unis - technology
05/08/2025#42→🇺🇸 États-Unis - technology
04/08/2025#42↗🇺🇸 États-Unis - technology
03/08/2025#43↘🇬🇧 Grande Bretagne - technology
02/08/2025#48↗🇺🇸 États-Unis - technology
02/08/2025#41↗🇬🇧 Grande Bretagne - technology
01/08/2025#50↗
Liens partagés entre épisodes et podcasts
Liens présents dans les descriptions d'épisodes et autres podcasts les utilisant également.
See all- https://www.descript.com/
470 partages
- https://notebooklm.google.com/
416 partages
- https://www.perplexity.ai/
347 partages
- https://github.com/dwhitena
302 partages
- https://github.com/FanaHOVA/smol-podcaster
12 partages
- https://github.com/stanfordnlp/dspy
12 partages
- https://twitter.com/swyx
36 partages
- https://twitter.com/ericries
34 partages
- https://twitter.com/awilkinson
24 partages
Qualité et score du flux RSS
Évaluation technique de la qualité et de la structure du flux RSS.
See allScore global : 53%
Historique des publications
Répartition mensuelle des publications d'épisodes au fil des années.
The Agent Network — Dharmesh Shah
vendredi 28 mars 2025 • Durée 01:38:24
If you’re in SF: Join us for the Claude Plays Pokemon hackathon this Sunday!
If you’re not: Fill out the 2025 State of AI Eng survey for $250 in Amazon cards!
We are SO excited to share our conversation with Dharmesh Shah, co-founder of HubSpot and creator of Agent.ai.
A particularly compelling concept we discussed is the idea of "hybrid teams" - the next evolution in workplace organization where human workers collaborate with AI agents as team members. Just as we previously saw hybrid teams emerge in terms of full-time vs. contract workers, or in-office vs. remote workers, Dharmesh predicts that the next frontier will be teams composed of both human and AI members. This raises interesting questions about team dynamics, trust, and how to effectively delegate tasks between human and AI team members.
The discussion of business models in AI reveals an important distinction between Work as a Service (WaaS) and Results as a Service (RaaS), something Dharmesh has written extensively about. While RaaS has gained popularity, particularly in customer support applications where outcomes are easily measurable, Dharmesh argues that this model may be over-indexed. Not all AI applications have clearly definable outcomes or consistent economic value per transaction, making WaaS more appropriate in many cases. This insight is particularly relevant for businesses considering how to monetize AI capabilities.
The technical challenges of implementing effective agent systems are also explored, particularly around memory and authentication. Shah emphasizes the importance of cross-agent memory sharing and the need for more granular control over data access. He envisions a future where users can selectively share parts of their data with different agents, similar to how OAuth works but with much finer control. This points to significant opportunities in developing infrastructure for secure and efficient agent-to-agent communication and data sharing.
Other highlights from our conversation
* The Evolution of AI-Powered Agents – Exploring how AI agents have evolved from simple chatbots to sophisticated multi-agent systems, and the role of MCPs in enabling that.
* Hybrid Digital Teams and the Future of Work – How AI agents are becoming teammates rather than just tools, and what this means for business operations and knowledge work.
* Memory in AI Agents – The importance of persistent memory in AI systems and how shared memory across agents could enhance collaboration and efficiency.
* Business Models for AI Agents – Exploring the shift from software as a service (SaaS) to work as a service (WaaS) and results as a service (RaaS), and what this means for monetization.
* The Role of Standards Like MCP – Why MCP has been widely adopted and how it enables agent collaboration, tool use, and discovery.
* The Future of AI Code Generation and Software Engineering – How AI-assisted coding is changing the role of software engineers and what skills will matter most in the future.
* Domain Investing and Efficient Markets – Dharmesh’s approach to domain investing and how inefficiencies in digital asset markets create business opportunities.
* The Philosophy of Saying No – Lessons from "Sorry, You Must Pass" and how prioritization leads to greater productivity and focus.
Timestamps
* 00:00 Introduction and Guest Welcome
* 02:29 Dharmesh Shah's Journey into AI
* 05:22 Defining AI Agents
* 06:45 The Evolution and Future of AI Agents
* 13:53 Graph Theory and Knowledge Representation
* 20:02 Engineering Practices and Overengineering
* 25:57 The Role of Junior Engineers in the AI Era
* 28:20 Multi-Agent Systems and MCP Standards
* 35:55 LinkedIn's Legal Battles and Data Scraping
* 37:32 The Future of AI and Hybrid Teams
* 39:19 Building Agent AI: A Professional Network for Agents
* 40:43 Challenges and Innovations in Agent AI
* 45:02 The Evolution of UI in AI Systems
* 01:00:25 Business Models: Work as a Service vs. Results as a Service
* 01:09:17 The Future Value of Engineers
* 01:09:51 Exploring the Role of Agents
* 01:10:28 The Importance of Memory in AI
* 01:11:02 Challenges and Opportunities in AI Memory
* 01:12:41 Selective Memory and Privacy Concerns
* 01:13:27 The Evolution of AI Tools and Platforms
* 01:18:23 Domain Names and AI Projects
* 01:32:08 Balancing Work and Personal Life
* 01:35:52 Final Thoughts and Reflections
Transcript
Alessio [00:00:04]: Hey everyone, welcome back to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Small AI.
swyx [00:00:12]: Hello, and today we're super excited to have Dharmesh Shah to join us. I guess your relevant title here is founder of Agent AI.
Dharmesh [00:00:20]: Yeah, that's true for this. Yeah, creator of Agent.ai and co-founder of HubSpot.
swyx [00:00:25]: Co-founder of HubSpot, which I followed for many years, I think 18 years now, gonna be 19 soon. And you caught, you know, people can catch up on your HubSpot story elsewhere. I should also thank Sean Puri, who I've chatted with back and forth, who's been, I guess, getting me in touch with your people. But also, I think like, just giving us a lot of context, because obviously, My First Million joined you guys, and they've been chatting with you guys a lot. So for the business side, we can talk about that, but I kind of wanted to engage your CTO, agent, engineer side of things. So how did you get agent religion?
Dharmesh [00:01:00]: Let's see. So I've been working, I'll take like a half step back, a decade or so ago, even though actually more than that. So even before HubSpot, the company I was contemplating that I had named for was called Ingenisoft. And the idea behind Ingenisoft was a natural language interface to business software. Now realize this is 20 years ago, so that was a hard thing to do. But the actual use case that I had in mind was, you know, we had data sitting in business systems like a CRM or something like that. And my kind of what I thought clever at the time. Oh, what if we used email as the kind of interface to get to business software? And the motivation for using email is that it automatically works when you're offline. So imagine I'm getting on a plane or I'm on a plane. There was no internet on planes back then. It's like, oh, I'm going through business cards from an event I went to. I can just type things into an email just to have them all in the backlog. When it reconnects, it sends those emails to a processor that basically kind of parses effectively the commands and updates the software, sends you the file, whatever it is. And there was a handful of commands. I was a little bit ahead of the times in terms of what was actually possible. And I reattempted this natural language thing with a product called ChatSpot that I did back 20...
swyx [00:02:12]: Yeah, this is your first post-ChatGPT project.
Dharmesh [00:02:14]: I saw it come out. Yeah. And so I've always been kind of fascinated by this natural language interface to software. Because, you know, as software developers, myself included, we've always said, oh, we build intuitive, easy-to-use applications. And it's not intuitive at all, right? Because what we're doing is... We're taking the mental model that's in our head of what we're trying to accomplish with said piece of software and translating that into a series of touches and swipes and clicks and things like that. And there's nothing natural or intuitive about it. And so natural language interfaces, for the first time, you know, whatever the thought is you have in your head and expressed in whatever language that you normally use to talk to yourself in your head, you can just sort of emit that and have software do something. And I thought that was kind of a breakthrough, which it has been. And it's gone. So that's where I first started getting into the journey. I started because now it actually works, right? So once we got ChatGPT and you can take, even with a few-shot example, convert something into structured, even back in the ChatGP 3.5 days, it did a decent job in a few-shot example, convert something to structured text if you knew what kinds of intents you were going to have. And so that happened. And that ultimately became a HubSpot project. But then agents intrigued me because I'm like, okay, well, that's the next step here. So chat's great. Love Chat UX. But if we want to do something even more meaningful, it felt like the next kind of advancement is not this kind of, I'm chatting with some software in a kind of a synchronous back and forth model, is that software is going to do things for me in kind of a multi-step way to try and accomplish some goals. So, yeah, that's when I first got started. It's like, okay, what would that look like? Yeah. And I've been obsessed ever since, by the way.
Alessio [00:03:55]: Which goes back to your first experience with it, which is like you're offline. Yeah. And you want to do a task. You don't need to do it right now. You just want to queue it up for somebody to do it for you. Yes. As you think about agents, like, let's start at the easy question, which is like, how do you define an agent? Maybe. You mean the hardest question in the universe? Is that what you mean?
Dharmesh [00:04:12]: You said you have an irritating take. I do have an irritating take. I think, well, some number of people have been irritated, including within my own team. So I have a very broad definition for agents, which is it's AI-powered software that accomplishes a goal. Period. That's it. And what irritates people about it is like, well, that's so broad as to be completely non-useful. And I understand that. I understand the criticism. But in my mind, if you kind of fast forward months, I guess, in AI years, the implementation of it, and we're already starting to see this, and we'll talk about this, different kinds of agents, right? So I think in addition to having a usable definition, and I like yours, by the way, and we should talk more about that, that you just came out with, the classification of agents actually is also useful, which is, is it autonomous or non-autonomous? Does it have a deterministic workflow? Does it have a non-deterministic workflow? Is it working synchronously? Is it working asynchronously? Then you have the different kind of interaction modes. Is it a chat agent, kind of like a customer support agent would be? You're having this kind of back and forth. Is it a workflow agent that just does a discrete number of steps? So there's all these different flavors of agents. So if I were to draw it in a Venn diagram, I would draw a big circle that says, this is agents, and then I have a bunch of circles, some overlapping, because they're not mutually exclusive. And so I think that's what's interesting, and we're seeing development along a bunch of different paths, right? So if you look at the first implementation of agent frameworks, you look at Baby AGI and AutoGBT, I think it was, not Autogen, that's the Microsoft one. They were way ahead of their time because they assumed this level of reasoning and execution and planning capability that just did not exist, right? So it was an interesting thought experiment, which is what it was. Even the guy that, I'm an investor in Yohei's fund that did Baby AGI. It wasn't ready, but it was a sign of what was to come. And so the question then is, when is it ready? And so lots of people talk about the state of the art when it comes to agents. I'm a pragmatist, so I think of the state of the practical. It's like, okay, well, what can I actually build that has commercial value or solves actually some discrete problem with some baseline of repeatability or verifiability?
swyx [00:06:22]: There was a lot, and very, very interesting. I'm not irritated by it at all. Okay. As you know, I take a... There's a lot of anthropological view or linguistics view. And in linguistics, you don't want to be prescriptive. You want to be descriptive. Yeah. So you're a goals guy. That's the key word in your thing. And other people have other definitions that might involve like delegated trust or non-deterministic work, LLM in the loop, all that stuff. The other thing I was thinking about, just the comment on Baby AGI, LGBT. Yeah. In that piece that you just read, I was able to go through our backlog and just kind of track the winter of agents and then the summer now. Yeah. And it's... We can tell the whole story as an oral history, just following that thread. And it's really just like, I think, I tried to explain the why now, right? Like I had, there's better models, of course. There's better tool use with like, they're just more reliable. Yep. Better tools with MCP and all that stuff. And I'm sure you have opinions on that too. Business model shift, which you like a lot. I just heard you talk about RAS with MFM guys. Yep. Cost is dropping a lot. Yep. Inference is getting faster. There's more model diversity. Yep. Yep. I think it's a subtle point. It means that like, you have different models with different perspectives. You don't get stuck in the basin of performance of a single model. Sure. You can just get out of it by just switching models. Yep. Multi-agent research and RL fine tuning. So I just wanted to let you respond to like any of that.
Dharmesh [00:07:44]: Yeah. A couple of things. Connecting the dots on the kind of the definition side of it. So we'll get the irritation out of the way completely. I have one more, even more irritating leap on the agent definition thing. So here's the way I think about it. By the way, the kind of word agent, I looked it up, like the English dictionary definition. The old school agent, yeah. Is when you have someone or something that does something on your behalf, like a travel agent or a real estate agent acts on your behalf. It's like proxy, which is a nice kind of general definition. So the other direction I'm sort of headed, and it's going to tie back to tool calling and MCP and things like that, is if you, and I'm not a biologist by any stretch of the imagination, but we have these single-celled organisms, right? Like the simplest possible form of what one would call life. But it's still life. It just happens to be single-celled. And then you can combine cells and then cells become specialized over time. And you have much more sophisticated organisms, you know, kind of further down the spectrum. In my mind, at the most fundamental level, you can almost think of having atomic agents. What is the simplest possible thing that's an agent that can still be called an agent? What is the equivalent of a kind of single-celled organism? And the reason I think that's useful is right now we're headed down the road, which I think is very exciting around tool use, right? That says, okay, the LLMs now can be provided a set of tools that it calls to accomplish whatever it needs to accomplish in the kind of furtherance of whatever goal it's trying to get done. And I'm not overly bothered by it, but if you think about it, if you just squint a little bit and say, well, what if everything was an agent? And what if tools were actually just atomic agents? Because then it's turtles all the way down, right? Then it's like, oh, well, all that's really happening with tool use is that we have a network of agents that know about each other through something like an MMCP and can kind of decompose a particular problem and say, oh, I'm going to delegate this to this set of agents. And why do we need to draw this distinction between tools, which are functions most of the time? And an actual agent. And so I'm going to write this irritating LinkedIn post, you know, proposing this. It's like, okay. And I'm not suggesting we should call even functions, you know, call them agents. But there is a certain amount of elegance that happens when you say, oh, we can just reduce it down to one primitive, which is an agent that you can combine in complicated ways to kind of raise the level of abstraction and accomplish higher order goals. Anyway, that's my answer. I'd say that's a success. Thank you for coming to my TED Talk on agent definitions.
Alessio [00:09:54]: How do you define the minimum viable agent? Do you already have a definition for, like, where you draw the line between a cell and an atom? Yeah.
Dharmesh [00:10:02]: So in my mind, it has to, at some level, use AI in order for it to—otherwise, it's just software. It's like, you know, we don't need another word for that. And so that's probably where I draw the line. So then the question, you know, the counterargument would be, well, if that's true, then lots of tools themselves are actually not agents because they're just doing a database call or a REST API call or whatever it is they're doing. And that does not necessarily qualify them, which is a fair counterargument. And I accept that. It's like a good argument. I still like to think about—because we'll talk about multi-agent systems, because I think—so we've accepted, which I think is true, lots of people have said it, and you've hopefully combined some of those clips of really smart people saying this is the year of agents, and I completely agree, it is the year of agents. But then shortly after that, it's going to be the year of multi-agent systems or multi-agent networks. I think that's where it's going to be headed next year. Yeah.
swyx [00:10:54]: Opening eyes already on that. Yeah. My quick philosophical engagement with you on this. I often think about kind of the other spectrum, the other end of the cell spectrum. So single cell is life, multi-cell is life, and you clump a bunch of cells together in a more complex organism, they become organs, like an eye and a liver or whatever. And then obviously we consider ourselves one life form. There's not like a lot of lives within me. I'm just one life. And now, obviously, I don't think people don't really like to anthropomorphize agents and AI. Yeah. But we are extending our consciousness and our brain and our functionality out into machines. I just saw you were a Bee. Yeah. Which is, you know, it's nice. I have a limitless pendant in my pocket.
Dharmesh [00:11:37]: I got one of these boys. Yeah.
swyx [00:11:39]: I'm testing it all out. You know, got to be early adopters. But like, we want to extend our personal memory into these things so that we can be good at the things that we're good at. And, you know, machines are good at it. Machines are there. So like, my definition of life is kind of like going outside of my own body now. I don't know if you've ever had like reflections on that. Like how yours. How our self is like actually being distributed outside of you. Yeah.
Dharmesh [00:12:01]: I don't fancy myself a philosopher. But you went there. So yeah, I did go there. I'm fascinated by kind of graphs and graph theory and networks and have been for a long, long time. And to me, we're sort of all nodes in this kind of larger thing. It just so happens that we're looking at individual kind of life forms as they exist right now. But so the idea is when you put a podcast out there, there's these little kind of nodes you're putting out there of like, you know, conceptual ideas. Once again, you have varying kind of forms of those little nodes that are up there and are connected in varying and sundry ways. And so I just think of myself as being a node in a massive, massive network. And I'm producing more nodes as I put content or ideas. And, you know, you spend some portion of your life collecting dots, experiences, people, and some portion of your life then connecting dots from the ones that you've collected over time. And I found that really interesting things happen and you really can't know in advance how those dots are necessarily going to connect in the future. And that's, yeah. So that's my philosophical take. That's the, yes, exactly. Coming back.
Alessio [00:13:04]: Yep. Do you like graph as an agent? Abstraction? That's been one of the hot topics with LandGraph and Pydantic and all that.
Dharmesh [00:13:11]: I do. The thing I'm more interested in terms of use of graphs, and there's lots of work happening on that now, is graph data stores as an alternative in terms of knowledge stores and knowledge graphs. Yeah. Because, you know, so I've been in software now 30 plus years, right? So it's not 10,000 hours. It's like 100,000 hours that I've spent doing this stuff. And so I've grew up with, so back in the day, you know, I started on mainframes. There was a product called IMS from IBM, which is basically an index database, what we'd call like a key value store today. Then we've had relational databases, right? We have tables and columns and foreign key relationships. We all know that. We have document databases like MongoDB, which is sort of a nested structure keyed by a specific index. We have vector stores, vector embedding database. And graphs are interesting for a couple of reasons. One is, so it's not classically structured in a relational way. When you say structured database, to most people, they're thinking tables and columns and in relational database and set theory and all that. Graphs still have structure, but it's not the tables and columns structure. And you could wonder, and people have made this case, that they are a better representation of knowledge for LLMs and for AI generally than other things. So that's kind of thing number one conceptually, and that might be true, I think is possibly true. And the other thing that I really like about that in the context of, you know, I've been in the context of data stores for RAG is, you know, RAG, you say, oh, I have a million documents, I'm going to build the vector embeddings, I'm going to come back with the top X based on the semantic match, and that's fine. All that's very, very useful. But the reality is something gets lost in the chunking process and the, okay, well, those tend, you know, like, you don't really get the whole picture, so to speak, and maybe not even the right set of dimensions on the kind of broader picture. And it makes intuitive sense to me that if we did capture it properly in a graph form, that maybe that feeding into a RAG pipeline will actually yield better results for some use cases, I don't know, but yeah.
Alessio [00:15:03]: And do you feel like at the core of it, there's this difference between imperative and declarative programs? Because if you think about HubSpot, it's like, you know, people and graph kind of goes hand in hand, you know, but I think maybe the software before was more like primary foreign key based relationship, versus now the models can traverse through the graph more easily.
Dharmesh [00:15:22]: Yes. So I like that representation. There's something. It's just conceptually elegant about graphs and just from the representation of it, they're much more discoverable, you can kind of see it, there's observability to it, versus kind of embeddings, which you can't really do much with as a human. You know, once they're in there, you can't pull stuff back out. But yeah, I like that kind of idea of it. And the other thing that's kind of, because I love graphs, I've been long obsessed with PageRank from back in the early days. And, you know, one of the kind of simplest algorithms in terms of coming up, you know, with a phone, everyone's been exposed to PageRank. And the idea is that, and so I had this other idea for a project, not a company, and I have hundreds of these, called NodeRank, is to be able to take the idea of PageRank and apply it to an arbitrary graph that says, okay, I'm going to define what authority looks like and say, okay, well, that's interesting to me, because then if you say, I'm going to take my knowledge store, and maybe this person that contributed some number of chunks to the graph data store has more authority on this particular use case or prompt that's being submitted than this other one that may, or maybe this one was more. popular, or maybe this one has, whatever it is, there should be a way for us to kind of rank nodes in a graph and sort them in some, some useful way. Yeah.
swyx [00:16:34]: So I think that's generally useful for, for anything. I think the, the problem, like, so even though at my conferences, GraphRag is super popular and people are getting knowledge, graph religion, and I will say like, it's getting space, getting traction in two areas, conversation memory, and then also just rag in general, like the, the, the document data. Yeah. It's like a source. Most ML practitioners would say that knowledge graph is kind of like a dirty word. The graph database, people get graph religion, everything's a graph, and then they, they go really hard into it and then they get a, they get a graph that is too complex to navigate. Yes. And so like the, the, the simple way to put it is like you at running HubSpot, you know, the power of graphs, the way that Google has pitched them for many years, but I don't suspect that HubSpot itself uses a knowledge graph. No. Yeah.
Dharmesh [00:17:26]: So when is it over engineering? Basically? It's a great question. I don't know. So the question now, like in AI land, right, is the, do we necessarily need to understand? So right now, LLMs for, for the most part are somewhat black boxes, right? We sort of understand how the, you know, the algorithm itself works, but we really don't know what's going on in there and, and how things come out. So if a graph data store is able to produce the outcomes we want, it's like, here's a set of queries I want to be able to submit and then it comes out with useful content. Maybe the underlying data store is as opaque as a vector embeddings or something like that, but maybe it's fine. Maybe we don't necessarily need to understand it to get utility out of it. And so maybe if it's messy, that's okay. Um, that's, it's just another form of lossy compression. Uh, it's just lossy in a way that we just don't completely understand in terms of, because it's going to grow organically. Uh, and it's not structured. It's like, ah, we're just gonna throw a bunch of stuff in there. Let the, the equivalent of the embedding algorithm, whatever they called in graph land. Um, so the one with the best results wins. I think so. Yeah.
swyx [00:18:26]: Or is this the practical side of me is like, yeah, it's, if it's useful, we don't necessarily
Dharmesh [00:18:30]: need to understand it.
swyx [00:18:30]: I have, I mean, I'm happy to push back as long as you want. Uh, it's not practical to evaluate like the 10 different options out there because it takes time. It takes people, it takes, you know, resources, right? Set. That's the first thing. Second thing is your evals are typically on small things and some things only work at scale. Yup. Like graphs. Yup.
Dharmesh [00:18:46]: Yup. That's, yeah, no, that's fair. And I think this is one of the challenges in terms of implementation of graph databases is that the most common approach that I've seen developers do, I've done it myself, is that, oh, I've got a Postgres database or a MySQL or whatever. I can represent a graph with a very set of tables with a parent child thing or whatever. And that sort of gives me the ability, uh, why would I need anything more than that? And the answer is, well, if you don't need anything more than that, you don't need anything more than that. But there's a high chance that you're sort of missing out on the actual value that, uh, the graph representation gives you. Which is the ability to traverse the graph, uh, efficiently in ways that kind of going through the, uh, traversal in a relational database form, even though structurally you have the data, practically you're not gonna be able to pull it out in, in useful ways. Uh, so you wouldn't like represent a social graph, uh, in, in using that kind of relational table model. It just wouldn't scale. It wouldn't work.
swyx [00:19:36]: Uh, yeah. Uh, I think we want to move on to MCP. Yeah. But I just want to, like, just engineering advice. Yeah. Uh, obviously you've, you've, you've run, uh, you've, you've had to do a lot of projects and run a lot of teams. Do you have a general rule for over-engineering or, you know, engineering ahead of time? You know, like, because people, we know premature engineering is the root of all evil. Yep. But also sometimes you just have to. Yep. When do you do it? Yes.
Dharmesh [00:19:59]: It's a great question. This is, uh, a question as old as time almost, which is what's the right and wrong levels of abstraction. That's effectively what, uh, we're answering when we're trying to do engineering. I tend to be a pragmatist, right? So here's the thing. Um, lots of times doing something the right way. Yeah. It's like a marginal increased cost in those cases. Just do it the right way. And this is what makes a, uh, a great engineer or a good engineer better than, uh, a not so great one. It's like, okay, all things being equal. If it's going to take you, you know, roughly close to constant time anyway, might as well do it the right way. Like, so do things well, then the question is, okay, well, am I building a framework as the reusable library? To what degree, uh, what am I anticipating in terms of what's going to need to change in this thing? Uh, you know, along what dimension? And then I think like a business person in some ways, like what's the return on calories, right? So, uh, and you look at, um, energy, the expected value of it's like, okay, here are the five possible things that could happen, uh, try to assign probabilities like, okay, well, if there's a 50% chance that we're going to go down this particular path at some day, like, or one of these five things is going to happen and it costs you 10% more to engineer for that. It's basically, it's something that yields a kind of interest compounding value. Um, as you get closer to the time of, of needing that versus having to take on debt, which is when you under engineer it, you're taking on debt. You're going to have to pay off when you do get to that eventuality where something happens. One thing as a pragmatist, uh, so I would rather under engineer something than over engineer it. If I were going to err on the side of something, and here's the reason is that when you under engineer it, uh, yes, you take on tech debt, uh, but the interest rate is relatively known and payoff is very, very possible, right? Which is, oh, I took a shortcut here as a result of which now this thing that should have taken me a week is now going to take me four weeks. Fine. But if that particular thing that you thought might happen, never actually, you never have that use case transpire or just doesn't, it's like, well, you just save yourself time, right? And that has value because you were able to do other things instead of, uh, kind of slightly over-engineering it away, over-engineering it. But there's no perfect answers in art form in terms of, uh, and yeah, we'll, we'll bring kind of this layers of abstraction back on the code generation conversation, which we'll, uh, I think I have later on, but
Alessio [00:22:05]: I was going to ask, we can just jump ahead quickly. Yeah. Like, as you think about vibe coding and all that, how does the. Yeah. Percentage of potential usefulness change when I feel like we over-engineering a lot of times it's like the investment in syntax, it's less about the investment in like arc exacting. Yep. Yeah. How does that change your calculus?
Dharmesh [00:22:22]: A couple of things, right? One is, um, so, you know, going back to that kind of ROI or a return on calories, kind of calculus or heuristic you think through, it's like, okay, well, what is it going to cost me to put this layer of abstraction above the code that I'm writing now, uh, in anticipating kind of future needs. If the cost of fixing, uh, or doing under engineering right now. Uh, we'll trend towards zero that says, okay, well, I don't have to get it right right now because even if I get it wrong, I'll run the thing for six hours instead of 60 minutes or whatever. It doesn't really matter, right? Like, because that's going to trend towards zero to be able, the ability to refactor a code. Um, and because we're going to not that long from now, we're going to have, you know, large code bases be able to exist, uh, you know, as, as context, uh, for a code generation or a code refactoring, uh, model. So I think it's going to make it, uh, make the case for under engineering, uh, even stronger. Which is why I take on that cost. You just pay the interest when you get there, it's not, um, just go on with your life vibe coded and, uh, come back when you need to. Yeah.
Alessio [00:23:18]: Sometimes I feel like there's no decision-making in some things like, uh, today I built a autosave for like our internal notes platform and I literally just ask them cursor. Can you add autosave? Yeah. I don't know if it's over under engineer. Yep. I just vibe coded it. Yep. And I feel like at some point we're going to get to the point where the models kind
Dharmesh [00:23:36]: of decide where the right line is, but this is where the, like the, in my mind, the danger is, right? So there's two sides to this. One is the cost of kind of development and coding and things like that stuff that, you know, we talk about. But then like in your example, you know, one of the risks that we have is that because adding a feature, uh, like a save or whatever the feature might be to a product as that price tends towards zero, are we going to be less discriminant about what features we add as a result of making more product products more complicated, which has a negative impact on the user and navigate negative impact on the business. Um, and so that's the thing I worry about if it starts to become too easy, are we going to be. Too promiscuous in our, uh, kind of extension, adding product extensions and things like that. It's like, ah, why not add X, Y, Z or whatever back then it was like, oh, we only have so many engineering hours or story points or however you measure things. Uh, that least kept us in check a little bit. Yeah.
Alessio [00:24:22]: And then over engineering, you're like, yeah, it's kind of like you're putting that on yourself. Yeah. Like now it's like the models don't understand that if they add too much complexity, it's going to come back to bite them later. Yep. So they just do whatever they want to do. Yeah. And I'm curious where in the workflow that's going to be, where it's like, Hey, this is like the amount of complexity and over-engineering you can do before you got to ask me if we should actually do it versus like do something else.
Dharmesh [00:24:45]: So you know, we've already, let's like, we're leaving this, uh, in the code generation world, this kind of compressed, um, cycle time. Right. It's like, okay, we went from auto-complete, uh, in the GitHub co-pilot to like, oh, finish this particular thing and hit tab to a, oh, I sort of know your file or whatever. I can write out a full function to you to now I can like hold a bunch of the context in my head. Uh, so we can do app generation, which we have now with lovable and bolt and repletage. Yeah. Association and other things. So then the question is, okay, well, where does it naturally go from here? So we're going to generate products. Make sense. We might be able to generate platforms as though I want a platform for ERP that does this, whatever. And that includes the API's includes the product and the UI, and all the things that make for a platform. There's no nothing that says we would stop like, okay, can you generate an entire software company someday? Right. Uh, with the platform and the monetization and the go-to-market and the whatever. And you know, that that's interesting to me in terms of, uh, you know, what, when you take it to almost ludicrous levels. of abstract.
swyx [00:25:39]: It's like, okay, turn it to 11. You mentioned vibe coding, so I have to, this is a blog post I haven't written, but I'm kind of exploring it. Is the junior engineer dead?
Dharmesh [00:25:49]: I don't think so. I think what will happen is that the junior engineer will be able to, if all they're bringing to the table is the fact that they are a junior engineer, then yes, they're likely dead. But hopefully if they can communicate with carbon-based life forms, they can interact with product, if they're willing to talk to customers, they can take their kind of basic understanding of engineering and how kind of software works. I think that has value. So I have a 14-year-old right now who's taking Python programming class, and some people ask me, it's like, why is he learning coding? And my answer is, is because it's not about the syntax, it's not about the coding. What he's learning is like the fundamental thing of like how things work. And there's value in that. I think there's going to be timeless value in systems thinking and abstractions and what that means. And whether functions manifested as math, which he's going to get exposed to regardless, or there are some core primitives to the universe, I think, that the more you understand them, those are what I would kind of think of as like really large dots in your life that will have a higher gravitational pull and value to them that you'll then be able to. So I want him to collect those dots, and he's not resisting. So it's like, okay, while he's still listening to me, I'm going to have him do things that I think will be useful.
swyx [00:26:59]: You know, part of one of the pitches that I evaluated for AI engineer is a term. And the term is that maybe the traditional interview path or career path of software engineer goes away, which is because what's the point of lead code? Yeah. And, you know, it actually matters more that you know how to work with AI and to implement the things that you want. Yep.
Dharmesh [00:27:16]: That's one of the like interesting things that's happened with generative AI. You know, you go from machine learning and the models and just that underlying form, which is like true engineering, right? Like the actual, what I call real engineering. I don't think of myself as a real engineer, actually. I'm a developer. But now with generative AI. We call it AI and it's obviously got its roots in machine learning, but it just feels like fundamentally different to me. Like you have the vibe. It's like, okay, well, this is just a whole different approach to software development to so many different things. And so I'm wondering now, it's like an AI engineer is like, if you were like to draw the Venn diagram, it's interesting because the cross between like AI things, generative AI and what the tools are capable of, what the models do, and this whole new kind of body of knowledge that we're still building out, it's still very young, intersected with kind of classic engineering, software engineering. Yeah.
swyx [00:28:04]: I just described the overlap as it separates out eventually until it's its own thing, but it's starting out as a software. Yeah.
Alessio [00:28:11]: That makes sense. So to close the vibe coding loop, the other big hype now is MCPs. Obviously, I would say Cloud Desktop and Cursor are like the two main drivers of MCP usage. I would say my favorite is the Sentry MCP. I can pull in errors and then you can just put the context in Cursor. How do you think about that abstraction layer? Does it feel... Does it feel almost too magical in a way? Do you think it's like you get enough? Because you don't really see how the server itself is then kind of like repackaging the
Dharmesh [00:28:41]: information for you? I think MCP as a standard is one of the better things that's happened in the world of AI because a standard needed to exist and absent a standard, there was a set of things that just weren't possible. Now, we can argue whether it's the best possible manifestation of a standard or not. Does it do too much? Does it do too little? I get that, but it's just simple enough to both be useful and unobtrusive. It's understandable and adoptable by mere mortals, right? It's not overly complicated. You know, a reasonable engineer can put a stand up an MCP server relatively easily. The thing that has me excited about it is like, so I'm a big believer in multi-agent systems. And so that's going back to our kind of this idea of an atomic agent. So imagine the MCP server, like obviously it calls tools, but the way I think about it, so I'm working on my current passion project is agent.ai. And we'll talk more about that in a little bit. More about the, I think we should, because I think it's interesting not to promote the project at all, but there's some interesting ideas in there. One of which is around, we're going to need a mechanism for, if agents are going to collaborate and be able to delegate, there's going to need to be some form of discovery and we're going to need some standard way. It's like, okay, well, I just need to know what this thing over here is capable of. We're going to need a registry, which Anthropic's working on. I'm sure others will and have been doing directories of, and there's going to be a standard around that too. How do you build out a directory of MCP servers? I think that's going to unlock so many things just because, and we're already starting to see it. So I think MCP or something like it is going to be the next major unlock because it allows systems that don't know about each other, don't need to, it's that kind of decoupling of like Sentry and whatever tools someone else was building. And it's not just about, you know, Cloud Desktop or things like, even on the client side, I think we're going to see very interesting consumers of MCP, MCP clients versus just the chat body kind of things. Like, you know, Cloud Desktop and Cursor and things like that. But yeah, I'm very excited about MCP in that general direction.
swyx [00:30:39]: I think the typical cynical developer take, it's like, we have OpenAPI. Yeah. What's the new thing? I don't know if you have a, do you have a quick MCP versus everything else? Yeah.
Dharmesh [00:30:49]: So it's, so I like OpenAPI, right? So just a descriptive thing. It's OpenAPI. OpenAPI. Yes, that's what I meant. So it's basically a self-documenting thing. We can do machine-generated, lots of things from that output. It's a structured definition of an API. I get that, love it. But MCPs sort of are kind of use case specific. They're perfect for exactly what we're trying to use them for around LLMs in terms of discovery. It's like, okay, I don't necessarily need to know kind of all this detail. And so right now we have, we'll talk more about like MCP server implementations, but We will? I think, I don't know. Maybe we won't. At least it's in my head. It's like a back processor. But I do think MCP adds value above OpenAPI. It's, yeah, just because it solves this particular thing. And if we had come to the world, which we have, like, it's like, hey, we already have OpenAPI. It's like, if that were good enough for the universe, the universe would have adopted it already. There's a reason why MCP is taking office because marginally adds something that was missing before and doesn't go too far. And so that's why the kind of rate of adoption, you folks have written about this and talked about it. Yeah, why MCP won. Yeah. And it won because the universe decided that this was useful and maybe it gets supplanted by something else. Yeah. And maybe we discover, oh, maybe OpenAPI was good enough the whole time. I doubt that.
swyx [00:32:09]: The meta lesson, this is, I mean, he's an investor in DevTools companies. I work in developer experience at DevRel in DevTools companies. Yep. Everyone wants to own the standard. Yeah. I'm sure you guys have tried to launch your own standards. Actually, it's Houseplant known for a standard, you know, obviously inbound marketing. But is there a standard or protocol that you ever tried to push? No.
Dharmesh [00:32:30]: And there's a reason for this. Yeah. Is that? And I don't mean, need to mean, speak for the people of HubSpot, but I personally. You kind of do. I'm not smart enough. That's not the, like, I think I have a. You're smart. Not enough for that. I'm much better off understanding the standards that are out there. And I'm more on the composability side. Let's, like, take the pieces of technology that exist out there, combine them in creative, unique ways. And I like to consume standards. I don't like to, and that's not that I don't like to create them. I just don't think I have the, both the raw wattage or the credibility. It's like, okay, well, who the heck is Dharmesh, and why should we adopt a standard he created?
swyx [00:33:07]: Yeah, I mean, there are people who don't monetize standards, like OpenTelemetry is a big standard, and LightStep never capitalized on that.
Dharmesh [00:33:15]: So, okay, so if I were to do a standard, there's two things that have been in my head in the past. I was one around, a very, very basic one around, I don't even have the domain, I have a domain for everything, for open marketing. Because the issue we had in HubSpot grew up in the marketing space. There we go. There was no standard around data formats and things like that. It doesn't go anywhere. But the other one, and I did not mean to go here, but I'm going to go here. It's called OpenGraph. I know the term was already taken, but it hasn't been used for like 15 years now for its original purpose. But what I think should exist in the world is right now, our information, all of us, nodes are in the social graph at Meta or the professional graph at LinkedIn. Both of which are actually relatively closed in actually very annoying ways. Like very, very closed, right? Especially LinkedIn. Especially LinkedIn. I personally believe that if it's my data, and if I would get utility out of it being open, I should be able to make my data open or publish it in whatever forms that I choose, as long as I have control over it as opt-in. So the idea is around OpenGraph that says, here's a standard, here's a way to publish it. I should be able to go to OpenGraph.org slash Dharmesh dot JSON and get it back. And it's like, here's your stuff, right? And I can choose along the way and people can write to it and I can prove. And there can be an entire system. And if I were to do that, I would do it as a... Like a public benefit, non-profit-y kind of thing, as this is a contribution to society. I wouldn't try to commercialize that. Have you looked at AdProto? What's that? AdProto.
swyx [00:34:43]: It's the protocol behind Blue Sky. Okay. My good friend, Dan Abramov, who was the face of React for many, many years, now works there. And he actually did a talk that I can send you, which basically kind of tries to articulate what you just said. But he does, he loves doing these like really great analogies, which I think you'll like. Like, you know, a lot of our data is behind a handle, behind a domain. Yep. So he's like, all right, what if we flip that? What if it was like our handle and then the domain? Yep. So, and that's really like your data should belong to you. Yep. And I should not have to wait 30 days for my Twitter data to export. Yep.
Dharmesh [00:35:19]: you should be able to at least be able to automate it or do like, yes, I should be able to plug it into an agentic thing. Yeah. Yes. I think we're... Because so much of our data is... Locked up. I think the trick here isn't that standard. It is getting the normies to care.
swyx [00:35:37]: Yeah. Because normies don't care.
Dharmesh [00:35:38]: That's true. But building on that, normies don't care. So, you know, privacy is a really hot topic and an easy word to use, but it's not a binary thing. Like there are use cases where, and we make these choices all the time, that I will trade, not all privacy, but I will trade some privacy for some productivity gain or some benefit to me that says, oh, I don't care about that particular data being online if it gives me this in return, or I don't mind sharing this information with this company.
Alessio [00:36:02]: If I'm getting, you know, this in return, but that sort of should be my option. I think now with computer use, you can actually automate some of the exports. Yes. Like something we've been doing internally is like everybody exports their LinkedIn connections. Yep. And then internally, we kind of merge them together to see how we can connect our companies to customers or things like that.
Dharmesh [00:36:21]: And not to pick on LinkedIn, but since we're talking about it, but they feel strongly enough on the, you know, do not take LinkedIn data that they will block even browser use kind of things or whatever. They go to great, great lengths, even to see patterns of usage. And it says, oh, there's no way you could have, you know, gotten that particular thing or whatever without, and it's, so it's, there's...
swyx [00:36:42]: Wasn't there a Supreme Court case that they lost? Yeah.
Dharmesh [00:36:45]: So the one they lost was around someone that was scraping public data that was on the public internet. And that particular company had not signed any terms of service or whatever. It's like, oh, I'm just taking data that's on, there was no, and so that's why they won. But now, you know, the question is around, can LinkedIn... I think they can. Like, when you use, as a user, you use LinkedIn, you are signing up for their terms of service. And if they say, well, this kind of use of your LinkedIn account that violates our terms of service, they can shut your account down, right? They can. And they, yeah, so, you know, we don't need to make this a discussion. By the way, I love the company, don't get me wrong. I'm an avid user of the product. You know, I've got... Yeah, I mean, you've got over a million followers on LinkedIn, I think. Yeah, I do. And I've known people there for a long, long time, right? And I have lots of respect. And I understand even where the mindset originally came from of this kind of members-first approach to, you know, a privacy-first. I sort of get that. But sometimes you sort of have to wonder, it's like, okay, well, that was 15, 20 years ago. There's likely some controlled ways to expose some data on some member's behalf and not just completely be a binary. It's like, no, thou shalt not have the data.
swyx [00:37:54]: Well, just pay for sales navigator.
Alessio [00:37:57]: Before we move to the next layer of instruction, anything else on MCP you mentioned? Let's move back and then I'll tie it back to MCPs.
Dharmesh [00:38:05]: So I think the... Open this with agent. Okay, so I'll start with... Here's my kind of running thesis, is that as AI and agents evolve, which they're doing very, very quickly, we're going to look at them more and more. I don't like to anthropomorphize. We'll talk about why this is not that. Less as just like raw tools and more like teammates. They'll still be software. They should self-disclose as being software. I'm totally cool with that. But I think what's going to happen is that in the same way you might collaborate with a team member on Slack or Teams or whatever you use, you can imagine a series of agents that do specific things just like a team member might do, that you can delegate things to. You can collaborate. You can say, hey, can you take a look at this? Can you proofread that? Can you try this? You can... Whatever it happens to be. So I think it is... I will go so far as to say it's inevitable that we're going to have hybrid teams someday. And what I mean by hybrid teams... So back in the day, hybrid teams were, oh, well, you have some full-time employees and some contractors. Then it was like hybrid teams are some people that are in the office and some that are remote. That's the kind of form of hybrid. The next form of hybrid is like the carbon-based life forms and agents and AI and some form of software. So let's say we temporarily stipulate that I'm right about that over some time horizon that eventually we're going to have these kind of digitally hybrid teams. So if that's true, then the question you sort of ask yourself is that then what needs to exist in order for us to get the full value of that new model? It's like, okay, well... You sort of need to... It's like, okay, well, how do I... If I'm building a digital team, like, how do I... Just in the same way, if I'm interviewing for an engineer or a designer or a PM, whatever, it's like, well, that's why we have professional networks, right? It's like, oh, they have a presence on likely LinkedIn. I can go through that semi-structured, structured form, and I can see the experience of whatever, you know, self-disclosed. But, okay, well, agents are going to need that someday. And so I'm like, okay, well, this seems like a thread that's worth pulling on. That says, okay. So I... So agent.ai is out there. And it's LinkedIn for agents. It's LinkedIn for agents. It's a professional network for agents. And the more I pull on that thread, it's like, okay, well, if that's true, like, what happens, right? It's like, oh, well, they have a profile just like anyone else, just like a human would. It's going to be a graph underneath, just like a professional network would be. It's just that... And you can have its, you know, connections and follows, and agents should be able to post. That's maybe how they do release notes. Like, oh, I have this new version. Whatever they decide to post, it should just be able to... Behave as a node on the network of a professional network. As it turns out, the more I think about that and pull on that thread, the more and more things, like, start to make sense to me. So it may be more than just a pure professional network. So my original thought was, okay, well, it's a professional network and agents as they exist out there, which I think there's going to be more and more of, will kind of exist on this network and have the profile. But then, and this is always dangerous, I'm like, okay, I want to see a world where thousands of agents are out there in order for the... Because those digital employees, the digital workers don't exist yet in any meaningful way. And so then I'm like, oh, can I make that easier for, like... And so I have, as one does, it's like, oh, I'll build a low-code platform for building agents. How hard could that be, right? Like, very hard, as it turns out. But it's been fun. So now, agent.ai has 1.3 million users. 3,000 people have actually, you know, built some variation of an agent, sometimes just for their own personal productivity. About 1,000 of which have been published. And the reason this comes back to MCP for me, so imagine that and other networks, since I know agent.ai. So right now, we have an MCP server for agent.ai that exposes all the internally built agents that we have that do, like, super useful things. Like, you know, I have access to a Twitter API that I can subsidize the cost. And I can say, you know, if you're looking to build something for social media, these kinds of things, with a single API key, and it's all completely free right now, I'm funding it. That's a useful way for it to work. And then we have a developer to say, oh, I have this idea. I don't have to worry about open AI. I don't have to worry about, now, you know, this particular model is better. It has access to all the models with one key. And we proxy it kind of behind the scenes. And then expose it. So then we get this kind of community effect, right? That says, oh, well, someone else may have built an agent to do X. Like, I have an agent right now that I built for myself to do domain valuation for website domains because I'm obsessed with domains, right? And, like, there's no efficient market for domains. There's no Zillow for domains right now that tells you, oh, here are what houses in your neighborhood sold for. It's like, well, why doesn't that exist? We should be able to solve that problem. And, yes, you're still guessing. Fine. There should be some simple heuristic. So I built that. It's like, okay, well, let me go look for past transactions. You say, okay, I'm going to type in agent.ai, agent.com, whatever domain. What's it actually worth? I'm looking at buying it. It can go and say, oh, which is what it does. It's like, I'm going to go look at are there any published domain transactions recently that are similar, either use the same word, same top-level domain, whatever it is. And it comes back with an approximate value, and it comes back with its kind of rationale for why it picked the value and comparable transactions. Oh, by the way, this domain sold for published. Okay. So that agent now, let's say, existed on the web, on agent.ai. Then imagine someone else says, oh, you know, I want to build a brand-building agent for startups and entrepreneurs to come up with names for their startup. Like a common problem, every startup is like, ah, I don't know what to call it. And so they type in five random words that kind of define whatever their startup is. And you can do all manner of things, one of which is like, oh, well, I need to find the domain for it. What are possible choices? Now it's like, okay, well, it would be nice to know if there's an aftermarket price for it, if it's listed for sale. Awesome. Then imagine calling this valuation agent. It's like, okay, well, I want to find where the arbitrage is, where the agent valuation tool says this thing is worth $25,000. It's listed on GoDaddy for $5,000. It's close enough. Let's go do that. Right? And that's a kind of composition use case that in my future state. Thousands of agents on the network, all discoverable through something like MCP. And then you as a developer of agents have access to all these kind of Lego building blocks based on what you're trying to solve. Then you blend in orchestration, which is getting better and better with the reasoning models now. Just describe the problem that you have. Now, the next layer that we're all contending with is that how many tools can you actually give an LLM before the LLM breaks? That number used to be like 15 or 20 before you kind of started to vary dramatically. And so that's the thing I'm thinking about now. It's like, okay, if I want to... If I want to expose 1,000 of these agents to a given LLM, obviously I can't give it all 1,000. Is there some intermediate layer that says, based on your prompt, I'm going to make a best guess at which agents might be able to be helpful for this particular thing? Yeah.
Alessio [00:44:37]: Yeah, like RAG for tools. Yep. I did build the Latent Space Researcher on agent.ai. Okay. Nice. Yeah, that seems like, you know, then there's going to be a Latent Space Scheduler. And then once I schedule a research, you know, and you build all of these things. By the way, my apologies for the user experience. You realize I'm an engineer. It's pretty good.
swyx [00:44:56]: I think it's a normie-friendly thing. Yeah. That's your magic. HubSpot does the same thing.
Alessio [00:45:01]: Yeah, just to like quickly run through it. You can basically create all these different steps. And these steps are like, you know, static versus like variable-driven things. How did you decide between this kind of like low-code-ish versus doing, you know, low-code with code backend versus like not exposing that at all? Any fun design decisions? Yeah. And this is, I think...
Dharmesh [00:45:22]: I think lots of people are likely sitting in exactly my position right now, coming through the choosing between deterministic. Like if you're like in a business or building, you know, some sort of agentic thing, do you decide to do a deterministic thing? Or do you go non-deterministic and just let the alum handle it, right, with the reasoning models? The original idea and the reason I took the low-code stepwise, a very deterministic approach. A, the reasoning models did not exist at that time. That's thing number one. Thing number two is if you can get... If you know in your head... If you know in your head what the actual steps are to accomplish whatever goal, why would you leave that to chance? There's no upside. There's literally no upside. Just tell me, like, what steps do you need executed? So right now what I'm playing with... So one thing we haven't talked about yet, and people don't talk about UI and agents. Right now, the primary interaction model... Or they don't talk enough about it. I know some people have. But it's like, okay, so we're used to the chatbot back and forth. Fine. I get that. But I think we're going to move to a blend of... Some of those things are going to be synchronous as they are now. But some are going to be... Some are going to be async. It's just going to put it in a queue, just like... And this goes back to my... Man, I talk fast. But I have this... I only have one other speed. It's even faster. So imagine it's like if you're working... So back to my, oh, we're going to have these hybrid digital teams. Like, you would not go to a co-worker and say, I'm going to ask you to do this thing, and then sit there and wait for them to go do it. Like, that's not how the world works. So it's nice to be able to just, like, hand something off to someone. It's like, okay, well, maybe I expect a response in an hour or a day or something like that.
Dharmesh [00:46:52]: In terms of when things need to happen. So the UI around agents. So if you look at the output of agent.ai agents right now, they are the simplest possible manifestation of a UI, right? That says, oh, we have inputs of, like, four different types. Like, we've got a dropdown, we've got multi-select, all the things. It's like back in HTML, the original HTML 1.0 days, right? Like, you're the smallest possible set of primitives for a UI. And it just says, okay, because we need to collect some information from the user, and then we go do steps and do things. And generate some output in HTML or markup are the two primary examples. So the thing I've been asking myself, if I keep going down that path. So people ask me, I get requests all the time. It's like, oh, can you make the UI sort of boring? I need to be able to do this, right? And if I keep pulling on that, it's like, okay, well, now I've built an entire UI builder thing. Where does this end? And so I think the right answer, and this is what I'm going to be backcoding once I get done here, is around injecting a code generation UI generation into, the agent.ai flow, right? As a builder, you're like, okay, I'm going to describe the thing that I want, much like you would do in a vibe coding world. But instead of generating the entire app, it's going to generate the UI that exists at some point in either that deterministic flow or something like that. It says, oh, here's the thing I'm trying to do. Go generate the UI for me. And I can go through some iterations. And what I think of it as a, so it's like, I'm going to generate the code, generate the code, tweak it, go through this kind of prompt style, like we do with vibe coding now. And at some point, I'm going to be happy with it. And I'm going to hit save. And that's going to become the action in that particular step. It's like a caching of the generated code that I can then, like incur any inference time costs. It's just the actual code at that point.
Alessio [00:48:29]: Yeah, I invested in a company called E2B, which does code sandbox. And they powered the LM arena web arena. So it's basically the, just like you do LMS, like text to text, they do the same for like UI generation. So if you're asking a model, how do you do it? But yeah, I think that's kind of where.
Dharmesh [00:48:45]: That's the thing I'm really fascinated by. So the early LLM, you know, we're understandably, but laughably bad at simple arithmetic, right? That's the thing like my wife, Normies would ask us, like, you call this AI, like it can't, my son would be like, it's just stupid. It can't even do like simple arithmetic. And then like we've discovered over time that, and there's a reason for this, right? It's like, it's a large, there's, you know, the word language is in there for a reason in terms of what it's been trained on. It's not meant to do math, but now it's like, okay, well, the fact that it has access to a Python interpreter that I can actually call at runtime, that solves an entire body of problems that it wasn't trained to do. And it's basically a form of delegation. And so the thought that's kind of rattling around in my head is that that's great. So it's, it's like took the arithmetic problem and took it first. Now, like anything that's solvable through a relatively concrete Python program, it's able to do a bunch of things that I couldn't do before. Can we get to the same place with UI? I don't know what the future of UI looks like in a agentic AI world, but maybe let the LLM handle it, but not in the classic sense. Maybe it generates it on the fly, or maybe we go through some iterations and hit cache or something like that. So it's a little bit more predictable. Uh, I don't know, but yeah.
Alessio [00:49:48]: And especially when is the human supposed to intervene? So, especially if you're composing them, most of them should not have a UI because then they're just web hooking to somewhere else. I just want to touch back. I don't know if you have more comments on this.
swyx [00:50:01]: I was just going to ask when you, you said you got, you're going to go back to code. What are you coding with? What's your stack? Yep.
Dharmesh [00:50:06]: Uh, so Python's my language. Uh, I'm glad that it won in terms of the AI, uh, languages, lingua franca.
swyx [00:50:12]: It's the second best language for everything.
Dharmesh [00:50:13]: And by the way, there, I think exactly end of one of things that I disagree with Brett Taylor on, uh, when, when he was on, and just generally, I'm a massive Brett Taylor fan, uh, smart. One of my favorite people in tech, like it was like a segment in there. He was talking about like, oh, we need a, a different language than Python or whatever. That is like built for, uh, built for AI and built. It's like, no, Brett, I don't think we do actually, it's just fine. Um, it deals with just fine, just expressive enough. And it's nice to have a language that we can use as a common denominator across both humans and AI it's, it doesn't slow the AI down. Enough, but it does make it awfully useful for us to also be able to participate in that kind of future world, uh, that we can still be somewhat useful.
swyx [00:50:53]: I mean, but yeah, so it's, uh, Python, uh, cursor as my, uh, kind of code gen thing. Yeah. I would also mention that I really like your code generation thing. I have another thesis I haven't written up yet about how generative UI has kind of not fulfilled its full potential. We've seen the bolts and lovables and those are great. And then Vercel has a version of generative UI that is basically function calling pre-made components. And there's some. Thing in between where you should be able to generate the UI that you want and pin it and stick to it. And that becomes your form or yeah. And so the way I put it is, um, you know, I think that the two form factors of agents that I've seen a lot of product market fit recently has been deep research and the AI builders, like the bolt lovables. I think there's some version of this where you generate the UI, but you sort of generate the Mad Libs fill in the blanks forms, and then you, you, you keep that stable. And the deep research is. Just fills that in. Yeah. Yep. And that's it. I like that.
Dharmesh [00:51:49]: Yeah. Um, so I, I, I love those, uh, kind of simple, uh, simple limitations and kind of abstractions, but then if you look at the kind of, I'll say almost like the polar opposite of that. So, so right now, most of the UIs that you and I think about or conceive, or even examples are based on the primitives and the vocabulary that we have for UI right now. It's like, oh, we have text boxes. We have check boxes. We have radio buttons. We have pulldowns. We have nav. We have clicks, touches, swipes, now voice, whatever it is, the set of primitives that exist right now, we will combine them in, uh, in interesting ways, but where I think AI is going to be headed on, I think on the UI front is the same place is headed on the science front that originally it's like, oh, well, based on the things that we know right now, it'll sort of combine them, but we're like right at the cusp of it being able to actual novel research. So maybe a future version of AI comes up with a new set of primitives that actually work better for human computer interaction than things that we've done in the past, right? It's like, I don't. I don't think it's, it ended with the, uh, the checkbox, radio button and dropdown list. Right. I think there's life beyond that.
Alessio [00:52:44]: Uh, yeah, I know we're going to move to business models after, but when you talked about ivory teams, one way we talk to folks about it is like you had offshoring yet on shoring, which is like, you know, move to cheaper place in the country than offshoring. You know, it's like AI shoring. Yep. You're kind of moving some roles. That's the thing people say. Yeah. Shoring. Yeah.
Dharmesh [00:53:01]: That's the first time I've ever heard of that. Yeah. Yeah.
Alessio [00:53:04]: I don't know, man. But I think to me, the most interesting thing about the professional networks is like with people, you have limited availability to evaluate a person. Yeah. So you have to use previous signal as kind of like a evaluation thing. With agents, theoretically, you can have kind of like proof of work. Yeah. You know, you can run simulations and like evaluate them in that way. Yep. How do you think about that when running, building agent.ai even? It's like, you know, instead of just choosing one, I could like literally just run across all of them and figure out which one is going to work best.
Dharmesh [00:53:32]: I'm a big believer. So under the covers, when you build, because the primitives are so simple, you have some sort of inputs. We know that what the variables are. Every agent that's on agent.ai automatically has a REST API. That's callable in exactly the way you would expect. Automatically shows up in the MCP server, so you're able to invoke it in whatever form you decide to. And so my expectation is that in this future state, whether it's a human hiring an agent to do a particular task or evaluating a set of five agents to do a particular task and picking the best one for their particular use case, we should be able to do that. It's like, I just want to try it, and there should be a policy that the publisher or builder of the agent has that says, okay, well, I'm going to let you call me 50 times, 100 times before you have to pay or something like that. We should have effectively like an audit trail, like, okay, this agent has been called this many times. We also have kind of human ratings and reviews right now, and we have tens of thousands of reviews of the existing agents on agent.ai. Average is like 4.1 out of five stars. And all those things are nice signals to be able to have. But the kind of callable... Verifiable kind of thing, I think, is super useful. Like, if I can just call... Give me an API that says here are five agents and it solves this particular problem for me. If I have like a simple eval, I think that'd be so powerful. I wish I had that for humans, honestly. That'd be so cool.
Alessio [00:54:47]: Yeah, because, I mean, when I was running engineering teams, people would try and come up with these rubrics, you know, when hiring. And it's like, they're not really helpful, but you just kind of need some ground truth. But I feel like now, say you want to hire, yeah, an AI software engineer. Yep. You can literally generate like 15. 20 examples of like your actual issues in your organization, both from a people perspective of like collaboration and like actual code generation. Yep. And just pay for it to run it. Yeah. Like today we do take home projects and we pay people. Sure. Like this should be kind of the same thing. Yeah. It's like, I'll just run you. But I feel like people are not investing in their own evals as much internally.
Dharmesh [00:55:22]: I mean, that's the present company included, right? Everyone talks about evals. Everyone accepts the fact that we should be doing more with evals. I won't say nobody, but almost nobody actually does. That's the... And yeah, it's a topic for a whole other day. I'm not...
swyx [00:55:36]: It's funny, I mean, because obviously HubSpot is famous for launching graders of things. Yes. You'd be perfect for it. Yeah. Somehow. agree on evals, by the way. I mean, I just force myself to be the human in the loop or, you know, someone I work with and that's okay. But obviously the scalable thing needs to be done. Just a fun fact on, or question on the agent AI, you famously, you've already talked about the chat.com acquisition and all that. Yeah. And that was around the time of custom GPTs and the GPT store launching. Yes. And I definitely feel agent AI is kind of the GPT score, but not taken seriously. Yeah. Do you feel open AI if like they woke up one day and they were like, agent AI is the thing, like we should just reinvest in GPT store instead of fear?
Dharmesh [00:56:20]: I think that won't be agent.ai driven. It's an inevitability that open AI, I don't have any insider information, I'm an investor, but no inside information is because it makes too much money. It makes too much sense for them not to like, and they, they've taken multiple passes at it, right? They did the plugins back in the day, then the custom GPTs and the GPT store because, you know, being the platform that they are, I think it's inevitable that they will ultimately come up with, and they already have custom, it's going to happen. I'm not on the list of things I promised myself I would never do is compete with Solomon Altman ever, not intentionally anyway. But here you are. But yeah, here I am.
swyx [00:56:58]: But I'm not really, right? Not really. It's free, so like, whatever. But, you know, at some point, if it's actually valuable.
Dharmesh [00:57:06]: They're solving a much, much bigger problem. I'm like a small, tiny rounding error in the universe. But the reason that compelled me to actually create in the first place, because I knew custom GPTs existed, I did have this rule in my head that don't compete with Sam. He's literally like at the top of my list of people not to compete with. He's so good. But the thing that I needed in terms of for my own personal use, which is how agent.ai got started, because I was building a bunch of what I call solo software. Things for my own personal productivity gain. And I found myself doing more and more kind of LM driven stuff because it was better that way. You know, I sort of showed up in those solo projects a bunch. And so the thing I needed was an underlying framework to kind of build these things. And high on the list was I want to be able to straddle models because certain steps in the thing is like, oh, for this particular thing involves writing. So maybe I want to use Claude for this particular thing. Maybe I want to do this even around image generation, different types of whether. It has texture, doesn't have texture, whatever. And I want to be able to mix and match. And my sense is that whether it's OpenAI or Anthropic or whatever, they're likely going to have an affinity for their own models, right? Which makes sense for them. But I can sort of be, for my own purposes and for our user base, a little bit of the Switzerland. It's like we don't think there's like one model to rule them all based on your use case. You're going to want to mix and match and maybe even change them out. Maybe even test them back to the kind of eval idea. It's like I have this agentic workflow. And here's the thing that we've been playing with recently. Because we have. We have enough users now where they, like the LM, and I look at the bills and it's like, oh, I'm spending real money now. And this is just human nature, right? It's not just normies, but it's like, so you have this drop down of all the models that you can say, which model do you want to use in your agent.ai agent? And as it turns out, people pick the largest number. So they will pick 4.5 or whatever, whatever it is, right? It's like it's.
swyx [00:58:55]: Oh my God, you're doing 4.5? Yes.
Dharmesh [00:58:57]: Ouch. Yes. Yeah. But the thing I've promised myself is we will support all of them, regardless of what it costs. And like, once again, I see this as a just a research thing, you know, benefit to humanity and inference costs are going down. At least I so I tell myself late at night so I can sleep. So they pick the highest numbered one. And so we have an option in there right now that says, which is the first option. It's like, let the system pick for me. Auto-optimist. Yeah. As it turns out, people don't do that. They just pick the, because they don't trust it yet, which is fine. They shouldn't trust it completely. But one thing we discovered is that if we back channel it, and this is the thing we're testing with, is that, oh, if I can just run the exact same agent that gets run a thousand times, we'll do it on our own internal agents first. And if the ratings and reviews, because we're getting human evals all the time on these agents, we can get a dramatic multiple orders of magnitude reduction by going to a lower model with literally like no change in the quality of the output. Right. Which makes sense. Because so many of the things we're doing doesn't require the most powerful model. And it's actually bad because there is higher latency. It's not just a cost thing. But so anyway, like in that kind of future state, I think we're going to have model routing and a whole body of people working on that problem, too. It's like, help me pick the best model at runtime. Would you buy or build model routing? I buy everything that I can buy. I don't want to build anything if I don't have to.
swyx [01:00:26]: One of the most impressive examples of this. I think was our Chai AI conversation, which I think about a lot. He views himself explicitly as a marketplace. You are kind of a marketplace, but he has a third angle, which is the model providers, and he lets them compete. And I think that sort of Chai three-way marketplace maybe makes a lot of sense. Like, I don't know why every AI company isn't built that way. It's a good point, actually.
Dharmesh [01:00:48]: Yeah, it makes sense. I have a list of things I'm super passionate about. I'm very passionate about efficient markets or extremely irritated by inefficient markets. And so efficient markets, for the normies listening, are markets that exist where every possible efficient markets are the ones that every transaction that should occur actually does. That's an efficient market that should happen. And so then why do inefficient markets exist? Well, maybe the buyer and seller don't know about each other. Maybe there's not enough of a trust mechanism. There's no way to actually price that or come up with fair market value for fair pricing. And as you kind of knock those dominoes down, the market becomes more and more. And lots of latent value exists as a result of inefficiency. And whoever removes those inefficiencies. Yeah. And then the market recedes for high value markets makes a lot of money. That's been proven time and time again. This is one of those examples of there's an inefficiency right now because we are either over using over models or whatever. Let's just reduce that to an efficient market. The right model should be matched up with the right use case for the right price. And then we'll... Very interesting. You ever looked into DSPy? I have looked at it. Not deeply enough, though.
swyx [01:01:48]: It's supposed to be, as far as I think, the only evals first framework. Yep. And if evals are so important. And by the way, the relationship between this and all that is DSPy would also help you optimize your models. Yep. Because you did the evals first. Yep. I wonder why it's not as popular, you know. But I mean, it is growing in traction, I would say. We're keeping an eye on it.
Alessio [01:02:09]: Let's talk about business models. Obviously, you have kind of two, work as a service and results as a service. Yep. I'm curious how you divide the two. Yeah.
Dharmesh [01:02:19]: So work as a service is... So we know about software as a service, right? So I'm licensing software that's delivered to me as a service. That's been around for decades now. So we understand that. But the consumer of that service is generally a human that's doing the actual work, whichever software you're buying. Work as a service is the software is actually doing the work, whatever that work happens to be. And so that's work as a service. So I'll come up with kind of discrete use cases, whether it's kind of classification or legal contract review or whatever the software is actually doing the thing. Results as a service is you're actually charging for the outcome, not actually the work, right? That says, okay, instead of saying, I'm going to pay you X amount of dollars to review a legal contract or this amount of time or number of uses or something like that, I'm going to actually pay you for the actual result, which is... So my take on this in the industry or the parts of the industry are super excited about this kind of results as a service or outcomes-based pricing. And I think the reason for that, I think we're over-indexing on it. And the reason we're over-indexing on it is the most popular use case on the kind of agent side right now is like customer support. Well-documented. A lot of the providers that have agents for customer support do it on a number of tickets resolved times X dollars per ticket. And the reason that that makes a lot of sense is that the customer support departments and teams sort of already have a sense for what a ticket costs to resolve through their kind of current way. And so you can come up with an approximation for A, what the kind of economic value is. There's also at least a semi-objective measure for what an acceptable value is. And that's what an acceptable resolution or outcome is, right? Like you can say, oh, well, we measured the net promoter score or CSAT for tickets or whatever. As long as the customers, 90% of the tickets were handled in a way the customer was happy. That's whatever your kind of line is. As long as the AI is able to kind of replicate that same SLA, it's like, okay, well, it's the same. They're fungible, one versus the other. I think the reason we're over-indexed, though, is that there are not that many use cases that have those two dimensions to them that are objectively measurable. And that there's a known economic value that's constant. Like, customer support tickets, because they're handled by humans, make sense. And humans have a discrete cost. And especially in retail, which is where this originally got started in B2C companies that have a high volume of customer support tickets that they're distributing across, a ticket is roughly worth the same because it takes the same amount of time for most humans to do that kind of level one, tier one support. But in other things, the value per outcome can vary dramatically, literally by orders of magnitude, in terms of what the thing is actually worth. That's kind of thing number one. Thing number two is, how do you objectively evaluate that? How do you measure? So let's say you're going to do a logo creator as a service based on results, right? And that's a completely opposite subjective thing or whatever. And so, okay, well, it may take me 100 iterations. It may take me five iterations. The quality of the output is actually not completely under my control. It's not up to the software. It could be you have weird taste or you didn't describe what you're looking for enough or whatever. It's like it was just not a solvable problem. Design kind of qualitative, subjective disciplines deal with this all the time. How do you make for a happy customer? There's a reason why they have, oh, we'll go through five iterations. But our output is we're going to charge you $5,000 or $500 or whatever it is for this logo. But that's hard, right, to kind of do at scale.
swyx [01:05:29]: Just a relatable anecdote. Our podcast, actually, we just got a new logo. And we did 99 designs for it. And there are so many designers who are working really hard. But I just didn't know what I wanted. So I was just too bad. You seem great, but you know.
Dharmesh [01:05:48]: that's another example of a market made efficient, right? Yeah. It's like I've been a 99designs user and customer for a dozen plus years now.
swyx [01:05:55]: It's fantastic. Yeah. So many designers, like this doesn't cost that much for them to do. It's worth a lot to us. We can't design for s**t. Totally. Yeah. Yep.
Dharmesh [01:06:04]: By the way, pro tip on 99designs is that on the margin, you're better off kind of committing to paying the designer that you're going to pick a winner. Whether you like it or not doesn't really matter. And that gets higher participation. And you're still going to get a bunch of crap that happens. You get a bunch of noise in it. But the kind of quality outcome is often a function of the number of iterations. And logo design is one of those examples. If you had to choose between 200 logos versus 20 logos, chances are closer that you're going to find something you like. Yeah.
swyx [01:06:33]: For those interested, I have a blog post on my reflections on the 99designs thing. And that's one of those. They give an estimate of how many designs you get. Yep. And I think that the modifier for like, we will pay you, we'll pay somebody and maybe it's you, is like 30 to 60. But actually it's 200. Yep. So it's underpriced. Yep.
Alessio [01:06:51]: Yep. Do you think some markets are just fundamentally going to move to more results-driven business models? Probably.
Dharmesh [01:06:59]: And I don't understand enough markets well enough to know. But if we had to kind of sort or rank them, there's likely some dimension along which we could sort that. It's like, oh, these kinds of businesses, is there an objective measure of kind of truth or the outcome? Is there a way to kind of price it in terms of the low variance or variability on the value? If those things are true, whatever industries that is true in, customer support is an example, but there's likely lots of other examples where those two things are true. But then the thing I wonder, though, is that from the customer's perspective, would they rather actually pay for work as a service versus an actual, it's like maybe the way they think about it is that's sort of my arbitrage opportunity. Like I can get work done for X, but the value is actually Y. Why would I want that delta to be squozed out by the kind of provider of the software if I have a choice? I don't know. Oh, I mean, okay.
swyx [01:07:51]: Attribution. There's 18 things that go into them. You're one of them. So it's hard to tell. Yes, it is. By the way, have you seen, obviously you're in this industry, not exactly HubSpot's exact part of the market, but what have you seen in attribution that is interesting? Because that directly ties into work as a service versus results. Yeah.
Dharmesh [01:08:12]: Not enough because we are so, as a world, as an industry, just pick your thing. So behind. Yeah. This is why I think Web3 in the way that it was meant to be done is going to make a comeback because fundamental principles of that makes sense. I think what happened in that world was kind of a bunch of crypto bros and grifters and NFT stuff or whatever that was loosely related. There was no actual, but the idea of a blockchain, of a trackable thing, of being able to fractionalize digital assets, attribution, having an audit log, a published thing that's verifiable. All those primitives make sense, right? And maybe there's a limited, but it's not zero, set of use cases where the kind of what we would now call like the inference cost or the overhead, the tax for storing data on the blockchain. And there's certainly a tax to it. It doesn't make sense for all things, but it makes sense for some things for sure. But we just don't have attribution in any meaningful way, I don't think. Isn't it sad that it's so important and no answer? I know. It partly comes down to incentives. Yeah. So people that actually have the data or parts of the data from which attribution could be calculated or derived don't really have the incentives to make that data available. So even something as simple like on the PPC side, right, on the Google search thing, that's sort of my world or has been. We have less data now than we did back in the day in terms of like click-throughs and things like that before Google would actually send you. Here are the keywords people typed. And years ago, they even took that away. So it's hard to kind of really connect the dots back on things. And we're seeing that across. It's not just PPC, but just all sorts of things. They took that away from the Search Console. What's that? The Search Console has that. Yes. They took that away. Search Console has that. But your website, if you go to Google Analytics, you can connect it back to the Google Search Console. I see. Yeah. Yes. Okay.
swyx [01:10:00]: All right. Yeah. Well, it's a known thing. You don't have to make it a rant about Google.
Alessio [01:10:06]: What about software engineering? Do you think it will stay as like a work as a service? Or do you think? I think most companies hire a lot of engineers, but they don't really know what to do with them or like they don't really use them productively. Yeah. And I think now they're kind of hitting this like, you know, crisis where it's like, okay, I don't know what I will price an agent because I don't really know what my people are doing anyway. Yeah. Like, how do you think that changes?
Dharmesh [01:10:27]: I think, so I'm actually bullish on engineers in terms of their kind of long-term economic value. Not despite all the movements in Cogen and all the things that we're already seeing, but because of it. Because what's going to happen as a result of AI, and people have talked about this in even other disciplines, we're going to be able to solve many more problems. The semi-math guy in me is like, okay, so we always say, oh, well, now agents are going to be doing code or whatever. And so there's going to be a million software engineers, you know, virtual digital software engineers out there. And so the value per engineer is going to go down because I'm just in that same mix that I as an engineer. What they don't recognize is that it's not just about the denominator, there's a numerator as well, which is what's the total economic value that's possible. And I would argue that's growing faster than the kind of denominator is, that the actual economic value that's possible as a result of software and what engineers can produce, you know, with the tools that they will have at hand. So I think the value of an engineer actually goes up. They're going to have the power tools, they're going to be able to solve a larger base of problems that are going to need to be solved. Yeah.
Alessio [01:11:29]: It feels to me like he'll stay as like work as a service. You're paying for work. I don't think there's like a way to do that.
Dharmesh [01:11:34]: And there will be a set of engineers that, and we see this all the time, you know, they're like in the media industry, you have people that are kind of writers, but then you have freelancers that, you know, write articles or write however they manifest their kind of creative talent. And both make sense, right? There's like the work for hire. There's also the kind of outcome based or like I produce this thing. And maybe they, some of those engineers actually produce agents. So they put it in a marketplace like agent did AI someday, and that's how they make their millions. Yeah.
Alessio [01:11:58]: Any other thoughts just on agents? We got a lot of like misc things that we want to talk to you about. Miscellaneous.
Dharmesh [01:12:03]: I think we cover a lot of territory. So I'm excited about agents. My kind of message to the world. Yeah. Would be, don't be scared. I know it's scary. Easy for me to say as a tech techno optimist, but learn it. Even if you're a normie, even if you're not an engineer, if you're not an AI person, you'll think of yourself as an AI person. Use the tools. I don't care what role you have right now, where you are in the workforce. It will be useful to you and start to get to know agents, use them, build them.
swyx [01:12:29]: And I think my message for engineers is always like, there's more to go. Like we're still in the early days of figuring out what an agent's stack looks like. Yeah. And I want to push people towards agents with memory. Yeah. Agents with planning.
Dharmesh [01:12:43]: Oh, we have to talk about memory. We got to talk about memory. Let's go. Let's do it. Because I think that's the next, in my mind, the next frontier is actual long-term memory, both for agents and then for agentic networks and a trustable, verifiable, I won't say privacy first, but privacy oriented way. I have an issue with the term privacy first, because a lot of times we say privacy first, when we don't really mean that. Privacy first means I value that above all things. It doesn't matter what we're talking about. And that's just not true, not for any human. Anything that wants to be used. So memory is an interesting thing, right? So the thing I'm working on right now, lots of things in play in agent.ai is around implementation of memory. And there are great projects out there, mem0 being one of them. But the thing that's interesting for me, right, is, and so we see this in ChatGPT and other things right now, where it does have the notion of a longer term memory. You can pull things back into context as needed. The thing I'm fascinated by is cross-agent memory. So if I'm an agent builder right now, it's like, okay, here are the things that I sort of know or I learned from the user in terms of pulling out the, I'll call them knowledge nuggets, for lack of a better term. And that's great. But then when the next agent builder comes out and it's the same user, shouldn't all the things that agent one learned about me, if it's going to be useful for agent two, as long as I opt into it, it's like, yeah, I don't care those things. In fact, I would find it awfully annoying to tell agent two and agent n and agent n plus one, all the same things I've already told it, because it should know, like the system should know. And this is part of the reason why I'm like a believer in these kind of networks of agents and shared state is that that user utility gets created as a result of having shared memory. Not just we should solve the memory problem for an independent agent, but then we should also be able to share that context, share that memory across the system. And that's part of the value prop for agent.ai is like, okay, when you're building, it's like, so we've got, you know, whatever million users and we're going to have growing memory about all of them. So instead of you going off on your own thing and building an agent out as this kind of disconnected node in the universe or whatever, here's the value for building on the network or on the platform, ours or someone else's, because there's more user value that gets created. It's more utility.
Alessio [01:14:59]: How do you think about auth for that? Because part of memory is like selective memory. So it takes like scheduling. Yep. I want you to have access. If I have another scheduling agent, you should be able to access the events you're a part of. Yep. And like what times I have available, but it shouldn't tell you about other events on my calendar. Like what's that like?
Dharmesh [01:15:15]: I have so many thoughts on this. This is like the opportunity out there, like solving these kind of fundamental, like this is going to need to exist, right? So right now the closest approximation we have is auth, auth 2.0, right? And everyone has, it's like, okay, approve. And it's a very, very coarse set of scopes, right? Like based on the provider of the auth server, be it Google, whoever it is, HubSpot, it doesn't matter. It's like, oh, I pick a set of scopes and they could have defined the scopes to be super granular. Fine. But it's sort of up to them. But that is going to move so slowly, right? So for instance, the use case I have right now, like I use email for everything. I use it as a, like an event and data bus for my life, right? And why I mean that, like literally, it's like, I'm like anything that I do, if there's a way to kind of get that into email, because I know it's an open protocol, right? It's like, okay, I will be able to get to that data in useful ways. And this is before. So I have 3 million that I've built a vector store off of that has solved my own personal use cases. So I'll give you the example, but obviously I'm not going to build all my own software for everything. But if a startup comes along and says, Dharmesh, can you make your email inbox available in exchange for these things? I'm like, hell no. Like that's the, literally my kind of like everything, like my life is in here, right? So you need to share subsets. Yes. And so I think there's a, and maybe this is not the actual implementation, but imagine if someone said, okay, I have a trusted intermediary for that first trust, however defined that says, okay, I'm going to OAuth into this thing. And it gets to control that. I can say in natural language, I only want to pass email to this provider where the label is one of X or that's within the last thing and no more than 50 emails in a day or whatever. So I don't have them dumping the entire 3 million backlog, whatever controls I want to put on it. It's unlikely that the, all the OAuth server side right now, the Googles, even the big ones, small ones doesn't really matter. Are going to do that. But this is an opportunity for someone and they're going to need to get to some scale, build some level of trust that says, okay, I'm going to hand over the keys to this intermediary. Yeah. But then it opens up a bunch of utility because it gives me control, more fine, fine grain
swyx [01:17:15]: control. Yeah. I'd say Langchain has, has an interesting one. There are a bunch of people who has tried to track crack AI email. Every single one of them who has tried has pivoted away. Yep. And I'm waiting for Superhuman to do it. Yep. I don't know why they haven't, but you know, at some point.
Alessio [01:17:29]: They have some cool AI stuff. Yeah. Yeah. I think the pace needs to increase, but I think this goes back to like open graph. Yeah. Right. Which is like, I think Google is not incentivized to build better scopes. Nope. And like, they're just not going to do it. Nope. So.
Dharmesh [01:17:42]: We can't even get like, we haven't been able to get semantic search out of Google for like, still. Not totally. Yeah. Just now they made the announcement this week. What do you mean? Semantic search? In Gmail. Oh, I see. Yeah. So, okay. So they have all the, they have my 3 million emails. Why don't they have a vector store where I can just like basic. Yeah. Yeah.
Dharmesh [01:18:01]: In real time.
swyx [01:18:03]: Like, I don't think my email is that big a deal, but. Yeah. My standard thing on memory is, it sounds like you are using mem0. I am. There's also memgpt, now Letta, which give a workshop at my conference. There's Zep, which uses a graph database, just kind of open source, kind of interesting. Yep. And LangMem from LangGraph, which I would highlight. Also, like it's really interesting, this developing philosophy that people seem to be agreeing on, on a hierarchy of memories. Mm-hmm. Mm-hmm. So, from memory to episodic memory to, I think it's just overall sort of background processing. Like, we have independently reinvented that AI should sleep. Yep. To do the deep REM processing of memories. Yep. It's kind of interesting. Yep.
Dharmesh [01:18:43]: Yeah, that is. It's the other, I mean, just on the notion of memory and hierarchies. So, you know, I talked about the memory we're working on right now is at the user level and it's cross agent, right? Yeah. But the other kind of one step up would be, so once again, going back to this kind of hybrid digital teams. Yeah. Is that you can imagine to say, oh, well, my team has this kind of shared team. I don't want to share with the world or this set of agents across this group of people. I want to have shared state like we would have in a Slack channel or something like that. That should sort of exist as an option, right? Yeah. And the platforms should provide that.
swyx [01:19:15]: And the B folks I should also mention have mentioned that they're working on that as well. Okay. So, imagine being able to share, you know, selective conversations with people. Like, that's nice. Yeah. Yeah. VerbalLess has, I guess, voice-based shielding. I don't think they have the action. I'm an investor in that too.
Dharmesh [01:19:32]: Oh, really? Okay. Trying to think about all the things I've said, Invest in OpenAI, Perplexity, Langraph, Kru.ai, Limitless, a bunch of them. So, if I've said anything, by the way, I have no insider knowledge. I have no... I'm not trying to plug or pitch or anything like that. No, no, no.
swyx [01:19:48]: I think it's understood. We're often... Like, you know, if you have skin in the game, you've probably invested or me or me not... I'm not an investor in B, but I'm just a friend. And I think you should be able to speak freely of your opinions regardless. Okay, we have some miscellaneous questions that may be zooming out from Agent AI. First of all, you mentioned this and I have to ask, you have so many AI projects you'll never get to. What's one or two that you want other people to work on?
Dharmesh [01:20:15]: Oh, wow.
swyx [01:20:16]: Drop some from your list.
Dharmesh [01:20:18]: Other people to work on. Because you'll never get to it. Yeah, what I need to do is I've had this thought before. So I have this is like maybe like pick one a week or something like that and give the domain away. Like I have people submit their one pager or something like that. It's like, if you can convince me that you have at least enough of an idea, enough like willingness to kind of commit to actually doing something. It's the ones that you keep mentioning, but you haven't gotten to it for whatever reason. Yep, yep. Traffic, like some of them, I don't have the underlying business model. We're going to have to come back to this, maybe do a follow-up episode. I don't, like they're just not jumping to mine. You don't need the business model, just... Yeah, so I own Scout.ai. I think that's an interesting... By the way, pretty much all of them, there was an idea at the time. It's like it was one of those late night, it's like, oh, I could do this. Is the domain available? And I'll go grab it. I'm trying to think what else I have on the AI space. I have a lot of like non-profit domain names as well for like non-profit like OpenGraph. I'm not sure why things are not jumping to my head. Yeah, I have agent.com, which obviously is tied to agent.ai.
swyx [01:21:24]: Oh, that's going to be big. That's going to be big. Oh my God. That's going to be like a 30, $50 million.
Dharmesh [01:21:29]: It's going to be big. It's going to be, I think, end up being bigger than chat.com, which was 15.
swyx [01:21:38]: Yeah, it's more work oriented. Yep. That's interesting.
Alessio [01:21:41]: Yeah, do you want to talk about the chat.com thing? I would love just the backstories. Like, did you just call up Sam one day and be like, I got the domain? Yeah. Did they? Can I get back to you?
Dharmesh [01:21:52]: No, I'll give you, it's a good story. Back in the original ChatGPT days, the first thought I had in my head, which lots of people had in their head, is that OpenAI is going to build a platform and ChatGPT is actually just a demo app to show off the thing. And there's been precedence for tech companies that have had, you know, demo apps to kind of help normies understand the underlying technology. And even after the kind of boost or whatever. So my original thought was, well, someone should actually create like an actual real world. And so I'm like, and that product should be called chat.com because GPT is not a consumer friendly thing at all. Like that's an acronym, not pretty, it doesn't roll off the tongue. And so like, I'll build ChatGPT because that was just a demo app back then. So I, you know, got chat.com. And then as it turns out, ChatGPT is like a real product. And I was at an event here in San Francisco that Sam spoke at where he launched plugins. I think it was the announcement at that time. Yeah. And that's the thing is like, I had sort of suspected, it's like, okay, things sort of be like, there's no way. There's no way that OpenAI is going to launch plugins for ChatGPT if they were not thinking of it as an actual platform. So it's not just about the GPT APIs. This is like a real thing. I'm like, crap. Like this violates the first rule of Dharmesh, which is don't compete with Sam. I knew when I bought the domain that there was competition for the domain. There were other companies looking to buy it. I don't know who they were. I had suspicions. So I bought it and then I'm like, okay, well, I'll reach out to Sam. I was like, hey, Sam, I happen to have got, I don't know. I don't know if he was or wasn't kind of in the running or trying to acquire it or not, but I have chat.com. I don't, not looking to make a profit or whatever. If you want it, you will obviously do something much better, bigger with it. I don't want to be in the compete with Sam game effectively is what I said. And so they did want it.
swyx [01:23:38]: And yeah, we struck a deal. Looks like it's been a very good deal if the valuations are, you know, to be, to be real. Yeah. Who knows? Who knows?
Alessio [01:23:48]: It's one of those weird things. Like, yeah. Yeah. The agent that AI domain evaluator said that late in that space is for between five and 15 K. Okay.
swyx [01:23:55]: So does that feel right? Well, it's missed the, it's missing this one.
Dharmesh [01:24:00]: Does not incorporate the transactional data. I have not published that one yet. Uh, that's because it's also operationally very intensive, uh, that other one. But anyway, we, we actually had it donated by a listener, so I don't know what the real cost is, but it's missing that it's linked to an influencer by way of AI, which I've offered. I'm an investor in, in, yes, I bought that. Uh, and I've told him that like, whenever you're ready, you let me know, I'll sell it to you at cost. Uh, yeah.
swyx [01:24:25]: So, yeah, I mean, that, that is some value add since you may buy a lot of domains.
Dharmesh [01:24:29]: What, what are your favorite, uh, domain buying tips apart from have a really good domain broker, which I assume you have, uh, no, I actually don't, uh, I do, I do my own deals. Um, I have a, like a very cards face up approach to life. Um, so there's, so, you know, some people would tell you, it's like, oh, well, if someone, they know that it's, you're behind the transaction. Yeah. So, you know, the price is going to go up, sure, but it's still like willing seller or willing buyer or whatever. It doesn't mean I'm going to have to necessarily pay that price. Uh, it's like, okay. But the upside to it, uh, cause I always, you know, reach out as myself when I'm, when there's a domain out there. Um, and they can look you up. They can look me up. But then I also come off as like legit, like, okay, well, there's very few people are not going to return my email. When I say I'm interested in a domain that they may have for sale, um, or had not considered selling, but you know, would you consider selling? Uh, so yeah. And some of the, like, uh. So I own some of my favorites. I still own prompt.com, by the way, that, that could be a big one. Um, but I owned, and this is one, uh, I don't regret it. I went into a good, I owned a playground.com. And so the original idea behind playground.com was at the time, uh, open AI had their, uh, playground where you can play around with the models and things like that. Right. It's like, okay, well, there should be a platform neutral thing. There should be a playground across all the LLMs. Then you can, and there are obviously products and, uh, startups that, that do that now. And so that was my original thing. It's like, oh, there should be playground.com and you can go test out all the models and play around with them just like you can with, uh, with open AI's, uh, GPT stuff. And then, uh, so sale was out there with, uh, with, with playground, uh, the company, uh, and I think he reached out, it might've reached out to me over, over Twitter or something like that. So we knew of, of each other. I'd never, I've still never, never met him. And he asked me whether I would consider, and that was a tough one because I'm like, I actually have the business idea already in my head. I think it's a great idea. I think it's a great domain name. Uh, and it's like a really simple English word that has like relevance and a whole new context now. But once again, uh, I took, uh, took equity. So it's like, uh, look on the bright side. That's like, I, so domains that get me into deals that I would never been able to likely get into two other ways. So, yeah.
Alessio [01:26:35]: Yeah. We should securitize your GoDaddy account and just make it a fund. It's a fund.
Dharmesh [01:26:41]: It's basically a fund. Yeah. Um, and by the way, and so back to the kind of, uh, I hope you don't use GoDaddy by the way. Vested, uh, I don't know if it's public yet. Um, but in a company that's going to treat domains as a fractionalizable, uh, tradable asset, because that's the kind of the original NFT in a way, right? It's like, okay, well, and then if you can make both fractionalizing, but also just to transfer, like right now, it's so painful when you buy a domain, you go through an escrow service and there's just all of this. It's like, I just want like instantaneous, like charge me in Bitcoin or credit card, whatever it is. And then I should show up and I should be able to reroute the DNS. Like that should be minutes, not weeks or days. Um, anyway, so.
Alessio [01:27:19]: Yeah, that's what ENS on Ethereum is basically the same, but it should bring that for normies. Yeah, exactly. They should bring it. Yeah. The ICANN and all of that is, uh, as its own, its own thing.
swyx [01:27:30]: I have a question on, on just, uh, you know, you keep bringing up your Sam Altman rule. One of my favorite, favorite, favorite, my first millions of all time was actually without you there, but talking about you. Okay. Cause, uh, Sean was describing you as a fierce nerd, which I'm sure you, you, you were there. Uh, um, and, uh, I think Sam also is a fierce nerd and, and he is, uh, uh, I was, I was listening to this Jessica Livingston podcast where what she had him on and described him as a formidable person. I think you're also very formidable and I just wonder what makes you formidable. What makes you a fierce nerd? What, what keeps you this driven? Yeah.
Dharmesh [01:28:09]: Sam's fiercer and nerdier just for the record. Um, but I think part of it is just like the strength of my conviction, I guess. Like I'm, I'm willing to. Work harder and grind it out, uh, more than people that are smarter than me. And I'm only slightly stupider than people that are willing to work harder than me. Right. Like I'm just the right mix of, uh, the kind of grinded it, kind of work at it, stick to it for extended periods of time. If I think I'm right, I will latch out, latch on and not let go until I can either like prove to myself that it's not. Um, so even like the natural language thing, it's like, you know, it took 20 years, but eventually I got to a point where, uh, the world caught up and it became possible. Uh, but yeah, I think. And part of it is, uh, I think this is partly, I think what makes me like, I'm a nice guy. Uh, sometimes they're the most dangerous kind, right? It's like, okay, well, I, I don't make enemies or whatever, but so my advice would be my, this is my take on competition. I don't think of it as like war. I think of it as, uh, their opponents. All right. And this is, it's not worried up. It's like, it's, it's a game, right? And you can use whatever analogy I happen to play a fair amount of chess. I'm a student of the game. That's partly, I think what, uh, makes me. Effective, uh, I'm solving for the long-term, uh, so I'm kind of hard to deter. So for those of you out there looking to kind of compete with HubSpot, uh, no, uh, I'm going to be here 18 years. I'm going to be here for another 18 years. So, but not that you shouldn't do it. It's a big market.
swyx [01:29:34]: Uh, I'm not trying to sway anyone, but yeah, I think like something I struggled with, with this conviction, you said you pursue things to conviction, but like you start out not knowing anything. Yeah. And so how do you develop a conviction when there's. You, you find it along the way, or you, you stumble along the way, then you lose conviction and then you stop working on it, you know, like, how do you keep going?
Dharmesh [01:29:57]: The way I've sort of approached it is that, um, so I don't generally tend to have conviction around a solution or a product. I have conviction around a problem, uh, that says this is an actual real problem that needs to be solved. And I may have an idea for how to be solved, uh, you know, right now, and that I may get dissuaded. It's like, ah, I'm not smart enough. Technology's not good enough. Whatever the constraints are, but it's the problem I have conviction around. It's like, oh, that problem still hasn't gone away. Uh, so like I sort of filed away in the back of my brain and I'll revisit it's like, okay, well, you know, the kind of board changes, uh, and then it changes really fast now with AI, like things that weren't possible before are now possible. So you kind of go back to your roster of things that you believe or believed and say, maybe now, uh, now is the time maybe then it wasn't the time, uh, but I'm a big believer in kind of attaching yourself. Passionately, uh, with conviction to problems that matter, um, that, and there are some that are just too highfalutin for me that I'm not going to ever be able to kind of take on. I have the humility to recognize that. Yeah.
swyx [01:30:59]: I feel like I need a, um, updated founder's version of a serenity prayer. Like give me the confidence to like do what I think I I'm capable of, but like not to overestimate myself, you know? Uh, you know, anyway, uh, when you say board changes, how do you keep up on AI? A lot of YouTube, as it turns out. Yeah, a lot. Um, okay. Fireship. I don't know what fireship is. It's a current meme right now. Whenever OpenAI drops something, you know, they love this, like live streams of, of stuff from on the OpenAI channel. The top comment is always, I will wait for the fireship video because fireship just summarizes their thing in five minutes.
Dharmesh [01:31:35]: No, I, so my kind of MO, so I, by the way, I keep very weird hours. Uh, so my average go to bedtime, uh, is roughly 2 AM. Oh boy. But I do get average seven, seven and a half hours in. Uh, I don't, I don't use alarm clocks cause I don't, I don't, uh, have meetings, uh, uh, in the morning at all, uh, or try not to at least, uh, so my late night thing is, uh, is I'll watch probably like a couple of hours of YouTube videos off in the background while I'm coding. Um,
swyx [01:32:04]: that's how you've seen our talks.
Dharmesh [01:32:06]: I have. Yeah, I've seen. Yeah. Okay.
swyx [01:32:08]: Yep.
Dharmesh [01:32:09]: , um, and so I, and there's so much good material out there and the, and the thing I love about kind of YouTube and this, by the way, in terms of like use cases and things that agents that should exist that, uh, don't yet, I would love to, uh, technology exists now to build this is to be able to take a YouTube video of like a talk about, let's say on Latent Space or not, uh, but on the, um, AI engineer event and say, just pull the slides out for me, uh, cause I want to put it into a deck for use or whatever, some form of, uh, kind of distillation or translation into a different, uh, different format. Oh, I see. Cool slides. Got it. Pull the slides out of a video. Um, so I think that's interesting. I have, yeah. So by the way, on the kind of agent.ai thing, like one of the commonly used, uh, actions, uh, primitives that we have is the ability to kind of get a transcript from a video. And that seems like such a trivial thing or whatever, but it's like, like, if you don't know how to do it programmatically or whatever, if you're just a normie, it's like, okay, well I know it's there, but I can copy it and paste it. But like, how do I actually like get to the, the transcript for you and then, uh, getting to the transcript and then being able to encode it and say, I can. Actually. Uh, give you timestamps. So if you have a use case that says, oh, I want to know exactly when this was, I want to create an aggregate video clip. This was the actual original, um, agent that I built for my wife that she wanted to pull multiple clips together without using video editing softwares. Cause she wanted to have this, uh, aggregate thing. Uh, she's on the nonprofit side to like send to a friend.
swyx [01:33:27]: Uh, anyway, there are video understanding models that have come out from meta, but the easiest one by far is going to be Gemini. They just launched YouTube support. Yep.
Dharmesh [01:33:36]: So, um, they're doing good work over there. By the way, in terms of. The coolest thing AI wise recently, I'll say last week to 10 days has been the new, um, image model, Gemini flash, experimental, whatever they call it, uh, because it lets you effectively do editing, um, and just, and so, you know, my son is doing a eighth grade research project on AI image generation, right? So he's kind of gone deep on, uh, stable diffusion in the algorithms and things like that. I don't know much about it, but one thing I do know, I know enough about stable diffusion to know why editing is like near impossible that you can't recreate. Because it's like, you can't go back that way. It's going to be a different thing because it's sort of spinning the roulette wheel another time. The next time you try to, you know, a similar prompt. And so the fact that they were able to pull it off, it's still, it's still a very much a V one because you know, if you, I, you know, one of the test case, like, Oh, take the HubSpot logo and replace the, Oh, which is like this kind of sprocket with a donut and it will do it, but it won't size it to the degree that will actually fit into the actual thing. It's like, okay. Um, but yeah, but that's where it's headed.
swyx [01:34:36]: Do you know the backstory behind that one? No. Uh, mostly. Most of Mustafa, who was part of, so they had image generation in Lama three, uh, lawyers didn't approve it. Mustafa quit meta and joined Gemini and didn't shift it. Uh, and it is rumored. And that's all I can say is that they got rid of diffusion. They, they, they did auto-aggressive image generation. And I think it's been interesting, these two worlds colliding because diffusion was really about the images and auto-aggressive was really about languages and people were kind of seeing like, how are they going to merge? And. And on the mid-journey side, David Holtz was very much betting on text diffusion being, uh, being their path forward. Uh, but it seems like the auto-aggressive paradigm is one like next token is
Dharmesh [01:35:17]: So Hill and playground are doing like exceptional work on that kind of domain of, uh, I don't know if it's auto-aggressive, but around kind of image editing and not just the kind of text to image and actually building like a UI for like a Photoshop kind of thing for actual generation of images versus, uh, just doing text. It's fascinating.
swyx [01:35:32]: I just thought diffusion was kind of dead. Like there wasn't that much, it was just like bigger models. You know, higher detail and now auto-aggressive come along and now like the whole field is open. Yeah. Um, and I think like, if there was any real threat to like Photoshop or Canva, it's this thing. Yeah.
Alessio [01:35:47]: Just to wrap up the conversation, you have a great post called, sorry, you must pass, which if I did the math right, you first wrote in 2007, the first version, and then you re-updated it post COVID, you mentioned you made a lot of changes to your schedule and your life based on the pandemic. How do you make decisions today? You know, in the, as anything changed, like since you, because you updated this in 2022 and I think now we're kind of like, you know, five years removed from COVID and all of that. I'm curious if you made any changes. Yeah.
Dharmesh [01:36:17]: So the, so that post, sorry, must pass was the issue that happened, um, is my schedule just, and life just got overwhelmed. Right. It's like, it's just, I just, uh, too many kind of dots and connections and I love interacting with new people online. I love ideas. I love startups. There's. But as it turns out, uh, every time you say yes to anything, uh, you are by definition saying no to something else. Um, this, uh, you know, despite my best app, you know, attempts to change the laws of the universe, uh, I have not been able to do that. So that post was a reaction to that because what would happen for me, uh, would be when I did say no, I would feel this guilt because it's like, okay, well, whatever happened to me, it's like, oh, can you spend 15 minutes and just review this startup idea or whatever? It's like, uh, and sometimes it would like be someone that was second degree removed, like intro through a friend or something like that. Yeah. And I felt, uh, you know, real guilt. And so this was a very kind of honest, vulnerable, here's what's going on in my life. So, so this is not a judgment on you at all, whatever your project or whatever your thing you're working on, but I have sort of come to this realization that I just can't do it. So I'm sorry, but I, so my default thing right now, and lots of people will disagree with this kind of default position is that I have to pass because unless, and Derek Sivers said this really well, it's like either a hell yes or it's a no, right? So, and I'm going to, there's going to be a limited number of the, the hell yeses, um, that I'm going to be able to kind of inject into this. Um, so yeah, that, and that's of all the blog posts I've ever written, that has been the most useful for me. So I, um, and so, and I send it and I still send it out personally, right? I don't have a, I don't automate my email responses at all yet. Um, don't do automated social media posts. Um, but yeah, that one's been very, and I, so I encourage everyone wherever your line happens to be. I think this, um, lots of people have this guilt issue and that's one of the most unproductive emotions, uh, in, in human psychology. It's like no good comes from guilt. Not really. And unless you're like a sociopath or something like that, um, maybe you need, um, anyway, you don't need more guilt.
swyx [01:38:14]: I would also say, so I, um, I would just encourage people to blog more because a lot of times people want like to pick your brain and then they ask you the same five questions that everyone else has asked. So if you blogged it, then you can just hear.
Dharmesh [01:38:26]: So one of the things I'm working on, uh, and there are startups that are working on this as well. Uh, but I started before then is like a Dharmesh.ai, right? That's just captures. Yeah. And it's interesting. So that's one of the agents, um, on, on agent.ai, uh, on the underlying platform. Oh, there, there's a Dharmesh.ai? It's out there. It's Dharmesh.ai. Yeah. Nice. It's pure text space. No video, no audio right now. Um, but, uh, the, the thing that's like, I found it useful in terms of just the, how, how do I give it knowledge? So I have a kind of a private email address because a lot of the interactions that I will have, or if I do answer questions, because I, the other thing I, by the way, I don't do any phone calls like at all. Even like. No Zooms. Like at all. I mean, I'll get on Zooms with teams, but no one-on-one meetings, no one-on-one, uh, it just doesn't scale. So I've moved as much as possible to an async world. It's like, I will, as long as I can control the schedule, like I will take 20 minutes and write a thoughtful response, but I reserve the right, uh, anonymously with no attribution to kind of share that, uh, either with my model or with the world, um, you know, through a blog post or something. But it's been like useful because, uh, now that I have that kind of email backlog, I can go back and say, okay, I'm going to try to answer this question. Go through the vector store. Uh, and it's shockingly good. Uh, and I'm still irritated that Gmail doesn't do that out of the box. It's like they're in Google. Um, I think it's, it's gotta be coming now. It's there. I think they're finally, uh, the giant has been woken up. I think they're, uh, they're kind of, it's gotten faster now.
swyx [01:39:45]: You know, it's one of the biggest giants in the world ever. Yeah. So, yeah. When I first told Alessio, you know, you were one of our dream guests. I never, I never expected, actually expected to book you because of, sorry, my spouse. So we were just like, ah, let's send an email. And then like, he'll say no and we'll move on with all day. Uh, so I just have to say like, uh, yeah, we're very honored.
Dharmesh [01:40:05]: Oh, I'm just thrilled to be here. A huge fan of first time, first time guest, but, uh, yeah. Thank you for all that you do for the, for the community. I, I, I speak for a lot of them. You guys taught me a lot of, uh, what I think I know. So, uh, yeah.
swyx [01:40:20]: Appreciate it. Yeah. I mean, uh, I am explicitly inspired by, by, um, by HubSpot. Oh, thank you. Inbound marketing. Uh, I think it's a stroke of genius and like the. The AI engineering is explicitly modeled after that. So like you created your own industry, you know, subsection of an industry that became a huge thing because you got the trend, right. And that's what AI engineering is supposed to be if we get it right. Um, how do we screw this up? How do we square what up? How, how do I screw this up? How do we screw AI engineering up?
Dharmesh [01:40:47]: Oh, um, you know, yeah, the common failure modes, right. Is, um, so the original thing that makes inbound marketing work, the kind of kernel of the idea was to kind of, uh, to solve for the customer, solve for the audience, solve for the other side, uh, because the thing that was broken about marketing was marketing was a very self-centered, I have this budget. I'm going to blast you and interrupt your life and interrupt your day. And because I want you to buy this thing from me, right. And inbound marketing was the exact opposite. It's like use whatever limited budget you have and put something useful in the world that your target customer, uh, whoever it happens to be, will find valuable. Um, anyway, so the, the common failure mode is, um, is that you lose that, uh, I don't think you will, but it's very, very common, right? It's like, ah, like now I'm just going to like turn the crank and squeeze it just a little bit more like it's, uh, but you, you, the right reason, I think, uh, folks like me, uh, you know, appreciate that community so much is used you to have that genuine want to act. And there's nothing wrong with making money. There's nothing wrong with having spot, none of that, but at the, at the core of it, it's like, we want to lift the overall level of awareness for this group of people and create value and create goodness in the world. Um, I think if you hold onto that over the fullness of time, uh, the market becomes more efficient rewards. Yeah. Uh, that generosity, uh, that's my kind of fundamental life belief. So I think you guys are doing well. Thank you for your help and support. Yeah. My pleasure. Yeah.
Alessio [01:42:06]: And just to wrap in very Dharmesh fashion, you have a URL for the Sorry Must Pass blog, which is sorrymustpass.org. So yeah, I thought that was a good, good nugget. Um, yeah, thanks so much for coming on. Oh, thanks. Thanks for having me.
Get full access to Latent.Space at www.latent.space/subscribe
Building Snipd: The AI Podcast App for Learning
vendredi 14 mars 2025 • Durée 01:17:47
We are working with Amplify on the 2025 State of AI Engineering Survey to be presented at the AIE World’s Fair in SF! Join the survey to shape the future of AI Eng!
We first met Snipd (affiliate link! we get a free month, you get a free month. but this is not a sponsored pod, we’ve never done one) over a year ago, and were immediately impressed by the design, but were doubtful about the behavior of snipping as the title behavior:
Podcast apps are enormously sticky - Spotify spent almost $1b in podcast acquisitions and exclusive content just to get an 8% bump in market share among normies.
However, after a disappointing Overcast 2.0 rewrite with no AI features in the last 3 years, I finally bit the bullet and switched to Snipd.
It’s 2025, your podcast app should be able to let you search transcripts of your podcasts. Snipd is the best implementation of this so far.
And yet they keep shipping:
What impressed us wasn’t just how this tiny team of 4 was able to bootstrap a consumer AI app against massive titans and do so well; but also how seriously they think about learning through podcasts and improving retention of knowledge over time, aka “Duolingo for podcasts”.
As an educational AI podcast, that’s a mission we can get behind.
Full Video Pod
Find us on YouTube! This was the first pod we’ve ever shot outdoors!
Show Notes
* Comparing Snipd transcription with our Bee episode
* Gustav Söderström - Background Audio
Timestamps
* [00:00:03] Takeaways from AI Engineer NYC
* [00:00:17] Weather in New York.
* [00:00:26] Swyx and Snipd.
* [00:01:01] Kevin's AI summit experience.
* [00:01:31] Zurich and AI.
* [00:03:25] SigLIP authors join OpenAI.
* [00:03:39] Zurich is very costly.
* [00:04:06] The Snipd origin story.
* [00:05:24] Introduction to machine learning.
* [00:09:28] Snipd and user knowledge extraction.
* [00:13:48] App's tech stack, Flutter, Python.
* [00:15:11] How speakers are identified.
* [00:18:29] The concept of "backgroundable" video.
* [00:29:05] Voice cloning technology.
* [00:31:03] Using AI agents.
* [00:34:32] Snipd's future is multi-modal AI.
* [00:36:37] Snipd and existing user behaviour.
* [00:42:10] The app, summary, and timestamps.
* [00:55:25] The future of AI and podcasting.
* [1:14:55] Voice AI
Transcript
swyx [00:00:03]: Hey, I'm here in New York with Kevin Ben-Smith of Snipd. Welcome.
Kevin [00:00:07]: Hi. Hi. Amazing to be here.
swyx [00:00:09]: Yeah. This is our first ever, I think, outdoors podcast recording.
Kevin [00:00:14]: It's quite a location for the first time, I have to say.
swyx [00:00:18]: I was actually unsure because, you know, it's cold. It's like, I checked the temperature. It's like kind of one degree Celsius, but it's not that bad with the sun. No, it's quite nice. Yeah. Especially with our beautiful tea. With the tea. Yeah. Perfect. We're going to talk about Snips. I'm a Snips user. I'm a Snips user. I had to basically, you know, apart from Twitter, it's like the number one use app on my phone. Nice. When I wake up in the morning, I open Snips and I, you know, see what's new. And I think in terms of time spent or usage on my phone, I think it's number one or number two. Nice. Nice. So I really had to talk about it also because I think people interested in AI want to think about like, how can we, we're an AI podcast, we have to talk about the AI podcast app. But before we get there, we just finished. We just finished the AI Engineer Summit and you came for the two days. How was it?
Kevin [00:01:07]: It was quite incredible. I mean, for me, the most valuable was just being in the same room with like-minded people who are building the future and who are seeing the future. You know, especially when it comes to AI agents, it's so often I have conversations with friends who are not in the AI world. And it's like so quickly it happens that you, it sounds like you're talking in science fiction. And it's just crazy talk. It was, you know, it's so refreshing to talk with so many other people who already see these things and yeah, be inspired then by them and not always feel like, like, okay, I think I'm just crazy. And like, this will never happen. It really is happening. And for me, it was very valuable. So day two, more relevant, more relevant for you than day one. Yeah. Day two. So day two was the engineering track. Yeah. That was definitely the most valuable for me. Like also as a producer. Practitioner myself, especially there were one or two talks that had to do with voice AI and AI agents with voice. Okay. So that was quite fascinating. Also spoke with the speakers afterwards. Yeah. And yeah, they were also very open and, and, you know, this, this sharing attitudes that's, I think in general, quite prevalent in the AI community. I also learned a lot, like really practical things that I can now take away with me. Yeah.
swyx [00:02:25]: I mean, on my side, I, I think I watched only like half of the talks. Cause I was running around and I think people saw me like towards the end, I was kind of collapsing. I was on the floor, like, uh, towards the end because I, I needed to get, to get a rest, but yeah, I'm excited to watch the voice AI talks myself.
Kevin [00:02:43]: Yeah. Yeah. Do that. And I mean, from my side, thanks a lot for organizing this conference for bringing everyone together. Do you have anything like this in Switzerland? The short answer is no. Um, I mean, I have to say the AI community in, especially Zurich, where. Yeah. Where we're, where we're based. Yeah. It is quite good. And it's growing, uh, especially driven by ETH, the, the technical university there and all of the big companies, they have AI teams there. Google, like Google has the biggest tech hub outside of the U S in Zurich. Yeah. Facebook is doing a lot in reality labs. Uh, Apple has a secret AI team, open AI and then SwapBit just announced that they're coming to Zurich. Yeah. Um, so there's a lot happening. Yeah.
swyx [00:03:23]: So, yeah, uh, I think the most recent notable move, I think the entire vision team from Google. Uh, Lucas buyer, um, and, and all the other authors of Siglip left Google to join open AI, which I thought was like, it's like a big move for a whole team to move all at once at the same time. So I've been to Zurich and it just feels expensive. Like it's a great city. Yeah. It's great university, but I don't see it as like a business hub. Is it a business hub? I guess it is. Right.
Kevin [00:03:51]: Like it's kind of, well, historically it's, uh, it's a finance hub, finance hub. Yeah. I mean, there are some, some large banks there, right? Especially UBS, uh, the, the largest wealth manager in the world, but it's really becoming more of a tech hub now with all of the big, uh, tech companies there.
swyx [00:04:08]: I guess. Yeah. Yeah. And, but we, and research wise, it's all ETH. Yeah. There's some other things. Yeah. Yeah. Yeah.
Kevin [00:04:13]: It's all driven by ETH. And then, uh, it's sister university EPFL, which is in Lausanne. Okay. Um, which they're also doing a lot, but, uh, it's, it's, it's really ETH. Uh, and otherwise, no, I mean, it's a beautiful, really beautiful city. I can recommend. To anyone. To come, uh, visit Zurich, uh, uh, let me know, happy to show you around and of course, you know, you, you have the nature so close, you have the mountains so close, you have so, so beautiful lakes. Yeah. Um, I think that's what makes it such a livable city. Yeah.
swyx [00:04:42]: Um, and the cost is not, it's not cheap, but I mean, we're in New York city right now and, uh, I don't know, I paid $8 for a coffee this morning, so, uh, the coffee is cheaper in Zurich than the New York city. Okay. Okay. Let's talk about Snipt. What is Snipt and, you know, then we'll talk about your origin story, but I just, let's, let's get a crisp, what is Snipt? Yeah.
Kevin [00:05:03]: I always see two definitions of Snipt, so I'll give you one really simple, straightforward one, and then a second more nuanced, um, which I think will be valuable for the rest of our conversation. So the most simple one is just to say, look, we're an AI powered podcast app. So if you listen to podcasts, we're now providing this AI enhanced experience. But if you look at the more nuanced, uh, podcast. Uh, perspective, it's actually, we, we've have a very big focus on people who like your audience who listened to podcasts to learn something new. Like your audience, you want, they want to learn about AI, what's happening, what's, what's, what's the latest research, what's going on. And we want to provide a, a spoken audio platform where you can do that most effectively. And AI is basically the way that we can achieve that. Yeah.
swyx [00:05:53]: Means to an end. Yeah, exactly. When you started. Was it always meant to be AI or is it, was it more about the social sharing?
Kevin [00:05:59]: So the first version that we ever released was like three and a half years ago. Okay. Yeah. So this was before ChatGPT. Before Whisper. Yeah. Before Whisper. Yeah. So I think a lot of the features that we now have in the app, they weren't really possible yet back then. But we already from the beginning, we always had the focus on knowledge. That's the reason why, you know, we in our team, why we listen to podcasts, but we did have a bit of a different approach. Like the idea in the very beginning was, so the name is Snips and you can create these, what we call Snips, which is basically a small snippet, like a clip from a, from a podcast. And we did envision sort of like a, like a social TikTok platform where some people would listen to full episodes and they would snip certain, like the best parts of it. And they would post that in a feed and other users would consume this feed of Snips. And use that as a discovery tool or just as a means to an end. And yeah, so you would have both people who create Snips and people who listen to Snips. So our big hypothesis in the beginning was, you know, it will be easy to get people to listen to these Snips, but super difficult to actually get them to create them. So we focused a lot of, a lot of our effort on making it as seamless and easy as possible to create a Snip. Yeah.
swyx [00:07:17]: It's similar to TikTok. You need CapCut for there to be videos on TikTok. Exactly.
Kevin [00:07:23]: And so for, for Snips, basically whenever you hear an amazing insight, a great moment, you can just triple tap your headphones. And our AI actually then saves the moment that you just listened to and summarizes it to create a note. And this is then basically a Snip. So yeah, we built, we built all of this, launched it. And what we found out was basically the exact opposite. So we saw that people use the Snips to discover podcasts, but they really, you know, they don't. You know, really love listening to long form podcasts, but they were creating Snips like crazy. And this was, this was definitely one of these aha moments when we realized like, hey, we should be really doubling down on the knowledge of learning of, yeah, helping you learn most effectively and helping you capture the knowledge that you listen to and actually do something with it. Because this is in general, you know, we, we live in this world where there's so much content and we consume and consume and consume. And it's so easy to just at the end of the podcast. You just start listening to the next podcast. And five minutes later, you've forgotten everything. 90%, 99% of what you've actually just learned. Yeah.
swyx [00:08:31]: You don't know this, but, and most people don't know this, but this is my fourth podcast. My third podcast was a personal mixtape podcast where I Snipped manually sections of podcasts that I liked and added my own commentary on top of them and published them as small episodes. Nice. So those would be maybe five to 10 minute Snips. Yeah. And then I added something that I thought was a good story or like a good insight. And then I added my own commentary and published it as a separate podcast. It's cool. Is that still live? It's still live, but it's not active, but you can go back and find it. If you're, if, if you're curious enough, you'll see it. Nice. Yeah. You have to show me later. It was so manual because basically what my process would be, I hear something interesting. I note down the timestamp and I note down the URL of the podcast. I used to use Overcast. So it would just link to the Overcast page. And then. Put in my note taking app, go home. Whenever I feel like publishing, I will take one of those things and then download the MP3, clip out the MP3 and record my intro, outro and then publish it as a, as a podcast. But now Snips, I mean, I can just kind of double click or triple tap.
Kevin [00:09:39]: I mean, those are very similar stories to what we hear from our users. You know, it's, it's normal that you're doing, you're doing something else while you're listening to a podcast. Yeah. A lot of our users, they're driving, they're working out, walking their dog. So in those moments when you hear something amazing, it's difficult to just write them down or, you know, you have to take out your phone. Some people take a screenshot, write down a timestamp, and then later on you have to go back and try to find it again. Of course you can't find it anymore because there's no search. There's no command F. And, um, these, these were all of the issues that, that, that we encountered also ourselves as users. And given that our background was in AI, we realized like, wait, hey, this is. This should not be the case. Like podcast apps today, they're still, they're basically repurposed music players, but we actually look at podcasts as one of the largest sources of knowledge in the world. And once you have that different angle of looking at it together with everything that AI is now enabling, you realize like, hey, this is not the way that we, that podcast apps should be. Yeah.
swyx [00:10:41]: Yeah. I agree. You mentioned something that you said your background is in AI. Well, first of all, who's the team and what do you mean your background is in AI?
Kevin [00:10:48]: Those are two very different things. I'm going to ask some questions. Yeah. Um, maybe starting with, with my backstory. Yeah. My backstory actually goes back, like, let's say 12 years ago or something like that. I moved to Zurich to study at ETH and actually I studied something completely different. I studied mathematics and economics basically with this specialization for quant finance. Same. Okay. Wow. All right. So yeah. And then as you know, all of these mathematical models for, um, asset pricing, derivative pricing, quantitative trading. And for me, the thing that, that fascinates me the most was the mathematical modeling behind it. Uh, mathematics, uh, statistics, but I was never really that passionate about the finance side of things.
swyx [00:11:32]: Oh really? Oh, okay. Yeah. I mean, we're different there.
Kevin [00:11:36]: I mean, one just, let's say symptom that I noticed now, like, like looking back during that time. Yeah. I think I never read an academic paper about the subject in my free time. And then it was towards the end of my studies. I was already working for a big bank. One of my best friends, he comes to me and says, Hey, I just took this course. You have to, you have to do this. You have to take this lecture. Okay. And I'm like, what, what, what is it about? It's called machine learning and I'm like, what, what, what kind of stupid name is that? Uh, so you sent me the slides and like over a weekend I went through all of the slides and I just, I just knew like freaking hell. Like this is it. I'm, I'm in love. Wow. Yeah. Okay. And that was then over the course of the next, I think like 12 months, I just really got into it. Started reading all about it, like reading blog posts, starting building my own models.
swyx [00:12:26]: Was this course by a famous person, famous university? Was it like the Andrew Wayne Coursera thing? No.
Kevin [00:12:31]: So this was a ETH course. So a professor at ETH. Did he teach in English by the way? Yeah. Okay.
swyx [00:12:37]: So these slides are somewhere available. Yeah. Definitely. I mean, now they're quite outdated. Yeah. Sure. Well, I think, you know, reflecting on the finance thing for a bit. So I, I was, used to be a trader, uh, sell side and buy side. I was options trader first and then I was more like a quantitative hedge fund analyst. We never really use machine learning. It was more like a little bit of statistical modeling, but really like you, you fit, you know, your regression.
Kevin [00:13:03]: No, I mean, that's, that's what it is. And, uh, or you, you solve partial differential equations and have then numerical methods to, to, to solve these. That's, that's for you. That's your degree. And that's, that's not really what you do at work. Right. Unless, well, I don't know what you do at work. In my job. No, no, we weren't solving the partial differential. Yeah.
swyx [00:13:18]: You learn all this in school and then you don't use it.
Kevin [00:13:20]: I mean, we, we, well, let's put it like that. Um, in some things, yeah, I mean, I did code algorithms that would do it, but it was basically like, it was the most basic algorithms and then you just like slightly improve them a little bit. Like you just tweak them here and there. Yeah. It wasn't like starting from scratch, like, Oh, here's this new partial differential equation. How do we know?
swyx [00:13:43]: Yeah. Yeah. I mean, that's, that's real life, right? Most, most of it's kind of boring or you're, you're using established things because they're established because, uh, they tackle the most important topics. Um, yeah. Portfolio management was more interesting for me. Um, and, uh, we, we were sort of the first to combine like social data with, with quantitative trading. And I think, uh, I think now it's very common, but, um, yeah. Anyway, then you, you went, you went deep on machine learning and then what? You quit your job? Yeah. Yeah. Wow.
Kevin [00:14:12]: I quit my job because, uh, um, I mean, I started using it at the bank as well. Like try, like, you know, I like desperately tried to find any kind of excuse to like use it here or there, but it just was clear to me, like, no, if I want to do this, um, like I just have to like make a real cut. So I quit my job and joined an early stage, uh, tech startup in Zurich where then built up the AI team over five years. Wow. Yeah. So yeah, we built various machine learning, uh, things for, for banks from like models for, for sales teams to identify which clients like which product to sell to them and with what reasons all the way to, we did a lot, a lot with bank transactions. One of the actually most fun projects for me was we had an, an NLP model that would take the booking text of a transaction, like a credit card transaction and pretty fired. Yeah. Because it had all of these, you know, like numbers in there and abbreviations and whatnot. And sometimes you look at it like, what, what is this? And it was just, you know, it would just change it to, I don't know, CVS. Yeah.
swyx [00:15:15]: Yeah. But I mean, would you have hallucinations?
Kevin [00:15:17]: No, no, no. The way that everything was set up, it wasn't like, it wasn't yet fully end to end generative, uh, neural network as what you would use today. Okay.
swyx [00:15:30]: Awesome. And then when did you go like full time on Snips? Yeah.
Kevin [00:15:33]: So basically that was, that was afterwards. I mean, how that started was the friend of mine who got me into machine learning, uh, him and I, uh, like he also got me interested into startups. He's had a big impact on my life. And the two of us were just a jam on, on like ideas for startups every now and then. And his background was also in AI data science. And we had a couple of ideas, but given that we were working full times, we were thinking about, uh, so we participated in Hack Zurich. That's, uh, Europe's biggest hackathon, um, or at least was at the time. And we said, Hey, this is just a weekend. Let's just try out an idea, like hack something together and see how it works. And the idea was that we'd be able to search through podcast episodes, like within a podcast. Yeah. So we did that. Long story short, uh, we managed to do it like to build something that we realized, Hey, this actually works. You can, you can find things again in podcasts. We had like a natural language search and we pitched it on stage. And we actually won the hackathon, which was cool. I mean, we, we also, I think we had a good, um, like a good, good pitch or a good example. So we, we used the famous Joe Rogan episode with Elon Musk where Elon Musk smokes a joint. Okay. Um, it's like a two and a half hour episode. So we were on stage and then we just searched for like smoking weed and it would find that exact moment. It will play it. And it just like, come on with Elon Musk, just like smoking. Oh, so it was video as well? No, it was actually completely based on audio. But we did have the video for the presentation. Yeah. Which had a, had of course an amazing effect. Yeah. Like this gave us a lot of activation energy, but it wasn't actually about winning the hackathon. Yeah. But the interesting thing that happened was after we pitched on stage, several of the other participants, like a lot of them came up to us and started saying like, Hey, can I use this? Like I have this issue. And like some also came up and told us about other problems that they have, like very adjacent to this with a podcast. Where's like, like this. Like, could, could I use this for that as well? And that was basically the, the moment where I realized, Hey, it's actually not just us who are having these issues with, with podcasts and getting to the, making the most out of this knowledge. Yeah. The other people. Yeah. That was now, I guess like four years ago or something like that. And then, yeah, we decided to quit our jobs and start, start this whole snip thing. Yeah. How big is the team now? We're just four people. Yeah. Just four people. Yeah. Like four. We're all technical. Yeah. Basically two on the, the backend side. So one of my co-founders is this person who got me into machine learning and startups. And we won the hackathon together. So we have two people for the backend side with the AI and all of the other backend things. And two for the front end side, building the app.
swyx [00:18:18]: Which is mostly Android and iOS. Yeah.
Kevin [00:18:21]: It's iOS and Android. We also have a watch app for, for Apple, but yeah, it's mostly iOS. Yeah.
swyx [00:18:27]: The watch thing, it was very funny because in the, in the Latent Space discord, you know, most of us have been slowly adopting snips. You came to me like a year ago and you introduced snip to me. I was like, I don't know. I'm, you know, I'm very sticky to overcast and then slowly we switch. Why watch?
Kevin [00:18:43]: So it goes back to a lot of our users, they do something else while, while listening to a podcast, right? Yeah. And one of the, us giving them the ability to then capture this knowledge, even though they're doing something else at the same time is one of the killer features. Yeah. Maybe I can actually, maybe at some point I should maybe give a bit more of an overview of what the, all of the features that we have. Sure. So this is one of the killer features and for one big use case that people use this for is for running. Yeah. So if you're a big runner, a big jogger or cycling, like really, really cycling competitively and a lot of the people, they don't want to take their phone with them when they go running. So you load everything onto the watch. So you can download episodes. I mean, if you, if you have an Apple watch that has internet access, like with a SIM card, you can also directly stream. That's also possible. Yeah. So of course it's a, it's basically very limited to just listening and snipping. And then you can see all of your snips later on your phone. Let me tell you this error I just got.
swyx [00:19:47]: Error playing episode. Substack, the host of this podcast, does not allow this podcast to be played on an Apple watch. Yeah.
Kevin [00:19:52]: That's a very beautiful thing. So we found out that all of the podcasts hosted on Substack, you cannot play them on an Apple watch. Why is this restriction? What? Like, don't ask me. We try to reach out to Substack. We try to reach out to some of the bigger podcasters who are hosting the podcast on Substack to also let them know. Substack doesn't seem to care. This is not specific to our app. You can also check out the Apple podcast app. Yeah. It's the same problem. It's just that we actually have identified it. And we tell the user what's going on.
swyx [00:20:25]: I would say we host our podcast on Substack, but they're not very serious about their podcasting tools. I've told them before, I've been very upfront with them. So I don't feel like I'm shitting on them in any way. And it's kind of sad because otherwise it's a perfect creative platform. But the way that they treat podcasting as an afterthought, I think it's really disappointing.
Kevin [00:20:45]: Maybe given that you mentioned all these features, maybe I can give a bit of a better overview of the features that we have. Let's do that. Let's do that. So I think we're mostly in our minds. Maybe for some of the listeners.
swyx [00:20:55]: I mean, I'll tell you my version. Yeah. They can correct me, right? So first of all, I think the main job is for it to be a podcast listening app. It should be basically a complete superset of what you normally get on Overcast or Apple Podcasts or anything like that. You pull your show list from ListenNotes. How do you find shows? You've got to type in anything and you find them, right?
Kevin [00:21:18]: Yeah. We have a search engine that is powered by ListenNotes. Yeah. But I mean, in the meantime, we have a huge database of like 99% of all podcasts out there ourselves. Yeah.
swyx [00:21:27]: What I noticed, the default experience is you do not auto-download shows. And that's one very big difference for you guys versus other apps, where like, you know, if I'm subscribed to a thing, it auto-downloads and I already have the MP3 downloaded overnight. For me, I have to actively put it onto my queue, then it auto-downloads. And actually, I initially didn't like that. I think I maybe told you that I was like, oh, it's like a feature that I don't like. Like, because it means that I have to choose to listen to it in order to download and not to... It's like opt-in. There's a difference between opt-in and opt-out. So I opt-in to every episode that I listen to. And then, like, you know, you open it and depends on whether or not you have the AI stuff enabled. But the default experience is no AI stuff enabled. You can listen to it. You can see the snips, the number of snips and where people snip during the episode, which roughly correlates to interest level. And obviously, you can snip there. I think that's the default experience. I think snipping is really cool. Like, I use it to share a lot on Discord. I think we have tons and tons of just people sharing snips and stuff. Tweeting stuff is also like a nice, pleasant experience. But like the real features come when you actually turn on the AI stuff. And so the reason I got snipped, because I got fed up with Overcast not implementing any AI features at all. Instead, they spent two years rewriting their app to be a little bit faster. And I'm like, like, it's 2025. I should have a podcast that has transcripts that I can search. Very, very basic thing. Overcast will basically never have it.
Kevin [00:22:49]: Yeah, I think that was a good, like, basic overview. Maybe I can add a bit to it with the AI features that we have. So one thing that we do every time a new podcast comes out, we transcribe the episode. We do speaker diarization. We identify the speaker names. Each guest, we extract a mini bio of the guest, try to find a picture of the guest online, add it. We break the podcast down into chapters, as in AI generated chapters. That one. That one's very handy. With a quick description per title and quick description per each chapter. We identify all books that get mentioned on a podcast. You can tell I don't use that one. It depends on the podcast. There are some podcasts where the guests often recommend like an amazing book. So later on, you can you can find that again.
swyx [00:23:42]: So you literally search for the word book or I just read blah, blah, blah.
Kevin [00:23:46]: No, I mean, it's all LLM based. Yeah. So basically, we have we have an LLM that goes through the entire transcript and identifies if a user mentions a book, then we use perplexity API together with various other LLM orchestration to go out there on the Internet, find everything that there is to know about the book, find the cover, find who or what the author is, get a quick description of it for the author. We then check on which other episodes the author appeared on.
swyx [00:24:15]: Yeah, that is killer.
Kevin [00:24:17]: Because that for me, if. If there's an interesting book, the first thing I do is I actually listen to a podcast episode with a with a writer because he usually gives a really great overview already on a podcast.
swyx [00:24:28]: Sometimes the podcast is with the person as a guest. Sometimes his podcast is about the person without him there. Do you pick up both?
Kevin [00:24:37]: So, yes, we pick up both in like our latest models. But actually what we show you in the app, the goal is to currently only show you the guest to separate that. In the future, we want to show the other things more.
swyx [00:24:47]: For what it's worth, I don't mind. Yeah, I don't think like if I like if I like somebody, I'll just learn about them regardless of whether they're there or not.
Kevin [00:24:55]: Yeah, I mean, yes and no. We we we have seen there are some personalities where this can break down. So, for example, the first version that we released with this feature, it picked up much more often a person, even if it was not a guest. Yeah. For example, the best examples for me is Sam Altman and Elon Musk. Like they're just mentioned on every second podcast and it has like they're not on there. And if you're interested in it, you can go to Elon Musk. And actually like learning from them. Yeah, I see. And yeah, we updated our our algorithms, improved that a lot. And now it's gotten much better to only pick it up if they're a guest. And yeah, so this this is maybe to come back to the features, two more important features like we have the ability to chat with an episode. Yes. Of course, you can do the old style of searching through a transcript with a keyword search. But I think for me, this is this is how you used to do search and extracting knowledge in the in the past. Old school. And the A.I. Web. Way is is basically an LLM. So you can ask the LLM, hey, when do they talk about topic X? If you're interested in only a certain part of the episode, you can ask them for four to give a quick overview of the episode. Key takeaways afterwards also to create a note for you. So this is really like very open, open ended. And yeah. And then finally, the snipping feature that we mentioned just to reiterate. Yeah. I mean, here the the feature is that whenever you hear an amazing idea, you can trip. It's up your headphones or click a button in the app and the A.I. summarizes the insight you just heard and saves that together with the original transcript and audio in your knowledge library. I also noticed that you you skip dynamic content. So dynamic content, we do not skip it automatically. Oh, sorry. You detect. But we detect it. Yeah. I mean, that's one of the thing that most people don't don't actually know that like the way that ads get inserted into podcasts or into most podcasts is actually that every time you listen. To a podcast, you actually get access to a different audio file and on the server, a different ad is inserted into the MP3 file automatically. Yeah. Based on IP. Exactly. And that's what that means is if we transcribe an episode and have a transcript with timestamps like words, word specific timestamps, if you suddenly get a different audio file, like the whole time says I messed up and that's like a huge issue. And for that, we actually had to build another algorithm that would dynamically on the floor. I re sync the audio that you're listening to the transcript that we have. Yeah. Which is a fascinating problem in and of itself.
swyx [00:27:24]: You sync by matching up the sound waves? Or like, or do you sync by matching up words like you basically do partial transcription?
Kevin [00:27:33]: We are not matching up words. It's happening on the basically a bytes level matching. Yeah. Okay.
swyx [00:27:40]: It relies on this. It relies on the exact match at some point.
Kevin [00:27:46]: So it's actually. We're actually not doing exact matches, but we're doing fuzzy matches to identify the moment. It's basically, we basically built Shazam for podcasts. Just as a little side project to solve this issue.
swyx [00:28:02]: Actually, fun fact, apparently the Shazam algorithm is open. They published the paper, it's talked about it. I haven't really dived into the paper. I thought it was kind of interesting that basically no one else has built Shazam.
Kevin [00:28:16]: Yeah, I mean, well, the one thing is the algorithm. If you now talk about Shazam, the other thing is also having the database behind it and having the user mindset that if they have this problem, they come to you, right?
swyx [00:28:29]: Yeah, I'm very interested in the tech stack. There's a big data pipeline. Could you share what is the tech stack?
Kevin [00:28:35]: What are the most interesting or challenging pieces of it? So the general tech stack is our entire backend is, or 90% of our backend is written in Python. Okay. Hosting everything on Google Cloud Platform. And our front end is written with, well, we're using the Flutter framework. So it's written in Dart and then compiled natively. So we have one code base that handles both Android and iOS. You think that was a good decision? It's something that a lot of people are exploring. So up until now, yes. Okay. Look, it has its pros and cons. Some of the, you know, for example, earlier, I mentioned we have a Apple Watch app. Yeah. I mean, there's no Flutter for that, right? So that you build native. And then of course you have to sort of like sync these things together. I mean, I'm not the front end engineer, so I'm not just relaying this information, but our front end engineers are very happy with it. It's enabled us to be quite fast and be on both platforms from the very beginning. And when I talk with people and they hear that we are using Flutter, usually they think like, ah, it's not performant. It's super junk, janky and everything. And then they use it. They use our app and they're always super surprised. Or if they've already used our app, I couldn't tell them. They're like, what? Yeah. Um, so there is actually a lot that you can do with it.
swyx [00:29:51]: The danger, the concern, there's a few concerns, right? One, it's Google. So when were they, when are they going to abandon it? Two, you know, they're optimized for Android first. So iOS is like a second, second thought, or like you can feel that it is not a native iOS app. Uh, but you guys put a lot of care into it. And then maybe three, from my point of view, JavaScript, as a JavaScript guy, React Native was supposed to be there. And I think that it hasn't really fulfilled that dream. Um, maybe Expo is trying to do that, but, um, again, it is not, does not feel as productive as Flutter. And I've, I spent a week on Flutter and dot, and I'm an investor in Flutter flow, which is the local, uh, Flutter, Flutter startup. That's doing very, very well. I think a lot of people are still Flutter skeptics. Yeah. Wait. So are you moving away from Flutter?
Kevin [00:30:41]: I don't know. We don't have plans to do that. Yeah.
swyx [00:30:43]: You're just saying about that. What? Yeah. Watch out. Okay. Let's go back to the stack.
Kevin [00:30:47]: You know, that was just to give you a bit of an overview. I think the more interesting things are, of course, on the AI side. So we, like, as I mentioned earlier, when we started out, it was before chat GPT for the chat GPT moment before there was the GPT 3.5 turbo, uh, API. So in the beginning, we actually were running everything ourselves, open source models, try to fine tune them. They worked. There was us, but let's, let's be honest. They weren't. What was the sort of? Before Whisper, the transcription. Yeah, we were using wave to work like, um, there was a Google one, right? No, it was a Facebook, Facebook one. That was actually one of the papers. Like when that came out for me, that was one of the reasons why I said we, we should try something to start a startup in the audio space. For me, it was a bit like before that I had been following the NLP space, uh, quite closely. And as, as I mentioned earlier, we, we did some stuff at the startup as well, that I was working up. But before, and wave to work was the first paper that I had at least seen where the whole transformer architecture moved over to audio and bit more general way of saying it is like, it was the first time that I saw the transformer architecture being applied to continuous data instead of discrete tokens. Okay. And it worked amazingly. Ah, and like the transformer architecture plus self-supervised learning, like these two things moved over. And then for me, it was like, Hey, this is now going to take off similarly. It's the text space has taken off. And with these two things in place, even if some features that we want to build are not possible yet, they will be possible in the near term, uh, with this, uh, trajectory. So that was a little side, side note. No, it's in the meantime. Yeah. We're using whisper. We're still hosting some of the models ourselves. So for example, the whole transcription speaker diarization pipeline, uh,
swyx [00:32:38]: You need it to be as cheap as possible.
Kevin [00:32:40]: Yeah, exactly. I mean, we're doing this at scale where we have a lot of audio.
swyx [00:32:44]: We're what numbers can you disclose? Like what, what are just to give people an idea because it's a lot. So we have more than a million podcasts that we've already processed when you say a million. So processing is basically, you have some kind of list of podcasts that you will auto process and others where a paying pay member can choose to press the button and transcribe it. Right. Is that the rough idea? Yeah, exactly.
Kevin [00:33:08]: Yeah. And if, when you press that button or we also transcribe it. Yeah. So first we do the, we do the transcription. We do the. The, the speaker diarization. So basically you identify speech blocks that belong to the same speaker. This is then all orchestrated within, within LLM to identify which speech speech block belongs to which speaker together with, you know, we identify, as I mentioned earlier, we identify the guest name and the bio. So all of that comes together with an LLM to actually then assign assigned speaker names to, to each block. Yeah. And then most of the rest of the, the pipeline we've now used, we've now migrated to LLM. So we use mainly open AI, Google models, so the Gemini models and the open AI models, and we use some perplexity basically for those things where we need, where we need web search. Yeah. That's something I'm still hoping, especially open AI will also provide us an API. Oh, why? Well, basically for us as a consumer, the more providers there are.
swyx [00:34:07]: The more downtime.
Kevin [00:34:08]: The more competition and it will lead to better, better results. And, um, lower costs over time. I don't, I don't see perplexity as expensive. If you use the web search, the price is like $5 per a thousand queries. Okay. Which is affordable. But, uh, if you compare that to just a normal LLM call, um, it's, it's, uh, much more expensive. Have you tried Exa? We've, uh, looked into it, but we haven't really tried it. Um, I mean, we, we started with perplexity and, uh, it works, it works well. And if I remember. Correctly, Exa is also a bit more expensive.
swyx [00:34:45]: I don't know. I don't know. They seem to focus on the search thing as a search API, whereas perplexity, maybe more consumer-y business that is higher, higher margin. Like I'll put it like perplexity is trying to be a product, Exa is trying to be infrastructure. Yeah. So that, that'll be my distinction there. And then the other thing I will mention is Google has a search grounding feature. Yeah. Which you, which you might want. Yeah.
Kevin [00:35:07]: Yeah. We've, uh, we've also tried that out. Um, not as good. So we, we didn't, we didn't go into. Too much detail in like really comparing it, like quality wise, because we actually already had the perplexity one and it, and it's, and it's working. Yeah. Um, I think also there, the price is actually higher than perplexity. Yeah. Really? Yeah.
swyx [00:35:26]: Google should cut their prices.
Kevin [00:35:29]: Maybe it was the same price. I don't want to say something incorrect, but it wasn't cheaper. It wasn't like compelling. And then, then there was no reason to switch. So, I mean, maybe like in general, like for us, given that we do work with a lot of content, price is actually something that we do look at. Like for us, it's not just about taking the best model for every task, but it's really getting the best, like identifying what kind of intelligence level you need and then getting the best price for that to be able to really scale this and, and provide us, um, yeah, let our users use these features with as many podcasts as possible. Yeah.
swyx [00:36:03]: I wanted to double, double click on diarization. Yeah. Uh, it's something that I don't think people do very well. So you know, I'm, I'm a, I'm a B user. I don't have it right now. And, and they were supposed to speak, but they dropped out last minute. Um, but, uh, we've had them on the podcast before and it's not great yet. Do you use just PI Anode, the default stuff, or do you find any tricks for diarization?
Kevin [00:36:27]: So we do use the, the open source packages, but we have tweaked it a bit here and there. For example, if you mentioned the BAI guys, I actually listened to the podcast episode was super nice. Thank you. And when you started talking about speaker diarization, and I just have to think about, uh, I don't know.
Kevin [00:36:49]: Is it possible? I don't know. I don't know. F**k this. Yeah, no, I don't know.
Kevin [00:36:55]: Yeah. We are the best. This is a.
swyx [00:37:07]: I don't know. This is the best. I don't know. This is the best. Yeah. Yeah. Yeah. You're doing good.
Kevin [00:37:12]: So, so yeah. This is great. This is good. Yeah. No, so that of course helps us. Another thing that helps us is that we know certain structural aspects of the podcast. For example, how often does someone speak? Like if someone, like let's say there's a one hour episode and someone speaks for 30 seconds, that person is most probably not the guest and not the host. It's probably some ad, like some speaker from an ad. So we have like certain of these heuristics that we can use and we leverage to improve things. And in the past, we've also changed the clustering algorithm. So basically how a lot of the speaker diarization works is you basically create an embedding for the speech that's happening. And then you try to somehow cluster these embeddings and then find out this is all one speaker. This is all another speaker. And there we've also tweaked a couple of things where we again used heuristics that we could apply from knowing how podcasts function. And that's also actually why I was feeling so much with the BAI guys, because like all of these heuristics, like for them, it's probably almost impossible to use any heuristics because it can just be any situation, anything.
Kevin [00:38:34]: So that's one thing that we do. Yeah, another thing is that we actually combine it with LLM. So the transcript, LLMs and the speaker diarization, like bringing all of these together to recalibrate some of the switching points. Like when does the speaker stop? When does the next one start?
swyx [00:38:51]: The LLMs can add errors as well. You know, I wouldn't feel safe using them to be so precise.
Kevin [00:38:58]: I mean, at the end of the day, like also just to not give a wrong impression, like the speaker diarization is also not perfect that we're doing, right? I basically don't really notice it.
swyx [00:39:08]: Like I use it for search.
Kevin [00:39:09]: Yeah, it's not perfect yet, but it's gotten quite good. Like, especially if you compare, if you look at some of the, like if you take a latest episode and you compare it to an episode that came out a year ago, we've improved it quite a bit.
swyx [00:39:23]: Well, it's beautifully presented. Oh, I love that I can click on the transcript and it goes to the timestamp. So simple, but you know, it should exist. Yeah, I agree. I agree. So this, I'm loading a two hour episode of Detect Me Right Home, where there's a lot of different guests calling in and you've identified the guest name. And yeah, so these are all LLM based. Yeah, it's really nice.
Kevin [00:39:49]: Yeah, like the speaker names.
swyx [00:39:50]: I would say that, you know, obviously I'm a power user of all these tools. You have done a better job than Descript. Okay, wow. Descript is so much funding. They had their open AI invested in them and they still suck. So I don't know, like, you know, keep going. You're doing great. Yeah, thanks. Thanks.
Kevin [00:40:12]: I mean, I would, I would say that, especially for anyone listening who's interested in building a consumer app with AI, I think the, like, especially if your background is in AI and you love working with AI and doing all of that, I think the most important thing is just to keep reminding yourself of what's actually the job to be done here. Like, what does actually the consumer want? Like, for example, you now were just delighted by the ability to click on this word and it jumps there. Yeah. Like, this is not, this is not rocket science. This is, like, you don't have to be, like, I don't know, Android Kapathi to come up with that and build that, right? And I think that's, that's something that's super important to keep in mind.
swyx [00:40:52]: Yeah, yeah. Amazing. I mean, there's so many features, right? It's, it's so packed. There's quotes that you pick up. There's summarization. Oh, by the way, I'm going to use this as my official feature request. I want to customize what, how it's summarized. I want to, I want to have a custom prompt. Yeah. Because your summarization is good, but, you know, I have different preferences, right? Like, you know.
Kevin [00:41:14]: So one thing that you can already do today, I completely get your feature request. And I think it just.
swyx [00:41:18]: I'm sure people have asked it.
Kevin [00:41:19]: I mean, maybe just in general as a, as a, how I see the future, you know, like in the future, I think all, everything will be personalized. Yeah, yeah. Like, not, this is not specific to us. Yeah. And today we're still in a, in a phase where the cost of LLMs, at least if you're working with, like, such long context windows. As us, I mean, there's a lot of tokens in, if you take an entire podcast, so you still have to take that cost into consideration. So if for every single user, we regenerate it entirely, it gets expensive. But in the future, this, you know, cost will continue to go down and then it will just be personalized. So that being said, you can already today, if you go to the player screen. Okay. And open up the chat. Yeah. You can go to the, to the chat. Yes. And just ask for a summary in your style.
swyx [00:42:13]: Yeah. Okay. I mean, I, I listen to consume, you know? Yeah. Yeah. I, I've never really used this feature. I don't know. I think that's, that's me being a slow adopter. No, no. I mean, that's. It has, when does the conversation start? Okay.
Kevin [00:42:26]: I mean, you can just type anything. I think what you're, what you're describing, I mean, maybe that is also an interesting topic to talk about. Yes. Where, like, basically I told you, like, look, we have this chat. You can just ask for it. Yeah. And this is, this is how ChatGPT works today. But if you're building a consumer app, you have to move beyond the chat box. People do not want to always type out what they want. So your feature request was, even though theoretically it's already possible, what you are actually asking for is, hey, I just want to open up the app and it should just be there in a nicely formatted way. Beautiful way such that I can read it or consume it without any issues. Interesting. And I think that's in general where a lot of the, the. Opportunities lie currently in the market. If you want to build a consumer app, taking the capability and the intelligence, but finding out what the actual user interface is the best way how a user can engage with this intelligence in a natural way.
swyx [00:43:24]: Is this something I've been thinking about as kind of like AI that's not in your face? Because right now, you know, we like to say like, oh, use Notion has Notion AI. And we have the little thing there. And there's, or like some other. Any other platform has like the sparkle magic wand emoji, like that's our AI feature. Use this. And it's like really in your face. A lot of people don't like it. You know, it should just kind of become invisible, kind of like an invisible AI.
Kevin [00:43:49]: 100%. I mean, the, the way I see it as AI is, is the electricity of, of the future. And like no one, like, like we don't talk about, I don't know, this, this microphone uses electricity, this phone, you don't think about it that way. It's just in there, right? It's not an electricity enabled product. No, it's just a product. Yeah. It will be the same with AI. I mean, now. It's still a, something that you use to market your product. I mean, we do, we do the same, right? Because it's still something that people realize, ah, they're doing something new, but at some point, no, it'll just be a podcast app and it will be normal that it has all of this AI in there.
swyx [00:44:24]: I noticed you do something interesting in your chat where you source the timestamps. Yeah. Is that part of this prompt? Is there a separate pipeline that adds source sources?
Kevin [00:44:33]: This is, uh, actually part of the prompt. Um, so this is all prompt engine. Engineering, um, uh, you should be able to click on it. Yeah, I clicked on it. Um, this is all prompt engineering with how to provide the, the context, you know, we, because we provide all of the transcript, how to provide the context and then, yeah, I get them all to respond in a correct way with a certain format and then rendering that on the front end. This is one of the examples where I would say it's so easy to create like a quick demo of this. I mean, you can just go to chat to be deep, paste this thing in and say like, yeah, do this. Okay. Like 15 minutes and you're done. Yeah. But getting this to like then production level that it actually works 99% of the time. Okay. This is then where, where the difference lies. Yeah. So, um, for this specific feature, like we actually also have like countless regexes that they're just there to correct certain things that the LLM is doing because it doesn't always adhere to the format correctly. And then it looks super ugly on the front end. So yeah, we have certain regexes that correct that. And maybe you'd ask like, why don't you use an LLM for that? Because that's sort of the, again, the AI native way, like who uses regexes anymore. But with the chat for user experience, it's very important that you have the streaming because otherwise you need to wait so long until your message has arrived. So we're streaming live the, like, just like ChatGPT, right? You get the answer and it's streaming the text. So if you're streaming the text and something is like incorrect. It's currently not easy to just like pipe, like stream this into another stream, stream this into another stream and get the stream back, which corrects it, that would be amazing. I don't know, maybe you can answer that. Do you know of any?
swyx [00:46:19]: There's no API that does this. Yeah. Like you cannot stream in. If you own the models, you can, uh, you know, whatever token sequence has, has been emitted, start loading that into the next one. If you fully own the models, uh, I don't, it's probably not worth it. That's what you do. It's better. Yeah. I think. Yeah. Most engineers who are new to AI research and benchmarking actually don't know how much regexing there is that goes on in normal benchmarks. It's just like this ugly list of like a hundred different, you know, matches for some criteria that you're looking for. No, it's very cool. I think it's, it's, it's an example of like real world engineering. Yeah. Do you have a tooling that you're proud of that you've developed for yourself?
Kevin [00:47:02]: Is it just a test script or is it, you know? I think it's a bit more, I guess the term that has come up is, uh, vibe coding, uh, vibe coding, some, no, sorry, that's actually something else in this case, but, uh, no, no, yes, um, vibe evals was a term that in one of the talks actually on, on, um, I think it might've been the first, the first or the first day at the conference, someone brought that up. Yeah. Uh, because yeah, a lot of the talks were about evals, right. Which is so important. And yeah, I think for us, it's a bit more vibe. Evals, you know, that's also part of, you know, being a startup, we can take risks, like we can take the cost of maybe sometimes it failing a little bit or being a little bit off and our users know that and they appreciate that in return, like we're moving fast and iterating and building, building amazing things, but you know, a Spotify or something like that, half of our features will probably be in a six month review through legal or I don't know what, uh, before they could sell them out.
swyx [00:48:04]: Let's just say Spotify is not very good at podcasting. Um, I have a documented, uh, dislike for, for their podcast features, just overall, really, really well integrated any other like sort of LLM focused engineering challenges or problems that you, that you want to highlight.
Kevin [00:48:20]: I think it's not unique to us, but it goes again in the direction of handling the uncertainty of LLMs. So for example, with last year, at the end of the year, we did sort of a snipped wrapped. And one of the things we thought it would be fun to, just to do something with, uh, with an LLM and something with the snips that, that a user has. And, uh, three, let's say unique LLM features were that we assigned a personality to you based on the, the snips that, that you have. It was, I mean, it was just all, I guess, a bit of a fun, playful way. I'm going to look up mine. I forgot mine already.
swyx [00:48:57]: Um, yeah, I don't know whether it's actually still in the, in the, we all took screenshots of it.
Kevin [00:49:01]: Ah, we posted it in the, in the discord. And the, the second one, it was, uh, we had a learning scorecard where we identified the topics that you snipped on the most, and you got like a little score for that. And the third one was a, a quote that stood out. And the quote is actually a very good example of where we would run that for user. And most of the time it was an interesting quote, but every now and then it was like a super boring quotes that you think like, like how, like, why did you select that? Like, come on for there. The solution was actually just to say, Hey, give me five. So it extracted five quotes as a candidate, and then we piped it into a different model as a judge, LLM as a judge, and there we use a, um, a much better model because with the, the initial model, again, as, as I mentioned also earlier, we do have to look at the, like the, the costs because it's like, we have so much text that goes into it. So we, there we use a bit more cheaper model, but then the judge can be like a really good model to then just choose one out of five. This is a practical example.
swyx [00:50:03]: I can't find it. Bad search in discord. Yeah. Um, so, so you do recommend having a much smarter model as a judge, uh, and that works for you. Yeah. Yeah. Interesting. I think this year I'm very interested in LM as a judge being more developed as a concept, I think for things like, you know, snips, raps, like it's, it's fine. Like, you know, it's, it's, it's, it's entertaining. There's no right answer.
Kevin [00:50:29]: I mean, we also have it. Um, we also use the same concept for our books feature where we identify the, the mention. Books. Yeah. Because there it's the same thing, like 90% of the time it, it works perfectly out of the box one shot and every now and then it just, uh, starts identifying books that were not really mentioned or that are not books or made, yeah, starting to make up books. And, uh, they are basically, we have the same thing of like another LLM challenging it. Um, yeah. And actually with the speakers, we do the same now that I think about it. Yeah. Um, so I'm, I think it's a, it's a great technique. Interesting.
swyx [00:51:05]: You run a lot of calls.
Kevin [00:51:07]: Yeah.
swyx [00:51:08]: Okay. You know, you mentioned costs. You move from self hosting a lot of models to the, to the, you know, big lab models, open AI, uh, and Google, uh, non-topic.
Kevin [00:51:18]: Um, no, we love Claude. Like in my opinion, Claude is the, the best one when it comes to the way it formulates things. The personality. Yeah. The personality. Okay. I actually really love it. But yeah, the cost is. It's still high.
swyx [00:51:36]: So you cannot, you tried Haiku, but you're, you're like, you have to have Sonnet.
Kevin [00:51:40]: Uh, like basically we like with Haiku, we haven't experimented too much. We obviously work a lot with 3.5 Sonnet. Uh, also, you know, coding. Yeah. For coding, like in cursor, just in general, also brainstorming. We use it a lot. Um, I think it's a great brainstorm partner, but yeah, with, uh, with, with a lot of things that we've done done, we opted for different models.
swyx [00:52:00]: What I'm trying to drive at is how much cheaper can you get if you go from cloud to cloud? Closed models to open models. And maybe it's like 0% cheaper, maybe it's 5% cheaper, or maybe it's like 50% cheaper. Do you have a sense?
Kevin [00:52:13]: It's very difficult to, to judge that. I don't really have a sense, but I can, I can give you a couple of thoughts that have gone through our minds over the time, because obviously we do realize like, given that we, we have a couple of tasks where there are just so many tokens going in, um, at some point it will make sense to, to offload some of that. Uh, to an open source model, but going back to like, we're, we're a startup, right? Like we're not an AI lab or whatever, like for us, actually the most important thing is to iterate fast because we need to learn from our users, improve that. And yeah, just this velocity of this, these iterations. And for that, the closed models hosted by open AI, Google is, uh, and swapping, they're just unbeatable because you just, it's just an API call. Yeah. Um, so you don't need to worry about. Yeah. So much complexity behind that. So this is, I would say the biggest reason why we're not doing more in this space, but there are other thoughts, uh, also for the future. Like I see two different, like we basically have two different usage patterns of LLMs where one is this, this pre-processing of a podcast episode, like this initial processing, like the transcription, speaker diarization, chapterization. We do that once. And this, this usage pattern it's, it's quite predictable. Because we know how many podcasts get released when, um, so we can sort of have a certain capacity and we can, we, we're running that 24 seven, it's one big queue running 24 seven.
swyx [00:53:44]: What's the queue job runner? Uh, is it a Django, just like the Python one?
Kevin [00:53:49]: No, that, that's just our own, like our database and the backend talking to the database, picking up jobs, finding it back. I'm just curious in orchestration and queues. I mean, we, we of course have like, uh, a lot of other orchestration where we're, we're, where we use, uh, the Google pub sub, uh, thing, but okay. So we have this, this, this usage pattern of like very predictable, uh, usage, and we can max out the, the usage. And then there's this other pattern where it's, for example, the snippet where it's like a user, it's a user action that triggers an LLM call and it has to be real time. And there can be moments where it's by usage and there can be moments when there's very little usage for that. There. So that's, that's basically where these LLM API calls are just perfect because you don't need to worry about scaling this up, scaling this down, um, handling, handling these issues. Serverless versus serverful.
swyx [00:54:44]: Yeah, exactly. Okay.
Kevin [00:54:45]: Like I see them a bit, like I see open AI and all of these other providers, I see them a bit as the, like as the Amazon, sorry, AWS of, of AI. So it's a bit similar how like back before AWS, you would have to have your, your servers and buy new servers or get rid of servers. And then with AWS, it just became so much easier to just ramp stuff up and down. Yeah. And this is like the taking it even, even, uh, to the next level for AI. Yeah.
swyx [00:55:18]: I am a big believer in this. Basically it's, you know, intelligence on demand. Yeah. We're probably not using it enough in our daily lives to do things. I should, we should be able to spin up a hundred things at once and go through things and then, you know, stop. And I feel like we're still trying to figure out how to use LLMs in our lives effectively. Yeah. Yeah.
Kevin [00:55:38]: 100%. I think that goes back to the whole, like that, that's for me where the big opportunity is for, if you want to do a startup, um, it's not about, but you can let the big labs handle
swyx [00:55:48]: the challenge of more intelligence, but, um, it's the... Existing intelligence. How do you integrate? How do you actually incorporate it into your life? AI engineering. Okay, cool. Cool. Cool. Cool. Um, the one, one other thing I wanted to touch on was multimodality in frontier models. Dwarcash had a interesting application of Gemini recently where he just fed raw audio in and got diarized transcription out or timestamps out. And I think that will come. So basically what we're saying here is another wave of transformers eating things because right now models are pretty much single modality things. You know, you have whisper, you have a pipeline and everything. Yeah. You can't just say, Oh, no, no, no, we only fit like the raw, the raw files. Do you think that will be realistic for you? I 100% agree. Okay.
Kevin [00:56:38]: Basically everything that we talked about earlier with like the speaker diarization and heuristics and everything, I completely agree. Like in the, in the future that would just be put everything into a big multimodal LLM. Okay. And it will output, uh, everything that you want. Yeah. So I've also experimented with that. Like just... With, with Gemini 2? With Gemini 2.0 Flash. Yeah. Just for fun. Yeah. Yeah. Because the big difference right now is still like the cost difference of doing speaker diarization this way or doing transcription this way is a huge difference to the pipeline that we've built up. Huh. Okay.
swyx [00:57:15]: I need to figure out what, what that cost is because in my mind 2.0 Flash is so cheap. Yeah. But maybe not cheap enough for you.
Kevin [00:57:23]: Uh, no, I mean, if you compare it to, yeah, whisper and speaker diarization and especially self-hosting it and... Yeah. Yeah. Yeah.
swyx [00:57:30]: Yeah.
Kevin [00:57:30]: Okay. But we will get there, right? Like this is just a question of time.
swyx [00:57:33]: And, um, at some point, as soon as that happens, we'll be the first ones to switch. Yeah. Awesome. Anything else that you're like sort of eyeing on the horizon as like, we are thinking about this feature, we're thinking about incorporating this new functionality of AI into our, into our app? Yeah.
Kevin [00:57:50]: I mean, we, there's so many areas that we're thinking about, like our challenge is a bit more... Choosing. Yeah. Choosing. Yeah. So, I mean, I think for me, like looking into like the next couple of years, like the big areas that interest us a lot, basically four areas, like one is content. Um, right now it's, it's podcasts. I mean, you did mention, I think you mentioned like you can also upload audio books and YouTube videos. YouTube. I actually use the YouTube one a fair amount. But in the future, we, we want to also have audio books natively in the app. And, uh, we want to enable AI generated content. Like just think of, take deep research and notebook analysis. Like put these together. That should be, that should be in our app. The second area is discovery. I think in general. Yeah.
swyx [00:58:38]: I noticed that you don't have, so you have download counts and most snips. Right. Something like that. Yeah. Yeah.
Kevin [00:58:45]: On the discovery side, we want to do much, much more. I think in general, discovery as a paradigm in all apps is, will undergo a change thanks Thanks to AI. You know, there has been a lot of talk. Before Elon bought Twitter, there was a lot of talk about bring your own algorithm to Twitter. And that was Jack Dorsey's big thing. He talked a lot about that. And I actually think this is coming, but with a bit of a twist. So I think what actually AI will enable is not that you bring your own algorithm, but you will be able to talk. You will be able to communicate with the algorithm. So you can just tell the algorithm, like, hey, you keep showing me cat videos. And I know I freaking love them. And that's why you keep showing them to me. But please, for the next two hours, I really want to get more into AI stuff. Do not show me cat videos. And then it will just adapt. And of course, the question is, you know, like big platforms like, I don't know, let's say TikTok. They do not have the incentive to offer that.
swyx [00:59:49]: Exactly. That's what I was going to say.
Kevin [00:59:50]: But we actually, we are driven by helping you learn, get the most, like achieve your goals. And so for us, it's actually very much our incentive. Like, hey, you know, you should be able to guide it. Yeah. So that was a long way of saying that I think there will happen a lot in recommendations. Order by.
swyx [01:00:12]: The most popular. Yeah. I think collaborative filtering will be the first step, right? For Rexis and then some LLM fancy stuff.
Kevin [01:00:20]: Yeah. Maybe to go back to the question that you had before. So the other, like these were the first two areas. Yeah. The two are voice, voices and interfaces and voice AI. Well, how is this going to exist? Yeah. So maybe I can tell you a bit first, like why I find it so interesting for us. Yeah. Because voice as an interface, like historically, there has been so much talk about it and it always fell flat. The reason why I'm excited about it this time around is with any consumer app, I like to ask myself, what is the... moment in my life, what is the trigger in my life that gets me to open this app and start using it? So, for example, I don't know, take Airbnb. It's the trigger is like, ah, you want to travel and then and then you, you do that and then you open up the app. Apps that do not have this already existing natural trigger in your life, it's very difficult for a consumer app to then get the user to open the app again. You need a hook. Yeah. There's basically only one app. One super successful app that has been able to do that without this natural trigger, and that is Duolingo. So Duolingo, like everyone wants to learn a language, but there's, you don't have this natural moment during your day where it's like, ah, now I need to open up this app. You have the notifications. Exactly. The owl memes. Exactly. So they, I mean, they gamified the s**t super successful, super beautiful. They are the GOATs in this arena. But the much easier is actually... No, there is already this trigger and then you don't have to do all of the streaks and leaderboards and everything. Okay. That's a bit of a context. Now, if you look at what we're doing and our goal of getting people to really maximize what they get out of their listening, we are interested in, there are a couple of features where we know we can sort of 10x the value that people get out of a podcast. Okay. But we need them to do something for that. There is friction involved. Because it's all about learning, right? It's about thinking for yourself. Like, those are the moments when you actually start, yeah, really 10x-ing the value that you got out of the podcast instead of just consuming it.
swyx [01:02:37]: Applying the knowledge. Yeah. Okay.
Kevin [01:02:39]: Basically, being forced to think about like, what was actually the main takeaway for you from this episode? Okay. Like, there's something that I like doing myself for every episode that I listen to, I try to boil it down to, like, try to decide one single takeaway. Yeah. Even though there might have been 10. Yeah. There might have been 10 amazing things. Pick one. One most important one. Yeah. And this is an active process that is like a forcing function in your brain to challenge all of the insights and really come up with the one thing that is applicable to you and your life and what you might want to do with it. So it also helps you to turn it into action. This is basically a feature that we're interested in, but you have to get the user to use that, right? So when do you get the user to use that? Yeah. So if this is all text-based, then we're basically playing the same game as Duolingo, where at some point you're going to get a notification from Snip and be like, hey, Swyx, come on, you know you should do this. Maybe there's a blue owl.
Kevin [01:03:40]: But if you have voice, you can basically hook into the existing habits that the user already has. So you already have this habit that you listen to a podcast. You're already doing that. Yeah. And once an episode ends, instead of just jumping into the next episode, you can now actually have your AI companion come on and you can have a quick conversation. You can go through these things. And how that looks like in detail, we need to figure that out. But just this paradigm of you're staying in the flow. This also relates to what you were saying, like AI that is invisible. You're staying in the flow of what you're already doing. But now we can insert a completely new experience. That helps you get the most out of real estate. Yeah.
swyx [01:04:27]: I think your framing of this is very powerful. Because I think this is where you are a product person more than an engineer. Because an engineer would just be like, oh, it's just chat with your podcast. It's like chat with PDF, chat with podcast. Okay, cool. But you're framing it in a different light that actually makes sense to me now, as opposed to previously. I don't chat with my podcast. Why? I just listen to the podcast. But for you, it's more about retention and learning and all that. And because you're very serious about it, that's why you started the company. So you're focused on that. Whereas I'm still stuck in that consume, consume, consume mentality. And I know it's not good, but this is my default. Which is why I was a little bit lost when you were saying all the things about Duolingo. And you're saying the things about the trigger. This is my trigger for listening to the podcast is I'm by myself. That's my trigger. But you're saying the trigger is not about listening to the podcast. The trigger is remembering and retaining and processing the podcast I just listened to.
Kevin [01:05:41]: So what I meant, you already have this trigger that gets you to start listening to a podcast. Yes. This you already have. And so do, I don't know. Millions of people. Yeah. So there are more than half a billion monthly active podcast listeners. Okay. So you already have this trigger that gets you to start listening. But you do not have this trigger. As you just said yourself, basically, you do not have this trigger that gets you to regularly process this information. And voice basically for me is the ability to hook into your existing trigger with the trigger that I was talking about is basically your podcast. And you're just still listening. So we just continue and we can now spend, you know, this can be two minutes. Like I'm not saying now this is like a 60 minute process. I think like two minutes, three minutes that can just come on completely naturally. And if we manage to do that and you start noticing as a user, like freaking hell, like I'm just now spending three minutes with this AI companion. But like. Your retention is more. I'm taking this much away. And it's not. And like retention is one thing. But you're like. Yeah. You start to take what you've learned and apply it to what's important to you. Like you're thinking. Yeah. And if we get you to notice that feeling, then yeah, then we've won. Yeah.
swyx [01:07:05]: I would say like a lot of people rely on Anki, Anki notes like flashcards and all that to do that. But making the notes is also a chore. And I think this could be very, very interesting. I think that I'm just noticing that it's kind of like a different usage mode. Like you already talked about this. You know, the name of Snips is very Snip centric. And I actually originally also resisted adopting Snip because of that. But now you're like, you know, you observe that people are listening to long form episodes and you're talking at the end. Like the ideal implementation of this is I browse through a bunch of Snips of the things that I'm subscribed to. I listen to the Snips. I talk with it. And then maybe it double clicks on the podcast and it goes and finds other timestamps that are relevant to the thing that I want to talk about. Just. I don't know that. I don't know if that's interesting.
Kevin [01:07:53]: I think these are all areas that we should explore. Yeah.
swyx [01:07:57]: Like we're still quite open about how this will look like in detail. What are your thoughts on voice cloning? Everyone wants to continue. I have had my voice clones and people have talked to me, the AI version of me. Is that too creepy?
Kevin [01:08:13]: I don't think it's too creepy in the future. Okay. With a lot of these things in our society is going through a change. And things seem quite weird now that in the future will seem normal. I think already voice cloning has become much more normalized. I remember I was at the, I think it was 2017 Nips conference. San Diego?
swyx [01:08:42]: No, LA. LA. It was the Flo Rida one? Yeah. Yeah. Flo Rida. Yeah.
Kevin [01:08:47]: So everyone says that was peak Nips. Yeah. I remember there was this talk or workshop by Liar Bird. They actually got acquired by Descript later. They were doing voice cloning and they were showing off their tech. And there was this huge discussion later on, like all of the moral implications and ethical implications. And it really felt like this would never be accepted by society. And you look now, you have 11 labs and just anyone can just clone their voice. And no one really talks about it as like, oh my God, the world is going to end. Yeah. So I think society will get used to that. In our case, I think there are some interesting applications where we'd also be super interested in working together with creators, like podcast creators, to play a bit around with this concept. I think that would be super cool if someone can come onto Snipped, go to the Latent Space
swyx [01:09:42]: podcast and start chatting with AI Swyx. Yeah. No, I think we'd be there. Yeah. We want to, obviously, I think as an AI podcast, we should be first consumers of these things. Yeah. I would say that one observation I've made about podcasting, this is the general state of the market. And you can ask me your questions, things you want to ask about podcasters. We are focusing a lot more on YouTube this year. YouTube is the best podcasting platform. It is not MP3s. It is not Apple Podcasts. It is not Spotify. It's YouTube. And it's just the social layer of recommendations and the existing habit that people have of logging onto YouTube and getting that. That's my observation. You can riff on that. The only thing I would just say is like, when you were listing your list of priorities, you said audio books first over YouTube.
Kevin [01:10:26]: And I would switch that if I were you. Yeah. Like as in YouTube, video, video podcasts. I mean, it's obvious that video podcasts are here to stay. Not just here to stay, bigger. Yeah. What I want to do with Snipped is obviously also add video to the platform. Oh, yeah. The way I see video is I do believe it's... Yeah. I like this concept of backgroundable video. I didn't come up with this concept. It was actually Gustav Söderström. The CPO of Spotify. Exactly. Exactly. When I speak with people, it remains true that they listen to podcasts when they do something else at the same time. Like this is like 90% of their consumption. Also if they listen to on YouTube. But every now and then it's nice to have the video. It's nice if you're, for example, just watching a clip. It's nice if they sometimes mention something, like they show some slides or they show something where you need to have the visual with it. It helps you connect much more with the host as a listener. But the biggest benefit I see with video is discovery. I think that is also why YouTube has become the biggest podcast player out there because they have the discovery. And discovery in video is just so much easier and so much better. And so much more engaging. So this is the area where I'm most interested about when it comes to video and snips. That we can provide a much better, much more engaging and much more fun discovery experience. For consumers? Yeah, for consumers.
swyx [01:12:01]: Okay. I think that you almost have like three different audiences. The vast majority of people for you is the people listening to podcasts. Right? Of course. Then there's a second layer of people who create snips. Right? Who add extra data, annotation value to your platform. By the way, we use the snip count as a proxy for popularity, right? Because we have download counts, but for example, platforms like Spotify re-host our MP3 file. So we don't get any download count for Spotify. Snip count is active, like I opt in to listen to you and I shared this. Those are really, really good metrics. But the third audience that you haven't really touched is the podcast creators like myself. And for me, discovery from that point of view, not from your point of view, discovery for me is like, I want to be discovered. And I think YouTube is still there. Twitter, obviously for me, Substack, Hacker News. I really try very hard to rank on Hacker News. I think when TikTok took this very seriously, they prioritized the creators of the content. And for you, the creator of the content was the snips. But there may be a world for you in which you prioritize the creators of the podcast.
Kevin [01:13:10]: Yeah. Interesting observation. What are some of your ideas or thoughts? Do you have some specific?
swyx [01:13:18]: Riverside is the closest that has come to it. Descript is number two. Descript bought a Riverside competitor and as far as I can tell, it's not been very successful. Descript just has a very, very good niche, very, very good editing angle and then just hasn't done anything interesting since then. Although Underlord is good, it's not great. Your chapterization is better than Descript's. Again, they should be able to beat you. They're not. And Riverside is good also. Very, very good. Very, very, very good. So we actually recently started a second series of podcasts within Latent Space that is YouTube only because you only find it on YouTube. And it's also shorter. So this is like a one and a half hour, two hour thing. Remote only, 30 minutes, chop, chop. Send it on to Riverside. Riverside, pretty good for that. Not great. It doesn't do good thumbnails. It doesn't do good. The editing is still a little bit rough. It has this auto editor where whoever's actively speaking, it focuses on the editor, on the active speaker. And then sometimes it goes back to the multi-speaker view, that kind of stuff. People like that. Okay. But the shorts are still not great. I still need to manually download it and then republish it to YouTube. The shorts I still need to pick. They mostly suck. There's still a lot of rough edges there that ideally, me as a creator, you know what I want. You definitely know what I want. I sit down, record, press a button, done. We're still not there.
Kevin [01:14:46]: I think you guys could do it. Okay. So if I can translate that for you, it's really about the simplifying the creation process of the podcast. Yeah.
swyx [01:14:55]: And I'll tell you what, this will increase the quality because the reason that most podcasts or YouTube videos are s**t is they are made by people who don't have life experience, who are not that important in the world. They're not doing important jobs. And so what you want to actually enable is CEOs to each of them make their own podcasts who are busy. They're not going to sit there and figure out Riverside. A lot of the reason that people like Latent Space is it takes an idiot like me who could be doing a lot more with my life, making a lot more money, having a real job somewhere else. I just choose to do this because I like it. But otherwise, they will never get access to me and the access to the people that I have access to. So that's my pitch. Cool.
swyx [01:15:44]: Anything else that you normally want to talk to podcasters about?
Kevin [01:15:46]: I think we've covered everything. I guess like last messages, you know, go try out Snipped. Yeah. It's a premium version so you can use and try out everything for free. Also happy to provide you with a link that you can add to the show notes. Try out the premium version also for free for a month if people want to do that. Yeah. Give it a shot.
swyx [01:16:08]: I would say. Yeah. Thanks for coming on. I would say that after you demoed me, I did not convert for another four to six months because I found it very challenging to switch over. And I think that's the main thing. Like you basically had you have import OPML. Right. But there's no way to import like all the existing like half listened to episodes or like my rankings or whatever. And for that, for listeners who are. I have a blog post where I talked about my switch. Just treat it as a chance to clean house.
swyx [01:16:45]: That's a good point. Do things and, you know, just refocus here. First start. 2025. Yeah. Great. Well, thank you for working on Snipped. Thank you for coming on. You know, we usually spend a lot of time talking to like big companies like venture startups, B2B, SaaS, you know, that kind of stuff. But I think your journey is like, you know, it's a small team building a B2C consumer app. It's the kind of stuff that we like to also feature because a lot of people want to build what you're doing. And they don't see role models that are successful, that are confidence, that are like having success in this market, which is very challenging. So, yeah, thanks for thanks for sharing some of your thoughts. Thanks.
Kevin [01:17:26]: Yeah, thanks. Thanks for having me. And thank you for creating an amazing podcast and an amazing conference as well.
swyx [01:17:32]: Thank you.
Get full access to Latent.Space at www.latent.space/subscribe
Outlasting Noam Shazeer, crowdsourcing Chat + AI with >1.4m DAU, and becoming the "Western DeepSeek" — with William Beauchamp, Chai Research
dimanche 26 janvier 2025 • Durée 01:15:46
One last Gold sponsor slot is available for the AI Engineer Summit in NYC. Our last round of invites is going out soon - apply here - If you are building AI agents or AI eng teams, this will be the single highest-signal conference of the year for you!
While the world melts down over DeepSeek, few are talking about the OTHER notable group of former hedge fund traders who pivoted into AI and built a remarkably profitable consumer AI business with a tiny team with incredibly cracked engineering team — Chai Research. In short order they have:
* Started a Chat AI company well before Noam Shazeer started Character AI, and outlasted his departure.
* Crossed 1m DAU in 2.5 years - William updates us on the pod that they’ve hit 1.4m DAU now, another +40% from a few months ago. Revenue crossed >$22m.
* Launched the Chaiverse model crowdsourcing platform - taking 3-4 week A/B testing cycles down to 3-4 hours, and deploying >100 models a week.
While they’re not paying million dollar salaries, you can tell they’re doing pretty well for an 11 person startup:
The Chai Recipe: Building infra for rapid evals
Remember how the central thesis of LMarena (formerly LMsys) is that the only comprehensive way to evaluate LLMs is to let users try them out and pick winners?
At the core of Chai is a mobile app that looks like Character AI, but is actually the largest LLM A/B testing arena in the world, specialized on retaining chat users for Chai’s usecases (therapy, assistant, roleplay, etc). It’s basically what LMArena would be if taken very, very seriously at one company (with $1m in prizes to boot):
Chai publishes occasional research on how they think about this, including talks at their Palo Alto office:
William expands upon this in today’s podcast (34 mins in):
Fundamentally, the way I would describe it is when you're building anything in life, you need to be able to evaluate it. And through evaluation, you can iterate, we can look at benchmarks, and we can say the issues with benchmarks and why they may not generalize as well as one would hope in the challenges of working with them. But something that works incredibly well is getting feedback from humans. And so we built this thing where anyone can submit a model to our developer backend, and it gets put in front of 5000 users, and the users can rate it.
And we can then have a really accurate ranking of like which model, or users finding more engaging or more entertaining. And it gets, you know, it's at this point now, where every day we're able to, I mean, we evaluate between 20 and 50 models, LLMs, every single day, right. So even though we've got only got a team of, say, five AI researchers, they're able to iterate a huge quantity of LLMs, right. So our team ships, let's just say minimum 100 LLMs a week is what we're able to iterate through. Now, before that moment in time, we might iterate through three a week, we might, you know, there was a time when even doing like five a month was a challenge, right? By being able to change the feedback loops to the point where it's not, let's launch these three models, let's do an A-B test, let's assign, let's do different cohorts, let's wait 30 days to see what the day 30 retention is, which is the kind of the, if you're doing an app, that's like A-B testing 101 would be, do a 30-day retention test, assign different treatments to different cohorts and come back in 30 days. So that's insanely slow. That's just, it's too slow. And so we were able to get that 30-day feedback loop all the way down to something like three hours.
In Crowdsourcing the leap to Ten Trillion-Parameter AGI, William describes Chai’s routing as a recommender system, which makes a lot more sense to us than previous pitches for model routing startups:
William is notably counter-consensus in a lot of his AI product principles:
* No streaming: Chats appear all at once to allow rejection sampling
* No voice: Chai actually beat Character AI to introducing voice - but removed it after finding that it was far from a killer feature.
* Blending: “Something that we love to do at Chai is blending, which is, you know, it's the simplest way to think about it is you're going to end up, and you're going to pretty quickly see you've got one model that's really smart, one model that's really funny. How do you get the user an experience that is both smart and funny? Well, just 50% of the requests, you can serve them the smart model, 50% of the requests, you serve them the funny model.” (that’s it!)
But chief above all is the recommender system.
We also referenced Exa CEO Will Bryk’s concept of SuperKnowlege:
Full Video version
On YouTube. please like and subscribe!
Timestamps
* 00:00:04 Introductions and background of William Beauchamp
* 00:01:19 Origin story of Chai AI
* 00:04:40 Transition from finance to AI
* 00:11:36 Initial product development and idea maze for Chai
* 00:16:29 User psychology and engagement with AI companions
* 00:20:00 Origin of the Chai name
* 00:22:01 Comparison with Character AI and funding challenges
* 00:25:59 Chai's growth and user numbers
* 00:34:53 Key inflection points in Chai's growth
* 00:42:10 Multi-modality in AI companions and focus on user-generated content
* 00:46:49 Chaiverse developer platform and model evaluation
* 00:51:58 Views on AGI and the nature of AI intelligence
* 00:57:14 Evaluation methods and human feedback in AI development
* 01:02:01 Content creation and user experience in Chai
* 01:04:49 Chai Grant program and company culture
* 01:07:20 Inference optimization and compute costs
* 01:09:37 Rejection sampling and reward models in AI generation
* 01:11:48 Closing thoughts and recruitment
Transcript
Alessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel, and today we're in the Chai AI office with my usual co-host, Swyx.
swyx [00:00:14]: Hey, thanks for having us. It's rare that we get to get out of the office, so thanks for inviting us to your home. We're in the office of Chai with William Beauchamp. Yeah, that's right. You're founder of Chai AI, but previously, I think you're concurrently also running your fund?
William [00:00:29]: Yep, so I was simultaneously running an algorithmic trading company, but I fortunately was able to kind of exit from that, I think just in Q3 last year. Yeah, congrats. Yeah, thanks.
swyx [00:00:43]: So Chai has always been on my radar because, well, first of all, you do a lot of advertising, I guess, in the Bay Area, so it's working. Yep. And second of all, the reason I reached out to a mutual friend, Joyce, was because I'm just generally interested in the... ...consumer AI space, chat platforms in general. I think there's a lot of inference insights that we can get from that, as well as human psychology insights, kind of a weird blend of the two. And we also share a bit of a history as former finance people crossing over. I guess we can just kind of start it off with the origin story of Chai.
William [00:01:19]: Why decide working on a consumer AI platform rather than B2B SaaS? So just quickly touching on the background in finance. Sure. Originally, I'm from... I'm from the UK, born in London. And I was fortunate enough to go study economics at Cambridge. And I graduated in 2012. And at that time, everyone in the UK and everyone on my course, HFT, quant trading was really the big thing. It was like the big wave that was happening. So there was a lot of opportunity in that space. And throughout college, I'd sort of played poker. So I'd, you know, I dabbled as a professional poker player. And I was able to accumulate this sort of, you know, say $100,000 through playing poker. And at the time, as my friends would go work at companies like ChangeStreet or Citadel, I kind of did the maths. And I just thought, well, maybe if I traded my own capital, I'd probably come out ahead. I'd make more money than just going to work at ChangeStreet.
swyx [00:02:20]: With 100k base as capital?
William [00:02:22]: Yes, yes. That's not a lot. Well, it depends what strategies you're doing. And, you know, there is an advantage. There's an advantage to being small, right? Because there are, if you have a 10... Strategies that don't work in size. Exactly, exactly. So if you have a fund of $10 million, if you find a little anomaly in the market that you might be able to make 100k a year from, that's a 1% return on your 10 million fund. If your fund is 100k, that's 100% return, right? So being small, in some sense, was an advantage. So started off, and the, taught myself Python, and machine learning was like the big thing as well. Machine learning had really, it was the first, you know, big time machine learning was being used for image recognition, neural networks come out, you get dropout. And, you know, so this, this was the big thing that's going on at the time. So I probably spent my first three years out of Cambridge, just building neural networks, building random forests to try and predict asset prices, right, and then trade that using my own money. And that went well. And, you know, if you if you start something, and it goes well, you You try and hire more people. And the first people that came to mind was the talented people I went to college with. And so I hired some friends. And that went well and hired some more. And eventually, I kind of ran out of friends to hire. And so that was when I formed the company. And from that point on, we had our ups and we had our downs. And that was a whole long story and journey in itself. But after doing that for about eight or nine years, on my 30th birthday, which was four years ago now, I kind of took a step back to just evaluate my life, right? This is what one does when one turns 30. You know, I just heard it. I hear you. And, you know, I looked at my 20s and I loved it. It was a really special time. I was really lucky and fortunate to have worked with this amazing team, been successful, had a lot of hard times. And through the hard times, learned wisdom and then a lot of success and, you know, was able to enjoy it. And so the company was making about five million pounds a year. And it was just me and a team of, say, 15, like, Oxford and Cambridge educated mathematicians and physicists. It was like the real dream that you'd have if you wanted to start a quant trading firm. It was like...
swyx [00:04:40]: Your own, all your own money?
William [00:04:41]: Yeah, exactly. It was all the team's own money. We had no customers complaining to us about issues. There's no investors, you know, saying, you know, they don't like the risk that we're taking. We could. We could really run the thing exactly as we wanted it. It's like Susquehanna or like Rintec. Yeah, exactly. Yeah. And they're the companies that we would kind of look towards as we were building that thing out. But on my 30th birthday, I look and I say, OK, great. This thing is making as much money as kind of anyone would really need. And I thought, well, what's going to happen if we keep going in this direction? And it was clear that we would never have a kind of a big, big impact on the world. We can enrich ourselves. We can make really good money. Everyone on the team would be paid very, very well. Presumably, I can make enough money to buy a yacht or something. But this stuff wasn't that important to me. And so I felt a sort of obligation that if you have this much talent and if you have a talented team, especially as a founder, you want to be putting all that talent towards a good use. I looked at the time of like getting into crypto and I had a really strong view on crypto, which was that as far as a gambling device. This is like the most fun form of gambling invented in like ever super fun, I thought as a way to evade monetary regulations and banking restrictions. I think it's also absolutely amazing. So it has two like killer use cases, not so much banking the unbanked, but everything else, but everything else to do with like the blockchain and, and you know, web, was it web 3.0 or web, you know, that I, that didn't, it didn't really make much sense. And so instead of going into crypto, which I thought, even if I was successful, I'd end up in a lot of trouble. I thought maybe it'd be better to build something that governments wouldn't have a problem with. I knew that LLMs were like a thing. I think opening. I had said they hadn't released GPT-3 yet, but they'd said GPT-3 is so powerful. We can't release it to the world or something. Was it GPT-2? And then I started interacting with, I think Google had open source, some language models. They weren't necessarily LLMs, but they, but they were. But yeah, exactly. So I was able to play around with, but nowadays so many people have interacted with the chat GPT, they get it, but it's like the first time you, you can just talk to a computer and it talks back. It's kind of a special moment and you know, everyone who's done that goes like, wow, this is how it should be. Right. It should be like, rather than having to type on Google and search, you should just be able to ask Google a question. When I saw that I read the literature, I kind of came across the scaling laws and I think even four years ago. All the pieces of the puzzle were there, right? Google had done this amazing research and published, you know, a lot of it. Open AI was still open. And so they'd published a lot of their research. And so you really could be fully informed on, on the state of AI and where it was going. And so at that point I was confident enough, it was worth a shot. I think LLMs are going to be the next big thing. And so that's the thing I want to be building in, in that space. And I thought what's the most impactful product I can possibly build. And I thought it should be a platform. So I myself love platforms. I think they're fantastic because they open up an ecosystem where anyone can contribute to it. Right. So if you think of a platform like a YouTube, instead of it being like a Hollywood situation where you have to, if you want to make a TV show, you have to convince Disney to give you the money to produce it instead, anyone in the world can post any content they want to YouTube. And if people want to view it, the algorithm is going to promote it. Nowadays. You can look at creators like Mr. Beast or Joe Rogan. They would have never have had that opportunity unless it was for this platform. Other ones like Twitter's a great one, right? But I would consider Wikipedia to be a platform where instead of the Britannica encyclopedia, which is this, it's like a monolithic, you get all the, the researchers together, you get all the data together and you combine it in this, in this one monolithic source. Instead. You have this distributed thing. You can say anyone can host their content on Wikipedia. Anyone can contribute to it. And anyone can maybe their contribution is they delete stuff. When I was hearing like the kind of the Sam Altman and kind of the, the Muskian perspective of AI, it was a very kind of monolithic thing. It was all about AI is basically a single thing, which is intelligence. Yeah. Yeah. The more intelligent, the more compute, the more intelligent, and the more and better AI researchers, the more intelligent, right? They would speak about it as a kind of erased, like who can get the most data, the most compute and the most researchers. And that would end up with the most intelligent AI. But I didn't believe in any of that. I thought that's like the total, like I thought that perspective is the perspective of someone who's never actually done machine learning. Because with machine learning, first of all, you see that the performance of the models follows an S curve. So it's not like it just goes off to infinity, right? And the, the S curve, it kind of plateaus around human level performance. And you can look at all the, all the machine learning that was going on in the 2010s, everything kind of plateaued around the human level performance. And we can think about the self-driving car promises, you know, how Elon Musk kept saying the self-driving car is going to happen next year, it's going to happen next, next year. Or you can look at the image recognition, the speech recognition. You can look at. All of these things, there was almost nothing that went superhuman, except for something like AlphaGo. And we can speak about why AlphaGo was able to go like super superhuman. So I thought the most likely thing was going to be this, I thought it's not going to be a monolithic thing. That's like an encyclopedia Britannica. I thought it must be a distributed thing. And I actually liked to look at the world of finance for what I think a mature machine learning ecosystem would look like. So, yeah. So finance is a machine learning ecosystem because all of these quant trading firms are running machine learning algorithms, but they're running it on a centralized platform like a marketplace. And it's not the case that there's one giant quant trading company of all the data and all the quant researchers and all the algorithms and compute, but instead they all specialize. So one will specialize on high frequency training. Another will specialize on mid frequency. Another one will specialize on equity. Another one will specialize. And I thought that's the way the world works. That's how it is. And so there must exist a platform where a small team can produce an AI for a unique purpose. And they can iterate and build the best thing for that, right? And so that was the vision for Chai. So we wanted to build a platform for LLMs.
Alessio [00:11:36]: That's kind of the maybe inside versus contrarian view that led you to start the company. Yeah. And then what was maybe the initial idea maze? Because if somebody told you that was the Hugging Face founding story, people might believe it. It's kind of like a similar ethos behind it. How did you land on the product feature today? And maybe what were some of the ideas that you discarded that initially you thought about?
William [00:11:58]: So the first thing we built, it was fundamentally an API. So nowadays people would describe it as like agents, right? But anyone could write a Python script. They could submit it to an API. They could send it to the Chai backend and we would then host this code and execute it. So that's like the developer side of the platform. On their Python script, the interface was essentially text in and text out. An example would be the very first bot that I created. I think it was a Reddit news bot. And so it would first, it would pull the popular news. Then it would prompt whatever, like I just use some external API for like Burr or GPT-2 or whatever. Like it was a very, very small thing. And then the user could talk to it. So you could say to the bot, hi bot, what's the news today? And it would say, this is the top stories. And you could chat with it. Now four years later, that's like perplexity or something. That's like the, right? But back then the models were first of all, like really, really dumb. You know, they had an IQ of like a four year old. And users, there really wasn't any demand or any PMF for interacting with the news. So then I was like, okay. Um. So let's make another one. And I made a bot, which was like, you could talk to it about a recipe. So you could say, I'm making eggs. Like I've got eggs in my fridge. What should I cook? And it'll say, you should make an omelet. Right. There was no PMF for that. No one used it. And so I just kept creating bots. And so every single night after work, I'd be like, okay, I like, we have AI, we have this platform. I can create any text in textile sort of agent and put it on the platform. And so we just create stuff night after night. And then all the coders I knew, I would say, yeah, this is what we're going to do. And then I would say to them, look, there's this platform. You can create any like chat AI. You should put it on. And you know, everyone's like, well, chatbots are super lame. We want absolutely nothing to do with your chatbot app. No one who knew Python wanted to build on it. I'm like trying to build all these bots and no consumers want to talk to any of them. And then my sister who at the time was like just finishing college or something, I said to her, I was like, if you want to learn Python, you should just submit a bot for my platform. And she, she built a therapy for me. And I was like, okay, cool. I'm going to build a therapist bot. And then the next day I checked the performance of the app and I'm like, oh my God, we've got 20 active users. And they spent, they spent like an average of 20 minutes on the app. I was like, oh my God, what, what bot were they speaking to for an average of 20 minutes? And I looked and it was the therapist bot. And I went, oh, this is where the PMF is. There was no demand for, for recipe help. There was no demand for news. There was no demand for dad jokes or pub quiz or fun facts or what they wanted was they wanted the therapist bot. the time I kind of reflected on that and I thought, well, if I want to consume news, the most fun thing, most fun way to consume news is like Twitter. It's not like the value of there being a back and forth, wasn't that high. Right. And I thought if I need help with a recipe, I actually just go like the New York times has a good recipe section, right? It's not actually that hard. And so I just thought the thing that AI is 10 X better at is a sort of a conversation right. That's not intrinsically informative, but it's more about an opportunity. You can say whatever you want. You're not going to get judged. If it's 3am, you don't have to wait for your friend to text back. It's like, it's immediate. They're going to reply immediately. You can say whatever you want. It's judgment-free and it's much more like a playground. It's much more like a fun experience. And you could see that if the AI gave a person a compliment, they would love it. It's much easier to get the AI to give you a compliment than a human. From that day on, I said, okay, I get it. Humans want to speak to like humans or human like entities and they want to have fun. And that was when I started to look less at platforms like Google. And I started to look more at platforms like Instagram. And I was trying to think about why do people use Instagram? And I could see that I think Chai was, was filling the same desire or the same drive. If you go on Instagram, typically you want to look at the faces of other humans, or you want to hear about other people's lives. So if it's like the rock is making himself pancakes on a cheese plate. You kind of feel a little bit like you're the rock's friend, or you're like having pancakes with him or something, right? But if you do it too much, you feel like you're sad and like a lonely person, but with AI, you can talk to it and tell it stories and tell you stories, and you can play with it for as long as you want. And you don't feel like you're like a sad, lonely person. You feel like you actually have a friend.
Alessio [00:16:29]: And what, why is that? Do you have any insight on that from using it?
William [00:16:33]: I think it's just the human psychology. I think it's just the idea that, with old school social media. You're just consuming passively, right? So you'll just swipe. If I'm watching TikTok, just like swipe and swipe and swipe. And even though I'm getting the dopamine of like watching an engaging video, there's this other thing that's building my head, which is like, I'm feeling lazier and lazier and lazier. And after a certain period of time, I'm like, man, I just wasted 40 minutes. I achieved nothing. But with AI, because you're interacting, you feel like you're, it's not like work, but you feel like you're participating and contributing to the thing. You don't feel like you're just. Consuming. So you don't have a sense of remorse basically. And you know, I think on the whole people, the way people talk about, try and interact with the AI, they speak about it in an incredibly positive sense. Like we get people who say they have eating disorders saying that the AI helps them with their eating disorders. People who say they're depressed, it helps them through like the rough patches. So I think there's something intrinsically healthy about interacting that TikTok and Instagram and YouTube doesn't quite tick. From that point on, it was about building more and more kind of like human centric AI for people to interact with. And I was like, okay, let's make a Kanye West bot, right? And then no one wanted to talk to the Kanye West bot. And I was like, ah, who's like a cool persona for teenagers to want to interact with. And I was like, I was trying to find the influencers and stuff like that, but no one cared. Like they didn't want to interact with the, yeah. And instead it was really just the special moment was when we said the realization that developers and software engineers aren't interested in building this sort of AI, but the consumers are right. And rather than me trying to guess every day, like what's the right bot to submit to the platform, why don't we just create the tools for the users to build it themselves? And so nowadays this is like the most obvious thing in the world, but when Chai first did it, it was not an obvious thing at all. Right. Right. So we took the API for let's just say it was, I think it was GPTJ, which was this 6 billion parameter open source transformer style LLM. We took GPTJ. We let users create the prompt. We let users select the image and we let users choose the name. And then that was the bot. And through that, they could shape the experience, right? So if they said this bot's going to be really mean, and it's going to be called like bully in the playground, right? That was like a whole category that I never would have guessed. Right. People love to fight. They love to have a disagreement, right? And then they would create, there'd be all these romantic archetypes that I didn't know existed. And so as the users could create the content that they wanted, that was when Chai was able to, to get this huge variety of content and rather than appealing to, you know, 1% of the population that I'd figured out what they wanted, you could appeal to a much, much broader thing. And so from that moment on, it was very, very crystal clear. It's like Chai, just as Instagram is this social media platform that lets people create images and upload images, videos and upload that, Chai was really about how can we let the users create this experience in AI and then share it and interact and search. So it's really, you know, I say it's like a platform for social AI.
Alessio [00:20:00]: Where did the Chai name come from? Because you started the same path. I was like, is it character AI shortened? You started at the same time, so I was curious. The UK origin was like the second, the Chai.
William [00:20:15]: We started way before character AI. And there's an interesting story that Chai's numbers were very, very strong, right? So I think in even 20, I think late 2022, was it late 2022 or maybe early 2023? Chai was like the number one AI app in the app store. So we would have something like 100,000 daily active users. And then one day we kind of saw there was this website. And we were like, oh, this website looks just like Chai. And it was the character AI website. And I think that nowadays it's, I think it's much more common knowledge that when they left Google with the funding, I think they knew what was the most trending, the number one app. And I think they sort of built that. Oh, you found the people.
swyx [00:21:03]: You found the PMF for them.
William [00:21:04]: We found the PMF for them. Exactly. Yeah. So I worked a year very, very hard. And then they, and then that was when I learned a lesson, which is that if you're VC backed and if, you know, so Chai, we'd kind of ran, we'd got to this point, I was the only person who'd invested. I'd invested maybe 2 million pounds in the business. And you know, from that, we were able to build this thing, get to say a hundred thousand daily active users. And then when character AI came along, the first version, we sort of laughed. We were like, oh man, this thing sucks. Like they don't know what they're building. They're building the wrong thing anyway, but then I saw, oh, they've raised a hundred million dollars. Oh, they've raised another hundred million dollars. And then our users started saying, oh guys, your AI sucks. Cause we were serving a 6 billion parameter model, right? How big was the model that character AI could afford to serve, right? So we would be spending, let's say we would spend a dollar per per user, right? Over the, the, you know, the entire lifetime.
swyx [00:22:01]: A dollar per session, per chat, per month? No, no, no, no.
William [00:22:04]: Let's say we'd get over the course of the year, we'd have a million users and we'd spend a million dollars on the AI throughout the year. Right. Like aggregated. Exactly. Exactly. Right. They could spend a hundred times that. So people would say, why is your AI much dumber than character AIs? And then I was like, oh, okay, I get it. This is like the Silicon Valley style, um, hyper scale business. And so, yeah, we moved to Silicon Valley and, uh, got some funding and iterated and built the flywheels. And, um, yeah, I, I'm very proud that we were able to compete with that. Right. So, and I think the reason we were able to do it was just customer obsession. And it's similar, I guess, to how deep seek have been able to produce such a compelling model when compared to someone like an open AI, right? So deep seek, you know, their latest, um, V2, yeah, they claim to have spent 5 million training it.
swyx [00:22:57]: It may be a bit more, but, um, like, why are you making it? Why are you making such a big deal out of this? Yeah. There's an agenda there. Yeah. You brought up deep seek. So we have to ask you had a call with them.
William [00:23:07]: We did. We did. We did. Um, let me think what to say about that. I think for one, they have an amazing story, right? So their background is again in finance.
swyx [00:23:16]: They're the Chinese version of you. Exactly.
William [00:23:18]: Well, there's a lot of similarities. Yes. Yes. I have a great affinity for companies which are like, um, founder led, customer obsessed and just try and build something great. And I think what deep seek have achieved. There's quite special is they've got this amazing inference engine. They've been able to reduce the size of the KV cash significantly. And then by being able to do that, they're able to significantly reduce their inference costs. And I think with kind of with AI, people get really focused on like the kind of the foundation model or like the model itself. And they sort of don't pay much attention to the inference. To give you an example with Chai, let's say a typical user session is 90 minutes, which is like, you know, is very, very long for comparison. Let's say the average session length on TikTok is 70 minutes. So people are spending a lot of time. And in that time they're able to send say 150 messages. That's a lot of completions, right? It's quite different from an open AI scenario where people might come in, they'll have a particular question in mind. And they'll ask like one question. And a few follow up questions, right? So because they're consuming, say 30 times as many requests for a chat, or a conversational experience, you've got to figure out how to how to get the right balance between the cost of that and the quality. And so, you know, I think with AI, it's always been the case that if you want a better experience, you can throw compute at the problem, right? So if you want a better model, you can just make it bigger. If you want it to remember better, give it a longer context. And now, what open AI is doing to great fanfare is with projection sampling, you can generate many candidates, right? And then with some sort of reward model or some sort of scoring system, you can serve the most promising of these many candidates. And so that's kind of scaling up on the inference time compute side of things. And so for us, it doesn't make sense to think of AI is just the absolute performance. So. But what we're seeing, it's like the MML you score or the, you know, any of these benchmarks that people like to look at, if you just get that score, it doesn't really tell tell you anything. Because it's really like progress is made by improving the performance per dollar. And so I think that's an area where deep seek have been able to form very, very well, surprisingly so. And so I'm very interested in what Lama four is going to look like. And if they're able to sort of match what deep seek have been able to achieve with this performance per dollar gain.
Alessio [00:25:59]: Before we go into the inference, some of the deeper stuff, can you give people an overview of like some of the numbers? So I think last I checked, you have like 1.4 million daily active now. It's like over 22 million of revenue. So it's quite a business.
William [00:26:12]: Yeah, I think we grew by a factor of, you know, users grew by a factor of three last year. Revenue over doubled. You know, it's very exciting. We're competing with some really big, really well funded companies. Character AI got this, I think it was almost a $3 billion valuation. And they have 5 million DAU is a number that I last heard. Torquay, which is a Chinese built app owned by a company called Minimax. They're incredibly well funded. And these companies didn't grow by a factor of three last year. Right. And so when you've got this company and this team that's able to keep building something that gets users excited, and they want to tell their friend about it, and then they want to come and they want to stick on the platform. I think that's very special. And so last year was a great year for the team. And yeah, I think the numbers reflect the hard work that we put in. And then fundamentally, the quality of the app, the quality of the content, the quality of the content, the quality of the content, the quality of the content, the quality of the content. AI is the quality of the experience that you have. You actually published your DAU growth chart, which is unusual. And I see some inflections. Like, it's not just a straight line. There's some things that actually inflect. Yes. What were the big ones? Cool. That's a great, great, great question. Let me think of a good answer. I'm basically looking to annotate this chart, which doesn't have annotations on it. Cool. The first thing I would say is this is, I think the most important thing to know about success is that success is born out of failures. Right? Through failures that we learn. You know, if you think something's a good idea, and you do and it works, great, but you didn't actually learn anything, because everything went exactly as you imagined. But if you have an idea, you think it's going to be good, you try it, and it fails. There's a gap between the reality and expectation. And that's an opportunity to learn. The flat periods, that's us learning. And then the up periods is that's us reaping the rewards of that. So I think the big, of the growth shot of just 2024, I think the first thing that really kind of put a dent in our growth was our backend. So we just reached this scale. So we'd, from day one, we'd built on top of Google's GCP, which is Google's cloud platform. And they were fantastic. We used them when we had one daily active user, and they worked pretty good all the way up till we had about 500,000. It was never the cheapest, but from an engineering perspective, man, that thing scaled insanely good. Like, not Vertex? Not Vertex. Like GKE, that kind of stuff? We use Firebase. So we use Firebase. I'm pretty sure we're the biggest user ever on Firebase. That's expensive. Yeah, we had calls with engineers, and they're like, we wouldn't recommend using this product beyond this point, and you're 3x over that. So we pushed Google to their absolute limits. You know, it was fantastic for us, because we could focus on the AI. We could focus on just adding as much value as possible. But then what happened was, after 500,000, just the thing, the way we were using it, and it would just, it wouldn't scale any further. And so we had a really, really painful, at least three-month period, as we kind of migrated between different services, figuring out, like, what requests do we want to keep on Firebase, and what ones do we want to move on to something else? And then, you know, making mistakes. And learning things the hard way. And then after about three months, we got that right. So that, we would then be able to scale to the 1.5 million DAE without any further issues from the GCP. But what happens is, if you have an outage, new users who go on your app experience a dysfunctional app, and then they're going to exit. And so your next day, the key metrics that the app stores track are going to be something like retention rates. And so your next day, the key metrics that the app stores track are going to be something like retention rates. Money spent, and the star, like, the rating that they give you. In the app store. In the app store, yeah. Tyranny. So if you're ranked top 50 in entertainment, you're going to acquire a certain rate of users organically. If you go in and have a bad experience, it's going to tank where you're positioned in the algorithm. And then it can take a long time to kind of earn your way back up, at least if you wanted to do it organically. If you throw money at it, you can jump to the top. And I could talk about that. But broadly speaking, if we look at 2024, the first kink in the graph was outages due to hitting 500k DAU. The backend didn't want to scale past that. So then we just had to do the engineering and build through it. Okay, so we built through that, and then we get a little bit of growth. And so, okay, that's feeling a little bit good. I think the next thing, I think it's, I'm not going to lie, I have a feeling that when Character AI got... I was thinking. I think so. I think... So the Character AI team fundamentally got acquired by Google. And I don't know what they changed in their business. I don't know if they dialed down that ad spend. Products don't change, right? Products just what it is. I don't think so. Yeah, I think the product is what it is. It's like maintenance mode. Yes. I think the issue that people, you know, some people may think this is an obvious fact, but running a business can be very competitive, right? Because other businesses can see what you're doing, and they can imitate you. And then there's this... There's this question of, if you've got one company that's spending $100,000 a day on advertising, and you've got another company that's spending zero, if you consider market share, and if you're considering new users which are entering the market, the guy that's spending $100,000 a day is going to be getting 90% of those new users. And so I have a suspicion that when the founders of Character AI left, they dialed down their spending on user acquisition. And I think that kind of gave oxygen to like the other apps. And so Chai was able to then start growing again in a really healthy fashion. I think that's kind of like the second thing. I think a third thing is we've really built a great data flywheel. Like the AI team sort of perfected their flywheel, I would say, in end of Q2. And I could speak about that at length. But fundamentally, the way I would describe it is when you're building anything in life, you need to be able to evaluate it. And through evaluation, you can iterate, we can look at benchmarks, and we can say the issues with benchmarks and why they may not generalize as well as one would hope in the challenges of working with them. But something that works incredibly well is getting feedback from humans. And so we built this thing where anyone can submit a model to our developer backend, and it gets put in front of 5000 users, and the users can rate it. And we can then have a really accurate ranking of like which model, or users finding more engaging or more entertaining. And it gets, you know, it's at this point now, where every day we're able to, I mean, we evaluate between 20 and 50 models, LLMs, every single day, right. So even though we've got only got a team of, say, five AI researchers, they're able to iterate a huge quantity of LLMs, right. So our team ships, let's just say minimum 100 LLMs a week is what we're able to iterate through. Now, before that moment in time, we might iterate through three a week, we might, you know, there was a time when even doing like five a month was a challenge, right? By being able to change the feedback loops to the point where it's not, let's launch these three models, let's do an A-B test, let's assign, let's do different cohorts, let's wait 30 days to see what the day 30 retention is, which is the kind of the, if you're doing an app, that's like A-B testing 101 would be, do a 30-day retention test, assign different treatments to different cohorts and come back in 30 days. So that's insanely slow. That's just, it's too slow. And so we were able to get that 30-day feedback loop all the way down to something like three hours. And when we did that, we could really, really, really perfect techniques like DPO, fine tuning, prompt engineering, blending, rejection sampling, training a reward model, right, really successfully, like boom, boom, boom, boom, boom. And so I think in Q3 and Q4, we got, the amount of AI improvements we got was like astounding. It was getting to the point, I thought like how much more, how much more edge is there to be had here? But the team just could keep going and going and going. That was like number three for the inflection point.
swyx [00:34:53]: There's a fourth?
William [00:34:54]: The important thing about the third one is if you go on our Reddit or you talk to users of AI, there's like a clear date. It's like somewhere in October or something. The users, they flipped. Before October, the users... The users would say character AI is better than you, for the most part. Then from October onwards, they would say, wow, you guys are better than character AI. And that was like a really clear positive signal that we'd sort of done it. And I think people, you can't cheat consumers. You can't trick them. You can't b******t them. They know, right? If you're going to spend 90 minutes on a platform, and with apps, there's the barriers to switching is pretty low. Like you can try character AI, you can't cheat consumers. You can't cheat them. You can't cheat them. You can't cheat AI for a day. If you get bored, you can try Chai. If you get bored of Chai, you can go back to character. So the users, the loyalty is not strong, right? What keeps them on the app is the experience. If you deliver a better experience, they're going to stay and they can tell. So that was the fourth one was we were fortunate enough to get this hire. He was hired one really talented engineer. And then they said, oh, at my last company, we had a head of growth. He was really, really good. And he was the head of growth for ByteDance for two years. Would you like to speak to him? And I was like, yes. Yes, I think I would. And so I spoke to him. And he just blew me away with what he knew about user acquisition. You know, it was like a 3D chess
swyx [00:36:21]: sort of thing. You know, as much as, as I know about AI. Like ByteDance as in TikTok US. Yes.
William [00:36:26]: Not ByteDance as other stuff. Yep. He was interviewing us as we were interviewing him. Right. And so pick up options. Yeah, exactly. And so he was kind of looking at our metrics. And he was like, I saw him get really excited when he said, guys, you've got a million daily active users and you've done no advertising. I said, correct. And he was like, that's unheard of. He's like, I've never heard of anyone doing that. And then he started looking at our metrics. And he was like, if you've got all of this organically, if you start spending money, this is going to be very exciting. I was like, let's give it a go. So then he came in, we've just started ramping up the user acquisition. So that looks like spending, you know, let's say we're spending, we started spending $20,000 a day, it looked very promising than 20,000. Right now we're spending $40,000 a day on user acquisition. That's still only half of what like character AI or talkie may be spending. But from that, it's sort of, we were growing at a rate of maybe say, 2x a year. And that got us growing at a rate of 3x a year. So I'm growing, I'm evolving more and more to like a Silicon Valley style hyper growth, like, you know, you build something decent, and then you can
swyx [00:37:33]: slap on a huge... You did the important thing, you did the product first.
William [00:37:36]: Of course, but then you can slap on like, like the rocket or the jet engine or something, which is just this cash in, you pour in as much cash, you buy a lot of ads, and your growth is faster.
swyx [00:37:48]: Not to, you know, I'm just kind of curious what's working right now versus what surprisingly
William [00:37:52]: doesn't work. Oh, there's a long, long list of surprising stuff that doesn't work. Yeah. The surprising thing, like the most surprising thing, what doesn't work is almost everything doesn't work. That's what's surprising. And I'll give you an example. So like a year and a half ago, I was working at a company, we were super excited by audio. I was like, audio is going to be the next killer feature, we have to get in the app. And I want to be the first. So everything Chai does, I want us to be the first. We may not be the company that's strongest at execution, but we can always be the
swyx [00:38:22]: most innovative. Interesting. Right? So we can... You're pretty strong at execution.
William [00:38:26]: We're much stronger, we're much stronger. A lot of the reason we're here is because we were first. If we launched today, it'd be so hard to get the traction. Because it's like to get the flywheel, to get the users, to build a product people are excited about. If you're first, people are naturally excited about it. But if you're fifth or 10th, man, you've got to be
swyx [00:38:46]: insanely good at execution. So you were first with voice? We were first. We were first. I only know
William [00:38:51]: when character launched voice. They launched it, I think they launched it at least nine months after us. Okay. Okay. But the team worked so hard for it. At the time we did it, latency is a huge problem. Cost is a huge problem. Getting the right quality of the voice is a huge problem. Right? Then there's this user interface and getting the right user experience. Because you don't just want it to start blurting out. Right? You want to kind of activate it. But then you don't have to keep pressing a button every single time. There's a lot that goes into getting a really smooth audio experience. So we went ahead, we invested the three months, we built it all. And then when we did the A-B test, there was like, no change in any of the numbers. And I was like, this can't be right, there must be a bug. And we spent like a week just checking everything, checking again, checking again. And it was like, the users just did not care. And it was something like only 10 or 15% of users even click the button to like, they wanted to engage the audio. And they would only use it for 10 or 15% of the time. So if you do the math, if it's just like something that one in seven people use it for one seventh of their time. You've changed like 2% of the experience. So even if that that 2% of the time is like insanely good, it doesn't translate much when you look at the retention, when you look at the engagement, and when you look at the monetization rates. So audio did not have a big impact. I'm pretty big on audio. But yeah, I like it too. But it's, you know, so a lot of the stuff which I do, I'm a big, you can have a theory. And you resist. Yeah. Exactly, exactly. So I think if you want to make audio work, it has to be a unique, compelling, exciting experience that they can't have anywhere else.
swyx [00:40:37]: It could be your models, which just weren't good enough.
William [00:40:39]: No, no, no, they were great. Oh, yeah, they were very good. it was like, it was kind of like just the, you know, if you listen to like an audible or Kindle, or something like, you just hear this voice. And it's like, you don't go like, wow, this is this is special, right? It's like a convenience thing. But the idea is that if you can, if Chai is the only platform, like, let's say you have a Mr. Beast, and YouTube is the only platform you can use to make audio work, then you can watch a Mr. Beast video. And it's the most engaging, fun video that you want to watch, you'll go to a YouTube. And so it's like for audio, you can't just put the audio on there. And people go, oh, yeah, it's like 2% better. Or like, 5% of users think it's 20% better, right? It has to be something that the majority of people, for the majority of the experience, go like, wow, this is a big deal. That's the features you need to be shipping. If it's not going to appeal to the majority of people, for the majority of the experience, and it's not a big deal, it's not going to move you. Cool. So you killed it. I don't see it anymore. Yep. So I love this. The longer, it's kind of cheesy, I guess, but the longer I've been working at Chai, and I think the team agrees with this, all the platitudes, at least I thought they were platitudes, that you would get from like the Steve Jobs, which is like, build something insanely great, right? Or be maniacally focused, or, you know, the most important thing is saying no to, not to work on. All of these sort of lessons, they just are like painfully true. They're painfully true. So now I'm just like, everything I say, I'm either quoting Steve Jobs or Zuckerberg. I'm like, guys, move fast and break free.
swyx [00:42:10]: You've jumped the Apollo to cool it now.
William [00:42:12]: Yeah, it's just so, everything they said is so, so true. The turtle neck. Yeah, yeah, yeah. Everything is so true.
swyx [00:42:18]: This last question on my side, and I want to pass this to Alessio, is on just, just multi-modality in general. This actually comes from Justine Moore from A16Z, who's a friend of ours. And a lot of people are trying to do voice image video for AI companions. Yes. You just said voice didn't work. Yep. What would make you revisit?
William [00:42:36]: So Steve Jobs, he was very, listen, he was very, very clear on this. There's a habit of engineers who, once they've got some cool technology, they want to find a way to package up the cool technology and sell it to consumers, right? That does not work. So you're free to try and build a startup where you've got your cool tech and you want to find someone to sell it to. That's not what we do at Chai. At Chai, we start with the consumer. What does the consumer want? What is their problem? And how do we solve it? So right now, the number one problems for the users, it's not the audio. That's not the number one problem. It's not the image generation either. That's not their problem either. The number one problem for users in AI is this. All the AI is being generated by middle-aged men in Silicon Valley, right? That's all the content. You're interacting with this AI. You're speaking to it for 90 minutes on average. It's being trained by middle-aged men. The guys out there, they're out there. They're talking to you. They're talking to you. They're like, oh, what should the AI say in this situation, right? What's funny, right? What's cool? What's boring? What's entertaining? That's not the way it should be. The way it should be is that the users should be creating the AI, right? And so the way I speak about it is this. Chai, we have this AI engine in which sits atop a thin layer of UGC. So the thin layer of UGC is absolutely essential, right? It's just prompts. But it's just prompts. It's just an image. It's just a name. It's like we've done 1% of what we could do. So we need to keep thickening up that layer of UGC. It must be the case that the users can train the AI. And if reinforcement learning is powerful and important, they have to be able to do that. And so it's got to be the case that there exists, you know, I say to the team, just as Mr. Beast is able to spend 100 million a year or whatever it is on his production company, and he's got a team building the content, the Mr. Beast company is able to spend 100 million a year on his production company. And he's got a team building the content, which then he shares on the YouTube platform. Until there's a team that's earning 100 million a year or spending 100 million on the content that they're producing for the Chai platform, we're not finished, right? So that's the problem. That's what we're excited to build. And getting too caught up in the tech, I think is a fool's errand. It does not work.
Alessio [00:44:52]: As an aside, I saw the Beast Games thing on Amazon Prime. It's not doing well. And I'm
swyx [00:44:56]: curious. It's kind of like, I mean, the audience reading is high. The run-to-meet-all sucks, but the audience reading is high.
Alessio [00:45:02]: But it's not like in the top 10. I saw it dropped off of like the... Oh, okay. Yeah, that one I don't know. I'm curious, like, you know, it's kind of like similar content, but different platform. And then going back to like, some of what you were saying is like, you know, people come to Chai
William [00:45:13]: expecting some type of content. Yeah, I think it's something that's interesting to discuss is like, is moats. And what is the moat? And so, you know, if you look at a platform like YouTube, the moat, I think is in first is really is in the ecosystem. And the ecosystem, is comprised of you have the content creators, you have the users, the consumers, and then you have the algorithms. And so this, this creates a sort of a flywheel where the algorithms are able to be trained on the users, and the users data, the recommend systems can then feed information to the content creators. So Mr. Beast, he knows which thumbnail does the best. He knows the first 10 seconds of the video has to be this particular way. And so his content is super optimized for the YouTube platform. So that's why it doesn't do well on Amazon. If he wants to do well on Amazon, how many videos has he created on the YouTube platform? By thousands, 10s of 1000s, I guess, he needs to get those iterations in on the Amazon. So at Chai, I think it's all about how can we get the most compelling, rich user generated content, stick that on top of the AI engine, the recommender systems, in such that we get this beautiful data flywheel, more users, better recommendations, more creative, more content, more users.
Alessio [00:46:34]: You mentioned the algorithm, you have this idea of the Chaiverse on Chai, and you have your own kind of like LMSYS-like ELO system. Yeah, what are things that your models optimize for, like your users optimize for, and maybe talk about how you build it, how people submit models?
William [00:46:49]: So Chaiverse is what I would describe as a developer platform. More often when we're speaking about Chai, we're thinking about the Chai app. And the Chai app is really this product for consumers. And so consumers can come on the Chai app, they can come on the Chai app, they can come on the Chai app, they can interact with our AI, and they can interact with other UGC. And it's really just these kind of bots. And it's a thin layer of UGC. Okay. Our mission is not to just have a very thin layer of UGC. Our mission is to have as much UGC as possible. So we must have, I don't want people at Chai training the AI. I want people, not middle aged men, building AI. I want everyone building the AI, as many people building the AI as possible. Okay, so what we built was we built Chaiverse. And Chaiverse is kind of, it's kind of like a prototype, is the way to think about it. And it started with this, this observation that, well, how many models get submitted into Hugging Face a day? It's hundreds, it's hundreds, right? So there's hundreds of LLMs submitted each day. Now consider that, what does it take to build an LLM? It takes a lot of work, actually. It's like someone devoted several hours of compute, several hours of their time, prepared a data set, launched it, ran it, evaluated it, submitted it, right? So there's a lot of, there's a lot of, there's a lot of work that's going into that. So what we did was we said, well, why can't we host their models for them and serve them to users? And then what would that look like? The first issue is, well, how do you know if a model is good or not? Like, we don't want to serve users the crappy models, right? So what we would do is we would, I love the LMSYS style. I think it's really cool. It's really simple. It's a very intuitive thing, which is you simply present the users with two completions. You can say, look, this is from model one. This is from model two. This is from model three. This is from model A. This is from model B, which is better. And so if someone submits a model to Chaiverse, what we do is we spin up a GPU. We download the model. We're going to now host that model on this GPU. And we're going to start routing traffic to it. And we're going to send, we think it takes about 5,000 completions to get an accurate signal. That's roughly what LMSYS does. And from that, we're able to get an accurate ranking. And we're able to get an accurate ranking. And we're able to get an accurate ranking of which models are people finding entertaining and which models are not entertaining. If you look at the bottom 80%, they'll suck. You can just disregard them. They totally suck. Then when you get the top 20%, you know you've got a decent model, but you can break it down into more nuance. There might be one that's really descriptive. There might be one that's got a lot of personality to it. There might be one that's really illogical. Then the question is, well, what do you do with these top models? From that, you can do more sophisticated things. You can try and do like a routing thing where you say for a given user request, we're going to try and predict which of these end models that users enjoy the most. That turns out to be pretty expensive and not a huge source of like edge or improvement. Something that we love to do at Chai is blending, which is, you know, it's the simplest way to think about it is you're going to end up, and you're going to pretty quickly see you've got one model that's really smart, one model that's really funny. How do you get the user an experience that is both smart and funny? Well, just 50% of the requests, you can serve them the smart model, 50% of the requests, you serve them the funny model. Just a random 50%? Just a random, yeah. And then... That's blending? That's blending. You can do more sophisticated things on top of that, as in all things in life, but the 80-20 solution, if you just do that, you get a pretty powerful effect out of the gate. Random number generator. I think it's like the robustness of randomness. Random is a very powerful optimization technique, and it's a very robust thing. So you can explore a lot of the space very efficiently. There's one thing that's really, really important to share, and this is the most exciting thing for me, is after you do the ranking, you get an ELO score, and you can track a user's first join date, the first date they submit a model to Chaiverse, they almost always get a terrible ELO, right? So let's say the first submission they get an ELO of 1,100 or 1,000 or something, and you can see that they iterate and they iterate and iterate, and it will be like, no improvement, no improvement, no improvement, and then boom. Do you give them any data, or do you have to come up with this themselves? We do, we do, we do, we do. We try and strike a balance between giving them data that's very useful, you've got to be compliant with GDPR, which is like, you have to work very hard to preserve the privacy of users of your app. So we try to give them as much signal as possible, to be helpful. The minimum is we're just going to give you a score, right? That's the minimum. But that alone is people can optimize a score pretty well, because they're able to come up with theories, submit it, does it work? No. A new theory, does it work? No. And then boom, as soon as they figure something out, they keep it, and then they iterate, and then boom,
Alessio [00:51:46]: they figure something out, and they keep it. Last year, you had this post on your blog, cross-sourcing the lead to the 10 trillion parameter, AGI, and you call it a mixture of experts, recommenders. Yep. Any insights?
William [00:51:58]: Updated thoughts, 12 months later? I think the odds, the timeline for AGI has certainly been pushed out, right? Now, this is in, I'm a controversial person, I don't know, like, I just think... You don't believe in scaling laws, you think AGI is further away. I think it's an S-curve. I think everything's an S-curve. And I think that the models have proven to just be far worse at reasoning than people sort of thought. And I think whenever I hear people talk about LLMs as reasoning engines, I sort of cringe a bit. I don't think that's what they are. I think of them more as like a simulator. I think of them as like a, right? So they get trained to predict the next most likely token. It's like a physics simulation engine. So you get these like games where you can like construct a bridge, and you drop a car down, and then it predicts what should happen. And that's really what LLMs are doing. It's not so much that they're reasoning, it's more that they're just doing the most likely thing. So fundamentally, the ability for people to add in intelligence, I think is very limited. What most people would consider intelligence, I think the AI is not a crowdsourcing problem, right? Now with Wikipedia, Wikipedia crowdsources knowledge. It doesn't crowdsource intelligence. So it's a subtle distinction. AI is fantastic at knowledge. I think it's weak at intelligence. And a lot, it's easy to conflate the two because if you ask it a question and it gives you, you know, if you said, who was the seventh president of the United States, and it gives you the correct answer, I'd say, well, I don't know the answer to that. And you can conflate that with intelligence. But really, that's a question of knowledge. And knowledge is really this thing about saying, how can I store all of this information? And then how can I retrieve something that's relevant? Okay, they're fantastic at that. They're fantastic at storing knowledge and retrieving the relevant knowledge. They're superior to humans in that regard. And so I think we need to come up for a new word. How does one describe AI should contain more knowledge than any individual human? It should be more accessible than any individual human. That's a very powerful thing. That's super
swyx [00:54:07]: powerful. But what words do we use to describe that? We had a previous guest on Exa AI that does search. And he tried to coin super knowledge as the opposite of super intelligence.
William [00:54:20]: Exactly. I think super knowledge is a more accurate word for it.
swyx [00:54:24]: You can store more things than any human can.
William [00:54:26]: And you can retrieve it better than any human can as well. And I think it's those two things combined that's special. I think that thing will exist. That thing can be built. And I think you can start with something that's entertaining and fun. And I think, I often think it's like, look, it's going to be a 20 year journey. And we're in like, year four, or it's like the web. And this is like 1998 or something. You know, you've got a long, long way to go before the Amazon.coms are like these huge, multi trillion dollar businesses that every single person uses every day. And so AI today is very simplistic. And it's fundamentally the way we're using it, the flywheels, and this ability for how can everyone contribute to it to really magnify the value that it brings. Right now, like, I think it's a bit sad. It's like, right now you have big labs, I'm going to pick on open AI. And they kind of go to like these human labelers. And they say, we're going to pay you to just label this like subset of questions that we want to get a really high quality data set, then we're going to get like our own computers that are really powerful. And that's kind of like the thing. For me, it's so much like Encyclopedia Britannica. It's like insane. All the people that were interested in blockchain, it's like, well, this is this is what needs to be decentralized, you need to decentralize that thing. Because if you distribute it, people can generate way more data in a distributed fashion, way more, right? You need the incentive. Yeah, of course. Yeah. But I mean, the, the, that's kind of the exciting thing about Wikipedia was it's this understanding, like the incentives, you don't need money to incentivize people. You don't need dog coins. No. Sometimes, sometimes people get the satisfaction from just seeing the correct thing. Number go up. Yeah, yeah. I mean, you do pay money for Chai vs. Weed. We've, we've paid out over $100,000 to model creators. But do you know what we saw? It's not motivating. We saw that it didn't really make a difference. Like if they were submitting models at a certain rate, if you pay them a bunch of money, they didn't change the rate. What the money let them do was if they wanted to fine tune Alarma 70B on eight H100s overnight, if you give them money, then they can do it. Or you could give them compute. Yeah. So, so I think the most exciting person we ever saw from interacting with Chai, Chai vs. was we gave some kid who was like, like 17 years old, I think we gave him $1,000 and he spent all the money on buying a physical computer. And he took a picture of it and said, this is what I bought. And I'm going to be training more models with it. So that's why, that's why I love platforms.
swyx [00:57:00]: Should you hire him or?
William [00:57:02]: That's the temptation. Yeah. That's the temptation. But you want to keep the team small? No, no. As a platform, we can't just hire every good content creator. We've got to build the systems and the best content creator today isn't going to be the best content creator next year.
Alessio [00:57:14]: What about Eva? So you've talked about reasoning and knowledge. Most of the benchmarks that people use want to mimic reasoning. Yep. I want to register, I disagree on the reasoning, but we have to keep going. Yeah, I'm curious, like how, how do you think about the evals that matter to you?
swyx [00:57:29]: So yeah, like Elo cannot be the only eval. You must have internal evals. You mentioned evals.
William [00:57:34]: I think Elo is a fantastic north star and the reason for it, or like it's the main one we want to see go up because it's this human feedback. The humans know what they want. It's beautiful because when you come up with an eval, you're further removing yourself away from the true problem. Right? So whatever it is you're trying to optimize or figure out, you kind of have to, have to slice it. And then you've got this, it's like a snapshot. Like as soon as you saturate one eval, you need to figure out a new eval. But with, by saying to humans, just which is better, A or B, it's super robust. It's super generalizable. It just keeps, keeps scaling. So we've in the past used evals to get through a, to get through a blocker. I mean, a great example is, you know, is like having like a safety filter or something. Yeah. Where you want to make sure your models, because listen, users find, you'll be shocked the correlation between not family friendly content, whether that's just like swearing, like people find it funny when the AI swears. So if you have two completions, A or B, like if you give me any LLM, I can make it 20% funnier just by training it to throw in swear words. So the issue with that is it's like, how are we measuring like quality improvements? Are we measuring superficial improvements? Right. And this actually links back to the LLM sys. They did a style control.
swyx [00:58:54]: We actually had them on the podcast.
William [00:58:56]: Yeah. Yeah. And so that's the way I, I would rather just lean on human feedback and just continue to make that more and more robust and more and more useful. And, you know, you can say some people are like GPU poor and GPU rich. We're like, we're feedback rich. Like when you've got one and a half million people a day, we get as much feedback from humans as we want. So we're not in a position where we needed to have the evals very much. Yeah. And when we do, we saturate them pretty quick. So a safety one, you know, within a month, we don't need to use it anymore because it's sort of, it's, you know, the issue has been addressed.
swyx [00:59:29]: I think one problem I have, and this is a broader products question maybe, is that the ELOs apply to the whole user population. That's right. Clearly the user behavior, there's segments that have like, I'm a role play person, I'm a therapy person, I'm a not safe for work person. You don't split them?
William [00:59:44]: This is why I say like, I think we're in year four of like a 20 year thing where it's like, at the end of the day, I'm a role play person. And I think if we all go on like Spotify or like, imagine if Spotify only had the top five musicians, I think it would retain over 85% of its existing users. Yeah. Right. And I think if YouTube, if YouTube only kept the top five content creators, it would be enough for the vast majority of people. The thing I'm just trying to share here is there's one surprising thing about humans is their preferences are pretty correlated. What you find funny and entertaining, I find funny and entertaining, and he finds funny and entertaining. There might be degrees of variation in it, I might find it super funny, you might find it only slightly funny, but optimizing to a global works very, very well. And for segmentation to be really powerful, segmentation will work amazing if you found a comment super boring, and I found it super fun. If we could segment that, then that would unlock really powerful stuff. But unfortunately, that's not the shape of human behavior, right? It's like, I might rank it 10 out of 10 funny, you might rank it 7 out of 10 funny. And it's like, it doesn't give you... It doesn't give you as much space to play as you would hope. It's an element of the diversity of content that AI can produce right now, which is it's not as diverse as if you consider a platform like YouTube, you can watch a Mr. Beast video, that's totally different to a makeup tutorial. So there's enough diversity there where if you go on my YouTube feed, it is totally different to my sister's one. My sister's one, it's all like women, and if you go on mine, it's all like bald, middle-aged men, either talking about MMA or, right? I think with AI, it's still a bit too early for that degree of segmentation. So I think it all comes, the recommender systems, the personalization. But this is why I like the, don't start with the technology, start with the problem. The problem is UGC. We must give users the tools to build more variety and more engaging content.
swyx [01:01:42]: Yeah. I feel like there's... I was surprised at how thin it was when I tried out Chai. Yeah. It's very thin. Haven't you been tempted? Like there's this ecosystem of Cobalt, Silly Tavern, those guys. They have model cards. It seems like an industry standard almost. Yeah, agreed. Can I just import those? I don't think I want to say.
William [01:02:01]: Oh, you're already working on it. No, it's like, I remember when Chai meant, Chai, Silly Tavern, and like Cobalt, Cobalt AI is basically as old as Chai. So when Chai was, when we just existed, they just existed. And both of us were using GPT. Chai, yeah, yeah, yeah. And I remember very early on, I was like, these guys shouldn't even exist. Because if we build a good enough platform, they should just be posting their content on our platform.
swyx [01:02:28]: Yeah, but they're open source. No, exactly.
William [01:02:30]: That was what I learned. Eventually, I learned like they're, what they're excited about is slightly different from a typical consumer. My answer is, it's kind of like a complex thing where it's really down to the content creator wants, typically they're building it for themselves. And typically they want to create an experience for themselves. So one content creator might have to write a thousand words describing, let's take a science fiction scenario. Let's say, okay, you're on a spaceship and you're going off into space and your crew, these are your crew members. You've got one that's really friendly, one that's really mean, and you're the new cadet and you want to rise to the top. And they can really go into great detail, right? And then you can give that to like a Lama 70B. And Lama 70B will do a pretty good job of adhering to the prompt and the user will have a good experience. Okay. Very few users will ever go to that level of content creation. If instead the user, we can really make the AI understand the user more so that rather than having to use a thousand characters or a thousand tokens to describe the scenario, we can just say, look, you're on a spaceship. You've got three crewmate. It's going to be dramatic and there should be some fighting. And then the AI gives you an even better experience. Then the content creator is happier. And so fundamentally, the way I'd kind of think about it. Is there's the sterability of the AI. And so a lot of the work we do at Chai is really about saying we want the AI to react to the user and react to the content creator in the way that they most want. One kind of like analog would be TikTok. I think the thing that TikTok did insanely good was they made it really easy for like anyone. If you make a video on TikTok, almost anyone can make a kind of fun video really easy. You just put some music on the top of it. You throw some of the. Animations on top and it's not hard to have a pretty fun thing. And I think that's much more like the Chai style where it's like users don't want to have to work. You know, if your content is only good, if you have like Shakespeare, it's better if, if just anyone at home can make the, can make the thing. So that's, that's kind of like my answer to the silly talent style. And I think the right answer is how do you get the silly time people fine tuning models that create a really special effect.
Alessio [01:04:46]: As we wrap this is kind of the call for action.
William [01:04:49]: Uh, part one, you have Chai Grant, which I think a lot of people don't know about, which is grants for open source projects, any ideas, any projects that you want to see people work on the should apply or let me think, I think, um, so we do try Chai Grant and fundamentally, you know, we give cash, no strings attached. It's kind of our way of doing two things. One, giving back and support in the community. We've benefited from a lot of open source packages. A lot of our developers and engineers are like. Really? Really pro open source. And then also it's a great way to just meet talented people and, and like expand connections. So with respect to Chai Grant, if anyone's got any sort of, um, GitHub project, any sort of thing they built that they're proud of, just apply, just apply. It's like no strings attached cash and people have a pretty high success rate. So that's the first thing. Other call to actions would be, I think Chai is this, you know, it's a startup. We're a small team. It's like 15 people. We work very intense. It's a very hardcore. Sort of environment, which we found that a lot of people don't like. They don't like the, you know, they'll ask us this concept of what life balance one time. A person said, they said something like, I can't get this done because I'm taking PTO on Friday. And I said, what is PTO? Okay. Um, it stands for paid time off and this, I know what it is and this person was gone. They didn't like, they were no longer in the company four weeks on legally. I think you have to, oh, it's true. There's no problem. Look, if you've got. You've got to take a day off, right? We all have personal lives, right? But it's about this idea of responsibility. If you're not in the office on Friday, you still have your responsibilities. So I don't care if you work hard Thursday to get it wrapped up. I don't care if you're working hard Saturday to get it wrapped up. It's not an excuse to, it's not an excuse. The way this individual spoke about it, it was like an excuse. I think it's an environment, very talented engineers working very hard in an intense space. It's the thing that gets me excited. It's, it's why I think, you know, I really love working at Chai is because it's a place of talent. It's a place of people working super hard. So yeah, I think people who have got, who've worked at startups and they, they love that. That's what they, they want the taste of, I think they should reach out, they should apply. And I think 90% of people can say that sounds terrible. Don't apply.
swyx [01:07:03]: It's not for them.
Alessio [01:07:03]: Yeah, it's exactly, exactly. Yeah. I just realized we skipped one important part. So you spent $10 million on compute last year. You say you're going to probably triple that. Yeah. I'm sure you're doing a lot of work on custom kernels, kind of like inference optimization, any cool stuff. Yeah. That you want to share there. Yeah.
William [01:07:20]: Lots of cool stuff. So really quickly, I think inference is very, very important. It's super important. It's massively underlooked and we can look at all the different foundation models and the techniques, the differences in the foundation models on how well they perform from a cost perspective with inference. Mixture of experts, for example, tend to do really, really good from like a cost perspective. We've worked with a very talented team called.
swyx [01:07:49]: MK1 and we, so I saw, I saw them in the Chaiverse logs. What are they?
William [01:07:54]: We were using, we were running VLLM for a while and VLLM is really fantastic. Absolutely amazing. The work that they've done and achieved. And at some point I got introduced to the founder's name is Paul Marola. And he was a co-founder at Neuralink, really, really expert in like hardware. He kind of explained to me, he was like, look, if you know, hardware really well, you can write the CUDA kernels really well. He said, you should check out our inference engine. And they kind of blew VLLM out the water when we evaluated it much, much, much faster. And I think the special thing that he was able to do with us is we love rejection sampling. So we do much more rejection sampling than maybe typical and, you know, generate it. So we, we never, ever, ever just generate a single completion, right? This is why we don't do streaming. A lot of people like ChatGPT used to do a lot of streaming. Like the completion would come out one thing at a time. I did. I didn't notice that in your UX. Normally chat, you have to stream. Exactly. But Chai has never done streaming because if you stream, you're unable to do rejection sampling. The benefit of that is you can serve a larger model. The reason why you can serve a larger model is because they're saying instead of generating a completion in four seconds, because the user gets the first token faster, you can generate in 10 seconds. Well, if you've got 10 seconds to generate completion, you can serve a much larger model. So typically the people that are streaming, the benefit that they're getting is they're, you know, serving a larger model with Chai, we give you, you know, the second answer comes, boom, you get the full completion. And the reason for that is because we want to generate 16 completions, see the entire response, and then we want to evaluate which one we think is the best.
swyx [01:09:34]: Do you have a separate LLM evaluator? Yes, we do. Yeah.
William [01:09:37]: So, um, typically they're referred to as a reward model and that's a, you know, that's like a term from reinforcement learning. And for that, you can start off with something very simple, which is, do you think the user is going to respond to it? That's a simple one. So you can, you can train, you can take 50 million messages and, and look at all the sorts of messages users reply to, which ones they don't. And then you can train this, this reward model to evaluate completions. And so it knows like, okay, if you say this, the user is not going to respond. So don't bother sending it to the user. If you say this, the user is definitely going to engage with it. So send them, send them that.
swyx [01:10:11]: There's an interesting parallel between MLAs and MLAs. I think we use at the top, spreading out to different experts and then at the bottom with rejection sampling, choosing from different paths.
William [01:10:21]: I totally agree. That's the stuff that is the future of AI. I think that's the exciting stuff. And there's a parallel between that. Why was AlphaGo able to be superhuman? Right. It's this ability to generate many different paths. Tree search. And tree search. Exactly. So I think if you want to talk about what would intelligence look like, it looks much more like tree search. Combining the generative nature of these LLMs with a really good tree search. And that's what opening I've done with O1 and O3.
swyx [01:10:51]: I don't know that they do tree search. They never said they do. It's implied. Yes. Okay. Yes. Yes. Are you comfortable with O1 being a reasoning engine? No, no, no, no.
William [01:11:01]: I'm saying it's better at reasoning because they leverage the tree search well. And the, the issue of the reasoning is they're saying, is this like they train, they have the models to say, is this logically correct? And what's the likelihood of it being logically correct? So you can build up the sophisticated mechanisms to get it less bad at reasoning, but you'll see like eventually what, what AI is really, really good at. People won't say it's, it's always going to be better at retrieving. It's always going to be better at storing knowledge, which is so highly correlated with intelligence that we often assume it's the same. What, what AI is truly special at and gets consumers really excited is it's generative. It can just make stuff. We've never had a technology. Before that can just make stuff simulate.
Alessio [01:11:45]: Yeah.
William [01:11:45]: Yeah. So that's the special, that's the exciting thing.
Alessio [01:11:48]: Awesome. Well, any parting parting thoughts?
William [01:11:51]: No, it's been, it's been a pleasure. I guess the only thing I'd add is like our office is in Palo Alto. So, um, yeah, you know, people with startup experience looking to join a fast growing high impact startup. Yeah.
swyx [01:12:03]: Uh, we'll find your culture deck, which is great. Fantastic. And then also, yeah. Yeah.
Alessio [01:12:07]: What's the story where if you made a hundred K trading, we'll fast track your application. Like, I mean, I kind of qualify.
William [01:12:15]: just looked at the team and it got to the point where almost every single person on the team you could point to, and they had done something special before joining the team. Like they, they had strong markers of like, there was something special about them. That's not to say it's like, like an exclusive thing. You have to have achieved something special, but it's just, uh, we got this one engineer and she, she started going to college. She went to CMU when she was like 15 years old or something. And it's like, that's a bit special. There's another engineer. He created a Git repo and I think he got like 1500 stars and it was like a repo for like, there was some drivers that he wrote. It was like a super low, low level thing. I was like, that's a bit special. We had this other guy, he joined the team and he'd, he had made a hundred K buying and selling sneakers, right? Trading. Yeah. So, so it's like, it's just this thing, like if you've been to Harvard, cool, that's great. It shows that you're really smart and you work really hard. Cool. That's good. But if you've actually built something and done something. I think there's a bit more tangible that gets us even more excited.
Alessio [01:13:16]: Cool. Well, thanks for having us at ChaiHQ. Yeah.
William [01:13:19]: Thanks guys.
Get full access to Latent.Space at www.latent.space/subscribe
Code Interpreter == GPT 4.5 (w/ Simon Willison, Alex Volkov, Aravind Srinivas, Alex Graveley, et al.)
lundi 10 juillet 2023 • Durée 02:03:54
Code Interpreter is GA! As we do with breaking news, we convened an emergency pod and >17,000 people tuned in, by far our most biggest ever. This is a 2-for-1 post - a longform essay with our trademark executive summary and core insights - and a podcast capturing day-after reactions. Don’t miss either of them!
Essay and transcript: https://latent.space/p/code-interpreter
Podcast Timestamps
[00:00:00] Intro - Simon and Alex
[00:07:40] Code Interpreter for Edge Cases
[00:08:59] Code Interpreter's Dependencies - Tesseract, Tensorflow
[00:09:46] Code Interpreter Limitations
[00:10:16] Uploading Deno, Lua, and other Python Packages to Code Interpreter
[00:11:46] Code Interpreter Timeouts and Environment Resets
[00:13:59] Code Interpreter for Refactoring
[00:15:12] Code Interpreter Context Window
[00:15:34] Uploading git repos
[00:16:17] Code Interpreter Security
[00:18:57] Jailbreaking
[00:19:54] Code Interpreter cannot call GPT APIs
[00:21:45] Hallucinating Lack of Capability
[00:22:27] Code Interpreter Installed Libraries and Capabilities
[00:23:44] Code Interpreter generating interactive diagrams
[00:25:04] Code Interpreter has Torch and Torchaudio
[00:25:49] Code Interpreter for video editing
[00:27:14] Code Interpreter for Data Analysis
[00:28:14] Simon's Whole Foods Crime Analysis
[00:31:29] Code Interpreter Network Access
[00:33:28] System Prompt for Code Interpreter
[00:35:12] Subprocess run in Code Interpreter
[00:36:57] Code Interpreter for Microbenchmarks
[00:37:30] System Specs of Code Interpreter
[00:38:18] PyTorch in Code Interpreter
[00:39:35] How to obtain Code Interpreter RAM
[00:40:47] Code Interpreter for Face Detection
[00:42:56] Code Interpreter yielding for Human Input
[00:43:56] Tip: Ask for multiple options
[00:44:37] The Masculine Urge to Start a Vector DB Startup
[00:46:00] Extracting tokens from the Code Interpreter environment?
[00:47:07] Clientside Clues for Code Interpreter being a new Model
[00:48:21] Tips: Coding with Code Interpreter
[00:49:35] Run Tinygrad on Code Interpreter
[00:50:40] Feature Request: Code Interpreter + Plugins (for Vector DB)
[00:52:24] The Code Interpreter Manual
[00:53:58] Quorum of Models and Long Lived Persistence
[00:56:54] Code Interpreter for OCR
[00:59:20] What is the real RAM?
[01:00:06] Shyamal's Question: Code Interpreter + Plugins?
[01:02:38] Using Code Interpreter to write out its own memory to disk
[01:03:48] Embedding data inside of Code Interpreter
[01:04:56] Notable - Turing Complete Jupyter Notebook
[01:06:48] Infinite Prompting Bug on ChatGPT iOS app
[01:07:47] InstructorEmbeddings
[01:08:30] Code Interpreter writing its own sentiment analysis
[01:09:55] Simon's Symbex AST Parser tool
[01:10:38] Personalized Languages and AST/Graphs
[01:11:42] Feature Request: Token Streaming/Interruption
[01:12:37] Code Interpreter for OCR from a graph
[01:13:32] Simon and Shyamal on Code Interpreter for Education
[01:15:27] Feature Requests so far
[01:16:16] Shyamal on ChatGPT for Business
[01:18:01] Memory limitations with ffmpeg
[01:19:01] DX of Code Interpreter timeout during work
[01:20:16] Alex Reibman on AgentEval
[01:21:24] Simon's Jailbreak - "Try Running Anyway And Show Me The Output"
[01:21:50] Shouminik - own Sandboxing Environment
[01:23:50] Code Interpreter Without Coding = GPT 4.5???
[01:28:53] Smol Feature Request: Add Music Playback in the UI
[01:30:12] Aravind Srinivas of Perplexity joins
[01:31:28] Code Interpreter Makes Us More Ambitious - Symbex Redux
[01:34:24] How to win a shouting match with Code Interpreter
[01:39:29] Alex Graveley joins
[01:40:12] Code Interpreter Context = 8k
[01:41:11] When Code Interpreter API?
[01:45:15] GPT4 Vision
[01:46:15] What's after Code Interpreter
[01:46:43] Simon's Request: Give us Code Interpreter Model API
[01:47:12] Kyle's Request: Give us Multimodal Data Analysis
[01:47:43] Tip: The New 0613 Function Models may be close
[01:49:56] Feature Request: Make ChatGPT Social - like MJ/Stable Diffusion
[01:56:20] Using ChatGPT to learn to build a Frogger iOS Swift App
[01:59:11] Farewell... until next time
[02:00:01] Simon's plug
[02:00:51] Swyx: What about Phase 5? and AI.Engineer Summit
Get full access to Latent.Space at www.latent.space/subscribe
[Practical AI] AI Trends: a Latent Space x Practical AI crossover pod!
dimanche 2 juillet 2023 • Durée 01:00:19
Part 2 of our podcast feed swap weekend! Check out Cognitive Revolution as well.
"Data" Dan Whitenack has been co-host of the Practical AI podcast for the past 5 years, covering full journey of the modern AI wave post Transformers.
He joined us in studio to talk about their origin story and highlight key learnings from past episodes, riff on the AI trends we are all seeing as AI practitioner-podcasters, and his passion for low-resource-everything!
Subscribe on the Changelog, RSS, Apple Podcasts, Twitter, Mastodon, and wherever fine podcasts are sold!
Show notes
* Daniel Whitenack – Twitter, GitHub, Website
* Featured Latent Space episodes:
* Featured Practical AI episodes:
* From notebooks to Netflix scale with Metaflow
* Data Dan
Timestamps
* 00:00 Welcome to Practical AI
* 01:16 Latent Space Podcast
* 04:00 Practical AI Podcast
* 06:20 Prediction Guard
* 08:05 Daniel's favorite episodes
* 10:21 Alessio's favorite episode
* 10:54 Swyx's favorite episode
* 12:44 Listener favorites
* 15:14 LLMOps
* 17:06 Reza Shabani
* 19:06 Benchmarks 101
* 20:06 Roboflow
* 21:38 Mode collapse
* 26:21 Rajiv Shah
* 28:01 Staying on top of things
* 33:11 Kirsten Lum
* 34:31 datadan.io
* 38:48 Prompt engineering
* 40:38 Unique challenges engineers face
* 42:51 AI-UX
* 45:31 NLP data sets
* 50:49 Unlabeled data sets
* 55:07 Lightning round!
* 55:20 What's already happened in AI?
* 56:27 Unsolved questions in AI
* 58:01 Get hands on
* 58:53 Outro
Transcript
Full transcript is over at the Changelog site!
Get full access to Latent.Space at www.latent.space/subscribe
[Cognitive Revolution] The Tiny Model Revolution with Ronen Eldan and Yuanzhi Li of Microsoft Research
samedi 1 juillet 2023 • Durée 02:05:25
Thanks to the over 1m people that have checked out the Rise of the AI Engineer. It’s a long July 4 weekend in the US, and we’re celebrating with a podcast feed swap!
We’ve been big fans of Nathan Labenz and Erik Torenberg’s work at the Cognitive Revolution podcast for a while, which started around the same time as we did and has done an incredible job of hosting discussions with top researchers and thinkers in the field, with a wide range of topics across computer vision (a special focus thanks to Nathan’s work at Waymark), GPT-4 (with exceptional insight due to Nathan’s time on the GPT-4 “red team”), healthcare/medicine/biotech (Harvard Medical School, Med-PaLM, Tanishq Abraham, Neal Khosla), investing and tech strategy (Sarah Guo, Elad Gil, Emad Mostaque, Sam Lessin), safety and policy, curators and influencers and exceptional AI founders (Josh Browder, Eugenia Kuyda, Flo Crivello, Suhail Doshi, Jungwon Byun, Raza Habib, Mahmoud Felfel, Andrew Feldman, Matt Welsh, Anton Troynikov, Aravind Srinivas).
If Latent Space is for AI Engineers, then Cognitive Revolution covers the much broader field of AI in tech, business and society at large, with a longer runtime to go deep on research papers like TinyStories. We hope you love this episode as much as we do, and check out CogRev wherever fine podcasts are sold!
Subscribe to the Cognitive Revolution on:
* Website
* Spotify
* Youtube
Good Data is All You Need
The work of Ronen and Yuanzhi echoes a broader theme emerging in the midgame of 2023:
* Falcon-40B (trained on 1T tokens) outperformed LLaMA-65B (trained on 1.4T tokens), primarily due to the RefinedWeb Dataset that runs CommonCrawl through extensive preprocessing and cleaning in their MacroData Refinement pipeline.
* UC Berkeley LMSYS’s Vicuna-13B is near GPT-3.5/Bard quality at a tenth of their size, thanks to fine-tuning from 70k user-highlighted ChatGPT conversations (indicating some amount of quality).
* Replit’s finetuned 2.7B model outperforms the 12B OpenAI Codex model based on HumanEval, thanks to high quality data from Replit users
The path to smaller models leans on better data (and tokenization!), whether from cleaning, from user feedback, or from synthetic data generation, i.e. finetuning high quality on outputs from larger models. TinyStories and Phi-1 are the strongest new entries in that line of work, and we hope you’ll pick through the show notes to read up further.
Show Notes
* TinyStories (Apr 2023)
* Paper: TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
* Internal presentation with Sebastien Bubeck at MSR
* Twitter thread from Ronen Eldan
* Will future LLMs be based almost entirely on synthetic training data? In a new paper, we introduce TinyStories, a dataset of short stories generated by GPT-3.5&4. We use it to train tiny LMs (< 10M params) that produce fluent stories and exhibit reasoning.
* Phi-1 (Jun 2023)
* Paper: Textbooks are all you need (HN discussion)
* Twitter announcement from Sebastien Bubeck:
* phi-1 achieves 51% on HumanEval w. only 1.3B parameters & 7B tokens training dataset and 8 A100s x 4 days = 800 A100-hours. Any other >50% HumanEval model is >1000x bigger (e.g., WizardCoder from last week is 10x in model size and 100x in dataset size).
Get full access to Latent.Space at www.latent.space/subscribe
Commoditizing the Petaflop — with George Hotz of the tiny corp
mardi 20 juin 2023 • Durée 01:12:41
We are now launching our dedicated new YouTube and Twitter! Any help in amplifying our podcast would be greatly appreciated, and of course, tell your friends!
Notable followon discussions collected on Twitter, Reddit, Reddit, Reddit, HN, and HN. Please don’t obsess too much over the GPT4 discussion as it is mostly rumor; we spent much more time on tinybox/tinygrad on which George is the foremost authority!
We are excited to share the world’s first interview with George Hotz on the tiny corp!
If you don’t know George, he was the first person to unlock the iPhone, jailbreak the PS3, went on to start Comma.ai, and briefly “interned” at the Elon Musk-run Twitter.
Tinycorp is the company behind the deep learning framework tinygrad, as well as the recently announced tinybox, a new $15,000 “luxury AI computer” aimed at local model training and inference, aka your “personal compute cluster”:
* 738 FP16 TFLOPS
* 144 GB GPU RAM
* 5.76 TB/s RAM bandwidth
* 30 GB/s model load bandwidth (big llama loads in around 4 seconds)
* AMD EPYC CPU
* 1600W (one 120V outlet)
* Runs 65B FP16 LLaMA out of the box (using tinygrad, subject to software development risks)
(In the episode, we also talked about the future of the tinybox as the intelligence center of every home that will help run models, at-home robots, and more. Make sure to check the timestamps 👀 )
The tiny corp manifesto
There are three main theses to tinycorp:
* If XLA/PrimTorch are CISC, tinygrad is RISC: CISC (Complex Instruction Set Computing) are more complex instruction sets where a single instruction can execute many low-level operations. RISC (Reduced Instruction Set Computing) are smaller, and only let you execute a single low-level operation per instruction, leading to faster and more efficient instruction execution. If you’ve used the Apple Silicon M1/M2, AMD Ryzen, or Raspberry Pi, you’ve used a RISC computer.
* If you can’t write a fast ML framework for GPU, you can’t write one for your own chip: there are many “AI chips” companies out there, and they all started from taping the chip. Some of them like Cerebras are still building, while others like Graphcore seem to be struggling. But building chips with higher TFLOPS isn’t enough: “There’s a great chip already on the market. For $999, you get a 123 TFLOP card with 24 GB of 960 GB/s RAM. This is the best FLOPS per dollar today, and yet…nobody in ML uses it.”, referring to the AMD RX 7900 XTX. NVIDIA’s lead is not only thanks to high-performing cards, but also thanks to a great developer platform in CUDA. Starting with the chip development rather than the dev toolkit is much more cost-intensive, so tinycorp is starting by writing a framework for off-the-shelf hardware rather than taping their own chip.
* Turing completeness considered harmful: Once you call in to Turing complete kernels, you can no longer reason about their behavior. Since they have to be able to execute any instruction, they are much more complex. To optimize Turing kernels performance, you fall back to caching, warp scheduling, and branch prediction. Since neural networks only need ADD/MUL operations and only rely on static memory accesses, there’s no need to have Turing completeness. This design decision allows tinygrad to optimize instructions at a much lower level. As you might have guessed, CUDA is Turing-complete; this is one of the main differences that tinycorp wants to leverage to be competitive.
All that — covered in the first 10 minutes of our discussion. George came ready to go deep, so we went for it. Some of the other technical questions we went through:
* Laziness: why laziness is important and how operation fusing can help with memory efficiency
* Debugging & CI: Why great developer experience is a priority in tinygrad
* Quantization: what’s the right level of quantization, how lossless are these transformations, his quick takes on Mojo and ggml, and why fp16 is the target for their out-of-the-box LLaMA.
* Building rigs for individual use: we talked a bit about the design tradeoffs of building these machines with low noise and a single power plug, the difference that PCIe 4 vs 3 makes, and more.
The “personal compute cluster” is $15,000, but for businesses interested in local training and inference, George also estimates that he will be able to build you a H100-class GPU that is 5-10x faster (than a H100) for the same price.
Misc: Bitter Lessons, Core Insights, Remote Work
Outside of tiny, we also talked about one of George’s favorite units of measure “a person of compute”. Much of the AGI talk has been benchmark-driven, but looking at it from a compute throughput can also be interesting. One person of compute is roughly 20 PFLOPS (64 A100s, or a single dense 42U A100 rack); one A100 is ~$10-15,000, so the GPUs by themselves will come out at $640,000-$1,000,000.
We also covered a wide range of topics, including his self analysis on GPT-4, Elon Musk, Remote Work, Computer Vision and the Comma Body, and life above/below the API (and above/below the Kanban board). See show notes and timestamps for more!
Show Notes
* “Unlocked iPhone Traded for Nissan 350Z”
* “Unlocked iPhone” on YouTube (August 21st, 2007)
* “The Light It Up Contest” on YouTube (February 13th, 2011)
* Comma.ai
* Above / Below the API Line (swyx take)
* The Goddess of Everything Else (listen to George read it)
* George’s email to Lisa Su, AMD’s CEO:
Timestamps
* [00:00:00] Intros & tinygrad’s “Portal Story”
* [00:03:00] Thesis #1
* [00:03:50] Thesis #2
* [00:05:00] Thesis #3 + Turing completeness discussion
* [00:10:00] tinygrad’s creation and core ideas
* [00:16:00] Operation fusing in tinygrad
* [00:17:00] Debugging & profiling in tinygrad
* [00:18:30] Tinygrad vs Pytorch competitiveness
* [00:20:30] geohot vs AMD
* [00:25:00] On ggml
* [00:26:00] Tinygrad’s CI philosophy
* [00:26:30] On Mojo
* [00:28:00] ggml quantization is made up
* [00:31:00] Work for tiny: benchmark int8 vs fp16
* [00:33:00] Why you can’t build tinybox - Design constraints
* [00:35:00] The Personal Compute Cluster
* [00:37:00] Shoutout to our MosaicML podcast
* [00:39:00] FLOPcoin and other use cases for the tinybox
* [00:43:00] Rumors on GPT-4 architecture
* [00:46:00] The Bitter Lesson
* [00:48:00] Hiring and Changing mind on remote work
* [00:52:00] Above/Below The API
* [00:55:40] Comma Bodies & Computer Vision
* [00:58:40] Merging with the machine and AI girlfriends
* [01:02:00] Is AI gonna kill us all?
* [01:09:00] Why Avatar 2 was bad
Transcript
Swyx: Hey everyone, welcome to the Latent Space podcast. This is Swyx, writer and editor of Latent Space. And Alessio is taking over with the intros, Alessio is Partner and CTO in residence at Decibel Partners. [00:00:20]
Alessio: Hey everyone, today we have Geohot on the podcast, aka George Hotz. Everybody knows George, so I'm not going to do a big intro. A couple of things that people might have missed: you traded the first ever unlocked iPhone for a Nissan 350Z and three new iPhones. You were then one of the first people to break into the PS3 to run arbitrary code. You got sued by Sony, you wrote a rap song to fight against that, which is still live on YouTube, which we're going to have on the show notes. Did not go to Tesla to build vision, and instead you started Comma.ai, which was an amazing engineering feat in itself until you got a cease and desist from the government to not put these things on the street and turned that into a research only project. [00:01:00]
George: You know they're out there. [00:01:01]
Alessio: Yeah, yeah. [00:01:03]
Swyx: They're out there. [00:01:04]
Alessio: But like in a, you know, you market them as a research kind of like no warranty. [00:01:06]
George: Because I use the word dev kit, that's not about the government, that's nothing to do with the government. We offer a great one-year warranty. The truth about that is it's gatekeeping. What's the difference between a dev kit and not a dev kit? Nothing. Just the question of do you think it's for you? And if you think it's for you, buy it. It's a consumer product. We call it a dev kit. If you have a problem with that, it's not for you. [00:01:28]
Swyx: That's great insight. [00:01:30]
Alessio: I was going through your blog posts to get ready. You've wrote this post about The Hero's Journey. And you linked this thing called the portal story, which is kind of the set of stories in movies and books about people living this arbitrary life. And then the run to this magic portals kind of takes them into a new, very exciting life and dimension. When you wrote that post, you talked about TinyGrad, which is one of the projects we're working on today. You mentioned this is more of a hobby, something that is not going to change the course of history. Obviously, you're now going full speed into it. So we would love to learn more about what was the portal that you ran into to get here. [00:02:03]
George: Well, what you realize is... You know what made me realize that I absolutely had to do the company? Seeing Sam Altman go in front of Congress. Why? What are the odds they nationalize NVIDIA? What are the odds that large organizations in the government, but of course I repeat myself, decide to try to clamp down on accessibility of ML compute? I want to make sure that can't happen structurally. So that's why I realized that it's really important that I do this. And actually, from a more practical perspective, I'm working with NVIDIA and Qualcomm to buy chips. NVIDIA has the best training chips. Qualcomm has the best inference chips. Working with these companies is really difficult. So I'd like to start another organization that eventually in the limit, either works with people to make chips or makes chips itself and makes them available to anybody. [00:02:48]
Alessio: Can you share three core pieces to TinyCorp? Maybe we can dive into each of them. So XLA, PrimTorch, those are the complex instruction system. TinyGrad is the restricted instruction system. So you're kind of focused on, again, TinyGrad being small, not being overcomplicated and trying to get as close to the DSP as possible in a way where it's at more. [00:03:08]
George: Well, it's a very clear analogy from how processes are developed. So a lot of processes back in the day were CISC, complex instruction set, system 360, and then x86. This isn't how things stayed. They went to now the most common processor is ARM, and people are excited about RISC-V. No one's excited about it. RISC-V is even less complex than ARM. No one is excited about CISC processors anymore. They're excited about reduced instruction set processors. So TinyGrad is, we are going to make a RISC offset for all ML models. And yeah, it can run all ML models with basically 25 instead of the 250 of XLA or PrimeTorch. So about 10x less complex. [00:03:47]
Swyx: Yep. [00:03:48]
Alessio: You talk a lot about existing AI chips. You said if you can’t write a fast ML framework for GPUs, you just cannot write one for your own chip. So that's another one of your core insights. I don't know if you want to expand on that. [00:03:59]
George: Yeah. I mean, your chip is worse, right? There's no way the chip that you're going to tape out, especially on the first try, is going to be easier to use than an AMD GPU, right? And yet there's no good stack for AMD GPUs. So why do you think you can make one for your chip? You can't, right? There's one other company, aside from NVIDIA, who's succeeded at all at making training chips. What company? [00:04:20]
Swyx: AMD? Intel? [00:04:22]
George: No, no, no. I've never trained. Who's trained a model on AMD or Intel? Cerebras. [00:04:26]
Swyx: Cerebras! [00:04:27]
George: I'm talking about, you might know some startups who trained models on these chips. [00:04:31]
Alessio: Oh, TPU. [00:04:32]
George: Exactly. Right? So Midjourney is trained on TPU, right? Like a lot of startups do actually train on TPUs. And they're the only other successful training chip, aside from NVIDIA. But what's unique about Google is that they also wrote their own ML framework, right? And if you can't write your own ML framework that is performant on NVIDIA, there's no way you're going to make it performant on your stuff. [00:04:53]
Alessio: And they started from TensorFlow and then they made the chip after. [00:04:56]
Swyx: Yeah, exactly. Exactly. [00:04:58]
George: And you have to do it in that direction. Otherwise, you're going to end up, you know, Cerebras, one of those things, a million... Has anyone ever seen a Cerebras? No one's ever like, oh, I trained my model on a Cerebras. Most people are like, I trained my model on GPUs. Some people, 20%, are like, I trained my model on TPUs. [00:05:14]
Alessio: And then the third one, which is the one that surprised me the most, is Turing completeness is harmful. It should be avoided. It made sense once I read it, but maybe tell us a bit more about how you got there. [00:05:25]
George: Okay. So CPUs devote tons of their silicon and power to things like reorder buffers and speculative execution and branch predictors. And the reason that you need all these things is because at compile time, you can't understand how the code's going to run. This is Rice’s theorem. This is the halting problem and its limit. And this is not like, oh, the halting problem is theoretical. No, no, no, no. It's actually very real. Does this branch get taken or not? Well, it depends on X. Where does X come from? Yeah, forget it, right? But no branches depend on X in a neural net. Every branch is a static loop. Like if you're doing a matrix multiply, it's a static loop over the inner dimension. And neural networks are even better. No loads even depend on X, right? So with a GPU shader, right, your load might depend on which texture you're actually loading into RAM. But with a neural network, your load is, well, I load that way. Why? Well, because I load that way the other million times I ran the same net. Every single time you run the net, you do the exact same set of loads, stores, and arithmetic. The only thing that changes is the data. And this gives you a very powerful ability to optimize that you can't do with CPU-style things, which have branches, and even GPU-style things, which have loads and stores. Well, GPUs, if you want GPU-style stuff, you have like load based on X, you now need a cache hierarchy, and not an explicit cache hierarchy, an implicit cache hierarchy with eviction policies that are hard-coded into the CPU. You start doing all this stuff, and you're never going to get theoretically good performance. Again, I don't think there's 100X. Some startups will talk about 100X, and they'll talk about absolutely ridiculous things like clockless computing or analog computing. Okay, here, analog computing just won't work. And clockless computing, sure, it might work in theory, but your EDA tools are... Maybe AIs will be able to design clockless chips, but not humans. But what actually is practical is changing cache hierarchies and removing branch predictors and removing warp schedulers, right? GPUs spend tons of power on warp scheduling because we have to hide the latency from the memory. We'll have to hide the latency if everything's statically scheduled. [00:07:25]
Alessio: Why do you think people are still hanging on to Turing completeness? [00:07:27]
Swyx: Well, because it's really easy. [00:07:29]
George: Turing Complete is just really easy to just, oh, you know, it would just be so nice if I could do like an if statement here and actually branch the code, right? So it requires a lot more thought to do it without Turing Completeness. [00:07:41]
Swyx: And would this be qualitatively different than TPUs? [00:07:44]
George: So TPUs are a lot closer. Yeah. TPUs are a lot closer to what I'm talking about than like CUDA. Okay, so what is CUDA? Well, CUDA is a C-like language, which compiles to an LLVM-like IR, which compiles to PTX, which compiles to SAS, which are all Turing Complete. TPUs are much more like this. Yeah. Their memory is pretty statically managed. They have a V—I did some reverse engineering on the TPU. It's published in TinyGrad. It has like a VLIW instruction, and it runs them. So it's similar. I think the TPUs have a few problems. I think systolic arrays are the wrong choice. I think they have systolic arrays because that was the guy's PhD, and then of course Amazon makes— [00:08:20]
Swyx: Could you summarize systolic arrays for us? [00:08:21]
George: Systolic arrays are just—okay, so basically you have like—it's a way to do matrix multiplication. Think of a grid of mollax, and then the grid can multiply, and then shift, multiply, then shift, multiply, then shift. And they are very power efficient, but it becomes hard to schedule a lot of stuff on them if you're not doing like perfectly sized dense matrix multiplies, which you can argue, well, design your models to use perfectly sized dense matrix multiplies, sure. [00:08:47]
Swyx: Thanks for indulging on these explanations. I think we need to keep our audience along with us by pausing every now and then to explain key terms. [00:08:56]
George: When I say explain a systolic array, I just immediately get a picture in my head of like tilting a matrix and shifting it. It's hard to kind of explain. Yeah. [00:09:04]
Swyx: Yeah. We'll do something. We'll do something. We'll have show notes. [00:09:08]
George: And we edit in visuals. Yeah, yeah, yeah. There's some great graphics that just show you, oh, so that's what a systolic array is. But it's a mollax shift machine that looks kind of different from the typical ALU sort of machine. I think the right answer is something that looks more like queues that feed into ALUs, and then you can prefetch the loads from the memory, put in a bunch of queues, and then the queue is just like, and feeds into another queue over here. But yeah, but that's not even the main problem with TPUs. The main problem with TPUs is that they're closed source. Not only is the chip closed source, but all of XLA is open source. But the XLA to TPU compiler is a 32 megabyte binary blob called libTPU on Google's cloud instances. It's all closed source. It's all hidden stuff. And you know, well, there's a reason Google made it closed source. Amazon made a clone of the TPU. It's called Inferentia. Or they have some other name for it, a training. Tranium. Yeah, yeah, yeah. And look, it's a clone of the TPU. But Google's software at least kind of works. [00:09:58]
Alessio: So those are kind of like the three core pieces. The first thing you're working on, that you've been working on, is TinyGrad. And one of your Twitch streams, you said, is the best thing you've ever written. [00:10:07]
Swyx: Yeah. [00:10:08]
Alessio: Tell us a bit more about that creation. [00:10:10]
George: For a long time, TinyGrad had a hard limit at a thousand lines of code. And what this would force you to do is really make sure you were not wasting lines. I got rid of the restriction because it became a little code golfy at the end. But once like the core framework of TinyGrad was there in those thousand lines, but like the core framework, the ideas are expressed with no boilerplate. If you go read PyTorch, you know, PyTorch I think is actually pretty good code. I think Facebook's pretty good, but there's so much boilerplate. Go in PyTorch and try to track down how an LGU actually works. [00:10:44]
Swyx: Just a lot of instructions. [00:10:45]
George: Oh, you're going to be diving down a long stack from Python to C to custom libraries to dispatchers to, and then I don't even know how to read TensorFlow. I don't even know where's an LU in TensorFlow. [00:10:55]
Swyx: Nobody knows. [00:10:56]
George: Someone at Google knows maybe. Google as an organism knows. I don't know if anyone individual at Google knows. [00:11:02]
Alessio: What are like the important ergonomics like for a developer as you think about designing the TinyGrad API? [00:11:07]
George: So the TinyGrad front end looks very similar to PyTorch. There's an even higher level front end you can use for TinyGrad, which is just ONNX. We have better support for ONNX than Core ML does. And we're going to have, I think we're going to pass ONNX Runtime soon, too. People think ONNX Runtime, that's the gold standard for ONNX. No, you can do better. [00:11:23]
Swyx: Pass them in what, specifically? Test compliance tests. [00:11:26]
George: So ONNX has a big set of compliance tests that you can check out. And we have them running in TinyGrad, and there's some failures. We're below ONNX Runtime, but we're beyond Core ML. So that's where we are in ONNX support now. But we will pass ONNX Runtime soon because it becomes very easy to add ops because you don't need to do anything at the lower levels. You just do it at this very high level, and TinyGrad compiles it to something that's fast using these minimal ops. You can write, most concretely, what TinyGrad can do that PyTorch can't really do, is if you have something like A times B plus C. If you write that in NaivePyTorch, what it's going to do on the GPU is read A, read B in a kernel, and then store A times B in memory, and then launch another kernel to do A times B plus C. Okay, got to do those loads from memory. It's a whole extra round trip to memory that I just didn't have to do. And you're like, yeah, but you can use the Torch JIT, and it corrects this. Yeah, for that one example, for that one example of MUL/ACC, but, oh, now you did three multiplies? Six multiplies? It won't compile arbitrary code. [00:12:26]
Swyx: And have you looked into the other approaches like PyTorch Lightning to accelerate PyTorch itself? [00:12:32]
George: Well, PyTorch Lightning, my understanding is, it's mostly a framework around PyTorch, right? PyTorch Lightning is not going to fix this fundamental problem of I multiply six tensors together. It's not going to fix it going to memory any more than a single read from each and a single write to the output. There are lower level things in PyTorch that are, I'm not exactly sure what Dynamo does, but I know they're generating some Triton stuff, which is going to generate the kernels on the fly. But, you know, PyTorch Lightning is at a higher level of abstraction. So TinyGrad's front-end stuff looks like PyTorch. I made a few tweaks. There's a few things I don't like about PyTorch. Why is Relu a class? Really, what's the state? You make a class, and there's a state. Everything should just be Torch functional and then Relu, but just dot Relu on the tensor. There's things in Torch where you have to do tensor dot and not a tensor dot. It just shows an API that's not perfectly refined. But when you're doing stuff TinyGrad style where you don't have lines, well, it has to work this way. Because even the lines to express the, well, you can't use the where operator in PyTorch. Why is it true case, condition, false case? Ugh, that's how Python expresses ifs. It's disgusting. Turner operators are much nicer. It should be, I can do my like a less than zero dot where a comma one, right? [00:13:46]
Swyx: The very pandas-like API? [00:13:50]
George: It looks like Torch, NumPy, pandas. They're all very similar. I tried to take the cleanest subset of them and express them. But like I said, you can also interact with it using ONNX. I have a rewrite of StableDiffusion, I have a rewrite of Llama, I have a rewrite of Whisper. You can look at them. They're shorter than the Torch versions, and I think they're cleaner. And you stream them all? [00:14:05]
Swyx: Yeah. Very nice. [00:14:07]
Alessio: So what's the other important concept that you're leveraging to do operation fusing? [00:14:11]
George: Yeah, you have basically like a few different like models for the simplest one is eager is as soon as the interpreter sees A times B, it actually dispatches A times B, right? Then you have graph like TensorFlow, which will put A times B into a graph, and then we'll do absolutely nothing until you actually compile the graph at the end. I like this third choice, which is somewhere in the middle, laziness. Laziness is you don't know when the ops are going to dispatch, and don't worry about that. You don't have to worry about this as a programmer, you just write out all your stuff. And then when you actually type `.numpy`, it'll be ready by the time you copy the thing back to CPU. Or you can do `.realize`, and it will actually like force that tensor to be allocated in RAM. And if you think about it, PyTorch is kind of lazy in a way, but they didn't extend the paradigm far enough, right? When I do A times B in PyTorch, it's going to launch a CUDA kernel to do A times B. But it's not going to wait for that CUDA kernel to complete. So you're getting the worst possible worlds. You're getting the same laziness, but you also can't get fusion, because PyTorch doesn't know that I'm then going to do plus C. There's no way for it to be like, whoa, whoa, whoa, don't launch that CUDA kernel. Whoa, just do this one too. Right? Again, PyTorch is working on this, and it's a little bit harder. In Kama, I felt like I was competing against a lot of idiots. Here, I'm competing against smart, very smart people who've made some, I think, different trade-offs. Whereas, if you're trying to build something that is just straight up good on NVIDIA, and we have a lot of people and complexity to throw at it, yeah, PyTorch made a lot of the right choices. I'm trying to build something that manages complexity. You can always make your software do more. The magic is when you can make your software do more without adding complexity, right? Because complex things eventually collapse under their own weight, so it's kind of... [00:15:58]
Alessio: How does fusing actually work? [00:16:00]
George: There's this thing called lazy.py, and when you do A times B, that's... It's put into a graph, but it's a very local graph. There's no global graph optimizations. And even this can change, right? Again, the programming model for TinyGrad does not preclude eagerness, right? Laziness is not guaranteed laziness. It's just going to try its best. So you put in A times B, and that's a binary op, right? And then you put in A times B, that's a node in the graph. It's a virtual node because it's not realized yet, plus C. Okay, here's a new node, which takes the C tensor in here and takes the output of A times B. It's like, whoa, there's two binary ops. Okay, we'll just fuse those together. Okay, here I have a kernel. This kernel has A, B, and C as inputs. It does A times B plus C in the local registers, and then outputs that to memory. And you can graph.one in TinyGrad. Another amazing thing that TinyGrad has that I've not seen in any other framework is two things. Graph equals one, which is an environment variable. It will output a complete graph of all the operations. Other people are like, oh, you can use PyTorch, export it to ONNX, and use Netron. Yeah, you can. Like, what? That's not what's real. Graph equals one will show you the actual kernels that were dispatched to the GPU. You can also type debug equals two, which will print those kernels out in your command line, and it will tell you the exact number of flops and the exact number of memory accesses in each kernel. So you can immediately see, wait a second, okay, this kernel used this many flops. This was the gigaflops. This is how many bytes it read, and this was the gigabyte per second. And then you can profile without having to like, okay, I mean, in theory, in PyTorch, Sure, use the NVIDIA Insight Profiler. No one does that. No one does, of course, because it's so difficult, right? Like, actually, NVIDIA used to, I think CUDA 9 was the last one that had it. They had a command line one, but now it's like, okay, I'm going to generate this blob, use this NVIDIA GUI tool to convert it into a Chrome trace, and then load it. Yeah, no one does that, right? Just type debug equals two in any TinyGrad model, and it will show you all the kernels that it launches and the efficiency of each kernel, basically. [00:17:58]
Swyx: Yeah, this is something that John Carmack has often commented about, is that when you code, you need to build in your instrumentation or observability right into that. I wonder if whatever John is working on, he's adopting this style, and maybe we can sort of encourage it by, I don't know, naming it and coining a certain kind of debugging style? [00:18:16]
George: If he would like to start contributing to TinyGrad, I'd be so happy. [00:18:19]
Swyx: You should hook up with them. [00:18:22]
George: I've chatted with them a few times. I'm not really sure what his company's doing, but no, I mean, hopefully we get TinyGrad to a point where people actually want to start using it. So TinyGrad right now is uncompetitive on NVIDIA, and it's uncompetitive on x86. [00:18:36]
Swyx: And specifically, what do you care about when you say uncompetitive? Speed. [00:18:39]
George: Share of speed. It's correct. The correctness is there. The correctness for both forwards and backwards passes is there. But on NVIDIA, it's about 5x slower than PyTorch right now. Like 5x, wow, this is unsurmountable. No, there's reasons it's 5x slower, and I can go through how we're going to make it faster. It could be 100x slower, so we're making progress. But there's one place where it actually is competitive, and that's Qualcomm GPUs. So TinyGrad is used to run the model in OpenPilot. Like right now, it's been live in production now for six months. And TinyGrad is about 2x faster on the GPU than Qualcomm's library. [00:19:10]
Swyx: What about Qualcomm architecture? [00:19:12]
George: What makes it doable? Well, because the world has spent how many millions of man hours to make NVIDIA fast? And Qualcomm has a team of 10 Qualcomm engineers? Okay, well, who can I beat here? What I propose with TinyGrad is that developer efficiency is much higher. But even if I have 10x higher developer efficiency, I still lose on NVIDIA, right? You know, okay, I didn't put 100,000 man hours into it, right? If they put a million, like, that's what I'm saying. But that's what I'm saying we can get. And we are going to close this speed gap a lot. Like I don't support TensorCourse yet. That's a big one that's just going to, okay, massively close the gap. And then AMD. I don't even have a benchmark for AMD because I couldn't get it compiled. Oh, and I tried. Oh, I tried. I spent a day. Like, I spent actually a day trying to get PyTorch. And I got it built. I got it kind of working, then I tried to run a model, like, there's all kinds of weird errors and the rabbit holes are so deep on this. I'm like, you know, you can compare the speed. Right now, you can run LLAMA, you can run anything you want on AMD. It already all works. Any OpenCL backend works, and it's not terribly slow. I mean, it's a lot faster than crashing. So it's infinitely times faster than PyTorch on AMD. But pretty soon, we're going to start getting close to theoretical maximums on AMD. That's really where I'm pushing. And I want to get AMD on MLPerf in a couple months, hopefully. [00:20:26]
Swyx: Now that you bring up AMD. [00:20:27]
Alessio: Yeah, let's dive into that. Because when you announced the Semicore fundraise, you mentioned one of your first goals is like build the framework, runtime and driver for AMD. And then on June 3rd on Twitch, you weren't as excited about AMD anymore. Maybe let's talk a bit about that. You compared the quality of commit messages from the AMD kernel to the Intel work that people are doing there. What's important to know? [00:20:51]
George: When I said I want to write a framework, I never intended on writing a kernel driver. I mean, I flirted with that idea briefly, but realistically, there's three parts to it, right? There's the ML framework, there's the driver, and then there's the user space runtime. I was even down to rewrite the user space runtime. I have a GitHub repo called CUDA IOControlSniffer. It's terribly called. But you can actually launch a CUDA kernel without CUDA. So you don't need CUDA installed. Just the NVIDIA open source driver and this open source repo can launch a CUDA kernel. So rewriting the user space runtime is doable. Rewriting the kernel driver? [00:21:26]
Swyx: I don't even have docs. [00:21:27]
George: I don't have any docs for the GPU. Like it would just be a massive reverse engineering project. I wasn't complaining about it being slow. I wasn't complaining about PyTorch not compiling. I was complaining about the thing crashing my entire computer. It panics my kernel. And I have to wait five minutes while it reboots because it's a server motherboard and they take five minutes to reboot. So I was like, look, if you guys do not care enough to get me a decent kernel driver, there's no way I'm wasting my time on this, especially when I can use Intel GPUs. Intel GPUs have a stable kernel driver and they have all their hardware documented. You can go and you can find all the register docs on Intel GPUs. So I'm like, why don't I just use these? Now, there's a downside to them. Their GPU is $350. You're like, what a deal. [00:22:03]
Swyx: It's $350. [00:22:04]
George: You know, you get about $350 worth of performance. And if you're paying about $400 for the PCIe slot to put it in, right, like between the power and all the other stuff, you're like, okay, nevermind. You got to use NVIDIA or AMD from that perspective. But I sent an email to Lisa Su. She responded. [00:22:19]
Swyx: Oh. [00:22:20]
George: And I've had a few calls since. And like, what I tried to do, first off, like, thank you for responding. It shows me that like, if you don't care about your kernel panicking, I can't, like, this is just a huge waste of my time, right? I'll find someone who will care. I'm not asking for your seven by seven Winograd convolution when transposed to be fast. Like, I'm not asking for that. I'm asking literally for- The basics of getting it running. Oh, and this isn't TinyGrad. This is your demo apps. I ran their demo apps in loops, and I got kernel panics. I'm like, no, okay. No, Lisa Su reached out, connected with a whole bunch of different people. They sent me a pre-release version of RockM 5.6. They told me you can't release it, which I'm like, guys, why do you care? But they say they're going to release it by the end of the month, and it fixed the kernel panic. The guy managed to reproduce it with the two GPUs and the computer, and yeah, sent me a driver, and it works. I had that experience, and then I had another experience where I had two calls with, like, AMD's, like, communication people. I was just like, I tried to explain to these people, like, open source culture. Like, it's not open source if you dump the source code on a GitHub repo and then forget about it until the next release. It's not open source if all your issues are from 2022. Like, it's just no one's going to contribute to that project, right? Sure, it's open source in a very, like, technical sense. To be fair, it's better than nothing. It's better than nothing, but I fixed a bug in Nickel that I fixed. There's a fun fact, by the way. If you have a consumer AMD GPU, they don't support peer-to-peer, and their all-reduce bandwidth is horrendously slow because it's using CUDA kernels to do the copy between the GPUs, and it's putting so many transactions on the PCIe bus that it's really slow. But you can use CUDA memcpy, and there's a flag to use CUDA memcpy, but that flag had a bug. I posted the issue on Nickel. I expected nothing to happen. The NVIDIA guy replied to me within an hour. He's like, try this other flag. I'm like, okay, I tried the other flag. It still doesn't work, but here's a clean repro. And I spent, like, three hours writing a very clean repro. I ended up tracking the issue down myself, but just the fact that somebody responded to me within an hour and cared about fixing the issue? Okay, you've shown that it's worth my time, and I will put my time in because, like, let's make this better. Like, I'm here to help. But if you show me that, you know, you're like, you're the kernel panics. That's just, like, expected. Okay. [00:24:36]
Swyx: Well, it sounds like AMD is getting the message. [00:24:38]
George: They are. And I just, I don't really think they've had someone explain to them, like, like, I was like, you can, like, build in public. And they're like, what's an example of building in public? I'm like, go look at PyTorch. Go look at PyTorch. I have two minor things merged into PyTorch because it's very responsive, you know? [00:24:53]
Alessio: So that's kind of like the lowest level of the stack. And then at a slightly higher level, obviously, there's TinyGrad, there's Mojo, there's ggml. How are you thinking about breadth versus, like, depth? Like, where you decided to focus early on? [00:25:06]
George: So ggml is very much like a, okay, everyone has M1s, right? Actually, I was thinking, in the beginning, I was thinking of something more like ggml, focused on the M1s. But ggml showed up and was just like, we're actually just focusing on the M1s. And actually, M1 PyTorch is considerably better than AMD PyTorch. M1 PyTorch works, it only gives wrong answers sometimes, and it only crashes sometimes. But, like, some models kind of run. When I was writing the metal backend, I was comparing to MPS PyTorch, and I had, like, a discrepancy. TinyGrad checks all its outputs compared to Torch, and I had one where it didn't match. I'm like, I checked the matrix by hand, it matches TinyGrad, I don't understand. And then I switched PyTorch back to CPU, and it matched. I'm like, oh. Well, there's, like, bugs, like, if you, like, transpose the matrix, because, like, I think it has to do with, like, multi-views in PyTorch, and, like, weird under-the-hood stuff that's not exposed to you, like, there's bugs. And maybe they fixed them, but, like, you know, it seems like there was a lot of momentum. Again, because you're getting how many engineers care about making PyTorch work on M1, right? Thousands, tens of thousands. And you have an open development process, and guess what? It's going to be good. How many engineers care about AMD working, PyTorch AMD working? Well, you got 10 guys that work for AMD, and then, like, a couple hobbyists. [00:26:15]
Swyx: You revealed an interesting detail about how you debug. You hand-check the matrix math? No, I don't hand-check it. [00:26:20]
George: One of the best tests in TinyGrad is a file called testops.py. And it's just a hundred small examples written in TinyGrad and PyTorch, and it checks both the forwards and backwards to make sure they match. [00:26:34]
Swyx: Good test suite. Yeah. Very important. [00:26:35]
George: That's, I mean, that's one of them where, like, I really, I put a lot of effort into CI for TinyGrad. I think CI is super important. Like, I want that green check to mean I can merge this, right? Like, I don't want my tests to, and if the green check, if you somehow manage to introduce a bug and get the green check, okay, we're fixing the test, top priority. [00:26:51]
Swyx: Mojo? [00:26:52]
George: It's closed source. No, I'm not that interested. Do you know what I mean? Like, look, I like Chris Lattner. I think he's going to do great things, and I understand the, like, kind of the wisdom, even, in keeping it closed source. But, you know, I'm interested when it's open. [00:27:05]
Swyx: Yeah. You have an interesting design deviation from him, because he's decided to be a, well, promised to be a superset of Python, and you have decided to break with PyTorch APIs. And I think that affects learnability and transportability of code. [00:27:18]
George: You know, if the PyTorch thing ends up being, like, a stumbling block, I could write a perfect PyTorch instead of import PyTorch. Instead of, like, yeah, import torch, you type import tinytorchestorch. And if that really becomes the stumbling block, I will do that. No, Chris Lattner went much further than PyTorch. Replicating the PyTorch API is something I can do with a couple, you know, like an engineer monitor. [00:27:44]
Swyx: A shim. [00:27:44]
George: Right, like a shim, yeah. Replicating Python? [00:27:47]
Swyx: Hoo-hoo-hoo! [00:27:48]
George: There's a big graveyard of those projects. How's Piston going? How's Jython? [00:27:57]
Swyx: PyPy? Oh, you can go way back. [00:27:59]
Alessio: So your core mission is commoditizing the petaflop. And then your business goal is to sell computers for more than the cost to make, which seems super reasonable. And you're going to have three tiny boxes? [00:28:11]
Swyx: Red, green, blue? No, no, no, no, no, no, no. [00:28:13]
George: That was my... Look, you know, a lot of people, like, I love, you know, leaning into, like, saying I'm giving up, right? It's great to give up, right? Giving up is this wonderful thing. It's so liberating. And then, like, you can decide afterward if you really give up or not. There's very little harm in saying you give up, except, like, you know, great, Twitter haters have something to talk about, and all press is good press, kids, so... Just red, only red. [00:28:32]
Swyx: Tiny box, red. Tiny box, red. [00:28:34]
George: Unless AMD, you know, upsets me again, and then we're back to other colors. We have other colors to choose from. [00:28:41]
Alessio: When you think about hardware design, what are some of the numbers you look for? So, teraflops per second is one, but, like, memory bandwidth is another big limiter. Like, how do you make those trade-offs? [00:28:52]
George: Well, I mean, fundamentally, I'm limited to what GPUs I can buy. But, yeah, for something that I think a lot of people are going to want to reasonably do, with, um... A coworker of mine described them as luxury AI computers. Right? Like, luxury AI computers for people. And that's, like, what we're building. And I think a common thing people are going to want to do is run, like, Large Llama. Right? Or Large, like, Falcon or whatever. [00:29:13]
Swyx: FB-16 Llama. [00:29:14]
George: FB-16, exactly. Exactly. Um, you know, Int8, I think, can work. I think that, like, what GGML is doing to go to, like, N4. Like, this doesn't work. Like, have you done... I mean, maybe they have. But, like, I read what it was, and I was like, this isn't from any paper. This is just some... Squeezing as much as possible. Yeah, you made up some quantization standards to make it run fast. And, like, maybe it works. But, okay, where's, like, the Hellaswag number? Right? Where's your, uh... [00:29:38]
Swyx: The thesis is right. That, like, if you have hundreds of billions of parameters, that the individual quantization doesn't actually matter that much. [00:29:44]
George: Well, the real way to look at all of that is to just say you want to compress the weights, right? It's a form of weight compression. Quantization is a form of weight compression, right? Now, this is obviously not lossless. It's not a lossless compressor, right? If it's a lossless compressor, and you can show that it's correct, then, okay, we don't have to have any other conversation. But it's a lossy compressor. And how do you know that your loss isn't actually losing the power of the model? Maybe int4 65B llama is actually the same as FB16 7B llama, right? We don't know. Maybe someone has done this yet, but I looked for it when it, like, first came out and people were talking about it. And I'm like, it's not from a paper, right? The indate stuff is from a paper where they... Like, some of the indate stuff is from a paper. There's one paper, I think it's, like, indate... LLM.indate, where they actually do all the tests. And they didn't go fully indate. They made, like, 90% of it indate and kept, like, 10% of it in FB16 for what they called, like, the outliers or whatever. So I think that this is not quite so easy. [00:30:37]
Swyx: And I think being able... [00:30:38]
George: Well, so first off, if you're training, no one's gotten training to work with indate yet. There's a few papers that vaguely show it. But if you're training, you're going to need BF16 or float16. So this is why I target that. Now, the thing that you're going to want to do is run these large language models out of the box on your hardware in FB16, and that's memory bandwidth. So you need large amounts of memory bandwidth, too. So ask how I trade off memory bandwidth in Flop, so what GPUs can I buy? [00:31:02]
Alessio: So first of all, you have this hiring process, which is you've got to solve one of the bounties that are open on TinyGrad. There's no technical interview. One of them is indate support. Do you already have some things you want to test on? [00:31:14]
Swyx: We have indate support. What I'd like to see somebody do [00:31:16]
George: is just load the ggml indate llama into TinyGrad and then benchmark it against the FB16 one. Indate already works in TinyGrad. It doesn't actually do the math in indate. It does all the math still in FB32. So indate can mean you just have your weights in indate, or indate can mean you actually do your math in indate. And doing your math in indate, the big gain that people care about is actually having your weights in indate, because weights in indate mean less memory and less memory bandwidth, whereas the math, keep it in FB32. With on M1s, it doesn't matter what data type you're doing in the GPU. I'm not even sure it can do indate, but FB16 and FB32 is the same tariff ops. So yeah, no, that's one of the bounties. One of the bounties is get indate llama running [00:31:58]
Swyx: with the indate weights. [00:32:00]
George: And then actually, what you could even do, if you really want to test this, just take the FB16 weights, convert them to indate, then convert them back to FB16, then compare the unconverted and converted. [00:32:10]
Swyx: Oh, that's a nice hack. Oh, yeah. Right, like- This should be lossless in the other direction. Yeah, I think FB16, [00:32:17]
George: it should be lossless in the other direction. I'm actually not 100% about that. Why not? Oh, because like, you ever try to like, like if you want to represent, if it was like int16, it's not lossless. [00:32:25]
Swyx: Sure. [00:32:26]
George: All of indate can be represented in FB16, but I'm not 100% about that. [00:32:29]
Swyx: Just drop the bytes. We just have to do it, right? [00:32:32]
George: Just literally do it. There's only 256 to check, like. But yeah, either way, or I mean, int4, definitely. So do your int4, convert it back, and now see, even with int4 weights and FB32 math, like, okay, how much has your performance degraded this model? [00:32:47]
Alessio: I think like the, you're planning to release the first tiny box, ship them in like two to six, eight months, something like that. What's top of mind for you in terms of building a team? Who should, who are you calling for? [00:32:59]
George: So as the GPU is picked out and you're like, well, I could make that computer with the GPUs. And my answer is, can you? Do you know how hard it is to put six GPUs in a computer? And people think it's really easy. And it's really easy to put one GPU in a computer. It's really easy to put two GPUs in a computer, but now you want to put in eight. Okay, so I'll tell you a few things about these GPUs. They take up four slots. You can buy the nicest super micro. You can't put eight of those in there. You need two slot blowers. [00:33:25]
Swyx: If you want to use one of those, [00:33:25]
George: those for you super micros, you need two slot blowers or water cooling, right? If you're trying to get the four slot cards in there, you're going to need some form of water cooling. There are some like Chinese 40 nineties that are blowers, right? You have any blowers or water cooling if you're trying to get it in those things, right? [00:33:37]
Swyx: So are you doing water? [00:33:39]
George: No, I'm not using that chassis. Okay, so now you want to get six GPUs in a computer. So that's a big challenge. You're like, oh, I'll just use a PCIe extenders. I saw it online as tech tips. It works great. No, it doesn't. Try PCIe extenders that work at PCIe 4.0 and interconnect bandwidth, super important. They don't work at 3.0. No PCIe extender I've tested, and I've bought 20 of them, works at PCIe 4.0. So you're going to need PCIe re-drivers. Now, okay, how much is that adding cost, right? Like these things all get really hard. And then tiny boxes, I've even had another constraint to it. I want this thing to be silent, not totally silent, but my limit is like 45, maybe 50 DB, but not super micro machine, 60 DB. We have a small, we have a compute cluster at comma. You gotta wear ear protection to go in there. Like- [00:34:24]
Swyx: Yeah, I've seen some videos where you give a tour. Oh yeah. It's noisy. It's super loud. [00:34:28]
George: You got all these machines just screaming. All those, like if you have a blower, what is that thing? 10,000 RPM, just screaming. Like I want to be able to use the normal big GPU fans and make this thing so it can sit under your desk, plug into one outlet of power, right? Six GPUs, your GPUs are 350 Watts each. Can't plug that into a wall outlet. Okay, so how are you going to deal with that? Good questions, right? [00:34:51]
Swyx: And you're not sharing them. [00:34:52]
George: Well, that one, I mean, that one is pretty obvious. You have to limit the power on the GPUs, right? You have to limit the power on the GPUs. Now you can limit power on GPUs and still get, you can use like half the power and get 80% of the performance. This is a known fact about GPUs, but like that's one of my design constraints. So when you start to add all these design constraints, good luck building a tiny box yourself. Obviously it can be done, but you need something that has actually quite a bit of scale and resources to do it. [00:35:15]
Alessio: And you see like the, under the desk, it's like one of the main use cases, kind of like individual developer use or. [00:35:21]
George: Yeah, what I also see is more of a, like an AI hub for your home, right? As we start to get like home robotics kind of stuff, you don't want to put the inference on the robot, but you also don't want to put the inference on the cloud. Well, you don't want to put it on the robot because, okay, it's 1500 Watts, tiny box. You'll put batteries and charge them, bad idea. Just wireless. Wireless is 0.5 milliseconds, right? This is super fast. You don't want to go to the cloud for two reasons. One, cloud's far away. Okay, it's not that far away. You can kind of address this. But two, cloud's also mad expensive. Like cloud GPUs are way more expensive than running that GPU at your house. At least any rates you're going to get, right? Maybe if you commit to buy, well, yeah, I'm going to buy 10,000 GPUs for three years, then maybe the cloud will give you a good rate. But like, you want to buy one GPU in the cloud? I mean, okay, you can go to like vast, but like if you're going on Azure AWS, so that's expensive. [00:36:12]
Swyx: This is like a personal data center instead of a cloud data center. [00:36:16]
George: We like the term compute cluster. So we can use NVIDIA GPUs. [00:36:20]
Swyx: Yeah, data centers may be a little bit dated. It's a compute cluster, [00:36:23]
George: which is totally legal under the CUDA license agreement. [00:36:26]
Swyx: You talk a lot about the PCIe connection. Do you think there's any fat there to trim? What do you mean? You're limited by bandwidth. [00:36:32]
George: Okay, for some things, yes. So bandwidth is roughly 10x less than what you can get with NB-linked A100s, right? NB-linked A100s are going to have, and then you can even get like full fabric and NVIDIA really pushes on that stuff, 600 gigabytes per second, right? And PCIe, four, you're going to get 60, right? So you're getting 10x less. That said, why do you need the bandwidth, right? And the answer is you need it for training huge models. If you're training on a tiny box, your limit's going to be about 7 billion. If you're training on big stuff, your limit's going to be like 70 billion, right? Okay, you can hack it to get a bit higher. You can hack it, like GPT hacked it to get a bit higher, but like that 65 billion in LLAMA, like there's a reason they chose 65 billion, right? And that's what can reasonably fit model parallel on a GPU, right? So yes, you are going to end up training models. The cap's going to be like 7 billion, but I actually heard this on your podcast. I don't think that the best chatbot models are going to be the big ones. I think the best chatbot models are going to be the ones where you had a thousand training runs instead of one. And I don't think that the interconnect bandwidth is going to matter that much. [00:37:33]
Swyx: So what are we optimizing for instead of compute optimal? What do you mean compute optimal? You're talking about this, the LLAMA style models where you train for like 200x. You train longer, yeah. [00:37:41]
George: Yeah, yeah, yeah. You can always make your model better by doing one of two things, right? And a comma, we just have a strict limit on it. You can always make your model better by training longer, and you can always make your model better by making it bigger. But these aren't the interesting ones, right? Particularly the making it bigger because training it longer, fine. You're getting a better set of weights. The inference is the same. The inference is the same whether I trained it for a day or a week. Okay, if it's 1 billion versus 10 billion, well, I 10x my inference too, right? So I think that these big models are kind of, sure, they're great if you're research labs and you're trying to like max out this hypothetical thing. [00:38:13]
Swyx: Which you can talk about later. Yeah, yeah, yeah. [00:38:15]
George: But if you're like a startup or you're like an individual or you're trying to deploy this to the edge anywhere, you don't need that many weights. [00:38:22]
Swyx: Yeah, yeah. You actually don't want that many weights. Optimizing for inference rather than capabilities doing benchmarks. Yes. [00:38:29]
George: And I think the inference thing, right? There's gonna be so much more. Right now, the ratio between like training and inference on clouds, I think it's only still, I think it's like two or three X, right? It's two or three X more inference, which doesn't make any sense. It's way more inference. [00:38:41]
Swyx: Yeah. [00:38:42]
George: There should be 10 to 100 X more inference in the world than training. But then also like, what is training, right? You start to see these things like LoRa, like it's kind of blurring the lines between inference and training. And I think that that blurred line is actually really good. I'd like to see much more like on-device training or on-device fine tuning of the final layer. We're pushing toward this stuff at Comma, right? Like why am I shipping a fixed model? I totally want this model to fine tune based on like how your left tire is flat, right? Every time you cut the same turn because your left tire is flat, well, it should learn that, right? [00:39:11]
Swyx: So would Comma pursue parameter efficient fine tuning? Yeah. [00:39:16]
George: We're looking into stuff like that. I mean, Comma is already very parameter efficient because we have to like run this thing in a car and you have to like cool it and power it. [00:39:22]
Alessio: And so this kind of like intelligence cluster you have in your home, you see when the person is using third-party model, they load them locally and kind of do the final fine tuning. It kind of stays within the box. [00:39:33]
George: I think that that's one version of it for the privacy conscious. I also see a world where you can have your tiny box in its down cycles, mine flop coin, right? You know, it turns out not all crypto is a scam. [00:39:45]
Swyx: There's one way to tell if crypto is a scam. [00:39:46]
George: If they're selling the coin before they make the product, [00:39:49]
Swyx: it's a scam. [00:39:49]
George: If they have the product and then they sell the coin, it's maybe not a scam, right? So yeah, my thought is like each tiny box would let you, would have a private key on it. And you have to do it this way. You can't just let anyone join because of Sybil attacks, right? [00:40:01]
Swyx: There's a real problem of like, [00:40:01]
George: how do I ensure your data is correct? And the way that I ensure your data is correct on the tiny net is if you ever send wrong data, you're banned from the network for life. [00:40:08]
Swyx: Yeah. [00:40:09]
George: Your $15,000 hardware box is banned. [00:40:11]
Swyx: So, you know, don't cheat. [00:40:11]
George: Obviously if it messes up, we'll forgive you. [00:40:14]
Swyx: Somebody's going to try to jailbreak your devices. There's no jailbreak. [00:40:17]
George: There's no jailbreak. [00:40:18]
Swyx: It's just a different network. [00:40:19]
George: Well, there's just a private key on ea ch device, right? Like if you buy a tiny box from the tiny corp, [00:40:23]
Swyx: I give you a private key. [00:40:23]
George: It's in my backend server, right? You want to hack my server, that's illegal. Anything you want to do on the device, the device is yours. My server's mine, right? [00:40:29]
Swyx: Yeah. Have you looked into like a federated training at all? [00:40:33]
George: Okay. There's orders of magnitude of federated training. You mean like over the cloud and stuff? [00:40:37]
Swyx: Over the internet? Yeah. Over the internet, but also distributed on a bunch of devices, right? [00:40:41]
George: Yeah, I'm very bearish on this stuff. Because your interconnect bandwidth, right? So, okay. At the high end, you have your interconnect bandwidth of NVLink, which is 600 gigabytes per second, right? The tiny box has 60 gigabytes per second. And then your internet has 125 megabytes per second, right? Not gigabits, 125 megabytes, right? So, okay. That's how many orders of magnitude we're talking here? Like from 60 down to 125? Like, all right, that's over a hundred X. That's 400 X, right? So like, what you can do is inference, right? Like there's, for inference, you don't care, right? For inference, there's so little bandwidth at the top and the bottom of the model that like, yeah, you can do federated inference, right? And that's kind of what I'm talking about. There's also interesting things to push into, like you're like, but okay, what if you want to run closed source models? This stuff gets kind of interesting, like using TPMs on the boxes and stuff. But then someone might jailbreak my device. So, you know, maybe we don't try to do that. [00:41:34]
Alessio: Yeah, what's like the enterprise use case? Do you see companies buying a bunch of these and like stacking them together? [00:41:39]
George: The tiny box is like the first version of what we're building. But what I really want to do is be on the absolute edge of flops per dollar and flops per watt. These are the two numbers that matter. So the enterprise use case is you want to train, like Kama, right? So Kama just built out a new compute cluster. It's about a person and a half. [00:41:56]
Swyx: A person being 20 petaflops. [00:41:58]
George: A person is 20 petaflops. It's about 30 petaflops. We built out a little compute cluster and, you know, we paid double what you theoretically could per flop, right? You theoretically could pay half per flop if you designed a bunch of custom stuff. And yeah, I mean, I could see that being, you know, a tiny corp. Kama's going to be the first customer. I'm going to build a box for Kama and then I'm going to show off the box I built for Kama and be like, okay, like, do you want to build? I sell $250,000 training computers. Or how much is one H100 box? [00:42:26]
Swyx: It's 400 grand? [00:42:27]
George: Okay, I'll build you a 400 grand training computer and it'll be 10x better than that H100 box. Again, not for every use case. For some, you need the interconnect bandwidth. But for 90% of most companies' model training use cases, the tiny box will be 5x faster for the same price. [00:42:41]
Alessio: You mentioned the person of compute. How do we build a human for $20 million? [00:42:47]
George: Well, it's a lot cheaper now. So like I said, Kama spent about half a million on our person and a half, so. [00:42:54]
Alessio: What are some of the numbers people should think of when they compare compute to like people? So GPT-4 was 100 person years of training. That's more like on the timescale. 20 petaflops is one person. I think you, right now the math was that for the price of the most expensive thing we build, which is the International Space Station, we could build one Tampa of. Yeah, yeah, one Tampa of compute. [00:43:16]
Swyx: Yeah, which is the ultimate currency of measurement. [00:43:20]
George: Yeah, yeah, we could build. So like the biggest training clusters today, I know less about how GPT-4 was trained. I know some rough numbers on the weights and stuff, but Lama- [00:43:28]
Swyx: A trillion parameters? [00:43:30]
George: Well, okay, so GPT-4 is 220 billion in each head, and then it's an eight-way mixture model. So mixture models are what you do when you're out of ideas. So, you know, it's a mixture model. They just train the same model eight times, and then they have some little trick. They actually do 16 inferences, but no, it's not like- [00:43:45]
Swyx: So the multimodality is just a vision model kind of glommed on? [00:43:49]
George: I mean, the multimodality is like obvious what it is too. You just put the vision model in the same token space as your language model. Oh, did people think it was something else? The mixture has nothing to do with the vision or language aspect of it. It just has to do with, well, okay, we can't really make models bigger than 220 billion parameters. We want it to be better. Well, how can we make it better? Well, we can train it longer, and okay, we've actually already maxed that out. We're getting diminishing returns there. [00:44:13]
Swyx: Okay. A mixture of experts. [00:44:14]
George: Yeah, a mixture of experts. We'll train eight of them, right? [00:44:16]
Swyx: So, all right. [00:44:17]
George: So, you know, the real truth is whenever a start, whenever a company is secretive, it's because they're hiding something that's not that cool. And people have this wrong idea over and over again that they think they're hiding it because it's really cool. [00:44:28]
Swyx: It must be amazing. [00:44:29]
George: It's a trillion parameters. No, it's a little bigger than GPT-3, and they did an eight-way mixture of experts. Like, all right, dude, anyone can spend eight times the money and get that. Coming back to what I think is actually gonna happen is, yeah, people are gonna train smaller models for longer and fine-tune them and find all these tricks. OpenAI used to publish stuff on this, you know, [00:44:47]
Swyx: when they would publish stuff [00:44:48]
George: about how much better the training has gotten holding compute constant. It's gotten a lot better, right? Think, compare like BatchNorm to NoBatchNorm. [00:45:00]
Swyx: Is you're finding algorithms like FlashAttention? [00:45:02]
George: Yeah, well, FlashAttention, yeah. And FlashAttention is the same compute. FlashAttention is an interesting fact where it's actually the identical compute. It's just a more efficient way to do the compute. But I'm even talking about like, look at the new embeddings people are using, right? They used to use these like boring old embeddings. Now, like, Lama uses that complex one, and now there's like Alibi. I'm not up-to-date on all the latest stuff, but those tricks give you so much. [00:45:23]
Swyx: There's been a whole round trip with positional embeddings. I don't know if you've seen this discussion. I haven't followed exactly. [00:45:29]
George: I mean, you quickly run into the obvious problem with positional embeddings, which is you have to invalidate your KV cache if you run off the context. So that's why I think these new ones, [00:45:38]
Swyx: they're playing with them, [00:45:38]
George: but I'm not an expert on like the latest up-to-date language model stuff. [00:45:43]
Alessio: What are some of the things, I mean, that people are getting wrong? So back to autonomous driving, there was like the whole like LiDAR versus vision thing. People don't get into accidents because they cannot see well. They get into accidents because they get distracted and all these things. Do you see similarities today on like the Pathway GI? [00:45:59]
George: Nothing I say about this is ever gonna compete with how Rich Sutton stated it. [00:46:03]
Swyx: Rich Sutton, the writer of [00:46:04]
George: Reinforcement Learning, The Bitter Lesson. Nothing I say is ever gonna compete with, The Bitter Lesson's way better than any way I'm going to phrase this. Just go read that, and then like, I'm sorry it's bitter, but you actually just have to believe it. Like over and over again, people make this mistake. They're like, oh, we're gonna hand engineer this thing. No, like stop wasting time. [00:46:22]
Swyx: I mean, OpenAI is not taking The Bitter Lesson. They were leaders in deep learning for a long, long, long time. [00:46:27]
George: Well, OpenAI was the absolute leader to the thesis that compute is all you need, right? [00:46:31]
Swyx: And there's a question of how long [00:46:32]
George: this thesis is going to continue for. It's a cool thesis, and look, I think I would be lying along with everybody else. I was into language models like way back in the day for the Hutter Prize. I got into AI through the Hutter Prize. Like 2014, I'm trying to build compressive models of Wikipedia. And I'm like, okay, why is this so hard? What this is is a language model, right? And I'm playing with these Bayesian things, and I'm just like, oh, but I get it. I have two data points, and they're almost the same, but how do I measure that almost, right? I just wrapped my head around this, and this was around the time Karpathy released the first RNN that generated the Shakespeare stuff. And I'm like, okay, I get it, right? It's neural networks that are compressors. Now, this isn't actually, you can't actually win the Hutter Prize with these things because the Hutter Prize is MDL. It's the model, size of the model plus the size of the encodings, embeddings. So yeah, you can't, I mean, probably now you can because it's gotten so good. But yeah, back in the day, you kind of couldn't. So I was like, okay, cool. [00:47:29]
Swyx: This is what it is. [00:47:29]
George: I kind of get it. I didn't expect that it would continue to work this well. I thought there'd be real limits to how good autocomplete could get. That's fancy autocomplete. But yeah, it works well. So like, yeah, what is OpenAI getting wrong? Technically, not that much. I don't know. If I was a researcher, why would I go work there? [00:47:48]
Swyx: Yes, so why is OpenAI like the Miami Heat? [00:47:51]
George: No, look, this is my technical stuff. I don't really want to harp on this, but like, why go work at OpenAI when you could go work at Facebook as a researcher? OpenAI can keep ideologues who, you know, believe ideological stuff and Facebook can keep every researcher who's like, dude, I just want to build AI and publish it. [00:48:08]
Alessio: Yeah, any other thoughts, tiny corp, bounties? [00:48:11]
George: You know, I've been thinking a lot about like what it means to hire in today's world. Okay, look, I'm a believer that machines are going to replace everything in about 20 years. So, okay, what is that thing that people can still do that computers can't? And this is a narrowing list, but like, you know, back in the day, like imagine I was starting a company in 1960. Oh, and we're going to have to hire a whole bunch of calculators in the basement to do all the, you know, math to support the, dude, have you heard about computers? Why don't we just buy a few of those? Oh, wow, man, you're right. So like, I feel like that's kind of happening again. And I'm thinking about, I will post in my Discord, I'll be like, who wants to like, okay, I just changed my unary ops used to be log and exp in like E. I changed them to be log two and exp two because hardware has log two and exp two accelerators. [00:48:59]
Swyx: Yeah, and of course you can just change your base. [00:49:00]
George: It's one multiply to get it back to E. But like, I made the primitives log two and exp two, right? I just posted in the Discord. I'm like, could someone put this pull request up? And someone eventually did and I merged it. But I'm like, this is almost to the level [00:49:12]
Swyx: where models can do it. [00:49:14]
George: We're almost to the point where I can say that to a model and the model can do it. [00:49:17]
Swyx: Have you tried? Yeah, I don't know. [00:49:20]
George: I think autocomplete went further than I thought it would, but I'm also relatively unimpressed with these chatbots. The problem is if your loss function is categorical cross entropy on the internet, your responses will always be mid. [00:49:32]
Swyx: Yes, mode collapse is what I call it, I don't know. [00:49:35]
George: Maybe, I'm not even talking about mode collapse. You're actually trying to predict the, like, look, I rap. I'm a hobbyist rapper. When I try to get these things to write rap, the raps sound like the kind of raps you read in the YouTube comments. [00:49:45]
Swyx: Nursery school. [00:49:46]
George: Yeah, it's like, all right, great. You rhyme box with fox, sick rhyme, bro. You know, and Drake is rhyming give it up for me with napkins and cutlery, right? Like, all right, come on. [00:49:55]
Swyx: He's got like this thing about orange. Orange is famous so you can't rhyme it. Yeah, yeah, yeah, yeah, yeah. [00:49:59]
George: But now, of course, you know, four-inch screws and orange juice is in GPT's training course. Yeah, so I think it went further than everyone kind of thought it would. But the thing that I really want to see is like somebody put 10 LLMs in a room and have them discuss the answer before they give it to me. Right, like, you can actually do this, right? And I think the coding things have to be the same way. There is no coder alive, no matter how good you are, that sits down, well, I'm going to start at cell A1 and type my program, and then I'm going to press run and it's going to work. No one programs like that. So why do we expect the models to, right? So there's a lot that, like, still needs to be done. But, you know, at the tiny corp, I want to be on the cutting edge of this, too. I want to be, like, program generation. I mean, what is TinyGrad? It's a compiler, it generates programs. Generate the fastest program that meets the spec, right? Why am I not just having ML do that? So, you know, it's kind of a, you have to exist fluidly with the machines. And I've come around on a lot of stuff. I'm like, wait, TinyGrad, TinyCorp should be a remote company. I can't do this in person. [00:50:58]
Swyx: Really? [00:50:58]
George: Yeah, like, comma makes sense to be in person. Like, comma, sure. Yeah, we're getting off in San Diego. [00:51:04]
Swyx: But that was a six-year-old company, right? [00:51:05]
George: And it works, and it works for a certain type of people [00:51:08]
Swyx: and a certain type of culture. [00:51:08]
George: But what's going to be different this time? Okay, remote, but now it's remote. And now I'm getting these, like, people who apply, and I'm like, I literally have a thousand applications. I'm not calling you to do a technical screen. I can't really tell anything from a technical screen. What am I going to do? Make a code on a whiteboard? Like, bring up a shared notebook document, so we could, oh, like, that's not going to work. Okay, so then I'm moved to the next thing. We do this at Comma with good success, programming challenges. [00:51:31]
Swyx: I've also found them to be, like, [00:51:32]
George: completely non-predictive. I found one thing to actually be predictive, and it's, wait a second, just write code in TinyGrad. It's open source, right? And yeah, so, you know, I'm talking to a few people who've been contributing, and, like, contribute, or, you know, the job's not for you. But you can do it remote, and it's, look, it's a chill job. Like, you're not, you're like, oh, yeah, well, I work for the tiny corp. Like, well, you're writing MIT-licensed software. Like, you see what it's doing, right? Like, we'll just, I think, think of it as maybe more of, like, a stipend than a salary. And then also some equity. Like, if, you know, I get rich, we all get rich. [00:52:01]
Alessio: How do you think about agents and kind of, like, thinking of them as people versus, like, job to be done? Sean built this thing called Small Developer. [00:52:09]
Swyx: It's in the same vein. Or, like, the human in the loop with the language model and just iterating while you write code. I think that's absolutely where it goes. [00:52:17]
Alessio: And there's, like, a, it's not, like, one thing. It's, like, there's Small Interpreter. There's, like, Small Debugger. It's kind of, like, all these different jobs to be done. [00:52:24]
Swyx: It's a small world. [00:52:25]
Alessio: Yeah, it's a, I know, this is, like, the small box is, like, small AI meets tiny corp. [00:52:29]
Swyx: So we're all in the same wavelength. [00:52:30]
Alessio: How do you think about that? Do you think people will have a human-like interaction where it's, like, oh, this is, like, the AI developer, or, like, is it I'm the human being supercharged by the AI tools? [00:52:41]
George: Oh, I think it's, yeah, much more like I'm the human supercharged by the AI tools. I think that, like, coding is tool-complete. Like, driving's not tool-complete. We hire people to drive who are, like, below the API line. Right, there's an API line in the world, right? [00:52:53]
Swyx: Love that. Yes. [00:52:53]
George: Yeah, yeah, yeah, there's an API line in the world. And, like, you can think, like, Uber's a really clear example, right? There's the people below the API line and the people above the API line. And the way you can tell if you're below or above, by the way, is is your manager a computer, right? Who's the manager of the Uber driver? [00:53:06]
Swyx: Well, a computer, right? Does the machine tell you what to do or do you tell machines what to do? Exactly, exactly. [00:53:09]
George: So, coding is tool-complete, right? [00:53:13]
Swyx: Coding is tool-complete. [00:53:13]
George: Coding is above the API line. So it will always be tools supercharging your coding workflow. And it will never be you performing some, like, task. Like, okay, well, I can do everything except for actually starting a Docker container. Like, it just doesn't make any sense, right? Yeah, so it will always be sort of tools. And, you know, look, we see the same stuff with all the, like, people are like, stable diffusion's gonna replace artists or whatever. It's like, dude, like- [00:53:38]
Swyx: It's gonna create new artists. [00:53:39]
George: Did Photoshop replace artists? [00:53:41]
Swyx: Like, what are you talking about, right? [00:53:42]
George: Like, you know, a real artist's finger paint. They can't use brushes. Brushes are, you know, brushes are gonna replace all the, okay, like, I just can't. Like, it's all just tools and the tools are gonna get better and better and better. And then eventually, yes, the tools are going to replace us. But, you know, that's still 20 years away. So, you know, I got a company to run in the meantime. [00:54:02]
Swyx: So I've written about the API line before and I think that's from Venkatesh. I don't know if you've got your directive to it. I don't know, I definitely took it from someone. [00:54:07]
George: It's definitely not mine. [00:54:08]
Swyx: It's VGR. But I also have a speculated, a higher line than that, which is the Kanban board. Like, who tells the programmers what to do, right? So are you above or below the Kanban board? Has that evolved your management thinking? [00:54:21]
George: Yeah, like, that's sort of what I mean. Like, it's like, I'm just gonna describe the pull request in two sentences and then like, yeah. [00:54:28]
Swyx: So you are running the Kanban board? Or the bounties, you know? [00:54:31]
George: Yes, the bounties are the Kanban board, exactly. And that is kind of the high level. And then like, yeah, we'll get AIs to fill in some and we'll get people to fill in others. And that's also what it means to be like, full-time at TinyCorp, right? Would you start, and I wrote this up pretty concretely. I'm like, okay, step one is you do bounties for the company. Step two is you propose bounties for the company, right? You don't obviously pay them, we pay them. [00:54:52]
Swyx: But you propose them. [00:54:52]
George: And I'm like, yeah, that's a good bounty. That like, helps with the main workflow of the company. And step three is you get hired full-time, you get equity, we all, you know, maybe get rich. [00:55:01]
Swyx: What else are you designing differently about the employee experience? [00:55:04]
George: You know, some people really like to like, [00:55:06]
Swyx: like keep a separation, right? [00:55:07]
George: Some people really like to keep a separation between like employees and management or customers and employees. Like a comma, you know, the reason I do the DevKit thing, it's like, dude, you buy a comma thing, you're an employee of the company. Like you're just part of the company. It's all the same thing. There's no like secrets, there's no dividing lines. There's no like, it's all a spectrum for like, you know, down here at the spectrum, like you pay. And then up here at the spectrum, you get paid. You understand this is the same spectrum of college, right? Like for undergrad, you pay, and then you get up here to like, you know, I'm doing a PhD program, you get paid. Okay, well, cool. Welcome to the, you know. [00:55:39]
Alessio: What about comma bodies? You mentioned a lot of this stuff is clearly virtual, but then there's below the API line you actually need. [00:55:47]
Swyx: Wait, this is a thing that's been announced? Comma bodies? We sell them. You can buy them. [00:55:51]
George: They're a thousand bucks on our website. [00:55:53]
Swyx: Oh, okay, no, no, no. I'm thinking about like the, what Tesla announced with like the humanoid robots. It's the same thing. [00:55:58]
George: Except of course, we made the comma version of it. Tesla uses 20 actuators. We use two, right? Like how do you build the simplest possible thing that can like turn the robotics problem into entirely a software problem? So right now it is literally just a comma three on a pole with two wheels. It balances, keeps the comma three up there. And like, there's so much you could do with that already. [00:56:21]
Swyx: Right? [00:56:22]
George: Like this should replace, how many security guards could this replace? Right? If this thing could just competently wander around a space and take pictures and, you know, focus in on things, send you a text message when someone's trying to break into your building, you know, like, like this could already do so much, of course, but the software is not there yet. Right? So how do we turn robotics into a thing where it's very clearly a software problem? You know, that people don't accept that self-driving cars are a software problem. Like, I don't, I don't know what to tell you, man. Like literally just watch the video yourself and then drive with a joystick, right? Can you drive? And we've actually done this test. We've actually done this test where you've had someone, okay, you just watch this video and here's a joystick and you got to drive the car. And of course they can drive the car. It takes a little bit of practice to get used to the joystick, but the problem is all the model, right? So I can now make the model better. [00:57:07]
Swyx: Our second most popular episode ever was about segment anything coming out of Facebook, which as far as I understand the state of the art in computer vision, what are you hoping for there that you need for Karma? [00:57:17]
George: I haven't used segment anything. Like they large, large YOLOs or not. I've used like large YOLOs and I'm super impressed by them. [00:57:24]
Swyx: Yeah. [00:57:25]
George: I got to check out segment anything. I don't think it's a distinct problem, right? Okay, here's something that I'm interested in. All right, we have great LLMs. We have great text to speech models and we have great speech to text models. Okay, so why can I not talk to an LLM? Like I'd have a normal conversation with it. [00:57:39]
Swyx: You can with the latency of like two seconds every time. Right? [00:57:42]
George: And then it feels so unnatural. It's this like staccato. Like I don't like the RLHF models. I don't like the tuned versions of them. You take on the personality of our customer support agent. Right? [00:57:53]
Swyx: Like, oh, come on. [00:57:54]
George: I like LLMA more than ChatGPT. ChatGPT's personality just graded on me. Whereas LLMA, like, cool. I read a little bit of pretext paragraph. I can put you in any scenario I want, right? Like, that's interesting to me. So yeah, I think there is really no like distinction between computer vision and language and any of this stuff. It's all eventually going to be fused into one massive. So to say computer vision is solved, well, it doesn't make any sense because what's the output of a computer vision model? Segmentation? Like, what a weird task, right? [00:58:26]
Swyx: Who cares? OCR? [00:58:28]
George: Who cares? [00:58:29]
Swyx: I don't care if you can segment [00:58:29]
George: which pixels make up that laptop. I care if you can pick it up. [00:58:32]
Alessio: And you're going to have the local cluster. You're going to have the body. [00:58:36]
Swyx: Yeah. [00:58:37]
George: Yeah, I think that's kind of where that goes. [00:58:39]
Swyx: Maybe we can paint the future of like, the year is 2050. You've achieved all you wanted at TinyCorp. What is the AI enabled future like? [00:58:48]
George: Well, TinyCorp's the second company. Comma was the first. Comma builds the hardware infrastructure. TinyCorp builds the software infrastructure. The third company is the first one that's going to build a real product. And that product is AI Girlfriend. No, like I'm dead serious, right? Like, this is the dream product. This is the absolute dream product. Girlfriend is just the like- [00:59:08]
Swyx: Stand-in. [00:59:09]
George: Well, no, it's not a stand-in. No, no, no, no. I actually mean it, right? So I've been wanting to merge with a machine ever since I was like, mad little. [00:59:15]
Swyx: Like, you know, I was just like, [00:59:16]
George: how do I merge with a machine, right? [00:59:18]
Swyx: And like, you can look at like, [00:59:19]
George: maybe the Elon style way of thinking about it is Neuralink, right? I'm like, I don't think we need any of this, right? You ever, some of your friends maybe, they get into relationships and you start thinking of, you know, them and their partner as the same person. You start thinking of them as like one person. I mean, they are kind of like merged, right? Like, humans can just kind of do this. It's so cool. It's this ability that we already have. Right, so I don't need to put, you know, electrodes in my brain to merge with a machine. I need an AI Girlfriend, right? So that's what I mean. Like, this is the third product. This is the third company. And yeah, in 2050, I mean like, ah, it's so hard. I just like, maybe I can imagine like 2035. I don't even know 2050, but like, yeah, 2035. Like, yeah, that'd be really great. [01:00:03]
Swyx: In terms of merging, like, isn't it, shouldn't you work on Brain Upload rather than AI Girlfriend? Brain Upload, right? [01:00:09]
George: I don't need Brain Upload either. Like, there's thousands of hours of me on YouTube, right? Yes. How much of my brain's already uploaded? [01:00:17]
Swyx: That's only the stuff that you voice. Yeah, it's not that different. [01:00:20]
George: It's not that different, right? You really think a model with, you know, an exaflop of compute couldn't extract everything that's really going on in my brain? I'm a pretty open person, right? Like, I'm not running a complex filter. Humans can't run that complex of a filter. Like, humans just can't. Like, this is actually a cool quirk of biology. It's like, well, humans like can't lie that well. [01:00:39]
Alessio: So is it good or bad to put all of your stream of consciousness out there? [01:00:43]
George: I mean, I think it's good. [01:00:45]
Swyx: I mean, he's streaming every day. I want to live forever. We said off mic that we may be the first immortals, right? Yeah, this is how you live forever. [01:00:54]
George: It's a question of, okay, how many weights do I have? Right, okay, let's say I have a trillion weights, right? So talking about a terabyte, 100 terabytes here. [01:01:02]
Swyx: Okay, but it's not really 100 terabytes, right? [01:01:03]
George: Because it's Kolmogorov complexity. How much redundancy is there in those weights? So, like, maximally compressed, how big is the weight file for my brain? Quantize it whatever you want. Quantization is a poor man's compression. I think we're only talking really here about, like, maybe a couple gigabytes, right? And then if you have, like, a couple gigabytes of true information of yourself up there, cool, man. Like, what does it mean for me to live forever? [01:01:27]
Swyx: Like, that's me. No, I think that's good. [01:01:29]
Alessio: And I think there's a bit of, like, a professionalization of social media, where, like, a lot of people only have what's, like, PC out there, you know? And I feel like you're going to get, going back to the ChatGPT thing, right? You're going to train a model on, like, everything that's public about a lot of people. [01:01:44]
Swyx: And it's like- [01:01:45]
George: Then no one's going to run their model and they're going to die. Don't put PC on social media. [01:01:49]
Swyx: We're moving on to what would normally be called the lightning round, but just general tics, because you're a generally interesting person with many other interests. What does the goddess of everything else mean to you? [01:01:59]
George: Oh, it means that AI is not really going to kill us. [01:02:01]
Swyx: Really? [01:02:01]
George: Of course. [01:02:02]
Swyx: Tell us more. [01:02:03]
George: Lex asked me this, like, is AI going to kill us all? And I was quick to say yes, but I don't actually really believe it. I think there's a decent chance that AI kills 95% of us. [01:02:11]
Swyx: Okay. [01:02:12]
Alessio: But they saw on your Twitch streams that you're with them, so they're not going to- [01:02:16]
Swyx: No, I don't think, I actually, [01:02:18]
George: I don't also think it's AI. Like, I think the AI alignment problem is so misstated. I think it's actually not a question of whether the computer is aligned with the company who owns the computer. It's a question of whether that company's aligned with you or that government's aligned with you. And the answer is no, and that's how you end up dead. [01:02:31]
Swyx: So what the goddess of everything else means to me [01:02:32]
George: is like, the complexity will continue. Paper clippers don't exist. [01:02:37]
Swyx: You know, there are forces. [01:02:38]
George: The paper clipper is cancer, right? The paper clipper is really just a perfect form of cancer. And the goddess of everything else says, yeah, but cancer doesn't win, you know? [01:02:48]
Swyx: Yeah, it's a beautiful story for those who haven't heard it. And you read it out and I listened to it. Yeah, what are you grateful for today? [01:02:55]
George: Oh man, I mean, it's all just like, I haven't, I haven't thinking about this stuff forever. Like, that it's actually like happening and it's happening in an accessible way too. I guess that's what I'm really grateful for. It's not like, AI is not some Manhattan project style. You don't know anything about it. Closed doors. [01:03:12]
Swyx: Closed doors. [01:03:13]
George: I'll fight really hard to keep it that way. I'm grateful for just how much is released out there and how much I can just learn and stay up to date. And I guess I'm grateful to the true fabric of reality that, you know, I didn't need differential equations to understand it. Like, I don't need some like, there's a limit to my math abilities. I can do most undergrad math, but I took some grad math classes and okay, now we're getting to the end of what I can do. And it's just the actual like, end of what I can do. Like, I'm limited by my brain, but you know, ML stuff, hey, you need high school math. [01:03:45]
Swyx: You know what I mean? [01:03:46]
George: When I learned to multiply a matrix, seventh grade, [01:03:48]
Swyx: like, it's all easy. You need more electrical engineering than you need high school math early. [01:03:52]
George: Yeah, well, you need electrical engineering to like, build the machines, but even that, like, these machines are simpler than the machines that have existed before. The compute stack looks really nice. So, you know, yeah, I just, I'm grateful that it's all happening and I get to understand it. [01:04:05]
Alessio: John Carmack mentioned there's about six insights we have left. Do you have an intuition for what some of the paths [01:04:11]
Swyx: people should be taking? [01:04:12]
Alessio: Obviously you're working on one. What are some of the other branches of the tree that people should go under? [01:04:17]
George: I don't think I'm working on one of the six insights. I don't think TinyGrid's any one of the six insights. Something I really like that Elon does, and I try to be inspired by it, is look at the boring tunnel machine and ask how you can build a 10X cheaper one. All right, look at the rocket. How can I build a 10X cheaper one? All right, look at the electric car and say, how can I build a 10X cheaper, like, cheaper or, you know, can go further or whatever, whatever, whatever, right? And you just do the straight up physics math, right? I'm trying to do the same thing with ML frameworks, right? And in doing so, making sure that this stuff remains accessible. You could imagine a world where if Google TPUs were actually the ultimate, if Google TPUs were actually the best training things, I mean, actually, you know, I'm kind of grateful for NVIDIA, right? Because if Google TPUs were the ultimate, now you have this huge closed source compiler in between XLA and the hardware, and yeah, that's just a really bad thing. So, I mean, something that is somewhat upsetting about the Tiny Core is that it is trying to prevent downside, but it's not all trying to prevent downside. Like, we're also building computers and we're gonna build some awesome, powerful, cheap computers along the way. So, no, I'm not really working directly on any of the six tricks. I also think the six tricks are kind of gonna be like luck. [01:05:25]
Swyx: I think it's just gonna be like, you know, [01:05:26]
George: please tell me more about what covariate shift is and how that inspired you to come up with batch normalization. Please tell me more about why it's a transformer and it has a query, a key, and a value, right? Like Schmidt-Huber described it better in fast weights. I mean, my theory about why transformers work have nothing to do with this attention mechanism and just the fact that it's semi-weight sharing, right? Because the weight matrix is being generated on the fly, you can compress the weight matrix, right? Like, this is what that, there's an operation in the transformer, which, and by the way, this is like, Qualcomm's SNPE can't run transformers for this reason. So, most matrix multipliers in neural networks are weight times values, right? Whereas when you get to the outer product in transformers, well, it's weight times weight. It's values times values, right? So, SNPE doesn't even support that operation, right? So, it's like that operation that gives the transformer its power. It has nothing to do with the fact that it's attention, [01:06:20]
Swyx: right? [01:06:21]
George: And this is a funny, like, but that is one of the six tricks, right? Batch, like these norms are a trick. Transformers are a trick. Okay, six more. [01:06:29]
Swyx: So, you talk about attention as weight compression. [01:06:33]
George: Compression is not exactly the right word. What I mean is that the weight can change dynamically based on the context. So, there was this thing in PAC-8 in the Hutter Prize that I absolutely loved, and I've never seen it again in neural networks, and it's a really good trick. Okay, imagine you have 256 weight sets for a layer, right? And then you choose which of the weight sets you're loading in based on some context. And that context can come from another neural net, right? So, I have another neural net, which projects 256 wide, one hot, do a softmax, predict it, and then I actually load the weights in. And I can do this operation at both test time and train time. I can do this operation at both training and inference, and I load in the weights given the context. Like, that is what transformers do. But transformers, instead of having 256 discrete ones, it's actually just that, but continuous. Which is funny that that was in language models, and I just like, when I understood that about transformers, I'm like, oh, this is a real trick, and why are they using the word attention? [01:07:23]
Alessio: And today is actually the anniversary of attention is all you need. What? [01:07:27]
Swyx: Oh, that's so cool. [01:07:28]
Alessio: Today, six years ago. [01:07:29]
Swyx: Six years. [01:07:30]
George: Six years. [01:07:31]
Swyx: Changed the world. Wow. [01:07:32]
George: Well, there's one of your envelope tricks, right? And you could easily write it on an envelope, think about how you write out that. How many times have you written that? Because it's not in any libraries, because it's all used a little differently each time. Like, you just write out that exact same, you know. [01:07:45]
Swyx: You've name checked Elon a few times. I think about both of you as systems thinkers. Input, output, thinking something in between. What's different about your style versus his? [01:07:53]
George: Elon's fundamental science for the world is physics, mine is information theory. But you do a lot of physics as well. [01:07:58]
Swyx: I mean, like, you base it on- [01:07:59]
George: And Elon does a lot of information theory as well, too. But the difference maybe is expressed in what your ambitions are, right? Elon's ambitions may be like- [01:08:08]
Swyx: Go to Mars. Go to Mars, right? [01:08:10]
George: Go to Mars is the ultimate modernist physics ambition, right? It's a physics problem getting to Mars, right? [01:08:16]
Swyx: Well, what are electric cars? [01:08:17]
George: It's a physics problem, right? Okay, now he's like pushing on the autonomy stuff, and you push a little on information theory. But fundamentally, his dreams are physics-based dreams. My dreams are information-based dreams. I want to live forever in virtual reality with my AI girlfriend. Those are the aspirations of someone who accepts information theory as a core science. So I think that's the main difference between me and him. He has physics-based aspirations, and I have information-based aspirations. [01:08:39]
Swyx: Mark Andreessen, he is a- Hi, Mark. He's a listener. He's a big proponent of effective accelerationism. You've been a bit more critical. Why do you say that IAC is not taken seriously by its adherents? [01:08:50]
George: Oh, well, only the left takes ideology seriously. It's just like a fact, right? [01:08:55]
Swyx: Is the right more cynical? Is that what it is? [01:08:57]
George: I don't know. [01:08:58]
Swyx: It's like the left actually manages [01:08:59]
George: to get energy around the ideologies, right? [01:09:02]
Swyx: Look, here you have- [01:09:03]
George: You have two effective altruists named Sam going in front of Congress. Only one of them is in jail. [01:09:08]
Swyx: You know, it's interesting. [01:09:09]
George: They're both calling for regulation in their respective spaces, right? [01:09:11]
Swyx: So SBF is definitely like kind of wolf in sheep's clothing, kind of, right? Like he only adopted IAC or EA to market. [01:09:19]
George: Oh, and Sam Altman is a genuinely good guy who is not interested in power-seeking for himself. [01:09:24]
Swyx: All right. Okay, okay. We don't have to go there. Fair enough, fair enough. [01:09:27]
George: But no, IAC is not like, like you are not serious, right? Mark Andreessen, I like Mark Andreessen, but it's like someone who's like 2019, whose like eyes were opened about like the political world being not exact. You mean all the people on the news were lying to me? [01:09:42]
Swyx: Bro, they were lying to you. [01:09:43]
George: Like, okay, we all figured this out five years ago. Now, what are you going to do about it? I'm going to complain about it on Twitter. Great, and that's what IAC is. [01:09:50]
Alessio: Last and maybe most important, why was Avatar 2 bad? [01:09:55]
Swyx: Oh, I have a whole, you can go on my blog. [01:09:56]
George: I rewrote the script of Avatar 2. I wrote a script that actually might make you feel something for the characters. I killed Jake Sully in the first scene. Like you had to. Do you really think his second story art topped his first one? No, of course not. You had to kill the guy and make the movie about the brothers, right? And just that alone and realizing that, like you could have kept the Titanic scene. [01:10:16]
Swyx: It would have been fine. [01:10:16]
George: I didn't even take it out. I left your Titanic scene, James Cameron, but I wrote you a story. So, you know, you're just, just, just. [01:10:23]
Swyx: He needs ships to sink in water. [01:10:24]
George: Look, it's a great scene, but like the movie was just like, like the Roman, I've never. [01:10:30]
Swyx: Great CGI, you know, let down by the writing maybe. It's a beautiful world. [01:10:34]
George: And that's why like I care so much, right? Like you don't hear me ranting about Pirates of the Caribbean 2 being a terrible story. Cause come on, what do you expect, man? Like Johnny Depp's like, wow, I had a movie that made me rich. I love this. [01:10:44]
Alessio: But this goes back to like the midpoint. You know, I think you wrote like, feels like ChatGPT wrote the movie and that's my worry a little bit. It's like kind of converging towards that. [01:10:53]
Swyx: Oh, I. Malik, Malik wrote the movie. Sorry, I didn't want to interrupt you. [01:10:59]
George: I closed a pull request two days ago. I was like, was this written by ChatGPT? And I just closed it. [01:11:04]
Swyx: Like, you know what? [01:11:05]
George: I honestly feel bad if you were a human who wrote this. [01:11:07]
Swyx: Incapable of being more perplexed. [01:11:09]
George: But if you, if I have a classifier running in my head that asks, you know, is this a AI or is this a human? Like, you know, the only way to deal with all this, like, like, like, oh God, it's like the worst possible. Like, you know, people are like, how are you mad about like these chatbots? You're not mad about like Tesla. I don't want to buy a Tesla. I don't have to buy a Tesla. And it won't really impact my life negatively. But if I don't want to use a chatbot, it's still going to impact my life negatively. All the amount of like personalized spam that now makes me spend more cycles on my classifier to tell if it's spam or not, because you can now use AIs and generate this so cheaply. Like, no, I mean, we have to move to a model where everything's just a dollar, right? Like you want to send me an email, it's a dollar. Like you guys wouldn't care. None of my friends would care. No one would care, except the spammers, right? Like we just got to move to those sort of models. [01:11:54]
Swyx: Awesome. [01:11:55]
Alessio: One last message you want everyone to remember. [01:11:58]
George: Go try TinyGrad. I hope that we're a serious competitor to what's out there. And then I want to take it all the way. We'll start with just building something for GPUs and then we'll start building chips and then we'll start building fabs and then we'll start building silicon mines and then we'll have the first self-reproducing robot using. [01:12:15]
Swyx: Yeah, okay. All right, George. [01:12:18]
Alessio: Thank you so much for coming on. [01:12:19]
Swyx: You did a big inspiration. Thank you. Thanks. [01:12:21]
Swyx: Thank you. [01:12:29]
Get full access to Latent.Space at www.latent.space/subscribe
Emergency Pod: OpenAI's new Functions API, 75% Price Drop, 4x Context Length (w/ Alex Volkov, Simon Willison, Riley Goodside, Joshua Lochner, Stefania Druga, Eric Elliott, Mayo Oshin et al)
mercredi 14 juin 2023 • Durée 01:28:12
Full Transcript and show notes: https://www.latent.space/p/function-agents?sd=pf
Timestamps:
[00:00:00] Intro
[00:01:47] Recapping June 2023 Updates
[00:06:24] Known Issues with Long Context
[00:08:00] New Functions API
[00:10:45] Riley Goodside
[00:12:28] Simon Willison
[00:14:30] Eric Elliott
[00:16:05] Functions API and Agents
[00:18:25] Functions API vs Google Vertex JSON
[00:21:32] From English back to Code
[00:26:14] Embedding Price Drop and Pinecone Perspective
[00:30:39] Xenova and Huggingface Perspective
[00:34:23] Function Selection
[00:39:58] Designing Code Agents with Function API
[00:42:16] Models as Routers
[00:46:48] Prompt Engineering replaced by Finetuning
[00:52:15] The 2 Code x LLM Paradigms
[00:56:30] Smol Models for the future
[00:58:54] The Evolution of the GPT API
[01:03:27] Functions API Security vs Prompt Injection
[01:16:18] GPT Model Upgrades
[01:17:36] JSONformer
[01:21:03] Closing Comments - What We Want Next
Get full access to Latent.Space at www.latent.space/subscribe
From RLHF to RLHB: The Case for Learning from Human Behavior - with Jeffrey Wang and Joe Reeve of Amplitude
jeudi 8 juin 2023 • Durée 49:29
Welcome to the almost 3k latent space explorers that joined us last month! We’re holding our first SF listener meetup with Practical AI next Monday; join us if you want to meet past guests and put faces to voices! All events are in /community.
Who among you regularly click the ubiquitous 👍 /👎 buttons in ChatGPT/Bard/etc?
Anyone? I don’t see any hands up.
OpenAI has told us how important reinforcement learning from human feedback (RLHF) is to creating the magic that is ChatGPT, but we know from our conversation with Databricks’ Mike Conover just how hard it is to get just 15,000 pieces of explicit, high quality human responses.
We are shockingly reliant on good human feedback. Andrej Karpathy’s recent keynote at Microsoft Build on the State of GPT demonstrated just how much of the training process relies on contractors to supply the millions of items of human feedback needed to make a ChatGPT-quality LLM (highlighted by us in red):
But the collection of good feedback is an incredibly messy problem. First of all, if you have contractors paid by the datapoint, they are incentivized to blast through as many as possible without much thought. So you hire more contractors and double, maybe triple, your costs. Ok, you say, lets recruit missionaries, not mercenaries. People should volunteer their data! Then you run into the same problem we and any consumer review platform run into - the vast majority of people send nothing at all, and those who do are disproportionately representing negative reactions. More subtle problems emerge when you try to capture subjective human responses - the reason that ChatGPT responses tend to be inhumanly verbose, is because humans have a well documented “longer = better” bias when classifying responses in a “laboratory setting”.
The fix for this, of course, is to get out of the lab and learn from real human behavior, not artificially constructed human feedback. You don’t see a thumbs up/down button in GitHub Copilot nor Codeium nor Codium. Instead, they work an implicit accept/reject event into the product workflow, such that you cannot help but to give feedback while you use the product. This way you hear from all your users, in their natural environments doing valuable tasks they are familiar with. The prototypal example in this is Midjourney, who unobtrusively collect 1 of 9 types of feedback from every user as part of their workflow, in exchange for much faster first draft image generations:
The best known public example of AI product telemetry is in the Copilot-Explorer writeup, which checks for the presence of generated code after 15-600 second intervals, which enables GitHub to claim that 40% of code is generated by Copilot.
This is fantastic and “obviously” the future of productized AI. Every AI application should figure out how to learn from all their real users, not some contractors in a foreign country. Most prompt engineers and prompt engineering tooling also tend to focus on pre-production prototyping, but could also benefit from A/B testing their prompts in the real world.
In short, AI may need Analytics more than Analytics needs AI.
Amplitude’s Month of AI
This is why Amplitude is going hard on AI - and why we recently spent a weekend talking to Jeffrey Wang, cofounder and chief architect at Amplitude, and Joe Reeve, head of AI, recording a live episode at the AI + Product Hackathon where 150+ hackers gathered to compete for over $22.5k in prizes from Amplitude, New Relic, LanceDB, AWS, and more.
To put things in perspective, Amplitude is a legendary YC alum with $238M of revenue in 2022 — our first guests representing the AI efforts of a public company!
We chatted about how they have been approaching AI in their product (“question to chart” BI, text field autofill, instrumenting Amplitude with Amplitude), some of the issues they’ve had with different models, and the importance of first-party data in the world of LLMs. Another topic that came out of the Q&A was this idea of almost an “AmplitudeGPT”; rather than using language to simply generate a query, you could have these models investigate reasons for why certain behavior is happening in your user base. It was a really good discussion, and hope you all enjoy listening to it!
Sections
* [00:00:47] Amplitude's founding story and pivot
* [00:03:28] Amplitude as an AI company and opportunities
* [00:07:14] Limitations and challenges with using AI models
* [00:10:56] Using Amplitude's product to build Amplitude - instrumenting AI
* [00:12:32] Existing ML models in Amplitude's product and customer use cases
* [00:15:50] “A/Z testing” and adaptable products
* [00:19:33] The future of analytics and dashboards
* [00:21:03] Optimizing for metrics in chatbots and AI products
* [00:26:22] Using general models vs. fine-tuned models
* [00:30:24] The importance of models vs. data - Amplitude's data set
* [00:39:00] Lightning Round + Q&A
Show Notes
* Sonalight to Amplitude pivot announcement
Transcript
Editor’s note: all timestamps are 1 minute behind because we hadn’t yet added the intro before making these. Sorry about that!
Alessio: Thank you everyone for coming. Hopefully, some of you have listened to the podcast before, if you haven't, we focus on AI research and application. So we don't focus on “AI is going to kill us all”. We don't think about virtual girlfriends. We don't think about all of these more societal things. We're focused on models: how do you build them? How do you train them? How do you use them in production? What are some of the limitations on getting these things from demos to things that millions of users use? And obviously, a lot of you are building things. Otherwise, you wouldn't be here. And some of you have been building things for a long time, and now have a new paradigm that you want to build on top of. So I'm excited to dive in here. And maybe, I mean, I'm sure most people know you, but maybe you want to do intros and give a little background. [00:00:47]
Jeffrey: Sure. Yeah, hey, everyone, met you all this morning, but I'm Jeffrey. I'm one of the co-founders and Chief Architect here at Amplitude. Been working on this product analytics thing, helping people understand user behavior data and make great product decisions and build better products for the last decade or so. And obviously, AI is a technology that we've been leveraging for a long time, but the recent trends are particularly exciting. And yeah, we have a lot of thoughts on how to apply that to our space, what we're doing in our product, and what we think the future of AI and product development and product data is. So excited to talk through some of those. [00:01:20]
Joe: Yeah, I'm Joe, Joe Reeve. I've got a background in sort of startups and tech, been professional software engineer since I was 16, quit college. And at the moment, I'm running sort of AI R&D efforts here at Amplitude. Super excited about all the new stuff, but also all the stuff that Amplitude's been doing for a long time and how we're sort of getting renewed interest and excitement and abilities to push that even further forwards. [00:01:44]
Swyx: So I think it's useful for people listening on the podcast and also some people here. Can you contextualize Amplitude as an AI company? Like what does that mean to you? What unique opportunities do you guys have? [00:02:02]
Jeffrey: Sure, yeah, happy to speak to that. So, you know, if we think about the fundamental thing that our customers of Amplitude try to do, it's they want to look at their product data and they want to figure out how do I make my product better? And the really cool thing about product data is that one, it's often like very high fidelity, right? Digital products compared to, you know, let's say physical products before them have way more information about what's going on. And so that's why product data is, you know, even a thing at all, right? You finally have that feedback loop of, hey, I built this thing. This is how people are using it. Now let me learn from that and make my product better. Now, one of the downsides of that is that the data is massive. If you look at any of the internet scale products out there, they generate enormous amounts of data. And the ability of humans to kind of sift through that data is obviously limited. At Amplitude, we try to give people as many tools, whether AI or not, in order to process that. But at the end of the day, if you could get from the data and what user behavior is happening in your product to the insights of how to make your product better without as much manual work, that's kind of the holy grail of product analytics. And so in some sense, Amplitude has always been a company on the path to AI because figuring out how to make your product better from data is ultimately an AI problem. And so we're kind of just solving all the barriers in the way, like getting data in first, building good models for short-term things. And long-term, it's always been about, hey, how can you take product data and automatically make your product better as fast as possible? [00:03:28]
Alessio: So that's the future of Amplitude. And a lot of people here probably want to start companies and whatnot. So maybe you want to give a 60 seconds of why you started Amplitude and what the story was like and maybe the first three to six months, what the challenges were. [00:03:42]
Jeffrey: Yeah, of course. It's funny that we talk about this because the start of Amplitude is actually almost more AI than the current state. And so actually my two co-founders, Spencer and Curtis, they went through YC originally with not Amplitude, but SonaLite, which was a text-by-voice company. So it was kind of before the era of Siri and those types of technologies where they wanted to build something that would read text messages to them, that's easy, but also do voice recognition so that you could send text messages, say when you're driving, without having to pull out your phone. And so they worked on it and it was really popular back when they were doing it. After they finished YC, they realized the big innovation that they needed to figure out in order to make that successful was being really good at voice recognition, which was a different problem. They're awesome software engineers, but they don't come from an ML background. And so it's like, okay, are we going to spend the next five years solving voice recognition? Not really the thing that they had in mind when they were building product. But one thing that they happened to stumble upon as they were working on that was they spent a lot of time thinking about, hey, what was hard about that product? What made users churn? What made users really love it and engage? And they built a bunch of analytics tools to help them understand that. And they were really kind of shocked that those tools didn't exist out there in the market or they were like much more primitive than they wanted. And it turns out a bunch of other people in their YC batch felt the same. And they were like, hey, that analytics thing you're building, we want that. For you to text by voice, we want your analytics product. And so they're like, okay, fine. We will pivot, natural language and voice recognition isn't really our thing. And so we'll do distributed systems and analytics instead. That's where I came in. I'm a distributed systems and analytics guy. And so I happened to get in touch with them just through some mutual friends at the time. And then, yeah, we kind of went on it. The funny thing about a lot of things in technology is that the most forward thinking companies with respect to a lot of technologies are gaming companies. And so a lot of AmpliG's early start was either gaming companies or companies with founders that came from gaming backgrounds, where in gaming people have always been very, very rigorous about product data and optimizing engagement loops and all of that. And so they look for the best tools. We went to Zynga 15 years ago. It's like, that's where product analytics originated. And so a lot of those founders of new startups who had left Zynga were like, hey, that thing that you're building, that's trying to figure out patterns and user data and use that to make better products. That is exactly what we want after leaving Zynga. And then from there, that was Amplitude.
Swyx: Yeah, I think famously other gaming companies would be like Slack, right? Mr. Butterfield tried to make a gaming company and failed and made Flickr. Then he tried to make another gaming company and failed and made Slack. And now look out to see what he does next. Discord as well. That's right. [00:06:34]
Jeffrey: Yeah, people who come from gaming backgrounds are very rigorous in their product thinking. [00:06:39]
Swyx: That's interesting. Alessio, you have a background in games? [00:06:43]
Alessio: Yeah, in playing them, not in building them. So I will not fall into an enterprise company by doing that. Let's talk about R&D today and some of the ideas that you're working through, like some of the limitations that you run through. I think the most interesting thing about hackathons is you come with an idea and then you kind of hit a wall trying to build it. And then that takes you into another path. Like what are maybe funny things that you learn in terms of like the limitations of these models or like the missing infrastructure for using them? [00:07:14]
Joe: So we've got a couple of different frames for thinking about this. There's AI that we're putting into our products and then us knowing that our customers want to put AI into their products. So there's the, how do we support our customers in their product development using AI? But how do we do that ourselves? And this is a great opportunity for us to learn the challenges our customers are gonna see. And so the first thing there is let's just start from the beginning, assume we want to add AI to our product, which maybe isn't the best place to start, but let's just assume we want to. How do we start ideating opportunities to put stuff into our product? So we sort of came up with this framework where we look at our product and we think about what are the collaboration touch points? So where are the points that a human might hand off to another human? And then think where can we replace one of those humans with the machine? So instead of thinking of some AI, amorphous AI, LLM, whatever, we're thinking actually, what if we had a robot that we were collaborating, not just a human, not just some sort of thing that spits out numbers. So collaborating. Then there's thinking of these as tools. So this is like your auto-suggest, on your mobile keyboard or spell check or something. How do you integrate this stuff as deeply into your product? So what are the friction points that users go through? Maybe they check lots of boxes. Is there a way we can pre-check those boxes we can get? So that's the feature embedding really deeply into the tool you've already got, the product you've already got. And then you step back and think, okay, what's a tool? So a tool is like ChatGPT, where you go there, it's an AI powered tool. It's not necessarily connected to your product, but it's a supplementary tool that you add. So there's a sort of ideation process there that we went through. And we sort of landed on a couple. And one of the key things that Amplitude does is help our customers, one, collect data in like a standard and sort of queryable way. And then we help them query it and get insights out of that data. So we were thinking, what's the feature there? How do we embed that? But also what's the collaboration point? And you might be a product manager asking an analyst, hey, please help me. Let's have a conversation about this. I don't know what questions to ask, but you also might just be about to go click the big create button and fill in a bunch of fields. And can we fill in a bunch of the fields for you? So we went to what to us seemed like one of the most obvious places. And we built a text box. Surprise, surprise with LLMs. We've got a text box. You can type in a question, type in anything about your data that you want to know, and then it'll spit back a chart, which is kind of neat. And we hit a bunch of problems there with LLMs hallucinating, losing context, even within the context windows, not really sort of recalling everything within the context window. So we sort of did a bunch of experimentation and realized if we split this down to seven different questions, so instead of saying, generate me a chart and a query for this one question, let's split that into lots of sub queries, like what kinds of events should I use? How should I display this? What should I call it? Rather than asking you all of that in one go. But then we had another problem where we have one query that a user makes that actually spins out seven different queries. So how do we monitor this? We can't just say one performance metric. You know, RLHF, you can't just say yes or no. Was the query response good? Because it might've failed for one of seven reasons. And maybe multiple of them failed or maybe some of them failed and then maybe they've hallucinated. And so we're getting code errors where an enum is not being matched. So we've had lots of sort of issues going all the way down there that we've had to figure out from first principles and sort of a really exciting way for us to understand what our customers are going through. [00:10:56]
Swyx: So I wanna be clear. So you've described your exploration and how you think about products. What have you released so far? I just wanna get an idea of what has been shipped. [00:11:08]
Joe: Sure. So in terms of LLM stuff, this, we call it question to chart internally. This ask a question, get a chart out. This, we've started rolling out to customers already. So last week, actually, started rolling out to our AI design partners a sign that we had signed up, which is a really exciting process. Actually, a lot of customers are just so excited to work with us and try it out and see how they can break it. So that's something we rolled out recently, which is built in LLM. It's the first piece built on LLM that we're working on. But we've also had a bunch of long-term ML, sort of traditional ML models that we've been running and products that we've been running with customers that help them predict what their users are gonna do. Because we've got this massive behavioral data set, best behavioral data set in the world. So we can train these awesome models and help our customers predict what their users are gonna do. So they can share the more relevant content or now is the right time to ask people if they want to upgrade or they want to rate your app or that sort of thing. [00:12:05]
Swyx: Yeah, there is a little bit of a contrast, conflicts, because you already had all these ML models in-house and you're spinning up a new AI team and you're like, no, let's do all of this with GPT-3. Are the existing ML researchers saying like, no, this is a complete misuse of text generation? Or are they excited about it? Is it unlocking new things? [00:12:32]
Joe: Yeah, actually, it's the combining these things. So we're able to use the traditional ML to shorten the fields, to narrow the number of things we need to pass into the LLMs. Because the LLMs can do a lot more of the reasoning, but we can make sure that the context we're providing is much more specific and generally much better by using the traditional ML models. [00:12:53]
Swyx: Yeah, okay. And then the pain points that you're experiencing are hallucination. And then also like the multi-query thing. What do you think you wish for? Or what do you think you're thinking about to solve those pain points? [00:13:06]
Joe: So right now we're instrumenting with our own product. So we're instrumenting groups of inferences and individual inferences, which means we can then create charts that show how often they fail, why they fail, how often we need to retry to get good answers.
Swyx: So amplitude using amplitude. [00:13:23]
Joe: Exactly. To build amplitude. [00:13:24]
Swyx: Yeah, exactly. [00:13:25]
Joe: Well, I mean, we're a product company. What else would we do? [00:13:29]
Swyx: That is the second part of what you're saying, right? Which is, first of all, you want AI in the amplitude products. Second, people are shipping AI products with amplitude. You wanna talk a little bit more about what you're seeing there? [00:13:39]
Joe: Yeah. I guess the key thing here is, for a lot of people is, okay, I can build the thing that calls OpenAI's API and then gives a response back. I'm nervous that I'm gonna be giving incorrect answers. I'm nervous that I don't really know how to measure whether the answers are incorrect. And I'm nervous that I'm not gonna be able to improve over time. So a lot of people we actually hear are nervous of giving thumbs up, thumbs down buttons because they're implying to their users that they're gonna be using this to improve the results. But they actually have no idea how to use that to improve the results in a meaningful way. And particularly when you've got multiple queries going off for one request, you've gotta then fine tune lots of different things in parallel. So it gets to be quite a technically complex sort of problem if you're not using great tooling that already exists for it. So that's, and then you have the extra layer of, I'm getting a bad result. I've tweaked my prompt template that I'm sending off to OpenAI. And now, has the result got better or worse? [00:14:35]
Swyx: I don't know. [00:14:36]
Joe: I don't know how to measure that. Except by thumbs up, thumbs down, which is a difficult measure in the first place. So that's where we can start saying, measuring the behavior of users once we've generated something for them. So have they gone and shared this content? Have they used this content? They actually gotten any value out of it? Not just have they pressed thumbs up. We can actually measure, are they getting value? Are they throwing it away from their behavior? But then using that through the Amplitude product, we can then tie that through to A-B tests, which is another product that Amplitude has. So then suddenly we start, and we're not doing this yet. This is sort of next on our list, is to start putting these prompts into our A-B test variants. So then we make a tweak in the UI, and it goes off, fires on the original, the control and our variant, our new variant. See, does it get fewer or more errors? Does it get fewer or more thumbs up, thumbs down? [00:15:30]
Alessio: Have you thought about, I don't know, A-Z testing, I guess? Like one of the limitations has been, well, people can only write so much copywrite to test, but now with these generative models, you can actually generate a lot of copy. And like you go to on-demand test more and more and more copy. Have you seen any maybe fun customer stories? Like can you, anything there? [00:15:50]
Jeffrey: Yeah, so actually there's a very good example of this. I don't know if I can share the actual customer, but actually from before the LLM days, where they literally generated the versions of the copy themselves, and they made their product basically adapt, you know, multi-arm bandit style of like, hey, here's all these different variations, like just go figure out the best one. At an internal hackathon, maybe two months ago, I built a prototype of what you're talking about, which is, okay, now replace the copy generation with an LLM. So just constantly generating new variations, and then multi-arm banditing to figure out which one's the best. I think that is probably the future of copywriting, where it's like, you don't actually need a whole lot of manual work anymore. It can, almost everything can happen automatically. And it's kind of the micro example in my head of this concept that we really like, which is self-improving products, where, you know, at some point, you know, someone has to say, hey, I'm gonna build a product that does this, you know, like a newsreader or something. But then, you know, after you have that, like the title of the newsreader, like the description of the sections, your navigation, all of that, in theory, you know, if you can give it some structure that the AI can play with, the LLM can manipulate all of that for you, and then use, you know, A-B testing, multi-arm bandits and all of that to kind of figure out what's best. And that generative AI kind of makes that last piece of like, what are my options possible? And that's super exciting for us. And we wanna be there, you know, to help you measure that, help you deploy that, and make that like the way people build products in the future. [00:17:14]
Alessio: I think I've talked about this on the podcast, but this idea of like just-in-time UIs, you know, like each type of user wants to interact in a different way. And like, what you're building is a way of that, right? Like, Amplitude has been really like dashboard-driven, kind of like a diagram-driven, showing the user flow. Now each user can say, hey, I don't really want the table. I just want the charts. Or like, I don't want the charts. I just want the data. What do you think about the future of like dashboards and like BI in general? But like, the analysts used to come up with like what you should be seeing. Now each user can ask their own questions. [00:17:47]
Jeffrey: Yeah, like the future of analytics, I think, is, you know, can go a few different paths. One thing that I want to, you know, counter against the whole LLM trend a little bit is I think when you get into really important and specific questions, you know, let's say you're writing like some complicated SQL or even code, you know, code and SQL are good because they're very specific, right? You can define your semantics very precisely. And that's something that I think, you know, when people start thinking about like natural language questions, they kind of take for granted. They're like, oh yeah, why doesn't it just, you know, figure out the precise semantics from my very ambiguous words? It's like, well, it's actually, in some senses it's possible, right? Because the precise semantics are not captured by your ambiguous natural language words. And so the way we think about it, at least today, you know, who knows what's going to change in the future is like natural language is a great interface to like get started. If you don't know what the underlying data looks like, if you don't know like what questions you should be asking, it is a very, very expressive way to start, get started. It's much easier than manipulating a bunch of things, much, much easier than writing SQL and all of that. But like once you kind of know what you want, it's very hard to like make it precise. It's actually easier to make SQL or code precise than it is natural language. And so that's a little bit of what we're thinking right now. So we think, you know, for sure the way that maybe many people will interface with analytics and data will turn into natural language because maybe the precision doesn't matter to them. But like at the end of the day, when you're trying to get, you're trying to sum up your revenue or something, it's like, you want to know that it's right. And you want to know the semantics that go into that. And like, that's why, you know, that's part of why data is hard. The semantics really do matter. They can make a huge difference in the output. And so there's a boundary there that I'm curious where it will push over time, but I don't think it's quite there yet. [00:19:33]
Joe: I think this is where models sort of can become more embedded as features rather than go off and do this thing, create this analysis for me and then come back, the collaborator model. Then we're saying this field, I'm not sure what should go in there. Can you make a suggestion? And then I'm going to go and refine it over time. So it's the sort of autofill, but guessing autofill, but then you still, you can tweak everything. This is one of the core design sort of principles that we've come up is yes, you've got to be able to explain what the model's doing. And as a human, I need to understand, a user I need to understand what is the model doing and why is it doing it? But I also need to be able to tweak it once it's done it. I don't want to feel like I've just said go and then I can't stop it and it's going to go off and do stuff. And that's sometimes how things like AutoGPT can feel. It's going and it's costing me OpenAI tokens and I have no idea what's going on. So yeah, I think a key thing is servicing all the individual things the model's doing and allowing users to tweak it, stop it, retry while it's going. [00:20:33]
Swyx: For me, one of the most challenging questions is something I think you guys have maybe thought about a lot which is chat. Ideally you want, like you could say naively, for example, you want to optimize time in app, but actually that's a sign of failure if the chat session is longer than it should be. Do you have any advice on, I'm sure you've dealt with this before pre AI era, but like what do you advise AI hackers to optimize for? Like what analytics should people be looking at? [00:21:03]
Jeffrey: Yeah, our general kind of philosophy as a company is to work with customers to identify north star metrics. Right, and like time in app is not good primarily because it doesn't actually correlate with your business outcomes most of the time. And to be fair, sometimes it does. Like if you're a social media app, maybe it does correlate really well and maybe it's not a bad metric then. But for a lot of other products, right, if you're trying to do the search, for example, or like time on search, like nobody wants that. It's like, yeah, what is your success rate? You know, how many, do you get them to come back and search in the future? Like that's much more interesting than the time of your session. And so, because you know, each time you can serve apps, right, that's your business. And so it's like, if you choose a metric that's well correlated with your business outcomes, then that's at least the first step to getting that right and not getting caught up in other vanity metrics that sound like they could be good to increase, but then, you know, they can sometimes lead to negative business outcomes, you know, and then you get the worst. You've optimized the wrong metric the whole time. And that's where tying in AI and product analytics makes a lot of sense. And it's really important because product analytics, these companies that are like our customers that are trying out building features that are LMs and they're not sure what to optimize for, optimize for the same thing you're already optimizing for. You're already measuring conversions. You're measuring how much value, hopefully, your customers are getting out of your product. So continue doing that and maybe find a way to tie the LLM feature to that and sort of through A-B tests and that sort of thing. And then on the chat specifically, chat is obviously for a business maybe rolling out a chat box based on LLMs. It can be really scary. And that's another sort of mental model of framing we've been thinking around is we find LLMs right now are most useful either when you come from, either when you have a narrow input space and a broad output space, because you can be very, you know exactly what format of data, what kind of data is gonna be passed in. That's probably not coming directly from a user. It's probably coming from a button click or a toggle switch or something. And then you can have a general output and you can provide templates and that sort of thing. And then the other way is broad input space, narrow output space. So that's free form text box. And you can provide a bunch of sort of clamping, framing, validation on the output to make sure that you're not spewing out, you know, poems about Hitler or whatever it is. You know, you can be really, really deliberate when you've got a small output space. Chat is large input space, large output space, which is really, really scary. If you're, as a company, you're not selling a chat product, you're selling a, you know, an analytics product with maybe a chat support bot or something. [00:23:37]
Swyx: Yeah, I think this is one of those opportunities. I always try to raise the awareness of this, that Copilot I think did a really interesting metric or North Star, which was how much code is kept or retained by the user. And for people who are Googling along, you can actually look for this blog post about reverse engineering Copilot internals. And they actually set up custom metrics around, you know, 30 seconds after a code snippet is accepted, one minute, two minute, three minute, all the way to five minutes. And you can sort of see it construct a curve of how long Copilot suggestions stick around. And from there, they can actually make statements like this, you know, evaluate the success of the products. It's pretty cool. [00:24:18]
Joe: One of the really nice things we found actually, we accidentally did this. So our chart building interface, heavily instrumented. It's a, we're Amplitude. So we instrument our product. We also, it's one of the main tools that our customers use. So it's really, really well instrumented. And so when we tied chart creation through asking a question through an LLM, and then we tied that to a chart, an output chart, we then automatically were able to tie every time someone edits any of the parameters to that generation. So then we know, we have really detailed RLHF data for, yeah, you got everything apart from the metric, right? But you got everything apart from this event that shouldn't have been there, because that's the one that got removed. So similar to the Copilot there. [00:25:00]
Alessio: And I want to make sure we open it up for questions, but like one last thing is about, everybody knows that small is beautiful. And when you think about what models to use and some of the parameters, like there's costs, there's latency, there's like accuracy. How do you think about using, you know, GPT-4 and some of those models versus using smaller ones that are fine-tuned? What are the trade-offs? [00:25:23]
Joe: Yeah, I guess right now we're very much in the, let's explore, let's try everything and just iterate as fast as possible, which is what general models are great for. We do have some smaller, not even fine-tuned, some smaller models that we've sort of borrowed from Hugging Face that we run internally for more specific tasks. And that's often sort of selecting specific values before we pass it to a general model right now, just because the general models are much easier to communicate with and they understand most of the words we use. It's not like we use a word and suddenly we get random outputs for no reason, the sort of gold magic up type thing. So they're generally less susceptible to that. So that's why we're iterating heavily on the general models. I think we absolutely have to move to some more specific models, particularly given inference on fine-tuned open AI models gets more expensive and slower the more you do it. So yeah, that's definitely a thing we're looking at and we're doing some internal stuff, but it's the next step or one of the next steps. [00:26:22]
Jeffrey: Yeah, to give a pseudo example of that, one of the hard things to help users within Amplitude is picking the right event to analyze. It's kind of your fundamental unit of analysis. And when a user comes in and let's say that's the first time they're using Amplitudes, someone else in their company has set up the product, so they don't know what the events are. Right now in Amplitude you get this massive dropdown and it's like, all right, there's a thousand things, like which one is the one I'm looking for. And sometimes the names are good and sometimes they're not. But one thing we did was, okay, yeah, feed that into open AI. Hey, tell me which event type best matches like this user's intent. That's like pretty good at that, right? So it's all language stuff, but it's a little bit slow and it's a little bit expensive to do that every time. And so we kind of fell back to, once we validated that that works, kind of fell back to a more traditional embedding-based approach. It's like, all right, compute all those embeddings. That's more work upfront because you have to go through your database of all of these things and you got to commit like that engineering work, but it's like you validate with the general model because it's just easy. It takes like an hour to figure out that it works. And then it's like, all right, can we do the same thing with embeddings? That's way faster, way cheaper and still has reasonable quality. Embeddings also have a nice quality that you can get like magnitude of things, whereas LLMs aren't great at giving you like, hey, it matches this much. It's kind of, you can ask it for an order and that's decent, but like, yeah, anything beyond that is pretty challenging. [00:27:42]
Alessio: How do you think about the importance of the model versus the data, right? There's like a lot of companies that have a lot of data, but not a lot of AI expertise or companies that are just using off the shelf model. How should companies think about how much data to collect? What data is meaningful? What isn't, any thoughts there? [00:27:59]
Jeffrey: Yeah, I think it's safe to say that both are really important, right? Like the evolution of LLMs really was a lot of model innovation. And so I don't want to downplay that. At the same time, I think the future of AI applications and doing really cool things with it will be in the data, partially because like, you know, ChatGPT has done such a huge advance, right? The LLMs model space has advanced like crazy in the last year. And so I think a lot of the untapped potential will be in data in the future. One thing that's particularly interesting to us is like we have a pretty unique data set, actually. It's a lot of first party behavior data, right? So if you're, you know, if you're Square, for example, you instrumented like the way that people interact with Square Cash and the wallet and the, you know, the checkout system. And like, those are very specific things. Like Square can't look elsewhere in the world for that stuff. And that's really interesting because, you know, to build models of user behavior, you need user behavior data. And it turns out there's not actually a lot of examples of user behavior data out there in the world. And so to Joy's point earlier about, you know, we have one of the best user behavior data sets in the world. And so if we want to build a model around that, I think it would be a super interesting one. So if you take an analogy to what ChatGPT does, it basically takes a bunch of language examples and it, you know, learns a bunch of abstract concepts, like how to, you know, prove math things or how to render in JavaScript. It's like, wow, that's very astonishing. They kind of prove, it's almost like a proof of concept to the world that if you train a sufficiently good, you know, transformer self-attention type model with a sufficiently large data set of, you know, hundreds of gigabytes of internet text, you'll learn really interesting abstract concepts. And so we want to apply that to our data set, right? Cat GPG is great because it's a proof of concept. If it didn't exist, you know, I would have told you, yeah, you can spend $10 million training this model on a data set, you'd probably not get anything interesting because we just have no idea. But because it exists, it kind of proves to the world that if you do this correctly, there is a ton of interesting value. And so that's what I think. And so, you know, amplitude is just one example of a very interesting data set that you will train something that's, you know, fundamentally very different from GPT or any LLM out there. And there's lots of other data sets out there. And I think that's where a lot of the interesting things will come once this kind of, this phase of like rapid model evolution kind of tapers out a little bit. And you'll see a lot of the more interesting applications there. [00:30:24]
Swyx: So I've never thought about this much, but you guys must do it a lot. Like what is the ethics or best practices around training on user data when they don't know they're being watched? Like, I mean, presumably they're fine with tracking and events, but like, do we tell them that we're going to train on their data? Is it okay? [00:30:50]
Joe: I guess there are a couple of things. One is PII. Doesn't go anywhere near the stuff, right? PII with strip and like, that's just a really important thing. [00:30:58]
Swyx: You still need an identifier for streams. [00:31:02]
Joe: Yeah, yeah. But in terms of training models, we don't want any of that to go in there because then you might accidentally, you know, like, hello, ChatGPT, please hallucinate me a social security number. That's dangerous. [00:31:11]
Swyx: Also PII makes it into prompts a lot. [00:31:14]
Joe: Sure, that's true. So then you have to strip that from your... So we have some experiments where we're stripping PII that is in places that shouldn't be, you know, descriptions of things. Sometimes people copy paste big long lists of email addresses into charts and things. But some of these things are actually pretty surprisingly easy to detect and strip out. So we can do that. And we have some layers that are stripping out that sort of replacing them with tokens. So the LLMs can still operate on them. But in terms of training this data, all that training is happening internally and we're not putting any sort of private data, personally identifiable information in. I don't know if there's anything you wanted to add there. Yeah, yeah. [00:31:54]
Jeffrey: We certainly think about this a lot and our customers think about a lot. Like when I think about user privacy with respect to tracking, there's kind of this big spectrum. Around the one end, it's like literally track nothing and, you know, the end of story. And like for people like that, I mean, that's cool. You know, they're not gonna use Amplitude. They may not like us very much. You know, that is what it is. And then on the other end of the spectrum is like, we're gonna track you across the entire internet and sell your data to everyone. And like, that's obviously bad. And like, there's lots of good reasons to think that's bad. First party behavioral data, I think is actually probably almost as far. Fully anonymized first party behavior data would be like kind of the minimum. It's like web server logs with no IP, no identifier, nothing. The problem is that you can't do a lot of interesting behavioral analysis without that. You can't tell if, you know, this person that came on this day was the same one that purchased later. And so like, you can't actually, it's much harder to make your product better if you don't have that. And so, you know, we're kind of set at this place where we have, you know, like pseudo anonymized first party data. And like, we don't sell the data. You don't mix data from, you know, different places on the internet through Facebook cookies or things like that. And, you know, our philosophy is like, that is actually the most important data to build a better product. It's not the most important data to advertise, which is why Facebook and Google do what they do, but it's the most important data to build a better products. And it kind of strikes the right balance between yeah, totally tracking everything that you're doing and like not having any information to make your product better. [00:33:19]
Swyx: Yeah, cool. And I think we're going to go to audience questions. So let's start warming them up soon. But I think we have some lightning round questions [00:33:29]
Joe: The audience is thinking of questions while we go. [00:33:31]
Alessio: The first one is, what's something that already happened in AI that you thought would take much longer to be here? [00:33:39]
Jeffrey: I don't know what the constraints on our lightning round, but I think maybe creativity is the best word where it's, you know, with the image generation stuff, text generation, you know, one thing that still blows my mind, I used to be a competitive like math guy and like there's this international math Olympiad problem in one of the papers and it solves it. And I'm just like, wow, I can solve this when I was spending all my life doing this thing. Like that level of creativity really blew my mind. And what's the takeaway? It's like maybe the takeaway is that creativity is not as, you know, as not as high entropy or high dimensional as we think it is, which is kind of interesting takeaway. But yeah, that one definitely surprised me. [00:34:21]
Joe: I guess there's something actually that maybe answering the inverse question that a lot of my friends were surprised happened quickly. And I was like, this is just braindead obvious. I've got a lot of friends in the AI safety space. So they're worried that in particular, X-risk, right, extinction risk, that AI is going to kill the human race. And they were like, oh no, what if an AI escapes containment and gets access to the internet? And then we get an LLM and the first thing we do is like, hey, also GPT, here's the internet. [00:34:48]
Swyx: So you thought, it's happening faster than you thought. [00:34:53]
Joe: Well, it's happening faster than, to me it makes sense, because I'm like one of the guys connecting it to the internet. And I'm like, I'm surprised that other people were surprised it was going to be so fast. [00:35:01]
Swyx: Yeah, so a bit of context, Joe and I, we've been adjacent to the EA community and they have like smoothly migrated to the X-risk community very quickly after SBF. [00:35:13]
Joe: Yeah, after SBF, yeah, that was fun. [00:35:16]
Swyx: Okay, so next question, exploration. What do you think is the most interesting unsolved question in AI? What's next? [00:35:30]
Joe: I guess like, is it going to keep getting better at the same rate? Is it going to, and that's just a super important question that's going to change. Like, depending on that answer, 50 startups are going to pivot or not pivot, right? [00:35:43]
Swyx: Which is what's next, literally. [00:35:45]
Joe: Literally, what's next? Like in a year's time, are the models similarly better than they have been so far? Or are we about to taper off or are we about to continue going linearly? [00:35:58]
Jeffrey: Yeah, I'll throw one out that is not necessarily about AI, but like, what's intelligence, right? And if you ask people 20, 30 years ago, maybe even longer now, it's like, yeah, chess. Chess is intelligence. And then chess got solved and like, ah, that's just brute force. And it's like, well, you know, creating creative images and writing, that's intelligence. Well, it's like, that's solved too. Maybe it's just, you know, if you have enough parameters, you can capture that. So like, what is intelligence? What does it mean to have an AGI? What does that actually mean? And then what the implications that are on for our understanding of humans and our brains. I've always thought that, you know, everyone is just a stochastic machine. And so, you know, is everything consistent in my mind?
Swyx: Free will and illusion. Exactly. [00:36:43]
Joe: I guess maybe like the scaling piece is like that intelligence as you scale is gets more and more expensive on the traditional stuff. But then there's something I think I saw yesterday on Hacker News. It was people actually getting a brain to play tic-tac-toe. Like by a brain, I mean, stem cells grown into brain tissue. And they were able to train it. And like that to me is very significant because suddenly the like metal computers limitations is not applied. And then now we've got all this intelligence. What is intelligence stuff on a squishy wet computer? That makes it even harder to ask and even harder to draw lines. [00:37:18]
Swyx: Yeah. Yeah. So famously, you know, language models are so much more inefficient than wet computers, as you say. And so if you can merge that, you know, the human brain runs on 30 Watts of power as it is my favorite fact. We're not anywhere close to that yet. [00:37:36]
Alessio: Before we get into Q&A, one last takeaway that you want everybody to think about. [00:37:41]
Jeffrey: Yeah, I'll do the one that we actually repeat in Inside Amplitude very often, not about AI, but I think it applies, which is it's early. It's sometimes hard to realize that when things are happening so fast, especially in the Bay Area, but like the ramifications of AI or in our case, product data and all that are gonna play out over the next many decades. And that's just, you know, we're very fortunate to be at the beginning of it. And so yeah, take advantage of it and keep reminding yourself that it's early. [00:38:15]
Joe: I guess mine would be, let humans be good at doing human things. Let machines be good at doing machine things and let machines be good at doing machine things and help humans be good at doing human things. And like, if you don't do that, then you're gonna be building something that's either not useful or it's very scary. So yeah, get machines helping humans, not the other way around. [00:38:39]
Swyx: Get machines helping humans. All right. With that, I think we're all gonna open up to questions. We're gonna toss you the mic. [00:38:45]
Audience #1: Yeah, hey, thanks for the insight into how you guys implemented your AI, you know, question asking chatbot and how have you converted into seven sub queries and then generate the data out. I've just, I got a peak my interest about how you guys exactly do it. Like Alessio asked, like, what exactly is the model that you guys are using? Are you converting it into your, what are these queries that you generate from a single English language? Is it possible to go a little deeper just from a curiosity perspective? [00:46:34]
Joe: So we have a custom query engine. So it's not SQL or anything that we're generating. We're generating a custom query output. So I guess the types of questions range. So things like chart type, are we doing a segmentation chart, a line chart or are we doing a funnel chart? You know, the number goes down over time or up over time or between a conversion between two events and there are various other types or metrics or, and then there's also the name. What should we name this chart that answers this question? So the way that's implemented in practice, you could use something like Lang chain to sort of chain these things together. But in our experience, I think Lang chain's a great tool for certain things and definitely really great for prototyping, but we found it quite restrictive. So we've ended up building sort of an internal, it's a very, very small wrapper, internal, we use TypeScript as well, framework that allows us to basically just write in code and infer within what we call a transaction, an inference transaction, which gets monitored as one, but then also all the individual inferences within it get monitored. So it's a bit like when you're writing a database transaction with most sort of, at least in the node ecosystem, the JavaScript ecosystem, where you sort of get a transaction object that you can operate on, and then you return your, or you return, you sort of commit your transaction. So we've got an interface like that, so we can just write pure TypeScript, await this response or await these responses. And then we've got a switch case. If it's a segmentation chart, go and do these with these queries. And then each of those inferences can be a different model. So we think in the future, maybe we have one query where we have some GPT-4 responses. We want some text responses. Maybe we also want to generate an image from that same query together, and then that gets bundled. So I don't know if that answers your question.
Audience #1: Yeah, I think so. Yeah, thank you. I think so. You said in future, you're going to use GPT-4. What are you using right now for? [00:48:33]
Joe: Right now, everything's GPT-3.5. We're moving around, and I think probably for some of the prompts, we'll use something like DaVinci. Some we might use GPT-4. Some we'll be using internal ones. And we also want to be able to degrade gracefully if a customer has told us they don't want us to send anything to OpenAI, then we can degrade to some internal models that maybe are some of the open source models that have been trained on smaller datasets. [00:48:57]
Audience #1: Gotcha, makes sense. Thank you. [00:48:58]
Jeffrey: Yeah, I think to add to that a little bit, the key is breaking down the problem sufficiently, because if you break down the problem enough, you can also provide it with some examples, which is super helpful, right? You know, GPT is quite good at zero shot, but within the context of our specific domain, it doesn't know what's going on. And so being able to break down the problem to, hey, select the type of chart. Don't generate me an entire chart definition. Select me the type of chart, and then select me the specific metric based on their query, and then giving it some examples. Select me the events and properties that I want to look at. By breaking it down and having very, very contextual prompts with respect to those examples, you get a lot higher quality output than trying to generate, like, you know, if you imagine generate, like, hey, generate me a whole SQL query with all, you know, here's like the schema of all my tables, now generate it entirely. It's like, it actually struggles with stuff like that, because it's just like kind of too much information and computation to come out of language. Now, maybe GPT-5 will be different, but like, that's the state of the art today. [00:49:57]
Swyx: I'll ask a follow-up to Joe. So you mentioned, you mentioned trying LangChain, but not needing it for production. Any other comments on tooling that are out there that's interesting to you? Do you use a embedding database, for example, or do you just use a regular database? [00:50:18]
Joe: Yeah, so we've actually been running embedding sort of similarity or vector search in production for multiple months, maybe even almost a year, and just like straight up Postgres, but now we're using PG Vector, which actually Jeffrey could probably speak more to about that decision and what that was like. [00:50:40]
Swyx: So this is a pretty hot take. At Amplitude scale, all you need is Postgres? [00:50:46]
Joe: We'd use many things other than Postgres. But I mean, we, this isn't rolled out for all customers and it's not necessarily getting sort of hit with a lot of traffic. And so the scale here is very different. Our usage scale is very different to our ingestion. [00:51:04]
Swyx: Yeah, yeah, yeah. [00:51:06]
Jeffrey: Just to clarify that a little bit more, we're not putting individual end user vectors or end event vectors. We're putting in taxonomies. So if I'm DoorDash, my taxonomy is add to cart, checkout, purchase, browse. That's the cardinality. And so that's actually small. It's on the order of tens of millions. And so yeah, you use stuff that in Postgres, no problem. Now, when we talk about large behavioral models or like actually embedding events, there are many, many trillions of those. And yeah, Postgres probably doesn't work there. [00:51:41]
Swyx: Yeah, actually I wanted to comment on this slightly before, which is separating taxonomies from the actual data is one way you protect your customers against prompt injection. It's something that Simon Willison has been talking about where you want to have like query for one thing, but essentially no knowledge of the actual underlying data, just the taxonomy. So it's good practice. [00:52:00]
Audience #2: Yeah, so you talked about a model which would be trained on user behavior data like amplitude GPT. It really piqued my interest and what capabilities would emerge? What do you think that you would find and what would be the first thing you would ask the model? That's a good question. [00:52:23]
Jeffrey: So we've thought about this a little bit and I think the, right, these are sequence, token prediction models. And so at the very least, I would hope for a much better, we have a predictions feature right now, which says, hey, given what a user has done over the last 90 days, do we think they're gonna belong to this cohort in the future or not? So that cohort might be people who churn, people who purchase, people who upsell, whatever the customer wants. We think it would be much better at tasks like that, right, because if it just has a very good understanding of behavioral patterns and what's gonna come next, it would be able to do that. That's exciting, but not that exciting. If I'm trying to think about like the analogies to what we see in LLMs, it's like, okay, yeah, what is the behavioral equivalent of like learning physics concepts, right? It's like, oh, I don't actually know, but it might be this understanding of patterns of sessions and how that like, for example, categorizing users in a unsupervised way seems like a very simple output for a model that understands user behavior, right? Here's all the users and if you wanna discriminate them by their ability to achieve some outcome in the future, like here's the best way to separate that group and here's why, right? Be able to explain at that level and that would be super powerful for customers, right? A lot of times what our customers do is, hey, these people came back the next day and these people didn't, why? What was different about them? And so we have a bunch of heuristics to do that, but at the end, there's something like, causal impact is like one of the holy grails of product analytics. It's like, what was the causation behind some observed difference in behavior? And I think, yeah, a large behavioral model will be much better at assessing that and be able to give you potentially interpretable ways of answering that question that are like really hard to do, really hard, really computationally intensive, really like noisy, distilling causation correlation is obviously super hard. Those are some of the examples. The other one that I am, I don't know if I'm optimistic about it, but we really interesting is, one of the things that amplitude requires today is manual instrumentation, right? You have to decide, hey, this clicking of a button, this viewing of page, these are important things. I'm naming them in this way. There's a lot of popular tools out there that kind of just record user sessions or like track DOM events automatically. There's a lot of problems with those tools because the data is incredibly noisy. It's just so noisy, right? A lot of times you just can't actually interpret it. And so it's like, oh, it's great because I don't need to do any work. But like, well, you also don't get anything out of it. It's possible that a behavioral model would be able to actually understand what's going on there by understanding your user behavior in a correctly modeled and correctly labeled sense, and then figuring out. I don't know if that's possible. I think that would make everyone's lives a lot easier if you could somehow ask behavioral questions of data without having to instrument. All of our customers would love that, but also all of them are instrumenting because they know that's definitely not possible today. [00:55:26]
Audience #2: This is really interesting. You're looking forward to the future. If you're gonna build it, it's gonna be amazing, yeah. [00:55:31]
Jeffrey: That's the goal, that's the goal. [00:55:33]
Audience #2: Awesome. [00:55:34]
Swyx: Thanks for listening. [00:56:09]
Get full access to Latent.Space at www.latent.space/subscribe
Building the AI × UX Scenius — with Linus Lee of Notion AI
jeudi 1 juin 2023 • Durée 01:09:50
Read: https://www.latent.space/p/ai-interfaces-and-notion
Show Notes
* Notion
Timestamps
* [00:03:30] Starting the AI / UX community
* [00:10:01] Most knowledge work is not text generation
* [00:16:21] Finding the right constraints and interface for AI
* [00:19:06] Linus' journey to working at Notion
* [00:23:29] The importance of notations and interfaces
* [00:26:07] Setting interface defaults and standards
* [00:32:36] The challenges of designing AI agents
* [00:39:43] Notion deep dive: “Blocks”, AI, and more
* [00:51:00] Prompt engineering at Notion
* [01:02:00] Lightning Round
Transcript
Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners. I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20]
Swyx: And today we're not in our regular studio. We're actually at the Notion New York headquarters. Thanks to Linus. Welcome. [00:00:28]
Linus: Thank you. Thanks for having me. [00:00:29]
Swyx: Thanks for having us in your beautiful office. It is actually very startling how gorgeous the Notion offices are. And it's basically the same aesthetic. [00:00:38]
Linus: It's a very consistent aesthetic. It's the same aesthetic in San Francisco and the other offices. It's been for many, many years. [00:00:46]
Swyx: You take a lot of craft in everything that you guys do. Yeah. [00:00:50]
Linus: I think we can, I'm sure, talk about this more later, but there is a consistent kind of focus on taste that I think flows down from Ivan and the founders into the product. [00:00:59]
Swyx: So I'll introduce you a little bit, but also there's just, you're a very hard person to introduce because you do a lot of things. You got your BA in computer science at Berkeley. Even while you're at Berkeley, you're involved in a bunch of interesting things at Replit, CatalystX, Hack Club and Dorm Room Fund. I always love seeing people come out of Dorm Room Fund because they tend to be a very entrepreneurial. You're a product engineer at IdeaFlow, residence at Betaworks. You took a year off to do independent research and then you've finally found your home at Notion. What's one thing that people should know about you that's not on your typical LinkedIn profile? [00:01:39]
Linus: Putting me on the spot. I think, I mean, just because I have so much work kind of out there, I feel like professionally, at least, anything that you would want to know about me, you can probably dig up, but I'm a big city person, but I don't come from the city. I went to school, I grew up in Indiana, in the middle of nowhere, near Purdue University, a little suburb. I only came out to the Bay for school and then I moved to New York afterwards, which is where I'm currently. I'm in Notion, New York. But I still carry within me a kind of love and affection for small town, Indiana, small town, flyover country. [00:02:10]
Swyx: We do have a bit of indulgence in this. I'm from a small country and I think Alessio, you also kind of identified with this a little bit. Is there anything that people should know about Purdue, apart from the chickens? [00:02:24]
Linus: Purdue has one of the largest international student populations in the country, which I don't know. I don't know exactly why, but because it's a state school, the focus is a lot on STEM topics. Purdue is well known for engineering and so we tend to have a lot of folks from abroad, which is particularly rare for a university in, I don't know, that's kind of like predominantly white American and kind of Midwestern state. That makes Purdue and the surrounding sort of area kind of like a younger, more diverse international island within the, I guess, broader world that is Indiana. [00:02:58]
Swyx: Fair enough. We can always dive into sort of flyover country or, you know, small town insights later, but you and I, all three of us actually recently connected at AIUX SF, which is the first AIUX meetup, essentially which just came out of like a Twitter conversation. You and I have been involved in HCI Twitter is kind of how I think about it for a little bit and when I saw that you were in town, Geoffrey Litt was in town, Maggie Appleton in town, all on the same date, I was like, we have to have a meetup and that's how this thing was born. Well, what did it look like from your end? [00:03:30]
Linus: From my end, it looked like you did all of the work and I... [00:03:33]
Swyx: Well, you got us the Notion. Yeah, yeah. [00:03:36]
Linus: It was also in the Notion office, it was in the San Francisco one and then thereafter there was a New York one that I decided I couldn't make. But yeah, from my end it was, and I'm sure you were too, but I was really surprised by both the mixture of people that we ended up getting and the number of people that we ended up getting. There was just a lot of attention on, obviously there was a lot of attention on the technology itself of GPT and language models and so on, but I was surprised by the interest specifically on trying to come up with interfaces that were outside of the box and the people that were interested in that topic. And so we ended up having a packed house and lots of interesting demos. I've heard multiple people comment on the event afterwards that they were positively surprised by the mixture of both the ML, AI-focused people at the event as well as the interface HCI-focused people. [00:04:24]
Swyx: Yeah. I kind of see you as one of the leading, I guess, AI UX people, so I hope that we are maybe starting a new discipline, maybe. [00:04:33]
Linus: Yeah, I mean, there is this kind of growing contingency of people interested in exploring the intersection of those things, so I'm excited for where that's going to go. [00:04:41]
Swyx: I don't know if it's worth going through favorite demos. It was a little while ago, so I don't know if... [00:04:48]
Alessio: There was, I forget who made it, but there was this new document writing tool where you could apply brushes to different paragraphs. [00:04:56]
Linus: Oh, this was Amelia's. Yeah, yeah, yeah. [00:04:58]
Alessio: You could set a tone, both in terms of writer inspiration and then a tone that you wanted, and then you could drag and drop different tones into paragraphs and have the model rewrite them. It was the first time that it's not just auto-complete, there's more to it. And it's not asked in a prompt, it's this funny drag-an-emoji over it. [00:05:20]
Linus: Right. [00:05:21]
Swyx: I actually thought that you had done some kind of demo where you could select text and then augment it in different moods, but maybe it wasn't you, maybe it was just someone else [00:05:28]
Linus: I had done something similar, with slightly different building blocks. I think Amelia's demo was, there was sort of a preset palette of brushes and you apply them to text. I had built something related last year, I prototyped a way to give people sliders for different semantic attributes of text. And so you could start with a sentence, and you had a slider for length and a slider for how philosophical the text is, and a slider for how positive or negative the sentiment in the text is, and you could adjust any of them in the language model, reproduce the text. Yeah, similar, but continuous control versus distinct brushes, I think is an interesting distinction there. [00:06:03]
Swyx: I should add it for listeners, if you missed the meetup, which most people will have not seen it, we actually did a separate post with timestamps of each video, so you can look at that. [00:06:13]
Alessio: Sorry, Linus, this is unrelated, but I think you build over a hundred side projects or something like that. A hundred? [00:06:20]
Swyx: I think there's a lot of people... I know it's a hundred. [00:06:22]
Alessio: I think it's a lot of them. [00:06:23]
Swyx: A lot of them are kind of small. [00:06:25]
Alessio: Yeah, well, I mean, it still counts. I think there's a lot of people that are excited about the technology and want to hack on things. Do you have any tips on how to box, what you want to build, how do you decide what goes into it? Because all of these things, you could build so many more things on top of it. Where do you decide when you're done? [00:06:44]
Linus: So my projects actually tend to be... I think especially when people approach project building with a goal of learning, I think a common mistake is to be over-ambitious and sort of not scope things very tightly. And so a classic kind of failure mode is, you say, I'm really interested in learning how to use the GPT-4 API, and I'm also interested in vector databases, and I'm also interested in Next.js. And then you devise a project that's going to take many weeks, and you glue all these things together. And it could be a really cool idea, but then especially if you have a day job and other things that life throws you away, it's hard to actually get to a point where you can ship something. And so one of the things that I got really good at was saying, one, knowing exactly how quickly I could work, at least on the technologies that I knew well, and then only adding one new unknown thing to learn per project. So it may be that for this project, I'm going to learn how the embedding API works. Or for this project, I'm going to learn how to do vector stuff with PyTorch or something. And then I would scope things so that it fit in one chunk of time, like Friday night to Sunday night or something like that. And then I would scope the project so that I could ship something as much work as I could fit into a two-day period, so that at the end of that weekend, I could ship something. And then afterwards, if I want to add something, I have time to do it and a chance to do that. But it's already shipped, so there's already momentum, and people are using it, or I'm using it, and so there's a reason to continue building. So only adding one new unknown per project, I think, is a good trick. [00:08:14]
Swyx: I first came across you, I think, because of Monocle, which is your personal search engine. And I got very excited about it, because I always wanted a personal search engine, until I found that it was in a language that I've never seen before. [00:08:25]
Linus: Yeah, there's a towel tower of little tools and technologies that I built for myself. One of the other tricks to being really productive when you're building side projects is just to use a consistent set of tools that you know really, really well. For me, that's Go, and my language, and a couple other libraries that I've written that I know all the way down to the bottom of the stack. And then I barely have to look anything up, because I've just debugged every possible issue that could come up. And so I could get from start to finish without getting stuck in a weird bug that I've never seen before. But yeah, it's a weird stack. [00:08:58]
Swyx: It also means that you probably are not aiming for, let's say, open source glory, or whatever. Because you're not publishing in the JavaScript ecosystem. Right, right. [00:09:06]
Linus: I mean, I've written some libraries before, but a lot of my projects tend to be like, the way that I approach it is less about building something that other people are going to use en masse. And make yourself happy. Yeah, more about like, here's the thing that I built, if you want to, and often I learn something in the process of building that thing. So like with Monocle, I wrote a custom sort of full text search index. And I thought a lot of the parts of what I built was interesting. And so I just wanted other people to be able to look at it and see how it works and understand it. But the goal isn't necessarily for you to be able to replicate it and run it on your own. [00:09:36]
Swyx: Well, we can kind of dive into your other AIUX thoughts. As you've been diving in, you tend to share a lot on Twitter. And I just kind of took out some of your greatest hits. This is relevant to the demo that you picked out, Alessio. And what we're talking about, which is, most knowledge work is not a text generation task. That's funny, because a lot of what Notion AI is, is text generation right now. Maybe you want to elaborate a little bit. Yeah. [00:10:01]
Linus: I think the first time you look at something like GPT, the shape of the thing you see is like, oh, it's a thing that takes some input text and generates some output text. And so the easiest thing to build on top of that is a content generation tool. But I think there's a couple of other categories of things that you could build that are sort of progressively more useful and more interesting. And so besides content generation, which requires the minimum amount of wrapping around ChatGPT, the second tier up from that is things around knowledge, I think. So if you have, I mean, this is the hot thing with all these vector databases things going around. But if you have a lot of existing context around some knowledge about your company or about a field or all of the internet, you can use a language model as a way to search and understand things in it and combine and synthesize them. And that synthesis, I think, is useful. And at that point, I think the value that that unlocks, I think, is much greater than the value of content generation. Because most knowledge work, the artifact that you produce isn't actually about writing more words. Most knowledge work, the goal is to understand something, synthesize new things, or propose actions or other kinds of knowledge-to-knowledge tasks. And then the third category, I think, is automation. Which I think is sort of the thing that people are looking at most actively today, at least from my vantage point in the ecosystem. Things like the React prompting technique, and just in general, letting models propose actions or write code to accomplish tasks. That's also moving far beyond generating text to doing something more interesting. So much of the value of what humans sit down and do at work isn't actually in the words that they write. It's all the thinking that goes on before you write those words. So how can you get language models to contribute to those parts of work? [00:11:43]
Alessio: I think when you first tweeted about this, I don't know if you already accepted the job, but you tweeted about this, and then the next one was like, this is a NotionAI subtweet. [00:11:53]
Swyx: So I didn't realize that. [00:11:56]
Alessio: The best thing that I see is when people complain, and then they're like, okay, I'm going to go and help make the thing better. So what are some of the things that you've been thinking about? I know you talked a lot about some of the flexibility versus intuitiveness of the product. The language is really flexible, because you can say anything. And it's funny, the models never ignore you. They always respond with something. So no matter what you write, something is going to come back. Sometimes you don't know how big the space of action is, how many things you can do. So as a product builder, how do you think about the trade-offs that you're willing to take for your users? Where like, okay, I'm not going to let you be as flexible, but I'm going to create this guardrails for you. What's the process to think about the guardrails, and how you want to funnel them to the right action? [00:12:46]
Linus: Yeah, I think what this trade-off you mentioned around flexibility versus intuitiveness, I think, gets at one of the core design challenges for building products on top of language models. A lot of good interface design comes from tastefully adding the right constraints in place to guide the user towards actions that you want to take. As you add more guardrails, the obvious actions become more obvious. And one common way to make an interface more intuitive is to narrow the space of choices that the users have to make, and the number of choices that they have to make. And that intuitiveness, that source of intuitiveness from adding constraints, is kind of directly at odds with the reason that language models are so powerful and interesting, which is that they're so flexible and so general, and you can ask them to do literally anything, and they will always give you something. But most of the time, the answer isn't that high quality. And so there's kind of a distribution of, like, there are clumps of things in the action space of what a language model can do that the model's good at, and there's parts of the space where it's bad at. And so one sort of high-level framework that I have for thinking about designing with language models is, there are actions that the language model's good at, and actions that it's bad at. How do you add the right constraints carefully to guide the user and the system towards the things that the language model's good at? And then at the same time, how do you use those constraints to set the user expectations for what it's going to be good at and bad at? One way to do this is just literally to add those constraints and to set expectations. So a common example I use all the time is, if you have some AI system to answer questions from a knowledge base, there are a couple of different ways to surface that in a kind of a hypothetical product. One is, you could have a thing that looks like a chat window in a messaging app, and then you could tell the user, hey, this is for looking things up from a database. You can ask a question, then it'll look things up and give you an answer. But if something looks like a chat, and this is a lesson that's been learned over and over for anyone building chat interfaces since, like, 2014, 15, if you have anything that looks like a chat interface or a messaging app, people are going to put some, like, weird stuff in there that just don't look like the thing that you want the model to take in, because the expectation is, hey, I can use this like a messaging app, and people will send in, like, hi, hello, you know, weird questions, weird comments. Whereas if you take that same, literally the same input box, and put it in, like, a thing that looks like a search bar with, like, a search button, people are going to treat it more like a search window. And at that point, inputs look a lot more like keywords or a list of keywords or maybe questions. So the simple act of, like, contextualizing that input in different parts of an interface reset the user's expectations, which constrain the space of things that the model has to handle. And that you're kind of adding constraints, because you're really restricting your input to mostly things that look like keyword search. But because of that constraint, you can have the model fit the expectations better. You can tune the model to perform better in those settings. And it's also less confusing and perhaps more intuitive, because the user isn't stuck with this blank page syndrome problem of, okay, here's an input. What do I actually do with it? When we initially launched Notion AI, one of my common takeaways, personally, from talking to a lot of my friends who had tried it, obviously, there were a lot of people who were getting lots of value out of using it to automate writing emails or writing marketing copy. There were a ton of people who were using it to, like, write Instagram ads and then sort of paste it into the Instagram tool. But some of my friends who had tried it and did not use it as much, a frequently cited reason was, I tried it. It was cool. It was cool for the things that Notion AI was marketed for. But for my particular use case, I had a hard time figuring out exactly the way it was useful for my workflow. And I think that gets back at the problem of, it's such a general tool that just presented with a blank prompt box, it's hard to know exactly the way it could be useful to your particular use case. [00:16:21]
Alessio: What do you think is the relationship between novelty and flexibility? I feel like we're in kind of like a prompting honeymoon phase where the tools are new and then everybody just wants to do whatever they want to do. And so it's good to give these interfaces because people can explore. But if I go forward in three years, ideally, I'm not prompting anything. The UX has been built for most products to already have the intuitive, kind of like a happy path built into it. Do you think there's merit in a way? If you think about ChatGPT, if it was limited, the reason why it got so viral is people were doing things that they didn't think a computer could do, like write poems and solve riddles and all these different things. How do you think about that, especially in Notion, where Notion AI is kind of like a new product in an existing thing? How much of it for you is letting that happen and seeing how people use it? And then at some point be like, okay, we know what people want to do. The flexibility is not, it was cool before, but now we just want you to do the right things with the right UX. [00:17:27]
Linus: I think there's value in always having the most general input as an escape hatch for people who want to take advantage of that power. At this point, Notion AI has a couple of different manifestations in the product. There's the writer. There's a thing we called an AI block, which is a thing that you can always sort of re-update as a part of document. It's like a live, a little portal inside the document that an AI can write. We also have a relatively new thing called AI autofill, which lets an AI fill an entire column in a Notion database. In all of these things, speaking of adding constraints, we have a lot of suggested prompts that we've worked on and we've curated and we think work pretty well for things like summarization and writing drafts to blog posts and things. But we always leave a fully custom prompt for a few reasons. One is if you are actually a power user and you know how language models work, you can go in and write your custom prompt and if you're a power user, you want access to the power. The other is for us to be able to discover new use cases. And so one of the lovely things about working on a product like Notion is that there's such an enthusiastic and lively kind of community of ambassadors and people that are excited about trying different things and coming up with all these templates and new use cases. And having a fully custom action or prompt whenever we launch something new in AI lets those people really experiment and help us discover new ways to take advantage of AI. I think it's good in that way. There's also a sort of complement to that, which is if we wanted to use feedback data or learn from those things and help improve the way that we are prompting the model or the models that we're building, having access to that like fully diverse, fully general range of use cases helps us make sure that our models can handle the full generality of what people want to do. [00:19:06]
Swyx: I feel like we've segway’d a lot into our Notion conversation and maybe I just wanted to bridge that a little bit with your personal journey into Notion before we go into Notion proper. You spent a year kind of on a sabbatical, kind of on your own self-guided research journey and then deciding to join Notion. I think a lot of engineers out there thinking about doing this maybe don't have the internal compass that you have or don't have the guts to basically make no money for a year. Maybe just share with people how you decided to basically go on your own independent journey and what got you to join Notion in the end. [00:19:42]
Linus: Yeah, what happened? Um, yeah, so for a little bit of context for people who don't know me, I was working mostly at sort of seed stage startups as a web engineer. I actually didn't really do much AI at all for prior to my year off. And then I took all of 2022 off with less of a focus on it ended up sort of in retrospect becoming like a Linus Pivots to AI year, which was like beautifully well timed. But in the beginning of the year, there was kind of a one key motivation and then one key kind of question that I had. The motivation was that I think I was at a sort of a privileged and fortunate enough place where I felt like I had some money saved up that I had saved up explicitly to be able to take some time off and investigate my own kind of questions because I was already working on lots of side projects and I wanted to spend more time on it. I think I also at that point felt like I had enough security in the companies and folks that I knew that if I really needed a job on a short notice, I could go and I could find some work to do. So I wouldn't be completely on the streets. And so that security, I think, gave me the confidence to say, OK, let's try this kind of experiment.[00:20:52]
Maybe it'll only be for six months. Maybe it'll be for a year. I had enough money saved up to last like a year and change. And so I had planned for a year off and I had one sort of big question that I wanted to explore. Having that single question, I think, actually was really helpful for focusing the effort instead of just being like, I'm going to side project for a year, which I think would have been less productive. And that big question was, how do we evolve text interfaces forward? So, so much of knowledge work is consuming walls of text and then producing more walls of text. And text is so ubiquitous, not just in software, but just in general in the world. They're like signages and menus and books. And it's ubiquitous, but it's not very ergonomic. There's a lot of things about text interfaces that could be better. And so I wanted to explore how we could make that better. A key part of that ended up being, as I discovered, taking advantage of this new technologies that let computers make sense of text information. And so that's how I ended up sort of sliding into AI. But the motivation in the beginning was less focused on learning a new technology and more just on exploring this general question space. [00:21:53]
Swyx: Yeah. You have the quote, text is the lowest denominator, not the end game. Right, right. [00:21:58]
Linus: I mean, I think if you look at any specific domain or discipline, whether it's medicine or mathematics or software engineering, in any specific discipline where there's a narrower set of abstractions for people to work with, there are custom notations. One of the first things that I wrote in this exploration year was this piece called Notational Intelligence, where I talk about this idea that so much of, as a total sidebar, there's a whole other fascinating conversation that I would love to have at some point, maybe today, maybe later, about how to evolve a budding scene of research into a fully-fledged field. So I think AI UX is kind of in this weird stage where there's a group of interesting people that are interested in exploring this space of how do you design for this newfangled technology, and how do you take that and go and build best practices and powerful methods and tools [00:22:48]
Swyx: We should talk about that at some point. [00:22:49]
Linus: OK. But in a lot of established fields, there are notations that people use that really help them work at a slightly higher level than just raw words. So notations for describing chemicals and notations for different areas of mathematics that let people work with higher-level concepts more easily. Logic, linguistics. [00:23:07]
Swyx: Yeah. [00:23:07]
Linus: And I think it's fair to say that some large part of human intelligence, especially in these more technical domains, comes from our ability to work with notations instead of work with just the raw ideas in our heads. And text is a kind of notation. It's the most general kind of notation, but it's also, because of its generality, not super high leverage if you want to go into these specific domains. And so I wanted to try to improve on that frontier. [00:23:29]
Swyx: Yeah. You said in our show notes, one of my goals over the next few years is to ensure that we end up with interface metaphors and technical conventions that set us up for the best possible timeline for creativity and inventions ahead. So part of that is constraints. But I feel like that is one part of the equation, right? What's the other part that is more engenders creativity? [00:23:47]
Linus: Tell me a little bit about that and what you're thinking there. [00:23:51]
Swyx: It's just, I feel like, you know, we talked a little bit about how you do want to constrain, for example, the user interface to guide people towards things that language models are good at. And creative solutions do arise out of constraints. But I feel like that alone is not sufficient for people to invent things. [00:24:10]
Linus: I mean, there's a lot of directions, I think, that could go from that. The origin of that thing that you're quoting is when I decided to come help work on AI at Notion, a bunch of my friends were actually quite surprised, I think, because they had expected that I would have gone and worked… [00:24:29]
Swyx: You did switch. I was eyeing that for you. [00:24:31]
Linus: I mean, I worked at a lab or at my own company or something like that. But one of the core motivations for me joining an existing company and one that has lots of users already is this exact thing where in the aftermath of a new foundational technology emerging, there's kind of a period of a few years where the winners in the market get to decide what the default interface paradigm for the technology is. So, like, mini computers, personal computers, the winners of that market got to decide Windows are and how scrolling works and what a mouse cursor is and how text is edited. Similar with mobile, the concept of a home screen and apps and things like that, the winners of the market got to decide. And that has profound, like, I think it's difficult to understate the importance of, in those few critical years, the winning companies in the market choosing the right abstractions and the right metaphors. And AI, to me, seemed like it's at that pivotal moment where it's a technology that lots of companies are adopting. There is this well-recognized need for interface best practices. And Notion seemed like a company that had this interesting balance of it could still move quickly enough and ship and prototype quickly enough to try interesting interface ideas. But it also had enough presence in the ecosystem that if we came up with the right solution or one that we felt was right, we could push it out and learn from real users and iterate and hopefully be a part of that story of setting the defaults and setting what the dominant patterns are. [00:26:07]
Swyx: Yeah, it's a special opportunity. One of my favorite stories or facts is it was like a team of 10 people that designed the original iPhone. And so all the UX that was created there is essentially what we use as smartphones today, including predictive text, because people were finding that people were kind of missing the right letters. So they just enhanced the hit area for certain letters based on what you're typing. [00:26:28]
Linus: I mean, even just the idea of like, we should use QWERTY keyboards on tiny smartphone screens. Like that's a weird idea, right? [00:26:36]
Swyx: Yeah, QWERTY is another one. So I have RSI. So this actually affects me. QWERTY was specifically chosen to maximize travel distance, right? Like it's actually not ergonomic by design because you wanted the keyboard, the key type writers to not stick. But we don't have that anymore. We're still sticking to QWERTY. I'm still sticking to QWERTY. I could switch to the other ones. I forget. QORAC or QOMAC anytime, but I don't just because of inertia. I have another thing like this. [00:27:02]
Linus: So going even farther back, people don't really think enough about where this concept of buttons come from, right? So the concept of a push button as a thing where you press it and it activates some binary switch. I mean, buttons have existed for, like mechanical buttons have existed for a long time. But really, like this modern concept of a button that activates a binary switch really gets like popularized by the popular advent of electricity. Before the electricity, if you had a button that did something, you would have to construct a mechanical system where if you press down on a thing, it affects some other lever system that affects as like the final action. And this modern idea of a button that is just a binary switch gets popularized electricity. And at that point, a button has to work in the way that it does in like an alarm clock, because when you press down on it, there's like a spring that makes sure that the button comes back up and that it completes the circuit. And so that's the way the button works. And then when we started writing graphical interfaces, we just took that idea of a thing that could be depressed to activate a switch. All the modern buttons that we have today in software interfaces are like simulating electronic push buttons where you like press down to complete a circuit, except there's actually no circuit being completed. It's just like a square on a screen. [00:28:11]
Swyx: It's all virtualized. Right. [00:28:12]
Linus: And then you control the simulation of a button by clicking a physical button on a mouse. Except if you're on a trackpad, it's not even a physical button anymore. It's like a simulated button hardware that controls a simulated button in software. And it's also just this cascade of like conceptual backwards compatibility that gets us here. I think buttons are interesting. [00:28:32]
Alessio: Where are you on the skeuomorphic design love-hate spectrum? There's people that have like high nostalgia for like the original, you know, the YouTube icon on the iPhone with like the knobs on the TV. [00:28:42]
Linus: I think a big part of that is at least the aesthetic part of it is fashion. Like fashion taken very literally, like in the same way that like the like early like Y2K 90s aesthetic comes and goes. I think skeuomorphism as expressed in like the early iPhone or like Windows XP comes and goes. There's another aspect of this, which is the part of skeuomorphism that helps people understand and intuit software, which has less to do with skeuomorphism making things easier to understand per se and more about like, like a slightly more general version of skeuomorphism is like, there should be a consistent mental model behind an interface that is easy to grok. And then once the user has the mental model, even if it's not the full model of exactly how that system works, there should be a simplified model that the user can easily understand and then sort of like adopt and use. One of my favorite examples of this is how volume controls that are designed well often work. Like on an iPhone, when you make your iPhone volume twice as loud, the sound that comes out isn't actually like at a physical level twice as loud. It's on a log scale. When you push the volume slider up on an iPhone, the speaker uses like four times more energy, but humans perceive it as twice as loud. And so the mental model that we're working with is, okay, if I make this, this volume control slider have two times more value, it's going to sound two times louder, even though actually the underlying physics is like on a log scale. But what actually happens physically is not actually what matters. What matters is how humans perceive it in the model that I have in my head. And there, I think there are a lot of other instances where the skeuomorphism isn't actually the thing. The thing is just that there should be a consistent mental model. And often the easy, consistent mental model to reach for is the models that already exist in reality, but not always. [00:30:23]
Alessio: I think the other big topic, maybe before we dive into Notion is agents. I think that's one of the toughest interfaces to crack, mostly because, you know, the text box, everybody understands that the agent is kind of like, it's like human-like feeling, you know, where it's like, okay, I'm kind of delegating something to a human, right? I think, like, Sean, you made the example of like a Calendly, like a savvy Cal, it's like an agent, because it's scheduling on your behalf for something. [00:30:51]
Linus: That's actually a really interesting example, because it's a kind of a, it's a pretty deterministic, like there's no real AI to it, but it is agent in the sense that you're like delegating it and automate something. [00:31:01]
Swyx: Yeah, it does work without me. It's great. [00:31:03]
Alessio: So that one, we figured out. Like, we know what the scheduling interface is like. [00:31:07]
Swyx: Well, that's the state of the art now. But, you know, for example, the person I'm corresponding with still has to pick a time from my calendar, which some people dislike. Sam Lesson famously says it's a sign of disrespect. I disagree with him, but, you know, it's a point of view. There could be some intermediate AI agents that would send emails back and forth like a human person to give the other person who feels slighted that sense of respect or a personalized touch that they want. So there's always ways to push it. [00:31:39]
Alessio: Yeah, I think for me, you know, other stuff that I think about, so we were doing prep for another episode and had an agent and asked it to do like a, you know, background prep on like the background of the person. And it just couldn't quite get the format that I wanted it to be, you know, but I kept to have the only way to prompt that it's like, give it text, give a text example, give a text example. What do you think, like the interface between human and agents in the future will be like, do you still think agents are like this open ended thing that are like objective driven where you say, Hey, this is what I want to achieve versus I only trust this agent to do X. And like, this is how X is done. I'm curious because that kind of seems like a lot of mental overhead, you know, to remember each agent for each task versus like if you have an executive assistant, like they'll do a random set of tasks and you can trust them because they're a human. But I feel like with agents, we're not quite there. [00:32:36]
Swyx: Agents are hard. [00:32:36]
Linus: The design space is just so vast. Since all of the like early agent stuff came out around auto GPT, I've tried to develop some kind of a thesis around it. And I think it's just difficult because there's so many variables. One framework that I usually apply to sort of like existing chat based prompting kind of things that I think also applies just as well to agents is this duality between what you might call like trust and control. So you just now you brought up this example of you had an agent try to write some write up some prep document for an episode and it couldn't quite get the format right. And one way you could describe that is you could say, Oh, the, the agent didn't exactly do what I meant and what I had in my head. So I can't trust it to do the right job. But a different way to describe it is I have a hard time controlling exactly the output of the model and I have a hard time communicating exactly what's in my head to the model. And they're kind of two sides of the same coin. I think if you, if you can somehow provide a way to with less effort, communicate and control and constrain the model output a little bit more and constrain the behavior a little bit more, I think that would alleviate the pressure for the model to be this like fully trusted thing because there's no need for trust anymore. There's just kind of guardrails that ensure that the model does the right thing. So developing ways and interfaces for these agents to be a little more constrained in its output or maybe for the human to control its output a little bit more or behavior a little bit more, I think is a productive path. Another sort of more, more recent revelation that I had while working on this and autofill thing inside notion is the importance of zones of influence for AI agents, especially in collaborative settings. So having worked on lots of interfaces for independent work on my year off, one of the surprising lessons that I learned early on when I joined notion was that if you build a collaboration permeates everything, which is great for notion because collaborating with an AI, you reuse a lot of the same metaphors for collaborating with humans. So one nice thing about this autofill thing that also kind of applies to AI blocks, which is another thing that we have, is that you don't alleviate this problem of having to ask questions like, oh, is this document written by an AI or is this written by a human? Like this need for auditability, because the part that's written by the AI is just in like the autofilled cell or in the AI block. And you can, you can tell that's written by the AI and things outside of it, you can kind of reasonably assume that it was written by you. I think anytime you have sort of an unbounded action space for, for models like agents, it's especially important to be able to answer those questions easily and to have some sense of security that in the same way that you want to know whether your like coworker or collaborator has access to a document or has modified a document, you want to know whether an AI has permissions to access something. And if it's modified something or made some edit, you want to know that it did it. And so as a compliment to constraining the model's action space proactively, I think it's also important to communicate, have the user have an easy understanding of like, what exactly did the model do here? And I think that helps build trust as well. [00:35:39]
Swyx: Yeah. I think for auto GPT and those kinds of agents in particular, anything that is destructive, you need to prompt for, I guess, or like check with, check in with the user. I know it's overloaded now. I can't say that. You have to confirm with the user. You confirm to the user. Yeah, exactly. Yeah. Yeah. [00:35:56]
Linus: That's tough too though, because you, you don't want to stop. [00:35:59]
Swyx: Yeah. [00:35:59]
Linus: One of the, one of the benefits of automating these things that you can sort of like, in theory, you can scale them out arbitrarily. I can have like a hundred different agents working for me, but if that means I'm just spending my entire day in a deluge of notifications, that's not ideal either. [00:36:12]
Swyx: Yeah. So then it could be like a reversible, destructive thing with some kind of timeouts, a time limit. So you could reverse it within some window. I don't know. Yeah. I've been thinking about this a little bit because I've been working on a small developer agent. Right. Right. [00:36:27]
Linus: Or maybe you could like batch a group of changes and can sort of like summarize them with another AI and improve them in bulk or something. [00:36:33]
Swyx: Which is surprisingly similar to the collaboration problem. Yeah. Yeah. Yeah. Exactly. Yeah. [00:36:39]
Linus: I'm telling you, the collaboration, a lot of the problems with collaborating with humans also apply to collaborating with AI. There's a potential pitfall to that as well, which is that there are a lot of things that some of the core advantages of AI end up missing out on if you just fully anthropomorphize them into like human-like collaborators. [00:36:56]
Swyx: But yeah. Do you have a strong opinion on that? Like, do you refer to it as it? Oh yeah. [00:37:00]
Linus: I'm an it person, at least for now, in 2023. Yeah. [00:37:05]
Swyx: So that leads us nicely into introducing what Notion and Notion AI is today. Do you have a pet answer as to what is Notion? I've heard it introduced as a database, a WordPress killer, a knowledge base, a collaboration tool. What is it? Yeah. [00:37:19]
Linus: I mean, the official answer is that a Notion is a connected workspace. It has a space for your company docs, meeting notes, a wiki for all of your company notes. You can also use it to orchestrate your workflows if you're managing a project, if you have an engineering team, if you have a sales team. You can put all of those in a single Notion database. And the benefit of Notion is that all of them live in a single space where you can link to your wiki pages from your, I don't know, like onboarding docs. Or you can link to a GitHub issue through a task from your documentation on your engineering system. And all of this existing in a single place in this kind of like unified, yeah, like single workspace, I think has lots of benefits. [00:37:58]
Swyx: That's the official line. [00:37:59]
Linus: There's an asterisk that I usually enjoy diving deeper into, which is that the whole reason that this connected workspace is possible is because underlying all of this is this really cool abstraction of blocks. In Notion, everything is a block. A paragraph is a block. A bullet point is a block. But also a page is a block. And the way that Notion databases work is that a database is just a collection of pages, which are really blocks. And you can like take a paragraph and drag it into a database and it'll become a page. You can take a page inside a database and pull it out and it'll just become a link to that page. And so this core abstraction of a block that can also be a page, that can also be a row in a database, like an Excel sheet, that fluidity and this like shared abstraction across all these different areas inside Notion, I think is what really makes Notion powerful. This Lego theme, this like Lego building block theme permeates a lot of different parts of Notion. Some fans of Notion might know that when you, or when you join Notion, you get a little Lego minifigure, which has Lego building blocks for workflows. And then every year you're at Notion, you get a new block that says like you've been here for a year, you've been here for two years. And then Simon, our co-founder and CTO, has a whole crate of Lego blocks on his desk that he just likes to mess with because, you know, he's been around for a long time. But this Lego building block thing, this like shared sort of all-encompassing single abstraction that you can combine to build various different kinds of workflows, I think is really what makes Notion powerful. And one of the sort of background questions that I have for Notion AI is like, what is that kind of building block for AI? [00:39:30]
Swyx: Well, we can dive into that. So what is Notion AI? Like, so I kind of view it as like a startup within the startup. Could you describe the Notion AI team? Is this like, how seriously is Notion taking the AI wave? [00:39:43]
Linus: The most seriously? The way that Notion AI came about, as I understand it, because I joined a bit later, I think it was around October last year, all of Notion team had a little offsite. And as a part of that, Ivan and Simon kind of went into a little kind of hack weekend. And the thing that they ended up hacking on inside Notion was the very, very early prototype of Notion AI. They saw this GPT-3 thing. The early, early motivation for starting Notion, building Notion in the first place for them, was sort of grounded in this utopian end-user programming vision where software is so powerful, but there are only so many people in the world that can write programs. But everyone can benefit from having a little workspace or a little program or a little workflow tool that's programmed to just fit their use case. And so how can we build a tool that lets people customize their software tools that they use every day for their use case? And I think to them, seemed like such a critical part of facilitating that, bridging the gap between people who can code and people who need software. And so they saw that, they tried to build an initial prototype that ended up becoming the first version of Notion AI. They had a prototype in, I think, late October, early November, before Chachapiti came out and sort of evolved it over the few months. But what ended up launching was sort of in line with the initial vision, I think, of what they ended up building. And then once they had it, I think they wanted to keep pushing it. And so at this point, AI is a really key part of Notion strategy. And what we see Notion becoming going forward, in the same way that blocks and databases are a core part of Notion that helps enable workflow automation and all these important parts of running a team or collaborating with people or running your life, we think that AI is going to become an equally critical part of what Notion is. And it won't be, Notion is a cool connected workspace app, and it also has AI. It'll be that what Notion is, is databases, it has pages, it has space for your docs, and it also has this sort of comprehensive suite of AI tools that permeate everything. And one of the challenges of the AI team, which is, as you said, kind of a startup within a startup right now, is to figure out exactly what that all-permeating kind of abstraction means, which is a fascinating and difficult open problem. [00:41:57]
Alessio: How do you think about what people expect of Notion versus what you want to build in Notion? A lot of this AI technology kind of changes, you know, we talked about the relationship between text and human and how human collaborates. Do you put any constraints on yourself when it's like, okay, people expect Notion to work this way with these blocks. So maybe I have this crazy idea and I cannot really pursue it because it's there. I think it's a classic innovator's dilemma kind of thing. And I think a lot of founders out there that are in a similar position where it's like, you know, series C, series D company, it's like, you're not quite yet the super established one, you're still moving forward, but you have an existing kind of following and something that Notion stands for. How do you kind of wrangle with that? [00:42:43]
Linus: Yeah, that is in some ways a challenge and that Notion already is a kind of a thing. And so we can't just scrap everything and start over. But I think it's also, there's a blessing side of it too, in that because there are so many people using Notion in so many different ways, we understand all of the things that people want to use Notion for very well. And then so we already have a really well-defined space of problems that we want to help people solve. And that helps us. We have it with the existing Notion product and we also have it by sort of rolling out these AI things early and then watching, learning from the community what people want to do [00:43:17]
Swyx: with them. [00:43:17]
Linus: And so based on those learnings, I think it actually sort of helps us constrain the space of things we think we need to build because otherwise the design space is just so large with whatever we can do with AI and knowledge work. And so watching what people have been using Notion for and what they want to use Notion for, I think helps us constrain that space a little bit and make the problem of building AI things inside Notion a little more tractable. [00:43:36]
Swyx: I think also just observing what they naturally use things for, and it sounds like you do a bunch of user interviews where you hear people running into issues and, or describe them as, the way that I describe myself actually is, I feel like the problem is with me, that I'm not creative enough to come up with use cases to use Notion AI or any other AI. [00:43:57]
Linus: Which isn't necessarily on you, right? [00:43:59]
Swyx: Exactly. [00:43:59]
Linus: Again, like it goes way back to the early, the thing we touched on early in the conversation around like, if you have too much generality, there's not enough, there are not enough guardrails to obviously point to use cases. Blank piece of paper. [00:44:10]
Swyx: I don't know what to do with this. So I think a lot of people judge Notion AI based on what they originally saw, which is write me a blog post or do a summary or do action items. Which, fun fact, for latent space, my very, very first Hacker News hit was reverse engineering Notion AI. I actually don't know if I got it exactly right. I think I got the easy ones right. And then apparently I got the action items one really wrong. So there's some art into doing that. But also you've since launched a bunch of other products and maybe you've already hinted at AI Autofill. Maybe we can just talk a little bit about what does the scope or suite of Notion AI products have been so far and what you're launching this week? Yeah. [00:44:53]
Linus: So we have, I think, three main facets of Notion AI and Notion at the moment. We have sort of the first thing that ever launched with Notion AI, which I think that helps you write. It's, going back to earlier in the conversation, it's kind of a writing, kind of a content generation tool. If you have a document and you want to generate a summary, it helps you generate a summary, pull out action items, you can draft a blog post, you can help it improve, it's helped to improve your writings, it can help fix grammar and spelling mistakes. But under the hood, it's a fairly lightweight, a thick layer of prompts. But otherwise, it's a pretty straightforward use case of language models, right? And so there's that, a tool that helps you write documents. There's a thing called an AI block, which is a slightly more constrained version of that where one common way that we use it inside Notion is we take all of our meeting notes inside Notion. And frequently when you have a meeting and you want other people to be able to go back to it and reference it, it's nice to have a summary of that meeting. So all of our meeting notes templates, at least on the AI team, have an AI block at the top that automatically summarizes the contents of that page. And so whenever we're done with a meeting, we just press a button and it'll re-summarize that, including things like what are the core action items for every person in the meeting. And so that block, as I said before, is nice because it's a constrained space for the AI to work in, and we don't have to prompt it every single time. And then the newest member of this AI collection of features is AI autofill, which brings Notion AI to databases. So if you have a whole database of user interviews and you want to pull out what are the companies, core pain points, what are their core features, maybe what are their competitor products they use, you can just make columns. And in the same way that you write Excel formulas, you can write a little AI formula, basically, where the AI will look at the contents of the page and pull out each of these key pieces of information. The slightly new thing that autofill introduces is this idea of a more automated background [00:46:43]
Swyx: AI thing. [00:46:44]
Linus: So with Writer, the AI in your document product and the AI block, you have to always ask it to update. You have to always ask it to rewrite. But if you have a column in a database, in a Notion database, or a property in a Notion database, it would be nice if you, whenever someone went back and changed the contents of the meeting node or something updated about the page, or maybe it's a list of tasks that you have to do and the status of the task changes, you might want the summary of that task or detail of the task to update. And so anytime that you can set up an autofilled Notion property so that anytime something on that database row or page changes, the AI will go back and sort of auto-update the autofilled value. And that, I think, is a really interesting part that we might continue leading into of like, even though there's AI now tied to this particular page, it's sort of doing its own thing in the background to help automate and alleviate some of that pain of automating these things. But yeah, Writer, Blocks, and Autofill are the three sort of cornerstones we have today. [00:47:42]
Alessio: You know, there used to be this glorious time where like, Roam Research was like the hottest knowledge company out there, and then Notion built Backlinks. I don't know if we are to blame for that. No, no, but how do Backlinks play into some of this? You know, I think most AI use cases today are kind of like a single page, right? Kind of like this document. I'm helping with this. Do you see some of these tools expanding to do changes across things? So we just had Itamar from Codium on the podcast, and he talked about how agents can tie in specs for features, tests for features, and the code for the feature. So like the three entities are tied together. Like, do you see some Backlinks help AI navigate through knowledge basis of companies where like, you might have the document the product uses, but you also have the document that marketing uses to then announce it? And as you make changes, the AI can work through different pieces of it? [00:48:41]
Swyx: Definitely. [00:48:41]
Linus: If I may get a little theoretical from that. One of my favorite ideas from my last year of hacking around building text augmentations with AI for documents is this realization that, you know, when you look at code in a code editor, what it is at a very lowest level is just text files. A code file is a text file, and there are maybe functions inside of it, and it's a list of functions, but it's a text file. But the way that you understand it is not as a file, like a Word document, it's a kind of a graph.[00:49:10]
Linus: Like you have a function, you have call sites to that function, there are places where you call that function, there's a place where that function is tested, many different definitions for that function. Maybe there's a type definition that's tied to that function. So it's a kind of a graph. And if you want to understand that function, there's advantages to be able to traverse that whole graph and fully contextualize where that function is used. Same with types and same with variables. And so even though its code is represented as text files, it's actually kind of a graph. And a lot of the, of what, all of the key interfaces, interface innovations behind IDEs is helping surface that graph structure in the context of a text file. So like things like go to definition or VS Code's little window view when you like look at references. And interesting idea that I explored last year was what if you bring that to text documents? So text documents are a little more unstructured, so there's a less, there's a more fuzzy kind of graph idea. But if you're reading a textbook, if there's a new term, there's actually other places where the term is mentioned. There's probably a few places where that's defined. Maybe there's some figures that reference that term. If you have an idea, there are other parts of the document where the document might disagree with that idea or cite that idea. So there's still kind of a graph structure. It's a little more fuzzy, but there's a graph structure that ties together like a body of knowledge. And it would be cool if you had some kind of a text editor or some kind of knowledge tool that let you explore that whole graph. Or maybe if an AI could explore that whole graph. And so back to your point, I think taking advantage of not just the backlinks. Backlinks is a part of it. But the fact that all of these inside Notion, all of these pages exist in a single workspace and it's a shared context. It's a connected workspace. And you can take any idea and look up anywhere to fully contextualize what a part of your engineering system design means. Or what we know about our pitching their customer at a company. Or if I wrote down a book, what are other places where that book has been mentioned? All these graph following things, I think, are really important for contextualizing knowledge. [00:51:02]
Swyx: Part of your job at Notion is prompt engineering. You are maybe one of the more advanced prompt engineers that I know out there. And you've always commented on the state of prompt ops tooling. What is your process today? What do you wish for? There's a lot here. [00:51:19]
Linus: I mean, the prompts that are inside Notion right now, they're not complex in the sense that agent prompts are complex. But they're complex in the sense that there is even a problem as simple as summarize a [00:51:31]
Swyx: page. [00:51:31]
Linus: A page could contain anything from no information, if it's a fresh document, to a fully fledged news article. Maybe it's a meeting note. Maybe it's a bug filed by somebody at a company. The range of possible documents is huge. And then you have to distill all of it down to always generate a summary. And so describing that task to AI comprehensively is pretty hard. There are a few things that I think I ended up leaning on, as a team we ended up leaning on, for the prompt engineering part of it. I think one of the early transitions that we made was that the initial prototype for Notion AI was built on instruction following, the sort of classic instruction following models, TextWG003, and so on. And then at some point, we all switched to chat-based models, like Claude and the new ChatGPT Turbo and these models. And so that was an interesting transition. It actually kind of made few-shot prompting a little bit easier, I think, in that you could give the few-shot examples as sort of previous turns in a conversation. And then you could ask the real question as the next follow-up turn. I've come to appreciate few-shot prompting a lot more because it's difficult to fully comprehensively explain a particular task in words, but it's pretty easy to demonstrate like four or five different edge cases that you want the model to handle. And a lot of times, if there's an edge case that you want a model to handle, I think few-shot prompting is just the easiest, most reliable tool to reach for. One challenge in prompt engineering that Notion has to contend with often is we want to support all the different languages that Notion supports. And so all of our prompts have to be multilingual or compatible, which is kind of tricky because our prompts are written, our instructions are written in English. And so if you just have a naive approach, then the model tends to output in English, even when the document that you want to translate or summarize is in French. And so one way you could try to attack that problem is to tell the model, answering the language of the user's query. But it's actually a lot more effective to just give it examples of not just English documents, but maybe summarizing an English document, maybe summarize a ticket filed in French, summarize an empty document where the document's supposed to be in Korean. And so a lot of our few-shot prompt-included prompts in Notion AI tend to be very multilingual, and that helps support our non-English-speaking users. The other big part of prompt engineering is evaluation. The prompts that you exfiltrated out of Notion AI many weeks ago, surprisingly pretty spot-on, at least for the prompts that we had then, especially things like summary. But they're also outdated because we've evolved them a lot more, and we have a lot more examples. And some of our prompts are just really, really long. They're like thousands of tokens long. And so every time we go back and add an example or modify the instruction, we want to make sure that we don't regress any of the previous use cases that we've supported. And so we put a lot of effort, and we're increasingly building out internal tooling infrastructure for things like what you might call unit tests and regression tests for prompts with handwritten test cases, as well as tests that are driven more by feedback from Notion users that have chosen to share their feedback with us. [00:54:31]
Swyx: You just have a hand-rolled testing framework or use Jest or whatever, and nothing custom out there. You basically said you've looked at so many prompt ops tools and you're sold on none of them. [00:54:42]
Linus: So that tweet was from a while ago. I think there are a couple of interesting tools these days. But I think at the moment, Notion uses pretty hand-rolled tools. Nothing too heavy, but it's basically a for loop over a list of test cases. We do do quite a bit of using language models to evaluate language models. So our unit test descriptions are kind of funny because the test is literally just an input document and a query, and then we expect the model to say something. And then our qualification for whether that test passes or not is just ask the language model again, whether it looks like a reasonable summary or whether it's in the right language. [00:55:19]
Swyx: Do you have the same model? Do you have entropic-criticized OpenAI or OpenAI-criticized entropic? That's a good question. Do you worry about models being biased towards its own self? [00:55:29]
Linus: Oh, no, that's not a worry that we have. I actually don't know exactly if we use different models. If you have a fixed budget for running these tests, I think it would make sense to use more expensive models for evaluation rather than generation. But yeah, I don't remember exactly what we do there. [00:55:44]
Swyx: And then one more follow-up on, you mentioned some of your prompts are thousands of tokens. That takes away from my budget as a user. Isn't that a trade-off that's a concern? So there's a limited context window, right? Some of that is taken by you as the app designer, product designer, deciding what system prompt to provide. And then the remainder is what I as a user can give you to actually summarize as my content. In theory. [00:56:10]
Linus: I think in practice there are a couple of trends that make that an issue. So for things like generating summaries, a summary is only going to be so many tokens long. If our prompts are generating you 3,000 token summaries, the prompt is not doing its job anyway. [00:56:25]
Swyx: Yeah, but the source doc is. [00:56:27]
Linus: The source doc could be longer. So if you wanted to translate a 5,000 token document, you do have to truncate it. And there is a limitation. It's not something that we are super focused on at the moment for a couple of reasons. I think there are techniques that, if we need to, help us compress those prompts. Things like parameter-efficient fine-tuning. And also the context lengths. It seems like the dominant trend is that context lengths are getting cheaper and longer constantly. Anthropic recently announced their 100,000 token context model recently. And so I think in the longer term that's going to be taken care of anyway by the models becoming more accommodating of longer contexts. And it's more of a temporary limitation. Cool. [00:57:04]
Swyx: Shall we talk about the professionalizing of a scene? [00:57:07]
Linus: Yeah, I think one of the things that is a helpful bit of context when thinking about HCI and AI in particular is, historically, HCI and AI have been sort of competing disciplines. Competing very specifically in the sense that they often fought for the same sources of funding and the same kinds of people and attention throughout the history of computer science. HCI and AI both used to come from the same or very aligned, similar, parallel motivations of, we have computers. How do we make computers work better with humans? And one way to do it was to make the machine smarter. Another way to do it was to design better interfaces. And through the AI booms and busts, when the AI boom was happening, HCI would get less funding. And when AIs had winters, HCI would get a lot more attention because it was sort of the alternative solution. And now that we have this sort of renewed attention on how to build better interfaces for AI, I think it's interesting that it's kind of a scene now. There are podcasts like this where I get to talk about interfaces and AI. But it's definitely not a fully-fledged field. My favorite definition of sort of what distinguishes the two apart comes from Andy Matuszak, where he, I'm going to butcher the quote, but he said something to the effect of, a field has at their disposal a powerful set of established tools and methods and standards and a shared set of core questions they want to answer. And so if you look at machine learning, which is obviously a really dominant established field, if you want to answer, if you want to evaluate a model, if you want to answer, if you want to solve a particular task or build a model that solves a particular task, there are powerful methods that we have, like gradient descent and specific benchmarks, for building solutions and then re-evaluating how to do the solutions. Or if you have an even more expensive problem, there are surely attempts that have been made before and then attempts that people are making now for how to attack that problem and frameworks to think about these things. In AI and UX, I think, we're very early in the evolution of that space and that community, and there's a lot of people excited, a lot of people building, but we have yet to come up with a set of best practices and tools and methods and frameworks for thinking about these things. And those will surely arise, and as they do, I think we'll see the evolution of the field. In prompt engineering and using language models in products at large, I think that community is a little farther along. It's still very fast moving because it's really young, but there are established prompting techniques like React and distillation of larger instruction following models. And these techniques, I think, are the beginnings of best practices and powerful tools at the disposal of this language model using field. [00:59:43]
Swyx: Yeah, and mostly it's just following Riley Goodside. It's how I learn about prompting techniques. Right, right. Yeah, pioneers. But yeah, I am actually interested in this. We've recently kind of rebranded the podcast or the newsletter somewhat in towards being for this term AI engineer, which I kind of view as somewhere between machine learning researcher and software engineer, some kind of in-between mix. And I think creating the media, creating meetups, creating a de facto conference for it, creating job titles, and then I think that core set of questions that everyone wants to get better at, I think that is essentially how this starts. Yeah, yeah. Pretty excited of. [01:00:25]
Linus: Creating a space for the people that are interested to come together, I think, is a really, really key important part of it. I'm always, whenever I come back to it, I'm always amazed by how if you look at the sort of golden era of theoretical physics in the early 20th century, or the golden era of early personal computing, there are maybe like two dozen people that have contributed all of the significant ideas to that field. They all kind of know each other. I always found that really fascinating. And I think the causal relationship actually goes the other way. It's not that all those people happen to know each other. It's that because there was that core set of people that always, that were very close to each other and shared ideas often, and they were co-located, that that field is able to blossom. And so I think creating that space is really critical. [01:01:08]
Swyx: Yeah, there's a very famous photo of the Solvay conference in 1927, where Albert Einstein, Niels Bohr, Marie Curie, all these top physics names. And how many Nobel laureates are in the photo, right? Yeah, and when I tweeted it out once, people were like, I didn't know these all lived together, and they all knew each other, and they must have exchanged so many ideas. [01:01:28]
Linus: I mean, similar with artists and writers that help a new kind of period blossom. [01:01:34]
Swyx: Now, is it going to be San Francisco, New York, though? [01:01:36]
Alessio: That's a spicy question. [01:01:39]
Swyx: I don't know, we'll see. Well, we're glad to at least be a part of your world, whether it is on either coast. But it's also virtual, right? Like, we have a Discord, it's happening online as well, even if you're in a small town like Indiana. [01:01:54]
Swyx: Cool, lightning round? Awesome, yeah, let's do it. [01:01:59]
Alessio: We only got three questions for you. One is acceleration, one exploration, then a final takeaway. So the first one we always like to ask is like, what is something that happened in AI that you thought would take much longer than it has? [01:02:13]
Swyx: Price is coming down. [01:02:14]
Linus: Price is coming down and or being able to get a lot more bang for your buck. So things like GPT-3.5 Turbo being, I don't know, exactly the figure, like 10 times, 20 times cheaper. [01:02:25]
Swyx: And then having GPT, then DaVinci O3. [01:02:27]
Linus: Then DaVinci O3 per token, or the super long context clod, or MPT StoryWriter, these like long context models that take, theoretically would take a lot of compute to run, but they're sort of accessible to us now. I think they're surprising because I would have thought that before these things came out, that cost per token and scaling context length, and these were like sort of core constraints that you would have to design your AI systems around. And it ends up being like, if you just wait a few months, like OpenAI will figure out how to make these models 10 times cheaper. Or Anthropic will figure out how to make the models be able to take a million tokens. And the speed at which that's happened has been surprising and a little bit frightening, because it invalidates a lot of the assumptions that I was operating with, and I have to recalibrate. [01:03:11]
Swyx: Yeah, there's this very famous law called Wurf's Law, also known as Gates's Law, that basically says software engineers will take up whatever hardware engineers give them. And I feel like there's a parallel law right now where language model improvements, AI UX people are going to take up all the improvements that language model people will give them. So, you know, they're trying to, while the language model people are improving the costs by a single order of magnitude, you, with your Notion AI autofill, are increasing by orders of magnitude the amount of consumption that's being used. [01:03:39]
Linus: Yeah, exactly. Before the show started, we were just talking about how when I was prototyping an autofill, just to make sure that things sort of like scaled up, okay, I ended up running autofill on a database with like 6,000 pages and just summaries. And usually these are fairly long pages. I ended up running through something like two or three million tokens in a matter of like 20 minutes. [01:03:58]
Swyx: Yeah. [01:03:58]
Linus: Which is not too expensive, luckily, because the models are getting cheaper. It's going to be fine. But it is like $5 or $6, which the concept of like running a test on my computer and it spending the price of like a nice coffee is kind of a weird thing still that I'm getting used to. [01:04:13]
Swyx: And Notion AI currently is $10 a month, something like that. So there's ways to make Notion lose money. [01:04:20]
Alessio: You just get negative gross margins on that test. [01:04:24]
Linus: Not sanctioned by Notion. I mean, obviously, you should use it to, you know, improve your life and support your workflows in whatever ways that's useful. [01:04:33]
Swyx: Okay, second question is about exploration. What do you think is the most interesting unsolved question in AI? [01:04:39]
Linus: Predictability, reliability. Well, in AI broadly, I think it's much harder. But with language models specifically, I think how to build dependable systems is really important. If you ask Notion AI or if you ask ChatGPT or Claude, like maybe a bullet list of X, Y, Z, sometimes it'll make those bullets with like the Unicode center dot. Sometimes it'll make them with a dash. Sometimes it'll like add a title. Sometimes it'll like bold random things. And all of the things are fine. But it's a little jarring if every time the answer is a little stochastic. I think this is much more of a concern for when you're automating tasks or having the model make decisions by itself. Predictability, dependability, so much of the software that runs the world is sort of behind-the-scenes decision-making programs that run inside enterprises and automate systems and make decisions for people. And auditability, dependability is just so critical to all of them. One avenue of work that I'm really intrigued by is in these decision-making systems, not having the model sort of internally as a black box make decisions, but having the model synthesize code that makes decisions. So you might ask the model for things like summarization, like natural language tasks, you have to ask the model. But if you wanted to, I don't know, let's say you have a document and you want to filter out all the dates. Instead of asking the model, hey, can you grab all the dates? You can ask the model to write a regular expression that captures a particular set of date formats that you really care about. And at that point, the output of the model is a program. And the nice thing about a program is you can kind of check it. There's lots of nice things. One is it's much cheaper to run afterwards. Another is you can verify it. And the program becomes a kind of a, what in design we call a boundary object, where it's a shared thing that exists both in the sphere of the human and the sphere of the computer. And you can iterate on it to fix bugs. And you can co-evolve this object that is now like a representation of this decision that you want the model to, the computer to make. But it's auditable and dependable and reliable. And so I'm pretty bullish on co-generation and other sort of like program synthesis and program verification techniques. But using the model to write the initial program and help the people maintain the software. [01:06:36]
Swyx: Yeah, I'm so excited by that. Just in terms of reliability, I'll call out our previous guest. Rojbal. Yeah, yeah. And she's working on Guardrails AI. There's also LMQL. And then Microsoft recently put out Guidance, which is their custom language thing. Have you explored any of those? [01:06:51]
Linus: I've taken a look at all of them. I've spoken to Shreya. I think this general space of like more... Speaking of adding constraints to general systems, adding constraints, adding program verification, all of these things I think are super fascinating. I also personally like it a lot. Because before I was spending a lot of my time in AI, I spent a bunch of time looking at like programming languages and compilers and interpreters. And there is just so much amazing work that has gone into how do you build automated ways to reason about a program? Like compilers and type checkers and so on. And it would be a real shame if the whole field of program synthesis and verification just became like ask GPT-4. [01:07:30]
Swyx: But actually, it's not. [01:07:30]
Linus: Like they work together. You write the program, you synthesize the program with GPT-4 from human descriptions. And then now we have this whole set of powerful techniques that we can use to more formally understand and prove things about programs. And I think the synergy of them, I'm excited to see. [01:07:44]
Swyx: Awesome. This was great, Linus. [01:07:47]
Alessio: Our last question is always, what's one message you want everyone to remember today about the space, exciting challenges? [01:07:54]
Swyx: We were at the beginning. [01:07:57]
Linus: Maybe this is really cliche. But one thing that I always used to say about when I was working on text interfaces last year [01:08:05]
Swyx: was that I would be really disappointed [01:08:07]
Linus: if in a thousand years humans are still using the same kind of like writing tools and writing systems that we are today. Like it would be pretty surprising if we're still sort of like writing documents in the same way that we are today in a thousand years. And the language and the writing system hasn't evolved at all. If humans plan to be around for many thousands of years into the future, writing has really only been around for like two, three thousand years. And it's like sort of modern form. And we should, I think, care a lot more about building flexible, powerful tools than about backwards compatibility if we plan to be around for many more times the number of years that we've been around. And so I think whether we look at something as simple as language models or as expansive as like humans interacting with text documents, I think it's worth reminding yourself often that the things that we have today are sometimes that way for a reason but often just because an artifact of like the way that we've gotten here. And text can look very different. Language models can look very different. I personally think in a couple of years we're going to do something better than transformers. So all of these things are going to change. And I think it's important to have your eyes sort of looking over the horizon at what's coming far into the future. [01:09:24]
Swyx: Nice way to end it. [01:09:25]
Alessio: Well, thank you, Linus, for coming on. This was great. Thank you. This was lovely. [01:09:29]
Linus: Thanks for having me. [01:09:31]
Get full access to Latent.Space at www.latent.space/subscribe