ThursdAI - The top AI news from the past week – Details, episodes & analysis

Podcast details

Technical and general information from the podcast's RSS feed.

ThursdAI - The top AI news from the past week

ThursdAI - The top AI news from the past week

From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week

News
Technology

Frequency: 1 episode/7d. Total Eps: 114

Substack
Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more.

sub.thursdai.news
Site
RSS
Apple

Recent rankings

Latest chart positions across Apple Podcasts and Spotify rankings.

Apple Podcasts
  • 🇬🇧 Great Britain - techNews

    27/07/2025
    #90
  • 🇩🇪 Germany - techNews

    27/07/2025
    #25
  • 🇫🇷 France - techNews

    27/07/2025
    #94
  • 🇬🇧 Great Britain - techNews

    26/07/2025
    #79
  • 🇩🇪 Germany - techNews

    26/07/2025
    #28
  • 🇫🇷 France - techNews

    26/07/2025
    #87
  • 🇬🇧 Great Britain - techNews

    25/07/2025
    #64
  • 🇩🇪 Germany - techNews

    25/07/2025
    #73
  • 🇫🇷 France - techNews

    25/07/2025
    #83
  • 🇬🇧 Great Britain - techNews

    24/07/2025
    #38
Spotify

    No recent rankings available



RSS feed quality and score

Technical evaluation of the podcast's RSS feed quality and structure.

See all
RSS feed quality
To improve

Score global : 53%


Publication history

Monthly episode publishing history over the past years.

Episodes published by month in

Latest published episodes

Recent episodes with titles, durations, and descriptions.

See all

📅 ThursdAI - Aug 29 - AI Plays DOOM, Cerebras breaks inference records, Google gives new Geminis, OSS vision SOTA & 100M context windows!?

vendredi 30 août 2024Duration 01:35:04

Hey, for the least time during summer of 2024, welcome to yet another edition of ThursdAI, also happy skynet self-awareness day for those who keep track :)

This week, Cerebras broke the world record for fastest LLama 3.1 70B/8B inference (and came on the show to talk about it) Google updated 3 new Geminis, Anthropic artifacts for all, 100M context windows are possible, and Qwen beats SOTA on vision models + much more!

As always, this weeks newsletter is brought to you by Weights & Biases, did I mention we're doing a hackathon in SF in September 21/22 and that we have an upcoming free RAG course w/ Cohere & Weaviate?

TL;DR

* Open Source LLMs

* Nous DisTrO - Distributed Training (X , Report)

* NousResearch/ hermes-function-calling-v1 open sourced - (X, HF)

* LinkedIN Liger-Kernel - OneLine to make Training 20% faster & 60% more memory Efficient (Github)

* Cartesia - Rene 1.3B LLM SSM + Edge Apache 2 acceleration (X, Blog)

* Big CO LLMs + APIs

* Cerebras launches the fastest AI inference - 447t/s LLama 3.1 70B (X, Blog, Try It)

* Google - Gemini 1.5 Flash 8B & new Gemini 1.5 Pro/Flash (X, Try it)

* Google adds Gems & Imagen to Gemini paid tier

* Anthropic artifacts available to all users + on mobile (Blog, Try it)

* Anthropic publishes their system prompts with model releases (release notes)

* OpenAI has project Strawberry coming this fall (via The information)

* This weeks Buzz

* WandB Hackathon hackathon hackathon (Register, Join)

* Also, we have a new RAG course w/ Cohere and Weaviate (RAG Course)

* Vision & Video

* Zhipu AI CogVideoX - 5B Video Model w/ Less 10GB of VRAM (X, HF, Try it)

* Qwen-2 VL 72B,7B,2B - new SOTA vision models from QWEN (X, Blog, HF)

* AI Art & Diffusion & 3D

* GameNgen - completely generated (not rendered) DOOM with SD1.4 (project)

* FAL new LORA trainer for FLUX - trains under 5 minutes (Trainer, Coupon for ThursdAI)

* Tools & Others

* SimpleBench from AI Explained - closely matches human experience (simple-bench.com)

ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Open Source

Let's be honest - ThursdAI is a love letter to the open-source AI community, and this week was packed with reasons to celebrate.

Nous Research DiStRO + Function Calling V1

Nous Research was on fire this week (aren't they always?) and they kicked off the week with the release of DiStRO, which is a breakthrough in distributed training. You see, while LLM training requires a lot of hardware, it also requires a lot of network bandwidth between the different GPUs, even within the same data center.

Proprietary networking solutions like Nvidia NVLink, and more open standards like Ethernet work well within the same datacenter, but training across different GPU clouds has been unimaginable until now.

Enter DiStRo, a new decentralized training by the mad geniuses at Nous Research, in which they reduced the required bandwidth to train a 1.2B param model from 74.4GB to just 86MB (857x)!

This can have massive implications for training across compute clusters, doing shared training runs, optimizing costs and efficiency and democratizing LLM training access! So don't sell your old GPUs just yet, someone may just come up with a folding@home but for training the largest open source LLM, and it may just be Nous!

Nous Research also released their function-calling-v1 dataset (HF) that was used to train Hermes-2, and we had InterstellarNinja who authored that dataset, join the show and chat about it. This is an incredible unlock for the open source community, as function calling become a de-facto standard now. Shout out to the Glaive team as well for their pioneering work that paved the way!

LinkedIn's Liger Kernel: Unleashing the Need for Speed (with One Line of Code)

What if I told you, that whatever software you develop, you can add 1 line of code, and it'll run 20% faster, and require 60% less memory?

This is basically what Linkedin researches released this week with Liger Kernel, yes you read that right, Linkedin, as in the website you career related posts on!

"If you're doing any form of finetuning, using this is an instant win"Wing Lian - Axolotl

This absolutely bonkers improvement in training LLMs, now works smoothly with Flash Attention, PyTorch FSDP and DeepSpeed. If you want to read more about the implementation of the triton kernels, you can see a deep dive here, I just wanted to bring this to your attention, even if you're not technical, because efficiency jumps like these are happening all the time. We are used to seeing them in capabilities / intelligence, but they are also happening on the algorithmic/training/hardware side, and it's incredible to see!

Huge shoutout to Byron and team at Linkedin for this unlock, check out their Github if you want to get involved!

Qwen-2 VL - SOTA image and video understanding + open weights mini VLM

You may already know that we love the folks at Qwen here on ThursdAI, not only because Junyang Lin is a frequeny co-host and we get to hear about their releases as soon as they come out (they seem to be releasing them on thursdays around the time of the live show, I wonder why!)

But also because, they are committed to open source, and have released 2 models 7B and 2B with complete Apache 2 license!

First of all, their Qwen-2 VL 72B model, is now SOTA at many benchmarks, beating GPT-4, Claude 3.5 and other much bigger models. This is insane. I literally had to pause Junyang and repeat what he said, this is a 72B param model, that beats GPT-4o on document understanding, on math, on general visual Q&A.

Additional Capabilities & Smaller models

They have added new capabilities in these models, like being able to handle arbitrary resolutions, but the one I'm most excited about is the video understanding. These models can now understand up to 20 minutes of video sequences, and it's not just "split the video to 10 frames and do image caption", no, these models understand video progression and if I understand correctly how they do it, it's quite genius.

They the video embed time progression into the model using a new technique called M-RoPE, which turns the time progression into rotary positional embeddings.

Now, the 72B model is currently available via API, but we do get 2 new small models with Apache 2 license and they are NOT too shabby either!

7B parameters (HF) and 2B Qwen-2 VL (HF) are small enough to run completely on your machine, and the 2B parameter, scores better than GPT-4o mini on OCR-bench for example!

I can't wait to finish writing and go play with these models!

Big Companies & LLM APIs

The biggest news this week came from Cerebras System, a relatively unknown company, that shattered the world record for LLM inferencing out of the blue (and came on the show to talk about how they are doing it)

Cerebras - fastest LLM inference on wafer scale chips

Cerebras has introduced the concept of wafer scale chips to the world, which is, if you imagine a microchip, they are the size of a post stamp maybe? GPUs are bigger, well, Cerebras are making chips the sizes of an iPad (72 square inches), largest commercial chips in the world.

And now, they created an inference stack on top of those chips, and showed that they have the fastest inference in the world, how fast? Well, they can server LLama 3.1 8B at a whopping 1822t/s. No really, this is INSANE speeds, as I was writing this, I copied all the words I had so far, went to inference.cerebras.ai , asked to summarize, pasted and hit send, and I immediately got a summary!

"The really simple explanation is we basically store the entire model, whether it's 8B or 70B or 405B, entirely on the chip. There's no external memory, no HBM. We have 44 gigabytes of memory on chip."James Wang

They not only store the whole model (405B coming soon), but they store it in full fp16 precision as well, so they don't quantize the models. Right now, they are serving it with 8K tokens in context window, and we had a conversation about their next steps being giving more context to developers.

The whole conversation is well worth listening to, James and Ian were awesome to chat with, and while they do have a waitlist, as they gradually roll out their release, James said to DM him on X and mention ThursdAI, and he'll put you through, so you'll be able to get an OpenAI compatible API key and be able to test this insane speed.

P.S - we also did an independent verification of these speeds, using Weave, and found Cerebras to be quite incredible for agentic purposes, you can read our report here and the weave dashboard here

Anthropic - unlocking just-in-time applications with artifacts for all

Well, if you aren't paying claude, maybe this will convince you. This week, anthropic announced that artifacts are available to all users, not only their paid customers.

Artifacts are a feature in Claude that is basically a side pane (and from this week, a drawer in their mobile apps) that allows you to see what Claude is building, by rendering the web application almost on the fly. They have also trained Claude in working with that interface, so it knows about the different files etc

Effectively, this turns Claude into a web developer that will build mini web applications (without backend) for you, on the fly, for any task you can think of.

Drop a design, and it'll build a mock of it, drop some data in a CSV and it'll build an interactive onetime dashboard visualizing that data, or just ask it to build an app helping you split the bill between friends by uploading a picture of a bill.

Artifacts are share-able and remixable, so you can build something and share with friends, so here you go, an artifact I made, by dropping my notes into claude, and asking for a magic 8 Ball, that will spit out a random fact from today's editing of ThursdAI. I also provided Claude with an 8Ball image, but it didn't work due to restrictions, so instead I just uploaded that image to claude and asked it to recreate it with SVG! And viola, a completely un-nessesary app that works!

Google’s Gemini Keeps Climbing the Charts (But Will It Be Enough?)

Sensing a disturbance in the AI force (probably from that Cerebras bombshell), Google rolled out a series of Gemini updates, including a new experimental Gemini 1.5 Pro (0827) with sharper coding skills and logical reasoning. According to LMSys, it’s already nipping at the heels of ChatGPT 4o and is number 2!

Their Gemini 1.5 Flash model got a serious upgrade, vaulting to the #6 position on the arena. And to add to the model madness, they even released an Gemini Flash 8B parameter version for folks who need that sweet spot between speed and size.

Oh, and those long-awaited Gems are finally starting to roll out. But get ready to open your wallet – this feature (preloading Gemini with custom context and capabilities) is a paid-tier exclusive. But hey, at least Imagen-3 is cautiously returning to the image generation game!

AI Art & Diffusion

Doom Meets Stable Diffusion: AI Dreams in 20FPS Glory (GameNGen)

The future of video games is, uh, definitely going to be interesting. Just as everyone thought AI would be conquering Go or Chess, it seems we've stumbled into a different battlefield: first-person shooters. 🤯

This week, researchers in DeepMind blew everyone's minds with their GameNgen research. What did they do? They trained Stable Diffusion 1.4 on Doom, and I'm not just talking about static images - I'm talking about generating actual Doom gameplay in near real time. Think 20FPS Doom running on nothing but the magic of AI.

The craziest part to me is this quote "Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation"

FAL Drops the LORA Training Time Bomb (and I Get a New Meme!)

As you see, I haven't yet relaxed from making custom AI generations with Flux and customizing them with training LORAs. Two weeks ago, this used to take 45 minutes, a week ago, 20 minutes, and now, the wizards at FAL, created a new trainer that shrinks the training times down to less than 5 minutes!

So given that the first upcoming SpaceX commercial spacewalk Polaris Dawn, I trained a SpaceX astronaut LORA and then combined my face with it, and viola, here I am, as a space X astronaut!

BTW because they are awesome, Jonathan and Simo (who is the magician behind this new trainer) came to the show, announced the new trainer, but also gave all listeners of ThursdAI a coupon to train a LORA effectively for free, just use this link and start training! (btw I get nothing out of this, just trying to look out for my listeners!)

That's it for this week, well almost that's it, magic.dev announced a new funding round of 320 million, and that they have a 100M context window capable models and coding product to go with it, but didn't yet release it, just as we were wrapping up. Sam Altman tweeted that OpenAI now has over 200 Million active users on ChatGPT and that OpenAI will collaborate with AI Safety institute.

Ok now officially that's it! See you next week, when it's going to be 🍁 already brrr

ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

📅 AI21 Jamba 1.5, DIY Meme Faces, 8yo codes with AI and a Doomsday LLM Device?!

jeudi 22 août 2024Duration 01:41:39

Hey there, Alex here with an end of summer edition of our show, which did not disappoint. Today is the official anniversary of stable diffusion 1.4 can you believe it?

It's the second week in the row that we have an exclusive LLM launch on the show (after Emozilla announced Hermes 3 on last week's show), and spoiler alert, we may have something cooking for next week as well!

This edition of ThursdAI is brought to you by W&B Weave, our LLM observability toolkit, letting you evaluate LLMs for your own use-case easily

Also this week, we've covered both ends of AI progress, doomerist CEO saying "Fck Gen AI" vs an 8yo coder and I continued to geek out on putting myself into memes (I promised I'll stop... at some point) so buckle up, let's take a look at another crazy week:

TL;DR

* Open Source LLMs

* AI21 releases Jamba1.5 Large / Mini hybrid Mamba MoE (X, Blog, HF)

* Microsoft Phi 3.5 - 3 new models including MoE (X, HF)

* BFCL 2 - Berkley Function Calling Leaderboard V2 (X, Blog, Leaderboard)

* NVIDIA - Mistral Nemo Minitron 8B - Distilled / Pruned from 12B (HF)

* Cohere paper proves - code improves intelligence (X, Paper)

* MOHAWK - transformer → Mamba distillation method (X, Paper, Blog)

* AI Art & Diffusion & 3D

* Ideogram launches v2 - new img diffusion king 👑 + API (X, Blog, Try it)

* Midjourney is now on web + free tier (try it finally)

* Flux keeps getting better, cheaper, faster + adoption from OSS (X, X, X)

* Procreate hates generative AI (X)

* Big CO LLMs + APIs

* Grok 2 full is finally available on X - performs well on real time queries (X)

* OpenAI adds GPT-4o Finetuning (blog)

* Google API updates - 1000 pages PDFs + LOTS of free tokens (X)

* This weeks Buzz

* Weights & Biases Judgement Day SF Hackathon in September 21-22 (Sign up to hack)

* Video

* Hotshot - new video model - trained by 4 guys (try it, technical deep dive)

* Luma Dream Machine 1.5 (X, Try it)

* Tools & Others

* LMStudio 0.0.3 update - local RAG, structured outputs with any model & more (X)

* Vercel - Vo now has chat (X)

* Ark - a completely offline device - offline LLM + worlds maps (X)

* Ricky's Daughter coding with cursor video is a must watch (video)

The Best of the Best: Open Source Wins with Jamba, Phi 3.5, and Surprise Function Calling Heroes

We kick things off this week by focusing on what we love the most on ThursdAI, open-source models! We had a ton of incredible releases this week, starting off with something we were super lucky to have live, the official announcement of AI21's latest LLM: Jamba.

AI21 Officially Announces Jamba 1.5 Large/Mini – The Powerhouse Architecture Combines Transformer and Mamba

While we've covered Jamba release on the show back in April, Jamba 1.5 is an updated powerhouse. It's 2 models, Large and Mini, both MoE and both are still hybrid architecture of Transformers + Mamba that try to get both worlds.

Itay Dalmedigos, technical lead at AI21, joined us on the ThursdAI stage for an exclusive first look, giving us the full rundown on this developer-ready model with an awesome 256K context window, but it's not just the size – it’s about using that size effectively.

AI21 measured the effective context use of their model on the new RULER benchmark released by NVIDIA, an iteration of the needle in the haystack and showed that their models have full utilization of context, as opposed to many other models.

“As you mentioned, we’re able to pack many, many tokens on a single GPU. Uh, this is mostly due to the fact that we are able to quantize most of our parameters", Itay explained, diving into their secret sauce, ExpertsInt8, a novel quantization technique specifically designed for MoE models.

Oh, and did we mention Jamba is multilingual (eight languages and counting), natively supports structured JSON, function calling, document digestion… basically everything developers dream of. They even chucked in citation generation, as it's long context can contain full documents, your RAG app may not even need to chunk anything, and the citation can cite full documents!

Berkeley Function Calling Leaderboard V2: Updated + Live (link)

Ever wondered how to measure the real-world magic of those models boasting "I can call functions! I can do tool use! Look how cool I am!" 😎? Enter the Berkeley Function Calling Leaderboard (BFCL) 2, a battleground where models clash to prove their function calling prowess.

Version 2 just dropped, and this ain't your average benchmark, folks. It's armed with a "Live Dataset" - a dynamic, user-contributed treasure trove of real-world queries, rare function documentations, and specialized use-cases spanning multiple languages. Translation: NO more biased, contaminated datasets. BFCL 2 is as close to the real world as it gets.

So, who’s sitting on the Function Calling throne this week? Our old friend Claude 3.5 Sonnet, with an impressive score of 73.61. But breathing down its neck is GPT 4-0613 (the OG Function Calling master) with 73.5. That's right, the one released a year ago, the first one with function calling, in fact the first LLM with function calling as a concept IIRC!

Now, prepare for the REAL plot twist. The top-performing open-source model isn’t some big name, resource-heavy behemoth. It’s a tiny little underdog called Functionary Medium 3.1, a finetuned version of Llama 3.1 that blew everyone away. It even outscored both versions of Claude 3 Opus AND GPT 4 - leaving folks scrambling to figure out WHO created this masterpiece.

“I’ve never heard of this model. It's MIT licensed from an organization called MeetKai. Have you guys heard about Functionary Medium?” I asked, echoing the collective bafflement in the space. Yep, turns out there’s gold hidden in the vast landscape of open source models, just waiting to be unearthed ⛏️.

Microsoft updates Phi 3.5 - 3 new models including an MoE + MIT license

3 new Phi's dropped this week, including an MoE one, and a new revamped vision one. They look very decent on benchmark yet again, with the mini version (3.8B) seemingly beating LLama 3.1 8B on a few benchmarks.

However, as previously the excitement is met with caution because Phi models seem great on benchmarks but then actually talking with them, folks are not as impressed usually.

Terry from BigCodeBench also saw a significant decrease in coding ability for Phi 3.5 vs 3.1

Of course, we're not complaining, the models released with 128K context and MIT license.

The thing I'm most excited about is the vision model updates, it has been updated with "multi-frame image understanding and reasoning" which is a big deal! This means understanding videos more natively across scenes.

This weeks Buzz

Hey, if you're reading this, while sitting in the bay area, and you don't have plans for exactly a month from now, why don't you come and hack with me? (Register Free)

Announcing, the first W&B hackathon, Judgement Day that's going to be focused on LLM as a judge! Come hack on innovative LLM as a judge ideas, UIs, evals and more, meet other like minded hackers and AI engineers and win great prizes!

🎨 AI Art: Ideogram Crowns Itself King, Midjourney Joins the Internet & FLUX everywhere

While there was little news from big LLM labs this week, there is a LOT of AI art news, which is fitting to celebrate 2 year Stable Diffusion 1.4 anniversary!

👑 Ideogram v2: Text Wizardry and API Access (But No Loras… Yet?)

With significantly improved realism, and likely the best text generation across all models out there, Ideogram v2 just took over the AI image generation game! Just look at that text sharpness!

They now offer a selection of styles (Realistic, Design, 3D, Anime) and any aspect ratios you'd like and also, brands can now provide color palettes to control the outputs!

Adding to this is a new API offering (.8c per image for the main model, .5c for the new turbo model of v2!) and a new IOS app, they also added the option (for premium users only) to search through a billion generations and their prompts, which is a great offering as well, as sometimes you don't even know what to prompt.

They claim a significant improvement over Flux[pro] and Dalle-3 in text, alignment and overall, interesting that MJ was not compared!

Meanwhile, Midjourney finally launched a website and a free tier, so no longer do you have to learn to use Discord to even try Midjourney.

Meanwhile Flux enjoys the fruits of Open Source

While the Ideogram and MJ fight it out for the closed source, Black Forest Labs enjoys the fruits of released their weights in the open.

Fal just released an update that LORAs run 2.5x faster and 2.5x cheaper, CivitAI has LORAs for pretty much every character and celebrity ported to FLUX already, different techniques like ControlNets Unions, IPAdapters and more are being trained as we speak and tutorials upon tutorials are released of how to customize these models, for free (shoutout to my friend Matt Wolfe for this one)

you can now train your own face on fal.ai , replicate.com and astria.ai , and thanks to astria, I was able to find some old generations of my LORAs from the 1.5 days (not quite 1.4, but still, enough to show the difference between then and now) and whoa.

🤔 Is This AI Tool Necessary, Bro?

Let’s end with a topic that stirred up a hornets nest of opinions this week: Procreate, a beloved iPad design app, publicly declared their "fing hate” for Generative AI.

Yeah, you read that right. Hate. The CEO, in a public statement went FULL scorched earth - proclaiming that AI-powered features would never sully the pristine code of their precious app.

“Instead of trying to bridge the gap, he’s creating more walls", Wolfram commented, echoing the general “dude… what?” vibe in the space. “It feels marketeerial”, I added, pointing out the obvious PR play (while simultaneously acknowledging the very REAL, very LOUD segment of the Procreate community that cheered this decision).

Here’s the thing: you can hate the tech. You can lament the potential demise of the human creative spark. You can rail against the looming AI overlords. But one thing’s undeniable: this tech isn't going anywhere.

Meanwhile, 8yo coders lean in fully into AI

As a contrast to this doomerism take, just watch this video of Ricky Robinette's eight-year-old daughter building a Harry Potter website in 45 minutes, using nothing but a chat interface in Cursor. No coding knowledge. No prior experience. Just prompts and the power of AI ✨.

THAT’s where we’re headed, folks. It might be terrifying. It might be inspiring. But it’s DEFINITELY happening. Better to understand it, engage with it, and maybe try to nudge it in a positive direction, than burying your head in the sand and muttering “I bleeping hate this progress” like a cranky, Luddite hermit. Just sayin' 🤷‍♀️.

AI Device to reboot civilization (if needed)

I was scrolling through my feed (as I do VERY often, to bring you this every week) and I saw this and super quickly decided to invite the author to the show to talk about it.

Adam Cohen Hillel has prototyped an AI hardware device, but this one isn't trying to record you or be your friend, no, this one comes with offline LLMs finetuned with health and bio information, survival tactics, and all of the worlds maps and works completely offline!

This to me was a very exciting use for an LLM, a distilled version of all human knowledge, buried in a faraday cage, with replaceable batteries that runs on solar and can help you survive in the case of something bad happening, like really bad happening (think a solar flare that takes out the electrical grid or an EMP device). While improbable, I thought this was a great idea and had a nice chat with the creator, you should definitely give this one a listen, and if you want to buy one, he is going to sell them soon here

This is it for this week, there have been a few updates from the big labs, OpenAI has opened Finetuneing for GPT-4o, and you can use your WandB API key in there to track those, which is cool, Gemini API now accepts incredibly large PDF files (up to 1000 pages) and Grok 2 is finally on X (not mini from last week)

See you next week (we will have another deep dive!)



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

ThursdAI - June 13th, 2024 - Apple Intelligence recap, Elons reaction, Luma's Dream Machine, AI Engineer invite, SD3 & more AI news from this past week

jeudi 13 juin 2024Duration 01:46:25

Happy Apple AI week everyone (well, those of us who celebrate, some don't) as this week we finally got told what Apple is planning to do with this whole generative AI wave and presented Apple Intelligence (which is AI, get it? they are trying to rebrand AI!)

This weeks pod and newsletter main focus will be Apple Intelligence of course, as it was for most people compared to how the market reacted ($APPL grew over $360B in a few days after this announcement) and how many people watched each live stream (10M at the time of this writing watched the WWDC keynote on youtube, compared to 4.5 for the OpenAI GPT-4o, 1.8 M for Google IO)

On the pod we also geeked out on new eval frameworks and benchmarks including a chat with the authors of MixEvals which I wrote about last week and a new benchmark called Live Bench from Abacus and Yan Lecun

Plus a new video model from Luma and finally SD3, let's go! 👇

TL;DR of all topics covered:

* Apple WWDC recap and Apple Intelligence (X)

* This Weeks Buzz

* AI Engineer expo in SF (June 25-27) come see my talk, it's going to be Epic (X, Schedule)

* Open Source LLMs

* Microsoft Samba - 3.8B MAMBA + Sliding Window Attention beating Phi 3 (X, Paper)

* Sakana AI releases LLM squared - LLMs coming up with preference algorithms to train better LLMS (X, Blog)

* Abacus + Yan Lecun release LiveBench.AI - impossible to game benchmark (X, Bench

* Interview with MixEval folks about achieving 96% arena accuracy with 5000x less price

* Big CO LLMs + APIs

* Mistral announced a 600M series B round

* Revenue at OpenAI DOUBLED in the last 6 month and is now at $3.4B annualized (source)

* Elon drops lawsuit vs OpenAI

* Vision & Video

* Luma drops DreamMachine - SORA like short video generation in free access (X, TRY IT)

* AI Art & Diffusion & 3D

* Stable Diffusion Medium weights are here (X, HF, FAL)

* Tools

* Google releases GenType - create an alphabet with diffusion Models (X, Try It)

Apple Intelligence

Technical LLM details

Let's dive right into what wasn't show on the keynote, in a 6 minute deep dive video from the state of the union for developers and in a follow up post on machine learning blog, Apple shared some very exciting technical details about their on device models and orchestration that will become Apple Intelligence.

Namely, on device they have trained a bespoke 3B parameter LLM, which was trained on licensed data, and uses a bunch of very cutting edge modern techniques to achieve quite an incredible on device performance. Stuff like GQA, Speculative Decoding, a very unique type of quantization (which they claim is almost lossless)

To maintain model , we developed a new framework using LoRA adapters that incorporates a mixed 2-bit and 4-bit configuration strategy — averaging 3.5 bits-per-weight — to achieve the same accuracy as the uncompressed models [...] on iPhone 15 Pro we are able to reach time-to-first-token latency of about 0.6 millisecond per prompt token, and a generation rate of 30 tokens per second

These small models (they also have a bespoke image diffusion model as well) are going to be finetuned with a lot of LORA adapters for specific tasks like Summarization, Query handling, Mail replies, Urgency and more, which gives their foundational models the ability to specialize itself on the fly to the task at hand, and be cached in memory as well for optimal performance.

Personal and Private (including in the cloud)

While these models are small, they will also benefit from 2 more things on device, a vector store of your stuff (contacts, recent chats, calendar, photos) they call semantic index and a new thing apple is calling App Intents, which developers can expose (and the OS apps already do) that will allows the LLM to use tools like moving files, extracting data across apps, and do actions, this already makes the AI much more personal and helpful as it has in its context things about me and what my apps can do on my phone.

Handoff to the Private Cloud (and then to OpenAI)

What the local 3B LLM + context can't do, it'll hand off to the cloud, in what Apple claims is a very secure way, called Private Cloud, in which they will create a new inference techniques in the cloud, on Apple Silicon, with Secure Enclave and Secure Boot, ensuring that the LLM sessions that run inference on your data are never stored, and even Apple can't access those sessions, not to mention train their LLMs on your data.

Here are some benchmarks Apple posted for their On-Device 3B model and unknown size server model comparing it to GPT-4-Turbo (not 4o!) on unnamed benchmarks they came up with.

In cases where Apple Intelligence cannot help you with a request (I'm still unclear when this actually would happen) IOS will now show you this dialog, suggesting you use chatGPT from OpenAI, marking a deal with OpenAI (in which apparently nobody pays nobody, so neither Apple is getting paid by OpenAI to be placed there, nor does Apple pay OpenAI for the additional compute, tokens, and inference)

Implementations across the OS

So what will people be able to actually do with this intelligence? I'm sure that Apple will add much more in the next versions of iOS, but at least for now, Siri is getting an LLM brain transplant and is going to be much more smarter and capable, from understanding natural speech better (and just, having better ears, the on device speech to text is improved and is really good now in IOS 18 beta) to being able to use app intents to do actions for you across several apps.

Other features across the OS will use Apple Intelligence to prioritize your notifications, and also summarize group chats that are going off, and have built in tools for rewriting, summarizing, and turning any text anywhere into anything else. Basically think of many of the tasks you'd use chatGPT for, are now built into the OS level itself for free.

Apple is also adding AI Art diffusion features like GenMoji (the ability to generate any emoji you can think of, like chefs kiss, or a seal with a clown nose) and while this sounds absurd, I've never been in a slack or a discord that didn't have their own unique custom emojis uploaded by their members.

And one last feature I'll highlight is this Image Playground, Apple's take on generating images, which is not only just text, but a contextual understanding of your conversation, and let's you create with autosuggested concepts instead of just text prompts and is going to be available to all developers to bake into their apps.

Elon is SALTY - and it's not because of privacy

I wasn't sure if to include this segment, but in what became my most viral tweet since the beginning of this year, I posted about Elon muddying the water about what Apple actually announced, and called it a Psyop that worked. Many MSMs and definitely the narrative on X, turned into what Elon thinks about those announcements, rather than the announcements themselves and just look at this insane reach.

We've covered Elon vs OpenAI before (a lawsuit that he actually withdrew this week, because emails came out showing he knew and was ok with OpenAI not being Open) and so it's no surprise that when Apple decided to partner with OpenAI and not say... XAI, Elon would promote absolutely incorrect and ignorant takes to take over the radio waves like he will ban apple devices from all his companies, or that OpenAI will get access to train on your iPhone data.

This weeks BUZZ (Weights & Biases Update)

Hey, if you're reading this, it's very likely that you've already registered or at least heard of ai.engineer and if you haven't, well I'm delighted to tell you, that we're sponsoring this awesome event in San Francisco June 25-27. Not only are we official sponsors, both Lukas (the Co-Founder and CEO) and I will be there giving talks (mine will likely be crazier than his) and we'll have a booth there, so if your'e coming, make sure to come by my talk (or Lukas's if you're a VP and are signed up for that exclusive track)

Everyone in our corder of the world is going to be there, Swyx told me that many of the foundational models labs are coming, OpenAI, Anthropic, Google, and there's going to be tons of tracks (My talk is of course in the Evals track, come, really, I might embarrass myself on stage to eternity you don't want to miss this)

Swyx kindly provided listeners and readers of ThursdAI with a special coupon feeltheagi so even more of a reason to try and convince your boss and come see me on stage in a costume (I've said too much!)

Vision & Video

Luma drops DreamMachine - SORA like short video generation in free access (X, TRY IT)

In an absolute surprise, Luma AI, a company that (used to) specialize in crafting 3D models, has released a free access video model similar to SORA, and Kling (which we covered last week) that generates 5 second videos (and doesn't require a chinese phone # haha)

It's free to try, and supports text to video, image to video, cinematic prompt instructions, great and cohesive narrative following, character consistency and a lot more.

Here's a comparison of the famous SORA videos and LDM (Luma Dream Machine) videos that I was provided on X by a AmebaGPT, however, worth noting that these are cherry picked SORA videos while LDM is likely a much smaller and quicker model and that folks are creating some incredible things already!

AI Art & Diffusion & 3D

Stable Diffusion Medium weights are here (X, HF, FAL)

It's finally here (well, I'm using finally carefully here, and really hoping that this isn't the last thing Stability AI releases) ,the weights for Stable Diffusion 3 are available on HuggingFace! SD3 offers improved photorealism and awesome prompt adherence, like asking for multiple subjects doing multiple things.

It's also pretty good at typography and fairly resource efficient compared to previuos versions, though I'm still waiting for the super turbo distilled versions that will likely come soon!

ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

And that's it for this week folks, it's been a hell of a week, I really do appreciate each and one of you who makes it to the end reading, engaging and would love to ask for feedback, so if anything didn't resonate, too long / too short, or on the podcast itself, too much info, to little info, please do share, I will take it into account 🙏 🫡

Also, we're coming up to the 52nd week I've been sending these, which will mark ThursdAI BirthdAI for real (the previous one was for the live shows) and I'm very humbled that so many of you are now reading, sharing and enjoying learning about AI together with me 🙏

See you next week,

Alex



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

📅 ThursdAI - Jun 6th - 👑 Qwen2 Beats Llama-3! Jina vs. Nomic for Multimodal Supremacy, new Chinese SORA, Suno & Udio user uploads & more AI news

vendredi 7 juin 2024Duration 01:43:45

Hey hey! This is Alex! 👋

Some podcasts have 1 or maaaybe 2 guests an episode, we had 6! guests today, each has had an announcement, an open source release, or a breaking news story that we've covered! (PS, this edition is very multimodal so click into the Substack as videos don't play in your inbox)

As you know my favorite thing is to host the folks who make the news to let them do their own announcements, but also, hitting that BREAKING NEWS button when something is actually breaking (as in, happened just before or during the show) and I've actually used it 3 times this show!

It's not every week that we get to announce a NEW SOTA open model with the team that worked on it. Junyang (Justin) Lin from Qwen is a friend of the pod, a frequent co-host, and today gave us the breaking news of this month, as Qwen2 72B, is beating LLama-3 70B on most benchmarks! That's right, a new state of the art open LLM was announced on the show, and Justin went deep into details 👏 (so don't miss this conversation, listen to wherever you get your podcasts)

We also chatted about SOTA multimodal embeddings with Jina folks (Bo Wand and Han Xiao) and Zach from Nomic, dove into an open source compute grant with FALs Batuhan Taskaya and much more!

TL;DR of all topics covered:

* Open Source LLMs

* Alibaba announces Qwen 2 - 5 model suite (X, HF)

* Jina announces Jina-Clip V1 - multimodal embeddings beating CLIP from OAI (X, Blog, Web Demo)

* Nomic announces Nomic-Embed-Vision (X, BLOG)

* MixEval - arena style rankings with Chatbot Arena model rankings with 2000× less time (5 minutes) and 5000× less cost ($0.6) (X, Blog)

* Vision & Video

* Kling - open access video model SORA competitor from China (X)

* This Weeks Buzz

* WandB supports Mistral new finetuning service (X)

* Register to my June 12 workshop on building Evals with Weave HERE

* Voice & Audio

* StableAudio Open - X, BLOG, TRY IT

* Suno launches "upload your audio" feature to select few - X

* Udio - upload your own audio feature - X

* AI Art & Diffusion & 3D

* Stable Diffusion 3 weights are coming on June 12th (Blog)

* JasperAI releases Flash Diffusion (X, TRY IT, Blog)

* Big CO LLMs + APIs

* Group of ex-OpenAI sign a new letter - righttowarn.ai

* A hacker releases TotalRecall - a tool to extract all the info from MS Recall Feature (Github)

Open Source LLMs

QWEN 2 - new SOTA open model from Alibaba (X, HF)

This is definitely the biggest news for this week, as the folks at Alibaba released a very surprising and super high quality suite of models, spanning from a tiny 0.5B model to a new leader in open models, Qwen 2 72B

To add to the distance from Llama-3, these new models support a wide range of context length, all large, with 7B and 72B support up to 128K context.

Justin mentioned on stage that actually finding sequences of longer context lengths is challenging, and this is why they are only at 128K.

In terms of advancements, the highlight is advanced Code and Math capabilities, which are likely to contribute to overall model advancements across other benchmarks as well.

It's also important to note that all models (besides the 72B) are now released with Apache 2 license to help folks actually use globally, and speaking of globality, these models have been natively trained with 27 additional languages, making them considerably better at multilingual prompts!

One additional amazing thing was, that a finetune was released by Eric Hartford and Cognitive Computations team, and AFAIK this is the first time a new model drops with an external finetune. Justing literally said "It is quite amazing. I don't know how they did that. Well, our teammates don't know how they did that, but, uh, it is really amazing when they use the Dolphin dataset to train it."

Here's the Dolphin finetune metrics and you can try it out here

ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Jina-Clip V1 and Nomic-Embed-Vision SOTA multimodal embeddings

It's quite remarkable that we got 2 separate SOTA of a similar thing during the same week, and even more cool that both companies came to talk about it on ThursdAI!

First we welcomed back Bo Wang from Jina (who joined by Han Xiao the CEO) and Bo talked about multimodal embeddings that beat OpenAI CLIP (which both conceded was a very low plank)

Jina Clip V1 is apache 2 open sourced, while Nomic Embed is beating it on benchmarks but is CC-BY-NC non commercially licensed, but in most cases, if you're embedding, you'd likely use an API, and both companies offer these embeddings via their respective APIs

One thing to note about Nomic, is that they have mentioned that these new embeddings are backwards compatible with the awesome Nomic embed endpoints and embeddings, so if you've used that, now you've gone multimodal!

Because these models are fairly small, there are now web versions, thanks to transformer.js, of Jina and Nomic Embed (caution, this will download large-ish files) built by non other than our friend Xenova.

If you're building any type of multimodal semantic search, these two embeddings systems are now open up all your RAG needs for multi modal data!

This weeks Buzz (What I learned with WandB this week)

Mistral announced built in finetuning server support, and has a simple WandB integration! (X)

Also, my workshop about building evals 101 is coming up next week, June 12, excited to share with you a workshop that we wrote for in person crowd, please register here

and hope to see you next week!

Vision & Video

New SORA like video generation model called KLING in open access (DEMO)

This one has to be seen to be believed, out of nowhere, an obscure (to us) chinese company kuaishou.com dropped a landing page with tons of videos that are clearly AI generated, and they all look very close to SORA quality, way surpassing everything else we've seen in this category (Runaway, Pika, SVD)

And they claim that they offer support for it via their App (but you need apparently a Chinese phone number, so not for me)

It's really hard to believe that this quality exists already outside of a frontier lab full of GPUs like OpenAI and it's now in waitlist mode, where as SORA is "coming soon"

Voice & Audio

Stability open sources Stable Audio Open (X, BLOG, TRY IT)

A new open model from Stability is always fun, and while we wait for SD3 to drop weights (June 12! we finally have a date) we get this awesome model from Dadabots at team at Stability.

It's able to generate 47s seconds of music, and is awesome at generating loops, drums and other non vocal stuff, so not quite where Suno/Udio are, but the samples are very clean and sound very good. Prompt: New York Subway

They focus the model on being able to get Finetuned on a specific drummers style for example, and have it be open and specialize in samples, and sound effects and not focused on melodies or finalized full songs but it has some decent skills in simple prompts, like "progressive house music"

This model has a non commercial license and can be played with here

Suno & Udio let users upload their own audio!

This one is big, so big in fact, that I am very surprised that both companies announced this exact feature the same week.

Suno has reached out to me and a bunch of other creators, and told us that we are now able to upload our own clips, be it someone playing solo guitar, or even whistling and have Suno remix it into a real proper song.

In this example, this is a very viral video, this guy sings at a market selling fish (to ladies?) and Suno was able to create this remix for me, with the drop, the changes in his voice, the melody, everything, it’s quite remarkable!

AI Art & Diffusion

Flash Diffusion from JasperAI / Clipdrop team (X, TRY IT, Blog, Paper)

Last but definitely not least, we now have a banger of a diffusion update, from the Clipdrop team (who was amazing things before Stability bought them and then sold them to JasperAI)

Diffusion models likle Stable Diffusion often take 30-40 inference steps to get you the image, searching for your prompt through latent space you know?

Well recently there have been tons of these new distill methonds, models that are like students, who learn from the teacher model (Stable Diffusion XL for example) and distill the same down to a few steps (sometimes as low as 2!)

Often the results are, distilled models that can run in real time, like SDXL Turbo, Lightning SDXL etc

Now Flash Diffusion achieves State-of-the-Art (SOTA) performance metrics, specifically in terms of Fréchet Inception Distance (FID) and CLIP Score. These metrics are the default for evaluating the quality and relevance of generated images.

And Jasper has open sourced the whole training code to allow for reproducibility which is very welcome!

Flash diffusion also comes in not only image generation, but also inpaining and upscaling, allowing it to be applied to other methods to speed them up as well.

This is all for this week, I mean, there are TONS more stuff we could have covered, and we did mention them on the pod, but I aim to serve as a filter to the most interesting things as well so, until next week 🫡



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

📅 ThursdAI - May 30 - 1000 T/s inference w/ SambaNova, <135ms TTS with Cartesia, SEAL leaderboard from Scale & more AI news

vendredi 31 mai 2024Duration 01:52:52

Hey everyone, Alex here!

Can you believe it's already end of May? And that 2 huge AI companies conferences are behind us (Google IO, MSFT Build) and Apple's WWDC is just ahead in 10 days! Exciting!

I was really looking forward to today's show, had quite a few guests today, I'll add all their socials below the TL;DR so please give them a follow and if you're only in reading mode of the newsletter, why don't you give the podcast a try 🙂 It's impossible for me to add the density of knowledge that's being shared on stage for 2 hours here in the newsletter!

Also, before we dive in, I’m hosting a free workshop soon, about building evaluations from scratch, if you’re building anything with LLMs in production, more than welcome to join us on June 12th (it’ll be virtual)

TL;DR of all topics covered:

* Open Source LLMs

* Mistral open weights Codestral - 22B dense coding model (X, Blog)

* Nvidia open sources NV-Embed-v1 - Mistral based SOTA embeddings (X, HF)

* HuggingFace Chat with tool support (X, demo)

* Aider beats SOTA on Swe-Bench with 26% (X, Blog, Github)

* OpenChat - Sota finetune of Llama3 (X, HF, Try It)

* LLM 360 - K2 65B - fully transparent and reproducible (X, Paper, HF, WandB)

* Big CO LLMs + APIs

* Scale announces SEAL Leaderboards - with private Evals (X, leaderboard)

* SambaNova achieves >1000T/s on Llama-3 full precision

* Groq hits back with breaking 1200T/s on Llama-3

* Anthropic tool support in GA (X, Blogpost)

* OpenAI adds GPT4o, Web Search, Vision, Code Interpreter & more to free users (X)

* Google Gemini & Gemini Flash are topping the evals leaderboards, in GA(X)

* Gemini Flash finetuning coming soon

* This weeks Buzz (What I learned at WandB this week)

* Sponsored a Mistral hackathon in Paris

* We have an upcoming workshop in 2 parts - come learn with me

* Vision & Video

* LLama3-V - Sota OSS VLM (X, Github)

* Voice & Audio

* Cartesia AI - super fast SSM based TTS with very good sounding voices (X, Demo)

* Tools & Hardware

* Jina Reader (https://jina.ai/reader/)

* Co-Hosts and Guests

* Rodrigo Liang (@RodrigoLiang) & Anton McGonnell (@aton2006) from SambaNova

* Itamar Friedman (@itamar_mar) Codium

* Arjun Desai (@jundesai) - Cartesia

* Nisten Tahiraj (@nisten) - Cohost

* Wolfram Ravenwolf (@WolframRvnwlf)

* Eric Hartford (@erhartford)

* Maziyar Panahi (@MaziyarPanahi)

Scale SEAL leaderboards (Leaderboard)

Scale AI has announced their new initiative, called SEAL leaderboards, which aims to provide yet another point of reference in how we understand frontier models and their performance against each other.

We've of course been sharing LMSys arena rankings here, and openLLM leaderboard from HuggingFace, however, there are issues with both these approaches, and Scale is approaching the measuring in a different way, focusing on very private benchmarks and dataset curated by their experts (Like Riley Goodside)

The focus of SEAL is private and novel assessments across Coding, Instruction Following, Math, Spanish and more, and the main reason they keep this private, is so that models won't be able to train on these benchmarks if they leak to the web, and thus show better performance due to data contamination.

They are also using ELO scores (Bradley-Terry) and I love this footnote from the actual website:

"To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts"

This means they are taking the contamination thing very seriously and it's great to see such dedication to being a trusted source in this space.

Specifically interesting also that on their benchmarks, GPT-4o is not better than Turbo at coding, and definitely not by 100 points like it was announced by LMSys and OpenAI when they released it!

Gemini 1.5 Flash (and Pro) in GA and showing impressive performance

As you may remember from my Google IO recap, I was really impressed with Gemini Flash, and I felt that it went under the radar for many folks. Given it's throughput speed, 1M context window, and multimodality and price tier, I strongly believed that Google was onto something here.

Well this week, not only was I proven right, I didn't actually realize how right I was 🙂 as we heard breaking news from Logan Kilpatrick during the show, that the models are now in GA, and that Gemini Flash gets upgraded to 1000 RPM (requests per minute) and announced that finetuning is coming and will be free of charge!

Not only with finetuning won't cost you anything, inference on your tuned model is going to cost the same, which is very impressive.

There was a sneaky price adjustment from the announced pricing to the GA pricing that upped the pricing by 2x on output tokens, but even despite that, Gemini Flash with $0.35/1MTok for input and $1.05/1MTok on output is probably the best deal there is right now for LLMs of this level.

This week it was also confirmed both on LMsys, and on Scale SEAL leaderboards that Gemini Flash is a very good coding LLM, beating Claude Sonnet and LLama-3 70B!

SambaNova + Groq competing at 1000T/s speeds

What a week for inference speeds!

SambaNova (an AI startup with $1.1B in investment from Google Ventures, Intel Capital, Samsung, Softbank founded in 2017) has announced that they broke the 1000T/s inference barrier on Llama-3-8B in full precision mode (suing their custom hardware called RDU (reconfigurable dataflow unit)

As you can see, this is incredible fast, really, try it yourself here.

Seeing this, the folks at Groq, who had the previous record on super fast inference (as I reported just in February) decided to not let this slide, and released an incredible 20% improvement on their own inference of LLama-3-8B, getting to 1200Ts, showing that they are very competitive.

This bump in throughput is really significant, many inference providers that use GPUs, and not even hitting 200T/s, and Groq improved their inference by that amount within 1 day of being challenged.

I had the awesome pleasure to have Rodrigo the CEO on the show this week to chat about SambaNova and this incredible achievement, their ability to run this in full precision, and future plans, so definitely give it a listen.

This weeks Buzz (What I learned with WandB this week)

This week was buzzing at Weights & Biases! After co-hosting a Hackathon with Meta a few weeks ago, we cohosted another Hackathon, this time with Mistral, in Paris. (where we also announced our new integration with their Finetuning!)

The organizers Cerebral Valley have invited us to participate and it was amazing to see the many projects that use WandB and Weave in their finetuning presentations, including a friend of the pod Maziyar Panahi who's team nabbed 2nd place (you can read about their project here) 👏

Also, I'm going to do a virtual workshop together with my colleague Anish, about prompting and building evals, something we know a thing or two about, it's free and I would very much love to invite you to register and learn with us!

Cartesia AI (try it)

Hot off the press, we're getting a new Audio TTS model, based on the State Space model architecture (remember Mamba?) from a new startup called Cartesia AI, who aim to bring real time intelligence to on device compute!

The most astonishing thing they released was actually the speed with which they model starts to generate voices, under 150ms, which is effectively instant, and it's a joy to play with their playground, just look at how fast it started generating this intro I recorded using their awesome 1920's radio host voice

Co-founded by Albert Gu, Karan Goel and Arjun Desai (who joined the pod this week) they have shown incredible performance but also showed that transformer alternative architectures like SSMs can really be beneficial for audio specifically, just look at this quote!

On speech, a parameter-matched and optimized Sonic model trained on the same data as a widely used Transformer improves audio quality significantly (20% lower perplexity, 2x lower word error, 1 point higher NISQA quality).

With lower latency (1.5x lower time-to-first-audio), faster inference speed (2x lower real-time factor) and higher throughput (4x)

In Open Source news:

Mistral released Codestral 22B - their flagship code model with a new non commercial license

Codestral is now available under the new Mistral license for non-commercial R&D use. With a larger context window of 32K, Codestral outperforms all other models in RepoBench, a long-range evaluation for code generation. Its fill-in-the-middle capability is favorably compared to DeepSeek Coder 33B.

Codestral is supported in VSCode via a plugin and is accessible through their API, Le Platforme, and Le Chat.

HuggingFace Chat with tool support (X, demo)

This one is really cool, HF added Cohere's Command R+ with tool support and the tools are using other HF spaces (with ZeroGPU) to add capabilities like image gen, image editing, web search and more!

LLM 360 - K2 65B - fully transparent and reproducible (X, Paper, HF, WandB)

The awesome team at LLM 360 released K2 65B, which is an open source model that comes very close to LLama70B on benchmarks, but the the most important thing, is that they open source everything, from code, to datasets, to technical write-ups, they even open sourced their WandB plots 👏

This is so important to the open source community, that we must highlight and acknowledge the awesome effort from LLM360 ai of doing as much open source!

Tools - Jina reader

In the tools category, while we haven't discussed this on the pod, I really wanted to highlight Jina reader. We've had Bo from Jina AI talk to us about Embeddings in the past episodes, and since then Jina folks released this awesome tool that's able to take any URL and parse it in a nice markdown format that's very digestable to LLMs.

You can pass any url, and it even does vision understanding! And today they released PDF understanding as well so you can pass the reader PDF files and have it return a nicely formatted text!

The best part, it's free! (for now at least!)

And that’s a wrap for today, see you guys next week, and if you found any of this interesting, please share with a friend 🙏



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

📅 ThursdAI - May 23 - OpenAI troubles, Microsoft Build, Phi-3 small/large, new Mistral & more AI news

jeudi 23 mai 2024Duration 01:43:00

Hello hello everyone, this is Alex, typing these words from beautiful Seattle (really, it only rained once while I was here!) where I'm attending Microsoft biggest developer conference BUILD.

This week we saw OpenAI get in the news from multiple angles, none of them positive and Microsoft clapped back at Google from last week with tons of new AI product announcements (CoPilot vs Gemini) and a few new PCs with NPU (Neural Processing Chips) that run alongside CPU/GPU combo we're familiar with. Those NPUs allow for local AI to run on these devices, making them AI native devices!

While I'm here I also had the pleasure to participate in the original AI tinkerers thanks to my friend Joe Heitzberg who operates and runs the aitinkerers.org (of which we are a local branch in Denver) and it was amazing to see tons of folks who listen to ThursdAI + read the newsletter and talk about Weave and evaluations with all of them! (Btw, one the left is Vik from Moondream, which we covered multiple times). I

Ok let's get to the news:

TL;DR of all topics covered:

* Open Source LLMs

* HuggingFace commits 10M in ZeroGPU (X)

* Microsoft open sources Phi-3 mini, Phi-3 small (7B) Medium (14B) and vision models w/ 128K context (Blog, Demo)

* Mistral 7B 0.3 - Base + Instruct (HF)

* LMSys created a "hard prompts" category (X)

* Cohere for AI releases Aya 23 - 3 models, 101 languages, (X)

* Big CO LLMs + APIs

* Microsoft Build recap - New AI native PCs, Recall functionality, Copilot everywhere

* Will post a dedicated episode to this on Sunday

* OpenAI pauses GPT-4o Sky voice because Scarlet Johansson complained

* Microsoft AI PCs - Copilot+ PCs (Blog)

* Anthropic - Scaling Monosemanticity paper - about mapping the features of an LLM (X, Paper)

* Vision & Video

* OpenBNB - MiniCPM-Llama3-V 2.5 (X, HuggingFace)

* Voice & Audio

* OpenAI pauses Sky voice due to ScarJo hiring legal counsel

* Tools & Hardware

* Humane is looking to sell (blog)

Open Source LLMs

Microsoft open sources Phi-3 mini, Phi-3 small (7B) Medium (14B) and vision models w/ 128K context (Blog, Demo)

Just in time for Build, Microsoft has open sourced the rest of the Phi family of models, specifically the small (7B) and the Medium (14B) models on top of the mini one we just knew as Phi-3.

All the models have a small context version (4K and 8K) and a large that goes up to 128K (tho they recommend using the small if you don't need that whole context) and all can run on device super quick.

Those models have MIT license, so use them as you will, and are giving an incredible performance comparatively to their size on benchmarks. Phi-3 mini, received an interesting split in the vibes, it was really good for reasoning tasks, but not very creative in it's writing, so some folks dismissed it, but it's hard to dismiss these new releases, especially when the benchmarks are that great!

LMsys just updated their arena to include a hard prompts category (X) which select for complex, specific and knowledge based prompts and scores the models on those. Phi-3 mini actually gets a big boost in ELO ranking when filtered on hard prompts and beats GPT-3.5 😮 Can't wait to see how the small and medium versions perform on the arena.

Mistral gives us function calling in Mistral 0.3 update (HF)

Just in time for the Mistral hackathon in Paris, Mistral has released an update to the 7B model (and likely will update the MoE 8x7B and 8x22B Mixtrals) with function calling and a new vocab.

This is awesome all around because function calling is important for agenting capabilities, and it's about time all companies have it, and apparently the way Mistral has it built in matches the Cohere Command R way and is already supported in Ollama, using raw mode.

Big CO LLMs + APIs

Open AI is not having a good week - Sky voice has paused, Employees complain

OpenAI is in hot waters this week, starting with pausing the Sky voice (arguably the best most natural sounding voice out of the ones that launched) due to complains for Scarlett Johansson about this voice being similar to hers. Scarlett appearance in the movie Her, and Sam Altman tweeting "her" to celebrate the release of the incredible GPT-4o voice mode were all talked about when ScarJo has released a statement saying she was shocked when her friends and family told her that OpenAI's new voice mode sounds just like her.

Spoiler, it doesn't really, and they hired an actress and have had this voice out since September last year, as they outlined in their blog following ScarJo complaint.

Now, whether or not there's legal precedent here, given that Sam Altman reached out to Scarlet twice, including once a few days before the event, I won't speculate, but for me, personally, not only Sky doesn't sound like ScarJo, it was my favorite voice even before they demoed it, and I'm really sad that it's paused, and I think it's unfair to the actress who was hired for her voice. See her own statement:

Microsoft Build - CoPilot all the things

I have recorded a Built recap with Ryan Carson from Intel AI and will be posting that as it's own episode on Sunday, so look forward to that, but for now, here are the highlights from BUILD:

* Copilot everywhere, Microsoft builds the CoPilot as a platform

* AI native laptops with NPU chips for local AI

* Recall an on device AI that let's you search through everything you saw or typed with natural language

* Github Copilot Workspace + Extensions

* Microsoft stepping into education with sponsoring Khan Academy free for all teaches in the US

* Copilot Team member and Agent - Copilot will do things proactively as your team member

* GPT-4o voice mode is coming to windows and to websites!

Hey, if you like reading this, can you share with 1 friend? It’ll be an awesome way to support this pod/newsletter!

Anthropic releases the Scaling Monosemanticity paper

This is quite a big thing that happened this week for Mechanistic Interpretability and Alignment, with Anthropic releasing a new paper and examples of their understanding of what LLM "thinks".

They have done incredible work in this area, and now they have scaled it up all the way to production models like Claude Haiku, which shows that this work can actually understand which "features" are causing which tokens to output.

In the work they highlighted features such as "deception", "bad code" and even a funny one called "Golden Gate bridge" and showed that clamping these features can affect the model outcomes.

One these features have been identified, they can be turned on or off with various levels of power, for example they turned up the Golden Gate Bridge feature up to the maximum, and the model thought it was the Golden Gate bridge.

While a funny example, they also found features for racism, bad / wrong code, inner conflict, gender bias, sycophancy and more, you can play around with some examples here and definitely read the full blog if this interests you, but overall it shows incredible promise in alignment and steer-ability of models going forward on large scale

This weeks Buzz (What I learned with WandB this week)

I was demoing Weave all week long in Seattle, first at the AI Tinkerers event, and then at MSFT BUILD.

They had me record a pre-recorded video of my talk, and then have a 5 minute demo on stage, which (was not stressful at all!) so here's the pre-recorded video that turned out really good!

Also, we're sponsoring the Mistral Hackathon this weekend in Paris, so if you're in EU and want to hack with us, please go, it's hosted by Cerebral Valley and HuggingFace and us →

Vision

Phi-3 mini Vision

In addition to Phi-3 small and Phi-3 Medium, Microsoft released Phi-3 mini with vision, which does an incredible job understanding text and images! (You can demo it right here)

Interestingly, the Phi-3 mini with vision has 128K context window which is amazing and even beats Mistral 7B as a language model! Give it a try

OpenBNB - MiniCPM-Llama3-V 2.5 (X, HuggingFace, Demo)

Two state of the art vision models in one week? well that's incredible. A company I haven't heard of OpenBNB have released MiniCPM 7B trained on top of LLama3 and they claim that they outperform the Phi-3 vision

They claim that it has GPT-4 vision level performance and achieving an 700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro

In my tests, Phi-3 performed a bit better, I showed both the same picture, and Phi was more factual on the hard prompts:

Phi-3 Vision:

And that's it for this week's newsletter, look out for the Sunday special full MSFT Build recap and definitely give the whole talk a listen, it's full of my co-hosts and their great analysis of this weeks events!



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

📅 ThursdAI - May 16 - OpenAI GPT-4o, Google IO recap, LLama3 hackathon, Yi 1.5, Nous Hermes Merge & more AI news

vendredi 17 mai 2024Duration 01:54:23

Wow, holy s**t, insane, overwhelming, incredible, the future is here!, "still not there", there are many more words to describe this past week. (TL;DR at the end of the blogpost)

I had a feeling it's going to be a big week, and the companies did NOT disappoint, so this is going to be a very big newsletter as well.

As you may have read last week, I was very lucky to be in San Francisco the weekend before Google IO, to co-host a hackathon with Meta LLama-3 team, and it was a blast, I will add my notes on that in This weeks Buzz section.

Then on Monday, we all got to watch the crazy announcements from OpenAI, namely a new flagship model called GPT-4o (we were right, it previously was im-also-a-good-gpt2-chatbot) that's twice faster, 50% cheaper (in English, significantly more so in other languages, more on that later) and is Omni (that's the o) which means it is end to end trained with voice, vision, text on inputs, and can generate text, voice and images on the output.

A true MMIO (multimodal on inputs and outputs, that's not the official term) is here and it has some very very surprising capabilities that blew us all away. Namely the ability to ask the model to "talk faster" or "more sarcasm in your voice" or "sing like a pirate", though, we didn't yet get that functionality with the GPT-4o model, it is absolutely and incredibly exciting. Oh and it's available to everyone for free!

That's GPT-4 level intelligence, for free for everyone, without having to log in!

What's also exciting was how immediate it was, apparently not only the model itself is faster (unclear if it's due to newer GPUs or distillation or some other crazy advancements or all of the above) but that training an end to end omnimodel reduces the latency to incredibly immediate conversation partner, one that you can interrupt, ask to recover from a mistake, and it can hold a conversation very very well.

So well, that indeed it seemed like, the Waifu future (digital girlfriends/wives) is very close to some folks who would want it, while we didn't get to try it (we got GPT-4o but not the new voice mode as Sam confirmed) OpenAI released a bunch of videos of their employees chatting with Omni (that's my nickname, use it if you'd like) and many online highlighted how thirsty / flirty it sounded. I downloaded all the videos for an X thread and I named one girlfriend.mp4, and well, just judge for yourself why:

Ok, that's not all that OpenAI updated or shipped, they also updated the Tokenizer which is incredible news to folks all around, specifically, the rest of the world. The new tokenizer reduces the previous "foreign language tax" by a LOT, making the model way way cheaper for the rest of the world as well

One last announcement from OpenAI was the desktop app experience, and this one, I actually got to use a bit, and it's incredible. MacOS only for now, this app comes with a launcher shortcut (kind of like RayCast) that let's you talk to ChatGPT right then and there, without opening a new tab, without additional interruptions, and it even can understand what you see on the screen, help you understand code, or jokes or look up information. Here's just one example I just had over at X. And sure, you could always do this with another tab, but the ability to do it without context switch is a huge win.

OpenAI had to do their demo 1 day before GoogleIO, but even during the excitement about GoogleIO, they had announced that Ilya is not only alive, but is also departing from OpenAI, which was followed by an announcement from Jan Leike (who co-headed the superailgnment team together with Ilya) that he left as well. This to me seemed like a well executed timing to give dampen the Google news a bit.

Google is BACK, backer than ever, Alex's Google IO recap

On Tuesday morning I showed up to Shoreline theater in Mountain View, together with creators/influencers delegation as we all watch the incredible firehouse of announcements that Google has prepared for us.

TL;DR - Google is adding Gemini and AI into all it's products across workspace (Gmail, Chat, Docs), into other cloud services like Photos, where you'll now be able to ask your photo library for specific moments. They introduced over 50 product updates and I don't think it makes sense to cover all of them here, so I'll focus on what we do best.

"Google with do the Googling for you"

Gemini 1.5 pro is now their flagship model (remember Ultra? where is that? 🤔) and has been extended to 2M tokens in the context window! Additionally, we got a new model called Gemini Flash, which is way faster and very cheap (up to 128K, then it becomes 2x more expensive)

Gemini Flash is multimodal as well and has 1M context window, making it an incredible deal if you have any types of videos to process for example.

Kind of hidden but important was a caching announcement, which IMO is a big deal, big enough it could post a serious risk to RAG based companies. Google has claimed they have a way to introduce caching of the LLM activation layers for most of your context, so a developer won't have to pay for repeatedly sending the same thing over and over again (which happens in most chat applications) and will significantly speed up work with larger context windows.

They also mentioned Gemini Nano, a on device Gemini, that's also multimodal, that can monitor calls in real time for example for older folks, and alert them about being scammed, and one of the cooler announcements was, Nano is going to be baked into the Chrome browser.

With Gemma's being upgraded, there's not a product at Google that Gemini is not going to get infused into, and while they counted 131 "AI" mentions during the keynote, I'm pretty sure Gemini was mentioned way more!

Project Astra - A universal AI agent helpful in everyday life

After a few of the announcements from Sundar, (newly knighted) Sir Demis Hassabis came out and talked about DeepMind research, AlphaFold 3 and then turned to project Astra.

This demo was really cool and kind of similar to the GPT-4o conversation, but also different. I'll let you just watch it yourself:

TK: project astra demo

And this is no fake, they actually had booths with Project Astra test stations and I got to chat with it (I came back 3 times) and had a personal demo from Josh Woodward (VP of Labs) and it works, and works fast! It sometimes disconnects and sometimes there are misunderstandings, like when multiple folks are speaking, but overall it's very very impressive.

If you remember the infamous video with the rubber ducky that was edited by Google and caused a major uproar when we found out? It's basically that, on steroids, and real and quite quite fast.

Astra has a decent short term memory, so if you ask it where something was, it will remember, and Google cleverly used that trick to also show that they are working on augmented reality glasses with Astra built in, which would make amazing sense.

Open Source LLMs

Google open sourced PaliGemma VLM

Giving us something in the open source department, adding to previous models like RecurrentGemma, Google has uploaded a whopping 116 different checkpoints of a new VLM called PaliGemma to the hub, which is a State of the Art vision model at 3B.

It's optimized for finetuning for different workloads such as Visual Q&A, Image and short video captioning and even segmentation!

They also mentioned that Gemma 2 is coming next month, will be a 27B parameter model that's optimized to run on a single TPU/GPU.

Nous Research Hermes 2 Θ (Theta) - their first Merge!

Collaborating with Charles Goddard from Arcee (the creators of MergeKit), Teknium and friends merged the recently trained Hermes 2 Pro with Llama 3 instruct to get a model that's well performant on all the tasks that LLama-3 is good at, while maintaining capabilities of Hermes (function calling, Json mode)

Yi releases 1.5 with apache 2 license

The folks at 01.ai release Yi 1.5, with 6B, 9B and 34B (base and chat finetunes)

Showing decent benchmarks on Math and Chinese, 34B beats LLama on some of these tasks while being 2x smaller, which is very impressive

This weeks Buzz - LLama3 hackathon with Meta

Before all the craziness that was announced this week, I participated and judged the first ever Llama-3 hackathon. It was quite incredible, with over 350 hackers participating, Groq, Lambda, Meta, Ollama and others sponsoring and giving talks and workshops it was an incredible 24 hours at Shak 15 in SF (where Cerebral Valley hosts their hackathons)

Winning hacks were really innovative, ranging from completely open source smart glasses for under 20$, to a LLM debate platform with an LLM judge on any moral issue, and one project that was able to jailbreak llama by doing some advanced LLM arithmetic. Kudos to the teams for winning, and it was amazing to see how many of them adopted Weave as their observability framework as it was really easy to integrate.

Oh and I got to co-judge with the 🐐 of HuggingFace

This is all the notes for this week, even though there was a LOT lot more, check out the TL;DR and see you here next week, which I'll be recording from Seattle, where I'll be participating in the Microsoft BUILD event, so we'll see Microsoft's answer to Google IO as well. If you're coming to BUILD, come by our booth and give me a high five!

TL;DR of all topics covered:

* OpenAI Announcements

* GPT-4o

* Voice mode

* Desktop App

* Google IO recap:

* Google Gemini

* Gemini 1.5 Pro: Available globally to developers with a 2-million-token context window, enabling it to handle larger and more complex tasks.

* Gemini 1.5 Flash: A faster and less expensive version of Gemini, optimized for tasks requiring low latency.

* Gemini Nano with Multimodality: An on-device model that processes various inputs like text, photos, audio, web content, and social videos.

* Project Astra: An AI agent capable of understanding and responding to live video and audio in real-time.

* Google Search

* AI Overviews in Search Results: Provides quick summaries and relevant information for complex search queries.

* Video Search with AI: Allows users to search by recording a video, with Google's AI processing it to pull up relevant answers.

* Google Workspace

* Gemini-powered features in Gmail, Docs, Sheets, and Meet: Including summarizing conversations, providing meeting highlights, and processing data requests.

* "Chip": An AI teammate in Google Chat that assists with various tasks by accessing information across Google services.

* Google Photos

* "Ask Photos": Allows users to search for specific items in photos using natural language queries, powered by Gemini.

* Video Generation

* Veo Generative Video: Creates 1080p videos from text prompts, offering cinematic effects and editing capabilities.

* Other Notable AI Announcements

* NotebookLM: An AI tool to organize and interact with various types of information (documents, PDFs, notes, etc.), allowing users to ask questions about the combined information.

* Video Overviews (Prototyping): A feature within NotebookLM that generates audio summaries from uploaded documents.

* Code VR: A generative video AI model capable of creating high-quality videos from various prompts.

* AI Agents: A demonstration showcasing how AI agents could automate tasks across different software and systems.

* Generative Music: Advancements in AI music generation were implied but not detailed.

* Open Source LLMs

* Google PaliGemma 3B - sota open base VLM (Blog)

* Gemma 2 - 27B coming next month

* Hermes 2 Θ (Theta) - Merge of Hermes Pro & Llama-instruct (X, HF)

* Yi 1.5 - Apache 2 licensed 6B, 9B and 34B (X)

* Tiger Lab - MMLU-pro - a harder MMLU with 12K questions (X, HuggingFace)

* This weeks Buzz (What I learned with WandB this week)

* Llama3 hackathon with Meta, Cerebral Valley, HuggingFace and Weights & Biases

* Vision & Video

* Google announces VEO - High quality cinematic generative video generation (X)

* AI Art & Diffusion & 3D

* Google announces Imagen3 - their latest Gen AI art model (Blog)

* Tools

* Cursor trained a model that does 1000tokens/s and editing 😮 (X)



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

📅 ThursdAI - May 9 - AlphaFold 3, im-a-good-gpt2-chatbot, Open Devin SOTA on SWE-Bench, DeepSeek V2 super cheap + interview with OpenUI creator & more AI news

vendredi 10 mai 2024Duration 01:47:51

Hey 👋 (show notes and links a bit below)

This week has been a great AI week, however, it does feel like a bit "quiet before the storm" with Google I/O on Tuesday next week (which I'll be covering from the ground in Shoreline!) and rumors that OpenAI is not just going to let Google have all the spotlight!

Early this week, we got 2 new models on LMsys, im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot, and we've now confirmed that they are from OpenAI, and folks have been testing them with logic puzzles, role play and have been saying great things, so maybe that's what we'll get from OpenAI soon?

Also on the show today, we had a BUNCH of guests, and as you know, I love chatting with the folks who make the news, so we've been honored to host Xingyao Wang and Graham Neubig core maintainers of Open Devin (which just broke SOTA on Swe-Bench this week!) and then we had friends of the pod Tanishq Abraham and Parmita Mishra dive deep into AlphaFold 3 from Google (both are medical / bio experts).

Also this week, OpenUI from Chris Van Pelt (Co-founder & CIO at Weights & Biases) has been blowing up, taking #1 Github trending spot, and I had the pleasure to invite Chris and chat about it on the show!

Let's delve into this (yes, this is I, Alex the human, using Delve as a joke, don't get triggered 😉)

TL;DR of all topics covered (trying something new, my Raw notes with all the links and bulletpoints are at the end of the newsletter)

* Open Source LLMs

* OpenDevin getting SOTA on Swe-Bench with 21% (X, Blog)

* DeepSeek V2 - 236B (21B Active) MoE (X, Try It)

* Weights & Biases OpenUI blows over 11K stars (X, Github, Try It)

* LLama-3 120B Chonker Merge from Maxime Labonne (X, HF)

* Alignment Lab open sources Buzz - 31M rows training dataset (X, HF)

* xLSTM - new transformer alternative (X, Paper, Critique)

* Benchmarks & Eval updates

* LLama-3 still in 6th place (LMsys analysis)

* Reka Core gets awesome 7th place and Qwen-Max breaks top 10 (X)

* No upsets in LLM leaderboard

* Big CO LLMs + APIs

* Google DeepMind announces AlphaFold-3 (Paper, Announcement)

* OpenAI publishes their Model Spec (Spec)

* OpenAI tests 2 models on LMsys (im-also-a-good-gpt2-chatbot & im-a-good-gpt2-chatbot)

* OpenAI joins Coalition for Content Provenance and Authenticity (Blog)

* Voice & Audio

* Udio adds in-painting - change parts of songs (X)

* 11Labs joins the AI Audio race (X)

* AI Art & Diffusion & 3D

* ByteDance PuLID - new high quality ID customization (Demo, Github, Paper)

* Tools & Hardware

* Went to the Museum with Rabbit R1 (My Thread)

* Co-Hosts and Guests

* Graham Neubig (@gneubig) & Xingyao Wang (@xingyaow_) from Open Devin

* Chris Van Pelt (@vanpelt) from Weights & Biases

* Nisten Tahiraj (@nisten) - Cohost

* Tanishq Abraham (@iScienceLuvr)

* Parmita Mishra (@prmshra)

* Wolfram Ravenwolf (@WolframRvnwlf)

* Ryan Carson (@ryancarson)

Open Source LLMs

Open Devin getting a whopping 21% on SWE-Bench (X, Blog)

Open Devin started as a tweet from our friend Junyang Lin (on the Qwen team at Alibaba) to get an open source alternative to the very popular Devin code agent from Cognition Lab (recently valued at $2B 🤯) and 8 weeks later, with tons of open source contributions, >100 contributors, they have almost 25K stars on Github, and now claim a State of the Art score on the very hard Swe-Bench Lite benchmark beating Devin and Swe-Agent (with 18%)

They have done so by using the CodeAct framework developed by Xingyao, and it's honestly incredible to see how an open source can catch up and beat a very well funded AI lab, within 8 weeks! Kudos to the OpenDevin folks for the organization, and amazing results!

DeepSeek v2 - huge MoE with 236B (21B active) parameters (X, Try It)

The folks at DeepSeek is releasing this huge MoE (the biggest we've seen in terms of experts) with 160 experts, and 6 experts activated per forward pass. A similar trend from the Snowflake team, just extended even longer. They also introduce a lot of technical details and optimizations to the KV cache.

With benchmark results getting close to GPT-4, Deepseek wants to take the crown in being the cheapest smartest model you can run, not only in open source btw, they are now offering this model at an incredible .28/1M tokens, that's 28 cents per 1M tokens!

The cheapest closest model in price was Haiku at $.25 and GPT3.5 at $0.5. This is quite an incredible deal for a model with 32K (128 in open source) context and these metrics.

Also notable is the training cost, they claim that it took them 1/5 the price of what Llama-3 cost Meta, which is also incredible. Unfortunately, running this model locally a nogo for most of us 🙂

I would mention here that metrics are not everything, as this model fails quite humorously on my basic logic tests

LLama-3 120B chonker Merge from Maxime LaBonne (X, HF)

We're covered Merges before, and we've had the awesome Maxime Labonne talk to us at length about model merging on ThursdAI but I've been waiting for Llama-3 merges, and Maxime did NOT dissapoint!

A whopping 120B llama (Maxime added 50 layers to the 70B Llama3) is doing the rounds, and folks are claiming that Maxime achieved AGI 😂 It's really funny, this model, is... something else.

Here just one example that Maxime shared, as it goes into an existential crisis about a very simple logic question. A question that Llama-3 answers ok with some help, but this... I've never seen this. Don't forget that merging has no additional training, it's mixing layers from the same model so... we still have no idea what Merging does to a model but... some brain damange definitely is occuring.

Oh and also it comes up with words!

ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Big CO LLMs + APIs

Open AI publishes Model Spec (X, Spec, Blog)

OpenAI publishes and invites engagement and feedback for their internal set of rules for how their models should behave. Anthropic has something similar with Constitution AI.

I specifically liked the new chain of command (Platform > Developer > User > Tool) rebranding they added to the models, making OpenAI the Platform, changing "system" prompts to "developer" and having user be the user. Very welcome renaming and clarifications (h/t Swyx for his analysis)

Here are a summarized version of OpenAI's new rules of robotics (thanks to Ethan Mollic)

* follow the chain of command: Platform > Developer > User > Tool

* Comply with applicable laws

* Don't provide info hazards

* Protect people's privacy

* Don't respond with NSFW contents

Very welcome effort from OpenAI, showing this spec in the open and inviting feedback is greately appreciated!

This comes on top of a pretty big week for OpenAI, announcing an integration with Stack Overflow, Joining the Coalition for Content Provenance and Authenticity + embedding watermarks in SORA and DALL-e images, telling us they have built a classifier that detects AI images with 96% certainty!

im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot

Following last week gpt2-chat mystery, Sam Altman trolled us with this tweet

And then we got 2 new models on LMSys, im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot, and the timeline exploded with folks trying all their best logic puzzles on these two models trying to understand what they are, are they GPT5? GPT4.5? Maybe a smaller version of GPT2 that's pretrained on tons of new tokens?

I think we may see the answer soon, but it's clear that both these models are really good, doing well on logic (better than Llama-70B, and sometimes Claude Opus as well)

And the speculation is pretty much over, we know OpenAI is behind them after seeing this oopsie on the Arena 😂

you can try these models as well, they seem to be very favored in the random selection of models, but they show up only in battle mode so you have to try a few times https://chat.lmsys.org/

Google DeepMind announces AlphaFold3 (Paper, Announcement)

Developed by DeepMind and IsomorphicLabs, AlphaFold has previously predicted the structure of every molecule known to science, and now AlphaFold 3 was announced which can now predict the structure of other biological complexes as well, paving the way for new drugs and treatments.

What's new here, is that they are using diffusion, yes, like Stable Diffusion, starting with noise and then denoising to get a structure, and this method is 50% more accurate than existing methods.

If you'd like more info about this very important paper, look no further than the awesome 2 minute paper youtube, who did a thorough analysis here, and listen to the Isomorphic Labs podcast with Weights & Biases CEO Lukas on Gradient Dissent

They also released AlphaFold server, a free research tool allowing scientists to access these capabilities and predict structures for non commercial use, however it seems that it's somewhat limited (from a conversation we had with a researcher on stage)

This weeks Buzz (What I learned with WandB this week)

This week, was amazing for Open Source and Weights & Biases, not every week a side project from a CIO blows up on... well everywhere. #1 trending on Github for Typescript and 6 overall, OpenUI (Github) has passed 12K stars as people are super excited about being able to build UIs with LLms, but in the open source.

I had the awesome pleasure to host Chris on the show as he talked about the inspiration and future plans, and he gave everyone his email to send him feedback (a decision which I hope he doesn't regret 😂) so definitely check out the last part of the show for that.

Meanwhile here's my quick tutorial and reaction about OpenUI, but just give it a try here and build something cool!

Vision

I was shared some news but respecting the team I decided not to include it in the newsletter ahead of time, but expect open source to come close to GPT4-V next week 👀

Voice & Audio

11 Labs joins the AI music race (X)

Breaking news from 11Labs, that happened during the show (but we didn't notice) is that they are stepping into the AI Music scene and it sounds pretty good!)

Udio adds Audio Inpainting (X, Udio)

This is really exciting, Udio decided to prove their investment and ship something novel!

Inpainting has been around in diffusion models, and now selecting a piece of a song on Udio and having Udio reword it is so seamless it will definitely come to every other AI music, given how powerful this is!

Udio also announced their pricing tiers this week, and it seems that this is the first feature that requires subscription

AI Art & Diffusion

ByteDance PuLID for no train ID Customization (Demo, Github, Paper)

It used to take a LONG time to finetune something like Stable Diffusion to generate an image of your face using DreamBooth, then things like LoRA started making this much easier but still needed training.

The latest crop of approaches for AI art customization is called ID Customization and ByteDance just released a novel, training free version called PuLID which works very very fast with very decent results! (really, try it on your own face), previous works like InstantID an IPAdapter are also worth calling out, however PuLID seems to be the state of the art here! 🔥

And that's it for the week, well who am I kidding, there's so much more we covered and I just didn't have the space to go deep into everything, but definitely check out the podcast episode for the whole conversation. See you next week, it's going to be 🔥 because of IO and ... other things 👀



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

ThursdAI - May 2nd - New GPT2? Copilot Workspace, Evals and Vibes from Reka, LLama3 1M context (+ Nous finetune) & more AI news

vendredi 3 mai 2024Duration 01:49:03

Hey 👋 Look it May or May not be the first AI newsletter you get in May, but it's for sure going to be a very information dense one. As we had an amazing conversation on the live recording today, over 1K folks joined to listen to the first May updates from ThursdAI.

As you May know by now, I just love giving the stage to folks who are the creators of the actual news I get to cover from week to week, and this week, we had again, 2 of those conversations.

First we chatted with Piotr Padlewski from Reka, the author on the new Vibe-Eval paper & Dataset which they published this week. We've had Yi and Max from Reka on the show before, but it was Piotr's first time and he was super super knowledgeable, and was really fun to chat with.

Specifically, as we at Weights & Biases launch a new product called Weave (which you should check out at https://wandb.me/weave) I'm getting more a LOT more interested in Evaluations and LLM scoring, and in fact, we started the whole show today with a full segment on Evals, Vibe checks and covered a new paper from Scale about overfitting.

The second deep dive was with my friend Idan Gazit, from GithubNext, about the new iteration of Github Copilot, called Copilot Workspace. It was a great one, and you should definitely give that one a listen as well

TL;DR of all topics covered + show notes

* Scores and Evals

* No notable changes, LLama-3 is still #6 on LMsys

* gpt2-chat came and went (in depth chan writeup)

* Scale checked for Data Contamination on GSM8K using GSM-1K (Announcement, Paper)

* Vibes-Eval from Reka - a set of multimodal evals (Announcement, Paper, HF dataset)

* Open Source LLMs

* Gradient releases 1M context window LLama-3 finetune (X)

* MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4 (X, HF)

* Nous Research - Hermes Pro 2 - LLama 3 8B (X, HF)

* AI Town is running on Macs thanks to Pinokio (X)

* LMStudio releases their CLI - LMS (X, Github)

* Big CO LLMs + APIs

* Github releases Copilot Workspace (Announcement)

* AI21 - releases Jamba Instruct w/ 256K context (Announcement)

* Google shows Med-Gemini with some great results (Announcement)

* Claude releases IOS app and Team accounts (X)

* This weeks Buzz

* We're heading to SF to sponsor the biggest LLama-3 hackathon ever with Cerebral Valley (X)

* Check out my video for Weave our new product, it's just 3 minutes (Youtube)

* Vision & Video

* Intern LM open sourced a bunch of LLama-3 and Phi based VLMs (HUB)

* And they are MLXd by the "The Bloke" of MLX, Prince Canuma (X)

* AI Art & Diffusion & 3D

* ByteDance releases Hyper-SD - Stable Diffusion in a single inference step (Demo)

* Tools & Hardware

* Still haven't open the AI Pin, and Rabbit R1 just arrived, will open later today

* Co-Hosts and Guests

* Piotr Padlewski (@PiotrPadlewski) from Reka AI

* Idan Gazit (@idangazit) from Github Next

* Wing Lian (@winglian)

* Nisten Tahiraj (@nisten)

* Yam Peleg (@yampeleg)

* LDJ (@ldjconfirmed)

* Wolfram Ravenwolf (@WolframRvnwlf)

* Ryan Carson (@ryancarson)

Scores and Evaluations

New corner in today's pod and newsletter given the focus this week on new models and comparing them to existing models.

What is GPT2-chat and who put it on LMSys? (and how do we even know it's good?)

For a very brief period this week, a new mysterious model appeared on LMSys, and was called gpt2-chat. It only appeared on the Arena, and did not show up on the leaderboard, and yet, tons of sleuths from 4chan to reddit to X started trying to figure out what this model was and wasn't.

Folks started analyzing the tokenizer, the output schema, tried to get the system prompt and gauge the context length. Many folks were hoping that this is an early example of GPT4.5 or something else entirely.

It did NOT help that uncle SAMA first posted the first tweet and then edited it to remove the - and it was unclear if he's trolling again or foreshadowing a completely new release or an old GPT-2 but retrained on newer data or something.

The model was really surprisingly good, solving logic puzzles better than Claude Opus, and having quite amazing step by step thinking, and able to provide remarkably informative, rational, and relevant replies. The average output quality across many different domains places it on, at least, the same level as high-end models such as GPT-4 and Claude Opus.

Whatever this model was, the hype around it made LMSYS add a clarification to their terms and temporarily take off the model now. And we're waiting to hear more news about what it is.

Reka AI gives us Vibe-Eval a new multimodal evaluation dataset and score (Announcement, Paper, HF dataset)

Reka keeps surprising, with only 20 people in the company, their latest Reka Core model is very good in multi modality, and to prove it, they just released a new paper + a new method of evaluating multi modal prompts on VLMS (Vision enabled Language Models)

Their new Open Benchmark + Open Dataset is consistent of this format:

And I was very happy to hear from one of the authors on the paper @PiotrPadlewski on the pod, where he mentioned that they were trying to create a dataset that was going to be very hard for their own model (Reka Core) and just decided to keep evaluating other models on it.

They had 2 main objectives : (i) vibe checking multimodal chat models for day-to-day tasks and (ii) deeply challenging and probing the capabilities of present frontier models. To this end, the hard set contains > 50% questions that all frontier models answer incorrectly

Chatting with Piotr about it, he mentioned that not only did they do a dataset, they actually used Reka Core as a Judge to score the replies from all models on that dataset and found that using their model in this way roughly correlates to non-expert human judgement! Very very interesting stuff.

The "hard" set is ... well hard!

Piotr concluded that if folks want to do research, they will provide free API access to Reka for that, so hit them up over DMs if you want to take this eval for a spin on your new shiny VLM (or indeed verify the metrics they put up)

Scale tests for eval dataset contamination with GSM-1K (Announcement, Paper)

Scale.ai is one of the most prominent companies in AI you may never have heard of, they are valued at $13B dollars and have pivoted from data processing for autonomous vehicles to being the darling of the government, with agreements from the DoD for data pipeline and evaluation for US Military.

They have released a new paper as well, creating (but not releasing) a new dataset that matches the GSM8K (Grade School Math) dataset and evaluation that many frontier companies love to showcase in their release benchmarks with some surprising results!

So Scale folks created (but not released) a dataset called GSK 1K, which tracks and is similar to the public GSM-8K dataset, and tested a bunch of existing models on their new one, to see the correlation, and if the different was very stark, assume that some models overfitted (or even had their dataset contaminated) on the publicly available GSM8K.

On one end, models like Mistral or Phi do up to 10% worse on GSM1k compared to GSM8k. On the other end, models like Gemini, Claude, or GPT show basically no signs of being overfit.

The author goes on to say that overfitting doesn't necessarily mean it's a bad model, and highlights Phi-3 which has a 10% difference on their new GSK-1K score compared to GSM-8K, but still answers 68% of their dataset, while being a tiny 3.8B parameter model.

It seems that Scale is now stepping into the Evaluation game and have noticed how much interest there is in actually understanding how models perform, and are stepping into this game, by building (but not releasing so they don't leak) datasets. Jim Fan tweet (and Scale CEO Alex Wang QT) seem to agree that this is the right positioning for Scale (as they don't have models of their own and so can be neutral like Moody's)

Open Source LLMs

LLama-3 gets 1M context window + Other LLama-3 news

In the second week of LLama-3 corner, we are noticing a significant ramp in all things Llama-3, first with the context length. The same folks from last week, Gradient, have spend cycles and upscaled/stretched LLama-3 to a whopping 1 million tokens in the context window (Llama-3 8B Gradient Instruct 1048k), with a very decent Niddle in the Haystack result.

The main problem? Transformers have quadratic attention scaling issues for longer context, so this isn't something that you'd be able to run on your mac (nay, on your cluster) any time soon, and it's almost only theoretical at this point.

The upside? We had Wing Lian (from Axolotl) on the show, and he talked about a new method called LoRD (which is now part of MergeKit) which is a way to extract Loras from models.

Think of it as LLM arithmetic, you take the base model (llama-3 in this case) and the finetune (Llama-3 8B Gradient Instruct 1048k) and simple run a command like so:

mergekit-extract-lora llama-3-8B-gradient-instruct-1048K llama-3-8B just-the-context-lora [--no-lazy-unpickle] --rank=desired_rank

And boom, in theory, you have a tiny LoRA file that's extracted that is only the difference between these two models, the base and it's finetune.

It's really exciting stuff to be able to do brain surgery on these models and extract only one specific essence!

First LLama-3 finetunes that beat the instruct version

Folks and Nous research give us a new Hermes-Pro on top of Llama-8B (X, HF) that is beating the llama-3 instruct on benchmarks, which is apparently very hard to do, given that Meta created a LOT of human labeled instructions (10M or so) and gave us a really really good instruct model.

Nous Hermes 2 pro is also giving Llama-3 additional superpowers like function calling and tool use, specifically mentioning that this is the model to use if you do any type of agentic stuff

This new version of Hermes maintains its excellent general task and conversation capabilities - but also excels at Function Calling, JSON Structured Outputs, and has improved on several other metrics as well, scoring a 90% on our function calling evaluation built in partnership with Fireworks.AI, and an 84% on our structured JSON Output evaluation.

Kudos Teknium1, Karan and @intrstllrninja on this release, can't wait to try it out 🫡

LMStudio gives us a CLI (Github)

And speaking of "trying it out", you guys know that my recommended way of running these local models is LMStudio, and no, Yagil didn't sponsor ThursdAI haha I just love how quickly this piece of software became the go to locally for me running these models.

Well during ThursdAI I got a #breakingNews ping from their discord, that LM Studio now has a CLI (command line interface) which allows one to load/unload and run the webserver with the new CLI (kind of similar to Ollama)

And since LM Studio exposes an OpenAI compatible completions API once the models are loaded, you are not able to use these models with a simple change to the your script like so:

client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

Which is amazing and I'm very happy about this option, as this opens the door to tons of automations and evaluation possibilities (with something like Weave), in fact while writing this, I downloaded the model from HuggingFace, loaded a web-server and ran my first prompts, and it all took like 5 minutes, and is very easy to do!

This weeks Buzz (What happens in Weights & Biases this week)

I have so much to share, but I want to make sure I don't overwhelm the newsletter, but here we go. First of all, I'm flying out to SF again! in a few weeks to sponsor and judge the first ever LLama-3 hackathon, together with Meta, and hosted by the fine folks at Cerebral Valley (sign up and come hack!)

Cerebral Valley is hosting their events at this beautiful place called Shak-15 which I've mentioned before on the newsletter, and I'm excited to finally take part in one of their events!

The second part I can't wait to tell you about, is a week after, I'm going to Microsoft BUILD conference in Seattle, and will be representing Weights & Biases in that huge event (which last year featured Andrej Karpathy giving state of LLMs)

Here's a video I recorded for that event, which I worked really hard on, and would love some feedback. Please also let me know if you notice anything that an AI did in this video 👀 There's... something

As always, if you're attending any of these events, and see me, please do come say hi and give me a high five. I love meeting ThursdAI community folks in the wild, it really makes up for the fact that I'm working remotely from Denver and really makes this whole thing worth it!

Big Companies & APIs

Github’s new Copilot Workspace in Technical Preview

I was very happy to have friend of the pod Idan Gazit, Senior Director of Research at GitHub Next, the place in Github that comes up with incredible stuff (including where Copilot was born) to talk to us about Copilot's next iteration after the chat experience, workspace!

Workspace is indeed that, a workspace for you and copilot to start working together, on github issues specifically, taking into context more than just 1 file, and breaking down the task into planning, iteration and human feedback.

It looks really slick, and per Idan, uses a LOT of tokens of gpt-4-turbo, and I've had a chance to get in there and play around.

They break down every task into a Specification that Copilot comes up with, and then you iteratively can work on until you get the required result, then into planning model, where you would see a whole plan, and then copilot will get to work and start iterating on your task.

Does this remind you of anything? AGENTS you may yell in your head as you read these words, however, I recommend you listen to Idan in our chat on the pod, because his take on agents are, we don't want these tools to replace us, we want them to help us, and what is an agent anyway, this word is very overused. And I have to agree, given the insane valuations we've seen in agent startups like Cognition Labs with Devin.

I've taken Workspace for a spin, and asked it for a basic task to translate a repo documentation into Russian, a task I know LLMs are really good at, and it identified all the README files in the repo, and translated them beautifully, but then it didn't place those new translations into a separate folder like I asked, a case Idan admitted they didn't yet build for, and hey, this is why this is a Technical Preview, you just can't build an LLM based product behind the scenes and release it, you need feedback, and evaluations on your product from actual users!

You can see my whole session here, in this nice link they give to be able to share (and fork if you have access) a workspace

The integration into Github is quite amazing, there's now a text box everyone on Github that you can ask for changes to a repo in natural language + a Raycast extension that allows you to basically kickstart a whole repo using Copilot Workspace from anywhere

And here's the result inside a new workspace 👇

I will run this later and see if it actually worked, given that Idan also mentioned, that Copilot does NOT run the code it writes, but it does allow me to easily do so via GIthub Codespaces (a bit confusing of a naming between the two!) and spin up a machine super quick.

I strongly recommend to listen to Idan on the pod because he went into a lot of detail about additional features, where they are planning to take this in the future etc'

I can go on and on, but I need to play with all the amazing new tools and models we just got today (and also start editing the podcast it's almost 4PM and I have 2 hours to send it!) so with that, thank you for reading , and see you next time 🫡



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

📅 ThursdAI - April 25 - Phi-3 3.8B impresses, LLama-3 gets finetunes, longer context & ranks top 6 in the world, Snowflake's new massive MoE and other AI news this week

vendredi 26 avril 2024Duration 01:21:34

Hey hey folks, happy ThursdAI 🎉

Not a lot of house-keeping here, just a reminder that if you're listening or reading from Europe, our European fullyconnected.com conference is happening in May 15 in London, and you're more than welcome to join us there. I will have quite a few event updates in the upcoming show as well.

Besides this, this week has been a very exciting one for smaller models, as Microsoft teased and than released Phi-3 with MIT license, a tiny model that can run on most macs with just 3.8B parameters, and is really punching above it's weights. To a surprising and even eyebrow raising degree! Let's get into it 👇

ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

TL;DR of all topics covered:

* Open Source LLMs

* Microsoft open sources Phi-3 (X, HF)

* LLama3 70B top5 (no top 6) on LMsys (LMsys Arena)

* Snowflake open sources Arctic - A massive hybrid MoE (X, Try it, HF)

* Evolutionary Model merges support in MergeKit (Blog)

* Llama-3 8B finetunes roundup - Longer Context (128K) and Dolphin & Bagel Finetunes

* HuggingFace FINEWEB - a massive 45TB (the GPT4 of datasets) and 15T tokens high quality web data dataset (HF)

* Cohere open sourced their chat interface (X)

* Apple open sources OpenElm 4 models + training library called corenet (HF, Github, Paper)

* Big CO LLMs + APIs

* Google Gemini 1.5 pro is #2 on LMsys arena

* Devin is now worth 2BN and Perplexity is also a Unicorn

* A new comer called Augment (backed by Eric Schmidt) is now coming out of stealth (X)

* Vision & Video

* Adobe releases VideoGigaGAN - high quality upscaler with temporal consistency (paper)

* TLDraw autocomplete UI demo (X)

* This Weeks Buzz - What I learned in WandB this week

* Joe Spisak talk about Llama3 on Stage at WandB Fully connected (Full Talk, TLDR)

* Voice & Audio

* Play.ai (previously play.ht) releases conversational Voice AI platform (X)

* AI Art & Diffusion & 3D

* IMGsys.org- like LMsys but for image generation model + leaderboard from FAL (try it)

* Tools & Hardware

* Rabbit R1 release party & no shipping update in sight

* I'm disillusioned about my AI Pin and will return it

Open Source LLMs

Llama-3 1 week-aversary 🎂 - Leaderboard ranking + finetunes

Well, it's exactly 1 week since we got Llama-3 from Meta and as expected, the rankings show a very very good story. (also it was downloaded over 1.2M times and already has 600 derivatives on HuggingFace)

Just on Monday, Llama-3 70B (the bigger version) took the incredible 5th place (now down to 6th) on LMSys, and more surprising, given that the Arena now has category filters (you can filter by English only, Longer chats, Coding etc) if you switch to English Only, this model shows up 2nd and was number 1 for a brief period of time.

So just to sum up, an open weights model that you can run on most current consumer hardware is taking over GPT-4-04-94, Claude Opus etc'

This seems dubious, because well, while it's amazing, it's clearly not at the level of Opus/Latest GPT-4 if you've used it, in fact it fails some basic logic questions in my tests, but it's a good reminder that it's really hard to know which model outperforms which and that the arena ALSO has a bias, of which people are using it for example and that evals are not a perfect way to explain which models are better.

However, LMsys is a big component of the overall vibes based eval in our community and Llama-3 is definitely a significant drop and it's really really good (even the smaller one)

One not so surprising thing about it, is that the Instruct version is also really really good, so much so, that the first finetunes of Eric Hartfords Dolphin (Dolphin-2.8-LLama3-70B) is improving just a little bit over Meta's own instruct version, which is done very well.

Per Joe Spisak (Program Manager @ Meta AI) chat at the Weights & Biases conference last week (which you can watch below) he said "I would say the magic is in post-training. That's where we are spending most of our time these days. Uh, that's where we're generating a lot of human annotations." and they with their annotation partners, generated up to 10 million annotation pairs, both PPO and DPO and then did instruct finetuning.

So much so that Jeremy Howard suggests to finetune their instruct version rather than the base model they released.

We also covered that despite the first reactions to the 8K context window, the community quickly noticed that extending context window for LLama-3 is possible, via existing techniques like Rope scaling, YaRN and a new PoSE method. Wing Lian (Maintainer of Axolotl finetuneing library) is stretching the model to almost 128K context window and doing NIH tests and it seems very promising!

Microsoft releases Phi-3 (Announcement, Paper, Model)

Microsoft didn't really let Meta take the open models spotlight, and comes with an incredible report and follow up with a model release that's MIT licened, tiny (3.8B parameters) and performs very very well even against Llama-3 70B.

Phi is a set of models from Microsoft that train on synthetic high-quality dataset modeled after textbooks-is-all-you-need/TinyStories approach.

The chart is quite incredible, the smallest (mini) Phi-3 is beating Llama-3-8B AND Mixtral on MMLU scores, BigBench and Humaneval. Again to simplify, this TINY 3.8B model, half the size of 1 Mixtral expert, beats Mixtral and newly released Llama-3-8B on most benchmark, not to mention GPT-3.5!

It's honestly quite a crazy chart to look at, which raises the question, did this model train on these benchmarks? 🤔

I still haven't seen definitive proof that the folks at Microsoft trained on any benchmarks data, I did see engagement from them and a complete denial, however we did see a few attempts at using Phi-3 and the quantized versions and the wrong end token formatting seem to be very prevalent in shaping the early opinion that this model performance is detached from it's very high scoring.

Not to mention that model being new, there's confusion about how to use it, see thread from Anton Bacaj about HuggingFace potentially using the wrong end token to finish conversations.

Now to an actual performance of this tiny model, I asked it a simple logic based question that trips many models even ones good with logic (Opus and GPT-4 answer it correctly usually) and it performed very well (here a comparison with LLama-3-70B which didn't do as well)

Additionally, their tokenizer is very interesting, they have all these terms that receive a full token, things like function_list, calc, ghreview, ghissue, and others, which highlight some interesting potential use-cases they have planned for this set of models or give us a hint at it's training process and how come it's so very good.

Snowflake open sources Arctic - a massive 480B MoE Hybrid with Apache 2 license (X, Try it, HF)

Snowflake is a name I haven't yet used on ThursdAI and this field is getting crowded, but they just released something interesting (+ a LOT of open source, including training code, checkpoints, research insights etc')

The thing I found most interesting is, the massive 128 experts MoE but also the Hybrid architecture. Not quite an MoE and definitely not a dense model.

They claim to have found that training Many-but-condensed experts with more expert choices is working well for them based on DeepSpeed research.

You can give this model a try here and I have, using the same 2 questions I had for Phi and LLama and found the model not that great at logic to be honest, but it was really fast considering the total size, so inference optimization for this type of architecture is definitely geared towards Enterprise (as well as training cost, they claim it cost just under $2 million dollars to train)

Big CO LLMs + APIs

Not a lot of super interesting things in this corner, besides Gemini 1.5 pro (the one with 1M context window) finally appearing in the Arena and taking the amazing #2 spot (pushing Llama-3 8B to number 6 on the same day it just appeared in there lol)

This is very impressive, and I gotta wonder what happened with Gemini Ultra if pro with larger context beats it outright. It's indeed very good, but not THAT good if you use it om simple logic problems and don't use the whole context length.

I suspect that we'll hear much more about their AI stuff during the upcoming Google IO (which I was invited to and am going to cover)

Additionally, we've had quite a few AI Unicorns born, with Perplexity becoming a freshly mint Unicorn with an additional round of funding and Devin, the 6-month old agent startup getting to a 2 billion valuation 😮

This weeks Buzz (What I learned with WandB this week)

It's been exactly 1 week since our conference in SF and since Joe Spisak by complete chance announced Meta LLama - 3 live on stage a few hours after it was officially announced.

In this weeks buzz, I'm very happy to bring you that recording, as promised last week.

I will also share that our newly announced new LLM observability tool Weave launched officially during the conference and it'll be my job to get you to use it 🙂 And shoutout to those in the ThursdAI community who already used and provided feedback, it's really helpful!

AI Art & Diffusion

The fine folks at FAL.ai have launched the LMsys.org for images, and called it.... IMGsys.org 🙂 It's a adversarial arena with different image generators, all hosted on Fal I assume, that lets the user choose which images are "better" which is a vague term.

But it's really fun, give it a try!

Tools & Hardware

Rabbit R1 first impressions

We finally got a tease of R1 from Rabbit, as the first customers started receiving this device (where's mine?? I didn't even get a tracking number)

Based on the presentation (which I watched so you don't have to) the response time, which was one of the most talked about negative pieces of AI Pin seems very decent. We're going to see a lot of reviews, but I'm very excited about my Rabbit 👏 🐇

Apparently I wasn't as fast as I thought on the pre-order so will have to wait patiently, but meanwhile, check out this review from Riley Brown.

That's the deep dive for this week, for the rest of the coverage, please listen to the episode and if you liked it, share with a friend!

I'll also be traveling quite a bit in the next two months, I'll be in Seattle for MSFT BUILD, and in San Francisco (more on this soon) a couple of times, hope to meet some of you, please come say hi! 🫡



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

Related Shows Based on Content Similarities

Discover shows related to ThursdAI - The top AI news from the past week, based on actual content similarities. Explore podcasts with similar topics, themes, and formats, backed by real data.
Génération Do It Yourself
Dive Club 🤿
tech 45'
EDGE of the Web - The Best SEO Podcast for Today's Digital Marketer
All-In with Chamath, Jason, Sacks & Friedberg
Young and Profiting with Hala Taha (Entrepreneurship, Sales, Marketing)
B&H Photography Podcast
Marketing Against The Grain
The Family History AI Show
Thinking Elixir Podcast
© My Podcast Data