164 points by jsnider3 66 days ago | 132 comments

vessenes 66 days ago [-]

I'm using Pro. It's definitely a "hand it to the team and have them schedule a meeting to get back to me" speed tool. But, it "feels" better to me than o3, and significantly better than gemini/claude for that use case. I do trust it more on confabulations; my current trust hierarchy would be o3-pro -> o3 -> gemini -> claude opus -> (a bunch of stuff) -> 4o.

That said, I'd like this quality with a relatively quick tool using model; I'm not sure what else I'd want to call it "AGI" at that point.

qwertox 66 days ago [-]

What are you using it for? It's not like that wouldn't matter.

With coding using anything is always a hit and miss, so I prefer to have faster models where I can throw away the chat if it turns into an idiot.

Would I wait 15 minutes for a transcription from Python to Rust if I don't know what the result will be? No.

Would I wait 15 minutes if I'd be a mathematician working on some kind of proof? Probably yes.

AaronAPU 66 days ago [-]

I feed most of my questions/code to 4o, Gemini, o3-pro (in that order). By the time I’ve read through 4o, Gemini is ready. Etc.

It’s the progressive jpg download of 2025. You can short circuit after the first model which gives a good enough response.

phillco 66 days ago [-]

Do you have any specific tooling for querying all three at once beyond just copy paste?

grw_ 66 days ago [-]

open-webui has support for doing this- https://imgur.com/a/XUEVgCT

gavinray 66 days ago [-]

https://github.com/caiyongji/ChatMultiAI

rubslopes 66 days ago [-]

OpenRouter lets you send the same prompt to several models.

plufz 66 days ago [-]

How do you reason about the energy consumption/climate impact of feeding the same question to three models? Im not saying there is a clear answer here, would just be interesting to hear your thinking.

true_religion 66 days ago [-]

How much energy does an AI model use during inferencing versus a human being?

This is a rhetorical question.

Sure we aren’t capturing every last externality, but optimization of large systems should be pushed toward the creators and operators of those systems. Customers shouldn’t have to validate environmental impact every time they spend 0.05 dollars to use a machine.

kridsdale1 66 days ago [-]

I actually did the math on this last year some time. For gpt4 or so. Attempted to derive a per-user energy use value. Based on known data LLM training used many hundreds of times the energy use of agriculture and transport costs to feed a human to do equivalent mental work. Inference was much lower. But the climate critique of AI doesn’t distinguish.

blharr 66 days ago [-]

100x more inefficient than a human in only food is pretty efficient. Consider that humans in the developed world spend far more in energy on heating/AC, transportation, housing, lawn care, refrigeration, washers and dryers, etc, and an LLM can probably be several factors more efficient.

I don't really understand the critique of GPT-4 in particular. GPT-4 cost >$100 Million to train. But likely less than 1 billion. Even if they pissed out $100 million in pure greenhouse gases, that'd be a drop in the bucket compared to, say 1/1000 of the US military's contributions

ben_w 66 days ago [-]

That sounds on the low side?

Does that "hundreds" include the cost of training one human to do the work, or enough humans to do the full range of tasks that an LLM can do? It's not like-for-like unless it's the full range of capabilities.

Given the training gets amortised over all uses until the model becomes obsolete (IDK, let's say 9 months?), I'd say details like this do matter — while I want the creation to be climate friendly just in its own right anyway, once it's made, greater or lesser use does very little:

As a rough guess, let's say that any given extra use of a model is roughly equivalent to turning API costs into kWh of electricity. So, at energy cost of $0.1/kWh, GPT-4.1-mini is currently about 62,500 tokens per kWh.

IDK the typical speed of human thought (and it probably doesn't map well to tokens), but for the sake of a rough guide, I think most people reading a book of that length would take something around 3 hours? Which means if the models burn electricity at about 333 W, they equal the performance (speed) of a human, whose biological requirements are on average 100 W… except 100 W is what you get from dividing 2065 kcal by 24h, and humans not only sleep, but object to working all waking hours 7 days a week, so those 3 hours of wall-clock time come with about 9 hours of down-time (40 hour work week/(7 days times 24 hours/day) ~= 1/4), making the requirements for 3 hours work into 12 hours of calories, or the equivalent of 400 W.

But that's for reading a book. Humans could easily spend months writing a book that size, so an AI model good enough to write 62,500 useful tokens could easily be (2 months * 2065 kcal/day = 144 kWh), at $0.1/kWh around $14.4, or $230/megatoken price range, and still more energy efficient than a human doing the same task.

I've not tried o3*, but I have tried o1, and I don't think o1 can write a book-sized artefact that's worth reading. But well architected code isn't a single monolith function with global state like a book can be, you can break everything down usefully and if one piece doesn't fit the style of the rest it isn't the end of the world, so it may be fine for code.

* I need to "verify my organisation", but also I'm a solo nerd right now, not an organisation… if they'd say I'm good, then that verification seems not very important?

plufz 66 days ago [-]

I totally agree that the environmental cost SHOULD be pushed towards the creators, but as long as that doesn't happen, is the moral thing as a consumer to just carry on using it? This is not a rhetorical question.

Transporting something with a car using fossil fuel usually uses less energy than if a human did the same thing by hand, that doesnt mean fossil fuel is environmentally friendly. LLM:s does not decrease the population even if it can do human tasks. If the LLM is used for the good of humanity it is probably a win, but I mean obviously a lot of the use of AI is not.

I use LLM:s as well, I'm just saying, I dont think it is a totally strange question to ponder over the energy use of different use cases with LLM:s.

themanmaran 66 days ago [-]

The same way you might reason about the climate impact of having a youtube video on in the background I expect.

AaronAPU 66 days ago [-]

I don’t have nearly a luxurious enough life for that to be a blip on my radar of concerns.

omikun 66 days ago [-]

Likely how you reason about driving to the beach or flying to a vacation destination. Or playing a game in 4k high quality with ray tracing turned on.

plufz 66 days ago [-]

Yeah, I try to avoid those things as much as possible. Eight years since I was last on an airplane. I'm not sure how that is relevant to my question?

dfsegoat 66 days ago [-]

It's a tough question and I do things the same way.

I feel like we are in awkward phase of: "We know this has severe environmental impact - but we need to know if these tools are actually going to be useful and worth adopting..." - so it seems like just keeping the environmental question at the forefront will be important as things progress.

plufz 66 days ago [-]

I agree, we are totally in an exploratory phase. I think its always a little scary when so much money and big companies are involved though.

naming_the_user 66 days ago [-]

I don’t think about it at all. When they cost money we’ll care more about it.

Until then, the choice is being made by the entities funding all of this.

Y_Y 66 days ago [-]

Are you setting the "reasoning effort"? I find going from the default (medium) to high makes a big difference on coding tasks for openai reasoning models.

pas 66 days ago [-]

what/how does that work internally?

Y_Y 66 days ago [-]

https://learn.microsoft.com/en-us/azure/ai-services/openai/h...

It's just the number of tokens it's willing to expend in the little internal dialogue before it has to spit out an answer.

lelanthran 65 days ago [-]

> That said, I'd like this quality with a relatively quick tool using model; I'm not sure what else I'd want to call it "AGI" at that point.

Am I the only one who is looking around at the AI industry and seeing only developers and artists being replaced with AI?

It's hardly AGI when it can't replace a salesperson, or an accountant, or a lawyer, or a teacher, or ...

All the headlines I am seeing are software development related. In this post, you yourself are using s/ware development as a measure for how good/bad AI is.

achierius 66 days ago [-]

Ideally it should be able to do things outside of the realm of programming with strong reliability (at least as strong as human experts), as well as be able to pick up new skills and learn new facts dynamically.

JamesBarney 66 days ago [-]

I haven't tested o3-pro yet enough to have a good hierarchy of confabulation.

I use AI a lot to double check my code via a code review what I've found is

Gemini - really good at contextual reasoning. Doesn't confabulate bugs that don't exist. Is really good at finding issues related to large context. (this method calls this method, and it does it with a value that could be this)

Sonnet/Opus - Seems to be the more creative. More likely to confabulate bugs that don't exist, but also most likely to catch a bug o3 and gemini missed.

o3 - Somewhere in the middle

bananapub 66 days ago [-]

what do you trust it to do?

the only example uses I see written about on HN appear to basically be Substack users asking o3 marketing questions and then writing substack posts about it, and a smattering of vague posts about debugging.

vessenes 66 days ago [-]

Long form research reporting.

Example: Pull together a list of the top 20 startups funded in Germany this year, valuation, founder and business model. Estimate which is most likely to want to take on private equity investment from a lower mid market US PE fund, as well as which would be most suitable taking into consideration their business model, founders and market; write an approach letter in english and in german aimed at getting a meeting. make sure that it's culturally appropriate for german startup founders.

I have no idea what the output of this query would be by the way, but it's one I would trust to get right on

* the list of startups

* the letter and its cultural sensitivity

* broad strokes of what the startup is doing

Stuff I'd "trust but verify" would be

* Names of the founders

* Size of company and target market

Stuff I'd double check / keep my own counsel on

* Suitability and why (note that o3 pro is def. better at this than o3 which is already not bad; it has some genuinely novel and good ideas, but often misses things.)

leptons 66 days ago [-]

This is all stuff I would expect an LLM to "hallucinate" about. Every bit of it.

thelock85 66 days ago [-]

I recently tried a version of this landscape analysis within a space I understand very well (CA college access nonprofits) and was shocked at how few organizations were named, let alone described in detail. Even worse, the scope and reach of the named orgs were pretty off the mark. My best guess is that they were the SEO winners of the past.

steveklabnik 66 days ago [-]

These tools can search the web to find this kind of data, and show you what they searched. Double checking is essential because hallucinations are still possible, but it's not like in the past where it would just try to make up the data from its training set. That said, it also may find bad data and give you a summary of that, which isn't a direct hallucination, but can still be inaccurate. This is why checking the sources is helpful too.

majormajor 66 days ago [-]

I wouldn't expect it to hallucinate but how do you evaluate it's ability to distinguish spam from good info? I.e. the "the first four pages of google results is all crap nowdays" problem.

steveklabnik 66 days ago [-]

By looking at the pages it looked at and deciding for yourself, just like you would with a web search you invoked yourself. I’ve generally found it to use trustworthy stuff like Stack Overflow, Wikipedia, and university websites. But I also haven’t used it in this way that much or for very serious things. I’d imagine more obscure questions are more likely to end up involving less trustworthy sites.

leptons 66 days ago [-]

>By looking at the pages it looked at and deciding for yourself, just like you would with a web search you invoked yourself.

Or you could just cut out the middleman(bot) and just do the search yourself, since you're going to have to anyway to verify what the "AI" wrote. It's just all so stupid that society is rushing towards this iffy-at-best technology when we still need to do the same work anyway to verify it isn't bullshitting us. Ugh, I hate this timeline.

steveklabnik 66 days ago [-]

You could. What's correct depends on your needs. Search engines aren't useless because you have to make sure that the pages they return to you are quality, and so neither is using an agent in this way.

leptons 63 days ago [-]

This is why I won't use these "AI" tools, because they obfuscate the bullshit. Since I have to check what it said anyway, it's easier if I just do a search and see where the info is coming from, and I can tell good from bad sources really easily. Putting an LLM in front off it is like polishing a turd - garbage in, garbage out.

vessenes 66 days ago [-]

Well you’d be wrong in this case: Deep research will trigger a series of web searches first then reach out to tooling for follow ups as needed; most of the facts will be grounded in the sources it finds.

With no deep research - agreed; too recent to believe info is accurately stored in the model weights.

bananapub 66 days ago [-]

why would you trust it to get any of that right? things like "top 20 startups in Germany" sound hard to determine.

how do you validate all of that is actually correct?

jazzyjackson 66 days ago [-]

A lot of stuff doesn't need to be accurate, it just needs to be enough information to act on.

Like how there's a ton of psychics, tarot and palm readers around Wall St.

bananapub 66 days ago [-]

That’s fine, but no one - not Sam Altman, not the fans on HN - are promoting them as $120/million token clairvoyants, they’re claiming they are srs bzns “iq maxxing” research tools.

If OP had suggested that they were just medium-quality nonsense generators I would have just agreed and not replied.

lukeschlather 66 days ago [-]

I don't think it's necessarily a question of trust, it's a question of cost/benefit, and I can apply this just as much to myself. I have been using a lot more SQL queries lately when I use ChatGPT, because I trust it pretty well to write gnarly queries with subqueries and CASE statements. Things that I wouldn't write myself because it's not worth the time to make the query correct, but ChatGPT can do it in seconds.

I had an example where o1 really wowed me - something I don't want to post on the internet because I want to use it to test models. In that case I was thinking through a problem where I had made an incorrect mathematical assumption. I explained my reasoning to o1 and it was able to point out the flaw in my reasoning, along with some examples mathematical expressions that disproved my thinking.

The funny thing in this case it basically functioned as a rubber duck. When it started producing a response I had deduced essentially what it told me - but it was pretty nice to see the detailed reasoning with examples that might've taken me a few more minutes to work out. And I never would've produced a little report explaining in detail why I was wrong, I would've just adjusted my thinking. Having the report was helpful.

lovich 66 days ago [-]

I’ve been using it in my job search by handing it stuff like the hn whose hiring threads, giving it a list of criteria i care about, and have it scour those posts for matching jobs, and then chase down all the companies posting and see if they have anything on their corporate site matching my descriptions.

Then I have it take those matches and try and chase down the hiring manager based on public info.

I did it at first just to see if it was possible, but I am getting direct emails that have been accurate a handful of times and I never would have gotten that on my own

bananapub 66 days ago [-]

This is a good data point - I guess another dimension is incompleteness-tolerance. An LLM is absolutely going to miss some but for your case that doesn’t matter very much.

Thank you!

IamLoading 66 days ago [-]

The time o3 pro takes is so annoying. I still need some time to get used to that.

agambrahma 66 days ago [-]

Hmm, Gemini + O3 > Claude-Opus for ... what kinds of things?

jes5199 66 days ago [-]

I haven’t tried pro yet but just yesterday I asked O3 to review a file and I saw a message in the chain-of-thought like “it’s going to be hard to give a comprehensive answer within the time limit” so now I’m tempted

A_D_E_P_T 66 days ago [-]

Chat just isn't the best format for something that takes 15-20 minutes (on average) to come up with a response. Email would unironically be better. Send a very long and detailed prompt, like a business email, and get a response back whenever it's ready. Then you can refine the prompt in another email, etc.

But I should note that o3-pro has been getting faster for me lately. At first every damn thing, however simple, took 15+ minutes. Today I got a few answers back within 5 minutes.

throw14082020 66 days ago [-]

Don't you chat with someone over various apps, but then get a notification minutes, hours, or days later? Email is just an instance of chat?

So I disagree. Are you building an AI agent accessible over email?

A_D_E_P_T 66 days ago [-]

Here's how I view it:

Chat is more of a "conversational" medium, where you dash off short and informal messages and expect a very quick if not immediate response.

Email is more of a "correspondence" medium, where you write longer and sometimes much more formal messages, and expect a response within ~24 hours.

The latter would be the better way to use ultra-compute-heavy/long-delay models like o3-Pro, because you don't expect the flow of a conversation, and it would nudge you to be much more detailed in your prompt. In fact, if it had a larger context window, an email-format o3-Pro would be perfect for lawyers and for certain types of data analysis.

I'm not building anything, but if you'd like to, lmk.

snissn 66 days ago [-]

I’ve found throw the problem at 3 o3 pros and have another one evaluate and synthesize works really well

ActionHank 66 days ago [-]

So like, a whole forest of trees per query is what we're saying here?

LeafItAlone 66 days ago [-]

Ideally just a few split atoms

kridsdale1 66 days ago [-]

Now You’re Playing With Agent Power!

b0a04gl 66 days ago [-]

when o3 pricing dropped 80%, most wrote the entire model family off as a downgrade (including me). but usage patterns flipped people finally ran real tasks through it. it's one of the few that holds state across fragmented prompts without collapsing context. used it to audit a messy auth flow spread over 6 services. didn't shortcut, didn't hallucinate edge cases. slow, but deliberate. in kahneman terms, it runs system 2 by default. many still benchmark on token speed, missing what actually matters

lubujackson 66 days ago [-]

I have been using o3 almost exclusively in Cursor now for my "vibe coding" project. I was able to get to a point with faster models before hitting a thrashing problem of forgetting about structure/not updating types/no using right types/ignoring existing functions, etc. Even when providing specific context. o3 rarely hits those issues and can happily implement a fully feature without breaking anything that touches multiple files. Speed is definitely an issue, but much less hassle on the back side.

viraptor 66 days ago [-]

Did you do anything interesting to start using it? I checked o3 when the price dropped and it kept falling basic tasks. ¾ of command runs and edits failed. Claude and Gemini both worked just fine with the same prompts.

lubujackson 64 days ago [-]

Yes, my codebase has gotten bigger and I hit a rut where every change was blowing up one way or another. The fix was to create a few Cursor rules files. You can have Cursor make this - one to define the overall structure of the project, one to define where dataclass types are (I have a Python codebase), a few different ones for generic patterns for what needs updating when adding new code paths. In Cursor you can also select when Cursor consults these files, every time or when doing specified changes.

This has made the biggest difference with o3. If I want to "add a new module" I first explain the goal and have it create a plan (no code). Then I have it save that plan to a TODO.txt. And I just have it start implementing it. I spot check the code along the way to make sure its going down the correct path. I run a few "continue implementing" until it has created tests that pass. Then I test things myself. If all that checks out, I actually look at the code. Most of the time now everything is functioning and I have just a few minor tweaks.

Because I'm working on a new project I let some things stay a little wrong because I know I can focus on an across-the-board solution to it later, with a single prompt. The most important thing is to have everything well-typed and well-structured. LLMs thrive in structure.

viraptor 64 days ago [-]

I didn't mean the finer details of what's produced. I just can't get o3 to reliably edit files or produce terminal commands. I get lots of errors in the Cursor UI itself, saying something like invalid command, which the agent seems to retry on multiple ways, failing each time.

lysecret 66 days ago [-]

This feels very Ai generated.

mettamage 66 days ago [-]

Some people write in similar ways yea. I've also been accused of writing as an AI.

But we're still human mate.

Stop discriminating or actually solve the problem. I've had enough of this attitude.

cshimmin 66 days ago [-]

almost as though the AIs were trained on a corpus of text written by... humans

SkyPuncher 66 days ago [-]

Feels like a lot of software engineers I work with (including myself at times).

Short, concise statements that don't necessarily string together sequentially. However, they still aggregate to a holistic, meaningful thought. No that much different that how a lot of code is written.

motoxpro 66 days ago [-]

I would say the opposite. Unless the person has a lot of custom instructions going on. Getting sentences like "but usage patterns flipped people finally ran real tasks through it." seem like it would take some amount of work.

gala8y 66 days ago [-]

Actually, it does not.

WXLCKNO 66 days ago [-]

Really doesn't to be fair and I feel like I spot so many AI comments every day.

b0a04gl 66 days ago [-]

yes im agi by the way

kridsdale1 66 days ago [-]

hi agi we’ve been trying so hard to find you

swyx 66 days ago [-]

> Arena has gotten quite silly if treated as a comprehensive measure (as in Gemini 2.5 Flash is rated above o3)

> The problem with o3-pro is that it is slow.

well maybe Arena is not that silly then. poorly argued/organized article.

highfrequency 66 days ago [-]

What is the difference between o3 pro and deep research? From a glance, both seem to take 10-15mins to respond and use o3 as the base model.

starik36 66 days ago [-]

I've tried o3 Pro for my use cases (parsing emails in the legal profession) and didn't have better results than the non pro.

In fact, o1-preview has given me more consistently correct results than any other model. But it's being sunset next month so I have to move to o3.

AaronAPU 66 days ago [-]

IMO 4o is much better at people-parsing. The reasoning models o1-pro / o3-pro are really good at writing code and solving algorithmic problems.

starik36 66 days ago [-]

I've tried it with various models. And 4o is really good given that it returns data at least 10 times faster. But if you ask it to fill out a Json document, o3 (or other reasoning models) is still better, more correct and predictable. Or at least, better enough to justify waiting a minute for the API call to return vs 3-5 seconds.

resters 66 days ago [-]

what is people parsing?

starik36 66 days ago [-]

The email from the lawyer might mention lots of names. Who are the plaintiffs, who are defendants, their attorneys, assistants, or insurance adjusters. The model parses out who is who and connects names to titles to email addresses.

resters 66 days ago [-]

Interesting, that's what I thought it meant, but didn't realize it was a term of art.

AaronAPU 66 days ago [-]

Things like inferring the meaning of “people parsing” when it isn’t explicitly defined but can be implied by context.

Not strict rational A+B=C, nuance.

ActionHank 66 days ago [-]

Out of interest, how widespread would you say this usage is amongst your peers in the legal profession?

starik36 66 days ago [-]

ChatGPT is pretty widespread. The only obstacle in the past was the fear that confidential documents might be used for training. OpenAI fixed that with a business account type that guarantees no training.

As far as usage of API for business processes (like document processing) - I can't say.

AtlasBarfed 66 days ago [-]

Guarantees ...... How?

You should assume Facebook level morality.

starik36 66 days ago [-]

I hear you. But the in-house lawyer read and approved the SLA. So all asses are covered!

franze 66 days ago [-]

I use Claude Code a lot. A lot lot. I make it do Atomic Git commits for me. When it gets stuck and instead of just saying so starts to refactor half of the codebase, I jump back to commit where the issue first appeared and get a summary of the involved files. Those in full text (not files) into o3 pro. And you can be sure it finds the issue or gives a direction where the issue does not appear. Would love o3-pro as am MCP so whenever Claude Code goes on a "lets refactor everything" coding spree it just asks o3 pro.

BeetleB 66 days ago [-]

Sounds like you're doing the equivalent of Aider's architect mode (use one model for the reasoning, and another for the code changes).

I would encourage you to try it. It's generally (much) cheaper doing stuff in Aider, but if you're paying a monthly subscription and using it a lot, Claude Code may be cheaper...

jgalt212 66 days ago [-]

> When it gets stuck and instead of just saying so starts to refactor half of the codebase

That's pretty scary.

franze 66 days ago [-]

Atomic Commits.

I put this into Claude.md and need to remind it every other hour. But yeah, you need to jump back every few hours or so.

nevertoolate 66 days ago [-]

Can you give an example what claude works on autonomously for hours? I only use the chat, maybe I’m just not prompting well, but I throw away almost everything claude writes and solve it in significantly less lines of code using the proper abstractions.

franze 66 days ago [-]

currently i am coding a node/react/ts firebase app that allows dynamic multiagent workflows to automate content workflows (a workflow.json defines call this model and the pass this part of the output of that model to that model and then combine it with this model to do that)

my setup is claude code in yolo mode with playwright MCP + browser MCP (to do stuff in the logged i firebase web interface) plus search enabled.

the prototype was developed via firebase studio until i reached a dead end there, then i used claude code to rip out firebase genkit and hooked in google-genai, openai, ...

the whole codebase goes into google gemini studio (caus the million token window) to write tickets, more tickets and even more tickets.

claude code then has the job to implemt these tickets (create a detailed tasklist for each ticket first) and then code it until done. end of each tasklist is a working playwright end to end test with verified output.

and atomic commits.

i hooked anydesk to my computer so i can check i at some point to tell to to continue or to read Claude.md again (the meta instructions which basically tells it to not to fallbacks, mock data or cheat in amy other way.)

ever fourth ticket is refactoring for sinplicity and documentation.

the tickets mist be updated before each commit and moved to the do done folder only when 100 tested ok.

so yeah, when i wale up in the morning either magic happend and the tockets are all done. or it got stuck and refactores half the codebase. in that case it works for an hoor to go over all git commits to find out where it went wrong.

what i need are multiple coding agent which challenge each other at crucial points.

throw234234234 66 days ago [-]

I have to ask a probably naive question - after the initial boilerplate/scaffolding is this actually any faster than just typing in the code you want? Or using the standard AI flow before these long task agents? It feels like you are juggling and bouncing async tools, doubling back on output, and constant trial and error to get things working.

I'm sure lots of code is being generated, but I do wonder about the effectiveness ratio of it when I read comments like above. Like there is a sweet spot after initial scaffold where its easier just to express yourself in code?

ActionHank 66 days ago [-]

Yeah, so far, I've only seen cases where the work is extremely simple and using pervasively used libraries and solutions to create widely implemented solutions. Add something a little out there and things start to unravel.

rotcev 66 days ago [-]

I use O3-pro not as a coding model, but as a strategic assistant. For me, the long delay between responses makes the model unsuitable for coding workflows, however, it is actually a feature when it comes to getting answers to hard questions impacting my (or my friend's/family's) day to day life.

metalrain 66 days ago [-]

"'take your profits’ in quality versus quantity is up to you."

As mainly AI invester not AI user, I think profitability is great importance. It has been race to top so far, soon we see race to the bottom.

resters 66 days ago [-]

Right! We are in a sense lucky to be getting access to actual state-of-the-art models. Soon the actual model may be kept internal and the customers will get "good enough for solid ROI" distilled versions that can be hosted profitably.

boole1854 66 days ago [-]

Here are my own anecdotes from using o3-pro recently.

My primary use cases where I am willing to wait 10-20 minutes for an answer from the "big slow" model (o3-pro) is code reviews of large amounts of code. I have been comparing results on this task from the three models above.

Oddly, I see many cases where each model will surface issues that the other two miss. In previous months when running this test (e.g., Claude 3.7 Sonnet vs o1-pro vs earlier Gemini), that wasn't the case. Back then, the best model (o1-pro) would almost always find all the issues that the other models found. But now it seems they each have their own blindspots (although they are also all better than the previous generation of models).

With that said, I am seeing Claude Opus 4 (w/extended thinking) be distinctly worse at missing problems which o3-pro and Gemini find. It seems fairly consistent that Opus will be the worst out of the three (despite sometimes noticing things the others do not).

Whether o3-pro or Gemini 2.5 Pro is better is less clear. o3-pro will report more issues, but it also has a tendency to confabulate problems. My workflow involves providing the model with a diff of all changes, plus the full contents of the files that were changed. o3-pro seems to have a tendency to imagine and report problems in the files that were not provided to it. It also has an odd new failure mode, which is very consistent: it gets confused by the fact that I provide both the diff and the full file contents. It "sees" parts of the same code twice and will usually report that there has accidentally been some code duplicated. Base o3 does this as well. None of the other models get confused in that way, and I also do not remember seeing that failure mode with o1-pro.

Nevertheless, it seems o3-pro can sometimes find real issues that Gemini 2.5 Pro and Opus 4 cannot more often than vice versa.

Back in the o1-pro days, it was fairly straightforward in my testing for this use case that o1-pro was simply better across the board. Now with o3-pro compared particularly with Gemini 2.5 Pro, it's no longer clear whether the bonus of occasionally finding a problem that Gemini misses is worth the trouble of (1) waiting way longer for an answer and (2) sifting through more false positives.

My other common code-related use case is actually writing code. Here, Claude Code (with Opus 4) is amazing and has replaced all my other use of coding models, including Cursor. I now code almost exclusively by peer programming with Claude Code, allowing it to be the code writer while I oversee and review. The OpenAI competitor to Claude Code, called Codex CLI, feels distinctly undercooked. It has a recurring problem where it seems to "forget" that it is an agent that needs to go ahead and edit files, and it will instead start to offer me suggestions about how I can make the change. It also hallucinates running commands on a regular basis (e.g., I tell it to commit the changes we've done, and outputs that it has done so, but it has not.)

So where will I spend my $200 monthly model budget? Answer: Claude, for nearly unlimited use of Claude Code. For highly complex tasks, I switch to Gemini 2.5 Pro, which is still free in AI Studio. If I can wait 10+ minutes, I may hand it to o3-pro. But once my ChatGPT Pro subscription expires this month, I may either stop using o3-pro altogether, or I may occasionally use it as a second opinion by paying on-demand through the API.

JamesBarney 66 days ago [-]

> With that said, I am seeing Claude Opus 4 (w/extended thinking) be distinctly worse at missing problems which o3-pro and Gemini find. It seems fairly consistent that Opus will be the worst out of the three (despite sometimes noticing things the others do not).

I've found the same thing. That claude is more likely miss a bug than o3 or gemini but more likely to catch something o3 and gemini missed. If I had to pick one model I'd pick o3 or gemini, but if I had to pick a second model I'd pick opus.

It's also seems to have a much higher false positive rate where as gemini seems to have the lowest false positive rate.

Basically o3 and gemini are better, but also more correlated which gives opus a lot of value.

throwdbaaway 66 days ago [-]

For the code review use case, maybe can try to create the diff with something like `git diff -U99999`, and then send only the diff.

thebiggening 65 days ago [-]

I feel really sorry for anyone using o3. It is really, really bad...

thebiggening 65 days ago [-]

not being ironic I would rather commit seppuku than use o3

jacobsenscott 66 days ago [-]

[flagged]

crubier 66 days ago [-]

Writing a Pull Request can take me 8 hours. Reviewing a Pull Request of the same size takes me 30min. Here you go.

Y_Y 66 days ago [-]

P ⊆ NP

Nition 66 days ago [-]

Yeah this is the big benefit of current LLM AI even with the mistakes and hallucinations IMO. All the things that are hard to answer but easy to verify.

Not just programming. e.g. You have a complex medical problem. Hard to ask Google. Ask AI, it gives you some possible answers, then you can search up those. Or you want to identify a plant. AI looks at your photo and tells you what it is, then search that name to verify.

There were existing ways to do some of these things, but this covers all of them.

infecto 66 days ago [-]

A bit of a meta topic but the one thing that probably grinds my nerves more than it should is this style of comments that are not extremely additive and simply posit some idea without experience or backing statements. Happens a lot in these LLM discussions. Perhaps there is genuine curiosity but it seems to always read as an objection to the idea of these tools coming from someone who has not used them.

ezst 66 days ago [-]

To me it's a useful reminder that those tools are nothing but text generating algorithms, optimised to produce compelling answers, irrespective whether they are truthful or not, having no concept of what's factual, and completely missing the ability to give up when asked for impossible or unreasonable answers outside of their training data.

In essence, they are only adequate in niche situations (like creative writing, marketing, placeholder during iterative design, …) where there's no such social contract and assumption that people operate in good faith and do their best diligence not to deceive others.

Pretending otherwise, not pushing back when LLMs are clearly used outside of those contexts, or dressing them into what they are not (thinking machines, search engines, knowledge archives, …) is doing the work of useful idiots defending tech oligarchs and data thieves against their own interests.

And yeah, I get it, naysayers are annoying. Doesn't mean they are wrong or their voices shouldn't be heard at a time where the legality and ethics of all this are being debated.

Brendinooo 66 days ago [-]

>In essence, they are only adequate in niche situations (like creative writing, marketing, placeholder during iterative design, …)

Placeholders, sure. But I'm surprised you don't know about the utility it has in a programming context.

I was kicking the tires on Claude for a personal thing the other day - "given this data and this output I want, write a utility function" - and it quickly gave me something that gave me exactly what I needed on the first try.

I also had an extended session recently where it really helped me with the "naming is hard" problem. Naming is hard and it's nice to have a rubber duck that has a certain understanding of your code and is able to throw a bunch of ideas at you quickly.

I could go on, but hopefully you get the point.

And I'll emphasize your "placeholder" point: sometimes it's not even about the solution it gives, it's about the process of articulating the problem and getting something out there. A lot of times the first iteration of anything is the hardest to complete, and it can really cut down the time to achieve that.

ezst 64 days ago [-]

As I said, I don't reject LLMs for every situation (what you describe would probably fit the "iterative design" category). But as soon as this personal affair meets the wider world, inevitably becomes too big and transient to be thoroughly understood, and accelerates the rate of minimally-supervised code reviews sent to very human peers ; is where I have a bigger issue: individual convenience shouldn't incur more work and effort to others, nor become an increased risk to stakeholders.

It is in that sense that I see LLMs challenge "social contracts": their use is infectious (probably also in the GPL-sense, be warned) and should be agreed upon by all parties upfront (do you want to be on the receiving end of a software that you pay good money to have developed while knowing that its developers have little understanding of its workings? How many times cheaper is cheap enough to pay-off the amount of technical and institutional debt/risk? Can LLMs effectively become that cheap?).

infecto 66 days ago [-]

I tried reading your comment but it was a little difficult, there was a lot of hints of ChatGPT fluff. Ultimately saying a LLM is only useful in niche situations is a tell of the boat passing you by. I am not advocating that they solve all problems but they are definitely more useful than simply “niche” workflows.

ezst 64 days ago [-]

> there was a lot of hints of ChatGPT fluff.

What does that mean? Are you implying that I used a LLM to formulate my comment? A comment about the dangers of LLMs for civic and respectful discourse?

> Ultimately saying a LLM is only useful in niche situations

is a fact. LLMs are not something new or obscure, we have plenty of open-source/weights implementations keeping in check the outlandish marketing promises of the big players and their "SOTA" models.

> …is a tell of the boat passing you by.

is an opinion, giving an emotional twist to something that really doesn't deserve one. All it shows is that you bought into LLMs being more than what they really are, and welcome the (admittedly cool and tempting) marketing above the truth and your critical and independent thinking on the subject.

infecto 64 days ago [-]

You’re stating opinion as fact while accusing others of being emotional. Ironically, your own comment reads like a reaction to hype, not a sober assessment of capability. Saying LLMs are “only useful in niche situations” is not a fact—it’s a perspective, just like mine. We’re all operating off personal anecdotes and incomplete information. The difference is I acknowledge that.

We can debate usefulness all day, but reducing the opposing view to “you fell for marketing” isn’t critical thinking.

And yes your reply unfortunately read exactly like ChatGPT would respond which I found both amusing but also hard to read as the logic was overly verbose.

ezst 63 days ago [-]

> We’re all operating off personal anecdotes and incomplete information. The difference is I acknowledge that.

I'm sorry but I can't assume that your comment comes in good faith. You are the one making the exceptional claims here, not me. The burden of the proof falls on you.

From my side, I've provided clear use-cases where LLMs are adequate, and large categories of use-cases where they are provably not (misguided usage in those by a number of people is no counter-argument).

The extent of your argumentation is annoyingly nihilist "we don't know everything about LLMs, so I'm entitled to believe anything I want".

> We can debate usefulness all day, but reducing the opposing view to “you fell for marketing” isn’t critical thinking.

The reason why I'm pulling marketing into this is because that's the reason why people generally believe that LLMs are more capable than they really are. Placarding LLMs everywhere and over-selling their abilities is a coordinated effort by a very few immensely rich tech companies. Not by Machine Learning scientists or Computational Neuroscientists who have been mapping this same space for decades.

> And yes your reply unfortunately read exactly like ChatGPT would respond

Then I suppose it's a compliment, taken as someone whose native language obviously isn't English.

infecto 62 days ago [-]

Still struggling to find anything in your wall of words that actually backs up your claims. As I always say—YMMV. On my side, we’ve seen real areas of value from LLMs. Maybe in your field or experience, that value doesn’t exist. Fair enough. But presenting your observations as objective fact is a stretch.

Nice chatting, but it’s clear you’re out of productive things to say.

ezst 61 days ago [-]

> Still struggling to find anything in your wall of words that actually backs up your claims.

I mean, one easy thing you could do is to explain how LLMs are adequate as

>> thinking machines, search engines, knowledge archives, …

How much practical experience do you have with LLMs? Have you even tried running models locally? Then how have you not experienced yourself tweaking few parameters and gotten the same model to say one thing and its opposite just after? Through your extensive use of commercial LLMs, aren't you getting daily occurrences of confabulations? Making stuff up is what defines¹ LLMs, and that's not me saying it.

If your field is so inconsequential that reliability and faithfulness of your output has no practical merit or repercussion, I'm glad that you get some entertainment out of this. Still doesn't make it the norm, though.

¹: https://arxiv.org/abs/2409.05746

nevertoolate 66 days ago [-]

On the other hand I only see downvotes and _never_ an answer on how you are using llms. The anecdotal 8hours to 30 minutes PR sounds great, but in my experience it just won’t happen. How can you set up llm to work autonomously for _hours_? If it is continuous “pair” work I just don’t see the 30 min work solving a beefy PR. In 8 hours coding with a well thought out plan / re-planning, testing one can finish quite interesting stuff. 30 minutes is basically nothing - and this is kinda what you get with an llm in my experience. How do you do it?

infecto 66 days ago [-]

Either your head is in the sand or you have not been on HN much. Practically every week there are multiple articles that make the front page either for or against LLM tooling, in each of those people, including myself, will share anecdotes from their own experiences. And you will also get a number of folks like the person I responded too who post a negative or positive comment without any evidence or attempt to learn on their own before making a claim. You say you never see people talking about their experiences but there is usually nothing but experiences being posted in these comments every week. It’s an absurd claim.

Don’t misjudge me either, I am not suggesting there should not be critical comments but that if you make a critical comment you should at least have an idea what your talking about.

nevertoolate 65 days ago [-]

I’m not sure if you have read my question, but I can tell you think I don’t know what is happening in with the llm scene. I’m aware, late? into the game, doing my own experiments. I asked a specific question - have spent ~100 hours in claude chat already implementing production features, prototypes and scripts. Anyways. :)

infecto 65 days ago [-]

I read and replied to the bit that matters. Anyways ;)

huxley 66 days ago [-]

Not necessarily, you don’t need to know the answer, the fabulation might:

* give an error

* return the wrong result

* not be internally consistent with the rest of the content

* be logically impossible

* be factually impossible

* have basic errors

It is entirely possible (and quite common) to know something is wrong without knowing what a right answer is.

ashdksnndck 66 days ago [-]

If you’re referring to the first chart in OP, “comparative evaluations with human testers”, it’s measuring how often o3-pro gave a better answer than o3. It’s not reporting a 63% accuracy rate.

wahnfrieden 66 days ago [-]

Many types of work are time-consuming to produce, and quick to verify.

Sateeshm 66 days ago [-]

I am curious. What are a few examples?

wahnfrieden 66 days ago [-]

Many code-writing tasks.

How long do your teams take to write vs review PRs? How long does it take to review a test case and run it vs write the implementation under test? Or to verify that a fix a regressed test now completes successfully? How long does it take you to do a "design review" of a rendered webpage vs to create a static webpage? How long does it take to evaluate a performance optimization vs write it?

imiric 66 days ago [-]

If your team takes a disproportionately shorter amount of time to review PRs than to write them, I guarantee that your code base has many issues that would've been caught by a more thorough reviewer. Reviewing code doesn't mean slapping a quick "LGTM!" because you trust the author.

> How long does it take to review a test case and run it vs write the implementation under test?

If you blindly trust a passing test and don't review it as production code, I have a bridge to sell you.

> How long does it take to evaluate a performance optimization vs write it?

Factoring in the time to review that the optimization didn't introduce a regression, and isn't a hack that will cause other issues later: the difference shouldn't be too large.

Yes, code usually takes more time and effort to write, but if it's not thoroughly read, understood, and reviewed, it can cause havoc someone will have to deal with later.

This idea that just because LLMs help you write code quicker will make you or the team more productive is delusional. It's just kicking the can down the road. You can ignore it, but sooner or later someone will have to handle it. And you better hope that it happens before it impacts your users.

infecto 66 days ago [-]

I don’t know why it would be delusional. There is definite opportunity for these tools to be force multipliers, it depends on the team and the codebase but I am not sure how you can so easily say it’s delusional. The vast majority of code in companies is not novel. The unique parts, the business logic, is usually not requiring specialization and the rest of the code is all the same.

steveklabnik 66 days ago [-]

"my tests are failing, and I don't know why. can you investigate?"

arrowsmith 66 days ago [-]

Unless P=NP

add-sub-mul-div 66 days ago [-]

The majority of people just want to go home at 5 after putting in as little effort as possible. Their bosses just want to save money in the short term. The interests could not be more aligned and optimized.

bananapub 66 days ago [-]

you're being silly. there are definitely cases in life where "verifying an answer" is much less effort than "producing an answer" (public key cryptography is built on this!). an obvious example is "writing boring code". I can much more quickly review the code to a simple little custom web app than I can sit down and write it. that's great! as a bonus, no one dies if my little dashboard crashes on invalid input or whatever. another thing might be marketing copy - no one really cares if it's good or not and 500 "OK" words on a topic might take an hour to write but five minutes to read and correct the grammar of.

an example of things that are the opposite is "public policy development", which is why it's simply malicious that various corrupt oligarchs are pushing for it to be used for such things.

so, a simple model for you to understand why other people might find these tools useful for some things:

- low stakes - doesn't matter that much if the output isn't Top Quality, either because it's easy to fix or it just doesn't matter

- enormous gap in cost between generation and review - e.g. coding

- review systems exist and are used - I don't care very much if my coworkers use an LLM to write code or not, since all the code gets reviewed by someone else, and if the proposer of the change doesn't even bother to check it themselves then they pay the social cost for it

imiric 66 days ago [-]

The intentional disregard for software quality in your comment is honestly disturbing.

If your quality threshold is so low that you can tolerate crashes on invalid input, you will certainly cut corners when building software for others. I wouldn't want to use a piece of software you wrote, let alone have you on my team.

> I don't care very much if my coworkers use an LLM to write code or not, since all the code gets reviewed by someone else

Ah, yes, let's kick the can down the road.

> if the proposer of the change doesn't even bother to check it themselves then they pay the social cost for it

... The side effects of shoddy code are not redeemed by "paying a social cost". They negatively impact your users, and thus the bottom line of your company.

bananapub 66 days ago [-]

> The intentional disregard for software quality in your comment is honestly disturbing.

why? my shitty sales dashboard at work doesn't control a rocket or a pacemaker. the crappy my-weird-org-mode-table-to-re-arranged-CSV convertor isn't either.

not all software is safety critical, and in any case, I'm the human who ran the code generator and then the code and I'm responsible for my dashboard crashing or my convertor deleting the photos of my cat.

should an LLM replace human code review? no. can I use an LLM for my own dumb projects? of course.

iLoveOncall 66 days ago [-]

> My experience so far is that waiting a long time is annoying, sufficiently annoying that you often won’t want to wait.

My solution for this has been to use non-reasoning models, and so far in 90% of the situations I have received the exact same results from both.

jasonjmcghee 66 days ago [-]

On the complete other end of the spectrum, I found deep research (whether it's actually performing searches or not) to be a significant upgrade in quality. But you need to be cool with having to wait 15-30 minutes. It's certainly not for everything, but definitely worth trying.

It tends to output significantly longer and more detailed output. So when you want that kind of thing- works well. Especially if you need up to date stuff or want to find related sources.

joshstrange 66 days ago [-]

Deep research is very cool, no doubt, but run it on a problem space you are familiar with and you will see the shortcomings.

Anytime I do my own “deep” research I like to then throw the same problem at OpenAI and see how well it fares. Often it misses things or gets things subtly wrong. The results look impressive so it’s easy to fool people and I’m not saying the results are useless, I’ve absolutely gotten value out of it, but I don’t love using it for anything I actually care about.

bcrosby95 66 days ago [-]

I view the results more as a starting point than an end unto itself. For that I think it's pretty useful.

joshstrange 66 days ago [-]

Absolutely, I agree it's useful as a starting point, sometimes it's all I need (if it's low-stakes and I just wanted a bit more data). I was just cautioning "trusting" it completely, since it's very easy to fall into that trap (I've done it).

matwood 66 days ago [-]

Same, it will pull enough sources together that I end up with an idea of where to go next.

pinoy420 66 days ago [-]

[dead]

Loading comments...

vessenes 66 days ago [-]

That said, I'd like this quality with a relatively quick tool using model; I'm not sure what else I'd want to call it "AGI" at that point.

qwertox 66 days ago [-]

What are you using it for? It's not like that wouldn't matter.

With coding using anything is always a hit and miss, so I prefer to have faster models where I can throw away the chat if it turns into an idiot.

Would I wait 15 minutes for a transcription from Python to Rust if I don't know what the result will be? No.

Would I wait 15 minutes if I'd be a mathematician working on some kind of proof? Probably yes.

AaronAPU 66 days ago [-]

I feed most of my questions/code to 4o, Gemini, o3-pro (in that order). By the time I’ve read through 4o, Gemini is ready. Etc.

It’s the progressive jpg download of 2025. You can short circuit after the first model which gives a good enough response.

phillco 66 days ago [-]

Do you have any specific tooling for querying all three at once beyond just copy paste?

grw_ 66 days ago [-]

open-webui has support for doing this- https://imgur.com/a/XUEVgCT

gavinray 66 days ago [-]

https://github.com/caiyongji/ChatMultiAI

rubslopes 66 days ago [-]

OpenRouter lets you send the same prompt to several models.

plufz 66 days ago [-]

true_religion 66 days ago [-]

How much energy does an AI model use during inferencing versus a human being?

This is a rhetorical question.

kridsdale1 66 days ago [-]

blharr 66 days ago [-]

ben_w 66 days ago [-]

That sounds on the low side?

* I need to "verify my organisation", but also I'm a solo nerd right now, not an organisation… if they'd say I'm good, then that verification seems not very important?

plufz 66 days ago [-]

I use LLM:s as well, I'm just saying, I dont think it is a totally strange question to ponder over the energy use of different use cases with LLM:s.

themanmaran 66 days ago [-]

The same way you might reason about the climate impact of having a youtube video on in the background I expect.

AaronAPU 66 days ago [-]

I don’t have nearly a luxurious enough life for that to be a blip on my radar of concerns.

omikun 66 days ago [-]

Likely how you reason about driving to the beach or flying to a vacation destination. Or playing a game in 4k high quality with ray tracing turned on.

plufz 66 days ago [-]

Yeah, I try to avoid those things as much as possible. Eight years since I was last on an airplane. I'm not sure how that is relevant to my question?

dfsegoat 66 days ago [-]

It's a tough question and I do things the same way.

plufz 66 days ago [-]

I agree, we are totally in an exploratory phase. I think its always a little scary when so much money and big companies are involved though.

naming_the_user 66 days ago [-]

I don’t think about it at all. When they cost money we’ll care more about it.

Until then, the choice is being made by the entities funding all of this.

Y_Y 66 days ago [-]

Are you setting the "reasoning effort"? I find going from the default (medium) to high makes a big difference on coding tasks for openai reasoning models.

pas 66 days ago [-]

what/how does that work internally?

Y_Y 66 days ago [-]

https://learn.microsoft.com/en-us/azure/ai-services/openai/h...

It's just the number of tokens it's willing to expend in the little internal dialogue before it has to spit out an answer.

lelanthran 65 days ago [-]

> That said, I'd like this quality with a relatively quick tool using model; I'm not sure what else I'd want to call it "AGI" at that point.

Am I the only one who is looking around at the AI industry and seeing only developers and artists being replaced with AI?

It's hardly AGI when it can't replace a salesperson, or an accountant, or a lawyer, or a teacher, or ...

All the headlines I am seeing are software development related. In this post, you yourself are using s/ware development as a measure for how good/bad AI is.

achierius 66 days ago [-]

JamesBarney 66 days ago [-]

I haven't tested o3-pro yet enough to have a good hierarchy of confabulation.

I use AI a lot to double check my code via a code review what I've found is

Sonnet/Opus - Seems to be the more creative. More likely to confabulate bugs that don't exist, but also most likely to catch a bug o3 and gemini missed.

o3 - Somewhere in the middle

bananapub 66 days ago [-]

what do you trust it to do?

vessenes 66 days ago [-]

Long form research reporting.

I have no idea what the output of this query would be by the way, but it's one I would trust to get right on

* the list of startups

* the letter and its cultural sensitivity

* broad strokes of what the startup is doing

Stuff I'd "trust but verify" would be

* Names of the founders

* Size of company and target market

Stuff I'd double check / keep my own counsel on

* Suitability and why (note that o3 pro is def. better at this than o3 which is already not bad; it has some genuinely novel and good ideas, but often misses things.)

leptons 66 days ago [-]

This is all stuff I would expect an LLM to "hallucinate" about. Every bit of it.

thelock85 66 days ago [-]

steveklabnik 66 days ago [-]

majormajor 66 days ago [-]

I wouldn't expect it to hallucinate but how do you evaluate it's ability to distinguish spam from good info? I.e. the "the first four pages of google results is all crap nowdays" problem.

steveklabnik 66 days ago [-]

leptons 66 days ago [-]

>By looking at the pages it looked at and deciding for yourself, just like you would with a web search you invoked yourself.

steveklabnik 66 days ago [-]

leptons 63 days ago [-]

vessenes 66 days ago [-]

With no deep research - agreed; too recent to believe info is accurately stored in the model weights.

bananapub 66 days ago [-]

why would you trust it to get any of that right? things like "top 20 startups in Germany" sound hard to determine.

how do you validate all of that is actually correct?

jazzyjackson 66 days ago [-]

A lot of stuff doesn't need to be accurate, it just needs to be enough information to act on.

Like how there's a ton of psychics, tarot and palm readers around Wall St.

bananapub 66 days ago [-]

That’s fine, but no one - not Sam Altman, not the fans on HN - are promoting them as $120/million token clairvoyants, they’re claiming they are srs bzns “iq maxxing” research tools.

If OP had suggested that they were just medium-quality nonsense generators I would have just agreed and not replied.

lukeschlather 66 days ago [-]

lovich 66 days ago [-]

Then I have it take those matches and try and chase down the hiring manager based on public info.

I did it at first just to see if it was possible, but I am getting direct emails that have been accurate a handful of times and I never would have gotten that on my own

bananapub 66 days ago [-]

This is a good data point - I guess another dimension is incompleteness-tolerance. An LLM is absolutely going to miss some but for your case that doesn’t matter very much.

Thank you!

IamLoading 66 days ago [-]

The time o3 pro takes is so annoying. I still need some time to get used to that.

agambrahma 66 days ago [-]

Hmm, Gemini + O3 > Claude-Opus for ... what kinds of things?

jes5199 66 days ago [-]

A_D_E_P_T 66 days ago [-]

But I should note that o3-pro has been getting faster for me lately. At first every damn thing, however simple, took 15+ minutes. Today I got a few answers back within 5 minutes.

throw14082020 66 days ago [-]

Don't you chat with someone over various apps, but then get a notification minutes, hours, or days later? Email is just an instance of chat?

So I disagree. Are you building an AI agent accessible over email?

A_D_E_P_T 66 days ago [-]

Here's how I view it:

Chat is more of a "conversational" medium, where you dash off short and informal messages and expect a very quick if not immediate response.

Email is more of a "correspondence" medium, where you write longer and sometimes much more formal messages, and expect a response within ~24 hours.

I'm not building anything, but if you'd like to, lmk.

snissn 66 days ago [-]

I’ve found throw the problem at 3 o3 pros and have another one evaluate and synthesize works really well

ActionHank 66 days ago [-]

So like, a whole forest of trees per query is what we're saying here?

LeafItAlone 66 days ago [-]

Ideally just a few split atoms

kridsdale1 66 days ago [-]

Now You’re Playing With Agent Power!

b0a04gl 66 days ago [-]

lubujackson 66 days ago [-]

viraptor 66 days ago [-]

lubujackson 64 days ago [-]

viraptor 64 days ago [-]

lysecret 66 days ago [-]

This feels very Ai generated.

mettamage 66 days ago [-]

Some people write in similar ways yea. I've also been accused of writing as an AI.

But we're still human mate.

Stop discriminating or actually solve the problem. I've had enough of this attitude.

cshimmin 66 days ago [-]

almost as though the AIs were trained on a corpus of text written by... humans

SkyPuncher 66 days ago [-]

Feels like a lot of software engineers I work with (including myself at times).

motoxpro 66 days ago [-]

gala8y 66 days ago [-]

Actually, it does not.

WXLCKNO 66 days ago [-]

Really doesn't to be fair and I feel like I spot so many AI comments every day.

b0a04gl 66 days ago [-]

yes im agi by the way

kridsdale1 66 days ago [-]

hi agi we’ve been trying so hard to find you

swyx 66 days ago [-]

> Arena has gotten quite silly if treated as a comprehensive measure (as in Gemini 2.5 Flash is rated above o3)

> The problem with o3-pro is that it is slow.

well maybe Arena is not that silly then. poorly argued/organized article.

highfrequency 66 days ago [-]

What is the difference between o3 pro and deep research? From a glance, both seem to take 10-15mins to respond and use o3 as the base model.

starik36 66 days ago [-]

I've tried o3 Pro for my use cases (parsing emails in the legal profession) and didn't have better results than the non pro.

In fact, o1-preview has given me more consistently correct results than any other model. But it's being sunset next month so I have to move to o3.

AaronAPU 66 days ago [-]

IMO 4o is much better at people-parsing. The reasoning models o1-pro / o3-pro are really good at writing code and solving algorithmic problems.

starik36 66 days ago [-]

resters 66 days ago [-]

what is people parsing?

starik36 66 days ago [-]

resters 66 days ago [-]

Interesting, that's what I thought it meant, but didn't realize it was a term of art.

AaronAPU 66 days ago [-]

Things like inferring the meaning of “people parsing” when it isn’t explicitly defined but can be implied by context.

Not strict rational A+B=C, nuance.

ActionHank 66 days ago [-]

Out of interest, how widespread would you say this usage is amongst your peers in the legal profession?

starik36 66 days ago [-]

As far as usage of API for business processes (like document processing) - I can't say.

AtlasBarfed 66 days ago [-]

Guarantees ...... How?

You should assume Facebook level morality.

starik36 66 days ago [-]

I hear you. But the in-house lawyer read and approved the SLA. So all asses are covered!

franze 66 days ago [-]

BeetleB 66 days ago [-]

Sounds like you're doing the equivalent of Aider's architect mode (use one model for the reasoning, and another for the code changes).

I would encourage you to try it. It's generally (much) cheaper doing stuff in Aider, but if you're paying a monthly subscription and using it a lot, Claude Code may be cheaper...

jgalt212 66 days ago [-]

> When it gets stuck and instead of just saying so starts to refactor half of the codebase

That's pretty scary.

franze 66 days ago [-]

Atomic Commits.

I put this into Claude.md and need to remind it every other hour. But yeah, you need to jump back every few hours or so.

nevertoolate 66 days ago [-]

franze 66 days ago [-]

my setup is claude code in yolo mode with playwright MCP + browser MCP (to do stuff in the logged i firebase web interface) plus search enabled.

the prototype was developed via firebase studio until i reached a dead end there, then i used claude code to rip out firebase genkit and hooked in google-genai, openai, ...

the whole codebase goes into google gemini studio (caus the million token window) to write tickets, more tickets and even more tickets.

and atomic commits.

ever fourth ticket is refactoring for sinplicity and documentation.

the tickets mist be updated before each commit and moved to the do done folder only when 100 tested ok.

what i need are multiple coding agent which challenge each other at crucial points.

throw234234234 66 days ago [-]

ActionHank 66 days ago [-]

rotcev 66 days ago [-]

metalrain 66 days ago [-]

"'take your profits’ in quality versus quantity is up to you."

As mainly AI invester not AI user, I think profitability is great importance. It has been race to top so far, soon we see race to the bottom.

resters 66 days ago [-]

boole1854 66 days ago [-]

Here are my own anecdotes from using o3-pro recently.

Nevertheless, it seems o3-pro can sometimes find real issues that Gemini 2.5 Pro and Opus 4 cannot more often than vice versa.

JamesBarney 66 days ago [-]

It's also seems to have a much higher false positive rate where as gemini seems to have the lowest false positive rate.

Basically o3 and gemini are better, but also more correlated which gives opus a lot of value.

throwdbaaway 66 days ago [-]

For the code review use case, maybe can try to create the diff with something like `git diff -U99999`, and then send only the diff.

thebiggening 65 days ago [-]

I feel really sorry for anyone using o3. It is really, really bad...

thebiggening 65 days ago [-]

not being ironic I would rather commit seppuku than use o3

jacobsenscott 66 days ago [-]

[flagged]

crubier 66 days ago [-]

Writing a Pull Request can take me 8 hours. Reviewing a Pull Request of the same size takes me 30min. Here you go.

Y_Y 66 days ago [-]

P ⊆ NP

Nition 66 days ago [-]

Yeah this is the big benefit of current LLM AI even with the mistakes and hallucinations IMO. All the things that are hard to answer but easy to verify.

There were existing ways to do some of these things, but this covers all of them.

infecto 66 days ago [-]

ezst 66 days ago [-]

And yeah, I get it, naysayers are annoying. Doesn't mean they are wrong or their voices shouldn't be heard at a time where the legality and ethics of all this are being debated.

Brendinooo 66 days ago [-]

>In essence, they are only adequate in niche situations (like creative writing, marketing, placeholder during iterative design, …)

Placeholders, sure. But I'm surprised you don't know about the utility it has in a programming context.

I could go on, but hopefully you get the point.

ezst 64 days ago [-]

infecto 66 days ago [-]

ezst 64 days ago [-]

> there was a lot of hints of ChatGPT fluff.

What does that mean? Are you implying that I used a LLM to formulate my comment? A comment about the dangers of LLMs for civic and respectful discourse?

> Ultimately saying a LLM is only useful in niche situations

is a fact. LLMs are not something new or obscure, we have plenty of open-source/weights implementations keeping in check the outlandish marketing promises of the big players and their "SOTA" models.

> …is a tell of the boat passing you by.

infecto 64 days ago [-]

We can debate usefulness all day, but reducing the opposing view to “you fell for marketing” isn’t critical thinking.

And yes your reply unfortunately read exactly like ChatGPT would respond which I found both amusing but also hard to read as the logic was overly verbose.

ezst 63 days ago [-]

> We’re all operating off personal anecdotes and incomplete information. The difference is I acknowledge that.

I'm sorry but I can't assume that your comment comes in good faith. You are the one making the exceptional claims here, not me. The burden of the proof falls on you.

The extent of your argumentation is annoyingly nihilist "we don't know everything about LLMs, so I'm entitled to believe anything I want".

> We can debate usefulness all day, but reducing the opposing view to “you fell for marketing” isn’t critical thinking.

> And yes your reply unfortunately read exactly like ChatGPT would respond

Then I suppose it's a compliment, taken as someone whose native language obviously isn't English.

infecto 62 days ago [-]

Nice chatting, but it’s clear you’re out of productive things to say.

ezst 61 days ago [-]

> Still struggling to find anything in your wall of words that actually backs up your claims.

I mean, one easy thing you could do is to explain how LLMs are adequate as

>> thinking machines, search engines, knowledge archives, …

¹: https://arxiv.org/abs/2409.05746

nevertoolate 66 days ago [-]

infecto 66 days ago [-]

Don’t misjudge me either, I am not suggesting there should not be critical comments but that if you make a critical comment you should at least have an idea what your talking about.

nevertoolate 65 days ago [-]

infecto 65 days ago [-]

I read and replied to the bit that matters. Anyways ;)

huxley 66 days ago [-]

Not necessarily, you don’t need to know the answer, the fabulation might:

* give an error

* return the wrong result

* not be internally consistent with the rest of the content

* be logically impossible

* be factually impossible

* have basic errors

It is entirely possible (and quite common) to know something is wrong without knowing what a right answer is.

ashdksnndck 66 days ago [-]

wahnfrieden 66 days ago [-]

Many types of work are time-consuming to produce, and quick to verify.

Sateeshm 66 days ago [-]

I am curious. What are a few examples?

wahnfrieden 66 days ago [-]

Many code-writing tasks.

imiric 66 days ago [-]

> How long does it take to review a test case and run it vs write the implementation under test?

If you blindly trust a passing test and don't review it as production code, I have a bridge to sell you.

> How long does it take to evaluate a performance optimization vs write it?

Factoring in the time to review that the optimization didn't introduce a regression, and isn't a hack that will cause other issues later: the difference shouldn't be too large.

Yes, code usually takes more time and effort to write, but if it's not thoroughly read, understood, and reviewed, it can cause havoc someone will have to deal with later.

infecto 66 days ago [-]

steveklabnik 66 days ago [-]

"my tests are failing, and I don't know why. can you investigate?"

arrowsmith 66 days ago [-]

Unless P=NP

add-sub-mul-div 66 days ago [-]

bananapub 66 days ago [-]

an example of things that are the opposite is "public policy development", which is why it's simply malicious that various corrupt oligarchs are pushing for it to be used for such things.

so, a simple model for you to understand why other people might find these tools useful for some things:

- low stakes - doesn't matter that much if the output isn't Top Quality, either because it's easy to fix or it just doesn't matter

- enormous gap in cost between generation and review - e.g. coding

imiric 66 days ago [-]

The intentional disregard for software quality in your comment is honestly disturbing.

> I don't care very much if my coworkers use an LLM to write code or not, since all the code gets reviewed by someone else

Ah, yes, let's kick the can down the road.

> if the proposer of the change doesn't even bother to check it themselves then they pay the social cost for it

... The side effects of shoddy code are not redeemed by "paying a social cost". They negatively impact your users, and thus the bottom line of your company.

bananapub 66 days ago [-]

> The intentional disregard for software quality in your comment is honestly disturbing.

why? my shitty sales dashboard at work doesn't control a rocket or a pacemaker. the crappy my-weird-org-mode-table-to-re-arranged-CSV convertor isn't either.

should an LLM replace human code review? no. can I use an LLM for my own dumb projects? of course.

iLoveOncall 66 days ago [-]

> My experience so far is that waiting a long time is annoying, sufficiently annoying that you often won’t want to wait.

My solution for this has been to use non-reasoning models, and so far in 90% of the situations I have received the exact same results from both.

jasonjmcghee 66 days ago [-]

It tends to output significantly longer and more detailed output. So when you want that kind of thing- works well. Especially if you need up to date stuff or want to find related sources.

joshstrange 66 days ago [-]

Deep research is very cool, no doubt, but run it on a problem space you are familiar with and you will see the shortcomings.

bcrosby95 66 days ago [-]

I view the results more as a starting point than an end unto itself. For that I think it's pretty useful.

joshstrange 66 days ago [-]

matwood 66 days ago [-]

Same, it will pull enough sources together that I end up with an idea of where to go next.

pinoy420 66 days ago [-]

[dead]