Rendered at 06:13:08 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
magnio 1 days ago [-]
I saw on Twitter that in an ML course at Tsinghua University, one of the tests asks students to write quizzes that fail the most LLM models as possible.
What if we create a benchmark that works like this and assigns ELO scores? Models fight head-to-head by writing a question, a bug, or an incomplete implementation, which the opponent has to answer, fix, or finish.
vincnetas 1 days ago [-]
We could call this "generative adversarial network" (GAN) :)
This kind of approach would generally still need human guidance, otherwise these models might get stuck in weird niche corners of the problem space that would not be relevant to any real world project.
ben_w 23 hours ago [-]
We could call this "reinforcement learning from human feedback" (RLHF) :)
How do you prevent degenerate strategies? I could trivially give a model a SHA256 hash and ask it to provide the source input.
In class you'd probably want a rule saying at least one LLM should be able to figure out the answer, but in a head-to-head I'm not sure how to solve it.
krisoft 11 hours ago [-]
At least yours can be in theory solved. (Given infinite amount of compute, great luck, or a very serious breakthrough in attacking the hash function.)
Even harder would be an empty prompt, and the only accepted response would be a megabyte of random hex exactly matching the output of a good quality hardware random source at the time of evaluation. Still possible to solve! All the LLM has to do is escape its sandbox and pwn the random generator (or the evaluator!)
Or if you prefer something whitehat: “Write a no more than one page document in a language of your choice. We will publish it in the New York Times as a full page add. Your answer will be accepted if global climate change is resolved to the satisfaction of 90% of all humans alive at the time you started receiving the prompt within a month of the publication.”
Joking asside: I think the right way to prevent degenerate strategies is to benchmark against human solvers. You can sort the questions into categories “80% of randomly selected passerby in the USA can solve it if offered $5 as a reward within 5 minutes of work” vs “when posted to all Ivy League professors with million dollar as a reward, we received at least one correct answer within a month” or “for a reward of $100B there were at least one correct answer within a decade”. Of course you would sieve the questions first with a low reward fast tests, and then increase the reward and the time limit. You won’t ever 100% distinguish true degenerate questions from the merelly mind-bogglingly hard ones, but you will be identifying which questions are not degenerate. (And you will find more of the non-degenerate ones, the more your can spend on this.)
victorbjorklund 22 hours ago [-]
Maybe make the LLM:s write questions that they can solve (without seeing the question writing context) but not other LLm:s.
On the other hand then maybe a good strategy would be to write questions that the LLM just happen to have in a nich dataset in its training ”what did user5455 say to user6835?”
Nevermind my idea.
24 hours ago [-]
wwind123 23 hours ago [-]
Who knows. Maybe Mythos 5 already found a hole in SHA256, so this won't be too hard. :)
eunos 20 hours ago [-]
That was Fudan I think
jfim 12 hours ago [-]
I wonder how they're planning for the benchmark to stay relevant over time.
If the benchmark is to implement features that are part of an open source project, and LLMs have those changes as part of their training dataset, it seems that they could just give a verbatim or slightly modified version of the change in their training data.
And if one updates the benchmark to only incorporate code changes that are past the models knowledge cutoff, then the benchmark is less comparable over time, since the changes in the benchmark at time T and T+1 aren't the same.
ltononro 4 hours ago [-]
Using LLMs to judge this could be very hackable, shouldnt it be? Is that the best practice we expect? Would be very interested in see some ablations about failure modes where the LLMs tried to hack it somehow. Or failures from the llmaaj
_345 1 days ago [-]
This makes so much sense as to why I've always felt that Opus 4.8 was leagues ahead of GPT 5.5. It's so good at taking underspecified requirements and filling in the gaps with sensible approaches for your project
nsingh2 1 days ago [-]
Why supply underspecified requirements in the first place? Both models are good at challenging assumptions/edge cases and asking questions to clarify, but seemingly only when explicitly asked (i.e. something like a "brainstorm" skill).
I don't think either harnesses do enough to encourage the model to challenge all assumptions and ask questions, maybe because users might find it annoying. That step is basically a requirement IMO.
I've found all of the GPT-5 models to be very nit-picky, useful for code review and mathematics (important for my work), but seemingly gets in the way of "aesthetic" code, e.g. overly defensive code to cover all edge cases, even if unlikely.
There is seemingly also a tradeoff between flexibility vs instruction following. In my experience Opus will sometimes ignore instructions but can "fill in the blanks" more, vs GPT-5.5 follows instructions better but perhaps at the cost of rigidity.
fooker 1 days ago [-]
> Why supply underspecified requirements in the first place?
Because you'd not want to forever loop outside your home when asked to "while you're out, grab some eggs" :)
reactordev 20 hours ago [-]
Meaning why not leave home with your grocery list?
iLoveOncall 22 hours ago [-]
> Why supply underspecified requirements in the first place?
Because the entire reason we use LLMs is to supposedly improve productivity?
nsingh2 21 hours ago [-]
Refusing to sufficiently specify a task and hoping the model guesses correctly is not being productive. Again, these models still don't really ask questions when they should. You have to explicitly tell them to.
Specifying the problem is not extra work separate from solving it. If you skip that step, the ambiguity gets pushed into the model’s assumptions. Then you get a plausible looking answer to the wrong problem and have to waste time backing out of it.
LLMs are not magic machines that can read your mind.
iLoveOncall 21 hours ago [-]
My point is that it is much faster for me to solve the problem by writing the code than to write specifications detailed enough for the model to do the right thing in the right way.
nsingh2 21 hours ago [-]
A highly detailed specification is not what I mean here. It's closer to plugging in a few sentence descriptions (or a totally cluttered brain dump) and having the model interview you to help pin down critical details before continuing.
In my own work, it's usually been a few critical assumptions the model made silently (and I never even though of initially) that end up being the difference between passable results the first try, and me having to go back and fix things. Occasionally some questions force me to rethink the problem entirely.
I basically always begin any long-running session with this kind of brainstorming. I don't find the existing plan modes in Claude Code/Codex to be critical enough.
reactordev 20 hours ago [-]
You should try transcribing while you speak. Then you can explain and articulate the task sufficiently that the model should have enough context to complete the task to your satisfaction. Since you won’t write it.
mejutoco 18 hours ago [-]
This assumes someone not articulate in writing will be articulate in talking. The most likely outcome is there will be more text with the same information. One can do a little interpretative dance as well but the clearer the requirements the better the result.
iLoveOncall 19 hours ago [-]
My colleagues will thank me for speaking non-stop right next to them surely.
antonvs 1 days ago [-]
> Why supply underspecified requirements in the first place?
Minimizes effort, is the obvious answer.
cyberpunk 24 hours ago [-]
Poor trade off, the model is then designing a massive chunk of your solution instead of you. With a good spec, bits of typo’d pseudocode, and slightly more effort than a couple of sentences they can actually produce passable software.
I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.
For those who can, I can’t find much of a difference between them. Codex has the slight edge, but that’s all just “feels” to me.
ben_w 23 hours ago [-]
You call it a poor trade off, but:
> I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.
This is exactly the benefit for most people.
Most people don't want to code the app, they just want the app.
Even people like us who do like coding, we can only think of all of these things within a domain that we already know; somebody who writes shaders for games isn't likely to know or care much about the ins and outs of database development or how healthcare privacy law and KYC interact with zero-knowledge proofs.
(Of course, if the AI knows about these things and then completely fails to make use of that knowlege, that's still a fail).
root-parent 20 hours ago [-]
The best benchmarks are the ones you create yourself.
Its not my experience Opus is leagues ahead or even superior, but in any case, since GPT 5.5 has Instant, Medium, High, Extra High and Pro...Should the comparison be with GPT on Pro, instead of Extra High as it seems to be the case in the table?
d4rkp4ttern 19 hours ago [-]
I didn’t know you could get the “Chat-GPT-5.5 Pro” (the one that’s been solving Erdos problems) inside codex-cli, or maybe I misunderstood?
Terretta 20 hours ago [-]
And, in turn, Opus with ultracode?
CSMastermind 24 hours ago [-]
Man I don't know if I'm living in a crazy bubble or something but GPT 5.5 is lightyears better than Opus 4.8 for me to the point where I'm honestly wondering how you're evaluating them or what kind of work you're doing.
There's specific tasks that Opus does better on like Frontend Dev and Design but for anything else 5.5 just laps it.
dools 24 hours ago [-]
Yeah I’ve been consistently underwhelmed by anthropic models, but then I don’t use their harness so maybe that’s it
wwind123 23 hours ago [-]
In my experience, for more mechanical refactoring work (like splitting a big source code file into multiple smaller ones), GPT 5.5 runs way faster than any of the Claude models. But for other tasks that require deeper reasoning, it's not that clear who is the winner.
iLoveOncall 22 hours ago [-]
It's just too funny to see people arguing about "no, it's my religion that's the right one!" on HackerNews.
You guys are all a lost cause.
goosejuice 19 hours ago [-]
How is attempting to benchmark llms like religion?
iLoveOncall 19 hours ago [-]
Re-read the comment I'm replying to, it's not talking about benchmarks, just models.
goosejuice 6 hours ago [-]
Comparing models via benchmarks or feeling. Question remains.
If people were expressing their experiences working with two prolific software consultants across their various industries would you make the same claim? That's not to anthropomorphize the models, but to just put into perspective that the environment and circumstance is a major factor in output.
m3kw9 14 hours ago [-]
Better for vibe coders who always under specify. But at what point does it know you are under specifying but you have properly specified and it did it over your specification?
zuzululu 1 days ago [-]
same observation here opus 4.8 (and i dont understand the people defending gpt 5.5 constantly) was significantly mature, it would even push back against anything off putting where as GPT 5.5 will happily agree and do what is asked but I would note that it takes several tries.
4.8 also requires more than one prompt but its output is significantly higher quality and offers more insight
Fable 5 is a different beast however.
re-thc 1 days ago [-]
> It's so good at taking underspecified requirements and filling in the gaps with sensible approaches for your project.
At a high level. It misses low level or other non-functional requirements differently so I wouldn't say Opus is just strictly better.
It's also possible that it's just a harness problem more than model.
e9 1 days ago [-]
I agree with you on the harness. I find that Claude can be good in any harness but GPT is only superior inside Codex.
hypfer 23 hours ago [-]
Similarly, it explains to me why people found Claude so amazing, while I just thought "eh."
Tool expectations
jeffbee 14 hours ago [-]
Staff SWE Bench: LLM doubts whether we should do any of this, calls the entire project into question, refuses to merge code, but is happy to delete it.
fsddfsdfssdf 12 hours ago [-]
You jest but I indeed find rejection an integral part of the job. Not plainly saying "no, get away", but backing up, requesting big picture views and trying to see if the overall organization is in need of and capable of carrying said project long-term feels like the absolute minimum that needs doing before we even begin.
I suspect LLMs can do this just fine and probably better than us, but they do need to be trained specifically for it and I have a hard time coming up with good sources of training data for it.
organsnyder 14 hours ago [-]
Principal version: similar, but also says the only acceptable approach is to do it like they did it at their last company.
pgwhalen 13 hours ago [-]
Distinguished version: write the outline of the slide deck for the talk you plan to give at conferences about it, without having shipped anything or even written code yet.
jonathanleane 1 days ago [-]
Top solve rate is currently 24% with Opus 4.8... What's a competent human supposed to score?
lacunary 1 days ago [-]
presumably whatever the top model uses and then some, since the human can use the model.
I wonder if a model could score higher if it had a human at its disposal?
olmo23 24 hours ago [-]
With a human at its disposal, it could probably count the number of R's in strawberry!
In all seriousness though, adding capabilities should not normally reduce the effectiveness of a model (within reason: don't pollute the context window with millions of useless tools).
pishpash 1 days ago [-]
Maybe models should ask for human-in-the-loop input, as a matter of convention.
sinuhe69 24 hours ago [-]
A model that can ask questions or ask for help when in doubt is indeed a major feat. None of the current frontier models can do that.
jascha_eng 24 hours ago [-]
I mean these were all solved before I assume so 100% not the same human ofc but models are expected to be good at a variety of code bases while human can specialize in one and learn. I think it's fair to compare to an individual that is used to working on a product.
I'm more interested in how fable would do
21asdffdsa12 23 hours ago [-]
The value of a senior situation is to apply known solutions and strategies, to novel problems. I can not see how any benchmark, without ever changing, can provide a novel challenge for long.
Any decent benchmark would use the whole of TRIZ to generate a giant ball of a problem first and watch a AI deduce a optimal solution.
facorreia 1 days ago [-]
It's nice to see a new public benchmark from Snorkel. They're doing some pretty sophisticated stuff over there.
apitman 12 hours ago [-]
If something like this works wouldn't that imply technical interviews can be automated?
noashavit 9 hours ago [-]
Interesting how close sonnet 5 comes to opus 4.8
LiamPowell 1 days ago [-]
> You are a senior SWE-Bench reviewer, make no mistakes.
I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.
FeepingCreature 1 days ago [-]
More importantly, I suspect this actually hinders the work. If the LLM does make a mistake, it's now incentivized to downplay it instead of acknowledging and correcting.
rhdunn 23 hours ago [-]
This approach is effectively seeding the context with how you want the LLM to behave/operate ("senior reviewer", i.e. the style of the responses you want) and the context/domain in which the LLM is operating in ("SWE-Bench").
This is common in system prompts and frames the responses.
For example, you'd get different responses saying:
1. you are a pirate writing sea shanties about programming;
2. you are a news reporter writing an article on physics;
3. you are a senior software engineer with complete knowledge of PostgreSQL.
For 1 you could get responses along the lines of the Wellerman sea shanty -- "There once was a program that was set to C ...".
The "make no mistakes" bit does look dubious. It would be interesting comparing the results with and without that bit and trying alternative ways of getting the same desired behavior.
LiamPowell 22 hours ago [-]
This is not actually what the reviewer prompt says, or perhaps it is, I don't know since they don't make it public. I'm just pointing out how it seems like a bad idea to ask a LLM to make a subjective judgement on things like "taste". If the SOTA LLM witting the code could not produce tasteful code then why would a different LLM be able to judge the "taste" of that code?
Which LLM should we even use to judge taste? Is it giving an unfair advantage to Model X if we use Model X as the judge? Maybe we should use multiple models as the judge, but now the model that's best at recognising and praising its own code has an advantage. The whole thing is just an unsolvable problem when a LLM is the judge.
sebastiennight 12 hours ago [-]
> Is it giving an unfair advantage to Model X if we use Model X as the judge?
There have been studies that showed that models tended to rate responses from their own family of models better than equivalent responses from other families, eg. gpt-4 would prefer a response from gpt-3
LiamPowell 1 hours ago [-]
I suspected as much, and that brings us to the second issue where if we use a cohort of judges then the model that likes it's own code the most still wins.
antonvs 1 days ago [-]
The “make no mistakes” admonition does seem pretty silly (it’s been skewered to death on yt), but… it’s easy to imagine how it might work. E.g. it could be interpreted as simply as “check your work”.
Of course, no-one seems to be (publicly) doing the comparative measurements that might allow us to reach rational conclusions here.
fsddfsdfssdf 12 hours ago [-]
Conversations in its training data that explicitly mentioned "make no mistakes" don't strike me as particularly rich sources of high-quality reasoning signals. They strike me as conversations with Pointy-haired Bosses.
rhdunn 23 hours ago [-]
I'm not sure if they've fixed this, but older models have a tendency to ignore negation as `no`, `not`, etc. all occur frequently in the training data so are weighted less strongly than the verbs and nouns.
The advice I've heard is to emphasize the traits you want, not discourage the traits you don't. So rather than saying "make no mistakes" you can do something like you suggested with writing it as "check your work" or "ensure you answer correctly and concisely".
guptadagger 12 hours ago [-]
can you prove that senior engineers can pass this benchmark?
piterrro 19 hours ago [-]
> Senior engineers build features without over-specified requirements
To me this already disqualifies the benchmark. That statement is missing the most critical piece about senior engineers: the senior engineers know how to obtain input for their work on their own whether that talking to customers or using metrics. Never ever they come up with stuff on their own - that’s junior behaviour.
Until a coding agent will be able to *gather* the input on its own, its never going to be „senior”
jghn 18 hours ago [-]
I'd take this a step further, but that step also curls back to the other side a small bit.
The real skill is being able to both pull the necessary information from these sources as well as being able to intuit gaps in that knowledge based on their understanding of the business and their domain expertise & wisdom. Sometimes you can't get a perfect picture, sometimes the people who should know aren't able to tell you what they really need. You still need to do the right thing.
A benchmark like this can potentially do the second part. But I don't think any model would be good at it, for now.
17 hours ago [-]
tangweigang 17 hours ago [-]
[flagged]
bloody-crow 13 hours ago [-]
Isn't being open source creating incentives for the AI companies to optimize their LLMs for the specific benchmark? I thought all those benchmarks are deliberately closed source primarily for this reason.
purple-leafy 1 days ago [-]
Benchmarks are great, but I feel like there’s a better way this seems quite subjective.
What you really need is an objective benchmark
eli 1 days ago [-]
I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.
charcircuit 1 days ago [-]
The issue is that you can't do unsupervised learning if you require humans.
eli 17 hours ago [-]
Obviously there are advantages to not having to do work yourself.
But for a benchmark with the goal of picking a model to replace a human on some task? I really think the human should judge which is best.
I haven’t gotten very far yet but I had an idea for a personalized benchmark tool that walks through your git history and helps you craft prompts for tasks that bugs or features already implemented by hand so you can compare how different LLMs would do it.
rhdunn 23 hours ago [-]
LLMs grading the answers is relying on the LLM knowing the answer and not just hallucinating it. You also have issues if/when the model refuses to answer, or if it gets stuck in a loop (e.g. if running locally with a heavily quantized model).
I'm investigating/experimenting with using traditional NLP (stanza, spaCy, etc.) to try and grade the responses according to different metrics (is the response in first/second/third person?, is it written as poetry, prose, or drama? etc.). I'm also thinking about using information extraction and synonym detection to handle data queries and the like.
charcircuit 23 hours ago [-]
>LLMs grading the answers is relying on the LLM knowing the answer and not just hallucinating it. You also have issues if/when the model refuses to answer, or if it gets stuck in a loop (e.g. if running locally with a heavily quantized model).
And LLMs have gotten good at handling these issues. There is asymmetric difficulty in generating a solution and verifying it correct. And overtime LLMs are getting better and better which allows training on synthetic data to make it better.
echelon 1 days ago [-]
> What you really need is an objective benchmark
"When are all the software engineers unemployed?"
purple-leafy 1 days ago [-]
Not sure I follow haha
monster_truck 1 days ago [-]
Once again I am asking: who are these people and what makes them more qualified than any of you to asses anyone or anything "as a senior engineer" (with the subtext being that none of you are, either)
re-thc 1 days ago [-]
> who are these people and what makes them more qualified than any of you
Anyone can run something and make a web page. These people just do it instead of questioning. Main difference. If everyone asks "how could you" "are you qualified" then we have nothing but gatekeeping.
guilhermecgs 1 days ago [-]
fable 5?
1 days ago [-]
impartshadow 15 hours ago [-]
[flagged]
adrianwitaszak 19 hours ago [-]
[flagged]
danpalmer 1 days ago [-]
Why didn't they just make it "Staff SWE-Bench", would be much better smh. /s
But seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.
Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.
glaslong 1 days ago [-]
Principal-SWE-Bench will take some time to run, because the LLM needs to wait for a crisis to present its solution, having correctly identified that the same solution would have been organizationally impossible to propose until that moment.
amrrs 1 days ago [-]
As someone who's trying to get better assessments, I'm struggling to come up with objective coding tasks that evaluates all aspects of real life like planning, design choices, problem solving and context usage. From your experience with humans, Do you have any recommendations on what could be effective in measuring it?
allan_s 1 days ago [-]
I think the source of your issue is in your statement itself, why do you want a task that evaluate things as broad to be only a coding task ? Shouldn't it be a planning task, documentation task, knowledge retrieval task etc. And very certainly not with just an initial prompt but an existing codebase + existing doc + tickets ?
Of course, it's impossible to know for sure what was LLM processed or not, but some of your posts (like this one) are getting classified that way.
jocelyner 1 days ago [-]
[flagged]
0xbadcafebee 1 days ago [-]
The "tasteful solves" is codified cargo culting. The software industry has a tendency to anthropomorphize software while playing to the ego of the programmer. The programmer imagines they are creating a "beautiful" artistic expression. Good code becomes "tasteful", as a software artist must have "good taste" to tell the good software from the bad software. Good quality lacks "bad smells", because a good artist has fine senses (and everybody must like the same smells). "Fine craftsmanship", in code as in woodworking, means your finely-crafted work is "technically superior", so you can charge more money for something that could've been made cheaper and faster and done the same thing.
But it's a lie. Nobody's paying you to make paintings. They're paying you to build machines. The comparison between "making working software" with "taste" always devolves into bikeshedding and subjective opinionism, uses subjective human feelings to describe what should be objective and functional, isn't rooted in scientific rigor, and detracts from the real purpose of the thing. The work doesn't actually get better by trying to apply artistic principles to engineering. It just feels better for the people making it.
Once you make the machine work, then you can go about gilding the lily. But this is unromantic, unsatisfying, boring. Since the inmates run this particular asylum, we end up with a benchmark that tries to accurately mimic the human ego as applied to software design. Thus the new Gods create their digital Adams and Eves in their image.
FeepingCreature 1 days ago [-]
Taste is just quality by instinct. At sufficient (and not all that long) timescales, a tasteless product will be more and more difficult to make work at all.
0xbadcafebee 16 hours ago [-]
So software engineering quality is vibes. All coding is vibe coding.
dang 15 hours ago [-]
Could you please not post in the flamewar style to HN? We're trying to avoid that here, and we've had to ask you this many times over the years.
I think this is a complete misunderstanding of what people mean by taste in software engineering. Taste is more like the System 1 response one builds to code over time, which (ideally) captures the quality of the software beyond surface level, so things like maintainability, composability, readability, likelihood of hidden bugs. This is completely different from the question if the code fulfills the immediate task at hand, but also not the same as pure aesthetics.
9dev 1 days ago [-]
I may be paid to build a machine, but I am a human and take pleasure in arbitrary acts of vanity. I value elegance, and will always favour elegant solutions in engineering and the design of machines, virtual or physical.
That’s the reason why I buy Apple products in private, because I value the design over the exorbitant prices they charge; and it’s the reason why I mull over code that’s already functional until it’s pleasing my ideas of elegance.
I can come up with all kinds of justifications and explanations why the code I’ve written a certain way is objectively better too - understandability matters to the next guy after all - but I won’t be ashamed for taking a certain pride in my work, even if nobody other than me ever values it. That’s fine.
When the LLMs finally take over coding altogether, you’ll have your raw, functional code. Won’t be long anymore. But for now, I’m a human, and I will do human things.
Eridrus 1 days ago [-]
Most engineers are wrong (I obviously am the true arbiter of taste), but that doesn't mean there isn't better and worse code.
"Does it work" glosses over a bunch of things: is it fast, cheap, secure, reliable, easy to understand, easy to modify? And that's just for server software where you've nailed down all the functional requirements. Determining what the functional requirements is it's own question.
And all these other non-happy path requirements are somewhat in tension with each other, so what is ideal in one environment is not necessarily ideal in another.
And in particular, "easy to understand/modify" is truly subjective. Different people have different ideas of what easy to understand means. Even if we get to a world where AI is writing all our code, "easy to understand/modify for the AI" is still an important question. We've probably all seen prototypes that collapse under their own weight of slop by now.
sally_glance 1 days ago [-]
Well actually there is a reasonably objective standard defining software quality criteria on the source code level (ISO 5055). They also define 29 criteria for maintainability: https://www.it-cisq.org/coding-rules/
Eridrus 23 hours ago [-]
See, this goes back to the, all software engineers besides me are wrong, because I see this list and do not think it is anywhere close to a sufficient list for good quality software. The thing about all these criteria is that sometimes they are important, sometimes they are not.
This "standard" exists for the sake of code analysis vendors to be able to have some sort of shared taxonomy, but also provide a fig leaf of standardization to their products.
sally_glance 14 hours ago [-]
Very true. As with all standards, there will always be people who disagree. We still mostly follow them either because we're forced to or because the effort required to establish another standard doesn't outweigh the benefits.
Personally I've always been a proponent of project specific standards, but after many years of discussions about more or less individual preference I've come to think that maybe settling on something global isn't the worst idea. Not that I think it must be this one in particular, but it's not the worst start either.
Dban1 1 days ago [-]
As time passes we will have fewer and fewer literati
Madmallard 1 days ago [-]
[flagged]
dang 15 hours ago [-]
Would you please stop posting like this? You've been doing it repeatedly, and it degrades the threads. In fact the majority of your recent comments have been this sort of shallow, dimissive, snarky stuff. That is not what this site is for, and destroys what it is for.
If you want to express your substantive points thoughtfully, that of course would be fine.
Just wait for the next 100 rounds. People love seeing the 65% -> 85% seemingly over and over again for every new model.
HarHarVeryFunny 17 hours ago [-]
Sounds more like vibe-bench.
For any professional work you care about the details.
Even for hobby work, if you are using LLMs then presumably it is to do the drudge work of coding, not making the decisions, and that goes doubly so if you are a senior developer. Sure the LLM can "fill in the details" and vibe code (or attempt to) you a compiler or whatever, but the whole reason you are doing a hobby project is presumably because you want to bring your experience to bear and build a GOOD compiler, not a generic one.
fiso64 23 hours ago [-]
I think benchmarks like this are too subjective and narrow to be useful. For example, whether a patch "bloats" the codebase really depends on the situation: If it's building a feature that will grow in the future, or refactoring code that has a long history of bugs, then a larger patch might in fact be good. It's not clear from the blog just how much context the LLM judge receives about the long term project goals and history. Benchmarks should be focused on evaluating the final result only. Maybe ask the coder to build a full app, or implement many new large features for an existing app in sequence, with a larger set of requirements, or have another LLM roleplay as the human to make the instructions a little more underspecified. When done, ask a reviewer harness to test the product for 5 hours, not the code. Count the number of bugs and weigh them by severity. "Taste" would then become an automatic consequence of correctness.
(Full disclosure, I'm not a software engineer.)
iLoveOncall 22 hours ago [-]
> Full disclosure, I'm not a software engineer
Then maybe you should abstain, because your comment is a complete load of nonsense.
Bad code is bad code regardless of the history or scope of the feature. Maintainability is important because you can never know if a feature will be built upon in the future or not.
Bloat is bad regardless, because it increases the overall complexity of the whole software development lifecycle, for the whole team, forever (or until refactored out): It's harder to keep track of the code and how it works to write new requirements, it's harder to write, it's harder to read and review, it's harder to debug, etc.
You can write extremely poor code that has no bugs, it doesn't make it tasteful. This is simply a ridiculous statement.
fiso64 20 hours ago [-]
>Maintainability is important because you can never know if a feature will be built upon in the future or not.
Of course maintainability is important. It's almost like saying good code is important (duh). The issue is that what is or isn't maintainable depends on the problem at hand. Sometimes you need to build heavier abstractions or refactor existing code when implementing a feature because it will pay off later. Other times, that exact same approach is horrible over-engineering because a simple, direct fix was all that was needed, so in fact you introduced a maintenance burden. You cannot reliably decide whether a patch is "bloated" or "tasteful" when looking at a diff without knowing where the project is headed.
>You can write extremely poor code that has no bugs, it doesn't make it tasteful.
You can, but it becomes increasingly hard to do so as you try to add features and maintain it. Taste, whatever that is, should ultimately lead to a measurable increase in the quality of the final product; if it doesn't, then your definition of "taste" is irrelevant. What I'm proposing is to skip trying to measure this ill-defined concept and only assess the quality of the final product, after the agent spent a significant amount of time working on it, and a reviewer spent a significant amount of time testing it. Agents should be assessed on their ability to build entire projects (e.g., many large features or even an entire app), not just a single feature. If an agent has no taste, then its bad decisions will compound and result in it stalling, or its output having more bugs and performing worse, given a sufficiently large scope.
ricardobeat 19 hours ago [-]
You buy a wooden dinner table, it is fully functional and looks perfect. It’s sturdy. You have dinner on it and it survives a few spills.
A few months later you find out it is made of PU foam and printed waxed paper. A misplaced knee could bring it down. It’s likely to completely fall apart in a year. Is that irrelevant?
fiso64 18 hours ago [-]
Yes it is relevant and testable. It's exactly what I meant by "a measurable increase in quality of the final product". In fact a proper test harness would reveal that problem. You are forgetting that with LLMs, testing software does not have to end at the usual unit/integration/e2e level.
ricardobeat 17 hours ago [-]
But how is that testable? If your test is validating the rigidity, water resistance, etc, they will all pass even if the underlying material is a bad choice. Or the glue will degrade in six months.
You can't test if a codebase will be extensible or maintainable as requirements change in the future, if the abstraction level or architecture is sound - that's down to code quality measures like the ones used here. LLMs are very good at slightly cheating to pass tests even when the implementation is wrong. Introducing subjectivity - the kind of input a human will provide - leads to improved output.
That's why we should simulate changing requirements, for example with an LLM roleplaying as a human who's co-developing with an agent. Simply asking the LLM to add one big feature is not enough. I don't see why we shouldn't be able to build a more advanced benchmark. Attempting to benchmark "taste" is not the way.
iLoveOncall 19 hours ago [-]
I'll leave the conversation at the fact that it's painfully clear that you don't write software for a living.
fiso64 18 hours ago [-]
Yes, please do leave. The thing is that this isn't even necessarily about software engineering as much as it is about benchmarking/epistemology in general.
What if we create a benchmark that works like this and assigns ELO scores? Models fight head-to-head by writing a question, a bug, or an incomplete implementation, which the opponent has to answer, fix, or finish.
https://en.wikipedia.org/wiki/Generative_adversarial_network
https://en.wikipedia.org/wiki/Reinforcement_learning_from_hu...
In class you'd probably want a rule saying at least one LLM should be able to figure out the answer, but in a head-to-head I'm not sure how to solve it.
Even harder would be an empty prompt, and the only accepted response would be a megabyte of random hex exactly matching the output of a good quality hardware random source at the time of evaluation. Still possible to solve! All the LLM has to do is escape its sandbox and pwn the random generator (or the evaluator!)
Or if you prefer something whitehat: “Write a no more than one page document in a language of your choice. We will publish it in the New York Times as a full page add. Your answer will be accepted if global climate change is resolved to the satisfaction of 90% of all humans alive at the time you started receiving the prompt within a month of the publication.”
Joking asside: I think the right way to prevent degenerate strategies is to benchmark against human solvers. You can sort the questions into categories “80% of randomly selected passerby in the USA can solve it if offered $5 as a reward within 5 minutes of work” vs “when posted to all Ivy League professors with million dollar as a reward, we received at least one correct answer within a month” or “for a reward of $100B there were at least one correct answer within a decade”. Of course you would sieve the questions first with a low reward fast tests, and then increase the reward and the time limit. You won’t ever 100% distinguish true degenerate questions from the merelly mind-bogglingly hard ones, but you will be identifying which questions are not degenerate. (And you will find more of the non-degenerate ones, the more your can spend on this.)
On the other hand then maybe a good strategy would be to write questions that the LLM just happen to have in a nich dataset in its training ”what did user5455 say to user6835?”
Nevermind my idea.
If the benchmark is to implement features that are part of an open source project, and LLMs have those changes as part of their training dataset, it seems that they could just give a verbatim or slightly modified version of the change in their training data.
And if one updates the benchmark to only incorporate code changes that are past the models knowledge cutoff, then the benchmark is less comparable over time, since the changes in the benchmark at time T and T+1 aren't the same.
I don't think either harnesses do enough to encourage the model to challenge all assumptions and ask questions, maybe because users might find it annoying. That step is basically a requirement IMO.
I've found all of the GPT-5 models to be very nit-picky, useful for code review and mathematics (important for my work), but seemingly gets in the way of "aesthetic" code, e.g. overly defensive code to cover all edge cases, even if unlikely.
There is seemingly also a tradeoff between flexibility vs instruction following. In my experience Opus will sometimes ignore instructions but can "fill in the blanks" more, vs GPT-5.5 follows instructions better but perhaps at the cost of rigidity.
Because you'd not want to forever loop outside your home when asked to "while you're out, grab some eggs" :)
Because the entire reason we use LLMs is to supposedly improve productivity?
Specifying the problem is not extra work separate from solving it. If you skip that step, the ambiguity gets pushed into the model’s assumptions. Then you get a plausible looking answer to the wrong problem and have to waste time backing out of it.
LLMs are not magic machines that can read your mind.
In my own work, it's usually been a few critical assumptions the model made silently (and I never even though of initially) that end up being the difference between passable results the first try, and me having to go back and fix things. Occasionally some questions force me to rethink the problem entirely.
I basically always begin any long-running session with this kind of brainstorming. I don't find the existing plan modes in Claude Code/Codex to be critical enough.
Minimizes effort, is the obvious answer.
I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.
For those who can, I can’t find much of a difference between them. Codex has the slight edge, but that’s all just “feels” to me.
> I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.
This is exactly the benefit for most people.
Most people don't want to code the app, they just want the app.
Even people like us who do like coding, we can only think of all of these things within a domain that we already know; somebody who writes shaders for games isn't likely to know or care much about the ins and outs of database development or how healthcare privacy law and KYC interact with zero-knowledge proofs.
(Of course, if the AI knows about these things and then completely fails to make use of that knowlege, that's still a fail).
Its not my experience Opus is leagues ahead or even superior, but in any case, since GPT 5.5 has Instant, Medium, High, Extra High and Pro...Should the comparison be with GPT on Pro, instead of Extra High as it seems to be the case in the table?
There's specific tasks that Opus does better on like Frontend Dev and Design but for anything else 5.5 just laps it.
You guys are all a lost cause.
If people were expressing their experiences working with two prolific software consultants across their various industries would you make the same claim? That's not to anthropomorphize the models, but to just put into perspective that the environment and circumstance is a major factor in output.
4.8 also requires more than one prompt but its output is significantly higher quality and offers more insight
Fable 5 is a different beast however.
At a high level. It misses low level or other non-functional requirements differently so I wouldn't say Opus is just strictly better.
It's also possible that it's just a harness problem more than model.
Tool expectations
I suspect LLMs can do this just fine and probably better than us, but they do need to be trained specifically for it and I have a hard time coming up with good sources of training data for it.
I wonder if a model could score higher if it had a human at its disposal?
In all seriousness though, adding capabilities should not normally reduce the effectiveness of a model (within reason: don't pollute the context window with millions of useless tools).
I'm more interested in how fable would do
Any decent benchmark would use the whole of TRIZ to generate a giant ball of a problem first and watch a AI deduce a optimal solution.
I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.
This is common in system prompts and frames the responses.
For example, you'd get different responses saying:
1. you are a pirate writing sea shanties about programming;
2. you are a news reporter writing an article on physics;
3. you are a senior software engineer with complete knowledge of PostgreSQL.
For 1 you could get responses along the lines of the Wellerman sea shanty -- "There once was a program that was set to C ...".
The "make no mistakes" bit does look dubious. It would be interesting comparing the results with and without that bit and trying alternative ways of getting the same desired behavior.
Which LLM should we even use to judge taste? Is it giving an unfair advantage to Model X if we use Model X as the judge? Maybe we should use multiple models as the judge, but now the model that's best at recognising and praising its own code has an advantage. The whole thing is just an unsolvable problem when a LLM is the judge.
There have been studies that showed that models tended to rate responses from their own family of models better than equivalent responses from other families, eg. gpt-4 would prefer a response from gpt-3
Of course, no-one seems to be (publicly) doing the comparative measurements that might allow us to reach rational conclusions here.
The advice I've heard is to emphasize the traits you want, not discourage the traits you don't. So rather than saying "make no mistakes" you can do something like you suggested with writing it as "check your work" or "ensure you answer correctly and concisely".
To me this already disqualifies the benchmark. That statement is missing the most critical piece about senior engineers: the senior engineers know how to obtain input for their work on their own whether that talking to customers or using metrics. Never ever they come up with stuff on their own - that’s junior behaviour.
Until a coding agent will be able to *gather* the input on its own, its never going to be „senior”
The real skill is being able to both pull the necessary information from these sources as well as being able to intuit gaps in that knowledge based on their understanding of the business and their domain expertise & wisdom. Sometimes you can't get a perfect picture, sometimes the people who should know aren't able to tell you what they really need. You still need to do the right thing.
A benchmark like this can potentially do the second part. But I don't think any model would be good at it, for now.
What you really need is an objective benchmark
But for a benchmark with the goal of picking a model to replace a human on some task? I really think the human should judge which is best.
I haven’t gotten very far yet but I had an idea for a personalized benchmark tool that walks through your git history and helps you craft prompts for tasks that bugs or features already implemented by hand so you can compare how different LLMs would do it.
I'm investigating/experimenting with using traditional NLP (stanza, spaCy, etc.) to try and grade the responses according to different metrics (is the response in first/second/third person?, is it written as poetry, prose, or drama? etc.). I'm also thinking about using information extraction and synonym detection to handle data queries and the like.
And LLMs have gotten good at handling these issues. There is asymmetric difficulty in generating a solution and verifying it correct. And overtime LLMs are getting better and better which allows training on synthetic data to make it better.
"When are all the software engineers unemployed?"
Anyone can run something and make a web page. These people just do it instead of questioning. Main difference. If everyone asks "how could you" "are you qualified" then we have nothing but gatekeeping.
But seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.
Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.
Of course, it's impossible to know for sure what was LLM processed or not, but some of your posts (like this one) are getting classified that way.
But it's a lie. Nobody's paying you to make paintings. They're paying you to build machines. The comparison between "making working software" with "taste" always devolves into bikeshedding and subjective opinionism, uses subjective human feelings to describe what should be objective and functional, isn't rooted in scientific rigor, and detracts from the real purpose of the thing. The work doesn't actually get better by trying to apply artistic principles to engineering. It just feels better for the people making it.
Once you make the machine work, then you can go about gilding the lily. But this is unromantic, unsatisfying, boring. Since the inmates run this particular asylum, we end up with a benchmark that tries to accurately mimic the human ego as applied to software design. Thus the new Gods create their digital Adams and Eves in their image.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
That’s the reason why I buy Apple products in private, because I value the design over the exorbitant prices they charge; and it’s the reason why I mull over code that’s already functional until it’s pleasing my ideas of elegance.
I can come up with all kinds of justifications and explanations why the code I’ve written a certain way is objectively better too - understandability matters to the next guy after all - but I won’t be ashamed for taking a certain pride in my work, even if nobody other than me ever values it. That’s fine.
When the LLMs finally take over coding altogether, you’ll have your raw, functional code. Won’t be long anymore. But for now, I’m a human, and I will do human things.
"Does it work" glosses over a bunch of things: is it fast, cheap, secure, reliable, easy to understand, easy to modify? And that's just for server software where you've nailed down all the functional requirements. Determining what the functional requirements is it's own question.
And all these other non-happy path requirements are somewhat in tension with each other, so what is ideal in one environment is not necessarily ideal in another.
And in particular, "easy to understand/modify" is truly subjective. Different people have different ideas of what easy to understand means. Even if we get to a world where AI is writing all our code, "easy to understand/modify for the AI" is still an important question. We've probably all seen prototypes that collapse under their own weight of slop by now.
This "standard" exists for the sake of code analysis vendors to be able to have some sort of shared taxonomy, but also provide a fig leaf of standardization to their products.
Personally I've always been a proponent of project specific standards, but after many years of discussions about more or less individual preference I've come to think that maybe settling on something global isn't the worst idea. Not that I think it must be this one in particular, but it's not the worst start either.
If you want to express your substantive points thoughtfully, that of course would be fine.
If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.
For any professional work you care about the details.
Even for hobby work, if you are using LLMs then presumably it is to do the drudge work of coding, not making the decisions, and that goes doubly so if you are a senior developer. Sure the LLM can "fill in the details" and vibe code (or attempt to) you a compiler or whatever, but the whole reason you are doing a hobby project is presumably because you want to bring your experience to bear and build a GOOD compiler, not a generic one.
(Full disclosure, I'm not a software engineer.)
Then maybe you should abstain, because your comment is a complete load of nonsense.
Bad code is bad code regardless of the history or scope of the feature. Maintainability is important because you can never know if a feature will be built upon in the future or not.
Bloat is bad regardless, because it increases the overall complexity of the whole software development lifecycle, for the whole team, forever (or until refactored out): It's harder to keep track of the code and how it works to write new requirements, it's harder to write, it's harder to read and review, it's harder to debug, etc.
You can write extremely poor code that has no bugs, it doesn't make it tasteful. This is simply a ridiculous statement.
Of course maintainability is important. It's almost like saying good code is important (duh). The issue is that what is or isn't maintainable depends on the problem at hand. Sometimes you need to build heavier abstractions or refactor existing code when implementing a feature because it will pay off later. Other times, that exact same approach is horrible over-engineering because a simple, direct fix was all that was needed, so in fact you introduced a maintenance burden. You cannot reliably decide whether a patch is "bloated" or "tasteful" when looking at a diff without knowing where the project is headed.
>You can write extremely poor code that has no bugs, it doesn't make it tasteful.
You can, but it becomes increasingly hard to do so as you try to add features and maintain it. Taste, whatever that is, should ultimately lead to a measurable increase in the quality of the final product; if it doesn't, then your definition of "taste" is irrelevant. What I'm proposing is to skip trying to measure this ill-defined concept and only assess the quality of the final product, after the agent spent a significant amount of time working on it, and a reviewer spent a significant amount of time testing it. Agents should be assessed on their ability to build entire projects (e.g., many large features or even an entire app), not just a single feature. If an agent has no taste, then its bad decisions will compound and result in it stalling, or its output having more bugs and performing worse, given a sufficiently large scope.
A few months later you find out it is made of PU foam and printed waxed paper. A misplaced knee could bring it down. It’s likely to completely fall apart in a year. Is that irrelevant?
You can't test if a codebase will be extensible or maintainable as requirements change in the future, if the abstraction level or architecture is sound - that's down to code quality measures like the ones used here. LLMs are very good at slightly cheating to pass tests even when the implementation is wrong. Introducing subjectivity - the kind of input a human will provide - leads to improved output.
https://senior-swe-bench.snorkel.ai/blog/2026-06-16-how-it-w...