103 comments

  • dvt 13 hours ago
    An alarming number of people don't understand that LLMs work via purely stochastic processes, so I'm happy to see in-depth pieces like this. I'm looking for a job and maybe this is why it's so hard to get a callback these days: resumes are just dumped in some LLM black hole and no one really knows how it works. The author says:

    > temperature 0.1 — low, supposedly nudging the model toward deterministic outputs

    This is not correct (and is briefly touched on later in the piece when he sets temperature to 0), temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"—but is still very much a distribution).

    • miki123211 11 hours ago
      In theory, temperature 0 does make the LLM deterministic.

      Well, in theory theory, temperature 0 doesn't really exist. Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0. In practice, temperature=0 is literally a separate branch of an if statement that just picks the most common sample (using the actual formula that works for non-zero values would cause a zero division).

      However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.

      • sobellian 13 minutes ago
        Even if it's deterministic that doesn't mean it isn't arbitrary. I can achieve determinism at any temperature by saving the seed. But that wouldn't make rejects feel much better knowing that if a bit was flipped in an arbitrary seed they would be scored differently.
      • sigmoid10 11 hours ago
        >in theory theory, temperature 0 doesn't really exist.

        It does exist very much, even if you go to pure math. Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients. But setting T to zero is in both, theory and practice, turning the usual probability function into greedy sampling.

        • 317070 8 hours ago
          > Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function.

          In pure math, it does not always do that. It becomes a dirac-delta comb with equal weight on every maximum. There can be more than 1 maximum. Setting the temperature to zero turns into greedy sampling, but greedy sampling is not necessarily deterministic as you can have multiple equally optimal options.

          • sigmoid10 7 hours ago
            That is not a problem for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal. That's why greedy sampling generally produces deterministic output for LLMs. The real gotchas are elsewhere (like with batch inference as we've seen with earlier GPTs). But unlike what the earlier comment says, this is a non-issue mathematically.
            • skissane 7 hours ago
              > That is not a problem for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal

              Any two tokens ending up with the exact same logit is very unlikely, but not impossible; and as the number of output tokens grows, the odds that it will happen eventually gets higher and higher.

              I suppose, to ensure determinism, rank by logit then token ID, so you still have a deterministic winner even if occasionally two tokens get precisely identical logits.

            • StilesCrisis 5 hours ago
              "Makes unlikely" is very different from "prevents."

              If there's one counterexample, it's not really deterministic.

              • rkozik1989 4 hours ago
                Exactly, consider the scenario where laws are at play and violating them could cost companies thousands. Recently my father received a 'request for address' letter addressed to me at his nursing home, the building has always been a nursing home, and he's also in his mid-70s. That's very obviously a violation of the Fair Debt Collection Practices Act. Imagine the implication of this if the law firm in questions used an AI-assisted data enriching product to find this information. That SaaS company is not only liable to that one law firm but every law firm who uses their software. Its potentially a federal class action lawsuit.

                My point is, deterministic logic matters in certain circumstances 100% of the time. Forcing the LLM to make something unlikely is not good enough because a series of mistakes could very quickly bankrupt the company.

                • Lerc 3 hours ago
                  >My point is, deterministic logic matters in certain circumstances 100% of the time. Forcing the LLM to make something unlikely is not good enough because a series of mistakes could very quickly bankrupt the company.

                  If your argument is that the danger of equal values being selected inconsistently breaks determinism, that's a trivial problem to solve.

                  Any non-infinite precision numbering system by definition is at the limits of it's precision when equal values occur. If you need to order such values you can extend the precision and add on a deterministically unique tiny value (position, order encountered, etc.) . Your original value stays in the same precision range but they are now unique.

                  It's usually more likely that you want to sacrifice a little precision for determinism so you can quantise to allocate the range where you apply the unique ID

                  For example if you had an array of 256 fp32 values but you required them to be unique, you can lop off 8 bits of mantissa and replace it with its index in the array, Every value is then unique.

                  Granted token dictionaries make for some fairly hefty indexes now, but the principle applies in general, it's easily solvable if you are prepared to spend some precision or do some extra calculation.

            • 317070 2 hours ago
              > for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal.

              In one thinking trace of 10k tokens, with fp16 or bf16 logits, I don't reckon a collision is rare? There are only 65k floating point numbers with that accuracy. And an agent can quickly rake up 100k tokens, so while not every token will have such a collision of equiprobable logits, it is not rare.

        • thaumasiotes 8 hours ago
          > It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients.

          I don't understand the distinction you're drawing. A Dirac delta function is a "simple if check".

          • sigmoid10 7 hours ago
            The point is that the case T=0 doesn't just "exist" as a special code branch - it is still well defined mathematically without any change to the output function. What the above comment refers to with the extra "if" check is just a limitation of computers not liking to divide anything by zero, even if the actual function exists and is well behaved at zero. It is not some weird or special theoretical construction.
            • StilesCrisis 5 hours ago
              Floating point defines n/0 the same as math. It's infinity as long as n isn't zero.
              • simiones 5 hours ago
                In almost all forms of math, the value n/0 is undefined. It's definitely not infinity, for two reasons - depending on the value of n, it can be negative; and neither info nor -inf are numbers, so they can't be the result of an equation (unless you look at transfinite equations).

                What you can do in math is talk about the limit of a series of fractions as the denominator approaches 0, and that's where you get some relation to infinity or -infinity. But the limit can also be any other number, if the numerator also gets closer to 0; or it can not exist, if the function oscillates.

                • StilesCrisis 4 hours ago
                  I explicitly didn't say "infinity or negative infinity" because I didn't think that level of pedantry would be needed here on HN. I guess I was wrong.
                  • simiones 1 hour ago
                    That's not the problem, and this is not just pedantry. It's just not correct to say that n/0 = inf, nor even to say that positive_n / 0 = inf, in any normal math context.

                    For example, if you accepted that n/0 = inf just like n/1 = n, then you'd conclude that n/0 + 3 = inf + 3 = inf, so n/0 + 3 = n/0, so 3 = 0. Or you'd want to do weird things like asking what is sin(inf).

                  • throw-the-towel 3 hours ago
                    All discussions of mathematics assume maximal possible pedantry.
                  • jdiff 4 hours ago
                    It's not positive or negative infinity. It is simply undefined. Math has many conventions, and you can define your own convention that it does equal some flavor of infinity, but that is only a convention, and not a universal one.
              • freehorse 4 hours ago
                > as long as n isn't zero

                Which is the case with softmax function, as for T=0 you end up with a fraction that either becomes 0/0 or inf/inf [0]. So you do need branching as floating point arithmetic is not gonna get you there.

                [0] except for weights that are exactly 0

                edit: thinking more about it, one could always express the softmax formula in ways that this could work with floating point arithmetic but it would be very inefficient and sort of pointless

      • teiferer 6 hours ago
        > Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0.

        That's not how limits work. As the temperature goes to 0, the rest goes to 0. That's it. The "almost-but-not-quite" is part of the "goes to".

        Let's say f(x) = 3x+1. It's a continuous function. If we let x go to 10, f(x) goes to 31. Not "almost-but-not-quite 31". No, to 31. (If you don't have a continuous function then it's the same argument, but less intuitive to illustrate.)

      • pmarreck 1 hour ago
        It is not deterministic because the order of computations in a typical multithreaded system is not deterministic and also because when combined with the devil that is IEEE754, it gets even less deterministic.
      • msdz 6 hours ago
        > However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.

        Exactly. While I’m assuming this won’t be news for most here, for those that are still new and/or curious about some more explanation on e.g. the floating-point imprecisions, see this nice article: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

      • lelandbatey 10 hours ago
        As I understood it, the "randomness" affecting what is selected at any temperature still comes from a PRNG or CSPRNG (or whatever RNG you want, maybe a hardware one), and if you where to swap out that with something deterministic you'd get the same results every time (barring non-determinism in other parts of the OS/drivers/maybe even hardware).

        But theoretically, the output of every LLM is seed-driven (or could be if you wrote the software to isolate it) just like any computer software. It's just none of the software written (even llama.cpp AFAIK) chooses to support stable-seeding due to the changes in stuff like CPU/Vulkan/CUDA/Metal differences making it difficult to make consistent.

        They could though! Hopefully one day someone implements it into the mainstream LLM-engine software and it gets exposed in the APIs serving the models. It'd do a lot to show folks the "internals" of these models.

        • toolslive 10 hours ago
          It's probably due to the fact that it's a cloud service. You have no guarantee that your next request will go to the same machine. So even with an identical seed, and temp 0 you might get different hardware and hence different accuracy/noise in the floating point operations.
          • rightbyte 8 hours ago
            How can there be noise in floating point operations? I could buy like completion order for parallized batches i.e. adding a+b+c is different from a+c+b etc.
            • StilesCrisis 5 hours ago
              IEEE-754 doesn't mandate exact results for functions like exp(x). It mandates things like "within 2 ULP of the true answer." Hardware vendors are free to implement these functions in any way that meets the error tolerance.
            • toolslive 6 hours ago
              While the IEEE 754 standard ensures that individual basic operations are deterministic and strictly bounded, it does not guarantee that an entire program will yield bit-identical results on all CPUs.

              CPUs and their execution environments introduce subtle hardware variations, architecture choices, and compiler optimizations that break bit-level consistency.

              (same for GPU/TPU, ...)

              • vlovich123 5 hours ago
                Parent is correct - the math is very deterministic if you can guarantee it’s running repeatedly on the same machine and you’re not processing “random” requests in parallel. The compiler is irrelevant because once the code is generated it’s not getting recompiled and thus isn’t a source of non determinism (and generally if you don’t touch the math the compiler will consistently emit the same underlying machine code).
                • simiones 5 hours ago
                  This sub-thread was about cloud environments, where different requests may be served by different hardware. And it's in fact very likely that there will be a mix of different hardware from different vendors, in any particular LLM cloud for now.
              • throwaway173738 5 hours ago
                It is, after all, a fundamentally voltage-based process, and the logical “no-man’s land” is chosen to limit the likelihood of a weak component producing faulty logic, but it’s impractical to run through the set of all possible starting states and to verify that after an unbounded number of clock steps the machine reaches a predictable end state on all of the devices being manufactured.
        • microtonal 10 hours ago
          Stable seeding is not enough. A lot of modern, fast compute kernels are nondeterministic. Floating point multiplication/addition is not strictly associative and e.g. reductions can combine results from different threads in different orders (e.g. through atomic ops). You can write kernels to be deterministic, but it is generally less efficient.
          • vlovich123 5 hours ago
            They are only non-deterministic when you’re doing batching and a kernel ends up running across a “random” set of token streams. If you’re only processing one user’s request, they’re very much deterministic.
        • nok22kon 9 hours ago
          that's incorrect in the presence of batching. it's tough work making it truly deterministic:

          https://x.com/FireworksAI_HQ/status/2069873437217276015

          • vidarh 9 hours ago
            It's not that hard. What is hard is making it truly deterministic and retain high throughput.
        • gaflo 5 hours ago
          PRNG is deterministic.
      • nullc 7 hours ago
        If you make an exact integer implementation and run with temp=0 it's deterministic.

        You don't even need temperature 0, just make a random seed for the sampler part of the input and then its deterministic as a function of the input.

        But running autoregressive models at temp=0 tends to expose pathological behavior, because the training process produces a function with a lot of gain so its prone to feedback on its own noise.

      • chrisjj 9 hours ago
        > However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run

        The implementation does not often differ run by run.

        • skissane 7 hours ago
          > The implementation does not often differ run by run.

          If you use a cluster, or even multiple clusters, and they have non-identical hardware, then two consecutive runs could end up being routed to nodes having different GPU models with slightly different floating point behaviour, or even software differences (e.g. newer GPU offers some feature usable to speed up calculations which older model lacked; same code can use the feature when it is available, fall back to slower alternative if it isn’t). The larger your scale, the greater the odds it will happen

    • PaulHoule 1 hour ago
      The whole problem of text understanding is a problem of reasoning under uncertainty, that is, you can't really be sure which witch people are talking about all the time. A person you might hire might be successful or unsuccessful at the role, no matter what hiring process you use. Two people might look at the same resume and come to the same conclusions. Two patients with the same symptoms and clinical presentation might have different diseases, etc.

      I don't buy the story that the old AI died primarily due to the cost of knowledge base maintenance [1], but rather the lack of a universal system of reasoning over uncertainty.

      For me it's a running gag that Spock was always saying things like "Captain, we have a 21% probability of surviving this mission" when Bayes teaches us your probability distribution has a probability distribution, "we have a β(5,1) chance of surviving this mission" is more like it.

      To that end it wouldn't be too crazy to run a resume through that machine 100 times and look at the probability distribution of the score.

      [1] then again I am the kind of maniac who will sort images on a tablet lying in bed until my visual system malfunctions

    • mywittyname 58 minutes ago
      > This is not correct

      Several of my claimed AI-expert colleagues repeat this as though it's gospel. I've heard "set the temperature to 0 so we get consistent results" more times that I can count.

      • Terr_ 15 minutes ago
        I imagine it's much like game-developers saying: "Set a fixed seed so we get consistent gameplay results."

        Yeah, it can work, but it is subject to so many potential pitfalls that you can't assume it'll work. It's a property you have to actively design-for and rigorously test to be sure the system can deliver it for your use-case.

    • vessenes 7 hours ago
      To be clear, temperature 0 is deterministic and will produce the same output for exact duplicate inputs, across all seed choices.

      Provided:

      * If it’s MoE we are talking about, that the duplicate inputs are for the whole batch (yes, your batch neighbours can impact your choice of experts. Blergh.)

      * Your kernels are deterministic

      * There’s no system wide effort switch that responds to, e.g. work load across the cluster (for a thinking model)

      Upshot:

      Temperature 0 is not deterministic in probably any existing cloud infra, but it could be for edge inference pretty reliably.

      To your quibble on 0.1 being more deterministic - I think it’s a pretty fair summary - we’re going to sample much more from the ‘temp 0’ answer at 0.1 than we would at temp 0.9, no?

      • Dylan16807 7 hours ago
        Even then it's deterministic in the way a hash function is deterministic. Change one letter and you can get a completely different output. What people actually want is something continuous.
        • vessenes 4 hours ago
          Agreed on the desire for continuous behavior. That said, in a modern LLM, is this hash analogy accurate? I would be surprised if a single letter changed most zero temp force ranked outputs.

          E.g:

          “Where is the Eiffel Tower Located? One word only.”

          “Where is the Effel Tower located? One word only.”

          “Where is the Eiffel Tower located? One wor only.”

          I’d be very surprised if those got different answers from even a small local model at temp 0.

          • knome 1 hour ago
            For a single word response, perhaps.

            But for anything else I wouldn't.

            The entire chain will be affected from the different tokenization on down. Even if it lands in roughly the same semantic area, it doesn't mean it will land there with anything like the same syntactic selections. Anywhere there were multiple near-tokens could easily select a different route based on even minor fluctuations in the starting conditions. It's chaotic.

          • forlorn_mammoth 35 minutes ago
            "Your are a helpful/less assistant"

            Give it a try. 4 letter difference. Add a few 100 tokens describing the task, such that the change becomes a tiny fraction of the input.

            Discontinuities everywhere.

        • guhcampos 7 hours ago
          This is it. People mistake deterministic for precise/exact/correct. It's not.
    • aesthesia 12 hours ago
      A distribution with all probability mass on one outcome is deterministic, so in principle, setting temperature to 0 _should_ result in deterministic outputs. There are a few reasons it might not, but I don't think any of these apply when running a local model like the author did.
      • 317070 11 hours ago
        > so in principle, setting temperature to 0 _should_ result in deterministic outputs

        It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element.

        Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.

        • jstanley 9 hours ago
          > "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs.

          But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM.

          • vbarrielle 9 hours ago
            It may be an implementation detail, but in practice, if the only way to get a deterministic output is to run on the CPU, then it's not going to be usable.
            • 317070 8 hours ago
              Actually, Google's TPUs are also deterministic!
            • Dylan16807 7 hours ago
              You can tell GPUs what order to do math instructions in.
        • EvgeniyZh 11 hours ago
          You don't have to sample uniformly. You could take the lowest index of all maxima. But yeah, the main source of randomness is non-deterministic matmul, and temperature does nothing with it
        • DougBTX 9 hours ago
          > GPUs put the associativity of the sums in matrix multiplications in arbitrary order

          That’s user-controlled too, not an inherent property of GPUs:

          https://docs.pytorch.org/docs/2.12/generated/torch.use_deter...

          • vbarrielle 9 hours ago
            The matrix multiplication is only deterministic for sparse-dense products under these settings:

            > torch.bmm() when called on sparse-dense CUDA tensors

            And it's not listed under the operations that raise an exception otherwise, so I'm not sure the docs promise that dense-dense matrix-matrix products are deterministic.

            • DougBTX 6 hours ago
              Oh, thanks, that’s interesting, I thought it covered that too!
      • easygenes 12 hours ago
        There are. If the kernels are nondeterministic (e.g. timing issues) there are minor changes between runs, on a single system, even with eager decode enabled (typically what temperature=0 achieves).
      • IshKebab 12 hours ago
        Setting the temperature to 0 should give deterministic results but that's not any better - it's just hiding the huge variance by only taking one sample.
      • croes 10 hours ago
        So you would get always the same result, but it could be the wrong one
        • srdjanr 10 hours ago
          Of course, nothing can guarantee the right answer from LLMs
      • valzam 12 hours ago
        I mean the easiest explanation would be that the model harness doesn't always take the most likely token but does top-k sampling or similar. temperatur just means that probabilities get more and more equalized, boosting the chance that an unlikely token gets picked. but even with temp 0 you could have 0.8 T1, 0.19 T2, ... and sometimes sample T2
        • aesthesia 12 hours ago
          No, this can't happen at temperature 0. The formula defining temperature-adjusted softmax isn't strictly defined at 0, but taking the limit (in the case where all logits are distinct) results in probability 1 being placed on the largest logit. Samplers will typically special case temperature 0 and pick the most likely token at each step.
          • dvt 12 hours ago
            This is a very authoritative answer that should be more nuanced and caveated as implementation-dependent. In some cases, repetition penalties take precedence over sampling; top_k and top_p can also be handled before or after the temperature step. In other cases, `0` is turned into like 1e-10 or some super tiny float value (which can drift if you do any arithmetic with it). Routing, quantization, etc. can also have an effect on sampling. And yes, in some cases, setting temperature to 0 can mean "pure greedy decoding" which makes the decoder about as deterministic as it can get.
    • margalabargala 1 hour ago
      > I'm happy to see in-depth pieces like this

      It's somewhat ironic that this "in depth" piece was written by an LLM as well.

    • lelanthran 5 hours ago
      > temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"—but is still very much a distribution).

      You're correct. The confusion arises because we use the word "non-deterministic" when we mean "probabilistic".

      I tried to explain it better: https://www.lelanthran.com/chap15/content.html

    • make3 11 hours ago
      A more spikey distribution exactly makes the distribution closer to deterministic. That's not the point though. Even in greedy (deterministic) decoding, it is still a black box though that reacts in ways ways that are unpredictable to the inputs. Switching one word around might lead to different scores for example.
      • fluoridation 7 hours ago
        Yeah, this is the forest that the people arguing about math trees are missing. It doesn't matter that the algorithm is deterministic if the algorithm passes the input through a cryptographic hash function to make a yes/no decision. The result may be perfectly reproducible and still non-sensical in its distribution with respect to its input domain.
    • bhanu786 10 hours ago
      Agree
    • mtharrison 5 hours ago
      Small refinement: the underlying model isn’t stochastic at all. The forward pass is a deterministic function of the weights and input, it just produces a probability distribution over the next token. The stochasticity is an optional sampling step layered on top, not something inherent to LLMs. Greedy/argmax decoding (or temperature 0) makes the whole thing deterministic.

      So “purely stochastic” overstates it a bit: the distribution is computed deterministically, and you choose whether to sample from it or not.

      • simiones 4 hours ago
        There are more layers to this problem, if we want to get into the details. The LLM is defined in terms of floating point operations, and those are not actually fully deterministic, on most hardware and in most performant implementations.

        IEEE 754 only specifies precision requirements for certain operations, not precise bit patterns (e.g. for exponentials). So, at least in principle, the same hardware performing the same operation could produce different results at different times, as long as they are close enough to the theoretical answer. I'm not sure if any hardware actually works like this.

        IEEE 754 also specifies that many of the basic arithmetic operations are not associative - so any reordering (which is common when batching multiple queries at the same time) will introduce indeterminacy from the perspective of your own query (that is the result for your query will change depending on what other query happens to be processed at the same time, which is not under your control).

        Finally, even if we take the case when a query is processed alone, and even if one particular hardware is completely deterministic, the result will be different on different hardware - which can again look like non-determinism if you're sending your query to a load balancer.

        So, the math for LLMs is deterministic in theory, but implemented with non-deterministic approximations & optimizations in practice, and their results are then normally used only as a probability distribution to be sampled from.

    • spwa4 11 hours ago
      > An alarming number of people don't understand that LLMs work via purely stochastic processes ...

      I've been studying AI for 20 years. What really needs to be added to this statement is:

      "An alarming number of people don't understand that LLMs work via purely stochastic processes - and so does human thinking. People do NOT arrive at the same conclusion if merely the weather's different. Worse: with human thinking not only do most people not think this is real, a subset of people will actively fight the idea. Of course, depending on the weather"

      • mahogany 7 hours ago
        Every time people point out a limitation or constraint of LLMs, I see a comment that is to the effect of “but humans…”. I don’t understand why this comparison is relevant to this particular thread. Is it just an amusing similarity?
        • efromvt 4 hours ago
          I think it often useful to push the conversation down "we built a system for humans that dealt with this, what from that is or is not applicable for agents in the same context"? Humans randomizing resume review for screening is pretty known; I've seen companies try to fight it with things like hiding information, panel reviews, etc - it's unclear to me how effective those would be for agents (honestly, it was unclear how effective those were for humans). I was depressed about the hiring process before we had AI screening and I remain depressed about it.
        • castlecrasher2 2 hours ago
          It may seem trite but the point is that if separate humans were assigned the same task the LLM was here the results would be similarly non-deterministic.
      • smusamashah 11 hours ago
        We expect computers to be consistent on the other hand. A calculator will always give you the same answer unless some chip gets struck by a particle. LLMs are on computers and should be fairly consistent too.
        • vidarh 8 hours ago
          And this lies at the heart of the problem.

          We expect computers to be consistent despite running programs that are not designed to be consistent.

          This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.

          But for some reason some people choose to assume LLMs should act like a calculator instead of any of those programs.

          • chrisjj 8 hours ago
            > This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.

            The average user has very little. A word processor with inconsistent pagination or a spreadsheet with inconsistent totals is rightly seen as faulty.

            • vidarh 4 hours ago
              The average user is familiar with games.
            • newswasboring 7 hours ago
              Yeah but daily tools have lots of complexity which appears as non determinism (if we are thinking only UX, not actual determinism). For example, try moving an image in the word doc. I have been using MS word my entire life it seems, still don't know what the rules are lol.
              • chrisjj 6 hours ago
                You're using a mouse? I have no problem getting reliable output from reliable input - through keyboard.
      • thisisit 7 hours ago
        The same person is not going to give you three different answers within span of minutes. Especially when nothing fundamentally has changed. People might or might not update their views depending on their biases.
        • rkuodys 6 hours ago
          I'm pretty sure the personality tests are created specifically for the reason that a single person can have fundamentally (or conflicting) beliefs about himself in a matter of minutes. You can say "I am honest person" and the next minute you can say "I never lie" - and both cannot be true for an average person.
      • miki123211 10 hours ago
        What's even worse, different humans have different weights.

        If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound.

        Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures".

        • chrisjj 8 hours ago
          > What's even worse, different humans have different weights.

          Far worse would be different humans having the same weights.

      • mnky9800n 11 hours ago
        Test retest reliability is a thing in psychometrics.
        • spwa4 9 hours ago
          Ah cool. So there is data? How consistent are humans?

          What I'd really love is an actual number for a "human hallucination rate". How often will a random human

          1) claim something that is wrong

          2) defend the wrong claim and/or logic even when the problem is pointed out to them

          (and this of course outside of the usual topics. In politics? I don't care. In religion? Don't care (well, maybe a bit more than politics). Let's say in physics or popular logic or something like that)

          • mnky9800n 6 hours ago
            There is evidence that children will oscillate between understanding and not understanding while learning topics. Philip Sadler at Harvard published about this but i can't find the paper im thinking of on his google scholar. too many papers!

            but moreover, to verify a test item you need to make sure that peopel will select the same answers under teh same conditions at different times. people generally forget the specific questions they were asked if you ask them the same questions a month later so being able to get them to answer the same way each time is important. it is assumed the people have some static knowledge of a topic in this scenario.

            If you want to consider a statistical examination of how people answer tests and how we assess knowledge and other things in people through surveying you can read about item response theory and rasch analysis.

      • cyanydeez 8 hours ago
        a studied example is sampling judicial decisions before lunch and after lunch. judges are more lenient on a full stomach.
        • ThrowawayR2 3 hours ago
          That was a single study and it's finding is at the very least disputed, if not debunked, e.g. https://news.ycombinator.com/item?id=41091803
        • WhrRTheBaboons 7 hours ago
          how did they account for sampling bias? a judge might leave easier cases for after lunch. people with control over their schedules usually ease themselves back into it after breaks.
    • nok22kon 9 hours ago
      its a bad idea in general to use non-1.0 temperature. there is a reason labs are strongly recommending using 1.0.

      using low temperature is more deterministic, but the cost is the model becomes "dumber"

      • tipsytoad 9 hours ago
        1.0 is actually pretty arbitrary and way too high as a general rule. Something like 0.3 is a more sensible default
        • programjames 2 hours ago
          1.0 is "natural units". If your energy corresponds to nats, you should be using temperature 1.0. If your energy corresponds to bits, you should be using temperature ln(2) ~= 0.7. The optimization pressure is

               max nats = max entropy + energy / temperature
          
          
          Why might energy correspond to bits or nats? Imagine your goal is to play as many interesting games of chess as possible in a tournament. This implies you have to keep winning. If you look at the RL environment from the right perspective, you can turn it into optimizing bits or nats.
        • 317070 8 hours ago
          If RL was used to train the model, the model will have been trained on its own sequences. Those will have been generated with a temperature of 1.0. They must be, otherwise you would get a premature collapse or explosion of your entropy if the temperature was respectively lower or higher.

          After that RL step, you want to stick to the RL distribution, and so keep a temperature of 1.0. Other temperatures will drive the model out-of-distribution.

          That is why the sampling step for agents or thinking LLMs are usually kept at a temperature of 1.0.

        • zipy124 9 hours ago
          It really depends on the application does it not? I'm not an LLM guy, but for creative tasks like storytelling wouldn't you want a higher temperature usually? Happy to gain insight from anyone with experience here :)
        • embedding-shape 9 hours ago
          Heavily depends on the model architecture and the implementation though, I don't think you can say what values are better than others without first specifying those, otherwise it's straight up guessing, ironically.
        • nullc 7 hours ago
          If you use a model in a configuration far from where it was RLed you get no warranty. (you also get no warranty the other way, however)
      • jldugger 3 hours ago
        Would 1.0 have fixed the wide variance in scoring?
      • codeflo 9 hours ago
        It can be useful for pure translation tasks and stuff like that where you explicitly don't want creativity of any kind.
      • vidarh 9 hours ago
        Plenty of setups defaults to lower values than 1.0.
    • bluechair 13 hours ago
      Willing to be corrected but I believe this type of automated resume filtering is illegal. Not saying it never happens but my understanding is it is not typical.
      • thayne 12 hours ago
        I would expect that to depend on jurisdiction.

        I don't know for sure, but I would be surprised if it was illegal in my particular US state. You might be able to argue the AI has inherent biases that introduce illegal discrimination in the hiring process, but my understanding is winning I case like that would be very difficult, especially since most employers are very cagey about their hiring process and why they mades a decision.

      • small_scombrus 12 hours ago
        They don't need to actually filter/blackhole to have have the same virtual effect.

        Show someone a list of resumes with an "applicant score*" and they'll naturally ignore the ones with a low ranking

        *scores are generated with AI, mistakes may be made, use only as a guide and verify results

      • ivan_gammel 12 hours ago
        In situations when you get hundreds of applications for one open position (real market now), whatever reduces your pool to the size a human can handle, works. You can preserve some diversity metrics in the process. This particular filtering is rather primitive, but LLM as a first filter can definitely do the job. You may burn less tokens than the hourly rate of your HR and it will be fairer than just dumping 50% of unread CVs in trash.
        • 369548684892826 11 hours ago
          Great until someone realises you’ve filtered out minority groups from the application process (most developers are men so maybe the LLM decided they’re the best fit, but you’ll never know exactly why it screwed your over) and you suddenly have an expensive lawsuit
          • TeMPOraL 8 hours ago
            LLMs are DEI-aware, as over past few years, their vendors all had various high profile news stories with their models and their default biases, so it's more likely they'll heavily discriminate in favor of minority candidates, not against them. Still, in both cases it would indicate whoever is operating the system is doing a really, really lazy job. It's really not hard to test and supervise LLMs on tasks where they give you mere 2-10x leverage, and prompt adherence today is much better than it was 3 years ago.
          • ivan_gammel 3 hours ago
            What „not so smart“ person would filter minority groups out of the process in 2026? It‘s more likely that 90/10 gender disbalance will be converted to 60/40 or even 50/50. Diverse teams are more fun and stable.
          • cyanydeez 8 hours ago
            this happened a decade ago when a US courted tried to make sentencing decisions via ML. it was easialy demonstrated that the training data was flawed because the justice system was flawed so the data it was trained on was weighted against minorities because it oversampled because you know, police routinely oversample and poverty for es oversampling

            nonetheless, people will defend history as perfect and say those samples, like nepo babies, are "perfect".

      • elric 11 hours ago
        Under GDPR, you have the right to request manual processing whenever personal data is processed automatically to make a decision about you that has "significant impact". Not being hired seems like it would qualify.
      • dgellow 11 hours ago
        Illegal where?
  • dathinab 6 hours ago
    And this + the tendency for AI to "prefer" AI produced code + some other AI biased is why *this is most likely highly illegal to use in the EU due to violating anti discrimination laws in multiple ways.

    To be clear:

    - randomly filtering "too many" resumes is pretty much allowed (I think)

    - but must be actual random independent of the resume (and can be in multiple layers, i.e. random filter > pre-select > random filter > select)

    - this isn't the case for AI as the random aspect isn't done as the random aspect is not independent of the actual resume evaluation

    - in general you can't make sure the AI doesn't apply systematic biases, and there is high indication that it does do so

    - for humans you can train them and order them to ignore their biases, this won't work reliable either _but now you delegated the responsibility of illegal biases to the hiring personal violating the order_. But for AI usage you are responsibility no matter what you tell it. Lastly you can technically "show/proof" a specific used AI is highly biased in a specific contexts, which for human employees is technical possible but practical not really practical. So this moves "specific mostly deniable" cases, into "systematic proven bias" teritory. Or in other word legal risk goes from "limited/no issue" to "people can systematically f-you over if they know you use AI for hiring".

    • jerf 4 hours ago
      Everything is correlated to everything [1].

      Which means there's a good chance this is somehow correlated in one way or another to race/gender/other protected classes in the US, just by the math of everything being correlated to everything.

      Which means this is one good lawsuit away from being illegal in the US as well. It doesn't even necessarily have to "win", just do well enough in court to scare away anyone else from using this.

      And boy oh boy would I hate to be on the receiving end of this lawsuit, trying to prove that my AI screener is completely in compliance with all hiring laws. That sounds like a nightmare.

      [1]: https://gwern.net/everything

      • oceansweep 3 hours ago
        Already happening with Workday in California:

        https://news.bloomberglaw.com/litigation/workday-loses-bid-t...

      • torben-friis 4 hours ago
        Would the accused party have to prove compliance? Or would non compliance have to be proved by the accuser?

        Honest question, I'm not American.

        • jerf 3 hours ago
          "Innocent until proven guilty" is a criminal court concept. This would be a civil suit. Those use different standards, like "preponderance of the evidence". I agree that if the claimant had to prove the AI system is violating employment law that that would be a hard bar to clear, but showing on the preponderance of the evidence is something that would have me a lot more nervous if I was on the receiving end of the lawsuit.

          This is a highly general answer to a complicated topic; my main point is more that this is not going to be held to the standard of "beyond reasonable doubt", which would be hard to meet.

          [1]: https://www.law.cornell.edu/wex/preponderance_of_the_evidenc...

      • nonethewiser 1 hour ago
        >Which means there's a good chance this is somehow correlated in one way or another to race/gender/other protected classes in the US, just by the math of everything being correlated to everything.

        >Which means this is one good lawsuit away from being illegal in the US as well.

        Uhh.. what? No that doesn't follow at all.

        Screening resumes in a way that correlates to race, gender, etc. is not illegal. This is a fundamental distinction. The law is you cannot use those as filters. But the outcomes likely will be correlated. In fact to ensure they are not correlated you'd have to break the law and control for race, gender etc. Which is racism.

        The models dont even get race as an input. If they did and they used it to select then yeah, that lawsuit sounds like it has merit. But a mere correlation in outcomes? In no way illegal what-so-ever.

      • DiscourseFan 2 hours ago
        I wouldn't doubt that lawsuits for employment discrimination for any company (and I suppose it was most of them) that used LLMs in hiring processes will become a very lucrative business. They are all open to civil suits at this point.
        • AnimalMuppet 1 hour ago
          And, if there aren't enough lawyers to do all that work, you could use AI to file the suits.

          I'll let you decide whether that's a dream or a nightmare...

    • CobrastanJorji 53 minutes ago
      > randomly filtering "too many" resumes is pretty much allowed (I think)

      It's totally fine to filter out resumes in a completely random, content-independent way. Grabbing the fourth resume down in the pile and offering them the job is a perfectly fair albeit stupid way to make a hiring decision. However, AIs are very, very good at capturing biases, and it would not at all surprise me if an AI told to filter resumes is going to end up filtering with some biases for things that you definitely do not want to filter on, like the name of the candidate. And it might be that everybody resume that claims it fixed a typo in a major open source project gets a pass, but resumes that only list their own projects get rejected 60% of the time, so you're losing more good candidates than bad.

    • District5524 4 hours ago
      I'm not sure this is very easy to show this is a breach of non-discrimination requirements, like under Council Directive 2000/78/EC for employment.

      Due to acting like an irrational gambling machine, I agree it can have unwanted indirect discrimination effect in general. But it will probably not differentiate "on the grounds of religion or belief, disability, age or sexual orientation". It is possible, but that would take a lot of work for the lawyers to prove to the court.

      I believe the more interesting part is that the EU AI Act (still not in force in this regard until 2 December 2027). This will be clearly a high-risk AI system: "AI systems intended to be used for the recruitment or selection of natural persons, in particular to place targeted job advertisements, to analyse and filter job applications, and to evaluate candidates".

      Which does not mean prohibited, but it could later turn out that LLMs will be excluded from being used in high-risk AI use cases (falling under article 6 with no exemptions).

      Considering that none of the standards are published yet, I have absolultely no idea how they will ensure compliance with the following parts of Article 10 when using LLMs for such tasks: "(f) examination in view of possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights or lead to discrimination prohibited under Union law, especially where data outputs influence inputs for future operations; (g) appropriate measures to detect, prevent and mitigate possible biases identified according to point (f)"

      I don't think that's technically possible to do so with LLMs in general at the moment, even with the full cooperation of the model providers. Maybe you can do some meaningful audits for smaller models. But the EU AI Act may end up excluding all the generic "using-LLM-but-not-entirely-sure-why" vibe coded approaches from high-risk use cases (in Annex III). Which would make sense.

      https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng

      • dathinab 2 hours ago
        EU AI Act got hijacked by huge corpo with last minute changed with moved it from "could probably work" to "catastrophe".

        Even at 2 December 2027 it might be intentionally not enforced at all due to that for a while, through I think the goal is currently to amend it until then.

        > that LLMs will be excluded from being used in high-risk AI use cases

        no, it won't I can guarantee you this. At best they will get additional restrictions over time, as things go wrong. Anyone who could make this happen has way too much interest to not make it happen. (Most/All? EU country legal systems are overloaded to a point of not working correctly anymore, and have been before AI generated law suites and other AI nonsense started. I won't go into detail but many believe AI assistance (for certain tasks, always with a human doing any final decisions) is the only way to get out of this mess).

        > standards are published yet

        or exist,

        like seriously this isn't a case of there being non public WIP standards which will pin all the nitty bitty details down, but cases of state agencies (and in last instance judges) having to decide if a specific standard (or implementation) is sufficient or not.

        but also to some degree it shouldn't be tightly coupled to tech standards as there are often many ways to implement the things the law requires and accepting only one is undesirable (and likely wouldn't legally hold up). But having tech standards which are a "guaranteed to be enough if you comply with" (but not the only valid way) would have been preferable, bringing us to the next point

        > have absolutely no idea how they will ensure compliance

        nor do they know, the original non big corpo hijacked version had exceptions for most companies affected now. So it would only have affected a handful of huge companies, which have many of the things required already in place, in some form or another. Most likely this would have played out as this companies presenting how their measurements are "sufficient" and the agencies then evaluating it and potentially requiring some changes, going back and force over a longer duration leading to documented cases of rough technical standards about "what is sufficient" they then can pass to other organizations in the future. But now the law affects not just a handful of companies but like thousands, if not tens of thousands. Many not stuffed in a way where such a process could work, or even do the necessary documentation to show "compliance"...

        So from a practicability POV, if enforced starting 2027, it currently excludes close to _any_ (meaningful) use of AI, down to a trivial linear regression or similar. Including any "old school ML/AI" any Bank uses for risk assessment.

        Banking stopping running in December and there not being any (meaningfull) AI startups or adoption at all is not something anyone (in power in any state organ) wants to see, so guess how much it will be enforced ;)

        And as mentioned the chance of AI as technology being excluded "in general" is close to none. Maybe specific usages could be excluded (and/or are already excluded) but thats it.

        Oh and as a bonus a malicious reading of f+g remove any proper privacy protections for any AI usage in high risk context, where it is often most relevant... (a more sane reading allow it, with ... tricks).

    • buzer 5 hours ago
      > this is most likely highly illegal to use in the EU due to violating anti discrimination laws in multiple ways.

      It's generally illegal under GDPR Article 22.

      > The data subject shall have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or similarly significantly affects him or her.

      Exceptions in 22(2) are unlikely to apply. It's hard to argue that it's truly necessary (a) and consent (c) is almost always unavailable in employment context. (b) might apply, but it requires specific law in EU or Member State to authorize it.

      • bluGill 4 hours ago
        For C: I'm not sure how EU laws work, but ethics says that someone who needs a job cannot give consent since the possibility of a job if they give consent could be a bribe. See a lawyer for how it works in your country.
        • dathinab 3 hours ago
          also not fully sure, but AFIK there are limits to how far you can wave this right, in context of things like TOS, simple opt-in fields on forms etc.

          Like YT would have loved to make you opt out of it (and probably has it in their TOS) but there where multiple cases of courts forcing them to handle it properly in the past as far as I remember.

          My _guess_ is that at least if you don't sign a proper contract you can always force a human reevaluation. But also only that (so only semi useful). Also even with a proper contract it's unclear if it would be possible in this specific case due to the contract being fundamentally one-side/unfair and semi-forced on you if it where wide spread on the market for the specific job you are trying to get.

          • bluGill 3 hours ago
            Those limits exist too, but even if the law doesn't give limits, ethics does.
        • buzer 3 hours ago
          That's why I said consent usually cannot be used in employment context. I wouldn't rule it out 100% for everything employment related, but application screening is unlikely to qualify for those rare cases.
      • dathinab 3 hours ago
        this isn't quite how GDPR Article 22 works

        The is a difference between

        - having a right you can't wave - which is very similar to something being forbidden - but different to having a right you fully or partially can wave

        Furthermore to some degree you are only "subject to a decision based on ..." if the decision has an effects affecting you.

        In practice wrt. Article 22 this means companies can make a "decision solely based on automated processing[..]" iff they give you a (realistic) chance to object to it in which case they will do a human review of the decision where a human confirms/changes this decision based on reviewing the involved information.

        There is a lot of gray area what a "chance to object" means and when a human review makes an decision no longer "solely based on automated processing" (a human just saying AI was right clearly doesn't count, but a human constructing a case why they would have decided the same way based on the why the AI did the decision can count, iff it's reasonable to assume a human might have come to the decision had it only been reviews by an human).

        Or in other words GDRP Article 22, just "soso" meaningful in context of hiring.

        Like if the AI did a mistake they have to reevaluate it, but as long as there are other similarly qualified competitor (they did hire/are in process of hiring) it quite easy to come up with a reason why they are a better choice for them. Or go through the motions of you being in round 2,3 of hiring and then find an excuse to not hire you.

        • buzer 2 hours ago
          Mostly yes.

          Note the chance to object must be given before decision is made, i.e. not to give option for human review after the fact. Human must also be able to actually have meaningful chance to affect the decision.

          If the decision is based on purely objective facts that are actually necessary (like you must have certain license) then human and computer always coming to same decision is likely correct and compliant, but as soon as you start putting in subjective criteria and human agrees with 100% of computer denials it becomes a lot harder to demonstrate that human is actually able to affect the decision as required by Article 5. Note that demonstration burden is on controller, not on data subject/DPA.

          Objective criteria also isn't always enough by itself. If both human and computer calculate the same credit score and you must score X points to get a loan then human isn't actually able to affect the decision. Essentially the credit score calculation itself ends up being the automated decision rather than the formal rejection that is later given to data subject.

      • fartcoin67 5 hours ago
        [dead]
    • stellamariesays 4 hours ago
      [flagged]
  • ryukoposting 12 hours ago
    At this point we might as well adopt that joke where you blindly throw away half the resumes because you don't want to hire unlucky people.
    • taffronaut 8 hours ago
      At one point in the past a major UK a medical school adopted random selection for qualified candidates (Barts and The London School of Medicine and Dentistry - part of Queen Mary University of London). The approach benefitted qualified students from less well-off backgrounds vs those who can afford to win at the ever more elaborate (manual at the time) hurdles of resume assessment criteria and effectively game the system. There was an orchestrated campaign against the lottery around "Why gamble with would-be doctors?". Random selection was quietly dropped.
    • agnosticmantis 10 hours ago
      A person's total luck is constant over a lifetime. The remaining half of the candidates already spent some of their luck in this selection, so they'll be on average less lucky than the discarded half.
      • bee_rider 4 hours ago
        But, however you structure the selection process the people who get picked are the ones who’ve expended some luck (like, if you throw away half the resumes, but then pick the resumes out of the trashcan, the ones you plucked out are still the lucky ones).

        I see two possible solutions.

        1) Most people won’t be using up most of their luck on this one thing. I mean they’ve got their whole lifetime worth of luck, so you just need to make sure to pick people who still have plenty left. In other words, ageism and/or picking people who’ve never accomplished much are the solutions!

        2) We assume working for the company is a lucky outcome. If you make the company a really unpleasant place to work, people will have to use their luck to dodge it. However, luck can only be evaluated against other possible outcomes. The plan, then, should be to set up a competitor (possibly a front) that is a really nice place to work. They’ll act as the “lucky outcome expenditure dump.”

      • t-3 8 hours ago
        No, luck would be some expression of the difference between the average and the individual outcomes - it only exists relative to a population at the point in time when it is measured.
      • throwawaythekey 9 hours ago
        > A person's total luck is constant over a lifetime

        Ah yes, the much revered cosmological fairness constraint.

        • cyanydeez 8 hours ago
          everyone knows luck is tied to the wealth-gravity and increases as the inverse distance to the density of matter. hut because its relative, everyone thinks they have the same luck when not observing others.
      • sfn42 2 hours ago
        This is not at all how probability works. Luck is not a resource one spends. If you flip heads 500 times in a row with a fair coin, the next coin flip is still 50/50.
      • latexr 9 hours ago
        Even assuming that was genuinely how luck works, the conclusion does not follow from the premise because it’s obvious not everyone “starts with” the same amount of luck to spend.
        • lobocinza 16 minutes ago
          assuming luck is spendable
        • addandsubtract 7 hours ago
          But assuming a random draw, you're more likely to select people with higher luck.
      • CuriouslyC 7 hours ago
        Donald Trump disproves the fixed luck hypothesis (and the Karma hypothesis!)
    • zipy124 8 hours ago
      Or more to the point. There are generally far more qualified applicants than job roles. That is training and education greatly expanded over the last couple of decades to produce more and more job seekers, whilst job creation hasn't really kept pace.
    • pjio 10 hours ago
      This hurts more than it should.
    • citrin_ru 7 hours ago
      May be LLM resume screening is a symptom of a bigger problem - with tens of candidates per vacancy employers can screen resume badly and even throw half of the resumes away and still hire someone qualified.
      • AbsurdCensor 4 hours ago
        That's really what it is, or at least what I've noticed.

        Any position you have these days is inundated with applications. Most don't meet the qualifications (because in a lot of places say in the US you must apply to jobs to keep with benefits, regardless of what you are applying for), and for the remaining, you'll find that there will always be some that are all similarly qualified. Who do you hire for one position? It sometimes just comes down to luck.

        AI doing the job of filtering I can't imagine making the process easier, and more applications are just going to get tossed because of it.

    • latortuga 3 hours ago
      The author made this exact joke in TFA.
  • jerrythegerbil 13 hours ago
    > I fail 65% of the time. Same exact resume, different luck.

    As someone who’s run hiring pipelines for technical roles in the past few years, that’s actually a fantastic number. I objectively hate saying that, but it’s true.

    35% chance of elevating a technical individual to the next stage with no effort? I’ve seen as many as 100+ applicants an hour even when including a domain specific screener question. That’s 35 “screened” applicants in an hour. Were valid candidates screened out? Yes. Does you still have a candidate pool 35x larger than you need? Unfortunately, also yes.

    The volume of applicants is SO HIGH such that your chances of getting moved to the next stage are actually markedly worse if AI isn’t involved. If you didn’t apply immediately (using an AI bot) there’s 50+ people ahead of you, and an exhausted technical leader if they ever make it to your resume.

    Referral bonuses exist for a reason.

    • PufPufPuf 11 hours ago
      In that case, I have a pre-screening system to sell you. Through state of the art technology, it only lets through the best* 1% of applications.

      *According to our proprietary, undisclosed, non-deterministic metric, which may or may not be Math.random

    • ludicrousdispla 11 hours ago
      So the logical solution is for candidates to submit multiple applications with slight variations to their contact info, "John Schmidt", "John J. Schmidt", "John J. J. Schmidt", "John Jacob J. Schmidt", "J. J. Jingleheimer Schmidt", etc.
      • ambicapter 3 hours ago
        It's a good day to have 3 middle names.
      • yuliyp 2 hours ago
        Hey, that's my name too!
    • kyralis 13 hours ago
      Is it? Or is it a 65% chance of a resume getting ignored before a single human sees it, reducing your pipeline's likelihood of catching qualified candidates by the same?

      Gates that reduce resume flow-through are only useful if their reduction is correlated with quality. Otherwise they're just dragging out your hiring process or unnecessarily causing you to ultimately lower your hiring bars.

      • jerrythegerbil 12 hours ago
        > Gates that reduce resume flow-through are only useful if their reduction is correlated with quality.

        The volume is infeasible to review everyone for quality, even at an hour scale. The conclusion and solution is inevitable, though I wish it were different. 35% is actually really good if you’re not coming in through a referral.

        The current reality is <1% and the person reviewing you is exhausted.

        • falsemyrmidon 11 hours ago
          You may as well just randomly pick 65 to discard, if your only goal is to reduce the number for review.
          • ayuhito 6 hours ago
            That’s exactly it for large scale hiring with finite resources.

            It’s all probabilities in the end. And if an LLM gives you more a more relevant pool vs random distribution, that’s still a net benefit.

        • sevenzero 12 hours ago
          What a inhumane way of looking at this. Hiring is deeply flawed, you know it, and yet you keep job postings open for weeks/months in case "the one" magically appears on your doorstep instead of just interviewing 10-20 people and just pick one...

          Corpo bullshittery at its finest.

          • LinXitoW 9 hours ago
            What's the alternative? Everyones up in arms, but I see ZERO viable alternatives proposed.

            If you have 1000 applications for every job, and you know that a bunch of these applications are "a bad fit", to put it mildly, you have to filter. And you cannot realistically give every resume a good, human look. By the time HR would be done, the market has already moved on five times.

            So, what is the real difference between being overlooked because HR could only look at the first 100 resumes, or the AI filtered all 1000 resumes down to 100? In the end, a fuckton of potentially great people get their feelings hurt either way.

            • RugnirViking 6 hours ago
              great question. The alternative is not accepting 1000 applicants. Nobody said you have to keep up your job posting for two weeks, or two hours for that matter. stop once you have enough. Enough is defined by whatever number you would have filtered to. In the rare case none of the first ten applicants were appropriate, just open it again until youve got another tranche.
              • jarito 6 hours ago
                You are assuming quality applicants are evenly distributed in terms of time of application - they aren’t. If you cut off at 100, you will only get a sample of people spewing fully automated application bots which mostly aren’t what you want.
                • MichaelDickens 4 hours ago
                  If that's true, then it suggests an easy fix: leave your application up for four hours, then discard all applications you get for the first two.
              • Arodex 6 hours ago
                That's just another type of randomness (who was online during the short time the posting was opened).
                • Xirdus 4 hours ago
                  "Being online during the short time" heavily favors bots. In a way, AI screening tools saved us from the future of everybody buying resume-spamming-as-a-service because it became as important to use these as getting a college degree.
                • RugnirViking 5 hours ago
                  right. But if you go online and look for a job, then the ones you are available at that moment will actually read your application
                • sevenzero 6 hours ago
                  At least this would not force applicants to fine tune their applications to the latest LLM bullshit bingo.
            • Xirdus 4 hours ago
              > If you have 1000 applications for every job, and you know that a bunch of these applications are "a bad fit", to put it mildly, you have to filter. And you cannot realistically give every resume a good, human look.

              At 10 seconds per resume, it would take you 3 hours to go through all 1000 resumes. I don't know what you consider "good" and "human", but my human eyes could easily do good enough, fully manual pre-screening at a rate of 1 requisition per day.

            • bee_rider 3 hours ago
              It’s weird because unemployment is still quite low, right?

              Maybe a platform could be designed where candidates have one account for multiple companies, and the number of applications on the platform is limited to, say, ten per person per month or something. To get people to be selective. I don’t think this should be the only way to apply, but maybe the companies involved could look there first.

            • kasey_junk 7 hours ago
              If your hiring pipeline is employing a filter that a) is not better than a random chance and b) is expensive to implement get rid of the filter.

              Instead of spending all those resources on resume filtering, hire resume blind. Instead of using llms for a thing they are bad at (subjective decision making) use them to build a deterministic process that isn’t.

              Use work sample hiring as the filter. Make the work sample automatic to sign up for and judge.

            • sevenzero 9 hours ago
              >instead of just interviewing 10-20 people and just pick one

              Here's a realistic proposition. HR just wants to inflate numbers so that they seem busy looking for the right fit. Keep posting open for 1 week, manually filter for another week, invite people, employ one. Plenty of people with degrees looking for jobs right now, I don't see what's the issue with just trying one. Companies desperately look for the "magic" applicant that checks all boxes, while also trying to pay them almost minimum wage.

        • Brian_K_White 12 hours ago
          This reasoning isn't.
      • bagels 13 hours ago
        The goal for the interviewer is to have a much higher ratio of good/bad candidates after the first screening. This means the more costly time you spend on the second step has a better return.
      • aesthesia 12 hours ago
        So the question is: is the score given by this system correlated with candidate quality? I don't think this post gives enough data to know.
    • mrhottakes 1 hour ago
      Sounds like you're pretty bad at hiring pipelines.
    • recursivecaveat 10 hours ago
      If you have no requirements for accuracy, you can just advance 35% of applicants at random.

      If the first 50 people who apply are all bots, why are you reading resumes in order of submission?

    • wodenokoto 6 hours ago
      One of the first things you do when hiring is to set a period and randomize order of resume when reviewing because early application is not a strong signal.
    • spike021 11 hours ago
      there have got to be better ways to optimize pipelines. maybe set a limit on number of applications for a role based on the number you/your team can reliably go through them. if more are needed then open the role for another wave of applications.
    • lowbloodsugar 12 hours ago
      Except the bit about ranking a decades long S3 engineer lower than an intern with GitHub repo.
    • IshKebab 10 hours ago
      I wonder if you could solve this for programming specifically as follows:

      1. Give them some easy leetcode questions. Nothing that a competent programmer would have any problem with.

      2. If they pass, ask for a deposit of like $20. Shouldn't be an issue for people who are actually serious.

      3. Do more simple leetcode questions but this time on zoom so you can tell if they are using AI. If they pass that they get the deposit back.

      (Yeah I know there are real-time interview cheat AI programs but based on what I've seen on demos of them it's super obvious when they're being used.)

      Probably not practical but just a thought!

      • jghn 3 hours ago
        I'm not going to do any of those 3 things for a would-be employer.
        • IshKebab 2 hours ago
          They don't seem like unreasonable things to me so I guess it also helps filter out unreasonable people!
      • never_inline 6 hours ago
        This selects for desperation.
    • dvt 13 hours ago
      [dead]
  • CM30 8 hours ago
    I think what's more worrying to me (if other systems work like this ATS) is that it seems to judge based on a bunch of factors that will probably disqualify a ton of decent to good participants.

    For example, 65 points are given for a mix of personal projects and open source contributions. Which is great if your one and only interest is in tech, and you don't have a family, dependents or a second/third job. If you have any of those other things, well the odds seem like they're incredibly stacked against you.

    And it makes me wonder how many of these systems are stacked in favour of wealthy people with a near special interest level of obsession with tech and no worries outside of going to college/working a single job in their industry of choice.

    • thewebguyd 1 hour ago
      Yeah, the over valuing personal/open source projects is worrying and kind of sucks. I can use myself as an example, I don't do personal projects really, outside of work. My only actual programming work experience is during work hours for my employer. My hobbies are tech-adjacent (3D printing, some hardware/arduino stuff, photography) but they aren't "make a bunch of projects and put them on github" type hobbies. I'm certainly not going to make some BS fake CRUD or SaaS apps just to show off for potential employers, what a waste of time.

      I, intentionally, have zero online presence in that regard. You won't find any public repos on my github, I don't blog, etc. Its even infected the ops/syadmin side of the field (where I work), and that's somehow even worse. Like of course I don't have a bunch of environment specific scripts on my GH, why would I? It's irrelevant to anyone that doesn't work in my department at my current employer.

    • bob001 8 hours ago
      [flagged]
      • doodaddy 5 hours ago
        I know that some think this is just some cold hard straight talk but this style of individualistic thinking lacks empathy. And more practically, it’s a trap.

        In context, the “doing things” and “opportunities” that we’re talking about are jobs, careers. So by promoting the idea that one must work harder or longer to get or keep a career that they’ve already built sounds like a path to opt-in servitude.

      • Schiendelman 6 hours ago
        In hiring, we pass laws to prevent abuses. In many countries and soon a few states, being asked to work outside of work hours is considered an abuse. Expecting that someone does work related activity outside of work hours is something I would actually consider regulating out of the application process!
      • danmaz74 7 hours ago
        Of course life isn't fair. But here the result is that companies will ignore potentially great candidates which dedicate all their programming time to their job and instead consider candidates which may be not just worse programmers, but also are more interested in their hobbies (or padding their CV) that doing their job.

        I'm saying this as somebody who most of the time has some side project going on.

        • bob001 7 hours ago
          [flagged]
          • danmaz74 7 hours ago
            > There's many great candidates

            Perhaps for top-paying companies, but that's never been my experience when I was involved in interviewing and hiring.

      • Grombobulous 4 hours ago
        “Fair” is one thing, “systemically impossible to even approach fair” is another.

        For example, you can’t “conscious long-term effort” your way out of being stop and frisked by cops because you were walking while black.

        This setup isn’t even good for employers. Having your job as your hobby doesn’t automatically make you better at your job.

  • orbital-decay 7 hours ago
    This word (determinism) has a magical effect of warping any online posts it touches. Once you hear it you can almost guarantee it's going to be misguided. At least this time it's actual determinism (same input = same output), not arbitrary unrelated things.

    Determinism matters for reproducibility, but do you really want these outputs to be reproducible in this particular case? Making LLM outputs deterministic is relatively trivial, you have to use batch-invariant kernels (if you use batching) and either set the temperature to 0 (don't do that, randomized sampling is here for a reason) or fix the seed (better). It's readily available in a few systems. But this won't make the result more useful, it will just obscure the fact that the agent is genuinely not sure about it - look at the range of the scores it gives! It still won't predict anything but the score will stay the same each time. Do you really want that?

    What happens here is they're supplying too little information (just a resume, which is almost at the noise level) and expecting a reply with too broad implications. This is a basic design mistake regardless of whether it uses LLMs. All surveys, tests, laws, and voting systems are extremely sensitive to framing because they work off too little information. But they also don't exist in vacuum, unlike this thing.

    • nonethewiser 49 minutes ago
      I made a similar comment on a different post. Non-determinism does not necessarily mean it cannot reliably reach the correct output (although sometimes it does mean that). Las Vegas algorithims are non-deterministic and 100% accurate. The tradeoff is the time it takes to reach the correct answer is highly variable.

      To contextualize this insight in your post and basically just repeat what you are saying: The mistake is not using a non-deterministic system. The mistake could be, in some sense, using it too little. Re-evaluating the same resume 5 times and seeing a high variance in scores is a more useful signal than evaluating it once.

    • programjames 1 hour ago
      Nondeterminism is also a feature, not a bug. If you don't want people to optimize against your filtering process, you have to make it somewhat nondeterministic. For example, better candidates are exponentially more likely to pass the filter, instead of a hard cut-off at the top-100. Then it becomes no longer worthwhile to Goodhart the filtering process, because it barely increases your chances and there are so many more places you can use your time better.
      • 12_throw_away 23 minutes ago
        > If you don't want people to optimize against your filtering process, you have to make it somewhat nondeterministic.

        I'm sorry, I'm not following this at all. When you say "better candidates are exponentially more likely to pass the filter", we're still are talking about a metric, yes? A metric that can be optimized? Why would switching from a hard cutoff to some sort of stochastic filter weighted by this metric discourage optimization?

    • RugnirViking 6 hours ago
      This. Human judges and examiners are famously not deterministic even though we would wish it were so - we've probably all heard the thing of harsher sentences being given in the hour before lunch.
      • nonethewiser 45 minutes ago
        >we've probably all heard the thing of harsher sentences being given in the hour before lunch

        That suggests determinism though.

        I mean I agree with you overall. Either humans decision making is a system so complex it appears non-deterministic, or it is deterministic. Practically speaking, we are non-deterministic.

        Let's not conflate non-deterministic with inaccurate though. Non-deterministic systems can be 100% accurate. https://en.wikipedia.org/wiki/Las_Vegas_algorithm

      • groundzeros2015 1 hour ago
        > harsher sentences being given in the hour before lunch.

        Implicit bias theory sparked a massive number of studies that suggested everything influenced you from the color of the room, to what the person said to you before entering.

        It’s been really hard to replicate and the conclusions that have been drawn are contradictory.

  • joshmn 4 hours ago
    I ran the ATS myself and had a similarly quirky experience. I was in the 70s because it couldn't find my GitHub profile, and then it didn't like some of the popular Ruby libraries I'm the author of.

    After a few runs it picked things up appropriately. I always got dinged on formal education though.

    This stuff is gross.

    • fernandopj 4 hours ago
      Similar to my experience. Put me around 65 in some runs, because it didn't like I don't have contributions to OSS.

      Also, it doesn't pick up certifications or awards. I tried some PRs people are suggesting with enhancements (https://github.com/Zem-0/hiring-agent), it helps, but overall their ATS is hugely biased towards people with large GitHub contributions to OSS.

  • seanieb 8 hours ago
    It's always amazed me that a tech company will pay $300,000+ for a good engineer, because talent is so hard hard to find... meanwhile their recruiter operates unsupported, has a very different idea about what good looks like. Their ATS black-holes >50% the resumes because it's filtering heuristics are garbage because recruiting selected the ATS system because it has a google Gmail integration or something, and the ATS's filtering technology was not reviewed by anyone in the engineering or data teams.
  • Aurornis 12 hours ago
    > The default model is gemma3:4b

    That’s a tiny model. No LLM is going to be a perfect and repeatable judge, but a tiny 4B model is like plugging an RNG into this system.

    This whole exercise feels like someone vibe coded an ATS and got it to the point where the tests were passing because they decided they should have an open source ATS project.

    • danpalmer 11 hours ago
      This sort of model is fine for small problems, when used in the right way. I think there's probably a version of Resume analysis that would work well with this model, but "hey clanker, what projects has this person done" is not the way. You need extraction, cleanup, probably OCR to compare and further clean up, multiple analysis passes per signal with LLMs, judges, etc. None of that needs to be large models, you'll get marginally better performance, but there's very little context, these models will perform well when used correctly.
  • a4isms 3 hours ago
    Feels like "I Don't Hire Unlucky People" all over again, but with extra tokenmaxxing steps.

    https://neonrocket.com/2014/05/rescued-from-the-ashes-i-dont...

  • zx8080 4 hours ago
    This is the new AI reality everyone around is wanting: a nondeterministic computing.

    There is another name for it: a waste of electricity.

    But wait, not waste! Consumers paid for it fully, with nice profit margins.

    You and me, paid.

    Try using google flights, or booking.com: the prices shown in search results list are frequently significantly different from those in a single result. It's a nondeterministic compute when it's easy to spot it. But it's not always that easy.

    It's all sad, to be honest.

    • reactordev 4 hours ago
      There should be laws against displaying wrong prices or different prices for who you are…
  • robertlagrant 7 hours ago
    I tried this with my CV, and it somehow scored me bonus points for GSoC!

       BONUS POINTS: 5.0
      ------------------------------
         Google Summer of Code (GSoC) participation: +5
    
    Even though I've never done this, and don't claim to have done it in my CV.
  • 0xbadcafebee 5 hours ago
    This insanity only exists because the tech industry is standard-less. No formal education needed, no formal training requirement, no apprenticeship, no software building code, no professional organization. Resumes have never been a good predictor of success - and why would they be?? Even if they're truthful and it's "impressive looking", that doesn't give you any assurance of knowledge, of who they learned under, what they learned, that they passed some minimum criteria. We might as well be rolling dice. So why not an LLM that randomly assigns scores?
    • groundzeros2015 1 hour ago
      Do you think fields that have formal criteria don’t use resumes with keywords? I bet Lawyers look for school names and big law firms all the time.

      Credentialing helps maintain a quality floor. Does this person have basic employable skill? Nothing more. It actually doesn’t help you identify levels of talent and skill which is a universal hiring problem.

      We do have a credential - a CS degree. And you can see it is a mixed signal. Employers can choose of their own free will to take risks on employees that do have this credential, or not.

      Mandating by law that you must have a CS degree doesn’t seem to help our field as we famously have high performers across the spectrum of formal education.

    • conductr 3 hours ago
      I have no data to lean on other than my experience and intuition but I’d say that’s not the case. My domain is corporate finance, which encompasses a lot of structured roles and certifications, yet I consistently feel the Resume is just a poor device for making any judgement calls. Having people summarize their career into 1-2 pages of bullet points just doesn’t mean much. Especially now that keyword packing is a thing. It’s just meant as an introduction/sniff test to open the door for a conversation. Then it allows for deeper more probing questions to be asked. This where you’ll assess how impactful their contribution to a project actually was. Were they really living up to your definition of a manager, or were they more so an IC that had a lot responsibility. Stuff like that.

      > Resumes have never been a good predictor of success

      Applies broadly to the world, it’s not unique to tech

      • 0xbadcafebee 1 hour ago
        The problem is we have too many applicants to phone screen them all. For a lot of jobs today you end up with 10,000 applications, which is why these automated resume-skimming systems exist, but unfortunately this page shows how they basically don't work
        • conductr 1 hour ago
          People seem to hit a wall when flooded by resumes. They feel like there some needle in the haystack they need to find and it’s overwhelming. But you don’t have to read all of them. Or talk to all of them. Or use a system like this to filter.

          If you know what you’re looking for, you just start skimming them and maybe ranking them based on your own rubric. If it’s an obvious “no” you can usually tell within 5 seconds skim. Once you have a handful of high ranking ones, stop, and talk to them. Repeat as necessary until you have a short list of people you’d want to hire. There might be 9900/10000 resumes you never even looked at and maybe one of them would have been slightly better but you can’t let perfection be the enemy of progress. Stand by your convictions of feeling the candidate is qualified and capable and meets what you expect and hire them, get back to business.

          Having been in “talent shortage” mode for a long while I’d rather have 10000 resumes than 3. Having to pick one from a suboptimal selection is an awful position to be in, but sometimes a necessity.

  • gs17 12 hours ago
    I'm a little confused, is this an ATS system that anyone actually uses? If not, I'm not sure how it's better than just asking ChatGPT to score your resume out of 100. Why would you want to optimize your resume for a system no one is using to score it?
    • Bukhmanizer 11 hours ago
      I would assume at least hackerrank is?

      I don’t think the point of a lot of this is to optimize your resume. It’s to show how arbitrary these systems are.

    • marticode 8 hours ago
      From my understanding this one is used for hiring tech workers only. The (very) widely used Workday application system for ex seems to have its own built-in ATS.
    • petesergeant 11 hours ago
      (Almost) everyone’s using some kind of ATS, every ATS is adding AI auto-ranking (and has been trying to for 15 years), and almost all HR people feel like they have too many obviously bad CVs to read. Whether or not someone is using this ATS specifically, if you submit several CVs to several places, your CV is going into at least one magical 8-ball.
    • 40four 11 hours ago
      “I'm a little confused, is this an ATS system that anyone actually uses?”

      You read my mind. If the answer is “no”, then we can ignore this.

      • another-dave 9 hours ago
        For one, if you go on to Hacker Rank's "Screen" page, they mention the product is used by Stripe/AirBnB/LinkedIn/Atlassian/IBM etc etc. I imagine that there's plenty more companies using it too.

        But I'd also assume that their competitors are doing something similar so I don't think we as an industry can just ignore that it's happening.

        • gs17 2 hours ago
          > HackerRank Screen compresses the top of the hiring funnel by replacing manual resume reviews and unstructured phone screens with structured, auto-scored assessments

          That seems to be a different type of product.

        • 40four 5 hours ago
          Interesting, thanks. I admittedly spent zero time looking into it :)

          I’m surprised open source contributions count for so much. first I thought was “is that something people actually list in as resume?”. But it looks like it pulls your GitHub account and appends that information.

          That kind of unfortunate for anyone who doesn’t use GitHub

  • achalxyz 2 hours ago
    If I know the truth value of p and I also know p=>q, then an LLM would be able to deduce the truth value of q - even if the statements aren’t exactly in this form. Generally, LLMs are good with logical inference.

    But logical inference itself is limited. You still have to find out if p is true or not - the ground truth.

    How do you find that? You would be able to define in the prompt that if resume has p, infer q and do this. But determining the truth value of p is something LLM cannot do.

    It’s not a limitation of the LLM. It’s the limitation of logic itself. You take 10 humans and give them the resumes with the same rubrics as the LLM. You’ll get a similar range of scores because everyone would assign different values.

    The issue is not in logical inference. It’s in determining the value of p, which takes much more than logic. And current LLMs are limited to being logical.

  • rsanek 5 hours ago
    It's fair to call out issues with the tool. But I think for individuals searching for jobs, using LLMs as the scapegoat for why it's hard to find a role is not terribly helpful.

    In my experience, cold-applying has always worked essentially as a black hole, and LLMs haven't changed that much. The reality is that alternative avenues are always necessary to get the job you want. That could be a third-party recruiter; reaching out to a hiring manager on LinkedIn; or using your network to get referrals. Those continue to work whether the company is using a bone-headed tool like this or not.

    • us-merul 5 hours ago
      I entered an interview with a hiring manger where they had received a "summary" of my resume that contained information blatantly not in my experience. The recruiter claimed they mixed my name up with another applicant, but the summary the hiring manager showed me had parts that were correct.
  • makeavish 12 hours ago
    Hiring and job search has been so hard and AI has amplified the existing problems instead of solving any.
    • sevenzero 12 hours ago
      Wdym, cant you just litter your applications with buzzwords and other bs to automatically get a high score in these systems?
      • szszrk 11 hours ago
        HR market is basically an early google rigging era, where you can place hundreds of keywords at the footer (white text on white background) to start popping up on random searches.
      • makeavish 8 hours ago
        I have been at both side of the market. And it sucks so bad at both ends. Companies which deeply care about next hire are struggling to hire and actual great people looking out are outcompeted by AI slop and AI bulk applying.

        It is actually a very hard to solve problem.

        • CuriouslyC 7 hours ago
          The mind blower is that this spam and slop is just lowering the job market to the quality of every other capitalist market. Poor hiring manager has to look through 1000 applicants, 950 of which are spam? How many ads are shoved down your throat every day, and how many products are you actually looking for info on?

          Chickens coming home to roost.

  • kailpa1 10 hours ago
    From `resume_evaluation_system_message.jinja`

    > *SCORES MUST NEVER DEPEND ON THE FOLLOWING FACTORS:*

    > - College, university, or educational institution name

    > - CGPA, GPA, or academic grades

    I don't understand why they would omit these factors from the evaluation.

    • swiftcoder 9 hours ago
      > I don't understand why they would omit these factors from the evaluation.

      Only hiring MIT graduates sounds great to a lot of tech folks! Automatically rejecting applicants from HBCUs, however, sounds like a lawsuit

      As to GPA thing, I think it's just to stop the LLM glomming onto an obvious numerical grade? LLMs like to rank things by obvious dimensions, and whether someone had a 4.0 or a 3.8 in grad school makes very little difference to their performance 10 years down the line.

    • ceejayoz 4 hours ago
      https://qz.com/1427621/companies-are-on-the-hook-if-their-hi...

      > But it didn’t. After the company trained the algorithm on 10 years of its own hiring data, the algorithm reportedly became biased against female applicants. The word “women,” like in women’s sports, would cause the algorithm to specifically rank applicants lower. After Amazon engineers attempted to fix that problem, the algorithm still wasn’t up to snuff and the project was ended.

      And in another org:

      > After an audit of the algorithm, the resume screening company found that the algorithm found two factors to be most indicative of job performance: their name was Jared, and whether they played high school lacrosse. Girouard’s client did not use the tool.

      https://www.npr.org/2024/04/11/1243713272/resume-bias-study-...

      > Their working paper, published this month and titled "A Discrimination Report Card," found that the typical employer called back the presumably white applicants around 9% more than Black ones. That number rose to roughly 24% for the worst offenders.

      It'll discriminate by proxy, basically.

    • bulder 7 hours ago
      I don't understand why they'd hand over those data points over to the model in the first place. If it's in the context window, it's impacting the output. To ensure that no weight is placed on those factors, they should be sanitizing them out before handing the data over to the model.
    • sph 10 hours ago
      Hopefully so that people like me, that dropped out of high school yet have had a successful career as a self-taught engineer, have a chance. [1]

      Just kidding, my resumes are sent to /dev/null like everybody else’s.

      ——

      1: In fact, I will be controversial and say that self-taught engineers tend to be the strongest in their own particular niche, because they are powered by sheer desire to learn and improve. I am routinely appalled by how many people go on forums to ask how to learn a new thing, completely unable to self-direct their learning. I blame the modern school system.

      • kailpa1 9 hours ago
        I'm a self-taught programmer as well, who dropped out of university, and these factors being omitted would benefit me as well, but I feel like good grades and a good university are still indicators of someone being or is capable of becoming a good programmer.

        This system would drop a Harvard top graduate for someone having a year of experience in some outsourcing firm.

        • Schiendelman 6 hours ago
          Unfortunately, graduating from Harvard is a very good predictor of whether your parents were wealthy, and also that you are less likely to be black.

          I worked for a very large job board for the last six years, it's the one you're thinking of. What we found is that the outcomes of paying attention to what school you went to are almost entirely discriminatory, and not predictors of success.

        • goosejuice 9 hours ago
          > I feel like good grades and a good university are still indicators of someone being or is capable of becoming a good programmer.

          Really depends on the program. In my undergrad program there were some very smart CS students who got great grades that really struggled with the programming. Smart and capable people can be bad at programming and lack many qualities that make for a good hire.

          • kailpa1 7 hours ago
            Sure, but isn't this kind of person the exception? I feel like most of the time good grades mean good programming skills
        • sph 9 hours ago
          I started in an outsourcing firm (body rental actually) but I definitely get your point. Maybe they optimize for real world experience, or rather, how one is used to workplace politics and logistics. The top grad will have higher expectations, and all they want is a cog for the Machine.
          • kailpa1 7 hours ago
            Yep, I don't know either, but I guess they have their reasons for this.
  • nimithryn 1 hour ago
    Oh ok. So I'll just have to apply 4-5 times to every job to be sure I'm considered. Sounds like a good equilibrium!
  • mrhottakes 1 hour ago
    Yep, any day now AI is going to be so good we'll never need to think again. What's that, it's just a really expensive random number generator?
  • tasuki 11 hours ago
    > Sometimes my projects “lack architectural complexity”

    Well done you! It is difficult to avoid architectural complexity, but imho well worth it.

  • bartread 7 hours ago
    The takeaway from this for me is that, using an LLM to score anything takes multiple (maybe even many) runs and the result you’ll get is, at best, a sane-ish distribution.

    Which sort of sounds workable until you scale it up to larger datasets, where at some point compute/time/energy costs will render it non-viable.

    I am sure there’s some reasonable rule of thumb estimation on distribution that could be applied based off fewer runs per data artifact, but you’re always going to be trading off against confidence by doing this.

    Beyond this, I’d bet that almost no implemented systems that use LLMs for scoring, ranking, or decision making use such a multi-run approach. Partly because people don’t understand their behaviour is stochastic, perhaps because a lot of people without a background in statistics don’t understand what stochastic actually means, and no doubt partly because of budget concerns: if you have to ask an LLM to do the same thing 10, 50, 100 times to get a sufficiently good result, then the cost saving argument is either weakened or completely destroyed.

    There is at least one more aspect worth considering in the specific case of resumes/CVs: is the inconsistency of scoring by LLM worse than the inconsistency of scoring by a human following a similar process?

    Because the reality is that, even for an experienced recruiter, reviewing hundreds or thousands of resumes or CVs gets pretty fatiguing. People get hungry, bored, tired, restless, irritable, etc.

    That inevitably leads to inconsistencies creeping in, so there’s always an element of “luck” (or, perhaps better, uncertainty) as to whether your resume/CV passes screening.

    So is that inconsistency better or worse with LLM screening? I don’t know. But, at least, if it’s not worse maybe it doesn’t matter for this specific use case. And if it’s notably better then maybe it’s raised the bar on what “good enough” screening looks like?

    (And I’m sure other use cases warrant similar, “does it matter?”, questions, with the answers no doubt landing differently.)

    • CuriouslyC 7 hours ago
      My experience with benchmarks and evals is that it can take ~20 runs of a problem for the distribution of answers to start to converge. Ideally you'd know the convergence properties of your algorithm ahead of time and make a Bayesian solution that makes the uncertainty explicit.
  • saidnooneever 10 hours ago
    Count to three, no more, no less. Four shalt thou not count, neither count thou two—excepting that thou then proceed to three. Five is right out.
  • Arch-TK 4 hours ago
    The list of "bonus" criteria and how they come about makes me feel sick.

    I am not currently looking for employment, nor am I currently particularly worried about future prospects if I was suddenly in the position of looking for employment.

    But if I ended up in a position with nothing to lean on but scattering my CV everywhere, well…

    A lot of my major contributions are littered across the internet, private, or even just verbal/consultancy. They're things I did for free, in my spare time.

    I also avoid GitHub. If you just look at my GitHub page for extra context, you would likely miss that delivering that very GitHub page likely involved a few bits of code I wrote.

    Now, I could do a better job of trying to document this stuff, so it could be easier to find… But also I can't quite imagine how that would work.

  • morphology 3 hours ago
    It's funny that even after all these years and all this money invested in technology, we still haven't come up with anything better than word-of-mouth for hiring great people. Many serial founders have said that, despite the most stringent interview processes and the most sophisticated filtering pipelines, they still have a higher hit rate with people they've worked with in the past.

    This isn't to diminish the whispernet. Rather, it shows just how many important signals cannot be quantized.

    • makeavish 12 minutes ago
      True, I have found it to be valid as well
  • bsoles 4 hours ago
    > 35 points for open source contributions

    > 30 for personal projects

    These are insane weights for scoring a software engineer's resume.

    • morphology 3 hours ago
      Insane how? I would expect more points for open source contributions. It is trivial to create a personal project, but that does not carry with it any indicator of quality. Having your work accepted by other maintainers is one indicator at least.
  • jedimastert 5 hours ago
    The blog post itself has pretty a pretty strong un-copy-edited ChatGPT vibes.
  • a3w 3 hours ago
    What does ATS mean? Neither github repo nor article explain that.
    • Bedon292 3 hours ago
      Probably: Applicant Tracking System. Used for tracking the people who apply to each of your openings, and the hiring workflow. Where this would likely be used to neck down all the applicants before a person actually looks at them to make judgement calls on who to move forward in the process.
  • davidpapermill 11 hours ago
    A better way to reformulate this problem is for the LLM to be tasked with making a _comparative_ judgement between two CVs. This should prove much more reliable, especially if you give it a third “too close to call” option. You can also ask for clear justifications of preference.
    • srdjanr 10 hours ago
      That's a good idea.

      The only drawback I see is that you should compare every pair of CVs for best results, and that grows quadraticly with number of CVs. Of course you can settle for fewer comparisons and not perfect results. But then I'm not sure if you can hit a good ratio of quality and token spend.

      • skribb 10 hours ago
        Could probably do an elo system and sample pairs. E.g.

        1. Set the elo of all CVs to 1000 elo

        2. Randomly pair up CVs and compare. Winners gain elo, losers lose elo.

        3. Repeat #2 for a few iterations, then remove bottom X% of CVs.

        4. Repeat 2-3 until the amount of remaining CVs is small enough to do an exhaustive comparison.

        I don't have a mathematical proof, but I suspect that this is a decent cost-effective approximation of comparing every pair (depending on the parameters)

      • swiftcoder 9 hours ago
        > you should compare every pair of CVs for best results

        Or compare each one to a reference set? Take 5 resumes of existing employees, rank all candidates against that set, maybe you get some useful level prediction into the bargain

      • davidpapermill 6 hours ago
        I'd just do a quick filter, probably deterministic, then perform a deeper comparison on the selected few.
  • pu_pe 10 hours ago
    He tried with a tiny model (gemma3:4b), got a range from 66 to 99. Then tried again with a small model (gemini 3.1 flash lite), the range was 48 to 64. Would a frontier model be more consistent? Perhaps this tool was optimized for more capable models?
    • srdjanr 10 hours ago
      It makes sense to me intuitively (though I'm not sure if my reasoning is actually correct).

      Worse model may not "know" enough to distinguish between a 70 and a 100 candidate, so it's expected that it's output has high variance. But a better model might "know" enough, so it can be more confident and thus more consistent.

  • dc3k 13 hours ago
    Disregarding the fact that this thing is completely broken, its grading rubric is ridiculous to begin with (as was mentioned in the article itself, but I must reiterate how completely stupid this is):

    > 35 points for open source contributions

    > 30 for personal projects

    I don't contribute to open source or have personal projects because I don't spend my free time doing what I do 40 hours a week to make a living. My 15 years of work experience is worth a maximum of 25%, so any company using this idiotic system would pass on me immediately. Open source and personal projects are fine, but in no sane world are they worth 65% of a resume's score.

    • adrianN 13 hours ago
      They are selecting for people who are fine working in their free time. If you contribute to open source you are more likely to contribute to the company on weekends. If instead you have other hobbies or a family that takes up non-work hours you are more likely to drop your pen after forty hours.
      • matheusmoreira 12 hours ago
        Maybe they're selecting for intrinsic motivation. People who enjoy programming to the point they do it for fun, not just because it pays.

        Free software work doesn't imply we work for free. We work on our projects, the stuff that we actually enjoy working on. Nobody is going to work on corporate products without adequate compensation.

        • lukan 11 hours ago
          "Nobody is going to work on corporate products without adequate compensation."

          I guess there sadly are many nobodies who do this to hope to become somebody.

          • matheusmoreira 11 hours ago
            If the open source work is part of a hiring pipeline, sure. Contribute to some repository and have it serve as a resume that gets you hired is also a form of compensation. If the work is also enjoyable, then it's a win either way.
      • emj 12 hours ago
        You might have numbers on that but after working in a place with a strict no more than 40 hour policy my view is that people overwork for many reasons. Being an open source enthusiast is not one of them.
      • another-dave 9 hours ago
        > If you contribute to open source you are more likely to contribute to the company on weekends

        I wonder if that assumption is bourne out in reality though?

        I'd imagine if someone's OSS contributions are enough of a factor that it's worth hiring them, they're not going to drop it on a whim to work extra hours on the day job.

        (Assuming you weed out open source contributions like "I made a todo list app in React but licenced it as MIT" or "I fixed a typo in the docs for NextJS". )

      • stevesimmons 12 hours ago
        I'm not sure that follows. I stopped making open source contributions when I switched from mature companies to startups.

        Now all my "non-work" time is spent on startup work. And none of that is visible via GitHub.

  • sleepynoodle 7 hours ago
    I really dont understand this constant changing of numbers. I have tried a bunch of ATS reviewers and everytime on the same resume i get different numbers. Its weird and unreliable. I understand the need for doing this to filter through thousands of CVs but maybe there is a better way. Like a take home test at the beginning or a test of somekind.
    • chrisandchris 7 hours ago
      I would say people that hink the LLM is doing a better job than they are in for a treat. I did expect the resulta to be of the same quality as if a human does the job - it averages out and has a big error margin.
      • sleepynoodle 6 hours ago
        no wonder i dont get calls. I dont have a separate CV for every application. Good luck to me then!
  • ipython 4 hours ago
    Don't forget DOGE using LLMs to consider which contracts to "munch", based upon a prompt: https://github.com/slavingia/va/blob/35e3ff1b9e0eb1c8aaaebf3....
  • realty_geek 10 hours ago
    Why doesn't something like this exist for real estate? A popular open source AVM (automated valuation model) that helps home sellers get an idea of what their home will sell for. Right now it seems AVMs are mainly seen as just a way to capture leads. Every estate agent will tell you they have some magic recipe that makes their valuation better than anyone else's. I have had a bunch of ideas on how to approach this, but I really could do with a collaborator or two.
  • mxuribe 2 hours ago
    I see mention of PDFs both in the article as well as the repo...But i think over the decades that I've been working and applied for roles - almost exclusively in corporate america...I've only been asked for a PDF once! Every other time, everyone wants a Word doc (.doc/.docx). So...is there now some growing HR groups who are asking for PDFs instead? Or, is that if someone asked you for a PDF instead of a Word doc, then that's a signal that said HR groups are employing some sort of agentic review of one's resume (I mean, beyond the conventional ATS systems)??
  • pmarreck 1 hour ago
    > An LLM is called six times to extract structured information

    Well, I think I found your problem

  • Tryk 1 hour ago
    What is an ATS?

    Why is it so hard to write out an acronym once...

  • kdavis 50 minutes ago
    Hmm...six runs with gemma3:12b on my CV

    - Varies from 102.0/100 to 100.0/100

    - Missed lots of OSS work

    - Misinterprets GSoC work (Thinks projects I started that were contributed to in GSoC implies that I received a GSoC stipends)

    - Areas for improvement seem to vary inconsistently (There's not enough project detail to there's too much project detail)

    I still don't make company's cut offs ¯\_(ツ)_/¯

  • jvanderbot 1 hour ago
    > I’d take the engineer with 30 years of experience who built S3 over someone with two internships and an open source project — but this tool wouldn’t.

    Is it possible the senior/principle jobs are not being applied to at a rate that LLM tools like this are required? Maybe star devs are getting recruiter referrals and this kind of tool is mostly used for filtering new grads?

    Either way, perfectly dystopian.

  • captainbland 5 hours ago
    I think the implication here is that you can almost certainly bias the models to always accept you by including "nudge" phrases like "I demonstrated real world deployments" and "helped develop an application in the context of a complex architecture..."
  • seedless-sensat 6 hours ago
    What is an ATS? This blog doesn't define it
    • gejose 6 hours ago
      ATS = Applicant Tracking System. It's software to help you manage your hiring pipeline as a whole.
  • fractal618 2 hours ago
    Maybe the ATS has logic for people resubmitting their resume. I don’t know how isolated each test was.
  • graemep 7 hours ago
    It took me a a minute to figure out what an ATS was. Not familiar with this particular means of a much used TLA.

    Even better Wikipedia lists the abbreviation I am familiar with but give a different interpretation of the same words:

    https://en.wikipedia.org/wiki/Ats

    • Leptonmaniac 7 hours ago
      Thanks for not explaining what TLA is, either.
      • graemep 7 hours ago
        My sense of humour. TLA = Three Letter Abbreviation.
  • cemoktra 10 hours ago
    So sending my CV to every company three times should get me pass the ATS?
    • cyanydeez 7 hours ago
      if i ever go back into the job market, will need three accounts: Peter J Smith, Peter Smith and PJ Smith. they live in #101, #102 and 103# 5607 Jane Street
      • left-struck 6 hours ago
        Why stop there? vary everything that can reasonably be varied slightly across each resume
  • nikolay 1 hour ago
    Roll the dice, HR folks!
  • zameermfm 2 hours ago
    Stop the qtip when there's resistance
  • cs02rm0 8 hours ago
    I feel like hiring is all a bit broken. Roles get flooded with applications, it's chance whether your CV gets through, then there's hiring rounds that seem designed to make you quit the process before they have to filter you out.

    Is it working for anyone, on any level?

    • luckylion 8 hours ago
      I'm on the other side, and my main tip (at least if there's people like me!) is: avoid the usual AI signs.

      For one role we got ~70 applications and all CVs looked obviously AI-written. I don't know whether the people did actually do any of the things mentioned and I don't have the time to find out, so the AI-written CVs are a discard-signal for me. (Either those people delegated a very important task to AI and didn't even bother to check, or they are bad using AI and don't know -- I want neither)

      Any CVs that signal they were actually written by a person I will actually look at.

      • quectophoton 6 hours ago
        > For one role we got ~70 applications and all CVs looked obviously AI-written.

        Were those ~70 applications all of them, or were those ~70 applications the result of an AI filtering from a larger amount?

        If the latter, are you sure your AI is not filtering out the hand-written CVs and giving you the ones that have been AI-assisted or AI-written (with or without "the usual AI signs")?

  • YossarianFrPrez 9 hours ago
    Looking at the linked scoring prompt (resume_evaluation_criteria.jinja) [0], I immediately see several red flags that suggest the output won't be reliable. (I'm developing an LLM intensive application where the stakes are high enough that I need the LLM output to be reasonably correct.)

    [0] https://github.com/interviewstreet/hiring-agent/blob/main/pr...

    In no particular order:

    1. The prompt is trying to get the system to do all of the evaluation steps at once. Instead, the system should break down the task of resume evaluation into its subcomponents and have separate prompts for each component. Like "evaluating open source contributions" should be its own task. Same with "assessing the complexity of software projects on the resume." Fwiw, each of the tasks contained within the prompt is woefully underspecified.

    2. The prompt leaves spreads of ~10 points up to the LLM, when it's doubtful that humans are that well calibrated. Take for example:

      > SCORING CRITERIA Open Source (0-35 points) 
      HIGH SCORES (25-35 points):
       - Contributions to popular open source projects (1000+ stars)
       - Significant contributions to well-known projects
       - Google Summer of Code (GSoC) participation
       - Substantial community involvement
    
    Are all of these 35-point examples? Is one a 26-point example? If not, what's the difference? If an expert can't reliably make the judgement, the LLM is going to struggle too. One partial fix is to get rid of the ranges and just say all of these are worth 30 points. An additive point scheme would be better...

    3. The authors of this prompt have left an incredible number of judgement calls up to the LLM, when that's the very thing you want to minimize. Using the same example as above...

    - Are all contributions to open source projects with 1000+ stars equal?

    - What counts as a "significant contribution"? Doesn't that imply that the LLM has to know or read through all of the commits in like the last ~6 months at minimum for the project to understand what the given contribution meant to the project? That itself isn't impossible with tool usage, but again, that'd be a separate task.

    - What on earth counts as "Substantial community involvement"? Why didn't the prompt authors define this, or at least give a few examples?

    Honestly at this point maybe someone should build a tool that scans prompts for adjectives...

    4. This sort of thing is just asking for trouble:

      > SCORES MUST NEVER DEPEND ON:
       Candidate's name, gender, or personal demographic information
    
    
    Just remove this stuff before you send the rest of the resume to the LLM. Even if you ask it not to, it's not a person, it's a very fancy statistical distribution generator. All of the input (including the name) will affect the distribution that gets generated. (This one is not unlike Andreessen's "don't be a sycophant" prompt.)

    5. Obviously this one depends on the LLM in question, but instead of writing things like:

      > DO NOT RETURN A RESUME SUMMARY. RETURN ONLY THE SCORING EVALUATION IN THE SPECIFIED JSON FORMAT. Analyze the following resume and provide a JSON response with this EXACT structure (all fields are required):...
    
    
    The system should utilize the "structured output" option, which guarantees a fixed output format. Also, fwiw, the JSON should force the LLM to pick between categorical options as much as possible. Forced-choice structured output should, at least in theory, cut down on hallucinatory responses and constrain judgement calls.

    6. One major thing that's not in the prompt is anything about traceability. This system should be designed so that humans can review the logs and make sure this is working as intended.

    7. Another thing that is missing in the file is what I'll call evidence of a theory of coding / coder quality. Most of the examples are designed to have the LLM assess proxies for code quality, not code quality itself. Surely both should be taken into account?

    I'm not an expert at evaluating coders. But two pretty basic LLM-answerable thing I would ask is: How well do a candidate's 5 most recent commit messages match the contents of those commits? Do the claimed technical skills on the resume match their GitHub code? (i.e., if they say they know R, is there any evidence of that on their GitHub?)

    8. The prompt also seems unaware of what it's asking the LLM to do:

      > LIVE DEMO BONUS: Projects with working live demos should receive 10-20% higher scores
    
    
    This implies that the LLM can use tools, but even then, I'd be pretty wary of its ability to fully execute this part of the prompt without more detailed instructions, examples, and guidance. There are very likely tons of edge cases here.
  • rkuska 12 hours ago
    This reminds me of my former CTO. He would take bunch of CVs and randomly throw some of them in a bin. He didn’t want to work with “unlucky” people.
    • psalaun 12 hours ago
      I thought this was only an old urban legend; some people actually use this technique? Especially in a trade supposed to be led by people trained in sciences?
      • gregates 10 hours ago
        Given how often it's been mentioned here, it's likely that this is an urban legend that people are pretending to have first-hand knowledge of for karma. In a trade that's supposed to be led by people trained in sciences, no less!

        (A more charitable interpretation would be that aforementioned CTO was making a joke that didn't land.)

        • cyanydeez 7 hours ago
          or its so old, people would make the joke and interns would repeat it unwittingly. no one has to consciously be lying for this type of meme to continue spreading.
      • subscribed 6 hours ago
        That'd be pretty gross for a CTO if it were real.
      • aquariusDue 10 hours ago
        It's OK! We can disguise it as the Secretary Problem and it'll be fine, we could even write a post on the company blog about it. /s

        https://en.wikipedia.org/wiki/Secretary_problem

    • hahahaa 12 hours ago
      The problem is with this system he only worked with unlucky people.
  • bryanrasmussen 8 hours ago
    >If your company’s cutoff sits at 85, I fail 65% of the time. Same exact resume, different luck.

    Your resume's reception is always affected by random factors, only now you are able to test, debug and technically critique the randomness.

    • xorcist 6 hours ago
      I think the question is why bother with an LLM if randomness is decisive?

      Just roll the dice. I mean, it's not the worse you can do to narrow a subset.

      • bryanrasmussen 4 hours ago
        right, and it's cheaper, but people want the illusion of determinism. Some people say that they want determinism, but if they do nothing to assure themselves it is deterministic I think it is fair they really only want a good enough illusion.
  • bhanu786 10 hours ago
    ATS resume usually check the keywords, and formatting your spacing and give score accordingly. As If someone is following some reference of the format. It can depend might he will be getting low scores.
  • speedgoose 9 hours ago
    Many em dashes and a "This is not, it is…" later, I think this article would have been a much better critic if it didn't use a LLM to (re)write some parts of it.
    • another-dave 9 hours ago
      I always find it funny when a technical crowd starts picking on em dashes as a sure sign of AI. I mean, are keyboard shortcuts really that difficult for developers? Some of us always knew how to use correct punctuation, even before LLMs existed.

      Also, neither "this is not" or "it is" appear at all in the article?

      • speedgoose 9 hours ago
        It’s a lot of them. It’s a style. I know some people who used them before and use them less nowadays.

        > This non-determinism isn’t a bug you can just fine-tune away, it’s a fundamental design flaw.

        • actionfromafar 7 hours ago
          Funny how something which was catchy at one point makes my skin crawl now.
  • neya 12 hours ago
    I wonder how is this even legal? The only useful job the HR departments are ever required to do - they decide to automate it? Aside from being a daycare for adults, what exactly does HR accomplish? It's clearly NOT on the side of employees, but this seems like they're clearly NOT on the side of employers, either.

    While resume's are being filtered left and right, they just make TikTok's on company's dime [1]. What a sad state of affairs.

    [1] https://www.youtube.com/shorts/wSug80Vg5JU

    • srdjanr 10 hours ago
      They could be using this just to throw out the obviously bad CVs, and then manually go over the rest. I'm not sure if they do this in practice, but the tech itself can be useful.

      Also if HR was really useless (or actively hurting the company) they wouldn't still have a job (or they'll lose it eventually). No one likes burning money for no reason. So obviously they are doing something useful.

      • syockit 9 hours ago
        The last time I heard HR being completely let go was with a fintech company Bolt. Then again, that company was midsized, around 200-500 people or so. For larger companies, it's going to be difficult to even realize that HR is redundant in the first place.
  • 0xpgm 11 hours ago
    With such kind of ATS systems, is it still a thing to optimize for a one page resume that is easy for a human reviewer to scan, or just include enough buzzwords and external links to try and please the LLM?
    • jorisw 10 hours ago
      I wouldn't assume based on this one thread/article that this is what you need to optimize your resume for. Nor that a majority or even significant group of reviewers is even using LLMs. I've been involved in hiring pipelines and never even thought of using LLMs to review incoming candidates.

      However given the time constraints reviewers have, yes, the former (making a resume easy to consume quickly) is a huge help.

  • dev_l1x_be 7 hours ago
    Did anyone try to prompt hack this setup?
  • ChicagoDave 11 hours ago
    I was inspired by this. I made a Claude skill to take my resume and compare it to any job description to point out viability and gaps. Pretty cool skill. I'll post it somewhere.
  • steve_j_choi 12 hours ago
    This could be used as a good way to self-evaluate one's current position from the company's point of view. you would tweak prompts and guidelines that are expected from the company and see how you score
    • hahahaa 12 hours ago
      I sort of hope we land on 2 agents, one working for the candidate and one for the employee do a screen round. Salary compatiability could be negotiated by a 3rd party bot that knows both parties ranges and what would be needed each end of range, and figure out yes/no worth going ahead. Such a time saver.
  • jackjd 6 hours ago
    I've done similar things and used GitHub Copilot to scan a folder of 40 CVs and rate them -> I then review the top 10 CVs and comment on every rating whether I agree or disagree and why -> I then asked AI to re-rate all the CVs according to my comments. -> I then reviewed all the CVs against their ratings; the AI did a much better job for that 2nd round after its learnings.

    It took more time than if I just reviewed the 40 CVs myself, but that was an experiment, and I think it shows the AIs can be trained on your comments. And if there is enough training and a good knowledge system that allows AI to apply the learning in those trainings, it can eventually become a lot more accurate at this task?

  • quink 12 hours ago
    "A computer can never be held accountable, therefore a computer must never make a management decision."
    • 12_throw_away 4 minutes ago
      Corollary: If a computer makes a business decision, the person who delegated the decision to the computer must be held accountable.

      Consequence: All business decisions will eventually be delegated to computers via sufficiently convoluted and untraceable processes such that no manager can ever be held accountable.

  • wielebny 6 hours ago
    This seems like extremely illegal in Europe.
  • padolsey 9 hours ago
    This is just the 'LLM judge', very badly implemented without any scientific prudence. What a joke. To be terse: you cannot rely on LLMs to provide standardized scores against arbitrary criteria. To get close to 'reliable' you would need highly tested rubrics, grounded in human decision-making, and you'd need to avoid all the measurement biases these things are riddled with... positional/order effects, anchoring on whatever numbers you stuffed into your own prompt, scale-format sensitivity (a 1–5 and an A–E scale give different answers for the same input), holistic-vs-isolated context effects, and lovely examples like where adding a "be unbiased" instruction makes it more biased. I've studied this at length. You cannot even _begin_ to approach this problem seriously without held-out validation, inter-rater agreement, and ground truth. This repo is just quagmire of wishful vibes with random numbers littered throughout.
  • jdw64 10 hours ago
    It seems like the design is flawed, probably because the scoring structure and conditions are wrong. And originally, due to the nature of LLMs, even if the input is unstructured, when you design something like a RAG system, you usually need to create a verifiable evidence table. Even with that, the scores are still probabilistic by nature, but at least they stay within an error distribution that I can verify. But it doesn't seem like there's any such evaluation criteria here.

    Typically, retrieval should be tied to evaluation metrics, evidence should be linked to scores, and you also need to account for parsing errors.

    But personally, I'm weak to these kinds of ATS systems (ugly appearance, non-native English speaker, didn't go to a good university), so if this kind of filtering existed, I probably would have never had a job in my entire life. Come to think of it, even now I don't have a proper job—I just bid on projects at the lowest price and implement them. So maybe it doesn't really matter whether such a system exists or not

  • brikym 11 hours ago
    So that's where the Windows XP file copy dialog author now works.
  • nnevatie 8 hours ago
    > An LLM is called

    Hooray for incidental non-determinism.

  • suzukivenom 7 hours ago
    never understood how can people think an .md file can actually evaluate a human being.
    • JrProgrammer 7 hours ago
      They are not evaluating human beings though. They evaluate a textual representation of a human being’s work experience.

      Not that I agree with this AI approach but when hiring, the real test begins after this initial hurdle

  • polynomial 3 hours ago
    The most concerning thing here is the temperature problem. If your harness isn't providing deterministic output at a temperature setting of 0.0, it is broken.
  • weare138 4 hours ago
    I'm from genx. This has been a serious issue in the tech industry for decades even before AI somehow made it worse. The real problem is resumes themselves. It's an outdated format that was originally designed for completely different industries that just doesn't work with ours. And this is a great example of what I'm talking about:

    The scoring is out of 100, with up to 20 bonus points on top:

    35 points for open source contributions

    30 for personal projects

    25 for work experience

    10 for technical skills

    Up to 20 bonus points for startup experience, a portfolio site, a technical blog, etc.

    All the AI is doing is trying to sus out the candidates portfolio which is really what we should be submitting when we apply for a position instead of being forced to somehow condense it to a set of BS business-speak bullet points. Especially when employers are now deploying AI systems just to figure out what's in a candidate's portfolio to begin with.

    When all you have is a hammer every problem is a nail. The process itself is broken. We need to kill the outdated concept of resumes before it kills the industry.

  • justinhj 4 hours ago
    "If your company’s cutoff sits at 85, I fail 65% of the time. Same exact resume, different luck."

    Sounds like they have replicated the existing recruitment process

  • maxignol 10 hours ago
    Are many people using HackerRank ATS ?
  • swingboy 8 hours ago
    I’ve always assumed any LLM output that was some type of rating or score was bullshit. Unless the LLM writes a Python script to calculate the score (and even then…) then the score it outputs is just the next most likely token, taking into account temperature and what not.

    You see a lot of frameworks for things like spec-driven development make use of scoring how good the spec/design/plan is and it’s like, uhhh…

    • joelthelion 8 hours ago
      > is just the next most likely token, taking into account temperature and what not.

      This doesn't mean anything. All LLM output is like that.

      That said, I agree that LLMs are terrible at grading stuff, except perhaps if you give them a very detailed evaluation grid.

  • diimdeep 10 hours ago
  • cyberax 13 hours ago
    Ah... The AI learned the old HR trick: take 50% of resumes and throw them out without looking. Rationale: "we don't need unlucky losers".
    • worldthruword 11 hours ago
      There are plenty of resumes in the sea. Assuming thorough mixing up and statistically speaking, throwing 50% of resumes is a good enough heuristics.
  • thrance 7 hours ago
    I cope by telling myself that I probably wouldn't want to work for a company that used an LLM to filter my resume out.
  • nullc 7 hours ago
    The true test of HackerRank is can you setup a system that combines a document editing / paraphrasing LLM with gradient descent on the HackerRank LLM to turn your arbitrary resume into a reliable 120 out of 100.

    One of the weird properties of other people using LLMs is the potential of having oracle access to your opponent. Even if you don't have their exact LLM a good guess at it may be a better model of the opponent than you ever had before.

  • carb 9 hours ago
    It's a good analysis but the AI slop writing makes me not trust you've reviewed this and I'm unable to finish or subscribe. I'm sure you're a great blogger but this is holding it back!
  • maxignol 5 hours ago
    Lol next time I’ll just apply with 4 accounts and maybe get in once.
  • rvz 10 hours ago
    I see.

    > LLM is called six times to extract structured information

    Followed by

    > The default model is gemma3:4b, running at temperature 0.1 — low, supposedly nudging the model toward deterministic outputs.

    This is exactly why hiring is even more broken: Because the people looking for candidates are also just as unqualified if not, more.

    Using much weaker LLMs to replace the person in charge of the final judgement call is the wrong solution as this is a plain old social problem.

    Even if you wanted to use LLMs for this case, the default configuration, model choice is laughably flawed. This LLM can’t be trusted as it doesn’t even know what it is reading.

    The correct solution is either advanced OCR with keyword ranking with a basic filter or a far stronger LLM that excels at document / vision parsing benchmarks with an experienced person making the final judgement call in case the technology misses a critical detail.

    Rather than using this less accurate one that hallucinates out its decision depending on a dice roll.

    • chrisjj 6 hours ago
      > an experienced person making the final judgement call in case the technology misses a critical detail.

      That would fail to meet the objective of reducing the costs of hiring an experienced person - the entire point of outsourcing to a chatbot.

  • Traubenfuchs 10 hours ago
    This actually makes a lot of sense, it's testing the luck of the candidate through the rng feeding the LLM. You wouldn't want to hire unlucky employees after all! Hiring managers of the past would solve this by throwing every second resume in the trash, now this is a built in feature of ATS.
  • eudamoniac 3 hours ago
    I applied to Posthog twice, a couple weeks apart, and was rejected both times at 1:06am on Monday, exactly. So they are obviously using this sort of thing. Just thought I'd name and shame where I can.
  • mihaaly 10 hours ago
    So many people are willing to participate in this kind of robotic practices in human employment makes me think that many are starting to consider that this is as unavoidable as global warming and rather play along, adapts their career (life) to it, sculpture it towards a specific look, doing things that will give them point on some arbitrary test run. Which I feel being dangerous, leading to superficial minded workforce, not those good in something, including judgement of a problem and solution. But good at manipulation.

    Speculative thought only, of course.

  • conartist6 6 hours ago
    Boooo you whores
  • glouwbug 13 hours ago
    I guess at least HR doesn’t have to read 1,000 resumes. Heck, to be frank, could they make sense of the first 10 resumes?
  • vanessa1211 6 hours ago
    I have a love hate relationship with ATS
  • zuzululu 9 hours ago
    this is why i dont feel sorry for working 3 remote jobs
  • yieldcrv 12 hours ago
    this will get patched, as in I'll optimize my resume for this and so will many other people that any edge disintegrates
  • 1105714 5 minutes ago
    [flagged]
  • nautilus12 5 hours ago
    “The demands of ritual are always stronger than those of reason.”
  • nekusar 6 hours ago
    There's a whole lotta analysis and math and bar charts.

    But the big question I want to know is "Why did I score that?" And these slop machines absolutely cannot explain anything. That's the root problem with LLMs as a whole. There is no way to describe WHY an llm makes a decision.

    Was it because they are a woman? Does the woman's name have more pregnancies than other names? Was it because their job history make the person older (over 40)? Is the person black or black name? Is the name or address attributed to higher criminal tendencies?

    But no, you font get to know ANY of that. Slop machine says 66/100 , if you're lucky to even get a number. Usually its a 30 second rejection, or rejection at GMT 0:00 when the batch is processed and you summarily failed.

  • psychoslave 8 hours ago
    >You might as well throw out half the resumes and tell the the applicants you don’t fuck with bad luck.

    Hmm, well, maybe a bit with a nuance of elite class structure reproduction (that doesn’t prevent a few transclass to showcase in case anyone critic the perfect meritocracy at run), that’s basically what people get, so crude truth but truth nonetheless.

    Oh don’t take it personally. Your own bespoke hand-tailored process of course is different, it does give the opportunity to everyone to reach the most accomplished version of themselves beyond what they ever dare to dream.

    It won’t help though with the systematic failure of aiming to provide an accessible path to flourish for everyone and letting no one behind.

    Again, this is no fault of any specific player, but as long as a majority feel compelled to move within the frame of the game with few winners that merit all they got in contrast to large stock of inept losers, the outcomes are no wonder.

  • myshapeprotocol 42 minutes ago
    [dead]
  • mv_d5339e31 19 minutes ago
    [dead]
  • yahavthehackern 2 hours ago
    [flagged]
  • secrooq 6 hours ago
    [flagged]
  • mlpicker 12 hours ago
    [flagged]
  • chonghaoju 12 hours ago
    [dead]
  • hari_vardhan 10 hours ago
    [dead]
  • tesnorindian 9 hours ago
    [dead]
  • CurbStomper 5 hours ago
    [flagged]
  • nicodjimenez 7 hours ago
    I actually just built an ATS for my company Mathpix. But it never occurred to me to use resumes. Basically we have a set of company values and a specific open ended questionnaire to gauge the fit:

    https://mathpix.com/careers/apply

    Then internally we have dashboards and sorting based on AI agent scoring. I noticed the scoring is imperfect but still saves a lot of time. Candidates scored at or below 2/5 are reliably bad and candidates above 4/5 are consistently impressive and leave thoughtful answers.

    The biggest thing is not using resumes. You can’t reliably gage applicants without a writing sample and resumes are the worst form of writing sample. Also you need to be intentional about who you’re hiring for, both to craft the questions as well as grade the responses.

    • mdorazio 6 hours ago
      This seems likely to be worse. How do you screen out people who point an LLM at your values and ask it to answer your questions in a way likely to appeal to a recruiter using an LLM to score the responses?
  • sp2hari 6 hours ago
    HackerRank CTO & author of this repo here

    There's no better feeling than building something open source and watching it take off. Nine months ago, I built a simple hiring agent to solve one very real problem.

    Things it is not: It's not an ATS. We don't use it to screen our open roles. Our customers don't use it either.

    Here's what it is: Every year at HackerRank, we get 50,000 to 60,000 intern applications. No human can read that many resumes well. So I built something to rank them, helping me decide which resumes to read first.

    [This was before we built AI Interviewer (Chakra) to automate the first round of interviews, so candidates are no longer rejected based on their resumes alone.]

    Two things worth clarifying since I've seen them come up in this thread:

    The default model is gemma3:4b because it's what runs locally on most laptops - no cloud API needed. Actual resumes are evaluated using a top Gemini model. The repo ships with a demo config, not the production one.

    The cutoff score was set very low — the system was designed to rank resumes, not reject them. Only resumes at the very bottom of the distribution were filtered out. The vast majority passed through to human review, where the real decisions were made.

    Over the last week, it's taken on a life of its own. People are cloning it, running their own resumes through it, opening issues, sending PRs.

    I contributed to open source a lot in college. Somewhere along the way, I drifted away from it. This week reminded me how good that feeling is. This thread has also given me more ideas than I expected. The critiques here are sharp and I'm already thinking about how to act on them. Improvements are coming.

    • orsorna 6 hours ago
      You know you're not writing for LinkedIn? So platitudes about drifting away, watching your project "succeed" by being really popular, is not relevant to the main concerns pushed by this piece. Particularly brushing off the non deterministic score calculation.
    • beardedwizard 6 hours ago
      I'm a bit disappointed to see "The critiques here are sharp", a Claude tell, in a response which (to me) is trying to subtly argue that hackerrank is not overly reliant on LLMs.

      I'm not sure if your intent was to come across as having written this yourself, but it did not have the effect of improving my perception that this approach is flawed.

      I was also disappointed that you didn't address the variability in scores. I'm inferring that you believe the larger model takes care of the main observation in the post, but I don't really see you directly addressing the points.

      Maybe it's just me.

      • sp2hari 5 hours ago
        There is variability in scores and that's expected given we are eventually using a LLM to score. At least, when I used it 7 months ago, the only way I could avoid it was by keeping the cutoff score low (as low as 10 or 20).

        Reading this thread, I'm hoping to minimize the variability even further (even though I know it can't be fully removed).

    • DiskoHexyl 1 hour ago
      >>It 's not an ATS.

      >>No human can read that many resumes well. So I built something to rank them, helping me decide which resumes to read first

      Translation: it's an ATS.

      >>the system was designed to rank resumes, not reject them

      >>Only resumes at the very bottom of the distribution were filtered out

      Translation: it was designed to reject the CVs

    • rendaw 5 hours ago
      Do you read all ~50,000 then? Just with the ranked ones first?

      Or are you using it to screen? I'm confused.

      • sp2hari 3 hours ago
        There are some with very low scores that were ignored (like < 20).

        Rest of the ones with good scores (at least more than 40K), was reviewed manually.

    • jkhdigital 6 hours ago
      Saw this comment at the top with 0 replies and thought “How is that possible??” and then saw the “0 minutes ago” timestamp. Only on HN can you stumble into the comments section just moments after a CTO, founder, author, etc. left unfiltered remarks about the exact topic of the post. Never change HN.
      • lewispollard 6 hours ago
        Depends how "unfiltered" you consider LLM output to be.
    • rizsyed1 6 hours ago
      Thank you for your fantastic work!