A 10 year old Xeon is all you need

(point.free)

192 points | by cafkafk 4 hours ago

28 comments

deng 55 minutes ago
Nice post and technically impressive work. I agree we need to understand the build pipeline and be able to do things locally. However, depending on your electricity cost, it might not make sense financially. These old servers are not energy efficient at all (I'm guessing that old Xeon server will easily pull 200W on load), and that model is currently at 0.1$/0.3$ per 1M tokens (with 76 tps and 262k context) in Openrouter (also, these servers are LOUD).
EDIT: I stand corrected, 200W is apparently way too high of an estimate. I used to run a bunch of old Xeon servers and they slurped watts like crazy, but I can't remember which ones exactly those were.
[-]
- toast0 35 minutes ago
  2620v4 is not a power slurping beast. Depending on the server board, it might not be either. Servers are often loud, but it depends.
  There's a lot of budget hosting built around chips like these, and they're suprisingly power efficient.
- jansommer 51 minutes ago
  It should be closer to 85W on load. And it's incredibly silent on even a low end cooler. I rarely get above 50° Celcius.
  [-]
  - deng 40 minutes ago
    OK, then you're in luck. I had a bunch of old 1U rack servers and even in the next room it was too annoying to run them (they had a bunch of 40mm fans which always ran at full speed, because in a server room, no one can hear you scream).
    [-]
    - jansommer 26 minutes ago
      Could it just be really bad cooling? Looking at 9800X3D, it seems like it's running in a similar range wrt TDP unless you really push the 9800X3D. I'm comparing with desktop cpu's because that's what my workload is. cpu governor is set to performance (no schedutil). No audible change in fan speed during heavy compilation or gaming (very silent humming), and i don't have any fans beside cheap intake, cpu and exhaust fans (1 each) + an excessive amount of dust.
      [-]
      - deng 13 minutes ago
        These servers had no fan control whatsoever, they always ran full blast. That's not untypical for rack servers, because as written: they are designed for server rooms, and you're supposed to wear ear protection there anyway... Yes, I could've modified them, but I ditched them because running them simply made no sense (especially the high idle power consumption was ridiculous).
  - consp 38 minutes ago
    Only when you remove it from the original server or enable low fan mode (if available). Most 1U/2U cases will happily blow at full speed well over 90db.
    You likely need to replace the flow-through server chassis system with an active "normal" cooler to achieve a bit of silence.
    85W might be about right. My old server CPU is in the same ballpark and compiling kernels it reached about 90w in power usage. If you want to keep it running: idle is not very low power unless you have one of the "low power" L versions, keep that in mind.
    [-]
    - tjoff 18 minutes ago
      Get a 4U case, many options if you want to combine it with a NAS. Not hard to cool and keep somewhat quiet. If you can store it in a closet or something that helps too.
      Well, you can use it for lots of other things as well.
      Compared to the cloud you can probably save up to buy a new server every month. And don't underestimate the gains of having something to experiment on and play with.
cafkafk 4 hours ago
Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.
I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.
I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.
I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.
[-]
- Sweepi 1 hour ago
  "-t 8 matches physical cores. The machine has 16 SMT threads but only 8 cores. On a memory-bound workload, oversubscribing threads adds scheduling cost without adding throughput: the cores are waiting on DDR3, not on each other."
  But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?
  I also dont understand the explanation of "--cpu-moe". If an expert has ~ 4.0 GiB of Parameters, why does optimizing the sequence of experts minimize cash trashing? With 20 MiB of L3 Cash vs 4.0 GiB of Parameters, it wont cash any noticeable amount of the Parameters, will it?
  As mentioned by others, only some Intel Xeon E5-2xxx v4 did support DDR3, and according to Intel, the E5-2620 v4 is not one of them.
  [-]
  - zamadatix 43 minutes ago
    > But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?
    Waiting in terms of latency. When the bus is mostly empty and it takes a while to make a round trip it's great to try to find a few extra passengers to put on it. When the buses are all completely full adding the extra riders just makes the bus stop that much more chaotic.
- gdjdhdheb 1 hour ago
  You sure you got DDR3 .. I have 2 e5 v4 rigs at home and both have ddr4 ... Unless I am wrong and 2011-3 supports ddr3 and ddr4
  [-]
  - lightedman 1 hour ago
    The first two generations supported DDR3 only. Haswell and Broadwell (v4) brought DDR4 support.
- fragmede 4 hours ago
  (purple on black is really hard to read)
  You say it runs "at reading speed". Have you benchmarked it?
  [-]
  - cafkafk 3 hours ago
    > (purple on black is really hard to read)
    Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.
    > You say it runs "at reading speed". Have you benchmarked it?
    At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:
    llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128
    Gives:
```
  llama_print_timings:        load time =   83911.65 ms
  llama_print_timings:      sample time =      26.99 ms /   128 runs   (    0.21 ms per token,  4742.15 tokens per second)
  llama_print_timings: prompt eval time =     343.41 ms /     7 tokens (   49.06 ms per token,    20.38 tokens per second)
  llama_print_timings:        eval time =   10639.36 ms /   127 runs   (   83.77 ms per token,    11.94 tokens per second)
  llama_print_timings:       total time =   11114.98 ms /   134 tokens
```
    So 11.94 tokens per second while it's also playing binary cache and CI builder.
    When I do it properly, I'll add it to the blog as well!
    [-]
    - anon-3988 2 hours ago
      I am pretty sure llamacpp have their own benchmarking binary that you can use.
      [-]
      - mft_ 42 minutes ago
        llama-bench is part of the llama-cpp package, but from recent experimentation, the settings it is able to (or is documented to?) accept lag behind somewhat. Not sure whether it would accept all of the esoteric settings in the article?
    - ekianjo 1 hour ago
      20 tokens per second for eval time is the killer here. It means you can't use this to process any meaningful amount of text.
      A GPU typically processes close to 1000 tokens/s during eval.
      [-]
      - boutell 40 minutes ago
        I'm pretty sure eval time is token generation time where it's actually outputting new tokens. If you're getting a thousand per second on that, I'd love to know on what.
- arpinum 2 hours ago
  How many watts is that setup? Cool you got it to work, but maybe only useful for vintage / retro computing rather than practical if the energy consumption makes it economically wasteful.
- shevy-java 1 hour ago
  Would you consider improving the website's layout? Right now I find it below average quality and very distracting. Whether you are an engineer or not is not really important; great engineers can write horrible text or use a layout that is not ideal, for instance.
throwaway2027 1 hour ago
Glad to see other people realizing this. I've been running Gemma 26B-A4B Q4 on a 2012 Xeon with 16GB to 24GB of RAM in a container. It's getting around 8 to 12 tokens per second. Obviously it's not comparable to huge contexts and running it on a GPU and the image decoder in llama.cpp is super slow compared to a GPU but for some small automation tasks and general trivia questions it's decent. The speed is just enough to not have to wait for it to finish so you can read along.
Here's my setup. You may want to figure out what the best optimizations are for your specific CPU like AVX2 because mine didn't have most of them. I did try MTP briefly but I wasn't getting performance improvements. You could play around with the batch sizes for cache or context or go even lower for Q2 and don't overcommit on threads either, but I would suggest either defaults or trying out llama-bench. This isn't by any means the best I assume but it worked decently for me and I sometimes swap out Gemma for Qwen. You could also lower q8_0 to q4_0 for more context but it could hurt quality some say, altough I have noticed it too on some models.
# Building
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_OPENMP=ON
# Running
export OPENBLAS_NUM_THREADS=4
export OMP_NUM_THREADS=4
OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4 \
llama.cpp/build/bin/llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.00 --jinja --host 0.0.0.0 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 --threads 4 --threads-batch 4 --ctx-size 8192 -n 8192 --batch-size 2048 --ubatch-size 512 --no-mmap --mlock --chat-template-kwargs '{"enable_thinking":false}' --no-mmproj -np 1 -fa 1
phaser 1 hour ago
What intrigues me the most about AI progress, is not AGI or the model du jour by $AI_UNICORN, but rather what can be run locally. I remember having an amusing, but rather useless model in a beefy gaming PC that I had 6 years ago; and now, something that’s a hundred times better on my M5 laptop.
Should the market react to the memory shortage, the progress of the Apple silicon continue at the same pace, and what we’ll be able to run locally in 6 years will be very exciting. or frightening.
Also I don’t know what this means for the valuation of the AI companies. I remember asking about this very idea to one of their employees at an event and instead of answering he bailed out to grab a cocktail.
[-]
- MAXPOOL 1 hour ago
  Things you are not supposed to talk about:
  - There is no "moat" (lasting, easy-to-defend technological edge) in AI model businesses. There are just short-term advantages.
  - An AI business is a capital-intensive business, just like old factories. Data centers are expensive, models are energy-hungry, and the hardware inside must be replaced every 3–4 years.
  - Smaller, specialized models eat margins from below. Transcription, voice, or image detection do not need large models.
  There is no reason to expect high margins like you can in traditional software business. Benefits of AI go mostly to consumers.
  edit: There is potential for economies of scale. Few megacorps can strive for cost advantage when they achieve scale (Microsoft, Google, Amazon and Meta)
  [-]
  - twoodfin 10 minutes ago
    All true.
    It does seem like the structural characteristics we’ve observed so far suggest there is a kind of flywheel from short-term to long-term advantage due to the capital requirements at various levels.
    If you’re Nvidia, making the best GPUs today, the expanding wavefront of demand is consuming them with volume and margins to give you a huge edge in building out the best next generation of GPUs. Similar to how the mobile wave gave TSMC sustained advantage for about a decade now.
    I’m guessing this is also what we’re seeing as Anthropic and OpenAI swap spots in the token-vendor market.
- fooker 1 hour ago
  What you can run locally in consumer hardware is progressing pretty well.
  If you get a not-quite-the-best gaming GPU like a 5080, you can run local models that are better than the state of the art from early 2025. Depending on what you want to do, you might have to switch models. The one size fits all huge models are still a data center thing.
- skdb476 1 hour ago
  Its a convenience thing. You can run a whole lot of stuff locally from wikipedia to social media/email/video servers whatever. Most people with a full time job and 2 kids dont do it cause who has time and energy to patch and maintain the ever growing complexity of this stuff. These systems will keep growing complex. That also means more bugs. Age old tradeoff between freedom and convenience.
- rienbdj 1 hour ago
  Training AI models to drive valuation reminds me of high frequency trading
robotswantdata 11 minutes ago
Granite or sapphire rapids are very under rated for MoE inference loads. But you need a GPU for the KV cache.
Plus many boards also support CXL for RAM expansion over PCI 5!
Source: building a hybrid inference business for regulated industry workloads.
jansommer 1 hour ago
The E5-2620 v4 is great. Have been using it for 10 years now. Wanted to upgrade until I saw current prices. I have 64 GB ddr4. Paired it with rx 9060 xt 16 GB and games run as fast as ever. Perhaps the cpu is a slight bottleneck in DOOM The Dark Ages, but i'm at 60 fps, so no problem. Light llm on the gpu is a nobrainer, and it's cool to see that things can be tuned to run ok on the cpu. I bought 2667 v4 a month ago for 30$. I'd expect it to give a decent performance boost but I just haven't had the need for it yet, but pushing into llm like in the article I'd probably upgrade because 2667 can handle slightly faster ram.
[-]
- throwaway2037 28 minutes ago
```
    > The E5-2620 v4 is great. Have been using it for 10 years now.
```
  10 years? Damn, that is a long time. I always assumed that heat-induced damage will kill a CPU after a certain amount of time (5-7 years). Am I wrong here? I assume yes. Or are CPUs must stronger/tougher than the bad old days?
  [-]
  - jansommer 6 minutes ago
    A quick search on Xeon production yields that it goes through a rather rigorous testing. I wouldn't be surprised that server cpu's in a desktop pc works longer. I can't overclock it either, and that probably helps with its lifespan as well. But yeah, the fact that it actually powers on when i click the button and isn't a limiting factor after 10 years is quite something.
FartyMcFarter 13 minutes ago
I may have missed this in the article, but:
What was the net effect of the optimisations? How much faster did it get?
vhaudiquet 2 hours ago
The E5 2620-v4 only supports DDR4.
cykros 1 hour ago
Does this mean my 15 year old Phenom is too old? But it has 16 gb of DDR3 RAM!
Admittedly web browsers and it don't get along that well. Literally the only thing that drags though on my Slackware 15 system, and even then usually only when it gets to around 15 or so open tabs.
car 1 hour ago
Similar recent posting with optimizations for older Xeon:
High-Performance AI on a Budget: Optimizing llama.cpp for Qwen3.5 Inference on a Dual-GPU HP Z440
https://news.ycombinator.com/item?id=47320244
Hasan121212 28 minutes ago
I think one overlooked advantage of older Xeon systems is their availability. Many people can experiment with local AI deployments at a fraction of the cost of building a brand-new setup.
haunter 1 hour ago
And this is one of those CPUs which had dual slot motherboards so you can have double the fun (and power bill)
https://pcpartpicker.com/products/motherboard/#s=20028,20029...
anon-3988 1 hour ago
I tried to run gemma 4 on this CPU and it did not go well
https://www.techpowerup.com/cpu-specs/ryzen-7-4800u.c2281
It is way too slow
potus_kushner 4 hours ago
@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?
[-]
- cafkafk 3 hours ago
  Honestly, at this point you're probably looking at a smaller model, for the Gemma series I'd go with Gemma 4 E4B with drafters, but that's just a hunch from using it on my laptop (where I do have a RTX 4060 M and 96gb ram).
  So you'd change the invocation slightly here, but a lot of things you can potentially reuse.
  That said, the Gemma 4 E4B models have so far in my experience been... not great when it comes to long context, but they are very passable for basic tasks, and even seem surprisingly okay at tool calls.
  [-]
  - sleepyeldrazi 1 hour ago
    Have you tested Qwen3.6 35B? Putting aside the capability claims for that model (which I support, but are not my point here), that 35B has smaller active parameter count than the gemma 4 26B, potentially making both prefill and decode faster out of the box, and has MTP heads built in the model and well supported (you may need to make sure you download a quant that didn't strip them off, as some do to preserve space). I would be curious to see your numbers there too. And if you do test this, please go for a clean one and not a fine-tuned one.
  - potus_kushner 3 hours ago
    i tried the Q4_K_M model form unsloth with your Q4_K_M drafter, but the required memory to load everything is 72GB. odd. otoh i could load Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled.IQ4_XS.gguf and it requires just ~18 GB:
    ~/ik_llama.cpp[main]$ build/bin/llama-cli --model ~/models/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled.IQ4_XS.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune -cnv --color --jinja --special -smgs -sas -mea 256 --temp 0.7 -t 6 --parallel 6 --cpu-moe --merge-up-gate-experts --flash-attn on --mla-use 3 --mlock --run-time-repack --no-kv-offload . works pretty fast, at about 15 t/s:
    llama_print_timings: sample time = 45.28 ms / 404 runs ( 0.11 ms per token, 8921.67 tokens per second) llama_print_timings: prompt eval time = 949.42 ms / 51 tokens ( 18.62 ms per token, 53.72 tokens per second) llama_print_timings: eval time = 24067.08 ms / 400 runs ( 60.17 ms per token, 16.62 tokens per second) llama_print_timings: total time = 242192.55 ms / 451 tokens
    so i wonder why the params used by the quantified qwen model use way less memory than the ones of gemma.
asimovDev 2 hours ago
I have an ancient DDR3 Xeon that doesn't support any AVX (dual x5690 and 96GB 1333 MHz RAM). You reckon it would even build / run at all?
[-]
- qwertox 2 hours ago
  CPU (2012)
```
  Model name: Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz
```
  Mainboard
```
  Product Name: P8Z77 WS
```
  GPU
```
  05:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1)
  05:00.1 Audio device: NVIDIA Corporation AD106M High Definition Audio Controller (rev a1)
```
  Memory: 32GB
  This works.
- cafkafk 2 hours ago
  Loading will take some minutes, but at 96 you can squeeze the model in and have some headroom around like ~10 GB, although depending on the Xeon, you may have to downgrade to E4B instead. Should still work thou.
- tgtweak 2 hours ago
  It may work - depending on your ram speeds it might not even be that much slower.
- burnt-resistor 1 hour ago
  I run Win 11 Enterprise on an el cheapo spare parts Xeon E3-1275 V2 + 32 GiB DDR3-2133 + Gigabyte GA-B75M-D3H rev. 1.2 (TPM support)
NSUserDefaults 2 hours ago
How about the iMac Pro? Would that work? I was able to put 128gb in it (not as easy as the regular iMac but possible).
[-]
- wazoox 2 hours ago
  I've been running various models on a Mac Pro 2013 (8 cores, 32 GB RAM) at about 8 to 10 t/s for months. It's not fast, but it's more than enough for many actual tasks, in particular background tasks. An iMac pro will do just as well I suppose.
  [-]
  - fooker 1 hour ago
    What are the tasks that do well with 8-10 t/s ?
egorfine 1 hour ago
This and the previous one are insanely good articles. Thank you!
gigatexal 1 hour ago
What kind of tokens per second did the op get I saw nothing of this written.
[-]
- urbandw311er 1 hour ago
  11.94 tokens/sec (from another answer above)
hparadiz 2 hours ago
I'm now staring at a 10 year old 4U with 256 GB of DDR4 and thinking hmmmmm
ezconnect 40 minutes ago
When you use page up and page down key when reading that blog the first line on the screen is obscured by the floating bar or what ever it is. It is not even needed for reading.
Eonexus 4 hours ago
I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?
[-]
- cafkafk 3 hours ago
  That is a very fair point! I just ran a not very scientific benchmark with the system under load, and posted the raw logs in a sibling comment above, but the short answer is that it's hitting 11.94 tokens per second for generation - while it's also being a binary cache and CI build server.
  Totally just vibes based, I think it goes up to 20+ tps when it's not under load (and that's me trying to be conservative). For context, reading speed at 250 wpm would be around 5 to 6 tokens per second.
  [-]
  - Eonexus 3 hours ago
    Huh, that's actually not bad at all! Sure, it's not at the speed of a GPU, but still, 20 tps is cromulent for a CPU.
rvba 49 minutes ago
As someone doing this for fun on a windows 11 machine (96gb ram, 5090 24gb) I wonder if I need any flags to keep the model in memory and avoid swapping to ssd?
I use LM studio and qwen3.5 35B - but never figured out if it is swapping or not.
Om am unrelated note, does anyone know a model that can help with this use case:
https://news.ycombinator.com/item?id=48301635
christkv 3 hours ago
Makes you wonder if its possible to squeeze more tps out of a strix halo system using the 16 zen5 cores as well as the gpu.
[-]
- Havoc 2 hours ago
  In general you’re mem bandwidth constrained so cpu vs gpu often ends up similar on APUs
  [-]
  - fulafel 1 hour ago
    There are ways to trade off compute power for memory bandwidth (like MTP and other speculative decoding approaches). The CPU and GPU would need to be able to share the same cache for this to work. In the Strix Halo case the GPU has a private cache on the GPU die I think, which is the snag.
- cafkafk 3 hours ago
  If you get the inference engine to route the heavy matrix math to the GPU and the speculative drafting to the CPU without choking on latency it's probably gonna be very fast.
  Would love to see the benchmarks if someone actually pulls something like that off.
shevy-java 1 hour ago
The webpage's layout is just horrible. Scrolling is also non-default - and thus rather annoying; I had to stop after two scroll events. Why do people think they need so much fancy effects or non-standard behaviour, if their alleged goal is to get information across to other people?
SXX 1 hour ago
Now we need someone try run Kimi K2.6 on old Xeon and DDR3. After all these platforms do support up to 768GB RAM.
[-]
- Havoc 11 minutes ago
  It’ll work but yield a token per minute. With ancient servers the throughput is the limiting aspect not mem size
nurettin 2 hours ago
I also run a Qwen 3.6 moe A4B on old hardware. I set it up with
numactl --membind=1
so it is constrained to one of the memory sticks which speeds up token generation a little.
bflesch 2 hours ago
Might consider going for even older CPUs which don't have the Intel ME ring -3 thing which is full of backdoors
[-]
- bflesch 57 minutes ago
  I appreciate the downvotes without any reasoning. It's a fact that newer Intel CPUs have Intel ME which was not in older CPUs and significantly increases attack surface if you are not living in a five eyes state.
hypfer 1 hour ago
> The argument for speculative decoding is stronger on CPU than on GPU.
Uh. Uuuh.
No?
___
Also
> While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip.
What purpose does the quoting of "caches" serve there? Is this AI writing written by that model running on that host?