Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

(github.com)

66 points | by mft_ 1 hour ago

9 comments

homarp 49 minutes ago
/r/localllama discussion: https://old.reddit.com/r/LocalLLaMA/comments/1rxmmu5/running...
zozbot234 11 minutes ago
The github page mentions that a naïve mmap approach is bottlenecked by per-page overhead. Can this be mitigated by setting up explicit "huge" pages? (2M using the CONT PTE feature if the "native" page size is 16k; 32M using a PMD level block mapping; or 1G using the CONT PMD feature.) Does macOS support this out of the box?
JSR_FDED 38 minutes ago
This is a very impressive result. If I understand correctly the bottleneck is the SSD in this architecture - the author seems to get almost 15GB/s - but I seem to remember the max b/w was about 8GB/s. What am I missing?
[-]
- Roxxik 11 minutes ago
  IO is very bursty in these setups. When the router results are in you can start loading experts from SSD. In this brief moment the SSD is saturated.
  Outside of that the SSD is idling.
  Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.
  I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.
  Not sure if this works on Mac, I only tested my larger than RAM setup on Linux with io_uring O_DIRECT reads and I saw that about 20% of total reads do finish while my fused upgate matmul is already running.
  Edit: Typos
- rado 33 minutes ago
  MacBook Pro M5 Pro and M5 Max have such SSD speed
bertili 32 minutes ago
Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights?
rvz 48 minutes ago
The technical write up is great, but Mac users should not get too excited just yet on running 300B+ parameter models locally as the TPS isn't that good.
>...at 4.4+ tokens/second
That is even when it is using 4-bit quantization and it is still at that speed.
> The entire 209GB model streams from SSD through a custom Metal compute pipeline.
This is my main problem.
If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.
Can't imagine using this in the long term right now, but improvements will follow. Still a great write up anyways.
[-]
- Roxxik 39 minutes ago
  Does an SSD meaningfully degrade by read only workloads?
  [-]
  - JSR_FDED 34 minutes ago
    Nope, reads don’t cause wear
- etiam 29 minutes ago
  > If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.
  How sure are you about that? I've never looked closer at how a large LLM with mixture of experts architecture switches between expert modules, but staying on roughly the same topic for the use (as it often would when editing the same codebase), I wouldn't be surprised to see the switches of composition are fairly rare, fairly small, and to the extent it happens it's repeated reads from the flash disk rather than writes it tends to cause.
  [-]
  - frotaur 13 minutes ago
    Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens.
- hrmtst93837 25 minutes ago
  If you want decent throughput and do not care about burning SSD write cycles on a box that was never meant to act like a tiny inference server, a used server with actual RAM is still the cheaper and less silly option. I woudn't expect Apple's warranty team to be much help.
  [-]
  - K0balt 1 minute ago
    Is it doing a bunch of ssd writes?
harshhhhhhhhh 49 minutes ago
seems promising , this is the way , can someone benchmark this
[-]
- frwickst 48 minutes ago
  I'm getting 6.55t/s using the Qwen3.5-397B-A17B-4bit model with the command: ./infer --prompt "Explain quantum computing" --tokens 100
  MacBook Pro M5 Pro (64GB RAM)
  [-]
  - logicallee 19 minutes ago
    can you post the final result (or as far as you got before you killed it) to show us how cohesive and good it is? I'd like to see an example of the output of this.
    [-]
    - frwickst 15 minutes ago
      Since the output is quite long, here is a link: https://pastebin.com/k76wiVGP
      [-]
      - hrimfaxi 3 minutes ago
        Why does this G character appear to prefix most of the output? ("Ġlike")
pdyc 27 minutes ago
impressive, i wish someone takes a stab at using this technique on mobile gpu's even if it does not use storage it would still be a win. I am running llama.cpp on adreno 830 with oepncl and i am getting pathetic 2-3t/s for output tokens
vilequeef 23 minutes ago
Why so much RAM?
[-]
- vilequeef 8 minutes ago
  Oh Mac, unified. Sometimes it takes a downvote
mugivarra69 6 minutes ago
[dead]