Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon

(github.com)

94 points | by MediaSquirrel 3 hours ago

5 comments

LuxBennu 2 hours ago
I run whisper large-v3 on an m2 max 96gb and even with just inference the memory gets tight on longer audio, can only imagine what fine-tuning looks like. Does the 64gb vs 96gb make a meaningful difference for gemma 4 fine-tuning or does it just push the oom wall back a bit? Been wanting to try local fine-tuning on apple silicon but the tooling gap has kept me on inference only so far.
[-]
- MediaSquirrel 2 hours ago
  Memory usage increases quadratically with sequence length. Therefore, using shorter sequences during fine-tuning can prevent memory explosions. On my 64GB RAM machine, I'm limited to input sequences of about 2,000 tokens, considering my average output for the fine-tuning task is around 1,000 tokens (~3k tokens total).
  [-]
  - LuxBennu 6 minutes ago
    Ah that makes sense, quadratic scaling is brutal. So with 96gb i'd probably get somewhere around 4-5k total sequence length before hitting the wall, which is still pretty limiting for anything multimodal. Do you do any gradient checkpointing or is that not worth the speed tradeoff at these sizes?
craze3 2 hours ago
Nice! I've been wanting to try local audio fine-tuning. Hopefully it works with music vocals too
yousifa 2 hours ago
This is super cool, will definitely try it out! Nice work
dsabanin 2 hours ago
Thanks for doing this. Looks interesting, I'm going to check it out soon.
[-]
- MediaSquirrel 2 hours ago
  you are welcome! It was a fun side quest
pivoshenko 1 hour ago
nice!