Every Byte Matters

(fzakaria.com)

189 points | by ingve 7 hours ago

22 comments

  • moring 4 hours ago
    The article shows nicely how "every byte matters" is false. First, it starts off by talking about the cost of a new field, when the actual topic is array-of-structs vs. struct-of-arrays. Then, this:

    > How much of an impact can this have? > Reading is:alive (1 byte) Across 1M Monsters

    You aren't reading one byte here, you are reading 1M bytes! Of course, optimizing the access to 1M bytes is something to consider. Optimizing the access to one byte isn't.

    The article is definitely worth reading IMHO, but it really needs a better headline!

    • jayd16 4 hours ago
      Even more so, it shows that SoA data structure means you can add fields to your 1M monsters with little impact.
      • gmueckl 2 hours ago
        This is valid for sequential scanning of the data. The CPU will fill whole cache lines at once with the arrays that do get used and the algorithm touches all the field instances in the array.

        Now think about random access to single struct instances instead: the CPU loads a cache line worth of data for each field and uses only one element out of the whole cache line. This is much worse than a compact structure representation of the same data.

        SoA is not universally better.

        • jayd16 2 hours ago
          No it's not always better and I didn't mean to imply it was. I was simply saying that the article argues against its title.

          In both cases you want to think about locality of the next read and structure the data accordingly.

      • notatyrannosaur 3 hours ago
        > you can add fields to your 1M monsters with little impact.

        Great for this access pattern, but I wouldn't make a general statement like that. This is the same thing as row-oriented vs column-oriented databases, OLTP vs OLAP. SoA is weak if you are adding/removing monsters more often than accessing a single "hot" field.

        • Altern4tiveAcc 2 hours ago
          > SoA is weak if you are adding/removing monsters more often than accessing a single "hot" field.

          Why is that? Genuinely curious. Does "weak" mean that it performs worse than AoS, or that the gains aren't as significant versus AoS?

          • tsimionescu 2 hours ago
            It's because removing a monster with 20 fields from an SoA structure means resizing 20 arrays. Removing the same monster from an AoS array involves resizing a single array, which you're going to process in a very cache friendly way.
            • vouwfietsman 19 minutes ago
              I'm not sure why anybody would at the same time be implementing SoA AND resizing 20 arrays for a single delete, those things seem to be on either ends of the "I care about performance" spectrum.
            • Altern4tiveAcc 1 hour ago
              Assuming ordering isn't a concern, can't you just have a field called "removed" and skip those when iterating?

              Or swap it with the last monster, and keeping an index for the last monster alive.

              • marcosdumay 1 hour ago
                Then you have to read the "removed" field on every field read on every operation.

                SoA is only useful when you don't read multiple fields for most operations.

          • jayd16 2 hours ago
            Presumably they're referring to resizing the arrays.
            • gmueckl 2 hours ago
              Array resizing is avoidable with an embedded free list if ordering is of no concern.
        • keynha 2 hours ago
          [dead]
      • celrod 3 hours ago
        Yes. I think one of the big advantages of SoA is that you only pay for the fields you're currently using. If you need a field somewhere, you can add it and only pay the cost of iterating it where you need it.
    • bronlund 3 hours ago
      Every Struct Matters
  • noelwelsh 6 hours ago
    The JVM is currently pretty bad for memory allocation. Every object (i.e. not a primitive) has a header that IIRC is 12 bytes. But there is good news in JVM land: this will be reduced to 8 bytes in the next JVM release, and Project Valhalla will give the tools to do away with headers entirely in some cases. Project Valhalla also has tools to manage off-heap memory, which is important in many cases.

    The JVM is an odd place where it requires too much heap to compete with the AOT compiled languages, but its startup time is too slow compared to interpreted languages. I think these enhancements are essential to keep the platform relevant.

    • pron 5 hours ago
      > Every object (i.e. not a primitive) has a header that IIRC is 12 bytes. But there is good news in JVM land: this will be reduced to 8 bytes in the next JVM release

      Since JDK 25 it's already 64 bits with the `-XX:+UseCompactObjectHeaders` flag [1], but in JDK 27 it will be the default [2].

      > where it requires too much heap to compete with the AOT compiled languages

      Not to compete but to beat, and not too much, but the right amount. Low level languages are optimised for control, not performance (that control translates to better performance in smaller programs, and to worse performance in larger programs), and their particular constraints prevent them from enjoying certain important optimisations, especially those offered by JIT compilation and moving collectors, which remove some overheads that AOT compilers and free-list allocators incur. Their memory management is forced (by their constraints) to optimise for footprint rather than speed.

      There are common misunderstandings about memory management and why moving collectors were created to reduce the CPU overheads of malloc/free, especially in large programs, in exchange for what is effectively free RAM. This is why moving collectors are chosen by the languages that are unconstrained enough to use them and have the resources to implement them (Java, .NET, V8). With the exception of Zig (and even there it requires some effort), it's hard for low level languages to use the basic optimisation that's behind moving collectors. I gave a talk about how moving collectors optimise memory management at the last Java One, and it should be available on YouTube soonish [3].

      > but its startup time is too slow compared to interpreted languages

      That hasn't been the case for some time. You are right, though, that startup/warmup time is worse than in AOT compiled languages, and that is the tradeoff of optimising JITs: reduce the overheads associated with AOT compilation in large program in exchange for warmup.

      Both startup and warmup have already been improved thanks to Project Leyden's "AOT cache" [4], but it will never be as low as C.

      In general, the tradeoff is between optimisations that help large programs vs optimisations that help small programs.

      [1]: https://openjdk.org/jeps/519

      [2]: https://openjdk.org/jeps/534

      [3]: I can't reproduce the full talk (which goes into the maths of memory management) here but what happened with moving collectors was that until very recently (open source low-latency moving collectors are newer than ChatGPT), they required pauses and so weren't suitable for programs requiring low latencies. As a result, many developers either forgot or never learnt just how incredibly efficient moving collectors are. But the key is that because accessing RAM by necessity requires CPU, using CPU effectively captures RAM even it's not used by the program. Bringing the CPU and RAM usage into a good balance is more efficient than trying to minimise one or the other. This is also the reason why hardware (physical or virtual) is packaged within a very narrow band of RAM/core ratio.

      [4]: https://www.youtube.com/watch

      • AlotOfReading 4 hours ago

            In general, the tradeoff is between optimisations that help large programs vs optimisations that help small programs.
        
        Do you have concrete examples of large scale Java programs that are significantly more performant than comparable programs in native languages like C++? My understanding was that this dynamic hadn't fundamentally changed much since the 2010s, when Java was able to occasionally edge out a win in 1-2 benchmarks and would lose handily in others. My experience is that large scale Java programs remain a bit of a bear even after significant optimization effort (e.g. Bazel).

        There are of course plenty of optimizations the JVM does that aren't possible AOT, but that that doesn't imply an automatic win at large scales, as Rust demonstrates.

        • pron 3 hours ago
          > Do you have concrete examples of large scale Java programs that are significantly more performant than comparable programs in native languages like C++?

          Yes. I was working in a place that made large sensor-fusion applications, air-traffic control applications, and logistical planning, each in the 2-8MLOC range. Over time, we ported all of them from C++ to Java because C++'s performance overheads were too annoying to work around.

          Of course, in principle it's always possible to match and perhaps even exceed Java's performance in a low-level language, but in practice it becomes ever more difficult as the program grows (and the cost remains with maintenance forever). The reason is that as programs grow, patterns become less regular (e.g. the variance in object lifetimes grows), the need for concurrency grows (and so the need for sharing objects among threads and for lock free data structures), and more general constructs are used (e.g. more dynamic dispatch). Improvements in modern allocators, as well as LTO and PGO have helped, but not enough to match the extent of optimisations you can do once you're free of the design constraints of low-level control and the focus on the worst case.

          Java's thesis (not initially, but from very early on) was to rely on optimisations that can't be effectively employed by low-level languages because of their constraints, such as efficient memory management that benefits from being able to move most pointers in a program, and highly aggressive speculative optimisations (that are nondeterministic and can fail, resulting in deoptimisation). These optimisations tend to be global, and so they don't restrict program structure much, keeping maintenance costs lower, but they do help the average case at the cost of harming the worst case, which is a tradeoff that programs written in low-level languages don't want, and of course, it doesn't give the low-level control that's the entire point of low-level languages. Proving that thesis took a while, and longer in some aspects than others (moving collectors that don't pause were first released to a wide audience three years ago).

          Of course, the differences aren't huge because the hot paths are typically small enough that they can be improved without adding too much cost (and hot paths require some manual optimisation in all languages), but gaining some performance as a side effect of significantly lowering costs is nice.

          > There are of course plenty of optimizations the JVM does that aren't possible AOT, but that that doesn't imply an automatic win at large scales, as Rust demonstrates.

          I don't know what it is that Rust demonstrates given how few large scale projects have chosen it, but I've seen nothing to indicate that it doesn't suffer from the same performance issues as C++ compared to Java. In fact, someone I know who works at one of the world's largest tech companies told me that his team lead really wanted to do something in Rust, so they ported a small-to-medium service from Java to Rust. The result was such a huge performance drop that it wouldn't meet their minimum requirements. They were then forced to spend an additional 6 to 12 months carefully hand-optimising their Rust code until it matches Java's performance, but the result is such that all future maintenance will be more expensive. This is the exact same pattern I've seen with C++.

          It's interesting that 20 years ago the people who said Java can't beat C++ on performance were experienced low-level programmers who had little or no experience with Java (and they were also right on several axes at the time). Today the people who say that are those with little experience with low-level languages (and are under the impression that low level languages are universally fast), but they will eventually learn about their fundamental performance issues just as we did decades ago.

          I think that Rust in particular has made people without much experience in low-level programming (among which Rust has made much more inroads than among those with a lot of experience in low-level programming) believe a certain story, namely that the problem with low level languages was memory safety and that that was the reason so many large programs switched to Java despite the performance sacrifices they had to make. Now that Rust fixes that problem, they can have their cake and eat it too! In reality, memory safety was indeed one of the several significant problems with low level languages that Java sought to fix, but another was the performance issues low level languages suffer from as they get large (making good performance ever more costly). The tradeoff isn't performance (in large programs there might even be a performance gain) but low-level control, as that is what low-level languages are about. That was what they offered back then, and it's still what they offer now. Rust was first designed twenty years ago, back when things still looked a certain way (which is why, IMO, it repeated most of C++'s design mistakes), but these days I think that a better, more modern design of low-level languages is more focused on control, leaving large programs to high-level languages. Lack of memory safety has, without a doubt, been one of the things that made low-level languages less palatable to "ordinary" applications, but it was far from the only one.

          Anyway, I'm sure the debate of which is faster, C++ (/Rust/Zig) or Java, will continue, and frankly, due to the nature of modern hardware, compiler, and runtime optimisations these days (when the question of the cost of some individual operation is all but meaningless and out ability to extrapolate from the performance of one program to another is close to nil), it largely comes down to empirical questions such as which program patterns are more or less common in the field and in which domains, as there are code and workload patterns that could give an advantage to either one.

          • jandrewrogers 10 minutes ago
            I’ve done performance-engineering for decades in Java, C++, and C for both data analytics and supercomputing/HPC. Java performs significantly worse than C++ in all cases without exception. This is the result you should expect from first principles; something has gone horribly wrong with your software optimization if Java is faster than C++ or even Rust.

            There are good reasons to use Java in environments that care about performance. Absolute performance can be traded for other concerns while still being good. It is why I did so much performance-engineering work in the language.

            Most performance is architectural in nature. Extremely granular control of scheduling is a prerequisite. System languages provide that control if you want it, Java does not.

            When you design software in Java, you accept that some software architectures are not available to you. If you care about performance, you would not port a software architecture optimized around the limitations of Java to a systems language.

          • WhitneyLand 54 minutes ago
            ”they ported a small-to-medium service from Java to Rust. The result was such a huge performance drop that it wouldn't meet their minimum requirements”

            That result would say less about performance of languages than it would about competency of developers with a language.

            I just don’t buy that a task could be assigned to two teams with comparable expertise and domain knowledge in Rust and Java, and have the Rust result be at a “huge” performance deficit.

            No, don’t believe that was an apples to apples comparison.

          • AlotOfReading 3 hours ago

                I don't know what it is that Rust demonstrates given that few large scale projects have chosen it, but I've seen nothing to indicate that it doesn't suffer from the same performance issues as C++ compared to Java. 
            
            The point of bringing up Rust is that it also gives the compiler much more information to optimize on than C++, but actual performance is comparable or slightly worse in most benchmarks because the quality of C++ codegen is so high. Some of those Rust advantages are exactly the same things that have been touted as major advantages for Java over C++, like escape analysis and lifetimes.

                Of course, in principle it's always possible to match and perhaps even exceed Java's performance in a low-level language, but in practice it becomes ever more difficult as the program grows (and the cost remains with maintenance forever).
            
            Sure, which is why I asked for real examples of whatever you consider a "large scale" program. I wasn't able to find anything via search before I replied, and the wiki page on Java performance [0] is repeating what I understood.

            [0] https://en.wikipedia.org/wiki/Java_performance

            • pron 2 hours ago
              > Some of those Rust advantages are exactly the same things that have been touted as major advantages for Java over C++, like escape analysis and lifetimes.

              These aren't the biggest advantages. I would say that the biggest ones are aggressive speculative optimisations that allow inlining of virtual calls (by default, up to a depth of 15 calls) and the ability to freely move pointers, which allows alternatives to free-list-based memory management. Low-level languages can't afford pervasive speculative optimisation (as they're focused on the worst case) and can't allow most of their pointers to be moved (because they often share them directly with the hardware and/or device drivers).

              > and the wiki page on Java performance [0] is repeating what I understood.

              That may be because the information on that page seems to be up to date to 2011-2. Java is now on version 26, BTW.

              • AlotOfReading 37 minutes ago
                LLVM does speculative devirtualization as well these days, though it's not as aggressive as Hotspot. High-performance native code tries to avoid deep dynamic hierarchies anyway, so it's mitigated by cultural practices.

                GCs are definitely a strong point for Java, but most high-performance code can be rewritten to avoid pummeling memory management. This used to be common for Java in financial applications, not sure if it still is.

                C++ has evolved its own compacting GCs like oilpan [0] for applications where high performance is inherently tied to allocation. Oilpan runs into pointer issues and isn't remotely comparable to G1GC or ZGC, but I think the speed of V8 speaks for itself. Rust allows you to drop in non free-list based allocators and GCs (e.g. Bumpalo), but they're relatively immature.

                    That may be because the information on that page seems to be up to date to 2011-2. Java is now on version 26, BTW.
                
                The last time I dove into JVM internals was around the same time. I figured that someone who's worked with it more recently might have better examples than what's easily searchable.

                [0] https://chromium.googlesource.com/v8/v8/+/main/include/cppgc...

            • gf000 2 hours ago
              Slightly off topic -- java-related wiki pages are notoriously bad and possibly biased for some reason. They are laughably outdated and have a bunch of non-objective sentences that paint a much worse picture of the language than deserved.

              I have even tried removing/rewriting some of the questionable sentences but my edits weren't accepted.

          • tealpod 2 hours ago
            We compiled one of our Java app to native binary using GraalVM (for encyption and secret managment needs). Side effect is the Java native binary performance is excellent, app startup time also significantly less compared to JVM version.

            I am not sure how it compares with C++, Rust and Zig, but we made a benchmark with a similar Go binary, Java native version performance (load tests) is similar to Go binary. Only RAM usage of Java native binary is 3 times to Go binary (and JVM app took almost 10 times more RAM than Go version).

            • pron 2 hours ago
              The RAM difference is primarily because both Native Image (what you call Graal VM) and Go use much simpler and less efficient memory management techniques. HotSpot uses much more RAM by design as there are inefficiencies caused by using too little of it. Memory management - and especially very sophisticated approaches that are only used by the best resourced teams - is an especially misunderstood aspect.

              I gave a talk on the subject that I hope will be published soon, and while I can't reproduce it here, let me give an example that offers some basic intuition. Imagine needing to do some computation in two ways on a machine with 1GB of free RAM. You could run for 10s, taking up 100% CPU and consuming 80MB of RAM, or for 9s, taking up 100% CPU and consuming 800MB of RAM. The second is more efficient, despite taking up 10x more RAM and saving "only" 10% of CPU, regardless of the relative cost of RAM and CPU. This is because taking up 100% of the CPU effectively captures 100% of RAM (as no other program can use it), so both programs capture the entire 1GB only the second one captures it for a second less. This scales to non extreme situations because accessing RAM requires CPU, so using CPU means capturing RAM whether you use it or not. So HotSpot uses it if it can use it to balance the CPU utilisation.

              In some situations it may not matter, and I assume that if Native Image and Go work just as well for you, then the workload isn't very high, but under high workloads, this can matter a lot.

      • layer8 51 minutes ago
        What do you mean by “control”?
      • pharrington 4 hours ago
        Your Project Leyden's "AOT cache" Youtube link is broken, did you mean to link to https://www.youtube.com/watch?v=fiBNDT9r_4I?
    • kakacik 6 hours ago
      Most of real world use of Java platform has next to 0 concerns like those. Some more niche use case may benefit, good, but overall success map isn't changing anytime soon. Reasons for its long term success lie elsewhere.
      • FartyMcFarter 5 hours ago
        Android Java apps' memory consumption is definitely a relevant concern.
        • gf000 2 hours ago
          It doesn't even run "JavaTM", but some bastard child that is in like ~5 years delay compared to OpenJDK.
      • re-thc 3 hours ago
        Not true. Lots of large Java deployments with millions to billions in cloud spend. The Java part of it isn’t 0.

        Memory isn’t free. CPU isn’t free.

        • gf000 2 hours ago
          And java uses very little CPU compared to most other languages. It's right after manual memory managed languages like C/C++, and is the first managed language according to a paper about how "green" each language is.

          But there is a semi-fundamental tradeoff here, you either use more CPU to use less memory or the reverse. Java can be dynamically configured for either end (though defaults to less CPU by not running the GC unnecessarily).

  • ChrisMarshallNY 3 hours ago
    I started off with Machine Code, on a device with 256 bytes (not KB) of RAM. That was 256 bytes, to install the executable, reserve the stack, and set up the heap.

    We often used bit (not byte) fields, to convey information.

    Made life challenging.

    However, being able to be sloppy has its definite advantages. It takes a long time to design highly-optimized stuff. If just declaring a couple of new properties takes thirty seconds, and designing a bitfield takes an hour, then we have some real cost-savings, there.

    That said, it's easy to get crazy, these days. I just spent a couple of days, chasing down greedy memory hogs. These were operations that ate gigabytes of memory. I determined that the real culprit was actually Apple MapKit, and figured out a simple workaround, but it took a long time to get there. If I suspect the OS, then it's usually my fault, and trying everything before going back to the OS takes time.

    • Obscurity4340 3 hours ago
      How do you deal with all the daemons and automatic crap that does this on Mac? Isnt it all reinforced by SIP?
      • ChrisMarshallNY 2 hours ago
        I think all operating systems have these.

        In this one case, allocating a MapView via storyboard, caused some kind of cascading strong reference stuff.

        Simply allocating it programmatically, fixed it.

        Took awhile to get there, though.

  • pron 6 hours ago
    > The cost of each new field is rarely considered

    Most developers, in Java and in most other languages, do not consider the cost of every field, but I can tell you that people who need micro-optimisations certainly do care, and in Java's standard library, a layout is very much a concern (except, as always, you want to optimise what really matters; there's no point in optimising something that is unlikely to be a hot spot in a real program). Sometimes, though, you want to intentionally spread out the layout to avoid cache line sharing when concurrency is involved. You will find such examples in the standard library, too.

    • re-thc 3 hours ago
      > Most developers, in Java and in most other languages, do not consider the cost of every field

      Are you saying most developers are bad? It’s the equivalent of most employees don’t consider the cost of every action to the employer and is how company spend blows up.

      • pron 3 hours ago
        I'm saying that most developers aren't writing code where layout is a primary contributor to the program's performance. Even in performance-sensitive applications, only a minority of the team are working on the hot spots.

        And speaking about costs, knowing what to optimise is the key to software performance. Improving the performance of an operation by 10000x will improve the performance of your program by less than 1% if the operation is only 1% of the profile to begin with. So I'm only saying that most developers don't work on code where the layout is very significant, but some certainly do.

        • re-thc 3 hours ago
          > I'm saying that most developers aren't writing code where layout is a primary contributor to the program's performance.

          I've heard this theory before. This isn't just about performance and I don't buy it.

          I've seen too many examples of this is just a temporary solution so it doesn't matter. >3 years later that "temporary solution" was still there and at the heart of many operations yet it's now to hard and too costly to fix.

          I've also seen the this is a quick hack. No 1 uses it. It doesn't go through any hot paths. All good. You know what happens? Years later, every service literally goes through it. Again, it's too hard to fix.

          In the real world these "theories" are really loose. The only fix is every should be aware of what they are doing and do it properly. The it might not happen, etc mindset is dangerous.

          • pron 2 hours ago
            This has absolutely nothing to do with what I said. I wasn't referring to people who think that program performance doesn't matter (although I'm sure there are many of those) but to people working on code that either doesn't impact the overall program's performance much or it does but not due to layout. The number of developers working on code where layout is a major contributor to performance is relatively low, and this includes people working on programs where layout does impact performance significantly (because even in such a program, that particular hot path is not touched by every developer).
            • re-thc 1 hour ago
              > but to people working on code that either doesn't impact the overall program's performance much or it does but not due to layout

              And that's the problem. Who decides that? How do you know and that's my problem with it. Things always change. It's always temporary, not in the hot path, doesn't matter etc until it does.

              So what is considered "doesn't impact" often comes back to bite.

              • pron 1 hour ago
                That is why profiling is the only way to good performance. It's what lets you know what matters, and it's the only thing that does or can. I've been doing low-level (as well as high level) programming for more than 25 years, and I don't know in advance what is more efficient than what. An operation that was inefficient in the program I wrote yesterday under high contention or bad branch prediciton could be efficient in the program I'll write tomorrow. I can only know that if I profile my specific program (and when writing code for different architectures, I need to profile my program on all of them, because what's efficient on x86-64 may be inefficient on Aarch64 or vice-versa). The days we could tell that something is efficient or not, except for the obvious cases, are gone. Computers, at both the hardware and software infrastructure layers, don't work like that anymore.

                If your profile shows you a hot path that's responsible for 90% of the time your program spends, any second optimising anything outside of it harms your performance, as it's a second spent on low ROI instead of high ROI.

          • gf000 2 hours ago
            Then what is it that you are saying? That I should use JMH to determine the best layout for my helper class that will be initialized 3 times? Like most of the software (by line of code) is boring plumbing from one service to another with some dumb business logic sprinkled in. Something like a single config option for your database driver matters orderS of magnitude more in many types of applications.

            It's much more niche to work on stuff where such changes actually matter, like much much more people write boring CRUD backends than those who write physics simulators and audio processing pipelines combined.

            • re-thc 2 hours ago
              Consider the cost of every field, of every action.

              Understand the language, the memory model, etc. Don't do "it works on my machine". Understand the architecture, layout, implications etc.

              E.g. if you need an int and not a long you should clearly use an int. Wait until you do this every time and things blow up and it's too "hard" to change.

              It's called be aware of your actions. Take responsibility of what you do.

              > It's much more niche to work on stuff where such changes actually matter,

              Not true and that's why there's so much wastage.

              A lot of things matter. I've seen more times than the other way that simple awareness and changes can pay for my salary, e.g. not updating to newer EC2 instances when they get released in AWS. Even in a mid size company that was hundreds to thousands in savings.

              I've seen CI/CD pipelines where the developers never considered caching and it takes hours to run. It's not free. When every PR and update (hundreds a day) triggers a run it's a cost and a cost not just on machines but developer time waiting.

              I can list a lot more examples and everyone in the chain can contribute.

              • pron 2 hours ago
                > Consider the cost of every field, of every action.

                This runs counter to most modern software performance principles. Thanks to modern hardware optimisations (cache hierarchy, ILP, branch prediction), modern compiler optimisations (aggressive inlining that leads to a much wider view), and increased concurrency, the notion of some action having a cost lost most meaning about 20 years ago, and increasingly since. Because how fast some action is now depends on a much broader context of what else is going on in the program (and the machine), action X can be faster than Y in one program and the same or slower than Y in another.

                Because it's nearly impossible to generalise (and so what was true in your previous program may not be true in your current one unless they're nearly identical), the advice is to first profile your program so that you know how fast or slow different parts are in the context of your particular program and then to focus the optimisation efforts on the hot paths in your program. Otherwise, you may end up spending effort where it makes no difference, and this comes at the cost of optimising what matters, overall harming performance.

                Taking responsibility means being smart about directing your resources to where they can have the most impact.

      • Retr0id 3 hours ago
        Most likely they just have other priorities. A lot of code is not at all performance-sensitive, or is bottlenecked by some other factor.
      • perching_aix 50 minutes ago
        No, it means the opposite.
      • nathanielks 3 hours ago
        If the previous commenter won't say that, I will
      • LoganDark 3 hours ago
        It doesn't take a "bad developer" to not consider the cost of every field...
  • manoDev 53 minutes ago
    Tip: to get LN cache sizes on mac, the commmand is

        $ sysctl -a | grep "l.*cachesize" | gnumfmt --field=2 --to=si
        hw.perflevel1.l1icachesize:   132k
        hw.perflevel1.l1dcachesize:   66k
        hw.perflevel1.l2cachesize:    4,2M
        hw.perflevel0.l1icachesize:   197k
        hw.perflevel0.l1dcachesize:   132k
        hw.perflevel0.l2cachesize:      13M
        hw.l1icachesize:   132k
        hw.l1dcachesize:   66k
        hw.l2cachesize:    4,2M
    
    And the equivalent to LEVEL1_DCACHE_LINESIZE is

        $ sysctl -a | grep hw.cachelinesize
        hw.cachelinesize: 128
  • forinti 6 hours ago
    So if you need speed, you just have to swallow your OO programmer's pride and put your data in arrays.
    • jayd16 4 hours ago
      If you have hot loops with millions of iterations at a time, structure your code accordingly. Its not anti-OO to choose the right data structure for the job.
    • bob1029 5 hours ago
      And avoid moving said data between physical threads as much as possible.

      Most of the bottlenecks I see are not due to the organization of data. Unnecessary communication of data is the #1 offender.

      • burnt-resistor 3 hours ago
        Working set and algorithm diagonalization (work independence) FTW. Immutable data structures and copying often helps to avoid cache invalidation penalties.
    • theandrewbailey 6 hours ago
      Maybe someone can write an OO language where arrays of structs are automatically stored as structs of arrays.

      mild /s

      • fp64 5 hours ago
        Odin has some helpers, was one of the more interesting features I found, but never tried. Not sure if you want to consider Odin OO, but well https://odin-lang.org/docs/overview/#soa-struct-arrays
        • the__alchemist 2 hours ago
          Odin is heavily inspired by the lang he or she is referring to!
          • fp64 2 hours ago
            A sibling comment also mentioned Jai. Not sure what I am missing that the original post was explicitly referring to Jai, some inside joke maybe?

            I am sorry, I only know Odin. Jai is this cult on reddit/discord, right? You get access if you socialize enough or something? Not my thing. Not for a language.

            • theandrewbailey 1 hour ago
              (original poster here)

              I was just throwing out an idea. I had no idea there were already implementations! Because, to my knowledge, conventional popular languages like C/C++/C#/Java/JS/Python don't do that, and automatically doing that (under certain conditions) feels like an easy performance win.

              • jevndev 21 minutes ago
                For what it’s worth, a common example of the capabilities of c++26 reflection is exactly this use case. I can’t remember where I first saw it, but this article [0] showcases the technique pretty well. It’s opt-in so not the compiler optimization that you’re imagining but still neat that it’s possible

                [0] https://brevzin.github.io/c++/2025/05/02/soa/

            • the__alchemist 1 hour ago
              Ah. So, the context (Which I read too far into evidently): 1: One of Jai's initial primary marketing points was to address exactly this: SoA performance with AoS ergonomics. 2: Odin is (or was initially) inspired by Jai.
      • Mizza 6 hours ago
        Are you talking about Zig's MultiArrayList?
        • alex7o 5 hours ago
          He is talking about jai the programing language from Jonathan Blow, which is quite cool but there is no way to access it.
      • tlb 5 hours ago
        There's a package to do this in Julia: https://juliaarrays.github.io/StructArrays.jl/stable/
      • gryn 2 hours ago
        something like this https://crates.io/crates/columnar ?
  • recursivedoubts 3 hours ago
    When you are developing games, sometimes.

    When you are developing most other applications every byte does not matter. What matters much more is overall system architecture, collapsing unnecessary abstraction layers that some developers (especially java developers) seem to love and optimizing your datastore access.

    As always, profile profile profile.

    A company I worked for spent a violent couple of man-decades flipping our proprietary scripting language from interpeted to bytecode generation, obviously with tons of bugs and subtle semantic changes, and it ended up boosting overall system performance by about 30%. We could have done nothing over that period of time and hardware advances would have made a bigger impact.

  • ssiddharth 6 hours ago
    Slight tangent, but every ms, μs, and ns counts too. We've gotten awfully carefree with response times and wasted compute cycles.
  • rao-v 3 hours ago
    Anyways find it odd that major languages don’t have a built in way of asking for an array of objects to be optimized as SoA or AoS
    • jayd16 2 hours ago
      It doesn't quite make sense to keep object identity at the language level. Inherently the data in the arrays cannot be the same memory of the data in the objects fields.

      To get the speed up, you can't just abstract it as an access pattern because it's tied to the specific way the memory is laid out.

      If you were trying to make some kind of collection type that could be queried by both row and column, you would need to store it both ways at all times and also keep both representations in sync, which also defeats the purpose, somewhat.

      I feel like if you're trying to do this pattern then it doesn't make sense to also keep the objects.

  • Luff 4 hours ago
    Yes we should end the hateful rhetoric of most and least significant bytes. Every Byte Matters.
    • diabllicseagull 3 hours ago
      We'll get there, bit by bit.
    • zabzonk 4 hours ago
      We need an ending to byte-sizeism as well.
    • moi2388 3 hours ago
      In combination with “What colour are your bits” I do not see this ending well..
  • compiler-guy 1 hour ago
    SoA can be a big win. But so can plain AoS, just depends on the access pattern.

    Profiling important workloads matters. Without that everything else is guesswork.

  • SuperV1234 3 hours ago
    Data Oriented Design rocks. It was the subject for my CppCon 2025 keynote: https://youtube.com/watch?v=SzjJfKHygaQ
    • setheron 2 hours ago
      Add it to my watch list!
  • coldcity_again 6 hours ago
    I love to see stuff like this. And an active Vectrex gamedev and PC/Amiga sizecoder I strongly agree with the sentiment!
  • nasretdinov 3 hours ago
    Ideally you'd want to go further and actually store the is_alive as a bit mask and use SIMD instructions to filter out zeroes for example.
  • AxelWickman 4 hours ago
    Cool read. The AoS vs SoA speaks for itself.
  • readthenotes1 2 hours ago
    "In that time, you get used to huge classes. New functionality? Just add a new method and field to the class"

    I guess this is one reason why object-orientation has such a bad reputation.

    I once worked at a bank where the OO mentor had taught people that the only object they needed was "Tape" and have them replicate the structure of data on the old spooled tape reels.

    The struct of arrays reminds me of this optimization.

  • yas_hmaheshwari 6 hours ago
    Out of course: I had thought about reading an article about Iran war or some geo political news when I read fzakaria :-)
  • RickJWagner 5 hours ago
    That’s a great read. I wish more people wrote like that.
    • fdegmecic 5 hours ago
      CppCon 2014: Mike Acton "Data-Oriented Design and C++"

      Andrew Kelley: A Practical Guide to Applying Data Oriented Design (DoD)

      you should check these two talks out then.

      • lionkor 3 hours ago
        The first is quite famous in data oriented design/programming circles, the second one is up there, too. Both very much worth watching.
  • coolThingsFirst 5 hours ago
    Why doesn’t the machine fill up the other cache lines as well why is 64 bytes only and then a miss?
    • masklinn 5 hours ago
      They will absolutely do that (prefetching, they can even eagerly load what’s on the other side of a pointer).

      However it requires additional hardware to recognize patterns which benefit from prefetching, and every time the CPU prefetches data which ends up not being used it has both burned energy and memory bandwidth, and evicted data from the cache which might be needed (cache pollution).

    • spiffyk 4 hours ago
      A cache line is simply the unit of data a CPU cache works with (generally 64 bytes, because someone somewhere has probably determined that that is the best line size for general use), much like there are units of data like bytes (8 bits nowadays, but there have been weird ones historically), pages (varies between hardware as well, and may be OS-configurable), etc.

      As TFA mentions, a CPU does some predictions about what cache lines to prefetch, e.g. when you do sequential reads. Moreover, the x86_64 instruction set provides a prefetch instruction through which you are able to give the CPU a hint "hey, I'm gonna be using this soon, prepare accordingly, pretty please".

      Still, the utility of prefetching is diminished if you only use a single byte from each cache line, because the mechanism generally depends on you doing other work while the next cache line is being fetched. So really the best case scenario is to take as much time as possible to work with what is already fetched, so that there is time for the next unit of data to be fetched in the meantime.

    • Liquid_Fire 5 hours ago
      It might sometimes prefetch the surrounding lines as well, but ultimately cache space is limited, so there is a trade-off. Every time you fill a line, you are throwing away something else that was cached there previously, which you may need again in the near future.
  • burnt-resistor 3 hours ago
    I'm curious if anyone has had to write a JNI extension for a hot (CPU, GPU, RAM) section the JVM was unable to effectively JIT and/or optimize enough.
  • maoliofc 3 hours ago
    [flagged]