Performance optimisation is one of those topics where the gap between what’s taught and what’s practiced is enormous. Textbooks focus on algorithmic complexity. Online discussions obsess over micro-benchmarks. In practice, the performance that matters in game development is about data layout, memory access patterns, and knowing when to optimise and — just as importantly — when to stop.

At Relish Games, our work with C++ and frameworks like HGE means we deal with performance-sensitive code regularly. This is our practical guide to the optimisation techniques that actually move the needle in 2D game development.

Rule Zero: Profile First

The single most important optimisation principle: never optimise without profiling data. Your intuition about where the bottleneck is will be wrong approximately 80% of the time.

Profiling Workflow

  1. Reproduce the performance problem — identify the specific scenario where performance is inadequate
  2. Profile under realistic conditions — debug builds, synthetic benchmarks, and release builds all have different performance characteristics
  3. Identify the hot path — the specific functions and code paths consuming the most time
  4. Optimise the hot path — not the code that “looks slow” but the code that IS slow
  5. Measure the improvement — quantify the gain to decide whether more optimisation is needed

Profiling Tools

  • Visual Studio Profiler: Integrated, sampling-based, good for getting started
  • Intel VTune: Detailed hardware counter analysis, excellent for cache analysis
  • Tracy: Open-source frame profiler designed for games — shows frame breakdowns visually
  • Manual timing: Simple QueryPerformanceCounter blocks around suspected hot paths

For game development specifically, frame-time profiling (how long each frame takes, broken down by system) is more useful than traditional function-level profiling.

Cache-Friendly Data Layouts

Modern CPUs are fast. Modern memory is (relatively) slow. The gap between CPU speed and memory latency is the defining performance characteristic of contemporary hardware. The cache hierarchy bridges this gap — but only if your data access patterns cooperate.

Array of Structures (AoS) vs Structure of Arrays (SoA)

AoS — the natural approach:

struct Entity {
    float x, y;          // Position
    float vx, vy;        // Velocity
    int health;           // HP
    int spriteId;         // Visual
    bool active;          // State
};
std::vector<Entity> entities;

SoA — the cache-friendly approach:

struct Entities {
    std::vector<float> x, y;
    std::vector<float> vx, vy;
    std::vector<int> health;
    std::vector<int> spriteId;
    std::vector<bool> active;
};

When your update loop only needs position and velocity (physics update), AoS loads entire Entity structs into cache lines, wasting bandwidth on health, spriteId, and active. SoA loads only the relevant arrays, using every byte of cache effectively.

The practical tradeoff: SoA is harder to work with and adds complexity. Use it for hot loops that process many entities — physics updates, collision checks, rendering batches. Keep AoS for low-frequency systems where code clarity matters more than cache efficiency.

Memory Allocation Patterns

Dynamic memory allocation (new, malloc) is expensive compared to computation. In game code, allocation patterns are predictable — you know roughly how many entities, particles, and projectiles you’ll have.

Pool Allocators

Pre-allocate a fixed array of objects and hand them out on demand:

template<typename T, size_t N>
class ObjectPool {
    T objects[N];
    bool used[N] = {};
    
public:
    T* acquire() {
        for (size_t i = 0; i < N; i++) {
            if (!used[i]) {
                used[i] = true;
                return &objects[i];
            }
        }
        return nullptr;
    }
    
    void release(T* obj) {
        size_t idx = obj - objects;
        used[idx] = false;
    }
};

Best for: Entities that are frequently created and destroyed — bullets, particles, temporary effects. This maps directly to how HGE’s particle system manages large numbers of short-lived objects.

Arena Allocators

Allocate memory linearly from a large pre-allocated block. Perfect for per-frame temporary allocations where everything can be freed at once:

class ArenaAllocator {
    char* memory;
    size_t offset = 0;
    size_t capacity;
    
public:
    void* alloc(size_t size) {
        void* ptr = memory + offset;
        offset += size;
        return ptr;
    }
    
    void reset() { offset = 0; }
};

Best for: Temporary calculations, string building, render command lists — anything allocated during a frame and discarded after.

Hot Loop Optimisation

The inner loops that run thousands of times per frame deserve focused attention.

Reduce Work Per Iteration

Before optimising how you do the work, check whether you can do less work:

  • Early-out conditions: Skip entities that can’t possibly be relevant (off-screen, inactive, too far away)
  • Spatial partitioning: Grids, quad-trees, or hash maps that limit collision checks to nearby entities
  • Culling: Don’t process or render anything outside the camera view

Minimise Branching

Branch mispredictions stall the CPU pipeline. In hot loops:

  • Sort data to make branches predictable (all active entities first, then inactive)
  • Use branchless techniques where simple (conditional moves, arithmetic instead of branches)
  • Avoid virtual function calls in tight loops — the indirect jump is a branch the CPU can’t predict well

SIMD for Game Math

Single Instruction, Multiple Data instructions process 4 or 8 values simultaneously. Game math — position updates, distance calculations, sprite transformations — is naturally SIMD-friendly:

// Scalar: 4 multiplies, 4 adds
for (int i = 0; i < count; i++)
    positions[i] += velocities[i] * dt;

// SIMD: processes 4 entities per instruction
__m128 dt_vec = _mm_set1_ps(dt);
for (int i = 0; i < count; i += 4) {
    __m128 pos = _mm_load_ps(&positions[i]);
    __m128 vel = _mm_load_ps(&velocities[i]);
    _mm_store_ps(&positions[i], _mm_add_ps(pos, _mm_mul_ps(vel, dt_vec)));
}

Practical note: Modern compilers auto-vectorise simple loops. Check the compiler output (Visual Studio: /Qvec-report:2) before writing manual SIMD. Let the compiler do it first; hand-write SIMD only where the compiler fails.

Rendering Optimisation for 2D Games

Sprite Batching

The single biggest rendering optimisation for 2D games: batch sprites that share the same texture into single draw calls.

Instead of:

Draw sprite 1 (texture A) → 1 draw call
Draw sprite 2 (texture A) → 1 draw call  
Draw sprite 3 (texture B) → 1 draw call
Draw sprite 4 (texture A) → 1 draw call

Sort and batch:

Draw sprites 1, 2, 4 (texture A) → 1 draw call
Draw sprite 3 (texture B)        → 1 draw call

Going from 4 draw calls to 2 might not seem significant, but in a real scene with hundreds of sprites, batching can reduce draw calls from 500+ to under 20.

The HGE sprite system handles batching internally, but if you’re building your own engine, implementing efficient batching is a high-priority task.

Texture Atlas Efficiency

Pack sprites into atlases to minimise texture switching. Beyond basic packing, consider:

  • Group sprites by rendering order (background atlas, entity atlas, UI atlas)
  • Pack frequently co-rendered sprites into the same atlas
  • Leave power-of-two padding for hardware compatibility

Compiler Optimisation Flags

Don’t underestimate what the compiler does for you:

  • Release build configuration: Obvious, but debug builds can be 10–50x slower
  • Link-time optimisation (LTO): Enables cross-file inlining and dead code elimination
  • Profile-guided optimisation (PGO): Compile, profile, recompile with profile data
  • Architecture-specific flags: /arch:AVX2 enables newer SIMD instructions

What We’d Do in Practice

  1. Profile first, always — no exceptions
  2. Fix the algorithm before fixing the implementation — O(n²) to O(n log n) beats any micro-optimisation
  3. Use SoA for entity systems with more than a hundred entities
  4. Pool-allocate anything created and destroyed frequently
  5. Batch rendering and use texture atlases from day one
  6. Let the compiler auto-vectorise before writing manual SIMD
  7. Set a frame budget and stop optimising when you’re within it

Performance optimisation is satisfying work, but it’s a means to an end. The goal is a smooth player experience, not the fastest possible code. Once your frame times are consistently within budget, move on to making the game better.

Explore HGE’s approach to performance in the engine documentation, or discuss optimisation techniques with other developers in our community forum.