Let’s bring the cinematic experience of linear film music to non-linear interactive games: tight synchronization, intentional music and fluid interactivity, able to acknowledge individual moments big and small. Imagine it as if the composer watched a video of your linear playthrough and composed the music for that movie—except in real-time.
Music has always been influenced by technology. Seemingly simple technological advances—new instruments like the piano, valved trumpet or even well temperament—provoked major changes in the way music was written and performed. Since the 20th century, the advent of recording technology, synthesizers and various mediums and distribution paradigms have created entire new meta fields of music: film scoring and game music, to name but a few.
Game music bears a lot of similarity to film music from a production standpoint. I could also draw a vague analogy to early game scores and music for silent film. Both share a more general approach to scoring a scene with little specificity or acknowledging subtle shifts. As the medium has matured, game composers have borrowed a lot of the techniques and traditions for scoring narrative and emotional games from the film world as well. The interactive nature of games and the historically-limited performance capabilities of gaming devices, however, mean there are also a lot of fundamental differences. Certain film scoring techniques that listeners and composers take for granted—a musical flourish with the hero’s theme when they enter—can be difficult or impossible to achieve in an interactive score.
The pace of technological innovation has been particularly rapid for game music. We’ve already gone through two major sea changes in how game music sounds and is written. I would argue we’re on the cusp of a third.
The first wave was defined by on-device synthesis. Certain adaptive techniques are effortless: Sonic gets the fast shoes and whatever music is playing speeds up. Music could be transposed or generated algorithmically on the fly, though practical applications of this technique were few and far between because of the infancy of the medium and lack of available processing power. Standard libraries like DirectMusic had features for doing algorithmic music. Dedicated sound cards with synthesis and DSP were standard in both computers and consoles.
The second wave began with the introduction of larger storage mediums: CDs, DVDs and hard drives. All of a sudden, game consoles and computers could store entire recordings, from rock to full orchestra. This also led to an influx of film and TV composers, fueled by larger and larger budgets for games. The quality of the music and audio itself grew substantially, but at the expense of adaptive flexibility. Trivial techniques from the first wave like speeding up the music are technically expensive and difficult all of a sudden.
Music is best operated on in the domain of notes and durations. That representation can be manipulated in musical terms. It also happens to be a much more compact representation. While you can do some musical tweaks in the audio domain, it’s much more difficult. Once you enter the audio domain, something as simple as playing the B section before the A section becomes technically difficult. You can’t simply start in the middle of your audio clip or you’ll hear the preceding material as well. You have to work around these limitations by generating more and more pieces of rendered content: separate clips for the A and B sections so you can start with either and then stitch them together on the fly.
Sonic 3D Blast (1996). The same game was released on both Sega Genesis (cartridge) and Sega Saturn (CD). It features entirely different scores by different composers. The Genesis version uses the same music speed up trick from the earlier games. Due to technical limitations of using audio instead of synthesis, the Saturn version transitions to an entirely new track while running fast.
In the first wave, music was represented in musical terms. Interactivity was continuous: tempos could be adjusted, music could be transposed, harmonized or restructured on the fly.
In the second wave, interactivity became discrete. Music was represented as audio. What we gained in quality from recordings to evocative sound design, we lost in flexibility. The primary tools of interactivity became transitions and layers, i.e. horizontal or vertical interactivity. These technological limitations greatly constrain what’s possible for composers in scoring a scene. Action scenes might feature a drum layer that fades in to build intensity. Pads and drones become a necessity to allow easy transitions at any point to a subsequent section. Sync with the gameplay is generally broad brushstrokes and fairly loose for a number of reasons we’ll talk about later.
I believe we are on the cusp of a new, third wave of video game music. Increased drive speed and capacity, more processing power and RAM, better tooling for music creators and better virtual instruments all open new technological frontiers. Our limitations right now are due both to a lack of novel technology but also a lack of imagination. Both composers and game developers are constrained by their tools, middlewares like Wwise and FMOD which are designed with audio first and foremost. While they provide music functionality and are used for such, their main strength is their real-time audio engine and editing suite for doing interactive sound design.
Most composers don’t know that they can be asking for more. They’re too busy working within the constraints of what is available to them. They are doing fantastic things with what they have available that are hinting at future possibilities if the technology could empower them. Nor can they dream up new features and ask the game developers to implement them. The scope is far too large, too domain specific and too complicated to manage in the context of a single game. Moreover, few game studios even have a dedicated audio programmer on staff. That engineer is not necessarily even experienced in the audio domain. It might just be their responsibility because someone needs to do it. Music is a further complex specialization within audio. Lastly, because composers are almost always independent contractors rather than full-time staff, there is little incentive or dialog around creating large scale, technically involved musical systems.
Our goal should be to elevate the audio quality and cinematic nuance from the second wave while bringing back and enhancing the musical flexibility from the first wave. Some are working towards AI-generated scores which may be viable far in the future (fourth wave?). For now, this approach loses the artistic intentionality and nuance of the composer. Others are prioritizing a full return to synthesis with real-time sampling, incorporating all our advances with music creation tools in the past 2 decades. This approach may yield interesting results but would be a regression on the quality front, regardless of how good the sampling and performance AI techniques are (more on that later).
We need a hybrid approach, one that combines both recordings and synthesis. We also need an entire new suite of tools and techniques, spanning the game engine, the music middleware and the composer’s music tools, from DAWs to virtual instruments. We need to think outside of the box about how to solve some of these problems because they are very much unsolved or even physically impossible in the generalized case.
My goal is to bring the detail-oriented world of film scoring into a fully interactive context. Let’s step back for a moment and ask why these things are all important.
Being intentional and acknowledging specific moments musically is hard. It’s magnitudes harder in an interactive context. Certain things are not possible without a full-featured audio middleware. Current middlewares are still insufficient to solve certain problems and need coding support in-engine. Composers and game designers need tools that allow them to target their score as accurately and seamlessly as possible.
My general thesis is that the more specificity we can create, the more impactful interactive scores can be. At one far end, we have the interactivity equivalent of Mickey Mousing: tying elements of the musical score directly to game actions like jumping. Opposite that, we have drones and broad brush textures that play throughout a scene, unresponsive to what’s going on in the gameplay except to set the tone and mood. Both solutions are neither bad nor good—they’re simply design decisions and may be appropriate depending on the game or scene therein.
Most of the time, the effective solutions will probably lie somewhere in the middle: music that acknowledges specific moments with tight synchronization while creating a seamless tapestry contextualizing that.
What follows is a discussion of some of the problems facing any such approach that hopes to achieve that level of specificity.
Problems with Current Practices
- Audio latency comes in many forms as a byproduct of digital signal processing algorithms, buffering in audio I/O devices and internal pipelining. Audio latency tends to be on the order of tens of milliseconds.
- Music systems add musical latency to create more musical transitions when crossfading between different elements or inserting transitional material. Musical latency is often on the order of seconds.
Latency is one of the single largest challenges in interactive music. Generally speaking, latency is the time it takes between your system receiving some input to outputting new data based on it. In interactive music systems, there are lots of different types of latency. For those familiar with digital audio generally, latency often refers to the general time delay of the audio system, e.g. the time it takes to generate a new buffer of audio and output it to your speakers (perhaps ~1-10ms) after you press a note on your keyboard. Some real-time audio processing algorithms add additional latency to allow them to effectively look ahead into the future before processing their results. A limiter might delay the signal by a few milliseconds to analyze the incoming signal’s volume before deciding how to compress it, for example.
Interactive music systems often add latency to delay changes until certain musical boundaries. If the music is transitioning from track A to track B, an immediate transition would likely be very jarring. It breaks the temporal fabric of the music, even if the two tracks have the same tempo and key. To avoid this, transitions are often designed to wait until certain points, e.g. the downbeat of a measure or the arrival of a beat, to start the next track. This delay between the input (request to play B) and the output (playing B) is a sort of musical latency.
Sometimes there is additional latency in the music itself. This could be something like one or two bars of musical material to further smooth that transition from A to B, or it could be an intro to track B or an outro to track A. These “transition fragments” are small interludes of music that play for musical cohesion but delay the desired output.
Latency is a problem because it causes loose synchronization: it decouples the user or in-game action from the musical result. Too much latency and the listener might not even feel the musical change occurred because of the triggering event. Imagine if sound effects had this problem. The player would fire a laser, yet you wouldn’t hear a response for a second or two. It would sound like a mistake. Music is of course more abstract and synchronized changes are relatively infrequent. Nevertheless, we rely on those precise musical shifts in traditional media regularly, where being off by a second or even a couple frames can significantly alter how the scene is perceived and emotionally received. Being out of sync can be problematic enough to turn a dramatic scene into comedy.
- Musical latency is high so achieving tight synchronization is difficult.
- User interactivity makes perfect synchronization impossible, even in contexts like cutscenes with limited interactivity. Design concessions must be made across the gameplay or music system.
In traditional, non-interactive media scoring, composers will often synchronize musical changes down to the level of individual frames. Large musical shifts in tone and feeling will align with cuts in the picture or a pivotal moment for a character. The music might be somber as the hero’s getting pummeled and on the brink of defeat. But when help arrives unexpectedly, the music takes a dramatic turn towards hopeful and heroic! Perhaps another hero comes swooping in and their musical theme signals their triumphant return.
In media scoring, we tend to talk about sync in terms of lining up a specifical musical moment with a particular time in the fixed video. This concept translates well to interactive music for games. The question then becomes what interactive element cues the music so it can align with the desired point. From here on out, I’ll refer to these moments in both the visual and the music as sync points. We want to make sure that the sync point in the music plays at precisely the same time as the sync point in the visual, or at least within some minimum acceptable threshold.
The real-time nature of interactive music makes frame level sync an unsolvable problem in the general case. Due primarily to musical latency, the synchronization with gameplay tends to be loose, on the granularity of a measure. Musical latency adds a large amount of time—potentially seconds or longer—before the desired musical synchronization point arrives. On top of that, since music affects the player’s perception and emotional state, it can also affect their choices and actions. Music building to a climactic finish could encourage a player to follow through or it could even cause the player to pull back.
Even in cutscenes—the closest a game ever gets to film or TV—musical sync is a non-trivial problem. Cutscenes are often entered via user interaction, so steps must be taken to transition out of any existing music and into the cutscene music ensuring the sync to the animations is correct. Any sort of performance hiccup in either the visual or audio engine (perhaps caused by running on an under-powered or over-burdened computer) could push the music out of alignment. On top of that, many cutscenes feature interactive elements that can change the timing of the visuals, whether it’s dialog continuation or the ability to pause the game. The music and music systems must be designed to accommodate those interactions to preserve the desired level of sync.
All these limitations around sync have been treated as locked-in facts. As a result, composers and audio teams mostly write music and design music systems based around loose synchronization. Music tends towards a fairly broad-brush tone setting. It tends not to acknowledge individual moments or actions like is common in film and TV. There are some rare exceptions, especially in the case of generative music driven directly by gameplay.
- Variety often means hand-curating a large amount of content.
- Musical variety traditionally means switching between discrete chunks of content. Continuous variation or altering existing music on the fly is mostly limited to things like fading in layers.
In traditional media scores, music cues have a limited, fixed duration. Game music cues, by contrast, may continue for anywhere from a few seconds to 30 minutes or more, depending on usage and context. Unlike a film, most games involve revisiting areas, whether that’s a menu screen or a hometown. As a result, music in games tends to loop to accommodate the indeterminate duration of the “scene.” Looping music is not a panacea, though. It can be grating for players to hear the same music over and over.
Looping and reprising the same music are both missed opportunities to score the player’s experience. As your character goes on their journey, the world is changing around them. When the character returns home, they are no longer the same character that went out into the world. The music should reflect that in some way, whether it’s a permanent shift or temporary based on their last quest. Films rarely reprise the exact same music twice for this reason. Even if the main theme appears, it will almost certainly be recontextualized to reflect the emotional and narrative journey.
This sounds great on paper but it’s near impossibly difficult to achieve in practice. In film and television, the plot is written and the composer can sculpt the musical narrative. In games, however, we only have a vague idea of what a player will do. When the possible variations are simple and well-defined, composers and designers can create multiple pieces of music or variations to fulfill this goal. Perhaps a town can be visited during the day or night, and so depending on time of day, there are two different versions of the music. This becomes more complicated if time actually passes in the game so the player could be standing there as it transitions from day to night.
Another simple example is competitive games like team-based first person shooters. If your team is losing, the music could sound more dire, whereas if your team is winning, it could be more optimistic and heroic. The question then turns to how much granularity do you need? Do you need “it’s neck and neck” music? What about for complete domination?
These approaches, while effective, can quickly lead to a huge amount of additional content to accommodate the design needs. This musical content needs to be generated ahead of time; it’s discrete and pre-planned. What does it mean to be 20% in the lead, and how does the music change if your lead grows or if it shrinks?
Some games add continuous change via synchronized layers that can fade in or out depending on the game state over the fixed, discrete content. Perhaps if you’re near death, intense drums start fading in on top of the existing track. While the layer volume is a continuous parameter, it’s still a discrete piece of content. Generally speaking, the volume of that track doesn’t substantially alter the perceived feeling except between the far ends of audible and inaudible.
Other music systems might rearrange their subsections horizontally to create variety, e.g. a simple playback system that has ten different one measure fragments that can be played back-to-back in any random order.
- Jumping between or even within tracks in the audio domain makes achieving musical transitions challenging.
Transitioning between two pieces of music or subsections of the same piece is yet another thorny problem. In the traditional media landscape, composers know what music comes before and after and can treat that handover with care for the incoming and outgoing keys, tempos, timing, etc. In interactive music, those transitions between music cues are much more varied. Firstly, the player’s action could trigger the transition at any point during your looping track. Further, this transition could move to multiple possible destination tracks as well! Imagine the player can walk into any of three doors and it will change the music depending on that choice. Even looping can be viewed as a specific type of musical transition.
In the audio domain, the best tools we have at our disposal are a combination of crossfades and hand-written transition music to bridge the gap between disparate sections. To avoid very high latency (e.g. only transitioning when the music reaches the end of its loop) and loose synchronization, there are often lots of possible moments where the music can transition, temporally quantized to metric boundaries of some sort. A transition that arrives at a downbeat in the new music where a downbeat in the old music would have occurred will be much smoother than if the new music were to start immediately.
Since we have multiple transition points per track, we need to audit each of those transitions. For music that only has a few different possible harmonic structures and is transitioning to a piece in a similar harmonic space, this may well be straightforward and not require too much additional content. As soon as there are multiple possible destinations, though, things become much more complicated to account for on all fronts: musical system design, composition and implementation.
Combinatorial Explosion of Content
- Transitioning from one of M source tracks to one of N destination tracks, with P transition points per source track requires auditing M×N×P total transitions.
- One way to reduce latency is to increase the number of transition points, creating substantially more content.
- The tooling and workflows aren’t good for either creating or auditing transitions.
The common solution to many aspects of the above problems is more content. Low latency transitions? Write lots of custom transitions. Tight synchronization? Write lots of tiny interstitials. More variety? Write more tracks, longer and with more layers.
As the complexity of these interactive systems increases, we run into headwinds on multiple fronts. The thought of writing, recording, editing, mixing and implementing hundreds of one or two measure transition fragments sounds tedious, repetitive, time consuming and expensive. The composer’s DAW is simply not designed with interactive music in mind, so something as simple as auditing transitions is an involved task to manually recreate the transition or it’s an approximation thereof. Simply conceptualizing each of those fragments and where they need to occur sounds mind-numbing.
Any given transition is straightforward enough, but the sum endeavor quickly cascades out of scope. Suppose for example we have a single music track that could change to a second track at any moment. Our initial track has a fixed tempo and harmonic home with a similar feel throughout, but roughly 10 different chords or unique spots that would need individually crafted transitions to be musical. Next, suppose instead we will transition to one of two possible tracks. Those carefully crafted transition fragments likely won’t work for this other track as well unless the composer severely constrained their approach to accommodate that. Now, the composer must write 20 transitions. Transition to one of three possible destination tracks? 30 transitions.
It gets even worse if there are multiple variations you’re transitioning from. Suppose you have a “night” variation of the source track. Now you’re transitioning from one of two possible tracks to another track. That is again 20 transitions. But if you’re transitioning from one of two possible tracks to one of three other tracks? 60 transitions. If there are multiple layers, you’ll need to account for each of those individually as well. 10 transitions per track may itself be optimistic based on transitioning on measure boundaries, perhaps. The more dynamic a track is and the tighter the synchronization needed, the more transition points will be needed.
How about instead you want to score a multiplayer Star Wars game where leitmotifs enter when a player does something cool on screen. An X-Wing swoops in to blow up a TIE Fighter so we get a musical flourish for the rebels. We can do this either by creating a musical fragment that will interrupt the current track briefly or by layering in something on top of the existing track. In either case, we’ll need to create a number of different variations to accommodate that depending on where in the source track the transition occurs. In the former, we need to create a bunch of different transitions to and from this fragment, depending on the complexity of the source material. For the latter, we also need a bunch of variations: a melody that works over one harmony or feel may not work musically when placed elsewhere.
The sheer amount of additional content required to glue things together quickly explodes into an impossible mountain of content. Composers’ tools aren’t built to create or audition these variations and transition fragments. On top of that, there may simply be too many possible configurations to reasonably consider and address.
Brief Tangent/Rant: Generative Music via AI
Discussions about interactive music almost always seem to turn to vague, hand-wavy statements about how AI will solve these problems in a fully adaptive manner, coupled with some sort of real-time synthesis engine to model performances and generate the audio. Create sliders for emotions and you can set the music to be 20% sad and 80% heroic! I would argue the whole conception is ill-formed from the start. But taking it at face value for a moment, while this may be viable a decade or two from now at the quality of contemporary scores, there are fundamental issues with this approach worth discussing.
Supposing we had the ability to generate the underlying music, the task of actually rendering that score into audio would be a massive undertaking. Current composer setups—top of the line desktops with hundreds of gigabytes of RAM and networked coprocessors—struggle to do this task on a regular basis. Doing that same amount of work on a consumer PC or a console is simply unrealistic. Doing the processing in The Cloud™ might work but would be incredibly complex and financially costly. It would also add latency when one of our primary goals is to reduce latency.
What the AI is actually doing under the hood, how it does that and what the composer has control over is fairly ill-defined. What do we even mean by AI? It seems to have become somewhat synonymous with deep learning and artificial neural networks over the past decade. It might mean something like providing some sort of seed material, a genre or instrumentation, and defining which parts are interactive and what that means (c.f. AIVA).
While it sounds like a good approach at first blush, I don’t find it compelling in the general case. You sacrifice too much in intentionality and specificity when the music system becomes a blackbox. The best contemporary AI systems are built around generalities and pastiche. They’re fundamentally imitation engines, regurgitating their best mashup of the datasets they were trained on.
AI certainly has an important role to play in these music systems, but not at the scope of generating the entire score. Its scope will be more narrowly targeted at the level of specific tools.
The Hybrid Approach
Our problems fall into two overarching categories: sync and musical flexibility.
Even if we had perfect synthesis that could replicate the quality of recorded, performed and mixed audio, we would still need to solve the sync problem. How do we handle musical latency and tight sync in a musical way? I propose two solutions: sync the music to the game and/or sync the game to the music. The former we can achieve by doing some clever music editing on the fly driven by continuous gameplay predictions. For the latter, we can tweak gameplay and animations to align with sync points in the music.
Both sync approaches are useful in different scenarios. Syncing the music is more generally useful but it’s musically non-deterministic. Syncing the game is especially useful around tightly scripted moments without user control like cutscenes. Both can be used together as well.
Given that synthesis isn’t perfect, we want to preserve the musical flexibility of synthesis but with the quality of recorded audio. I propose a number of new paradigms for manipulating our audio in a musical way, combined with new tooling to enable those approaches.
Tightly Synchronizing the Music to the Game
Music Editor in a Box
Musical latency is necessary to prevent awkward musical conjunctions. That typically means quantizing transition points to metric boundaries like the downbeat of a measure. If we could instead transition on an arbitrary beat within a measure, we’ve reduced our latency from up to measure to up to a single beat. If we could do a fraction of a beat, even better!
Traditional media composers and music editors actually do this sort of thing on a regular basis. In their quest to align with specific synchronization points or to adjust existing music to fit a new edit of a film, composers will shorten or elongate bars, resulting in somewhat odd time signatures in the midst of an otherwise steady click. Rather than simply waiting for the next possible transition, we should attempt to shorten that duration by music editing on the fly, with constraints on what precisely that means.
A potential way to approach this problem could be borrowed from the image processing domain. Seam carving is a dynamic image processing algorithm for resizing images while preserving important content. It works by removing a contiguous squiggle of pixels with the “least energy.” Finding a musical equivalent algorithm to remove subdivisions of a phrase while preserving its fundamental character should be possible. In the worst case, this can be a composer-defined plan on how to shorten or extend phrases on a track by track basis.
Predictive Game Engine
The more advanced notice we get that our transition is coming, the more we can prepare. If the engine can give us a prediction of when certain game events will occur, we can start the transition early optimistically. We can use tools like our music editor in a box to adjust the measure durations, either by removing a few beats here and there or by inserting beats.
These predictions would need to either be fairly conservative so we don’t get many false positives, or the transitions themselves need to be cancellable. In the latter case, we would want an aborted transition to be as musically seamless as possible. If the player is about to deliver the killing blow, perhaps the music starts swelling because the engine has predicted they’re about to succeed. If they suddenly died, the music likely needs a way to abort the transition into some other musical moment.
Tightly Synchronizing the Game to the Music
Most of the time, our goal is to synchronize the music to the game. At times, it’s easier to adapt the game subtly to align with the music. Musical latency is a locked-in fact. We can take steps to reduce it but it will always be there in some small amount. If we can measure how much time until we hit our musical sync point and the same for our visual sync point, we can try to adjust the game to bring them closer together.
The first component we need is a way to look ahead in the interactive music system to figure out how far that sync point actually is. This is actually non-trivial: it depends on simulating the music playback with all its transitions faster than real-time, often based on a hypothetical game state.
Once we have the ability to determine the time until the sync point in both the music and the gameplay, we can design the game and its animations to achieve tighter sync. We can speed up or slow down animations in small ways, insert or remove small delays, etc. to make the visuals take more or less time to get to the same place.
It’s like the music editor in a box, but in reverse: a picture editor in a box. Combine the two with the predictive engine and lookahead, and you have some effective tools for elastically adapting both the music and gameplay for tight synchronization.
Musical Flexibility in the Audio Domain
Composing with Systems: A Top Down Approach
One of the primary features of the hybrid approach is moving the musical system design more directly into the composer’s process. The composer knows their music most intimately. They’re often already implicitly designing their music around the systems required for the game, keeping in mind structure and accounting for transitions, layers and other interactive elements.
If you were to ask a composer about the construction of a piece of interactive music, I suspect they can almost surely elaborate on the design process behind the piece, the interactive aspects and how they decided to structure the music to accommodate those design goals. Abstracting music to a systems level view allows us to encode flexibility in musical terms.
In many ways, even something as seemingly fundamental as sheet music is a system. The sheet music isn’t the music itself—it’s a set of instructions for how human players should create a version of the music. Each time the players perform it, the music that comes out is different because the rules system allows for a certain amount of inherent variation (and humans aren’t robots). Another analogy would be to the graphical notation of experimental music scores and other forms of indeterminate music from the mid-20th century. Or even something like the musical dice games of the 18th century.
In the example of Sonic jumping on the fast shoes, it’s a single piece of music with a system that plays back the piece at different speeds. Not only should the composer be aware of those two different speeds, they need to be able to make sure the music makes sense at both. Ideally though, you wouldn’t simply duplicate the two sequences in your DAW: you would write the piece once and then describe the interactive element as changing the tempo. Once we do it that way, we can ask our software to generate both variants. If you change a note in one, it will affect the other and vice versa.
If we can include systems level view in composing tools, we can achieve all sorts of impressive results. Suppose you had written a track and you wanted to vary the instrumentation or other musical parameters based on a bunch of input variables. If it’s a simple matter of additional layers that fade in like drums, it’s easy enough. Suppose there are multiple variables driving this system, including randomness, and it gets dramatically more complicated with a lot more material to generate. Once we’ve abstracted our piece into a systems level design, we can ask our tool to generate all the material we need for us and tell the middleware how to assemble it all.
The above systems are what I would consider top-down systems. They are rules systems for interactivity applied on top of an existing piece of music. Perhaps the music was designed with the system in mind but there’s a fairly specific concept for a piece of music. They’re manipulating existing material in some way to achieve flexibility.
You can also imagine a use case for bottom-up systems that are more generative in nature. Bottom-up systems are more fine-grained. Rather than describing exactly what notes, these systems describe how to write the notes. Instead of writing a specific ostinato motor, for example, you could instead define it in the abstract: describe a metric pattern, the shape and the pitch content and let a note generator fill in the rest of the details. You could combine a number of similar abstract systems to create a piece of music based on your rule set. There’s no longer a canonical version of the track—it’s been reduced to a set of rules- and constraints-based systems.
Bottom-up systems are useful tools. Many composers use rules-based systems as part of their compositional process so it seems intuitive to expose them in an interactive context so that the music engine can take advantage of their indeterminism. There are a number of potential pitfalls, however.
First, there’s a real danger that intentionality will suffer. This is a fairly vague idea but there’s a risk that such systems will lose creative direction and land in a musical Uncanny Valley: things that clearly sound like music but don’t make coherent “musical sense.” The best solution here is to provide as much musical structure as possible. The more such systems are driven by the game, the more forgiving our musical ears will be. Consider Carl Stalling’s brilliant music for Looney Tunes. On its own, the music can be rather wacky and nonsensical. But combined with the visuals, everything clicks.
Second, real-time content generation is hard. The more work we are doing at the level of individual notes, the harder it will be to keep the audio quality high, let alone manage on-device performance. We can work around these issues by limiting the scope of the tools and applying them judiciously.
Since we’re working with audio data, we’re going to need to use crossfades to stitch things together. Tweaking crossfades is a pain though, and in a truly dynamic context, we won’t necessarily have a way to audit them in advance. Therefore, we need to make a sort of musical crossfade that’s aware of the content of the stems, perhaps combined with MIDI sequence data. This way, we can create more effective crossfades than simple equal power and similar can achieve. We can determine if we just need to tail out a ring out or if we’re actually moving between two pieces of musical material and how best to do that. In addition to being necessary for dynamic stitching, similar tools could save tremendous time editing deterministic material.
Detailed stem breakouts will allow for more elegant transitions. The more stems, the more graceful our music edits and dovetailing can be. Any transition system built on top of these musical crossfades should coordinate across the various stems based on their content. It should account for a melodic line carrying over the seam just as well as the low percussion that ends with a single hit and its reverb tail. This is a standard task for a music editor and it’s worth encoding a similar style of judgement and craft.
Baking Dynamic Music
If one of the problems is that we can’t do full real-time synthesis, perhaps we can make do with generating certain material in advance and rendering it out for use in-engine. We can automate this creation in the DAW inside the composer’s template, so it will be using any recorded audio, virtual instruments and effects present. We’ll borrow the term “baking” from the graphics & game world, where it’s used to describe precomputing and caching expensive calculations like lighting and shadows.
Creating hundreds of tiny transition fragments is tedious, time-consuming and difficult. We can’t make these transitions in-engine because we might not have the proper assets to build them dynamically.
Given a set of transition points and destinations, we should be able to generate a bunch of short transition fragments. This would use basic AI techniques alongside constraints and design goals provided by the composer. In addition, it may be acceptable to slow down or speed up, depending on the source and destination tracks.
Next, the tool can automate the export and stitching together of these fragments in-engine. This might mean bouncing out whole transitions from the DAW to be used as-is. That way, the composer could do some sanity checks on the AI-generated transitions. Or perhaps it would just export some key building blocks that can be assembled dynamically in-engine.
Baked Motifs & Layers
In cases where musical material might be overlaid rather than a transition, we can do something similar. Suppose we want a bunch of thematic fragments to be able to appear at any point within a certain section. We can create tools for generating all those variants from the underlying audio and MIDI representation and render them out. The composer may need to specify constraints for those variations, e.g. altering the notes via transposition or to follow the harmonic structure.
Or suppose we want a whole bunch of possible variations on instrumentation, dynamics, or processing. We can automate the generation of all those variants and then communicate with the middleware side to ingest them. This sort of automation makes increasingly complex workflows viable.
Real-time Note Sampler
Given that we have high quality samplers and synthesizers in DAWs, there’s no reason we can’t do some amount of synthesis in real-time. There’s no reason we should be asking the engine to render the entire track, but there’s room to render specific elements that we expect to be especially dynamic or reactive.
Typical samplers used in DAWs like Kontakt, PLAY et al. are far too general purpose and unpredictable. They’re not optimized enough for this use case and someone would need to port them to a number of platforms to be able to run on consoles.
The best approach would be a custom sample engine designed for pseudo-real-time use in game engines. This should be combined with tooling to re-sample and generate a new instrument in the composer’s DAW to capture any effects in their mix chain as well.
Even in the case of synthesizers, it may be better to sample it in the DAW rather than embed synthesis engines. Composers rarely use instruments—synths especially—without modifying their sound via additional signal processing and effects of some sort. Even if you directly ported a composer’s favorite synthesizer instrument, you’d still need to port the entire FX chain to achieve the same sound in-engine. Alternatively, we can trade the CPU cost and code complexity for disk and RAM usage by sampling the instrument. Sampling allows us to fully capture the sound as it exists in the DAW.
The next question is what kind of content to feed such a sampler. AI-based systems could come into play here, using some combination of prewritten material as a seed for an algorithm that would generate off of it and based on the surrounding material. The impressive work of Sony CSL, Google’s Magenta, AIVA and Amper may be relevant here.
Real-time Phrase Sampler
The phrase sampler is a sampler of sorts working on large timescales so that individual musical phrases can be rearranged on the fly. This could be used in conjunction with the baked material and musical crossfades to dynamically stitch together all the elements we generated.
Another use case would be a sort of phrase based resampler. This technique was used to great effect in Rise of the Tomb Raider with a music system designed by Intelligent Music Systems to resample drums and rearrange them on the fly. This has the benefit of preserving the audio quality while providing variability.
Real-time restructuring is an easy, effectual replacement for traditional looping. Instead of looping at the end of the piece, we define a number of subsections that can be reordered within the piece. We can create a policy for what rearrangements are acceptable. Certain sections might not be allowed next to each other, for example, or we may want a round robin approach where every segment must be played at least once before we can play another.
It is possible to do some amount of automatic restructuring as well. Toy projects like The Infinite Jukebox (inspired by The Echo Nest founder’s Tristan Jehan) use algorithms to decompose a piece of music into individual beats, find similarities based on sonic signatures, and then create seamless junctions between the most similar beats. A similar approach could be easily applied to interactive scores to allow them to dynamically loop infinitely while varying their structure. Given our additional knowledge about the musical structure via the MIDI and the stems, we can use automation to plan these junctions in advance and bake the individual sections to make the jumps even more seamless.
A similar approach may also be viable as a transitioning scheme to more quickly shift into a new piece of music by finding a highly related location in the destination track to jump to, in cases where specific musical sync is less important than the general feel change.
In an ideal world, we would have a tool that is both a music middleware and a DAW—a unified creative suite for authoring interactive music. The music world is blessed (or plagued, depending on your point of view) by a wide variety of tools that are all in every day, successful use by composers. Quite simply, composers are not looking to replace their main composing tool with another DAW. It would be a hard sell.
Until the day comes when someone can build a unified model, the best approach is incremental. At the base level, that means creating tools that integrate with audio middlewares like Wwise and FMOD, and tools for DAWs to automate certain content creation tasks. It’ll take clever solutions to connect the DAW and the middleware and find ways to streamline that process.
It will likely become impossible at a certain point as the audio and sync features progress. That means it’s time for a new music-focused middleware. This music middleware could couple more tightly with both the DAW and the game engine.
Applying the Hybrid Approach
The Final Fantasy Battle Problem
Or: How to musically bridge two disparate pieces of music with a synchronized transition moment
One of my favorite toy examples in interactive music is what I like to call the Final Fantasy Problem. In a traditional Final Fantasy or other Japanese Role Playing Game (JRPG), the character is typically walking along until they encounter an enemy (whether at random or on screen). They are thrown into a combat gameplay mode with entirely new music for the battle. Then, when the battle ends, a victory fanfare plays. The transitions between all these elements can be rather jarring. Moreover, with the exception of special boss battles, the music is essentially always the same, no matter the context of what’s going on in the game outside this battle or how the battle went.
Suppose we wanted to take a different approach. Instead, let’s say we want to smooth out the transitions so whatever the victory music is, it reflects the outside world and how the battle went. Moreover, we want to transition to that victory track as musically as possible; no more hard cut. Lastly, let’s make the end of combat even more satisfying by tightening the synchronization so the music thunders to a climax right when the player defeats the last enemy. We may not want this on every combat or it will get repetitive but it can be an effective musical spotlight to deploy.
This problem isn’t specific to Final Fantasy and other JRPGs either. It’s a generally hard problem! We have to figure out how to approach transitioning between disparate tracks that need multiple degrees of variation (overall context in the game, plus context from the battle). And we want to transition with a lead in that anticipates the player action so we can align a musical flourish with the gameplay.
Let’s apply the tools described above to solve this scenario. First, let’s talk about the sync problem. The game state is fairly well-defined in a game like this. We know the health of the enemies, the player actions available, what any AI-driven entities might do, etc. and can form an aggregate prediction of when combat will end. With that prediction in hand, we can use it to decide when to start our transition to the flourish. If as the crucial sync point gets closer, we update our prediction is off, we can use our music editor in a box to either extend or shorten the build up. In addition or alternatively, after the last player input that would deliver the killing blow, we can adjust the game timing and insert or remove some animations to tweak the sync.
If the player doesn’t look like they’re going to succeed, we can build in some way to bail out of the flourish. This may be the musical equivalent of putting your hand up for a high five and trying to pass it off as waving at someone in the distance instead to save face. That said, it should be a rare occurrence in some sense as long as our predictions are conservative. Also, it’s okay for the player to feel like they fell short! We just don’t want it to be musically goofy (probably).
Solving the variety problem is a little trickier and depends on what the composer wants to do. At its most basic, this is a content problem. If the game is well-defined enough, that may mean writing a bunch of different variations on a victory fanfare depending on where the player is in the game. Alternatively, we can think in terms of a systems-based design to encode the variety. For example, we could have the combat music dynamically ritard from wherever it was into a slower variant of the music. We could use musical crossfades and stems to perform the job of a music editor on the fly and stitch the two pieces together in a musical way, with some stems tailing over as needed.
While we’ve had thirty years of excellent game music, there are still so many fascinating aspects of interactive music to explore. The field has undergone a major paradigm shift already: from the flexibility of on-device synthesis to the expansive sonic possibilities of audio playback. Major gains in processing power, storage capacity and speed allowed for that former shift. Then, the catalyst was hard drives and the CD-ROM ushering in the age of digital music creation and distribution. This time around, the change may well be caused by more powerful CPUs and high speed, high capacity SSDs.
The hardware groundwork has been laid. What’s needed now to usher in the third wave is software to take advantage of all that firepower and creative thinking about how we can make game music more detailed and intentional. Composers and game developers need new tools to effectively design, build and integrate interactive scores. But we also need to come to a cultural consensus on best practices for scoring interactivity.
This whole document is laying out a vision for game music as something akin to interactive film music. I’ve talked a bit about what that could look like, why that would be interesting, why it’s near impossible with current capabilities and how new software could help achieve it.
I don’t claim to have all the answers or even the best solutions. Instead, I hope this long-winded essay will spark a larger discussion of some sort. At the very least, I hope it has gotten you thinking about the limitations of our current practices and the world of possibilities opened by novel approaches.
Please reach out and let me know what you think!