April 2013

Volume 26 Number 04

DirectX Factor - Streaming and Manipulating Audio Files in Windows 8

By Charles Petzold | April 2013

Charles PetzoldMany Windows users these days have a Music Library on their hard drives containing perhaps thousands or tens of thousands of MP3 and WMA files. To play this music on the PC, generally such users run the Windows Media Player or the Windows 8 Music application. But for programmers, it’s good to know we can write our own programs to play these files. Windows 8 provides programming interfaces to access the Music Library, obtain information about the individual music files (such as artist, title and duration), and play these files using MediaElement.

MediaElement is the easy approach, and of course there are alternatives that make the job more difficult but also add a lot of versatility. With two DirectX components—Media Foundation and XAudio2—it’s possible for an application to get much more involved in this process. You can load chunks of decompressed audio data from music files, and analyze that data or manipulate it some way before (or instead of) playing the music. Have you ever wondered what a Chopin Étude sounds like when played backward at half speed? Well, neither have I, but one of the programs accompanying this article will let you find out.

Pickers and Bulk Access

Certainly the easiest way for a Windows 8 program to access the Music Library is through the FileOpenPicker, which can be initialized in a C++ program for loading audio files like this:

FileOpenPicker^ fileOpenPicker = ref new FileOpenPicker();
fileOpenPicker->SuggestedStartLocation = 
  PickerLocationId::MusicLibrary;
fileOpenPicker->FileTypeFilter->Append(".wma");
fileOpenPicker->FileTypeFilter->Append(".mp3");
fileOpenPicker->FileTypeFilter->Append(".wav");

Call PickSingleFileAsync to display the FileOpenPicker and let the user select a file.

For a free-form exploration of the folders and files, it’s also possible for the application manifest file to indicate it wants more extensive access to the Music Library. The program can then use the classes in the Windows::Storage::BulkAccess namespace to enumerate the folders and music files on its own.

Regardless of the approach you take, each file is represented by a StorageFile object. From that object you can obtain a thumbnail, which is an image of the music’s album cover (if it exists). From the Properties property of StorageFile, you can obtain a MusicProperties object, which provides the artist, album, track name, duration and other standard information associated with the music file.

By calling OpenAsync on that StorageFile, you can also open it for reading and obtain an IRandomAccessStream object, and even read the entire file into memory. If it’s a WAV file, you might consider parsing the file, extracting the waveform data and playing the sound through XAudio2, as I’ve described in recent installments of this column.

But if it’s an MP3 or WMA file, that’s not so easy. You’ll need to decompress the audio data, and that’s a job you’ll probably not want to take on yourself. Fortunately, the Media Foundation APIs include facilities to decompress MP3 and WMA files and put the data into a form that can be passed directly to XAudio2 for playing.

Another approach to getting access to decompressed audio data is through an audio effect that’s attached to a MediaElement. I hope to demonstrate this technique in a later article.

Media Foundation Streaming

To use the Media Foundation functions and interfaces I’ll be discussing here, you’ll need to link your Windows 8 program with the mfplat.lib and mfreadwrite.lib import libraries, and you’ll need #include statements for mfapi.h, mfidl.h and mfreadwrite.h in your pch.h file. (Also, be sure to include initguid.h before mfapi.h or you’ll get link errors that might leave you baffled for many unproductive hours.) If you’re also using XAudio2 to play the files (as I’ll be doing here), you’ll need the xaudio2.lib import library and xaudio2.h header file.

Among the downloadable code for this column is a Windows 8 project named StreamMusicFile that demonstrates pretty much the minimum code necessary to load a file from the PC’s Music Library, decompress it through Media Foundation and play it through XAudio2. A button invokes the FileOpenPicker, and after you’ve selected a file, the program displays some standard information (as shown in Figure 1) and immediately begins playing the file. By default, the volume Slider at the bottom is set to 0, so you’ll need to increase that to hear anything. There’s no way to pause or stop the file except by terminating the program or bringing another program to the foreground.

The StreamMusicFile Program Playing a Music File
Figure 1 The StreamMusicFile Program Playing a Music File

In fact, the program doesn’t stop playing a music file even if you click the button and load in a second file. Instead, you’ll discover that both files play at the same time, but probably not in any type of coherent synchronization. So that’s something this program can do that the Windows 8 Music application and Media Player cannot: play multiple music files simultaneously!

The method shown in Figure 2 shows how the program uses an IRandomAccessStream from a StorageFile to create an IMFSourceReader object capable of reading an audio file and delivering chunks of uncompressed audio data.

Figure 2 Creating and Initializing an IMFSourceReader

ComPtr<IMFSourceReader> MainPage::CreateSourceReader(IRandomAccessStream^ randomAccessStream)
{
  // Start up Media Foundation
  HRESULT hresult = MFStartup(MF_VERSION);
  // Create a IMFByteStream to wrap the IRandomAccessStream
  ComPtr<IMFByteStream> mfByteStream;
  hresult = MFCreateMFByteStreamOnStreamEx((IUnknown *)randomAccessStream,
                                            &mfByteStream);
  // Create an attribute for low latency operation
  ComPtr<IMFAttributes> mfAttributes;
  hresult = MFCreateAttributes(&mfAttributes, 1);
  hresult = mfAttributes->SetUINT32(MF_LOW_LATENCY, TRUE);
  // Create the IMFSourceReader
  ComPtr<IMFSourceReader> mfSourceReader;
  hresult = MFCreateSourceReaderFromByteStream(mfByteStream.Get(),
                                               mfAttributes.Get(),
                                               &mfSourceReader);
  // Create an IMFMediaType for setting the desired format
  ComPtr<IMFMediaType> mfMediaType;
  hresult = MFCreateMediaType(&mfMediaType);
  hresult = mfMediaType->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Audio);
  hresult = mfMediaType->SetGUID(MF_MT_SUBTYPE, MFAudioFormat_Float);
  // Set the media type in the source reader
  hresult = mfSourceReader->SetCurrentMediaType(MF_SOURCE_READER_FIRST_AUDIO_STREAM,
                                          0, mfMediaType.Get());
  return mfSourceReader;
}

For clarity, Figure 2 excludes all code dealing with errant HRESULT return values. The actual code throws exceptions of type COMException, but the program doesn’t catch these exceptions as a real application would.

In short, this method uses the IRandomAccessStream to create an IMFByteStream object encapsulating the input stream, and then uses that to create an IMFSourceReader, which can perform the actual decompression.

Notice the use of an IMFAttributes object to specify a low-latency operation. This isn’t strictly required, and you can set the second argument to the MFCreateSourceReaderFromByteStream function to nullptr. However, as the file is being read and played, the hard drive is being accessed and you don’t want those disk operations to create audible gaps in the playback. If you’re really nervous about this problem, you might consider reading the entire file into an InMemoryRandomAccessStream object and using that for creating the IMFByteStream.

When a program uses Media Foundation to decompress an audio file, the program has no control over the sampling rate of the uncompressed data it receives from the file, or the number of channels. This is governed by the file. However, the program can specify that the samples be in one of two different formats: 16-bit integers (used for CD audio) or 32-bit floating-point values (the C float type). Internally, XAudio2 uses 32-bit floating-point samples, so fewer internal conversions are required if 32-bit floating-point samples are passed to XAudio2 for playing the file. I decided to go that route in this program. Accordingly, the method in Figure 2 specifies the format of the audio data it desires with the two identifiers MFMediaType_Audio and MFAudioFormat_Float. If decompressed data is required, the only alternative to this second identifier is MFAudioFormat_PCM for 16-bit integer samples.

At this point, we have an object of type IMFSourceReader poised to read and decompress chunks of an audio file.

Playing the File

I originally wanted to have all the code for this first program in the MainPage class, but I also wanted to use an XAudio2 callback function. That’s a problem because (as I discovered) a Windows Runtime type like MainPage can’t implement a non-Windows Runtime interface like IXAudio2VoiceCallback, so I needed a second class, which I called AudioFilePlayer.

After obtaining an IMFSourceReader object from the method shown in Figure 2, MainPage creates a new AudioFilePlayer object, also passing to it an IXAudio2 object created in the MainPage constructor:

new AudioFilePlayer(pXAudio2, mfSourceReader);

From there, the AudioFilePlayer object is entirely on its own and pretty much self-contained. That’s how the program can play multiple files simultaneously.

To play the music file, AudioFilePlayer needs to create an IXAudio2­SourceVoice object. This requires a WAVEFORMATEX structure indicating the format of the audio data to be passed to the source voice, and that should be consistent with the audio data being delivered by the IMFSourceReader object. You can probably guess at the correct parameters (such as two channels and a 44,100 Hz sampling rate), and if you get the sampling rate wrong, XAudio2 can perform sample rate conversions internally. Still, it’s best to obtain a WAVEFORMATEX structure from the IMFSourceReader and use that, as shown in the AudioFilePlayer constructor in Figure 3.

Figure 3 The AudioFilePlayer Constructor in StreamMusicFile

AudioFilePlayer::AudioFilePlayer(ComPtr<IXAudio2> pXAudio2,
                                 ComPtr<IMFSourceReader> mfSourceReader)
{
  this->mfSourceReader = mfSourceReader;
  // Get the Media Foundation media type
  ComPtr<IMFMediaType> mfMediaType;
  HRESULT hresult = mfSourceReader->GetCurrentMediaType(MF_SOURCE_READER_
                                                        FIRST_AUDIO_STREAM,
                                                        &mfMediaType);
  // Create a WAVEFORMATEX from the media type
  WAVEFORMATEX* pWaveFormat;
  unsigned int waveFormatLength;
  hresult = MFCreateWaveFormatExFromMFMediaType(mfMediaType.Get(),
                                                &pWaveFormat,
                                                &waveFormatLength);
  // Create the XAudio2 source voice
  hresult = pXAudio2->CreateSourceVoice(&pSourceVoice, pWaveFormat,
                                        XAUDIO2_VOICE_NOPITCH, 1.0f, this);
  // Free the memory allocated by function
  CoTaskMemFree(pWaveFormat);
  // Submit two buffers
  SubmitBuffer();
  SubmitBuffer();
  // Start the voice playing
  pSourceVoice->Start();
  endOfFile = false;
}

Getting that WAVEFORMATEX structure is a bit of a nuisance that involves a memory block that must then be explicitly freed, but by the conclusion of the AudioFilePlayer constructor, the file is ready to be played.

To keep the memory footprint of such a program to a minimum, the file should be read and played in small chunks. Both Media Foundation and XAudio2 are very conducive to this approach. Each call to the ReadSample method of the IMFSourceReader object obtains access to the next block of uncompressed data until the file is entirely read. For a sampling rate of 44,100 Hz, two channels and 32-bit floating-point samples, my experience is that these blocks are usually 16,384 or 32,768 bytes in size, and sometimes as little as 12,288 bytes (but always a multiple of 4,096), indicating about 35 to 100 milliseconds of audio each.

Following each call to the ReadSample method of IMFSource­Reader, a program can simply allocate a local block of memory, copy the data to that, and then submit that local block to the IXAudio2SourceVoice object with SubmitSourceBuffer.

AudioFilePlayer uses a two-buffer approach to play the file: While one buffer is being filled with data, the other buffer is playing. Figure 4 shows the entire process, again without the checks for errors.

Figure 4 The Audio-Streaming Pipeline in StreamMusicFile

void AudioFilePlayer::SubmitBuffer()
{
  // Get the next block of audio data
  int audioBufferLength;
  byte * pAudioBuffer = GetNextBlock(&audioBufferLength);
  if (pAudioBuffer != nullptr)
  {
    // Create an XAUDIO2_BUFFER for submitting audio data
    XAUDIO2_BUFFER buffer = {0};
    buffer.AudioBytes = audioBufferLength;
    buffer.pAudioData = pAudioBuffer;
    buffer.pContext = pAudioBuffer;
    HRESULT hresult = pSourceVoice->SubmitSourceBuffer(&buffer);
  }
}
byte * AudioFilePlayer::GetNextBlock(int * pAudioBufferLength)
{
  // Get an IMFSample object
  ComPtr<IMFSample> mfSample;
  DWORD flags = 0;
  HRESULT hresult = mfSourceReader->ReadSample(MF_SOURCE_READER_FIRST_AUDIO_STREAM,
                                               0, nullptr, &flags, nullptr,
                                               &mfSample);
  // Check if we’re at the end of the file
  if (flags & MF_SOURCE_READERF_ENDOFSTREAM)
  {
    endOfFile = true;
    *pAudioBufferLength = 0;
    return nullptr;
  }
  // If not, convert the data to a contiguous buffer
  ComPtr<IMFMediaBuffer> mfMediaBuffer;
  hresult = mfSample->ConvertToContiguousBuffer(&mfMediaBuffer);
  // Lock the audio buffer and copy the samples to local memory
  uint8 * pAudioData = nullptr;
  DWORD audioDataLength = 0;
  hresult = mfMediaBuffer->Lock(&pAudioData, nullptr, &audioDataLength);
  byte * pAudioBuffer = new byte[audioDataLength];
  CopyMemory(pAudioBuffer, pAudioData, audioDataLength);
  hresult = mfMediaBuffer->Unlock();
  *pAudioBufferLength = audioDataLength;
  return pAudioBuffer;
}
// Callback methods from IXAudio2VoiceCallback
void _stdcall AudioFilePlayer::OnBufferEnd(void* pContext)
{
  // Remember to free the audio buffer!
  delete[] pContext;
  // Either submit a new buffer or clean up
  if (!endOfFile)
  {
    SubmitBuffer();
  }
  else
  {
    pSourceVoice->DestroyVoice();
    HRESULT hresult = MFShutdown();
  }
}

To get temporary access to the audio data, the program needs to call Lock and then Unlock on an IMFMediaBuffer object representing the new block of data. In between those calls, the GetNextBlock method in Figure 4 copies the block into a newly allocated byte array.

The SubmitBuffer method in Figure 4 is responsible for setting the fields of an XAUDIO2_BUFFER structure in preparation for submitting the audio data for playing. Notice how this method also sets the pContext field to the allocated audio buffer. This pointer is passed to the OnBufferEnd callback method seen toward the end of Figure 4, which can then delete the array memory.

When a file has been entirely read, the next ReadSample call sets an MF_SOURCE_READERF_ENDOFSTREAM flag and the IMFSample object is null. The program responds by setting an endOfFile field variable. At this time, the other buffer is still playing, and a last call to OnBufferEnd will occur, which uses the occasion to release some system resources.

There’s also an OnStreamEnd callback method that’s triggered by setting the XAUDIO2_END_OF_STREAM flag in the XAUDIO2_BUFFER, but it’s hard to use in this context. The problem is that you can’t set that flag until you receive an MF_SOURCE_READERF_ENDOFSTREAM flag from the ReadSample call. But SubmitSourceBuffer does not allow null buffers or buffers of zero size, which means you have to submit a non-empty buffer anyway, even though no more data is available!

Spinning a Record Metaphor

Of course, passing audio data from Media Foundation to XAudio2 is not nearly as easy as using the Windows 8 Media­Element, and hardly worth the effort unless you’re going to do something interesting with the audio data. You can use XAudio2 to set some special effects (such as echo or reverb), and in the next installment of this column I’ll apply XAudio2 filters to sound files.

Meanwhile, Figure 5 shows a program named DeeJay that displays an on-screen record and rotates it as the music is playing at a default rate of 33 1/3 revolutions per minute.

The DeeJay Program
Figure 5 The DeeJay Program

Not shown is an application bar with a Load File button and two sliders—one for the volume and another to control the playback speed. This slider has values that range from -3 to 3 and indicate a speed ratio. The default value is 1. A value of 0.5 plays back the file at half speed, a value of 3 plays the file back three times as fast, a value of 0 essentially pauses playback, and negative values play the file backward (perhaps allowing you to hear hidden messages encoded in the music).

Because this is Windows 8, of course you can also spin the record with your fingers, thus justifying the program’s name. DeeJay allows for single-finger rotation with inertia, so you can give the record a good spin in either direction. You can also tap the record to move the “needle” to that location.

I very, very, very much wanted to implement this program in a manner similar to the StreamMusicFile project with alternating calls to ReadSample and SubmitSourceBuffer. But problems arose when attempting to play the file backward. I really needed IMFSource­Reader to support a ReadPreviousSample method, but it does not.

What IMFSourceReader does support is a SetCurrentPosition method that allows you to move to a previous location in the file. However, subsequent ReadSample calls begin returning blocks earlier than that position. Most of the time, a series of calls to ReadSample eventually meet up at the same block as the last ReadSample call before SetCurrentPosition, but sometimes they don’t, and that made it just too messy.

I eventually gave up, and the program simply loads the entire uncompressed audio file into memory. To keep the memory footprint down, I specified 16-bit integer samples rather than 32-bit floating-point samples, but still it’s about 10MB of memory per minute of audio, and loading in a long movement of a Mahler symphony would commandeer about 300MB.

Those Mahler symphonies also mandated that the entire file-loading method be executed in a secondary thread, a job that’s greatly simplified by the create_task function available in Windows 8.

To ease working with the individual samples, I created a simple structure named AudioSample:

struct AudioSample
{
  short Left;
  short Right;
};

So instead of working with an array of bytes, the AudioFilePlayer class in this program works with an array of AudioSample values. However, this means that the program is basically hardcoded for stereo files. If it loads an audio file that doesn’t have exactly two channels, it can’t play that file!

The asynchronous file-reading method stores the data it obtains in a structure I call LoadedAudioFileInfo:

struct LoadedAudioFileInfo
{
  AudioSample* pBuffer;
  int bufferLength;
  WAVEFORMATEX waveFormat;
};

The pBuffer is the big block of memory, and bufferLength is the product of the sampling rate (probably 44,100 Hz) and the duration of the file in seconds. This structure is passed directly to the AudioFilePlayer class. A new AudioFilePlayer is created for each loaded file, and it replaces any previous AudioFilePlayer instance. For cleaning up, AudioFilePlayer has a destructor that deletes the big array holding the entire file, as well as two smaller arrays used for submitting buffers to the IXAudio2SourceVoice object.

The keys to playing the file forward and backward at various speeds are two fields in AudioFilePlayer of type double: audioBuffer­Index and speedRatio. The audioBufferIndex variable points to a location within the big array containing the entire uncompressed file. The speedRatio variable is set to the same values as the slider, -3 through 3. When the AudioFilePlayer needs to transfer audio data from the big buffer into the smaller buffers for submission, it increments audioBufferIndex by speedRatio for each sample. The resultant audioBufferIndex is (in general) between two file samples, so the method in Figure 6 performs an interpolation to derive a value that’s then transferred to the submission buffer.

Figure 6 Interpolating Between Two Samples in DeeJay

AudioSample AudioFilePlayer::InterpolateSamples()
{
  double left1 = 0, left2 = 0, right1= 0, right2 = 0;
  for (int i = 0; i < 2; i++)
  {
    if (pAudioBuffer == nullptr)
      break;
    int index1 = (int)audioBufferIndex;
    int index2 = index1 + 1;
    double weight = audioBufferIndex - index1;
    if (index1 >= 0 && index1 < audioBufferLength)
    {
      left1 = (1 - weight) * pAudioBuffer[index1].Left;
      right1 = (1 - weight) * pAudioBuffer[index1].Right;
    }
    if (index2 >= 0 && index2 < audioBufferLength)
    {
      left2 = weight * pAudioBuffer[index2].Left;
      right2 = weight * pAudioBuffer[index2].Right;
    }
  }
  AudioSample audioSample;
  audioSample.Left = (short)(left1 + left2);
  audioSample.Right = (short)(right1 + right2);
  return audioSample;
}

The Touch Interface

To keep the program simple, the entire touch interface consists of a Tapped event (to position the “needle” at a different location on the record) and three Manipulation events: the ManipulationStarting handler initializes single-finger rotation; the ManipulationDelta handler sets a speed ratio for the AudioFilePlayer that overrides the speed ratio from the slider; and the ManipulationCompleted handler restores the speed ratio in AudioFilePlayer to the slider value after all inertial movement has completed.

Rotational velocity values are directly available from event arguments of the ManipulationDelta handler. These are in units of degrees of rotation per millisecond. If you consider that a standard long-playing record speed of 33 1/3 revolutions per minute is equivalent to 200° per second, or 0.2° per millisecond, I merely needed to divide the value in the ManipulationDelta event by 0.2 to obtain the speed ratio I required.

However, I discovered that the velocities reported by the ManipulationDelta are quite erratic, so I had to smooth them out with some simple logic involving a field variable named smoothVelocity:

smoothVelocity = 0.95 * smoothVelocity +
                 0.05 * args->Velocities.Angular / 0.2;
pAudioFilePlayer->SetSpeedRatio(smoothVelocity);

On a real turntable, you can stop rotation by simply pressing your finger on the record. But that doesn’t work here. Actual movement of your finger is necessary for Manipulation events to be generated, so to stop the record you need to press and then move your finger (or mouse or pen) a bit.

The inertial deceleration logic also doesn’t match up with reality. This program allows inertial movement to finish entirely before restoring the speed ratio to the value indicated by the slider. In reality, that slider value should exhibit a type of pull on the inertial values, but that would have complicated the logic considerably.

Besides, I couldn’t really detect an “unnatural” inertial effect. Undoubtedly a real DJ would feel the difference right away.


Charles Petzold is a longtime contributor to MSDN Magazine and the author of “Programming Windows, 6th edition” (O’Reilly Media, 2012), a book about writing applications for Windows 8. His Web site is charlespetzold.com.

Thanks to the following technical experts for reviewing this article: Richard Fricks (Microsoft)