@Tony Thanks for the question. The default audio streaming format is WAV (16 kHz or 8 kHz, 16-bit, and mono PCM). Outside of WAV / PCM, the compressed input formats listed are also supported using GStreamer.
Here is the doc for supported input formats and samples.
The below python code is converting any audio files size:
https://github.com/caiomsouza/Microsoft-Cognitive-Services/blob/master/speech-to-text/speech-to-text-all-files_large_files.py