@Mahesh Ch Thanks for the question. Can you please share the sample code that you are trying.
Azure provides Speaker identification within Speech Services, but in a call center scenario the customer does not need to identify who is speaking, and cannot train the model beforehand with speaker voices since a new user calls in every time. Rather they only need to identify different voices when converting voice to text.
Microsoft Cognitive Services Batch transcription API have ability to identify the 2 voices separately (eg. Speaker 0 - Agent or Speaker 1 - Customer when there are 2 speakers) when converting speech to text.
Here is a blog documenting it with sample code trying to do this with Speech + Text Analytics: https://azure.microsoft.com/en-us/blog/using-text-analytics-in-call-centers/
Video Indexer support transcription, speaker diarization (enumeration), and emotion recognition both from the text and the tone of the voice. Additional insights are available as well e.g. topic inference, language identification, brand detection, translation, etc. You can consume it via the video or audio-only APIs for COGS optimization.