question

mahnln-8089 avatar image
2 Votes"
mahnln-8089 asked mahnln-8089 commented

Has WinSock Registered IO performance degraded since 2012?

EDIT: keep in mind that the following description is the initial version of the "problem" - please also read any comments I added below to get the full picture of what I have tried already. Generally speaking it seems that RIO itself (i.e. as an API) is not causing the problem. Something between the NIC and (my) user space code has to be configured wrongly or is outright broken.



I have recently written a UDP reception based on WinSock Registered IO (RIO) using the somewhat acceptable documentation MS provides for this API. The resulting performance is very disappointing: somewhat stable single socket performance of around 180k packets per second - around 260k packets per second when using multiple RSS queues (i.e. multiple sockets). The code has been profiled using Tracy and the results suggest that everything is in working order.

Up until now one might be tempted to think that I have just done a bad job at writing the code I gathered these numbers from. Therefore, I have searched online and found blog posts from 2012 that suggest single socket performance to be at least 450k packets per second. I downloaded the code and ran it on may test setup (back to back 10 Gb/s connected machines) and got similar performance to mine.

And now it gets interesting: after writing an email to the blogger I got the code from, he tried it out on his machines and also came to the conclusion that "RIO performance is disappointing" and that basic blocking IO outperforms it (at least in this single socket setup). Unfortunately, he does not have the time for further investigation on why the code from 2012 results in worse performance on better hardware.

All tests were performed on Windows 10 Pro based machines and at least on my side with (presumably) appropriate NIC configuration and no SMT/HyperThreading. The system was not under any load while conducting the tests.

It is true that the question I initially asked, could be rephrased to: did the RIO API change such that code written in 2012 is now performing very badly? Or: how does a "modern" approach to using RIO look like? Or: should RIO be used at all in 2021?


windows-apiwindows-10-network
· 13
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Since nobody seems to be interested in providing assistance related to the main question, maybe someone is kind enough to answer one of the questions at the end of this post. Its just ridiculous how bad Windows 10 networking performance is (at least in my tests) and I would really like to know what to do about it.

1 Vote 1 ·

"What to do about it" is meant in the sense of WinSock use - and please don't link me to stuff like this: https://docs.microsoft.com/en-us/windows/win32/winsock/high-performance-windows-sockets-applications-2 - its not even remotely helpful and mainly focuses on TCP (which I'm not interested in).

1 Vote 1 ·

How recently are we talking? I explored the performance of RIO myself roughly a year ago (Win10) and my results were promising; around 660kpps on a single pinned core with ~2008 hardware from one steered flow.

What sort of analysis have you done? I recall while profiling discovering considerable contention in a few NDIS filter drivers (QoS and the component responsible for providing traffic filtering services; the name of which escapes me).

Have the target machines been configured for this sort of workload? One thing that comes to mind is ethernet flow control; this can easily prevent you from saturating a link with small frames.

1 Vote 1 ·

Can't state any numbers with respect to when there could have been a degradation - I've only come accross RIO in early 2021.

I dont see a problem with respect to NIC settings. Stuff like flow control has been disabled. Other settings (interrupts, recv buffers, ...) have also been adjusted appropriately.

However, I didnt look at stuff between the NIC and my (user space) code, mainly because Im not proficient at these levels. Nevertheless, the more Im workling with this Problem, the more Im suspecting some NDIS/Driver level issue. How did you cope with the mentioned contention? How did you find/profile it the first place?

The only obvious issue Im seeing is related to the massive amount of CPU time that gets used on interrupt processing. Maybe its that high due to the aforementioned NDIS level contention?

Edit: my own code has been profiled with the result that its basically not producing relevant load, especially when compared to interrupt load.

0 Votes 0 ·

I still would be interested in feedback to my previous response. Just a link on where to start reading/learning about profiling these driver level issues would be really appreciated.

0 Votes 0 ·

Sorry, but I am going to bump this thread until someone reacts to it. Even a hint on how to get official support from MS would be appreciated - I will certainly not be able to pay for it, but maybe someone has an idea.

0 Votes 0 ·

Bumping again...
I finally found something from MS that tries to show how performant Windows networking can/should be: https://github.com/microsoft/ctsTraffic/issues
It also supports RIO but actually uses it in a very simple manner (i.e. no use of the RIO_MSG_COMMIT_ONLY, RIO_MSG_DONT_NOTIFY or RIO_MSG_DEFER flags for the receive call). I am surely going to try its performance but would also like to hear about the experiences of other people.

0 Votes 0 ·

Bumping again...
Unfortunately ctsTraffic (linked to in the previous post) is unstable and its UDP support seems to be at the lower end. Why is seemingly everything Microsoft produces focused on TCP?

0 Votes 0 ·

The performance tests that where successful showed that a sensible UDP performance test is not possible as the sending side already is not able to supply the necessary amount of standard Ethernet MTU packets. So, either there is something seriously wrong with respect to system/NIC configuration (I highly doubt that the NIC side of things is a problem), or ctsTraffic is just slow, or there is some hidden problem.

The related GitHub issue for reference: https://github.com/microsoft/ctsTraffic/issues/11

0 Votes 0 ·

You will need a Linux computer with iPerf to measure UDP performance.

0 Votes 0 ·
Show more comments
mahnln-8089 avatar image
1 Vote"
mahnln-8089 answered mahnln-8089 edited

Is this such a far fetched question that nobody is interested in getting to the bottom of these observations? Is RIO really that dead?

Unfortunately I am unable to test this on Windows Server based machines and I would be happy to at least get some feedback from someone with an appropriate setup.

If this is the wrong forum, please relocate this thread as I was unable to pin point it myself.

· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

I actually used RIO and it's really good for 10Gb/s performance. I currently use it for Windows 10, facing an issue though where I need to run the same application multiple times across separate interfaces, it cuts the bandwidth instead of increasing. Can't seem to run my RIO application 4x to fully saturate 4 10Gb/s links.

1 Vote 1 ·

First, thank you for actually taking the time to reply to this question!

Could you elaborate on what you mean by "it's really good for 10Gb/s performance"? For example, how many packets per second can you receive in a stable manner (single and multi socket)? How large is your MTU? Are you going with TCP or UDP? Do you have an example/resource describing what you did to get to the point you are at?

It seems that you are also facing similar problems. Where you actually able to ever get your system to receive 40 Gb/s? Maybe through the use of a filter driver or another operating system under "lab conditions"?

0 Votes 0 ·
pzdu-2129 avatar image
0 Votes"
pzdu-2129 answered mahnln-8089 edited

So i don't think UDP performance has degraded, i think it actually enhanced because of the new NDIS RSS improvements. I was able to increase rates from 10G -> 40G by tuning the RSS profiles. Registered IO works fine with packet reordering in the RSS queues.

The biggest thing that did help was having a strong single thread performing CPU.

And if that's still to slow, DPDK is making good progress on windows.

· 8
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks for the answer.

Could you elaborate a bit on what you did with respect to tuning RSS queues? I am not able to find a setting related to reordering in the powershell RSS docs (https://docs.microsoft.com/en-us/powershell/module/netadapter/Set-NetAdapterRss?view=windowsserver2019-ps). Maybe there is something else I am missing?

While I would love to use DPDK I am not able to do so due to constraints of the project I am working on.

0 Votes 0 ·

Do a little research, NDIS has packet reordering baked in so you don't need to enable packet reordering. You do have to change RSS settings though to match what you need.

Registered IO will be more than suffice for what you need. Just implement it with a good ring buffer in your C code.

0 Votes 0 ·

Belive me or not, I have done my fair share of researching, including RSS settings. It seems that I am missing some keyword because there is nothing popping up that seems to be different from powershell stuff I have linked to in my previous post.

So, I don't want a solution, I just need a keyword to push me in the right direction.

Or are these powershell commands what I need? I have worked with them but maybe I have given up too early.

0 Votes 0 ·
Show more comments

You are talking about receiving 40 Gb/s of traffic but are not stating the used packet size. If we assume jumbo frames than 40 Gb/s results in around 560 kpps. While I assume that my setup would not be able to handle this (i.e. stability goes down the drain at around 300 to 400 kpps) its still not what I would expect to be a "good" packet rate - especially if you are using hardware that really makes use of RSS queues (i.e. 8 to 16 and not the meagre 4 I am using).

What MattJames-8087 stated in his comment is what I'm expecting to see. Not because my final application will work like this but because I want my application to perform well and not waste too much cycles on doing inefficient IO. With what I have right now I can safely stream the 10 Gb/s using jumbo frames but I assume that my solution is inefficient, at least on RSS/NDIS configuration level, to a point where the rest of the application does not have CPU time left (at least on around 4 to 6 year old hardware).

0 Votes 0 ·