I have the following issue I am witnessing when I have a AKS k8s cluster with >100 nodes, and I attempt to have every node lock the same file which is located on a shared AzureFile mount; the 100th node to request the lock is returned an errno 122 (Disk Quote Exceeded.) . (I am doing this to in-parallel have a 100 node computation platform parse a dataset.)
volumes:
- azureFile:
readOnly: false
secretName: azure-secret
shareName: aksshare
This happens always exactly on the 100th node to request the concurrent file lock; and I have never seen it on < 100 nodes; So I am assuming there is some Hard Limit on the amount of concurrent locks that are allowed.
Specifically I was curious if anyone had seen this; and if possibly there is some configuration setting that could be increased to allow more concurrent locks?
To simplify the scenario I wrote a simple C program (shown below) which is able to reproduce the problem.
(Other data points:
*I did try have 100 pids/program on a single kubernetes node lock the same file in an azure files mount and that did work.
*I did also have the 100 separate k8s nodes each lock a hostPath file, and that did work. (would expect that to work since those are different files per each host, but just to sanity check it.)
)
Simple repro-program:
int main(int argc, char *argv[])
{
struct flock fltest = {0,0,0,0,0};
fltest.l_type = F_RDLCK;
int fd = open( argv[1], O_RDONLY, 0);
printf("opened file: %s, fid:%d errno:%d\n", argv[1], fd, errno);
int irc = fcntl(fd, F_SETLK, &fltest);
printf("locked file: %s, %d errno %d\n", argv[1], irc, errno);
if (irc == 0)
{
printf("Waiting 30 seconds while holding lock.\n");
sl_ep(30);
}
fltest.l_type = F_UNLCK;
irc = fcntl(fd, F_SETLK, &fltest);
printf("Lock released irc:%d errno %d\n", irc, errno);
}