Knowing that my multithreaded code never shares file stream objects between threads except stdout/stdin/stderr, I modified all the stdio calls which operated on the session->out object to use the ${foo}_unlocked() version where available. I also changed the stream locking scheme with __setlocking() to "by caller" which means no locking will be done with functions that didnt have an _unlocked() counterpart (fprintf for example).
Below are the results of this change, pop00 is running the module with no implicit locking in glibc and pop01 the previous version with the default locking behavior. I did not change any other stdio calls (like the ones that operate with the courierpop3dsizelist file for example).
Note this is during an extremely low load period (sunday morning), The change has made the user cpu line on pop00 practically shadow the system cpu line. I will post updated graphs on monday which is the highest load period, so far the change appears to have a dramatic effect on user cpu utilization.
Also note these pop servers are behind LVS configured to distribute the connections evenly across the two servers, before the change their cpu utilization was practically identical.
pop00 |
pop01 |
Now, the interesting question is, how much GNU/Linux software is paying the penalty of implicit stdio locking where it is utterly useless? For those of you who use getc/fgetc/getchar with the blind assumption that it is optimized, macroified, or otherwise magically efficient, think again.... we aint in libc5 anymore.
Additional graph of pop00 showing the cutover from the locking version to the non-locking version just after 6AM:
update:
Some graphs from "Monday Morning Mail Madness", every monday morning is usually
our high water mark due to new customers/migrations, and the general trend of
Mondays being the busiest mail day of the week. The stratification is quite
impressive:
pop00 |
pop01 |
Addendum: After considering the connection counter hash scaling, the utilization doesnt only spike on the user cpu and the hash executes entirely in user space, it is probably not the cause. We've also tried a different kernel since these tests and it's giving some dramatic improvements, we have made a correlation between the nonlinear climb in user cpu and RPC retransmissions (the mailstore is on NFS...), upgrading the kernel seems to have made dealing with the retransmission case more efficient but after some time we found it just started happening again, we just shifted the curve over with the new kernel but still found it to be nonlinear as our scale grew more. Ultimately, we resolved the RPC retransmission cpu consumption problem by switching to NFS over TCP, this eliminated the problem altogether and didnt cause any other noticable problems.
The next major improvement in scalability was discovering sigprocmask is a killer with lots of threads