It's been known for a while now that the courier RETR/TOP function was very dumb in terms of efficiency, but we left it largely alone for now as the code is in production and "working". Since we are moving towards putting combination pop & delivery on the same machine (to take advantage of the buffer cache being populated by deliveries so pop can hit the backend storage less) I have been making relatively minor changes to make the pop server more efficient.

Below is the RETR/TOP main IO loop we have been running with (mostly courier still):

	for(lastc = 0; (c = getc_unlocked(f)) >= 0; lastc = c) {
#ifdef RETR_THROTTLE
		counter++;
#endif

		if(lastc == '\n') {
			if(lptr) {
				if(inheader) {
					if(c == '\n')	inheader=0;
				} else if((*lptr)-- == 0) break;
			}

			if (c == '.') {
				if(fprintf(session->out, ".") < 0) goto _output_error_free_path_close_f;
			}
		}

		if (c == '\n')	{
			if(fprintf(session->out, "\r") < 0) goto _output_error_free_path_close_f;
		}

		if(fprintf(session->out, "%c", c) < 0) goto _output_error_free_path_close_f;

#ifdef RETR_THROTTLE
		if(!(counter %= 4096)) nanosleep(&io_delay, NULL);
#endif
	}
Note that this is not entirely original courier code, it has an IO throttle integrated and has also been changed to use getc_unlocked(), a result of a previous optimization pass.

The change we made to getc_unlocked() made helped things but glibc still makes this a function call (which if you read books like k&r's c programming language, you would expect to be a fast macro). So I changed the loop to not use stdio for reading from the message, but stdio is still used for writing to the output stream (saving that for another day, as it will probably result in changing the whole module to not use stdio anywhere for the output stream). Another change is the omission of a format string in the output functions, which is really stupid when all you're writing out are individual chars... there are other minor changes like no more counter on the loop to modulo for the nanosleep (io throttle)...

Here is the new code:

	while((ret = read(f, outbuf, BUFSIZ)) > 0) {
		for(booty = 0; booty < ret; booty++) {
			c = outbuf[booty];

			if(lastc == '\n') {
				if(lptr) {
					if(inheader) {
						if(c == '\n') inheader = 0;
					} else if((*lptr)-- == 0) break;
				}

				if(c == '.') {
					if(fputc_unlocked('.', session->out) == EOF) goto _output_error_free_path_close_f;
				}
			}

			if(c == '\n') {
				if(fputc_unlocked('\r', session->out) == EOF) goto _output_error_free_path_close_f;
			}

			if(fputc_unlocked(c, session->out) == EOF) goto _output_error_free_path_close_f;

			lastc = c;
		}
#ifdef RETR_THROTTLE
		nanosleep(&io_delay, NULL);
#endif
	}
And here are the results, pop00 runs the new code, pop01 the old, both are behind LVS doing 50/50, both are identical machines, aggregate connections @ the peak fo the day (~11am) are ~16,000/minute.
pop00
pop00
pop01
pop01