Coding Challenge I

Pages: 1 2 3 4 5 6 7 8 9

Solution from CD

Here’s my code challenge entry. The compiler is MSVC6 SP5. I did all the work in the IDE, but it can also be compiled from the command line using: cl pi.c /G6 /O2 /FAs

I experimented with some optimization settings, but since they didn’t make much of a difference, I just stuck with the default optimize for speed setting and told it to target P6 in the code generation section. I didn’t notice any significant speed difference between P6 and blend, but I did see it aligning the loop entry points in the assembly listing as is good for the P6 and perhaps some other modern processors.

I only implemented two basic optimizations because although other things were possible, I didn’t see much point in spending time on anything not in the inner loop after seeing how many more times that inner loop is executed compared to everything else.

The first optimization was switching most of the computations to use unsigned ints instead of signed ints as the values do not go negative (p will underflow on the last iteration, but this value is not used since the loop then exits). Using unsigned instead of signed allows the compiler to use div instructions instead of idiv instructions. div is faster than idiv on some processors. It also allows it to use an xor reg, reg instead of cdq instruction to set up edx preceding the div, which may also make a difference on some processors. Actually, both of these don’t really make a difference on my P6, but div is a bit faster than idiv on P5 (but they’re old) and a little faster on K7. xor instead of cdq is better for P5, but is most likely insignificant for P6 and K7. I’m not really sure how these two things may or may not affect P7, but they should either be the same or better.

I didn’t notice the second optimization making much of a difference, but I slid all of the values in the array a one place and made it one entry larger so that the loop counter i could be used more efficiently inside the inner loop. The new one only uses just i instead of i and i + 1. I think it was good for getting one instruction out of the inner loop, which didn’t make a noticeable difference on my P6 because of the divide latency.

What’s more interesting is that I discovered that there are two different MSVC6 SP5s (well, really an optional update…). My system at home generated one div instructions in the inner loop, but when I tried it on my system at work with (what I thought was) the same compiler, I noticed that it generated two div instructions in the inner loop. I discovered that the difference was the Processor Pack update I had installed at home, but not at work. See http://msdn.microsoft.com/vstudio/downloads/ppack/default.asp for more information. It turns out this updates a few more things in the compiler than just support for new instructions. One big thing is of course the one div instead of two div instructions with this program. So I guess you can call my entry MSVC6 SP5 plus Processor Pack now. I was interested in what else changed, so I recompiled a big project at work and took a quick look through some of the assembly listings to see what else might be different. Here are a few things I noticed:

  • I saw one place where it lifted a common load above an if/else instead of doing it twice.
  • Some Pentium 4 friendly code generation. shl reg, 1 was replaced with add reg, reg.
  • Some string intrinsics were significantly different.

#include <stdlib.h>
#include <stdio.h>

void ComputePi(int numdigits, char* pi) {
	int alength = 10 * numdigits / 3;
	unsigned int* a = (unsigned int*) malloc((alength + 1) * sizeof(unsigned int));
	int piLength = 0;
	int nines = 0;
	int predigit = 0;
	int i, j;

	for (i = 0; i < (alength + 1); ++i) {
		a[i] = 2;
	}

	for (j = 0; j < numdigits; ++j) {
		unsigned int q = 0;
		unsigned int p = 2 * alength – 1;

		i = alength + 1;
		while (–i > 0) {
			unsigned int x = 10*a[i] + q*(i);
			a[i] = x % p;
			q = x / p;
			p -= 2;
		}

		a[0+1] = q % 10;
		q /= 10;
		if (q == 9) {
			++nines;
		}
		else if (q == 10) {
			int k;

			pi[piLength] = (char)(predigit + 1 + ‘0’);
			for (k = 1; k <= nines; ++k) {
				pi[piLength + k] = ‘0’;
			}
			piLength += nines + 1;
			predigit = 0;
			nines = 0;
		}
		else {
			int k;

			pi[piLength] = (char)(predigit + ‘0’);
			predigit = q;
			for (k = 1; k <= nines; ++k) {
				pi[piLength + k] = ‘9’;
			}
			piLength += nines + 1;
			nines = 0;
		}
	}
	pi[piLength] = (char)(predigit + ‘0’);
	pi[piLength+1] = ‘\0’;

	free(a);
}

int main(int argc, char** argv) {
	int numdigits;
	char* pi;

	if (argc <= 1) {
		fprintf(stderr, “usage: pi #DIGITS [FILE]”);
		return 1;
	}

	numdigits = atoi(argv[1]);
	pi = (char*) malloc(numdigits+1);
	ComputePi(numdigits, pi);

	if (argc > 2) {
		FILE* fp = fopen(argv[2], “w”);
		if (fp == NULL)
		{
		   fprintf(stderr, “Cannot open %s\n”, argv[2]);
		   return 2;
		}
		fputs(pi, fp);
		fputc(‘\n’, fp);
		fclose(fp);
	}
	else {
		puts(pi);
	}

	free(pi);

	return 0;
}

Pages: « Prev   1 2 3 4 5 6 7 8 9   Next »

Discuss (16 comments)