19 Feb 2022
The 1999 revision to the C programming language brought several interesting changes and additions to the standard. Among those are variable length arrays, or VLAs for short, which allow array types to have lengths that are determined at runtime. These can show up in different contexts, but for the purposes of this post we will be looking at VLAs as parameters to functions. For example:
void f(int n, float v[n]);
Since array parameters decay to their corresponding pointer type, that declaration is functionally equivalent, and compatible (6.7.5.3p7) to the following:
void f(int n, float *v);
The length of v
in the first declaration is given by the value of a previous
parameter of the function, which can only be known at runtime. However, in a
function prototype (i.e. not a function definition) like the above, the length
of a VLA parameter is never computed and essentially ignored (6.7.5.2p5).
But in a function definition, the length is evaluated and must be greater than
zero. In practice this is rather useless because the sizeof operator doesn't
work as one might expect it to with array parameters (6.5.3.4-note88):
void f(int n, float param[n]) {
float local[n];
int a = sizeof local,
b = sizeof param;
printf("n = %d\n", n);
printf("sizeof local = %d\n", a); // this will be n * sizeof(float)
printf("sizeof param = %d\n", b); // but this will be sizeof(float *)
}
In other words, using sizeof with a VLA will evaluate the runtime-known length
and calculate the array size based on that, except when that VLA is a function
parameter, then, for whatever reason, it is the size of the corresponding
pointer type (in this example, sizeof local === n * sizeof(float)
but sizeof param == sizeof(float *)
). The length of a VLA parameter may be used when e.g.
computing indices when accessing multi-dimensional variable length arrays.
Alas, the standard mandates that the variable array length be computed when
the function is called. Of course, the expression in between the square
brackets is not limited to simple expressions like n
, so one can write
something like:
void f(int a, int b, int m[(a + b) / 2]) {}
or
void f(int x, int b[abs(x)]) {}
or even
void f(int v[getchar()]) {}
The following program should give you an idea of the kind of constructs that these rules allow for (try it on compiler explorer):
int main(int argc, char *argv[printf("Hello")])
{
printf(" world!\n");
}
The length expression of the VLA is evaluated before any of the statements in
main
. I couldn't find anywhere in the standard saying whether this evaluation
order is well-defined but it is what clang and gcc do, and luckily, it does not
matter for the sake of this article, as we will see shortly.
Let us refer to the subset of C99 where function bodies must be empty as disembodied C. You might naturally ask yourself what things can be accomplished in this subset (though you can probably guess the answer from the title of this page). In other words, what can you do in C if you are limited to just evaluating expressions and no statements?
Using the comma operator, we can sequence arbitrarily many expressions for their side effects, so
void f() {
printf("Press enter to confirm: ");
getchar();
printf("thanks.\n");
}
becomes
void f(char _[(
printf("Press enter to confirm: "),
getchar(),
printf("thanks.\n"),
1
)]) {}
In a comma expression, the operands are evaluated left-to-right and the value
of the last operand is the resulting value of the whole expression. The 1
at the end ensures that the evaluated array length is >0 (to avoid UB).
Functions in disembodied C are going to need a dummy parameter where the VLA
length expression evaluation can take place. For consistency, we will denote
it as char _[...]
and give it the value ""
(the empty string) when calling
said functions (note that the value we give it doesn't actually matter, though
its size should be at least as big as the computed VLA size to avoid UB).
If-else statements can be replaced with ternary conditional expressions, such that
void f(int n) {
if (n < 0)
printf("negative!");
else if (n > 0)
printf("positive!");
else
printf("zero!");
}
becomes
void f(int n, char _[(
(n < 0) ?
printf("negative!")
: (n > 0) ?
printf("positive!")
:
printf("zero!")
, 1
)]) {}
Remember that the VLA length expression can access previous function arguments, so parameter passing is essentially unchanged.
We cannot return values, but we can use out parameters by taking advantage of the fact that assignments are expressions in C, so instead of
int imax(int a, int b) {
return a > b ? a : b;
}
we can write
void imax(int *out, int a, int b, char _[
(*out = a > b ? a : b),
1
]) {}
We cannot define local variables inside of expressions, but we can just add extra function parameters to use as temporaries, rewriting
void fswapf(float *a, float *b) {
float tmp = *a;
*a = *b;
*b = tmp;
}
as
static void fswapf_aux(float *a, float *b, float tmp, char _[(
tmp = *a,
*a = *b,
*b = tmp,
1
)]) {}
void fswapf(float *a, float *b, char _[(
fswapf_aux(a, b, 0, ""), 1
)]) {}
Alternatively, if re-entrancy and thread-safety are disregarded, we could just use global (static) variables.
What about loops? Clearly we cannot use while
or for
inside expressions.
Thankfully, they are unnecessary thanks to recursion. For example:
float sum(float *v, size_t n) {
float sum = 0.0f;
for (size_t i = 0; i < n; ++i)
sum += v[i];
return sum;
}
can be expressed as
/* the forward declaration is necessary */
static void sum_aux(float *out, float *v, size_t n, char *);
static void sum_aux(float *out, float *v, size_t n, char _[(
(n > 0) ? (
*out += *v,
sum_aux(out, v + 1, n - 1, ""),
1
) : 1
)]) {}
void sum(float *out, float *v, size_t n, char _[(
*out = 0.0f,
sum_aux(out, v, n, ""),
1
)]) {}
In fact, any imperative-style loop can be turned into an equivalent recursive loop (as any functional programmer will be happy to demonstrate), though since C lacks closures or any form of anonymous functions, it can get quite unwieldy (I hope you like auxiliary functions).
The astute reader might point out that these two versions of sum
are not
equivalent because the recursive definition may cause a stack overflow for
large enough values of n
. This is unfortunately true, and the major hurdle
for the practicality of disembodied C, but does not preclude Turing
completeness (an ideal Turing machine has infinite memory at its
disposal). Luckily, modern compilers are smart, and if we write our functions
carefully to be tail recursive, they will often be able to perform tail
call elimination, removing the risk of a stack overflow. For the above
example, both gcc and clang are able to optimize the tail call (see on
compiler explorer).
Although not relevant to standard C99, in case you had the idea of using gcc statement expressions to bypass these limitations, the compiler will stop you right on your tracks since it doesn't allow those outside of function bodies.
After all of that, it seems that basically anything can be done in disembodied C, so, is there something that can't be expressed in this C99 subset?
With the rules that we've laid down so far, yes. In particular:
It is not possible in the general case to use APIs that require callbacks.
For example, look at the standard qsort
function:
void qsort(void *base, size_t nmemb, size_t size,
int (*compar)(const void *, const void *));
The compar
parameter is a pointer to a function that should return the
result of comparing two values in the given array. Since its parameters are
void pointers, we can't have the VLA argument we need to perform computations
(arrays of void
are not valid C), and we also cannot provide an int
return value. Thus there is no way (that I know of) to use this function in
standards-compliant disembodied C.
This is not a dealbreaker since we could just reimplement qsort
;).
We cannot access the program arguments. The entry point of a program in disembodied C looks like this:
int main(int argc, char *argv[ /* code here ... */ ]) {}
In the VLA length expression for argv
, argc
is in scope, but argv
is
not.
Fortunately there is a way to work around this: we can use the common
extension where main
can take a third argument which is a pointer to the
environment. According to the standard (J.5.1), this is implemented in
hosted environments. In practice, POSIX environments and Microsoft both
support it. Then, our entry point looks like this instead:
int main(int argc, char **argv, char *envp[/* ... */]) {}
Now argv
is in scope within the envp
array length expression.
Side note: even though we can't return
, main
is the exception to the rule
that reaching the closing }
of a function returning non-void is verboten,
and is equivalent to returning zero (5.1.2.2.3). If we need to terminate
with an exit code other than zero, we can use exit()
.
It seems that with enough effort anything can be done with these peculiar restrictions. As an example, I wrote implementations of the MD5 and SHA256 hash functions as well as Conway's Game of Life using SDL for graphics. I was also working on a more interesting and "useful" program but I kind of lost motivation halfway through. I'll update this article if I ever come back to finish that.
In case it wasn't clear enough, there is zero practical reason to ever do any of this. I hope you'll agree it's a fun party trick though :).