Quines
This is a blog post detailing my speaker notes for a session on quines.
quines
A quine is a program which prints its own listing. It's not a hack per se, it's more of a mathematical property of any turing complete language. It's based on the fixed point theorem.
It depends on nothing…. there are programs able to print their own values, and they are very efficient at it. Since it's a consequence of a mathematical theorem, the interest often lies in being able to write the smallest quine. Constructing a trivial quine is often a matter of patience, since the fixed point theorem also allows us to be able to find such a representation.
A first attempt at creating a quine
The central idea is to be able to use some data to represent the code being printed, while also including the data in the printed out format as well.
For C, we need a string, to, well, print a string. More importantly, we need a string to print itself.
#include <stdio.h>
int main (void) {
char *s="#include <stdio.h>\n\nint main (void) {\n";
printf(s); printf("char *s=\"%s\";\n",s);
}
We can use a lot of backslashes to work with the escaped backslash issue, or we can use characters to print them.
#include <stdio.h>
int main(void) {
char *s1="#include <stdio.h>%c%cint main(void) {%c";
char *s2=" char *s1=%c%s%c;%c char *s2=%c%s%c;%c";
char n='\n', q='"';
printf(s1,n,n,n,n,n);
printf(s2,q,s1,q,n,q,s2,q,n);
}
We're somewhat there. This might seem like a never ending procedure, but it ends.
#include <stdio.h>
int main(void) {
char *s1="#include <stdio.h>%c%cint main(void) {%c";
char *s2=" char *s%c=%c%s%c;%c char *s%c=%c%s%c;%c";
char *s3=" char n='%cn', q='%c', b='%c%c';%c";
char *sp=" printf(";
char *s4="%ss1,n,n,n);%c";
char *s5="%ss2,'1',q,s1,q,n,'2',q,s2,q,n);%ss2,'3',q,s3,q,n,'p',q,sp,q,n);%c";
char *s6="%ss2,'4',q,s4,q,n,'5',q,s5,q,n);%ss2,'6',q,s6,q,n,'7',q,s7,q,n);%c";
char *s7="%ss2,'8',q,s8,q,n,'9',q,s9,q,n);%ss2,'0',q,s0,q,n,'x',q,sx,q,n);%c";
char *s8="%ss3,b,q,b,b,n);%ss4,sp,n);%ss5,sp,sp,n);%c";
char *s9="%ss6,sp,sp,n);%ss7,sp,sp,n);%ss8,sp,sp,sp,n);%c";
char *s0="%ss9,sp,sp,sp,n);%ss0,sp,sp,n,n,n);%c return 0;%c}%c";
char *sx="--- INTERESTING STUFF ABOUNDS ---";
char n='\n', q='"', b='\\';
printf(s1,n,n,n);
printf(s2,'1',q,s1,q,n,'2',q,s2,q,n); printf(s2,'3',q,s3,q,n,'p',q,sp,q,n);
printf(s2,'4',q,s4,q,n,'5',q,s5,q,n); printf(s2,'6',q,s6,q,n,'7',q,s7,q,n);
printf(s2,'8',q,s8,q,n,'9',q,s9,q,n); printf(s2,'0',q,s0,q,n,'x',q,sx,q,n);
printf(s3,b,q,b,b,n); printf(s4,sp,n); printf(s5,sp,sp,n);
printf(s6,sp,sp,n); printf(s7,sp,sp,n); printf(s8,sp,sp,sp,n);
printf(s9,sp,sp,sp,n); printf(s0,sp,sp,n,n,n);
return 0;
}
This is a relatively bad quine. Better ones are yet to come. Also, note the fact that we're not
using sx as a format string anywhere.
The principle of writing quines in imperative programming
The basic idea goes as follows:
"It is impossible in most programming languages for a program to manipulate itself."
In the sense, atleast the textual representation. We make it possible anyway. We write the program in two parts, one which is the code, and the other which is the data representing the code.
The code uses data to print the code, and then uses the data to print the data. Obviously it has to manipulate the data first, before it tries to print the data itself. However, it's perfectly doable.
This is a great example. This is also the sort of quine you'd get from directly applying the fixed point theorem.
The fixed point theorem
This is the quintessential Kleene's Fixed Point Theorem. There are many in math, this is one of them. More importantly, this is central in the theory of computability.
What this essentially says is that given a total function, we can construct/find a program that'll do the same even when the function is applied to it.
The notion of fixpoint itself comes from lattice theory, where a function has pre-fixpoints and post-fixpoints, both of which either converge to the top, or to a fixpoint. Now, I'm not gonna argue over whether top is a fixpoint or not.
- Computable functions: A computable function with several integer inputs is one that can be calculated by some program. Simply put, you can write a function using an "algorithm".
- Partial functions: A partial function is one which gives outputs for some elements in the domain, but not necessarily all elements in the domain.
- Total functions: A function which always produces an output for any value in its domain.
A good example of partial function would be this:
int mod3(int x) {
if (x % 3 == 0) {
while (1) {}
} else {
return x % 3;
}
}
Please don't be pedantic and tell me this is an algorithm. I know. We're coming to that.
- Algorithm: Anything that computes a function. As mentioned above, this code block is an algorithm.
A listing of programs
There is a very trivial way to list all the programs. When I say a listing, I mean, for a program, associate it with a unique number, and the number with a unique program. Not all numbers have to be in the listing though, we can only use numbers convenient to us.
A very trivial way to do this for computer programs is to convert their ascii code into decimal, since their ascii code is technically already a number in base 256.
The universality theorem
This basically states that there exists an "algorithm" that can run any other algorithm. This is made possible by, well, a lot of reasons, but primarily because algorithms can be fed as input and read by this new machine.
This is like an interpreter. Given the input, along with the program, it will be able to run the program on the input. A good example will be a shell. A shell can run a program that can run on its own as well.
./file.sh
bash ./file.sh
The s-m-n theorem
This is preceisely the opposite of universality theorem. It states that a program which takes in k arguments can be used to generate a function that takes in k - 1 arguments and uses a hardcoded value for the first one. In essense, we created a new program that may or may not use the old program to compute the value. More formally:
If there's an algorithm n (represented as a number), we parse the actions of n and then use it with a hardcoded value y to construct g (represented as a number yet again) to run n with g and the k - 1 arguments.
The proof
We basically prove the following, for any computable total function \(h\), there exists an index/program such that \(n(\dots) = h(n)(\dots)\)
What is the function \(h\) in our case? It's the print function. Basically, one which takes the program and produces a new one that prints it. So, what does \(n\) do? It does the same thing that \(h(n)\) does. Print the listing of \(n\).
Now, the proof:
For a given program \(t\), we consider \(s(t, t)\), which is given by the s-m-n theorem. It performs what \(t\) would do when given itself as an input. Now, we have, from the universality theorem, a program \(m\) that takes \(t\) as the input and does what \(h(s(t,t))\) would have done.
I'm somewhat bending what the universality theorem says, but you get the essence. This is also perfectly provable. Using the same ideas that are used for universality theorem.
It follows that \(s(m, m)\) is the fixpoint we're looking for. By definition of \(s\), this is equivalent to \(m(m,\dots)\), and by the definition of \(m\), this is equal to \(h(s(m, m))\).
Now the existence of \(m\) in the proof using universality theorem may make it seem that we need an interpreter for this. But if you understand carefully, we only need an interpreter for programs of the form \(h(\dots)\), which is to say we need an interpreter that takes the arguments and evaluates it while also printing it.
We take a program \(t\), which takes an arguement. This argument is given by the variable data to the
program. \(s(t,t)\) is the program obtained by setting the data to the representation of the program.
Now, \(h(s(t, t))\) does nothing but print the listing of the program \(t\) with the data set to itself.
We're almost at the edge of what we want, theoretically atleast. We take \(m(t,\dots)\) and basically use
\(s(m,m)\) to create a sort of self-reference. This idea is very central to the theory of
computability. In fact, an alternate proof of this is also obtained via the venerated diagonalisation
argument.
This also has other amusing applications. Particularly, the function \(h\) need not just be printing the program. It can be any total function in existence, and the proof would still hold (construction, however, is a different story).
Quines in Biology - Introns and saRNA
Remember this example?
#include <stdio.h>
int main(void) {
char *s1="#include <stdio.h>%c%cint main(void) {%c";
char *s2=" char *s%c=%c%s%c;%c char *s%c=%c%s%c;%c";
char *s3=" char n='%cn', q='%c', b='%c%c';%c";
char *sp=" printf(";
char *s4="%ss1,n,n,n);%c";
char *s5="%ss2,'1',q,s1,q,n,'2',q,s2,q,n);%ss2,'3',q,s3,q,n,'p',q,sp,q,n);%c";
char *s6="%ss2,'4',q,s4,q,n,'5',q,s5,q,n);%ss2,'6',q,s6,q,n,'7',q,s7,q,n);%c";
char *s7="%ss2,'8',q,s8,q,n,'9',q,s9,q,n);%ss2,'0',q,s0,q,n,'x',q,sx,q,n);%c";
char *s8="%ss3,b,q,b,b,n);%ss4,sp,n);%ss5,sp,sp,n);%c";
char *s9="%ss6,sp,sp,n);%ss7,sp,sp,n);%ss8,sp,sp,sp,n);%c";
char *s0="%ss9,sp,sp,sp,n);%ss0,sp,sp,n,n,n);%c return 0;%c}%c";
char *sx="--- INTRONS GO BRRR ---";
char n='\n', q='"', b='\\';
printf(s1,n,n,n);
printf(s2,'1',q,s1,q,n,'2',q,s2,q,n); printf(s2,'3',q,s3,q,n,'p',q,sp,q,n);
printf(s2,'4',q,s4,q,n,'5',q,s5,q,n); printf(s2,'6',q,s6,q,n,'7',q,s7,q,n);
printf(s2,'8',q,s8,q,n,'9',q,s9,q,n); printf(s2,'0',q,s0,q,n,'x',q,sx,q,n);
printf(s3,b,q,b,b,n); printf(s4,sp,n); printf(s5,sp,sp,n);
printf(s6,sp,sp,n); printf(s7,sp,sp,n); printf(s8,sp,sp,sp,n);
printf(s9,sp,sp,sp,n); printf(s0,sp,sp,n,n,n);
return 0;
}
The string sx is a very specific string over here, something that quine writers and geneticists alike
call an intron.
Introns
An intron is any noncoding sequence in a gene that are removed during the process of protein synthesis. They're not there in the final protein codings.
Their reason for being there is essentially to serve as markers for transcription and splicing purposes.
The reason introns can exist is as follows:
When we use the function \(s\) from The s-m-n theorem, we basically link the data as hardcoded into the program, which was otherwise just input. However, that function is free to add anything else apart from the data as well, which is where such bits come from. Note that they're not functional. They can be replaced with anything you want and the result will be the same.
saRNA
Self-replicating RNA is a biological equivalent of whatever we've been doing so far. It contains two Open Reading Frameworks (contiguous sequence from a start codon to the next stop codon):
- Replicase Machinery ORF: This part encodes non-structural proteins. These proteins eventually form a polymerase that's used for replicating the RNA. This is the encoding of the RNA's code, aka, the RNA polymerase that creates more of it.
- Gene of interest ORF: This is effectively like an intron, this is extra data that's produced.
Bootstrapping and Self-Healing
A quine is a bunch of data and an active part - the code. Now, there might be multiple programs that have the same behaviour on the same input (think sorting algorithms). So, if you can get an equivalent code that does the same thing, you'll be able to extract exactly the precise quine that the data represents. This implies that quines are self-healing. If the code used to process the data is changed, as long as it can still process the data, it can be recovered.
For other fancier functions, it depends. For instance, here is a program that includes the key to decryption in the code. Without that, needless to say, the program cannot be used.
We can classify the code into two sections based on this: key code which performs the basic functions needed for interpreting the encoding of data, and irrelevant code which can be removed, and subsequently healed/bootstrapped by the existing key code.
Contrary to the code, the data, once lost, is lost forever. Unless, of course, it is an intron.
Using Homoiconicity for quines
Consider the following quine in emacs-lisp
(defun line-write (x) (message "%s" (prin1-to-string x)))
(defun d (l) (mapc #'line-write l))
(defun mid () (message "(do '("))
(defun end () (message "))"))
(defun do (l) (d l) (mid) (d l) nil)
(do '(
(defun line-write (x) (message "%s" (prin1-to-string x)))
(defun d (l) (mapc #'line-write l))
(defun mid () (message "(do '("))
(defun end () (message "))"))
(defun do (l) (d l) (mid) (d l) nil)
))
nil
This is like the regular good old quine we've been writing so far.
It prints the result to the *Messages* buffer (which is why the nil
quirk at the end).
However, we can do better with languages like lisp, which allow for homoiconicity. Since the code we write can literally be interpreted as data (it's a polymorphic list after all), and vice-versa (a polymorphic list is equivalent to a function call with the first element as a function), we have this:
(defvar x '(
progn
(message "(defvar x '(")
(mapc (lambda (term) (prin1 term) (message "")) x)
(message "))")
(message "")
(message "(eval x)")
(message "")
nil
))
(eval x)
nil
This literally encoded the code as data, and since elisp can parse data as code, we save a lot more effort than anything else. While we showed that the expressive power of a language that cannot manipulate its AST is the same as one that does, this is still much more amazing in my opinion.
A similar quine in bash.
Other interesting "quines"
These aren't quines per se, their function \(h\) isn't the printing function. However, we can use different \(h\) to produce some interesting programs. This one in elisp for example, 'interprets' its own listing. Basically saying, it starts interpreting itself, which interprets itself, which interprets itself, and so on…
It does, however, lead to an error eventually (emacs keeps creating new objects and can't collect fast enough).
((lambda (expr) (eval `(,expr ,expr)))
'(lambda (expr) (eval `(,expr ,expr))))
