Rewriting cat using other utilities
2023-09-16
In the past week I installed plan9 on my new RPi 400. This is a thing that I
wanted to try for a long time but did not get the chance. I did not have the
hardware for doing it.
After some hours of exploring the system, its internals, the philosophy and the
code I realized how simple the core utilities are implemented. In this article I
will focus on cat(1).
For the reference, this is the code:
#include <u.h>
#include <libc.h>
void
cat(int f, char *s)
{
char buf[8192];
long n;
while((n=read(f, buf, (long)sizeof buf))>0)
if(write(1, buf, n)!=n)
sysfatal("write error copying %s: %r", s);
if(n < 0)
sysfatal("error reading %s: %r", s);
}
void
main(int argc, char *argv[])
{
int f, i;
argv0 = "cat";
if(argc == 1)
cat(0, "");
else for(i=1; i<argc; i++){
f = open(argv[i], OREAD);
if(f < 0)
sysfatal("can't open %s: %r", argv[i]);
else{
cat(f, argv[i]);
close(f);
}
}
exits(0);
}
I was surpried that I understood what the code does in the first minute I read
it. On the other hand, I cannot say the same thing about the cat(1)
implementation from GNU. With all due respect, it's full of crap that makes the
grasping of the utility harder that it should be.
After that the following thought came to my mind: is it possible to take a
modern implementation of cat(1), let's say the one from OpenBSD, and strip it
from all the flags that it implements so that we can achieve the same
functionality using pipes to other system utilities? For reference, this is the
man page of OpenBSD cat(1) [1].
The following are the results of this exercies:
cat -b : awk '{if (NF) { c++; print c $s } else { print $s }}'
cat -e : sed -n 'l'
cat -n : awk '{print NR $s}'
cat -s : sed '/^$/N;/^\n$/D'
cat -t : sed 's/ /^I/g' # the white space is a tab
cat -u : N/A
cat -v : sed -n 'l' | sed 's/.$//'
1. -b Number the lines, but don't count blank lines.
This can easily be achieved using awk(1) and a simple if statement. No need for
putting additional code in cat(1) if someone else already does the job.
2. -e Print a dollar sign (ā$ā) at the end of each line. Implies the -v option
to display non-printing characters.
This uses sed(1) with a simple command, which is documented as follows:
(The letter ell.) Write the pattern space to the standard output in a visually
unambiguous form.
We can take this as another wording for the cat(1) -v option:
Displays non-printing characters so they are visible.
3. -n Number the output lines, starting at 1.
Again, a simple awk script.
4. -s Squeeze multiple adjacent empty lines, causing the output to be single
spaced.
This sed script is a bit more complicated. It matches an empty line and appends
it to the pattern space. Then if the pattern space cotains one more newline in
addition to the pattern space, then delete the first newline.
5. -t Print tab characters as ā^Iā. Implies the -v option to display
non-printing characters.
Again, a simple sed script.
6. -u The output is guaranteed to be unbuffered (see setvbuf(3)).
I found no solution for this. There is GNU stdbuf but it's full of nasty hacks
and it does not work with statically linked binaries and setuid binaries because
it uses LD_PRELOAD. Also there is expect/unbuffer [2] but it's written in TCL
and I don't think OpenBSD comes out-of-the-box with a TCL toolchain.
I guess calling setvbuf from cat and exposing -u as a flag doesn't hurt that
much.
7. -v Displays non-printing characters so they are visible.
Same as -e but we also drop the last character form the lines which is a '$'. I
don't understad why the dollar sign is not printed with -v as well.
In conclusion, the cat implementation can be as long as 35 lines (let's say 40
if we also add -u in the implementation) and we can extend it using only two
system utilities, i.e. awk and sed.
"Modern" cat implementations are not following the Unix philosophy of "do one
thing and do it well". From [3]: `c_a_t_ -- concatenate and print` is what cat
was designed to do. One can argue that priting can be customized in multiple
ways as we alreay saw in OpenBSD cat, but this is a conflict of interests with
the philosophy mentioned above.
On the other hand, GNU's (secondary) philosophy is to make utilities as easy to
use as possible. This means that they don't care if they embed the same code in
two utilities as longs as their utilities are easy to use. Writing the sed or
awk scripts from my examples is not straight forward, I have to admit that. But
maybe they should not be easy to write, at the end of the day.
Another conclusion is that I don't want to stop here with the minimization of
the system utilities and with the discovery of existing alternatives. So I will
continue this journey with analyzing other system utilities in the same way I
did in this article.
Update 2023-09-19:
This is the full implementation of cat using awk and sed. I also fixed some
inconsistencies found above:
#!/bin/sh
while getopts "benstuv" arg
do
case $arg in
b) B=1 ;;
e) E=1 ;;
n) N=1 ;;
s) S=1 ;;
t) T=1 ;;
u) U=1 ;;
v) V=1 ;;
esac
done
shift $(($OPTIND - 1))
_cat "$@" \
| if [[ $S -eq 1 ]]; then sed '/^$/N;/^\n$/D'; else _cat; fi \
| if [[ $V -eq 1 ]] || [[ $T -eq 1 ]] || [[ $E -eq 1 ]]; then sed -n 'l'; else _cat; fi \
| if [[ $T -eq 1 ]]; then sed 's/\\t/^I/g'; else _cat; fi \
| if [[ $B -eq 1 ]]; then awk '{if (NF) { c++; print c " " $0 } else { print $0 }}'; else _cat; fi \
| if [[ $N -eq 1 ]]; then awk '{print NR " " $0}'; else _cat; fi
_cat is the program listed above with some small modifications so that it can be
compiled on OpenBSD.
[1] https://man.openbsd.org/cat.1
[2] https://github.com/aeruder/expect/blob/master/example/unbuffer
[3] http://harmful.cat-v.org/cat-v/unix_prog_design.pdf