Rewriting cat using other utilities
  2023-09-16

In the past week I installed plan9 on my new RPi 400. This is a thing that I wanted to try for a long time but did not get the chance. I did not have the hardware for doing it. After some hours of exploring the system, its internals, the philosophy and the code I realized how simple the core utilities are implemented. In this article I will focus on cat(1). For the reference, this is the code: #include <u.h> #include <libc.h> void cat(int f, char *s) { char buf[8192]; long n; while((n=read(f, buf, (long)sizeof buf))>0) if(write(1, buf, n)!=n) sysfatal("write error copying %s: %r", s); if(n < 0) sysfatal("error reading %s: %r", s); } void main(int argc, char *argv[]) { int f, i; argv0 = "cat"; if(argc == 1) cat(0, ""); else for(i=1; i<argc; i++){ f = open(argv[i], OREAD); if(f < 0) sysfatal("can't open %s: %r", argv[i]); else{ cat(f, argv[i]); close(f); } } exits(0); } I was surpried that I understood what the code does in the first minute I read it. On the other hand, I cannot say the same thing about the cat(1) implementation from GNU. With all due respect, it's full of crap that makes the grasping of the utility harder that it should be. After that the following thought came to my mind: is it possible to take a modern implementation of cat(1), let's say the one from OpenBSD, and strip it from all the flags that it implements so that we can achieve the same functionality using pipes to other system utilities? For reference, this is the man page of OpenBSD cat(1) [1]. The following are the results of this exercies: cat -b : awk '{if (NF) { c++; print c $s } else { print $s }}' cat -e : sed -n 'l' cat -n : awk '{print NR $s}' cat -s : sed '/^$/N;/^\n$/D' cat -t : sed 's/ /^I/g' # the white space is a tab cat -u : N/A cat -v : sed -n 'l' | sed 's/.$//' 1. -b Number the lines, but don't count blank lines. This can easily be achieved using awk(1) and a simple if statement. No need for putting additional code in cat(1) if someone else already does the job. 2. -e Print a dollar sign (ā€˜$ā€™) at the end of each line. Implies the -v option to display non-printing characters. This uses sed(1) with a simple command, which is documented as follows: (The letter ell.) Write the pattern space to the standard output in a visually unambiguous form. We can take this as another wording for the cat(1) -v option: Displays non-printing characters so they are visible. 3. -n Number the output lines, starting at 1. Again, a simple awk script. 4. -s Squeeze multiple adjacent empty lines, causing the output to be single spaced. This sed script is a bit more complicated. It matches an empty line and appends it to the pattern space. Then if the pattern space cotains one more newline in addition to the pattern space, then delete the first newline. 5. -t Print tab characters as ā€˜^Iā€™. Implies the -v option to display non-printing characters. Again, a simple sed script. 6. -u The output is guaranteed to be unbuffered (see setvbuf(3)). I found no solution for this. There is GNU stdbuf but it's full of nasty hacks and it does not work with statically linked binaries and setuid binaries because it uses LD_PRELOAD. Also there is expect/unbuffer [2] but it's written in TCL and I don't think OpenBSD comes out-of-the-box with a TCL toolchain. I guess calling setvbuf from cat and exposing -u as a flag doesn't hurt that much. 7. -v Displays non-printing characters so they are visible. Same as -e but we also drop the last character form the lines which is a '$'. I don't understad why the dollar sign is not printed with -v as well. In conclusion, the cat implementation can be as long as 35 lines (let's say 40 if we also add -u in the implementation) and we can extend it using only two system utilities, i.e. awk and sed. "Modern" cat implementations are not following the Unix philosophy of "do one thing and do it well". From [3]: `c_a_t_ -- concatenate and print` is what cat was designed to do. One can argue that priting can be customized in multiple ways as we alreay saw in OpenBSD cat, but this is a conflict of interests with the philosophy mentioned above. On the other hand, GNU's (secondary) philosophy is to make utilities as easy to use as possible. This means that they don't care if they embed the same code in two utilities as longs as their utilities are easy to use. Writing the sed or awk scripts from my examples is not straight forward, I have to admit that. But maybe they should not be easy to write, at the end of the day. Another conclusion is that I don't want to stop here with the minimization of the system utilities and with the discovery of existing alternatives. So I will continue this journey with analyzing other system utilities in the same way I did in this article. Update 2023-09-19: This is the full implementation of cat using awk and sed. I also fixed some inconsistencies found above: #!/bin/sh while getopts "benstuv" arg do case $arg in b) B=1 ;; e) E=1 ;; n) N=1 ;; s) S=1 ;; t) T=1 ;; u) U=1 ;; v) V=1 ;; esac done shift $(($OPTIND - 1)) _cat "$@" \ | if [[ $S -eq 1 ]]; then sed '/^$/N;/^\n$/D'; else _cat; fi \ | if [[ $V -eq 1 ]] || [[ $T -eq 1 ]] || [[ $E -eq 1 ]]; then sed -n 'l'; else _cat; fi \ | if [[ $T -eq 1 ]]; then sed 's/\\t/^I/g'; else _cat; fi \ | if [[ $B -eq 1 ]]; then awk '{if (NF) { c++; print c " " $0 } else { print $0 }}'; else _cat; fi \ | if [[ $N -eq 1 ]]; then awk '{print NR " " $0}'; else _cat; fi _cat is the program listed above with some small modifications so that it can be compiled on OpenBSD. [1] https://man.openbsd.org/cat.1 [2] https://github.com/aeruder/expect/blob/master/example/unbuffer [3] http://harmful.cat-v.org/cat-v/unix_prog_design.pdf