YOU ARE NOT IMMUNE TO PROPAGANDA

You might be wondering how I got into this situation. Well, it all started with OCaml. I was getting somewhat unhappy about OCaml, its language-making tooling and my EPROL implementation in general. I thought to myself that in the year of out lord 2026 (well, it's not the most accurate to Jesus' birth, but whatever...) we should have better tooling for making languages. And then I remembered that I was not making BAST in OCaml because I figured out it is the best option out there, but because I just decided to do something in OCaml first and that it should be compiler later.

And so I looked for what's out there.

I looked, and found Chumsky, a decent-looking parser library that promises good error messages and easy error recovery. Only problem is that it is written in Rust, which I never used, but I've heard it has quite steep learning curve and likes to not allow you to do anything remotely fun. But since I couldn't find any similar library for any other language (maybe Haskell, but I was already failed by one functional language, so I'm not doing that one) and I generally take every excuse to try a new language I can, I guess I'm doing Rust now.

Time to don the programmer socks I guess.

Made with Lunapic BTW

I chose to use Logos for lexing, mostly because Chumsky has examples of being used with it.

repo, current commit: f7bcaeb94606b2d60687d843fad1e8188754a8d3, current plan

So, how did it went? Well, as per usual, I did a lot of things without it feeling like so and now I have no idea how to write about it all while keeping the word count reasonable. I decided that the correct way would be to tunnel from lexing to codegen while avoiding any other problematics, such as error reporting and then polish the base once I know the architecture can at least somewhat work. So i implemented a minimal subset of the language, that can compile a simple WASM-4 hello world.

IMPORT "env" "textUtf8" AS textUtf8 (I32 ptr, len, x, y)

VAR
I32 hello := "Hello,",
    world := "world!",
    helloLen := 6,
    worldLen := 6,
END

PROC start
EXPORT "start"
DO END

PROC update
EXPORT "update"
I32 hx := 10, hy := 10,
    wx := 20, wy := 40,
DO
  textUtf8(hello, helloLen, hx, hy);
  textUtf8(world, worldLen, wx, wy)
END

Well, it can do a bit more, but not by much. Namespaces work, constants work, datatatypes work, basic unary and binary operators at least look like they work. Procedures can be called and take args, but they can not return anything. That's mostly about it. Some memory stuff that will eventually be passed as a command line argument is hardcoded, but hey, I'm still happy about it (though I feel like the progress could be a bit faster tbh...).

Next step will be error handling and reporting. But now, something about Rust, compiler design and wasm.

Lexing with Logos is very straight-forward, just as it should be. Just like OCaml, Rust has enums that can hold values, which became the main way to pass data around the program. Unlike OCaml, however, Rust enums can be easily printed and compared, so that's already a HUGE upgrade. Another great upgrade is that Rust has a built-in testing system, in which tests can occupy the same file as the code they are related to, so that's nice.

Another interesting part of Rust, that I'm actually quite fond of, is it's OOP system. It is quite similar to Go. You have regular types, and interfaces, which you then implement for whatever type you feel like. No classes, no inheritance (I think), just as it should be. Well, Rust interface implementations must be explicitly stated, unlike Go, where you just write methods and interface implementation is implied, but I don't actually mind that all that much.

Another interesting part of Rust's OOP is the idea of deriving interfaces. This basically allows you to implement some default form of an interface over your type. For example, say we have a "/esimple" enum type:

enum List<T> {
  Atom(T),
  Cons(Box<List<T>>, Box<List<T>>),
  Nil
}

It cannot do much by it's own, besides holding values. The 'Box<_>' syntax allows recursive enum types, as otherwise, the type would have infinite size (supposedly). It is of mild inconvenience, as you need to write:

// types are implied
let lst = List::Cons(Box::new(List::Atom(37)),
                     Box::new(List::Nil));

but when reading, it acts just like a reference, so it is not all that bad. The reason to not just use a reference in the enum is that then you would need to handle lifetimes, which I will not get into right now, as I'm not qualified to talk about them, but the less of them you have to worry about, the better.

But back to deriving. By simply writing:

#[derive(Debug, PartialEq, Clone)]
enum List<T> {
  Atom(T),
  Cons(Box<List<T>>, Box<List<T>>),
  Nil
}

we can now print, compare (recursively) and copy (very important) the list. Just like that. Nice! You can always define the interface (Or trait, as they are called in Rust) implementation yourself, but most of the time, you won't need to.

But what about the parser, is it aby good? Well, I would at first recommend you to read through it's official introduction, it is quite good, but once you get past the initial hit of this being your procedure signature:

// yes, this is a real language people use willingly, (me included)
fn declare_namespace<'tokens, 'src: 'tokens, I>(
) -> impl Parser<'tokens, I, Vec<&'src str>, extra::Err<Rich<'tokens, Token<'src>>>>
where
    I: ValueInput<'tokens, Token = Token<'src>, Span = SimpleSpan>,

it is quite relaxing actually. For example, here is a function for matching namespace declaration (that is ': some.idents')

fn declare_namespace<'tokens, 'src: 'tokens, I>(
) -> impl Parser<'tokens, I, Vec<&'src str>, extra::Err<Rich<'tokens, Token<'src>>>>
where
    I: ValueInput<'tokens, Token = Token<'src>, Span = SimpleSpan>,
{
    just(Token::Colon)
        .ignore_then(
            select! { Token::Ident(s) => s }
            .separated_by(just(Token::Period))
            .at_least(1)
            .collect::<Vec<_>>()
        )
}

At first, we specify what do we want to return, that is the 'Vec<&'src str>' part of the return type, you don't need to worry about the rest of the type. The ''src' part is a lifetime. You need to worry about this one a bit, but I will not elaborate any further.

It starts by matching a colon with 'just(Token::Colon)'. The 'Token' Type is my own type defined in the lexer. Each part of the pattern gets collected and passed as a tuple to the output. You can then map them and stuff, which is nice. Here, we want to ignore the colon, so we call '.ignore_then'. Then we define then namespace part, which is a set of idents separated by periods.

The 'select!' macro (macros in Rust end with bangs) is used to extract the string from the ident. This would match ': foo.bar' and ': foo',but since Chumsky also accepts empty lists by default, it would also match ':', which is not ideal, so we specify that at least one item is required. Finally, we collect all the outcomes into a nice little vector, which is then explicitly returned, because Rust.

Let's look at one more example:

fn optional_export<'tokens, 'src: 'tokens, I>(
) -> impl Parser<'tokens, I, Option<&'src str>, extra::Err<Rich<'tokens, Token<'src>>>>
where
    I: ValueInput<'tokens, Token = Token<'src>, Span = SimpleSpan>,
{
    choice((
        just(Token::Export)
            .ignore_then(select!{Token::String(s) => s})
            .map(Some),
        empty().to(None),
    ))
}

This one matches either 'EXPORT "foo"' or nothing, and returns an option, which is either 'Some(T)' or 'None'. First interesting thing is 'choice'. It takes a tuple of arbitrary length (thus the double parens), and it can match either one of it's contents, while the earlier one has higher priority.

'.map(Some)' here takes 'Some' as a function and applies it to the outcome string. The second brach matches 'empty()', which always matches and does not consume any tokens and it's mixed with '.to', which discards whatever it's applied to and replaces it with it's contents, in this case 'None'.

Well, that's a lot of text just to say that parser exists. Looks like I will talk about the semantic analysis in the next b-log. It's not like it is in all that good state anyways.

There will be some notes on WASM tho. First of all, WASM USES HARVARD ARCHITECTURE. That means that program and data live in two separate memory spaces. That would not be all that bad, it's not like I'm planning to writing self-modifying programs or anything, but it does not end there. Procedures, data and variables all have their own memory. This means that you can't just take a pointer to a variable. Languages like C usually solve this with a shadow stack, I plan to solve this by not allowing creation of pointers to variables.

Also, while function and global declarations can be mixed as much as you want, imports must always be first, followed by data definitions. Data definitions require specifying offset (well mostly, there is an extension that does not require it, but Wasm3 does not have it I think).

Interesting.

Also one more thing to mention about Rust. It has a lot of types. One of the things I did not like about OCaml was that I felt that I just can't use it without LSP constantly telling me what needs to be written differently to produce the correct types and calls and stuff. I don't think that LSP should be required to code.

While Rust's more strict syntax does make this somewhat better, I still need the LSP to tell me where I have borrowing/ownership problems. Yes, this is mostly since I'm new to this memory management scheme, so I naturally get things wrong, and I might get better until next time, but If I sill keep gettig everything wrong, I don't think it's a good language design, if the developer can't consistently get things right without any extra tooling. But as I said, we'll see how better I get until next b-logs.

peace!()