So you know (or probably don't) how I tried to make a WASM programming language before? (also, I just noticed that you can use both '/~/name/' and '/~name/' path; nice!) Well, I'm doing it again, except properly this time.
So, what is different, what is the same, and why am inflicting this on myself again? Well, I'm still going to use ocamllex and Menhir. I'm still targeting WASM with the main goal of developing WASM-4 games. But no MoonBit this time. Just straight up spitting out WASM (by which I mean spitting out WAT and then assembling it with a separate assembler. Modularity FTW!)
I can achieve all this by going the completely opposite direction. While BAST was super experimental and high-level with lax syntax, dynamic typing, garbage collection and all that stuff, EPROL will (hopefully) be an simple ALGOL-inspired, low-level language trying to be as close to the target WASM as comfortable. (which means I'm not doing stack-based things, it's not like WASM is good at being a stack-based language anyways...) I'll just leave the current ('design'/idea) document I'm working off of here!
As you can see, It's generally way more conservative that BAST, but I feel in a more traditional, low-level these days, so that's what I'm going for. This way, I can at least finish it and be sure that it will all fit on the WASM-4.
So, there are few things that EPROL will need that BAST didn't have. First of all, I will actually do some semantic analysis. I'm not sure how I got away without semantic analysis in BAST (well, it wasn't actually any good, so that's probably how), but I will actually need to do it here. I also want some half-decent compiler errors, which mostly just means reporting on which line bad stuff happened. I have no idea how to do this properly, though.
If I was a good hacker, I would already know this. If I was a decent hacker, I would read a lot on how to make a compiler and then do it properly. Instead, I'm just hoping things will go well and I'm just packing position data with every (token / AST node) until I figure out which ones actually need it and which ones don't.
So far, I have only done lexing and parsing, no analysis just yet. But I have a few words of wisdom to share anyway. First of all, namespaced identifier definition. I want to store identifiers in a data structure like so:
type var_data = {
name: string;
namespace: string list;
}
and idnt = (loc * var_data)
There are a few challenges to this. First of all, I need to extract only the string from al the matched idents, which also have their location. Second, Menhir does not like repeating lists followed by a similar pattern, so the following does not work:
idnt:
(* will not match properly *)
| nmsp = separated_list(PERIOD, IDENT); PERIOD; name = IDENT;
{ ... } (* still need to extract strings from IDENTs *)
| name = IDENT
{ ... }
What I ended up with is:
idnt:
(* accumulate strings of leading idents until the last one is the base *)
| names = separated_nonempty_list(PERIOD, IDENT)
{ let rec aux acc = function
| [] -> failwith "impossible"
| [i] -> (match i with | (l, s) -> (l, { name = s;
namespace = (List.rev acc) }))
| i :: rest -> aux ((match i with | (_, s) -> s) :: acc) rest
in aux [] names }
The second problem are statement lists. First, I want semicolons to be statement separators and not terminators, which means that the last statement in a block does not need to end with a semicolon. Second, I also want to allow trailing, leading and repeating semicolons. (In original Pascal you couldn't do that) Third, I also want control structures to end their statement automatically with 'END', so that the following is valid:
LOOP
x := y;
y := x;
IF x DO
foo()
ELSE
bar();
baz();
END
x := y
END
So, how did it go? Not well. I don't have all the progress, because I suck at the whole b-logging thing, but I DO have a solution (END):
stmt_list:
| list(SEMICOLON); lst = option(stmt_chain)
{ match lst with
| Some chain -> chain
| None -> [] }
stmt_chain:
| s = stmt; nonempty_list(SEMICOLON); rest = stmt_chain
{ s :: rest}
| s = control_block; list(SEMICOLON); rest = stmt_chain
{ s :: rest }
| s = stmt; list(SEMICOLON)
{ [s] }
| s = control_block; list(SEMICOLON)
{ [s] }
stmt:
... all statements needing semicolon ...
control_block:
... all statements that don't need semicolon separation ...
Basically the solution is (as it often happens to be) recursion. It also happened to be useful approach for allowing trailing commas in lists.
...
I have just looked into the Menhir manual to see if there is any builtin for trailing separators, and no, there does not seem to be any, but there IS 'delimited', 'terminated', 'preceded', 'separated_pair' and more! I should probably read through it and rewrite a few things, but well, we'll see.
Anyways, here's the repo, might be of use at some point. Current commit: 'faefe6e32d8473fd95aec3515ba931182a32c6ff'