repo, current commit: 4b34b5d688404a56648a7b2d6e65ecf417b7000e
So, as I stated last time, my codebase was not in the best hands. I can't really do much about that, but it was also not in a good state, which I can do at least something about.
The main problem was 'src/semantic.rs'. This file held all of semantic analysis and IR generation, which are both done together in my compiler. These two parts are the main part of a compiler, so as you might have guessed, single file may not be sufficient. Primary example of this being the 'main' AST processing function, which at one point held the 48 spaces of indentation I wrote about. This part now sits at 32 spaces max. What can I say, sometimes you have to nest a lot of things in Rust.
I have split the file into 5 files:
Yes, I will keep the 'semantic_' part there. It might be better to put them in a separate directory, or even a separate module, but I just feel like doing it this way.
What I feared the most was (a bit unsurprisingly) the borrow checker. The process of analysing a file consists of multiple steps (consts, vars, procedures...) passing around single shared, mutable state. They also used to pass around IR list, but I factored that out and now each part just produces it's own list. I feared that the different functions will fight over mutable reference ownership, but it went surprisingly easy. Yes, I had to specify a separate lifetime once, but the rest just worked. I guess Rust is not as much against mutable state as I thought it was. Neat!
Overall, the refactoring, while not the most exciting, did improve the quality of the code and future development. I should refactor more often.
I shared my hello world example before. It used to look like this:
IMPORT "env" "textUtf8" AS textUtf8 (I32 ptr, len, x, y)
VAR
I32 hello := "Hello,",
world := "world!",
helloLen := 6,
worldLen := 6,
END
PROC start
EXPORT "start"
DO END
PROC update
EXPORT "update"
I32 hx := 10, hy := 10,
wx := 20, wy := 40,
DO
textUtf8(hello, helloLen, hx, hy);
textUtf8(world, worldLen, wx, wy)
END
And yes, it does in fact compile and run, but I don't like that I have to specify the string length in a separate variable, especially since I do already store the string length before the string. I would like something more like this:
IMPORT "env" "textUtf8" AS textUtf8 (I32 ptr, len, x, y)
VAR
I32 hello := "Hello,",
world := "world!",
END
PROC start
EXPORT "start"
DO END
PROC update
EXPORT "update"
I32 hx := 10, hy := 10,
wx := 20, wy := 40,
DO
(* negative first I32 before the string *)
textUtf8(hello, hello[-1], hx, hy)
textUtf8(world, world[-1], wx, wy)
END
Ok, let's tackle the comments first. Comments are best handled at the lexer level, so that the parser does not have to worry about them at all. For lexing, I'm using Logos. It can handle custom callbacks on match like so:
#[logos(skip(r"\(\*", comment_callback))]
After digging a bit through the documentation, I found how to progress and read the lexer, so I wrote a simple function to handle nested comments.
fn comment_callback<'a, 'b : 'a>(
lex: &'a mut logos::Lexer<'b, Token<'b>>
) -> logos::Skip {
let mut depth = 1;
let mut remainder = lex.remainder();
lex.bump(2); // Skip the initial `(*`
while depth > 0 && remainder.len() > 1 {
if remainder.starts_with("(*") {
depth += 1;
lex.bump(2);
} else if remainder.starts_with("*)") {
depth -= 1;
lex.bump(2);
} else {
lex.bump(1);
}
remainder = lex.remainder();
}
logos::Skip
}
I think the code is quite self explanatory. Progress through the source string, count encountered '(*' and '*)' and when comment ends, return back to the regular lexer. Currently, when there is more opening than closing delimiters, it does not complain, and just ignores the rest of the file. I'm not bothered by it enough to fix it right now.
Now to the indexing. Indexing in EPROL is a bit more complex. It includes a result type, offset and index. Default result type is 'I32' and default offset is length of the type. Index can be negative, as I plan to store some metadata, such as length, before the data structure a pointer points to.
[2] -> I32 at offset 8B [F64:-2] -> F64 at offset -16B [I64:3:2] -> I64 at offset 6B [I64::1] -> I64 at offset 1B [::2] -> I32 at offset 2B
Accessors also provide extra types normally not usable in EPROL: 'I8', 'I16', 'U8', 'U16', 'U32' and 'U64'.
The parser itself is quite simple, but it has one catch: It needs to search for expression, which by itself can match accessors and Chumsky is not very happy about this kind of recursion. So instead, I pass an expression parser to the accessor parser as an argument.
fn accessor<'tokens, 'src: 'tokens, I, P>(
exp: P,
) -> impl Parser<'tokens, I, Accessor<'src>, extra::Err<Rich<'tokens, Token<'src>>>>
where
I: ValueInput<'tokens, Token = Token<'src>, Span = SimpleSpan>,
P: Parser<'tokens, I, Expr<'src>, extra::Err<Rich<'tokens, Token<'src>>>> + Clone,
{
just(Token::LSquare)
.ignore_then(
choice((
// [type:len:count]
simple_type().or_not()
.then_ignore(just(Token::Colon))
.then(possibly_negative_int().or_not())
.then_ignore(just(Token::Colon))
.then(exp.clone())
.map(|((typ, len), count)|
Accessor {
typ: typ.unwrap_or(Type::I32),
offset_len: len.unwrap_or(1),
offset: count,
}),
// [type:offset]
simple_type()
.then_ignore(just(Token::Colon))
.then(exp.clone())
.map(|(typ, offset)|
Accessor {
typ: typ.clone(),
offset_len: type_len(typ),
offset: offset,
}),
// [offset]
exp.map(|offset|
Accessor {
typ: Type::I32,
offset_len: 4,
offset,
}),
))
)
.then_ignore(just(Token::RSquare))
}
The processing of accessors is not all that hard. I just generate some pointer arithmetic AST, which gets evaluated like normal, followed by a load instruction. Writing to accessor is almost the same, I just call the store instruction instead.
That would work, but I don't want to access length with '[-1]'. That is cryptic and unintuitive. I want to use 's.len' instead like so:
IMPORT "env" "textUtf8" AS textUtf8 (I32 ptr, len, x, y)
ACCESSOR len : s [-1]
VAR
I32 hello := "Hello,",
world := "world!",
END
PROC start
EXPORT "start"
DO END
PROC update
EXPORT "update"
I32 hx := 10, hy := 10,
wx := 20, wy := 40,
DO
textUtf8(hello, s.len|hello, hx, hy)
textUtf8(world, s.len|world, wx, wy)
END
This was actually pretty easy. Parser for 'ACCESSOR' is very staright forward. I added another entry type to scope for named accessors, and while evaluating, I just construct a regular accessor AST and let that one get evaluated instead. I just need to check that the index is known at compile time, but that is the same as with constants.
Recently (as fo writing this b-log), I played a bit with Modula-2. In short, it is a scam. Your time is better spent directed towards a more sensible languages, sucha as Odin, Nim, or Algol68.
But as it's Vim support is (reasonably so) not good, I decided to call upon the machine spirit to write me a plugin, which after some manual tweaking turned out quite well. Here it is!
Why do I mention Modula-2 here? Well, I decided to reuse the work and turn it into an EPROL plugin, since EPROL and Modula-2 have relatively similar syntax.
I would like to eventually try writing a Tree-sitter parser for EPROL, but this plugin will suffice for now.
It supports filetype detection, syntax highlight, indentation, comments and even optional abbreviations, so that you don't have to type the keywords in all-caps yourself. (EPROL will still require them to be all-caps tho)
So, what is next? Well, I would like to have 's.len' automatically defined for me, so I would like to add some multi-file support with automatic inclusion, as specified in the design document. As for what after that, we'll see.