My blog on Gemini

My previous post was a brief intro and what I feel about the Gemini protocol. All said and done, I thought that it wouldn’t hurt to “expand” my blog to the Gemini space too. This blog is about what I had to do to get my blog to the Gemini space. The blog is at gemini://gemini.ctrl-c.club/~dhruva

Tl;dr — LOL no, read the whole thing.

Server and Hosting

The first thing I needed was a server to host my Gemini capsule. For my HTTP site, I use GitHub pages, which is super easy to set up because I don’t need to manage my own server. But is there something similar for Gemini?

Turns out, it’s even better. Tildeverse is a universe of Tilde server instances. What’s a tilde server? It’s a public server which you can request for an account on, and you can just… use it. So I signed up for an account on the ctrl-c club tilde instance. In about a day I received my ssh details, and I was up and running in about 15 mins.

Using Franklin

My first approach was the most straightforward one. Git clone my blog, install Julia, run the Franklin build step, then do something to the HTML so that it would output the gmi file. The problem with this however is that Julia and the .julia directories are HUGE. My disk usage quickly hit the soft limit of 1Gb, and I really did not want to exploit the courtesy of the ctrl-c club.

I could try and attempt to only parse the markdown posts to gmi, but then all the Franklin specific content would have to be reparsed. I definitely don’t want to reimplement Franklin.jl. What then?

Scraping the blog

So the constraint is that I cannot build the HTML files on the tilde server, but I need to build it before I can convert it to Gemini files. This catch-22 situation can only be solved by putting the Franklin build step on some other device, and only do the HTML → Gemini part of it on the tilde server.

Why not just get the rendered HTML directly from the website and convert it? But for this I would need to find the correct locations. Well, thank god for sitemap.xml.

So I curl the sitemap.xml, grep for “posts/”, sed out the useless parts, and for each link, download and convert. Voilà, one script to sync the Gemini server with the HTML blog!

Some HTML cleanup

While the previous solution works from a technical point of view, the end result is… illegible. The problem is that the HTML page has a lot of links and layout which any HTML → gmi converter attempts to convert, leading to a lot of noise. I needed to only convert the content part of the page, not the entire thing.

However, the content div only has a class which differentiates it from the rest of the divs. While regexing the start of the content is easy, finding the closing div tag is pretty difficult with regex(technically impossible, as XML is a CFG, and HTML isn’t even that, but it can be done with a little extended regex and prayers to god.). I would need an entire HTML parser from some language, and write a proper program which would parse the HTML to get me the right div. This was again going into the territory of too much work.

All I needed was a way to demarcate the content. I could do that with comments, but those would be removed on build. So the trick I came up with was to add an extra contentthing tag where the content would be in the build templates. Since contentthing is not defined by anyone at all, no browser should render the page any differently than before. Writing a regex to find the content inside was now pretty easy. The irony however is that Gemini was made to get rid of non-standard HTML.

Another tiny issue was that Franklin added an extra p block inside a blockquote. This I fixed with a sed replace.

The HTML→gmi script

This was pretty straightforward. html2gmi is a go script which does the job very well.

index.gmi

While the posts were nicely converted, there was no way that the blog index page could be translated directly. I anyway needed an index page which was unique to the Gemini capsule. This I did by writing a template index.gmi which a python script fills with the posts. Now my capsule is complete.

Conclusion

In the end, the final script is this -

#! /bin/bash
myf() {
    out=$(basename $(dirname $1)).gmi
    echo $1
    curl -sL $1 |
        sed -E "s/(.*<contentthing>|<\/contentthing>.*)//g" |
        sed -E "s/<blockquote> <p>/<blockquote>/g" |
        sed -E "s/<\/p> <\/blockquote>/<\/blockquote>/g" |
        $HOME/go/bin/html2gmi -met -l 1 -o $HOME/public_gemini/$out
}
export -f myf
rm -rf $HOME/public_gemini/*
curl -s https://dhruvasambrani.github.io/blog/sitemap.xml -- | grep .*/posts/2.* | sed -E "s/([ ]*<\/*loc>)//g" | xargs -I{} -- bash -c 'myf "{}"'
python3 makeindex.py

where makeindex.py is

import os
def listposts():
    files = os.listdir("../public_gemini/")
    files.sort()
    return("\n".join(["=> "+_file for _file in files]))
with open("template.gmi") as f:
    s = f.read()
    final = s.replace("{{ listposts }}", listposts())
    with open("../public_gemini/index.gmi", "w") as out:
        out.write(final)

By simply running . makegemini.sh, I can recreate my blog entirely in Gemini.

Things I can improve

Definitely, there are still some things I can improve

The posts listed in the Gemini only have the filename and not the post title. This makes it hard for viewers to find a particular post
Post pages do not link back to the index page
Internal links may not work
HTML → Gemini isn’t the best, because scripts attempt to retain content while ignoring formatting, where in some places, it may be better to just leave the content on the whole.

But on the whole, I’m pleased with the present output.