analognowhere (I've written that word a lot today) is a website where Prahou posts his images.
As there are a lot of posts, however, and the names are not always the most descriptive, it can be quite challenging to find one specific post. For this reason (and because I was bored) I made a simple web application to search the archives.
You get a simple website, that downloads all the analognowhere data locally. Then you write a bunch of keywords into a search bar. The keywords will be looked for one by one in each posts title, subtitle and description. (presence of which is the main thing making this project possible, the subtitle is usually even less descriptive than the title)
After the scan is complete, previews of all the posts that match all the keywords will be shown. Clicking a preview will send you to the post proper.
Getting it up and running was quite simple.
The state of the website (loading, loaded, loading error) was done by switching visibility of multiple <div>s.
Both the index and posts (example) are very scrapable. While I usually scrape with regex (it's fun), this time, I decided to use JavaScript's builtin XML parser.
const parser = new DOMParser(); // parts of page const txt = parser.parseFromString(src, 'text/xml') // whole page const txt = parser.parseFromString(src, 'text/html')
This returns a DOM element, which can be handled like any other. (sadly, DOM is not implemented in Node...) Since the site is nicely structured, getting data from it is not that hard. One thing to note is that the 'href' attribute stores path as an absolute path with protocol and all, so I had to extract just the relevant part of the URL like so:
const site = PARSER.parseFromString(siteHTML, "text/html"); const links = Array.from(site.getElementsByClassName("archive")[0].children) .map(li => li.firstChild.href.match(/_.+/)[0])
The search bar itself is just a <form>. Turns out you can call JavaScript from links. I will surely have some fun with this information later.
<form action="javascript:search()"> <b>INPUT A LIST OF KEYWORDS HERE</b><br> <input type="search" id="query" name="query" placeholder="keywords..."> <input type="submit" value="Search"> </form>
I got into slight problems with accessing the website, because of the heresy called Same-Origin Policy (Something something security something something CORS... Dealing with this makes me want to install Windows95 and write this from there)
To circumvent this, I used a proxy.
But that was not enough.
While this approach worked, it took way too long to load (which is not surprising, as it had to load all the posts), so I decided to take a different approach: Generate the data before and then just load them from the server. Well, not really. Turns out SOP is not happy about loading local files even from 'file://' addresses, so I just decided to make things simple and generate the entire 'script.js' file with the data baked in. (fun!)
For server-side generation, I chose ruby. This means going back to scraping with regex, which was actually way easier. Ruby has a builtin HTTP/HTTPS client, which works somewhat like 'fetch'.
require 'uri' require 'net/http' resp = Net::HTTP.get_response(URI("https://analognowhere.com/_/archive/")).body
I run this script weekly with systemd timers, since updates are not all that frequent.