Ennuicastr Technical Details
This document is for the benefit of the technically minded, and describes Ennuicastr's implementation details. If you're not technically minded or don't care about the details, you've found the wrong document for you.
First, let's address the elephant in the room: live voice chat isn't Ennuicastr at all. That's just Jitsi and WebRTC. I make no claims of originality for live voice chat, and it's a great thing that WebRTC is now standard in every browser.
Ennuicastr consists three components: The server, the client, and the processing software. The server stores and processes audio data, the client produces audio data and sends it to the server, and the processing software turns raw audio data into usable audio streams for editing. Technically, the web site, which allows users to create and download recordings, is a fourth component, but that's just glue.
An Ennuicastr client maintains two separate WebSock connections to the server. One is exclusively for timing information, and the other is for data transmission. Every ten seconds, the timing socket sends the local time to the server, and the server replies with both the client's original time and the server's time (in terms of the active recording). The client uses the difference in local time to approximate the RTT to the server, and that along with the server time to estimate the current correct time. It is the role of the client to timestamp each frame of audio with when it belongs in the recording!
Using libav.js (a simple port of the venerable ffmpeg's libraries to WebAssembly and JavaScript), the client encodes the audio it receives from the mic as either Opus or FLAC, configured in either case to use 20ms frames. In the instance of FLAC, the data is converted to 24-bit (the browser audio libraries always deliver audio data as 64-bit floats, so it is impossible to know what the original quality was. 24-bit is the highest-quality option available). Using the information from the timing socket, it stamps each frame with its correct time, and sends it off to the server. Since each frame is stamped with its time, there's no need to send silent frames; if the voice activity detector is used, then VAD-off frames simply aren't sent at all. It buffers two seconds of audio even while the VAD is off, and sends those when the VAD switches on, so there's not that annoying VAD-characteristic “click” when the VAD kicks in.
If video recording is active, it uses the MediaRecorder API to capture video, and libav.js to fix the timestamps live. Safari only supports the MPEG-4 container format, which does not provide enough header information to decode live, so Ennuicastr “cheats” on that platform by capturing a short (1-second) video, using its header information to inform it of how to decode, then performing a streaming capture using that information. Video data never touches the server; it's sent over WebRTC to the host, or just saved locally.
That's... really all there is to the client. Add a dancing waveform rendered to an HTML5 canvas and a weasel experiencing ennui, and you've got Ennuicastr. The host client also has a master socket that can change the recording mode, and the UI to send messages on it, but that's pretty trivial.
You might assume that at this point I'd say “the server is where all the complexity is”, but no. The server just maintains a recording mode, makes sure connections have the right key, handles pricing, and writes the audio packets received from the clients to the disk. It writes using the Ogg file format, in which frames are always marked with their timestamps, and (barring some fix-ups for errors or abuse) just passes through the timestamps from the clients directly. The audio files it produces are technically valid by the Ogg specification, but useless because the audio data is so erratically timed. No software can reliably handle audio data like that.
So, the server itself is pretty trivial, and intentionally so. Oh yeah, it also responds to time requests to keep the timestamps in sync.
The processing is arguably where all the complexity is, but even there, it's not that much. The most unique piece of infrastructure is oggcorrect, which lays out the timestamped data on a continuous track, filling any gaps and quashing any overflows. Gaps can arise from actual gaps in the data (e.g. VAD-off mode), or simply from the client not being continuously connected, and overflows arise from one clock running faster than others. After oggcorrect, the file is passed into good ol' ffmpeg, which does the actual processing.
Overall, Ennuicastr is a triumph of knowing which components to write oneself and which components to use off-the-shelf. Synchronization is the most important part, and the part most frequently screwed up, so that's where most of the effort went: Nearly every component that doesn't directly relate to synchronization is off-the-shelf.
To be quite blunt, this design is also the correct way to do it. The clients do the initial encoding, but I'm not stuck with lossy, since I use a library for it. The server component which actually needs to be soft real-time has nothing interesting; it never even decodes audio! The processing is expensive, but can easily be made low-priority, and even relegated to a subset of CPU cores if need be. By moving all of the hard work to post-processing, Ennuicastr scales extremely well.