Juho Snellman's Weblog

LLMs are cheap

jsnell@iki.fi — Mon, 02 Jun 2025 22:00:00 GMT

This post is making a point - generative AI is relatively cheap - that might seem so obvious it doesn't need making. I'm mostly writing it because I've repeatedly had the same discussion in the past six months where people claim the opposite. Not only is the misconception still around, but it's not even getting less frequent. This is mainly written to have a document I can point people at, the next time it repeats.

It seems to be a common, if not a majority, belief that Large Language Models (in the colloquial sense of "things that are like ChatGPT") are very expensive to operate. This then leads to a ton of innumerate analyses about how AI companies must be obviously doomed, as well as a myopic view on how consumer AI businesses can/will be monetized.

It's an understandable mistake, since inference was indeed very expensive at the start of the AI boom, and those costs were talked about a lot. But inference has gotten cheaper even faster than models have gotten better, and nobody has an intuition for something becoming 1000x cheaper in two years. It just doesn't happen. It doesn't help that the common pricing model ("$ per million tokens") is very hard to visualize.

So let's compare LLMs to web search. I'm choosing search as the comparison since it's in the same vicinity and since it's something everyone uses and nobody pays for, not because I'm suggesting that ungrounded generative AI is a good substitute for search.

(It should also go without saying that these are just my personal opinions.)

What is the price of a web search?

Here's the public API pricing for some companies operating their own web search infrastructure, retrieved on 2025-05-02:

The Gemini API pricing lists a "Grounding with Google Search" feature at $35/1k queries. I believe that's the best number we can get for Google, they don't publish prices for a "raw" search result API.
The Bing Search API is priced at $15/1k queries at the cheapest tier.
Brave has a price of $5/1k searches at the cheapest tier. Though there's something very strange about their pricing structure, with the unit pricing increasing as the quota increases, which is the opposite of what you'd expect. The tier with real quota is priced at $9/1k searches.

So there's a range of prices, but not a horribly wide one, and with the engines you'd expect to be of higher quality also having higher prices.

What is the price of LLMs in a similar domain?

To make a reasonable comparison between those search prices and LLM prices, we need two numbers:

How many tokens are output per query?
What's the price per token?

I picked a few arbitrary queries from my search history, and phrased them as questions, and ran them on Gemini 2.5 Flash (thinking mode off) in AI Studio:

[When was the term LLM first used?] -> 361 tokens, 2.5 seconds
[What are the top javascript game engines?] -> 1145 tokens, 7.6 seconds
[What are the typical carry-on bag size limits in europe?] -> 506 tokens, 3.4 seconds
[List the 10 largest power outages in history] -> 583 tokens, 3.7 seconds

Note that I'm not judging the quality of the answers here. The purpose is just to get rough numbers for how large typical responses are. A 500-1000 token range seems like a reasonable estimate.

What's the price of a token? The pricing is sometimes different for input and output tokens. Input tokens tend to be cheaper, and our inputs are very short compared to the outputs, so for simplicity let's consider all the tokens to be outputs. Here's the pricing of some relevant models, retrieved on 2025-05-02:

Model	Price / 1M tokens
Gemma 3 27B	$0.20 (source)
Qwen3 30B A3B	$0.30 (source)
Gemini 2.0 Flash	$0.40 (source)
GPT-4.1 nano	$0.40 (source)
Gemini 2.5 Flash Preview	$0.60 (source)
Deepseek V3	$1.10 (source)
GPT-4.1 mini	$1.60 (source)
Deepseek R1	$2.19 (source)
Claude 3.5 Haiku	$4.00 (source)
GPT-4.1	$8.00 (source)
Gemini 2.5 Pro Preview	$10.00 (source)
Claude 3.7 Sonnet	$15.00 (source)
o3	$40.00 (source)

If we assume the average query uses 1k tokens, these prices would be directly comparable to the prices per 1k search queries. That's convenient.

The low end of that spectrum is at least an order of magnitude cheaper than even the cheapest search API, and even the models at the low end are pretty capable. The high end is about on par with the highest end of search pricing. To compare a midrange pair on quality, the Bing Search vs. a Gemini 2.5 Flash comparison shows the LLM being 1/25th the price.

Note that many of the above models have cheaper pricing in exchange for more flexible scheduling (Anthropic, Google and OpenAI give a 50% discount for batch requests, Deepseek is 50%-75% cheaper during off-peak hours). I've not included those cheaper options in the table to keep things comparable, but the presence of those cheaper tiers is worth keeping in mind when thinking about the next section...

Objection!

I know some people are going to have objections to this back-of-the-envelope calculation, and a lot of them will be totally legit concerns. I'll try to address some of them preemptively. Slightly different assumptions can easily lead to clawing back 10% here and 50% there. But I don't see how to bridge a 25x gap just for breaking even, let alone making the AI significantly more expensive. If you want to play around with different assumptions, there's a little calculator widget below.

Surely the typical LLM response is longer than that - I already picked the upper end of what the (very light) testing suggested as a reasonable range for the type of question that I'd use web search for. There's a lot of use cases where the inputs and outputs are going to be much longer (e.g. coding), but then you'd need to also switch the comparison to something in that same domain as well.

The LLM API prices must be subsidized to grab market share -- i.e. the prices might be low, but the costs are high - I don't think they are, for a few reasons. I'd instead assume APIs are typically profitable on a unit basis. I have not found any credible analysis suggesting otherwise.

First, there's not that much motive to gain API market share with unsustainably cheap prices. Any gains would be temporary, since there's no long-term lock-in, and better models are released weekly. Data from paid API queries will also typically not be used for training or tuning the models, so getting access to more data wouldn't explain it. Note that it's not just that you'd be losing money on each of these queries for no benefit, you're losing the compute that could be spent on training, research, or more useful types of inference.

Second, some of those models have been released with open weights and API access is also available from third-party providers who would have no motive to subsidize inference. (Or the number in the table isn't even first party hosting -- I sure can't figure out what the Vertex AI pricing for Gemma 3 is). The pricing of those third-party hosted APIs appears competitive with first-party hosted APIs. For example, the Artificial Analysis summary on Deepseek R1 hosting.

Third, Deepseek released actual numbers on their inference efficiency in February. Those numbers suggest that their normal R1 API pricing has about 80% margins when considering the GPU costs, though not any other serving costs.

Fourth, there are a bunch of first-principles analyses on the cost structure of models with various architectures should be. Those are of course mathematical models, but those costs line up pretty well with the observed end-user pricing of models whose architecture is known. See the references section for links.

The search API prices amortize building and updating the search index, LLM inference is based on just the cost of inference - This seems pretty likely to be true, actually? But the effect can't really be that large for a popular model: e.g. the allegedly leaked OpenAI financials claimed $2/year spent on inference vs. $3B/year on training. Given the crazy growth of inference volumes (e.g. Google recently claimed a 50x increase in token volumes in the last year) the training costs are getting amortized much more effectively.

The search API prices must have higher margins than LLM inference - It's possible. I certainly don't know what the margins of any Search API providers are, though it seems fair to assume they're pretty robust. But, well, see the point above about Deepseek's releasd numbers on the R1 profit margins.

Also, it seems quite plausible that some Search providers would accept lower margins, since at least Microsoft execs have testified under oath that they'd be willing to pay more for the iOS query stream than their revenue, just to get more usage data.

Web search returns results 20x-100x faster than an LLM finishes the query, how could it be more expensive? - Search latency can be improved by parallelizing the problem, while LLM inference is (for now) serial in nature. The task of predicting a single token can be parallelized, but the you can't predict all the output tokens at once.

But OpenAI made a loss, and they don't expect to make profit for years! - That's because a huge proportion of their usage is not monetized at all, despite the usage pattern being ideal for it. OpenAI reportedly made a loss of $5B in 2024. They also reportedly have 500M MAUs. To reach break-even, they'd just need to monetize (e.g. with ads) those free users for an average of $10/year, or $1/month. A $1 ARPU for a service like this would be pitifully low.

If the reported numbers are true, OpenAI doesn't actually have high costs for a consumer service that popular, which is what you'd expect to see if the high cost of inference was the problem. They just have a very low per-user revenue, by choice.

If you want to play around with different assumptions, here's a calculator:

Open in new tab

Why does this matter?

I mean, you're right to ask that. Nothing really matters and eventually we'll all be dead.

But it is interesting how many people have built their mental model for the near future on a premise that was true for only a brief moment. Some things that will come as a surprise to them even assuming all progress stops right now:

There's an argument advanced by some people about how low prices mean it'll be impossible for AI companies to ever recoup model training costs. The thinking seems to be that it's just the prices that have been going down, but not the costs, and the low prices must be an unprofitable race to the bottom for what little demand there is. What's happening and will continue to happen instead is that as costs go down, the prices go down too, and demand increases as new uses become viable. For an example, look at the OpenRouter API traffic volumes, both in aggregate and in the relative share of cheaper models.

This post was mainly about APIs, but consumer usage will have exactly the same cost structure, just a different monetization structure. And given how low the unit costs must be, advertising isn't merely viable but lucrative.

From this it follows that the financials of frontier AI labs are a lot better than some innumerate pundits would have you believe. They're making a loss because they're not under pressure to be profitable, and aren't actively trying to monetize consumer traffic yet. This could well be a land grab unlike APIs, since unpaid consumer queries may be used for training while paid API queries typically are not. Even the subscription pricing might be there mainly for demand management rather than trying to run a profit.

The real cost problem isn't going to be with the LLMs themselves, it's with all the backend services that AI agents will want to access if even a rudimentary form of the agentic vision actually materializes. Running the AI is already cheap, will keep getting cheaper, and will always have a monetization model of some sort since it's what the end user is interacting with. Neither of those is true for the end-user services that have been turned into AI backends without their consent. An AI trying to, I don't know, book concert tickets whenever a band I like plays in my town will probably be phenomenally expensive to its third-party backends (e.g. scraping ticket sites). Those sites will be uncompensated for the expense while also removing their actual revenue streams.

I don't really know how that plays out.

Obviously many service owners will try to make unauthorized scraping harder, but that's a very hard problem to solve on the web. Maybe some of them give up on the web entirely, and move to mobile where they can at least get device attestations. Some might just give up on the open web, and require all usage to be signed in, with account creation being gated on something scarce. Some might become unviable and close up shop entirely.

If/when that happens, what's the play on the AI agent side? Will they choose an escalating adversarial arms race with increasingly dodgy tactics, or will they eventually decide that it's better to pay for the services they use? The former seems unsustainable. If the latter, then it feels like the core engineering challenge becomes one of building data provider backends optimized specifically for AI use, with the goal of scaling to massive volumes and cheaper unit prices, with the trade-off being higher latency, lower reliability and lower quality. That could be quite interesting from a systems perspective. (Yes, I'm aware of MCP, but it's a solution to an orthogonal issue.)

But one thing I'm confident won't be happening is that it's the AIs that turn out to be too expensive to run.

Additional reading

Below are some additional references that were not worked into the main narrative (this article was long-winded enough already).

Inference economics of language models (2025) - A mathematical model for estimating the cost structure, latency/cost tradeoffs, optimal cluster size, and optimal batching based on the LLM architecture.
LLM Inference Economics from First Principles - (2025) A very detailed cost-per-token computation on the cost structure of one specific model, LLama 3.3 70B.
Observations About LLM Inference Pricing - (2025) Analysis of the economics driven by pricing data rather than first-principles cost structure; concludes that proprietary models have very significant markups.
Large Language Models Search Architecture And Cost - (2023) Analysis on the cost of integrating LLMs into search; the LLM cost data is no longer very relevant due to the age of the article (GPT-3.5) but it uses a different way of estimating the search cost structure.

Web Environment Integrity vs. Private Access Tokens - They're the same thing!

jsnell@iki.fi — Tue, 25 Jul 2023 18:30:00 GMT

I've seen a lot of discussions in the last week about the Web Environment Integrity proposal. Quite predictably from the moment it got called things like "DRM for the web", people have been arguing passionately against it on HN, Github issues, etc. The basic claims seem to be that it's going to turn the web into a walled garden, kill ad blockers, kill all small browsers, kill all small operating systems, kill accessibility tools like screen readers, etc.

The Web Environment Integrity proposal is basically:

A website can request an attestation from the browser
The browser forwards the attestation requests to an attester
The attester checks properties like hardware and software integrity
If they check out, the attester creates a token and signs it with its private key.
The attester hands off the signed token to the browser, which in turn sends it to the website.
The website checks that the token was signed by a trusted attester

Here's a funny thing I suspect few of those commenters know: A very similar mechanism already exists on the web, and is already deployed in production browsers (Safari), operating systems (iOS, OS X), and hosting infrastructure (Cloudflare, Fastly). That mechanism is Private Access Tokens / Privacy Pass.

Here's what PATs (as deployed by Apple, and on by default) do to the best of my understanding:

A website can request an attestation from the browser
The browser forwards the attestation requests to an attester
The attester checks properties like hardware and software integrity.
If they check out, the attester calls the website's trusted token issuer
The issuer checks whether to trust the attester and whether the information passed by the attester is sufficient, and then issues a token signed by its private key
The attester hands off the signed token to the browser, which passes it to the website.
The website checks that the token was signed by a trusted token issuer

This launching was hailed in the tech press as a win for privacy and security, not as an attempt to kill accessibility tools or build a walled garden. [1]

You might notice that the basic operating model of the two protocols is almost exactly the same. So is their intended use. From the "DRM for websites" perspective, I don't think there is a difference.

With both WEI and PATs, the website would be able to ask Apple to verify that the request is coming from a genuine non-jailbroken iPhone running Safari, and block the ones running Firefox on Linux. And in both, the intent is not for the API to be used for that kind of outright blocking.

Neither lists e.g. checking whether the browser is running an ad blocker extension as a use case. Both would have just the same technical capabilities for making that kind of thing happen, by just having the attester check for it, and I bet that in both cases the attester would be equally unmotivated in actually providing that kind of attestation.

It's also not that PATs would somehow make it easier for people to spin up new attesters for small or new platforms. Want to run your own attester for PATs? You could, but the issuers you care about will not trust it. [2]

Now, the technologies aren't quite identical, but the distinctions are subtle and would just matter for exactly the kind of anti-abuse work that both of the proposals were ostensibly meant for. The big one is the WEI proposal including the ability to content-bind the attestation to a specific operation. It's a feature anyone trying to use a feature like this for abuse prevention would think is needed, but that adds no power to the theorized "DRM for the web" use case. There is also a more obvious difference between the two, with whether the attester and issuer are the same entity or split. But that too is irrelevant in the discussion on how the technology could be misused. [3]

In principle there could also be differences in the exact things that the APIs allow attesting for. But neither standard defines the exact set of attestations, just the mechanisms.

Given the DRM narrative would have worked exactly the same for the two projects, why such a different reception? I can only think of two differences, both social rather than technical.

One is that the PAT (and related Privacy Pass) draft standards were written in the IETF and are dense standardese. There was no plaintext explainer. Effectively nobody outside of the internet standardization circles read those drafts, and if they had they wouldn't have known whether they needed to be outraged or not. The first time it actually broke through to the public was when Apple implemented it.

The other is the framing. PATs were sold to the public exclusively as a way of seeing fewer captchas. Who wouldn't want fewer captchas? WEI was pitched as a bunch of fairly abstract use cases and mostly from the perspective of the service provider, not for how it'd improve the user experience by reducing the need for invasive challenges and data collection.

This isn't the first time I've seen two attempts at a really similar project, with one getting lauded while the other gets trashed for something that's common to both. But it is the one where the two things are the most similar, and it feels like it should be instructive somehow.

If the takeaway is that standards proposals should be opaque and kept away from the public for as long as possible, before being launched straight to prod based on a draft spec, that'd be bad. If it's that standard proposals should be carefully written to highlight the benefit for the end user, even starting from the first draft, that's probably pretty good? And if it's that only Apple can launch any browser features without a massive backlash, it seems pretty damn bad.

[1] Just to be clear, the one significant HN discussion on PATs had similar arguments about it being DRM, so my claim is not that absolutely everyone loved PATs. But it didn't actually get traction as a hacker cause celebre, and as far as I can see the general media coverage was broadly positive.

[2] What's the process for getting Cloudflare or Fastly to trust a non-Apple attester anyway? I can't find any documentation.

[3] The split version seems kind of superior for deployment, since it means each site needs to only care about a single key (their chosen issuer). This makes e.g. the creation of a new attester a lot more tractable. You only need to convince half a dozen issuers to trust your new attester and ingest the keys, not try to sign up every single website in the world one by one.

A monorepo misconception - atomic cross-project commits

jsnell@iki.fi — Wed, 21 Jul 2021 11:00:00 GMT

In articles and discussions about monorepos, there's one frequently alleged key benefit: atomic commits across the whole tree let you make changes to both a library's implementation and the clients in a single commit. Many authors even go as far to claim that this is the only benefit of monorepos.

I like monorepos, but that particular claim makes no sense! It's not how you'd actually make backwards incompatible changes, such as interface refactorings, in a large monorepo. Instead the process would be highly incremental, and more like the following:

Push one commit to change the library, such that it supports both the old and new behavior with different interfaces.
Once you're sure the commit from stage 1 won't be reverted, push N commits to switch each of the N clients to use the new interface.
Once you're sure the commits from stage 2 won't be reverted, push one commit to remove the old implementation and interface from the library.

There's a bunch of reasons why this is a nicer sequencing than a single atomic commit, but they're mostly variations on the theme: mitigating risks. If something breaks, you want as few things as possible to break at once, and for the rollback to a known-good state to be simple. Here's how the risks are mitigated at the various stages in the process:

There is nothing risky at all about the first commit. It is just adding new code that's not yet used by anyone.
The commits for changing the clients can be done gradually, starting with the ones that the library owners are themselves working on, the projects that are most likely to detect bugs, or the clients that are most forgiving to errors. Depending on the risk profile of the change, you might even use these commits as a form of staged rollout, where you'll wait to see if the previous clients report any problems in production before sending the next batch of commits for code review.
The final commit to remove the old implementation can only break a minimal number of clients: the ones that just started using the library between the removal commit being reviewed and pushed, and did so using the old interface. The ideal environment would have tooling in place to prevent that kind of backslipping from happening in the first place (e.g. lint warnings on new uses of deprecated interfaces).

If anything goes wrong in stage 2, it's trivial to revert a commit that's only touching a couple of files. By contrast, reverting a commit that's spanning hundreds of projects would be quite painful, especially if the repo has any kind of per-directory ACLs (which I think is mandatory for a big monorepo). It gets worse if the breakage isn't detected immediately, since the more code that the single change is affecting, the less likely it's that the reversion applies cleanly.

If anything goes wrong in stage 3, it would also have gone wrong when using atomic commits. But with atomic commits the breakage in stage 3 is far more likely, since the new users will naturally use the old interface (the new one doesn't exist yet in their view of the world), and since the window between start of code review and committing will be wider. And again, the rollback will be far easier with the commit that's only touching the library and not the clients.

There's some additional reasons for why the huge commit will be annoying. For example getting a clean presubmit CI run will become progressively harder the more projects a single commits is changing.

Sure, the atomic commit will save a little bit of work in not needing to have the implementation support both interfaces at once. But that tiny saving is just not a worthwhile tradeoff when compared to how much work wrangling the huge commit would be.

It's particularly easy to see that the "atomic changes across the whole repo"story is rubbish when you move away from libraries, and also consider code that has any kind of more complicated deployment lifecycle, for example the interactions between services and client binaries that communicate over an RPC interface. Obviously you can't do an atomic change in that case, since you need to continue supporting the old server implementation until all client binaries have been upgraded (and are rollback-safe). The same goes for changes to database schemas, command line tools, synchronized client-side Javascript + backend changes, etc.

I think it's true that monorepos make refactoring easier. So that's not the problem. It's also true that they have atomic commits across projects. But the two facts have nothing to do with each other. The reasons monorepos make refactoring simpler all boil down to everyone in the organization having a shared view of what the current state is:

A monorepo will, in practice, mean trunk-based development. You'll know that everybody really is on HEAD rather than actually doing their development on some year-old branch.
And conversely, you'll know that every user of the library is using your library from HEAD rather than pinning it to some year-old version.
It's trivial to find all the current callers, so that you know which clients need to be updated. (Once you've solved the highly non-trivial problem of having any kind of monorepo tooling at scale, of course.)

In theory you could do the exact same thing with multirepos assuming sufficient tool support, discipline about code organization, enforced trunk-based development in all repositories, a master list of all repositories in the org, and defaulting to all repositories being readable by every engineer with no hidden silos. That's all technically doable, but I suspect not culturally compatible with using multirepos in the first place.

Where does this misconception come from? It's certainly present in the Google monorepo paper, which somewhat contradicts itself on this. On one hand, they describe exactly this form of atomic refactoring as a benefit of monorepos:

The ability to make atomic changes is also a very powerful feature of the monolithic model. A developer can make a major change touching hundreds or thousands of files across the repository in a single consistent operation. For instance, a developer can rename a class or function in a single commit and yet not break any builds or tests.

But when it comes to the actual refactoring workflow is, the process that's described is quite different:

A team of Google developers will occasionally undertake a set of wide-reaching code-cleanup changes to further maintain the health of the codebase. The developers who perform these changes commonly separate them into two phases. With this approach, a large backward-compatible change is made first. Once it is complete, a second smaller change can be made to remove the original pattern that is no longer referenced.

I suspect what happened here was that the atomic commits were identified as a benefit in the abstract, with refactoring being used as an illustration of a use case. This was then quite understandably read as a practical example of how you'd work with a monorepo.

There might be a few cases where atomic commits across the whole repository are the right solution, but it has to be exceedingly rare. The example of renaming a function with thousands of callers, for example, is probably better handled by just temporarily aliasing the function, or by temporarily defining the new function in terms of the old. (But this does suggest that languages, both programming languages and IDLs, should make aliasing and indirection easy for as many constructs as possible).

Are there organizations with a large monorepo where atomic cross-project commits are routinely used to change both the implementation and the clients?

Computing multiple hash values in parallel with AVX2

jsnell@iki.fi — Sun, 19 Mar 2017 12:00:00 GMT

I wanted to compute some hash values in a very particular way, and couldn't find any existing implementations. The special circumstances were:

The keys are short (not sure exactly what size they'll end up, but almost certainly in the 12-40 byte range).
The keys all of the same length.
I know the length at compile time.
I have a batch of keys to process at once.

Given the above constraints, it seems obvious that doing multiple keys in a batch with SIMD could speed thing up over computing each one individually. Now, typically small data sizes aren't a good sign for SIMD. But that's not the case here, since the core problem parallelizes so neatly.

After a couple of false starts, I ended up with a version of xxHash32 that computes hash values for 8 keys at the same time using AVX2. The code is at parallel-xxhash.

Benchmarks

Before heading off into the weeds with the details, below are a couple of pretty graphs showing performance with different key sizes for a few different implementations: CityHash64 since it's been my default hash function for years, xxHash64 since my parallel implementation was based on xxHash32, and MetroHash64 since I saw people suggesting it was the fastest option for small keys. I did not include FarmHash since it was consistently slower than CityHash for all key sizes.

Finally, to isolate the benefits of specializing for the statically known key sizes, I've included a scalar version of xxHash32. It has exactly the same structure as the parallel version, except for not using SIMD [0].

All implementations computed hashes for the same number of keys; the parallel implementations did it 8 keys at a time, the others did them sequentially. The tests were run on a i7-6700 and GCC 6.3.0, with -O3 -march=native -fno-strict-aliasing. The benchmark code is in the repository, but you'll need to bring your own copies of the external hash table libraries.

First, let's look at the time take per key for key sizes relevant to my use case (this graph is 4-72 bytes, but as mentioned before the most interesting range for me is around 12-40 bytes):

That looks pretty nice, with very significant speedups compared to the alternatives on all the key sizes. With larger key sizes the parallel Murmur3 (my first try) quickly runs out of steam, but the parallel xxHash32 stayed ahead of the pack. We'll switch to showing time per byte rather than time per key here.

And at 512 bytes or so, the time per byte has flattened out completely:

Don't look under the rug

So what are the downsides? Why wouldn't everyone use this?

The most glaring problem is that most applications don't do hash computations in parallel. Either it's going to be fundamentally impossible, or at least it will require a major restructuring.

Second, I've swept a small detail under the rug: the parallel implementations were using column-major order for the data. It's the natural way to structure this. The timings above do not include a row-major to column-major conversion step. That's because my application was already using column-major anyway. But if that weren't the case, it's totally possible that the conversion step would wipe away a good chunk of the gains. (What about scatter-gather? See below).

Third, I suspect that most uses of hash tables use strings as keys. This code will not work at all in that use case. Not only do the sizes of keys have to be statically known, but (another detail I skimmed over above) they also need to be a multiple of 4 bytes long. Basically, I want to use structures as hash keys; not sure how many other people also need that.

And fourth, the parallel implementations were using the 32-bit variants of the algorithms due to reasons that I'll explain later. That does not make the benchmarks unfair (the 64-bit versions are faster than the 32-bit ones). But some applications will need those extra bits in the hash value. This code can't provide it.

So while this should work fine for me (though that still remains to be seen), it might not be a very large ecological niche.

What's interesting about this?

Converting from the scalar version to the parallel version is a fairly mindless process, not many insights to be had in that part. But while doing this, I bumped into some interesting aspects on the periphery.

Rotates

All the fast and high quality hash functions I looked at seemed to be descendants of Murmur, and used rotates as their primitive of choice for moving bits down. This is most likely because x86 has a dedicated rotate instruction, while most other methods require two instructions, e.g. shift+xor. For AVX that's not the case, and you need to synthesize the rotate from two shifts and a xor/or.

Based on some quick testing, a single-instruction replacement could give a 40% speedup, and a two instruction replacement a 20% speedup. There's not a huge number of single instruction options available though: horizontal 16-bit addition/subtraction, or the 8-bit shuffles. I suspect neither would work very well due to the effects aligning at an 8 bit boundary. With two instructions a shift+xor is probably the best option. Would be interesting to see if the best speed/quality tradeoff is different for AVX than for x86.

Multiplies

These days new hash functions are mostly built with 64*64->64 multiplies. We won't have that in SIMD until AVX-512 (and given the way things are going, I wonder if a general purpose CPU using AVX-512 will actually ever launch). Synthesizing a 64-bit multiply from 32-bit multiplies doesn't seem viable for this use case. So for this use case, we really want to look at the hash functions defined a few years ago rather than the latest hotness.

Memory layout

Like I mentioned earlier, my data is already in column-major order so I didn't need to worry about wrangling that. But at one point I thought that it'd be nice to provide an alternate version that would work on row-major data. That's what scatter-gather is for, right?

Nope, the gather instructions are just unbelievably slow, and additionally for some reason prevented compilers unrolling the Murmur3 loop, for a 4x performance loss. (Even on GCC 6.3 and clang 3.8). In theory the xxHash inner loop should be better for the gather instructions, since at least there you're not depending on the compiler unrolling to get multiple parallel loads going. But the results there were only marginally less worse.

Auto-vectorization

After having written the version using intrinsics, it occurred to me that I really should have started off with just writing out plain C++ with the same semantics, and see if it auto-vectorizes. Because this really looks like it should be a very easy case. And while the transformation from a scalar version to using intrinsics is not too bad, the transformation to standard C++ expressing the same order of operations on the same memory layout is easier yet. The theoretically auto-vectorizable code certainly looks very pretty compared to the AVX intrinsic soup.

But ignoring aesthetics, the results were mixed. GCC 6.3 seemed to vectorize everything perfectly. GCC 4.9 [1] missed something (I didn't track down exactly what) that cost about 25% performance. And Clang 3.8 did nothing at all, with the plain-C++ version being 150% slower than the version using intrinsics. So still a bit on the fragile side. But this is the best showing for auto-vectorization that I've experienced so far.

(The GCC 4.9 case is particularly annoying; it would have been easy to write the auto-vectorizable version first, see the speedups and think auto-vectorization was working, but miss that it was still leaving a lot of performance on the table).

32-bit output values

The other advantage of 64-bit operations is that the natural implementation will end up producing a 64-bit hash value. Now, for normal hash tables I'm totally OK with a single 32-bit hash value. But there's some use cases like Cuckoo hash tables or Bloom Filters where one would really like more key material.

Before moving from Murmur3 to xxHash, I experimented a bit with a version that would not only compute results for multiple different keys at once, but also do it with multiple different seed values. It was actually pretty efficient. I didn't end up redoing that work for the xxHash version though. Primarily since I don't actually need that version right now, and secondarily since I'm actually not sure of whether the different seed values will give different enough outputs for use in a probabilistic data structures.

(If anyone knows for sure whether the last bit is true or not, please let me know).

Is there a faster non-parallel hash function here?

As mentioned multiple times, computing multiple keys in parallel is a very niche use case. But based on the benchmark graphs for large key sizes, I wonder if there's a decent non-parallel hash function hidden here: compute the 8 32-bit streams in parallel and combining them at the end (or at certain block boundaries). After all, that's already what xxHash does on a smaller scale.

This seems like something that people would already have explored in the quest for faster and faster hashing for large key sizes. But I can't find any trace of such an implementation. Maybe everyone had already moved to 64-bit multiplies by the time AVX2 started to be widely deployed and 32-bit multiplies became the faster option again. Or maybe 32-bit hash values for large key sizes aren't actually a useful point in the design space.

Designing hash functions is hard. I explicitly did not want to invent a new one here, but just re-implement existing algorithms. I even went so far as to add in the "mix in the length of the key" steps, just so that I could verify my code against the reference implementations. Sure, it's a useless step given the length is constant. But it doesn't cost that much to do either, and lets me not worry about accidentally destroying the hash quality.

But if I wanted to burn some brain cycles on designing one and a lot of CPU cycles on running SMHasher... 32-bit multiplies + shift-xor, working 64 bytes at a time, and code organized in a way that makes it easy to auto-vectorize could be a pretty interesting place to start from.

Footnotes

[0] Note that I tried to make sure to isolate this to specializing for the key size, not to e.g. be able to hoist any computations outside the benchmark loop. AFAIK all implementations went through the same number of non-inlined function calls.

[1] Yes, it's a couple of years old. But that's Debian stable for you. And to be honest, a year ago our main compiler at work was still GCC 4.4. Compared to that, 4.9 feels pretty darn luxurious.

I've been writing ring buffers wrong all these years

jsnell@iki.fi — Tue, 13 Dec 2016 17:00:00 GMT

So there I was, implementing a one element ring buffer. Which, I'm sure you'll agree, is a perfectly reasonable data structure.

It was just surprisingly annoying to write, due to reasons we'll get to in a bit. After giving it a bit of thought, I realized I'd always been writing ring buffers "wrong", and there was a better way.

Array + two indices

There are two common ways of implementing a queue with a ring buffer.

One is to use an array as the backing storage plus two indices to the array; read and write. To shift a value from the head of the queue, index into the array by the read index, and then increment the read index. To push a value to the back, index into the array by the write index, store the value in that offset, and then increment the write index.

Both indices will always be in the range 0..(capacity - 1). This is done by masking the value after an index gets incremented.

That implementation looks basically like:

uint32 read;
uint32 write;
mask(val)  { return val & (array.capacity - 1); }
inc(index) { return mask(index + 1); }
push(val)  { assert(!full()); array[write] = val; write = inc(write); }
shift()    { assert(!empty()); ret = array[read]; read = inc(read); return ret; }
empty()    { return read == write; }
full()     { return inc(write) == read; }
size()     { return mask(write - read); }

The downside of this representation is that you always waste one element in the array. If the array is 4 elements, the queue can hold at most 3. Why? Well, an empty buffer will have a read pointer that's equal to the write pointer; a buffer with capacity N and size N would also have have a read pointer equal to the write pointer. Like this:

The 0 and 4 element cases are indistinguishable, so we need to prevent one from ever happening. Since empty queues are kind of necessary, it follows that the latter case needs to go. The queue has to be defined as full when one element in the array is still unused. And that's the way I've always done it.

Losing one element isn't a huge deal when the ring buffer has thousands of elements. But when the array is supposed to have just one element... That's 100% overhead, 0% payload!

Array + index + length

The alternative is to use one index field and one length field. Shifting an element indexes to the array by the read index, increments the read index, and then decrements the length. Pushing an element writes to the slot that "length" elements after the read index, and then increments the length. That looks something like this:

uint32 read;
uint32 length;
mask(val)  { return val & (array.capacity - 1); }
inc(index) { return mask(index + 1); }
push(val)  { assert(!full()); array[mask(read + length++)] = val; }
shift()    { assert(!empty()); --length; ret = array[read]; read = inc(read); return ret; }
empty()    { return length == 0; }
full()     { return length == array.capacity; }
size()     { return length; }

This uses the full capacity of the array, with the code not getting much more complex.

But at least I've never liked this representation. The most common use for ring buffers is for it to be the intermediary between a concurrent reader and writer (be it two threads, to processes sharing memory, or a software process communicating with hardware). And for that, the index + size representation is kind of miserable. Both the reader and the writer will be writing to the length field, which is bad for caching. The read index and the length will also need to always be read and updated atomically, which would be awkward.

(Obviously my one element ring buffer wasn't going to be used in a concurrent setting. But it's a matter of principle.)

Array + two unmasked indices

So is there an option that gets the benefits of both representations, without introducing a third state variable? (Whether it's two indices + a size, or two indices + some kind of a full vs. empty flag). Turns out there is, and it's really simple. It uses two indices, but with one tweak compared to the first solution: don't squash the indices into the correct range when they are incremented, but only when they are used to index into the array. Instead you let them grow unbounded, and eventually wrap around to zero once the unsigned integer overflows. So:

This reclaims the wasted slot. The code modifying the indices also becomes simpler, since the clumsy ordering of increments vs. array accesses was only needed for maintaining the invariant that the index is always in range.

uint32 read;
uint32 write;
mask(val)  { return val & (array.capacity - 1); }
push(val)  { assert(!full()); array[mask(write++)] = val; }
shift()    { assert(!empty()); return array[mask(read++)]; }

Checking the status of the ring also gets simpler:

empty()    { return read == write; }
full()     { return size() == array.capacity }
size()     { return write - read; }

This all works, assuming the following restrictions:

The implementation language supports wraparound on unsigned integer overflow. If it doesn't, this approach doesn't really buy anything. (What will happen in these languages is that the indices get promoted to bignums which will be bad, or they get promoted to doubles which will be worse. So you'll need to manually restrict their range anyway).
The capacity must always be a power of two. (Edit: This limitation does not come just from the definition of mask using a bitwise and. It applies even if mask were defined using modular arithmetic or a conditional. It's required for the code to be correct on unsigned integer overflow.)
The maximum capacity can only be half the range of the index data types. (So 2^31-1 when using 32 bit unsigned integers). In a way that could be interpreted as stealing the top bit of the index to function as a flag. But the case against flags isn't so much the extra memory as having to maintain the extra state.

All of those seem like non-issues. What kind of a monster would make a non-power of two ring anyway?

This is of course not a new invention. The earliest instance I could find with a bit of searching was from 2004, with Andrew Morton mentioning in it a code review so casually that it seems to have been a well established trick. But the vast majority of implementations I looked at do not do this.

So here's the question: Why do people use the version that's inferior and more complicated? I've must have written a dozen ring buffers over the years, and before being forced to really think about it, I'd always just used the first definition. I can understand why a textbook wouldn't take advantage of unsigned integer wraparound. But it seems like it should be exactly the kind of cleverness that hackers would relish using and passing on.

Could it just be tradition? It seems likely that this is the kind of thing one learns by osmosis, and then never revisits. But even so, you'd expect the "good" implementations to push out the "bad" ones at some point, which doesn't seem to be happening in this case.
Is it resistance to having code actually take advantage of integer overflow, rather than it be a sign of a bug?
Are non-power of two capacities for ring buffers actually common?

Join me next week for the exciting sequel to this post, "I've been tying my shoelaces wrong all these years".

Ratas - A hierarchical timer wheel

jsnell@iki.fi — Thu, 28 Jul 2016 01:00:00 GMT

Last week I needed a timer wheel for a hobby project. That's a data structure that's been reimplemented over and over in the last three decades, but for various reasons I couldn't get excited by any of the freely available ones. Obviously this means that one more implementation was needed, hence Ratas - a hierarchical timer wheel. Unfortunately my vacation ran out before I could get back to the original project, but that's the nature of yak shaving.

In this post I'll first explain briefly what timer wheels are - you might want to read one of the references instead if you've got the time - and then go into more detail on why I wrote a new one.

(Hierarchical) timer wheels

Timer wheels are one way of implementing timer queues, which in turn are used to to schedule events to happen at some future time. If you have tens or hundreds of timers, it doesn't matter much how they're stored. An unsorted list will do just fine. To handle millions of timers you need something a bit more sophisticated.

A timer wheels is effectively a ring buffer of linked lists of events, and a pointer to the ring buffer. Each slot corresponds to a specific timer tick, and contains the head of a linked list. The linked list contains the events that should happen on that tick. So something like this.

(The pointer is in red, the wheel slots in gray, the events in orange, and the numbers show the time the slot/event is associated with.)

For every tick of time, the pointer moves forward one slot. The slot that was passed will now refer one full rotation to the future. The first slot is no longer tick 0, but tick 8:

Any events in the slot will get executed as the pointer passes it. So on the next tick events a and b are executed, and removed from the ring.

A timer wheel has O(1) time complexity and cheap constant factors for the important operations of inserting or removing timers. Various kinds of sorted sequences (lists, trees, heaps) will scale worse, and the constant factors tend to be larger. But a basic timer wheel only work for a limited time range. The question is how you extend them to work when the timer range is larger than the size of the ring.

One solution is the hierarchical timer wheel, which layers multiple simple timer wheels running at different resolutions on top of each other. Each field has its own slots and its own pointer.

When an event is scheduled far enough in the future that it does not fit the innermost (core) wheel, it instead gets scheduled on one of the outer wheels. So something like the following, where the timers scheduled for ticks 9 and 13 have been stored in the first slot of the second wheel:

Usually a timer tick will just advance the pointer on the core wheel, working exactly like the simple timer wheel. That's true all the way up to and including tick 7 in this example.

But when the pointer of the core wheel wraps around, the pointer for the second wheel will advance by one slot. The events contained in that slot will either be executed or promoted to the correct slot in the core ring. (As happens here for events e and f).

This obviously generalized to more than two layers. A full rotation of any single wheel will advance the pointer by one on the next layer of wheels.

The 1987 paper Hashed and Hierarchical Timing Wheels: Data Structures for the Efficient Implementation of a Timer Facility by Varghese & Lauck is where the concept of hierarchical timing wheels [0] was introduced. As usual, Adrian Colyer's summary is great if you don't have time to read the full paper.

Single-file implementation, no dependencies

The main reason I couldn't use existing implementations was that the best ones were deeply embedded in larger systems, and would have taken a lot of work to extract into stand-alone libraries. Ratas is a single-file implementation with no external dependencies beyond C++11. So it should be pretty easy to drop into any C++ project.

In a related but equally important point, Ratas doesn't have any internal time source, its notion of time is driven from the outside by the user of the library. This is actually a more important property than might first appear. There are a lot of event loop libraries that have timer queues of some sort, but the event loops by their nature want to be in control of execution. I needed a component to use as part of building a custom event loop instead [1].

Some of the libraries I looked at had interesting implementation strategies. (Wow, DPDK uses skiplists for the timer queue?). It would have been great to run some deterministic benchmarks comparing all these implementations. But in general even a crude hack job to extract the minimum viable free-standing implementation out of them was too much work.

Limiting number of events triggered in a single timestep

One operation that I've wanted in the past is limiting the number of timers triggered in a single call to the timer's advance method. If exceeded, the timer wheel should bail out early, let the application do some work, and then continue where it left off. This way an unfortunate clump of timers can't starve the main processing loop for too long. [2]

Of course the main application loop will need special logic logic, and do more frequent calls to timer processing than normal until the backlog has been dealt with.

This is trivial with a fully ordered event queue. With a timer wheel it means making the timer processing re-entrant as a whole (including the contents of the wheel being modified while it's still in the state between ticks). Luckily there's a way to do this with very little extra wheel-level state and with no changes at all to the event scheduling or canceling logic. So the overhead on the fast path is pretty much immeasurable.

Optimize for high occupancy, not low

One of the perceived problems of timer wheels is that while event insertion and deletion are O(1) operations, finding out the time remaining until the next event triggers is O(m+n) where m is the total number of timer wheel slots and n is the number of events. Specifically, you need to walk through the wheel until you find a slot with some events. Then depending on the resolution of the slot, you might need to walk through the full event chain of that slot. In addition to the algorithmic complexity, both of these operations are effectively pointer chasing, so there will be a lot of cache pollution.

There are various tricks that can be used to make this operation cheaper. For example each wheel could have a bitmap that parallels the array of slots in the wheel. If the bitmap has a 1 in a given position, the slot is non-empty. This allows the implementation to short-circuit the search by skipping the empty slots completely.

I am not convinced this class of optimizations is a good tradeoff. They speed up a couple of operations: looking up time of next event, advancing the time by a lot of ticks in one go, but only if the timer wheel is at a low utilization. In exchange you're paying a small extra cost on every insert and deletion operation. [4]

And here's the thing. If the wheel is at a low utilization, it means that the program as a whole is at low utilization. That's exactly the situation where I don't care about the performance of a component like this. Performance only starts to matter once the system is under load. When the system as a whole is under heavy load, so are the timer wheels. At that point these lookup operations will short-circuit almost immediately, so the optimizations do nothing. On the other hand, that's exactly when the extra overhead added to insertions and deletions will be most significant.

There is however one useful thing we can do on the interface side to help with the original issue. Every time I've used any kind of ticks_until_next_event()-like functionality, it's with a pattern like this:

  Tick sleep_usec = std::min(timers.ticks_until_next_event(), 1000);

There's some upper bound on how large a result I'm interested in. Even if there are no timer events to be handled in the next millisecond, there's something non-timer driven that I know will need to happen. So let's just tell the timer wheel what that the upper bound is:

  Tick sleep_usec = timers.ticks_until_next_event(1000);

This allows the timer wheel to short circuit the search, and return as soon as it's clear that the wheel contains no events scheduled to happen before that threshold.

Another option for reducing the cost of ticks_until_next_event is to allow it to return a lower bound rather than an accurate result. If the process reaches a slot with more than one tick of granularity, don't walk through the chain of events but return the lower bound of that slot's tick range. I didn't do this since again that feels like optimizing for the uninteresting case, and since the API would then need separate accurate and approximate operations. (Having just the approximate operation available seems wrong).

Range-based scheduling

A lot of timer events get scheduled over and over again, but never executed [3]. Often the exact timing of their execution doesn't matter, but there is a lot of expensive churn as the timer is adjusted by minute amounts. To reduce the churn, Ratas includes a second scheduling interface that takes a range of acceptable times rather than a single exact time. For example:

timers.schedule_in_range(event, 100000, 101000);

It is then up to the implementation to decide on the optimal scheduling. Right now the implementation of schedule_in_range decides immediately on a single tick to schedule the event for. All information on the range is lost after that, rather than maintained in the long term. The decision is currently done as follows:

If the timer is already scheduled in the right interval, just do nothing.
Prefer scheduling the timer on a exact slot boundary, so that it can be executed in-place rather than promoted to the inner wheel.
Prefer scheduling the timer as late in the range as possible. This is important to maximize the efficiency of the first point.

Why reify the the timing immediately, and drop the range information? Mostly because it's not clear there's a good time to use that information later on. The main times we're going to look at the event again is when it's up for execution, or when it gets rescheduled. In the first case the range gives us nothing. In the second case we've already gotten a new and improved range.

Using this additional interface didn't make as big a difference as I was hoping for in my benchmark, just about 10%. The reason is that only one of my four timer types was a good match for range-based scheduling. But I don't know that it'd be appropriate for a much higher proportion of real world use cases, so that's fair enough.

Approximate event scheduling is of course not a new concept in any way [5]. And even if approximate scheduling is not supported by the timer library, some parts of it you can emulate on the application level. But even so it seems like something you'd want integrated properly in the library. Native functionality is always more comfortable to use than wrappers.

Finally, there's one more alternate strategy to mitigate the churn of timers creeping later and later. If a timer is already active and is rescheduled further into the future, update its scheduled time but leave the event structure on the ring in exactly the same place. When the event comes up on the wheel, the timer wheel checks the execution time, notices the time is still in the future, and reschedules the event to its proper location instead of executing the callback.

The theory sounds plausible, but in my testing this mostly resulted in small slowdowns. It's of course dependent on the workload, but it's definitely too fragile to use as a default and too obscure to be worth an option. I'd also hate losing the invariant that a slot only contains events with the correct tick.

A benchmark program

I wrote a little benchmark that creates a configurable number of events with a mix of different behaviors:

Timers with a short duration that get scheduled constantly and are almost always executed.
Timers with a long duration that are scheduled once, and eventually get executed.
Timers with a long duration that are constantly rescheduled to happen later, and thus never get executed.
Timers with a medium duration that get scheduled rarely and get executed once every time they get rescheduled.
Timers with a medium duration that get scheduled rarely, and are either rescheduled to happen earlier or canceled.

Sets of these timers are grouped into units, with each unit consisting of a total of 10 separate timer events. The work units run at a slightly different cycle lengths to make the access patterns vary a bit over time. There's also a long completely idle period at the end.

The following results were generated by running the benchmark with various different amounts of 'work units'. Each unit contains a total of 10 timers (but not all will be active at the same time), and during the life of a test will schedule about 70k events and end up executing about 23k of them. The test runtime measured in virtual timer ticks will always be constant, so increasing the number of work units will increase the average wheel occupancy rather than cause more to be done in serial.

This is a log-log graph, i.e. both the axes are logarithmic. Despite all the important operations being constant time in theory, performance degrades non-linearly and really hits a wall around the 256k work units, where a doubling the workload increases runtime by a factor of 5. It's almost certainly a cache issue. The i7-2640M I ran the tests on has a 4MB cache, so for the 128k work unit case it's already a struggle to keep even the events on the core wheel in L3. So the suboptimal scaling is probably to be expected.

In terms of absolute performance numbers, the 32k work unit workload will do ~120M scheduling operations plus ~40M event executions per second. And on top of that the minor work that the application does in the event handlers. That seems decent for one core of a 5 year old laptop CPU.

No comparative benchmarks, sorry. As I mentioned before, it was hard to find anything to benchmark against. My benchmark program appears to trigger some kind of a performance bug in the only fully featured standalone timer queue I found, which made it perform two orders of magnitude slower than it realistically should have. The primitive operations are much faster than that. I'm sure the problem will get fixed, but right now it doesn't make for a very useful comparison.

Features that didn't make it

There is intentionally no support for repeating timers. I think that kind of thing should be done by the timer explicitly rescheduling itself.

The timer events have a vtable due to the virtual execute() method. Originally everything was parametrized at compile time by the callback type, so all events in a single wheel could use the same execute(). But that really did not work with the MemberTimerEvent, where the callback is a combination of a member function specified statically and an object specified dynamically. I wasn't willing to give up that feature, so the vtable was a lesser evil.

But it might be neat to allow parametrizing TimerWheel with a specific TimerEvent. If you need a heterogenous wheel, instantiate the template with the current TimerEventInterface. If you can live with a homogenous wheel, instantiate with some other implementation that has a non-virtual execute(). I didn't do it since it exceeded my personal template tolerance, and since homogenous wheels don't feel like a very compelling use case anyway.

Footnotes

[0] I use the term "timer wheel" instead of the original "timing wheel". The former is what I'd always seen them called until seeing this paper title.

[1] I wrote earlier about why packet processing applications might want a different kind of event loop than typical server applications, and at another time about why determistic control of time is important for testability.

[2] Interestingly this is the opposite of what operating system kernels want to do. They'd prefer to batch as many timers together as possible. The difference here is that I'm thinking of a single-threaded non-locking application, while modern operating systems are all about concurrency.

[3] Think of a timer that deallocates resources after a period of idleness. These might be re-scheduled after every single operation on the resources.

[4] Or maybe not so small a cost; at least I found it tricky to maintain the bitmap without adding an extra back pointer in each event structure or each slot, either of which would be bad news. You could do it without that backpointer by requiring that timers are canceled via the timer wheel, but then you can't have the timers be automatically canceled on destruction. That would be totally unacceptable.

[5] For example Linux already applies a percentage-based timer slack, though for a the purpose of trying to batch as many timer executions together as possible. The LWN article on a proposed replacement to the Linux timer wheel is a good read.

json-to-multicsv - Convert hierarchical JSON to multiple CSV files

jsnell@iki.fi — Tue, 12 Jan 2016 14:30:00 GMT

Introduction

json-to-multicsv is a little program to convert a JSON file to one or more CSV files in a way that preserves the hierarchical structure of nested objects and lists. It's the kind of dime a dozen data munging tool that's too trivial to talk about, but I'll write a bit anyway for a couple of reasons.

The first one is that I spent an hour looking for an existing tool that did this and didn't find one. Lots of converters to other formats, all of which seem to assume the JSON is effectively going to be a list of records, but none that supported arbitrary nesting. Did I just somehow manage to miss all the good ones? Or is this truly something that nobody has ever needed to do?

Second, this is as good an excuse as any to start talking a bit about some patterns in how command line programs get told what to do (I'd use the word "configured", except that's not quite right).

What and why?

I needed to produce some data for someone else to analyze, but the statistics package they were using could not import JSON files with any non-trivial structure. Instead the data needed to be provided as multiple CSV files that can be joined together by the appropriate columns.

As a simplified example, instead of this:

{
  "item 1": {
    "title": "The First Item",
    "genres": ["sci-fi", "adventure"],
    "rating": {
      "mean": 9.5,
      "votes": 190
     }
  },
  "item 2": {
    "title": "The Second Item",
    "genres": ["history", "economics"],
    "rating": {
      "mean": 7.4,
      "votes": 865
   },
   "sales": [
     { "count": 76, "country": "us" },
     { "count": 13, "country": "de" },
     { "count": 4, "country": "fi" }
   ]
  }
}

My "customer" needed this:

item.csv

item._key	item.rating.mean	item.rating.votes	item.title
"item 1"	9.5	190	"The First Item"
"item 2"	7.4	865	"The Second Item"

item.genres.csv

genres	item._key	item.genres._key
sci-fi	"item 1"	1
adventure	"item 1"	2
history	"item 2"	1
economics	"item 2"	2

item.sales.csv

item._key	item.sales._key	sales.count	sales.country
"item 2"	1	76	us
"item 2"	2	13	de
"item 2"	3	4	fi

One way to do this would have been to just change the program I used to produce the output. That would have been a bit annoying since the CSV output codepath would have been basically completely separate from the JSON one (which was basically just a JSON::encode_json on the natural data structure. It's almost easier to just have a generic converter than one specific for that one app (the documentation is as long as the program itself). The only question is how to configure the generic mechanism for the specific case.

How command line tools get run

Could this "just work" out of the box with no settings at all? Not really, there's multiple ways of interpreting the data. A compound value could mean either the addition of more columns (ratings in the example) or adding rows to another CSV file (sales in the example). Consistently choosing the first interpretation would not work at all, while in the latter case you'd get really awkward entity-attribute-value-style output.

Ok, so some configuration is needed. What kind of options do we have for doing that? Command line flags tend to be the simplest to start with, though they'll often eventually become complex either by developing ordering dependencies between flags (to express different semantics) or by the values developing some kind of complicated internal structure.

Both of those actually happen for this tool. To run it, you need to pass in multiple --path command line options, each containing a pair of a patterns and the action to take for values whose path matches the pattern. (Just the first matching action is taken). For the above example those flags were:

   --path /:table:item
   --path /*/rating:column
   --path /*/sales:table:sales
   --path /*/genres:table:genres

Scalar values have an automatic fallback handler that just outputs the value as a column, but for compound data fields not finding a match is an error. In these cases the error message will print out some suggestions on what command line arguments could be added to resolve the error, for example:

Don't know how to handle object at /*/appendix/. Suggestions:
 --path /*/appendix/:table:name
 --path /*/appendix/:column
 --path /*/appendix/:row
 --path /*/appendix/:ignore

The next option would be feeding some kind of a schema file to the tool, which would then be used to guide the process. For example if the schema says that a type of object has a static set of fields, those fields are probably columns. If it has an unknown set of keys, it's probably more like tabular data.

The problem is that writing the schema would be a bit of a pain, and it would be much harder for the conversion tool to guide the user through an iterative process of getting the schema definition right. One could maybe generate a schema file from the data file itself, and edit any bits that the autodetection goes wrong. Schema generators do exist, for example jsonschema.net, but at least that one doesn't have enough knobs to tweak to even get this basic example right. And the mistakes are such that fixing them would take a fair bit of work. Reliable automated schema generation would make for some pretty epic yak shaving in the context of this tiny tool.

Maybe if people really did write JSON schemas for everything it would make sense to use that existing infrastructure. But I've never seen one of those in the wild, the spec is complicated, and JSON schemas are not particularly well suited to this use case. (Really you'd want a custom schema format, but then it's completely guaranteed that there's no pre-existing schema file to use).

And here's the thing... It's not just this specific case. It never feels like any kind of declarative schema is the right solution. In a couple of decades of writing data munging scripts I can remember just a single case of basing the solution on an external description of the data. And that single exception had several people working on the tool full time. Sure, it's great to have a schema of some sort for for your data interchange or storage format, for use in validation, code generation, automated generation of example data, or other things like that. But for actually processing it? It's just an incredibly rare pattern.

And finally, could this be a use case for a special purpose language? If schemas feel like a rarity, little languages are the opposite. Especially in classic Unix they are ubiquitous.

As a recovering programming language addict, I have to be deeply suspicious every time a new language looks like the right solution. Is it really? Or is this just an excuse to fall off the wagon again, and implement a language. (Not a big language, man. Just a little one, to take the edge off).

It's also clear that the general idea of a JSON processing language is solid. Some already exist (e.g. jq), but there could be room for multiple approaches. Writing sample programs to see what a language for JSON processing and transformation might look like was a fun way to spend a couple of hours on the boring "no internet" leg of a train journey. ("It could have this awk-like structure of a toplevel pattern matching clauses, but on paths instead of rows of text, and with a recursive main loop instead of a streaming one, and and and...").

If I kind of wanted to write this, the idea is good, and an initial implementation is not an unreasonable amount of work, why not do it? Well, even if a script written in this hypothetical language to translate from hierarchical to tabular data would have been pretty simple, it would still have been a program that the user of the tool needs to write in a dodgy DSL. And since the language would have been much more generic than a mere conversion tool, it it would also have been impossible to guide the user through a process of iteratively building the right configuration (like is now done via the error messages).

In all likelihood it'd mean that nobody else would ever use the tool for the original purpose. The less powerful and less flexible version is just going to be more useful purely due to simplicity.

So sanity prevailed this time. But tune in for the next post for an earlier example of where my self control failed.

The most obsolete infrastructure money could buy - my worst job ever

jsnell@iki.fi — Tue, 01 Sep 2015 17:30:00 GMT

Today marks the 10th anniversary of the most bizarre, and possibly the saddest, job I ever took.

The year was 2005. My interest in writing a content management system in Java for the company that bought our startup had been steadily draining away, while my real passion was working on compilers and other programming language infrastructure (mostly SBCL). One day I spotted a job advert looking for compiler people, which was a rare occurrence in that time and place. I breezed through the job interview, but did not ask the right questions and ignored a couple of warning signs. Oops.

It turned out to be a bit of an adventure in retrocomputing.

The bizarre

This was the former internal tools unit of a very large company, let's call them X. For some reason X had split off the unit and sold (given?) it to a moderately large consulting company, whom we shall call Y. I was going to work at Y. The reason they needed compiler people was that they were about to take over the maintenance of a C compiler suite (compiler, linker, assembler, etc). Except I'd misunderstood them as taking over the maintenance from X. That wasn't the case. Actually the compiler was from another very large company, Z, who were discontinuing all support. So X bought the source code from Z for very significant $$$, and needed somebody (Y) to actually do something with it. In fact it wasn't even just one compiler suite as I'd initially understood, it was two. Woo, double the compilers to play with!

I started in September, but some schedules had slipped and we wouldn't actually have anything to work with for a month or two. So I had plenty of time to acclimatize there. Which is good, because it's like I'd stepped into some strange parallel dimension where the 80s never ended. You know, the kind of place where you need access to some old documentation, and eventually find it's stored in an ingenious in-house source control system built on top of RCS.

For example on my first day I found that X was running what was supposedly largest VAXcluster remaining in the world, for doing their production builds. Yes, dozens of VAXen running VMS, working as a cross-compile farm, producing x86 code. You might wonder a bit about the viability of the VAX as computing platform in the year 2005. Especially for something as cpu-bound as compiling. But don't worry, one of my new coworkers had as their current task evaluating whether this should be migrated to VMS/Alpha or to VMS/VAX running under a VAX emulator on x86-64! [0]

Why did this company need to maintain a specific C compiler anyway? Well, they had their own ingenious in-house programming language that you could think of as an imperative Erlang with a Pascal-like syntax that was compiled to C source [1]. I have no real data on how much code was written in that language, but it'd have to be tens of millions lines at a minimum.

The result of compiling this C code would then be run on an ingenious in-house operating system that was written in, IIRC, the late 80s. This operating system used the 386's segment registers to implement multitasking and message passing. For this, they needed the a compiler with much more support for segment registers than normal. Now, you might wonder about the wisdom of relying on segment registers heavily in the year 2005. After all use of segment registers had been getting slower and slower with every generation of CPUs, and in x86-64 the segmentation support was essentially removed. But don't worry, there was a project underway to migrate all of this code to run on Solaris instead [2].

After a couple of months of twiddling my thumbs and mostly reading up on all this mysterious infrastructure, a huge package arrived addressed to this compiler project. But... We were supposed to get a source dump. Why does the package need two men to carry it? Did somebody play a practical joke on us, and send the source as printouts?

Why it's the server that we'll use for compiling one of the compiler suites once we get the source code! A Intel System/86 with a genuine 80286 CPU, running Intel Xenix 286 3.5. The best way to interface with all this computing power is over a 9600 bps serial port. Luckily the previous owners were kind enough to pre-install Kermit on the spacious 40MB hard drive of the machine, and I didn't need to track down a floppy drive or a Xenix 286 Kermit or rz/sz binary. God, what primitive pieces of crap that machine and OS were.

You might wonder about the wisdom of using a 15-20 year old machine as the sole method of building a piece of software. It's dog slow and obviously will break sooner or later. In fact I raised this very issue and suggested maybe imaging the hard drive and getting everything running virtualized. That idea was nixed since the machine was old and fragile, we couldn't risk poking around in the inside. It'd be really hard to replace, when they went hunting for this machine from antique computer specialists, they only found two remaining working units [3].

This might be a good time to say that computationally speaking, I was raised by the wolves on a SunOS 4 server (which I ended up sysadmining for a few hundred users). My personal email was still going over UUCP in 2005. The highlight of my previous weekend (in 2015, when I'm writing this) was finding what looks like a partial source repository for a Lisp implementation written before I was born, and which appeared to have been completely lost to time. It was on a copy of some old backup tapes from an ITS server, and I don't even remember how or when those ended up on my harddrive. Which is to say, I like old computer systems more than is reasonable.

But even by my standards this level of computational archeology was going a bit too far. And the rabbit hole still had a little bit deeper to go.

A couple of weeks later the source drop arrived. I'll talk about the other compiler later, let's tackle this one that needed to be built on a 286 first.

So it was written in PL/M. (Wait, is that even a thing? That's not a thing, right?). And it was last modified in the mid 80s. I'd like to say the build instructions were generated using a typewriter, but it could be that my memory is playing tricks on that. Some of the components didn't build cleanly, and required various Makefile tweaks with excruciating round trip times for every test. Because, you know, this is a 286.

The hard drive wasn't large enough for all of the components either, so the process of rebuilding everything would be:

Upload the linker source tarball over the 9600 bps serial connection from a Linux server acting as a frontend
Unpack it
Build
Download the linker binary back to safety
Remove the source and the build artifacts
Repeat the same for all five components of the system.

Just the data transfers for each component took an hour. But after a long time fighting with it I had a script that with a single keystroke generated bit-identical binaries when compared to the ones that had apparently been in use for almost the last 20 years.

I was pretty worried though, it'd be really hard to actually make any use of this source. There was no documentation except for the build instructions, we'd need to reverse engineer everything. There wouldn't be any training either from company Z either, frankly it's a miracle if anyone who originally worked on the software was still with the company. Nobody knew PL/M. The roundtrip time from making a change on the build machine to having a binary on a machine capable of actually running it was at least an hour. And we didn't have a source level debugger for this, so that'd mean an hour just to add a single debug printf. (Wait, not a debug printf of course. A debug whatever-it-is-that-PL/M-uses-for-io). It'd be pure pain.

I expressed these concerns, and was told not to worry.
- " Oh, we'll never want to make changes to this compiler, not enough code is compiled with it these days for that to be worth it. The more modern suite is the important one."
- "Wait? I just spent a month elbow deep in PL/M and Xenix/286 over a 9600bps Kermit connection, and you're telling me we're never going to actually use any of this?!"
- "Right, we just needed to verify that we really got what we bought."

I didn't really know whether to be happy about not having to do any more work on that crap, or angry about the waste of time.

The sad

That concluded the bizarre retrocomputing part of the story. We now get to the part with sad dysfunctional corporate politics. If you're just reading this for the laughs, maybe just skip to the end.

The more modern compiler suite wasn't a spring chicken either. It had to be compiled specifically on Visual Studio 6. There were again no design docs, nor tests. The lack of tests was explained as being due to third party IP concerns. The lack of documentation we never got an answer for.

Unlike the truly ancient compilers, this one was easy to build. But what could we possibly do with it? So I read through the compiler, tried to understand what each file did, did some experiments and wrote some notes.

We arranged a big meeting with senior engineers from all the relevant departments of X. The agenda was to figure out what improvements they'd want in the compiler. It was pretty dispiriting. Half of them seemed to think it'd be better not to touch it at all, since we'd probably just break it. Even those who weren't completely opposed to changes couldn't think of anything they really needed. Finally someone took pity on me, and noted that the compiler isn't very smart about scheduling segment register loads, and those were expensive operations. Maybe that could be improved?

After the meeting one of the managers told me that it was really our job to come up with projects that the customer wanted to buy, not the other way around. And it usually couldn't just be a general project for minor improvements, it'd need clear and ideally measurable goals. The projects would also need to be pretty large to justify all the overhead. It should go without saying that this is an absolutely insane way of doing platform development, but it's something that follows directly from the incentives of the two parties. How anyone at X thought that anything good would come out of this, I don't know.

But never mind that. Our initial project for taking over the compiler maintenance was well funded, and vague enough that it was easy to argue that proving capability of shipping some kind of improvements is a core deliverable. We could at least proceed with the only improvement anyone had shown any interest in.

So I implemented a new peephole optimizer stage for the segment registers, and even got the code reviewed by the original authors of the compiler when they came over from Z to give us a training session. It seemed to work, but as mentioned above we didn't have a test suite and building one would take a long time and a lot of work. (Excellent! We can propose that as a project later!).

We couldn't even run any of the production code since that would require the ingenious in-house operating system. The only way to get any performance numbers and confidence in the changes being correct would be to schedule a load test in X's test lab. Unfortunately weeks and weeks of discussion over that never got us both the lab time and the people from their side who would have been needed. It's of course understandable; whether these compiler changes got released or not wouldn't make a difference to these people, who had their own actual work to do. But it also made it very hard to see how we could ship this change. The justification would be improved performance, but with no numbers it'd be a hollow claim.

That's when it dawned on me that there was never going to be any real compiler work there. These special compilers would not really matter once X migrated away from the custom OS, which would have to happen. Oh, sure it'd need to be "maintained" just in case a customer running 20 year old code needed a bugfix. Given the dysfunctional processes, it seemed pretty clear that the costs for any improvements would be massive in the short active development life these systems had remaining. They'd probably spent a seven figure sum on this project as insurance, but actually doing something with the code? No way.

All of this infrastructure was just going to be on life support while it was being replaced by new systems that would in turn be obsoleted in five years. But nothing would ever actually go away, all this cruft would just accumulate and accumulate, nominally supported for ever. And you'd have these extremely good engineers doing this completely insane work, having been moved working from a prestigious high tech company to a despised consulting firm.

And how do you even get out of that job? I imagined myself in a job interview in 2010, trying to explain how useful my extensive knowledge of Xenix, PL/M build systems, and VMS would be to my prospective new employer. There might be a time when you just stop keeping up with tech, but doing it in your 20s is really not that time :)

Coda

So I quit without arranging for another job first, assuming that something would probably turn up. In an amazing display of serendipity, during my notice period ITA Software posted to the SBCL mailing list that they wanted to pay somebody to work on SBCL improvements for them, which was pretty much my dream gig at the time [4]. Perfect timing.

Ok, that's all. You can now proceed with the one-upping with stories of developing new production software on a physical IBM 1401 in this millennium, or something ;-)

Footnotes

[0] I don't know the outcome of that evaluation.

[1] No, transpiling is not a word no matter how much you people try to make it one.

[2] And that leads us to the question of whether Solaris is really what you want to be migrating to in 2005.

[3] Wait?! There are two machines in the world that you think can be used to build this software, and we only bought one of them? "Oh, yes. It would have been a pretty good idea to buy the second one as a spare. Let's do that now!"

[4] Of course if one worry with the job at Y was that I'd be unemployable due to only having worked on boring and obsolete technology, one might wonder about the long term career prospects in Common Lisp compilers. But look, in 2006 CL was really going places!

Updated zlib benchmarks

jsnell@iki.fi — Fri, 05 Jun 2015 20:30:00 GMT

Last year I wrote a small benchmark suite to benchmark the various zlib optimization forks that were floating around. There's a couple of reasons to update those results. First, there were major optimizations added to the Cloudflare fork. And second, there's now a new entrant, zlib-ng which merges in the changes from both the Intel and Cloudflare versions but also drops support for old architectures and cleans up the code in general.

I'll write a bit less commentary this time, so that the results will be easier to update in the future without a new post. The big change compared to the 2014-08 results is that the Cloudflare version is now significantly faster particularly on high compression levels, but there are smaller improvements on all compression levels. Except for compression level 1, it seems like the preferable version now for pure speed.

Zlib-ng showed a massive slowdown in decompression speed compared to all other versions until compiled with --zlib-compat (only relevant for minigzip, not necessary for general use of the library), and is much slower with compression level 1 than the the Intel version despite apparently using the new quick deflate strategy. On other levels it closely shadows the Intel results.

Versions used:

baseline	50893291621658f355bc5b4d450a8d06a563053d
cloudflare	a80420c63532c25220a54ea0980667c02303460a
intel	e176b3c23ace88d5ded5b8f8371bbab6d7b02ba8
zlib-ng	4b1728a261e32e08bc5403f391ba65bfe5f4ba57

Flags used:

All:	`CFLAGS='-msse4.2 -mpclmul -O3'`
zlib-ng:	`--zlib-compat`

Decompression

	baseline		cloudflare		intel		zlib-ng
decompress executable (50 iterations)
Execution time	1.32s±0.00	(100%)	1.10s±0.00	(83%)	1.30s±0.01	(98%)	1.31s±0.01	(99%)
decompress html (50 iterations)
Execution time	0.76s±0.00	(100%)	0.65s±0.00	(85%)	0.75s±0.00	(98%)	0.76s±0.00	(100%)
decompress jpeg (50 iterations)
Execution time	0.20s±0.00	(100%)	0.12s±0.00	(60%)	0.20s±0.01	(101%)	0.20s±0.00	(100%)
decompress pngpixels (50 iterations)
Execution time	0.87s±0.00	(100%)	0.65s±0.00	(75%)	0.85s±0.00	(98%)	0.86s±0.00	(99%)

Compression level 1

	baseline		cloudflare		intel		zlib-ng
compress executable -1 (10 iterations)
Compression ratio	0.37		0.37		0.46		0.46
Execution time	0.75s±0.01	(100%)	0.52s±0.01	(69%)	0.29s±0.00	(38%)	0.46s±0.01	(61%)
compress html -1 (10 iterations)
Compression ratio	0.39		0.37		0.54		0.54
Execution time	0.38s±0.00	(100%)	0.27s±0.00	(71%)	0.19s±0.00	(49%)	0.28s±0.00	(73%)
compress jpeg -1 (10 iterations)
Compression ratio	1.00		1.00		1.05		1.05
Execution time	0.65s±0.01	(100%)	0.53s±0.01	(81%)	0.24s±0.00	(36%)	0.40s±0.00	(61%)
compress pngpixels -1 (10 iterations)
Compression ratio	0.17		0.17		0.23		0.23
Execution time	0.44s±0.01	(100%)	0.27s±0.01	(60%)	0.18s±0.00	(40%)	0.26s±0.00	(57%)

Compression level 3

	baseline		cloudflare		intel		zlib-ng
compress executable -3 (10 iterations)
Compression ratio	0.35		0.36		0.36		0.36
Execution time	1.10s±0.02	(100%)	0.62s±0.01	(56%)	0.73s±0.02	(66%)	0.69s±0.01	(63%)
compress html -3 (10 iterations)
Compression ratio	0.36		0.35		0.35		0.35
Execution time	0.61s±0.00	(100%)	0.37s±0.00	(59%)	0.43s±0.00	(69%)	0.41s±0.00	(66%)
compress jpeg -3 (10 iterations)
Compression ratio	1.00		1.00		1.00		1.00
Execution time	0.62s±0.00	(100%)	0.51s±0.00	(82%)	0.55s±0.00	(88%)	0.56s±0.00	(90%)
compress pngpixels -3 (10 iterations)
Compression ratio	0.15		0.15		0.16		0.16
Execution time	0.85s±0.01	(100%)	0.44s±0.00	(51%)	0.46s±0.00	(54%)	0.44s±0.01	(51%)

Compression level 5

	baseline		cloudflare		intel		zlib-ng
compress executable -5 (10 iterations)
Compression ratio	0.33		0.34		0.34		0.34
Execution time	1.61s±0.00	(100%)	0.93s±0.01	(57%)	0.93s±0.00	(57%)	0.91s±0.01	(56%)
compress html -5 (10 iterations)
Compression ratio	0.34		0.33		0.33		0.33
Execution time	0.99s±0.01	(100%)	0.57s±0.00	(57%)	0.53s±0.00	(53%)	0.52s±0.01	(52%)
compress jpeg -5 (10 iterations)
Compression ratio	1.00		1.00		1.00		1.00
Execution time	0.64s±0.00	(100%)	0.53s±0.00	(83%)	0.74s±0.01	(116%)	0.74s±0.00	(115%)
compress pngpixels -5 (10 iterations)
Compression ratio	0.14		0.14		0.14		0.14
Execution time	1.23s±0.01	(100%)	0.61s±0.01	(49%)	0.61s±0.00	(49%)	0.59s±0.00	(47%)

Compression level 9

	baseline		cloudflare		intel		zlib-ng
compress executable -9 (10 iterations)
Compression ratio	0.33		0.33		0.33		0.33
Execution time	9.55s±0.01	(100%)	4.07s±0.01	(42%)	7.53s±0.01	(78%)	7.34s±0.01	(76%)
compress html -9 (10 iterations)
Compression ratio	0.33		0.33		0.33		0.33
Execution time	2.81s±0.01	(100%)	1.64s±0.00	(58%)	2.54s±0.01	(90%)	2.48s±0.02	(88%)
compress jpeg -9 (10 iterations)
Compression ratio	1.00		1.00		1.00		1.00
Execution time	0.64s±0.00	(100%)	0.53s±0.00	(82%)	0.58s±0.01	(90%)	0.59s±0.00	(93%)
compress pngpixels -9 (10 iterations)
Compression ratio	0.12		0.12		0.12		0.12
Execution time	26.58s±0.05	(100%)	14.24s±0.02	(53%)	21.43s±0.02	(80%)	19.40s±0.03	(72%)

"It's like an OkCupid for voting" - the Finnish election engines

jsnell@iki.fi — Mon, 11 May 2015 17:00:00 GMT

Have I ever told you about the time I built an "OkCupid for elections" for the communists?

No? That's strange, I tend to get good mileage out of that story during election season. Unfortunately for the story to make any sense, you'll need a bit of absolutely fascinating background information on how elections work in Finland, and especially how websites that tell people whom to vote for became an integral part of it.

The Finnish electoral system

Like many other countries, Finland has a representative electoral system. The country is divided into 13 districts, each district elects some number (average of ~15) of representatives into the parliament. The number of representatives in each district is determined based on population, while the distribution of the seats in each district is based on the proportion of the vote that each party got there. The proportional distribution of seats is done using the D'Hondt method.

That's pretty standard fare at least for continental Europe. The odd bit is in how the distribution of seats is done within each party. There is no predetermined party list at all, the ordering of candidates within a party is determined purely by votes. And what's more, it's not even possible to give a generic vote to a party as a whole; each vote is equally a statement on which party you'd like to win the seats, and on which specific person in that party you'd like to get one of the seats that the party wins.

This is fundamentally different from systems where voting for a person rather than just the list requires extra effort. It has a couple of sweet advantages, but also a bunch of disadvantages.

The advantages of the system I can think of are:

It decreases the power of party elites by removing from them the decision on how to order a list, and instead giving it to the people who voted for the party.
It encourages demographic diversity in the candidate pool. It makes a lot of tactical sense for every party to have candidates of all ages, of both genders, of different educational backgrounds, and different ethnic groups. Some people won't vote purely based on issues, but for example specifically want to vote for some group they have an affinity for, or feel is under-represented. Each party will want have a good spread of candidates to make sure that they don't lose any voters due to a lack of a compatible candidate.
This is of course pure identity politics, but somehow it feels a lot less objectionable when the effect is progressive. It's worth noting that while maximizing the demographic diversity of candidates makes sense for each party, the same might not apply to ideology / opinions. Candidates that don't fit into any of the wings of a party will just make it unclear what the party really stands for.

The downsides of the system come mostly from overloading a single vote to mean two things. You vote for a candidate, and the party gets a vote as a side effect.

The meaning of a vote is obfuscated. If your ideal candidate is not in your ideal party, what's the right action? Should you try to ensure that they get elected at the cost of helping the wrong party, or should you vote for a suboptimal candidate in the right party?
I'm pretty sure that the rational choice there is to vote for the party first. The party distribution of the parliament has more impact than the exact distribution of people. And unless you're pretty certain that your preferred candidate is "on the bubble" within the party, it's more likely that a single vote will swing a seat between two parties then it's to swing a seat between two individuals in the same party.
If the more important choice is the party, why is the candidate at the forefront of the decision?
Since some people will ignore the maxim to vote for the party first and person second, it makes sense for parties to run candidates with good name recognition in an attempt to hoover in some of these voters.
So the candidate list will have TV celebrities, beauty contest winners, olympic medalists and so on. And occasionally they'll even get elected bumping off "serious" politicians from the list. (Please note that I don't want to imply that the celebrities would automatically be incompetent at that job, that's not true at all. But their actual or perceived level of competence is not what gets them elected, it's their fame.)
I'm pretty sure that there's even some level of "no publicity is bad publicity" in the system as a result. The most obvious recent example is the most notorious corrupt politician in the country (guilty of soliciting for bribes as a minister, and I think the only person expelled from the parliament during my lifetime) getting re-elected once again. That's just the kind of cronyism you'd hope to get rid of by eliminating the party list, but it clearly doesn't always work.
For a candidate who is serious about getting elected the main competition are the other people in that party in the same district. If as a candidate you don't have any natural name recognition, you have to get it by advertising. This creates a situation where even a nominally party-dominated system can still have uncomfortably large level of either personal fundraising, or election campaigns being self-funded by the rich.
The problem with campaign fundraising is of course corruption, or the perception of corruption. I give 10k € to your personal election campaign, you after getting elected help in solving some nasty zoning issues that my company has. The last large scale election funding scandal happened only 8 years ago, and implicated a disturbing number of high profile politicians. And the problem with funding a large campaign from your personal wealth is that it ensures that the rich have a much higher chance of getting elected than the poor.
The issue with competition within the party being potentially more important than the competition between parties can also have other odd consequences. Like the the bizarre case of two candidates from opposing (and not particularly compatible) parties doing a joint TV advertising campaign. They were in different districts, so neither was personally hurt by increasing the success chances of the other one. But they were directly undermining their own party in the other district.
Even if you're trying to be a "good" voter and make your decision based on something else than fame or budget, how are you actually going to decide which of the 200 candidates in the district or even the 30 candidates in one party to vote for? Nobody has the time to do deep research on all of them.

You might notice that a lot of these issues boil down to what's essentially a discovery problem. If you could somehow solve that and make people aware of candidates that they might want to vote for based on some criteria other than name recognition, surely this would be an absolutely awesome election system?

The election engines

The solution to all these discovery problems that Finland arrived in the mid-90s was the vaalikone, which literally translates to "the election machine", or a bit less literally to "the election engine". The first one was made by the national broadcasting corporation YLE, in later elections other news outlets added their own versions, followed by the political parties and various kinds of trade / lobbying organizations. There's probably at least a couple of dozen active ones for any given election.

The core concept is simple. The site has a bunch of multiple choice questions. 20 questions is a typical amount. Some questions are related to the hot political topics of the day, some with stale / evergreen topics (why yes, let's ask people about NATO once again), others with cultural values, and yet others with economic ones. A month or two before the elections, the candidates answer the questions, possibly tell how important they consider that issue, and they might even write a bit of text explaining their answer.

During the campaign the voters will use the site to answer the same questions, and get a match percentage with all the candidates in the voter's district. For the really picky users, some of the modern implementations even allow you to restrict the results e.g. only to candidates in a given age range. You can now perhaps see why I use the quote in the title of this post when explaining the idea ;-)

And do these sites matter? Apparently over half the voters use one or more of them. In the 20-30 year old age group it's up to three quarters. And the results aren't just ignored. One study says 40% of the users had the results from the site affect their voting behavior in some way. Another says that a sixth outright voted for the top rated candidate. These numbers are not insignificant.

(Note: I am aware that similar sites exist in other countries. But my impression is that they are nowhere near as central to the process. If I'm wrong about it, I'd love to hear about counterexamples.)

The problem with the concept

So the issues have been solved, right? Use the recommendations from the website to narrow the candidate pool down to a handful, do deeper research on those candidates, and select the best one. Informed democracy has been saved!

Unfortunately no. While the concept is simple, I don't know if it really should be used for anything more than entertainment.

The most critical problem is that the algorithm, the question selection, the question phrasing, and even the reporting UI have a huge effect on the results. It's no different from e.g. the way polling results could be distorted by these kinds of methodological things. But unlike polls, these tools are directly intended to affect voting behavior.
One party or candidate can easily be the best match on one site and a horrible match on another. Someone from the Green party can be branded as the most right-wing MP in the country based on the results. Filtering the results to just show extremely left-wing politicians might show up several people from the conservative party due to (presumably) some kind of uninitialized data. These are not hypothetical examples, but event that actually happened.
As an example on the power of the UI, a site that specially highlights candidates from the best matching party (no matter how small the margin is to the 2nd best match) is going to give a dramatically different view from a site that has a single sorted list of candidates.
Reliance on these kinds of tools puts quite a lot of unauditable power into very few hands, and feels distinctly anti-democratic even if you assume that no malice is involved. Some people might argue that a similar if not greater amount of power is placed in the hands of e.g. the people moderating televised election debates. The difference is that at the end of a debate nobody gives you a personalized recommendation on what your opinion should be. The viewer would need to apply at least some thought to the process, rather than trust a number with too many significant digits that was spit out by an ostensibly neutral website.
The results are often flaky and inconsistent. False positives are fine; you shouldn't be just blindly voting for the single top candidate with no thought at all (even if some people are). But that strategy doesn't work with false negatives.
While a voter has an incentive to answer the questions honestly, the candidates do not. Nobody will get raked over the coals if their views happen to evolve between answering, and actually having to vote on an issue, except for a handful of the most contentious issues. Answering a multiple choice question simply does not appear to count as an election promise. (Note: This year one of the sites phrased a few of their questions in the form of explicit yes / no votes on some potential bills; that seemed like a clever idea).
It is an interesting question how exactly an unscrupulous candidate should "optimize" their answers to maximize their chances of getting elected, or if they could do it at all. Most likely the problem is completely intractable without insider knowledge of the algorithms and the answers of other candidates. If you assume that can't happen, the worst you have to deal with is just plain political dishonesty and trying to avoid unnecessary controversy.
The whole idea of trying to compute a match between the a voter and a party based on the candidates of that party is fundamentally flawed. The candidates are not the party. The party wide match is computed based on the aggregate matches of the candidates with the voter. But the full pool of candidates is irrelevant. What actually matters is the pool of people who get elected, and that is obviously not yet known at this time!
A cynic might say that even that's not important, and what really matters are the opinions of the party leader and their coterie, but if you go down that road you might as well just vote randomly.

All this is perhaps better illustrated by a real world example.

This year one of the major election engine sites produced a 72-77% match for me with six of the parties, and significantly lower ratings for the remaining two "serious" parties. When I redid the test a few weeks later, the results for those parties were 70-79% in roughly the same order; so the margin of error just from the uncertainty in my answers was at least a couple of percentage points.

Now, this list of six top matches included the conservatives, the xenophobes, the greens, the social democrats, the agrarian centrists, and the language-centric Swedish people's party. All of them were rated as the basically equally good within the margin of error, despite some of them being polar opposites. It's just absurd that all these parties spanning most of the ideological spectrum could be equally acceptable to me. I have no idea of what algorithm was used, but it was certainly a good way to give the impression that my vote doesn't matter since every party is the same.

And the UI issue I mentioned earlier is relevant here too. One of those six almost equally good matches was promoted an order of magnitude more than the others on the result page, based on a difference of just a couple of percentage points. And as it happens, that party was reported as an abysmal 50% match on another major site.

Anecdotally, this isn't just a useless idea. It's dangerously deceptive at least as currently implemented.

How the sausage was once made

Oh yes, I promised to tell a story!

The year is 2003. Our first startup had been recently acquihired for what was effectively peanuts compared to our big dotcom bubble dreams. But hey, anything beats not making payroll next month.

One of our customers was the 4th largest party in the country, created in the early 90s from merger of two struggling communist parties. At that time it was expected for each party to have an election engine on their website, with answers just from their candidates. Yes, a party-specific election engine is a pretty stupid idea. Doesn't matter, since everyone else will have one too. And someone had sold them one for way too cheap, maybe as a sweetener for a website redesign done a bit earlier.

Especially given how rough web development was in 2003 compared to now, the amount of time budgeted was pitiful. I don't remember the totals, but when broken down it'd be like half a day for a CRUD app to gather answers from the candidates, a couple of hours to make some kind of ranking algorithm, a couple of hours for design, and so on.

As was often the case with these underbudget one-off projects, it was given to me and I rolled a quick flat-file Perl special for the data collection and polling. But I had no good ideas on the ranking algorithm. So I explained the problem in a mangled way to a statistician friend, probably stopped listening to the answer too soon, and ended up with the impression that a normal (Pearson) correlation between the answers of a candidate and voter would be appropriate; just map the answers from strongly disagree / disagree / don't care / agree / strongly agree to a 1-5 linear scale, normalize the -1 to -1 results to a 0 - 100 range instead.

In reality it wasn't appropriate for a number of reasons. The answers were ordinal and non-linear. The results would be undefined if someone answered every question the same way. Due to the linear transformation property of the correlation computation, someone who answered every question with various degrees of "disagree" could be a pretty decent match with someone who answered everything with various degrees of "agree". And it'd be susceptible to overweighting certain issues if the set of question wasn't constructed very carefully, but included many questions that correlated closely with each other. (IIRC the questions were supplied by the party HQ, so the chances of them having been carefully constructed to be non-correlating are pretty damn low).

But never mind that. I was young, stupid, ignorant of statistics, and just happy that I had something working so quickly. Time to test the system! So I took it out for a spin in the Helsinki voting district. Top match 45%. Try another district. And another, and another. The results were horrible everywhere. But at least it worked superficially, so I sent an email to my coworkers and asked them to kick the tires a bit.

Everyone in the same office, my old startup friends, reported similar results.

We can't possibly ship something like this! They're going to cancel their contract if we give them an election engine that gives at best 50% matches and just loses them votes. So we started whiteboarding ways to nudge the numbers in the right direction without being too obvious. Maybe normalize to a 0 to 1 range first, take a square root, and then normalize to the 0 to 100 range? So 50% becomes around 70%. Is that too aggressive?

But in the middle of this brainstorming, before we did any changes, the project manager walked to our office. "Guys, you need to change the site. The match percentages are way too high. This will be really embarrassing, it looks like the numbers have been cooked.".

Turns out that even if we weren't exactly Ayn Rand-quoting liberalists, entrepreneurial 20-somethings weren't quite the right people to use as test users for the ex-communist party website, while the middle aged mother of three in the office next door had a slightly different viewpoint. After we stopped laughing about the situation, we concluded that the algorithm must be about right, and shipped it. (There might have been a couple of kludges added on top, like inserting a non-integer dummy vote into each answer set, to guarantee some variance so that the results would at least always be defined).

I guess it must have worked ok, since we never heard any complaints. And it being a party-specific election engine, the odds are that it did not materially affect the voting behavior of anyone let alone the final election results. It was only a few years later that I properly understood what a shoddy system we'd created, and how irresponsible building it was.

Even if I might not really like the results I get from the major election engine sites, they must surely be doing a better job than this. But it's still hard to shake off the feeling that automating away a basic human right is the wrong solution to the problem.

Juho Snellman's Weblog

LLMs are cheap

What is the price of a web search?

What is the price of LLMs in a similar domain?

Objection!

Sensitivity analysis

Why does this matter?

Additional reading

Web Environment Integrity vs. Private Access Tokens - They're the same thing!

A monorepo misconception - atomic cross-project commits

Computing multiple hash values in parallel with AVX2

Benchmarks

Don't look under the rug

What's interesting about this?

Rotates

Multiplies

Memory layout

Auto-vectorization

32-bit output values

Is there a faster non-parallel hash function here?

Footnotes

I've been writing ring buffers wrong all these years

Array + two indices

Array + index + length

Array + two unmasked indices

Ratas - A hierarchical timer wheel

(Hierarchical) timer wheels

Single-file implementation, no dependencies

Limiting number of events triggered in a single timestep

Optimize for high occupancy, not low

Range-based scheduling

A benchmark program

Features that didn't make it

Footnotes

json-to-multicsv - Convert hierarchical JSON to multiple CSV files

Introduction

What and why?

How command line tools get run

The most obsolete infrastructure money could buy - my worst job ever

The bizarre

The sad

Coda

Footnotes

Updated zlib benchmarks

Decompression

Compression level 1

Compression level 3

Compression level 5

Compression level 9

"It's like an OkCupid for voting" - the Finnish election engines

The Finnish electoral system

The election engines

The problem with the concept

How the sausage was once made