<rss version='2.0'><channel><title>Juho Snellman's Weblog</title><link>https://www.snellman.net/blog/</link><description>Lisp, Perl Golf</description><item><title>LLMs are cheap</title><link>https://www.snellman.net/blog/archive/2025-06-02-llms-are-cheap/</link><description>
&lt;p&gt;This post is making a point - generative AI is relatively cheap -
that might seem so obvious it doesn&#039;t need making. I&#039;m mostly writing it because I&#039;ve
repeatedly had the same discussion in the past six months where people claim the opposite.
Not only is the misconception still around, but it&#039;s not even getting less frequent. This
is mainly written to have a document I can point people at, the next time it repeats.&lt;/p&gt;

&lt;p&gt;It seems to be a common, if not a majority, belief that Large
Language Models (in the colloquial sense of &amp;quot;things that are like
ChatGPT&amp;quot;) are very expensive to operate. This then leads to a ton
of innumerate analyses about how AI companies must be obviously
doomed, as well as a myopic view on how consumer AI businesses can/will
be monetized.&lt;/p&gt;
&lt;p&gt;It&#039;s an understandable mistake, since inference was indeed very
expensive at the start of the AI boom, and those costs were talked about
a lot. But inference has gotten cheaper even faster
than models have gotten better, and nobody has an intuition for
something becoming 1000x cheaper in two years. It just doesn&#039;t happen.
It doesn&#039;t help that the common pricing model (&amp;quot;$ per million tokens&amp;quot;)
is very hard to visualize.&lt;/p&gt;
&lt;p&gt;So let&#039;s compare LLMs to web search. I&#039;m choosing search as the
comparison since it&#039;s in the same vicinity and since it&#039;s
something everyone uses and nobody pays for, not because I&#039;m suggesting that
ungrounded generative AI is a good substitute for search.&lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

&lt;p&gt;(It should also go without saying that these are just my personal
opinions.)&lt;/p&gt;
&lt;h3 id=&quot;what-does-a-web-search-cost&quot;&gt;What is the price of a web search?&lt;/h3&gt;
&lt;p&gt;Here&#039;s the public API pricing for some companies operating their own
web search infrastructure, retrieved on 2025-05-02:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a
href=&quot;https://ai.google.dev/gemini-api/docs/pricing&quot;&gt;Gemini API pricing&lt;/a&gt;
lists a &amp;quot;Grounding with Google Search&amp;quot; feature at $35/1k queries. I
believe that&#039;s the best number we can get for Google, they don&#039;t publish
prices for a &quot;raw&quot; search result API.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;a
href=&quot;https://www.microsoft.com/en-us/bing/apis/pricing&quot;&gt;Bing Search API&lt;/a&gt;
is priced at $15/1k queries at the cheapest tier.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a
href=&quot;https://brave.com/search/api/&quot;&gt;Brave&lt;/a&gt;
has a price of $5/1k searches at the cheapest tier. Though there&#039;s something very
strange about their pricing structure, with the unit pricing increasing as
the quota increases, which is the opposite of what you&#039;d expect. The
tier with real quota is priced at $9/1k searches.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So there&#039;s a range of prices, but not a horribly wide one, and with the
engines you&#039;d expect to be of higher quality also having higher prices.&lt;/p&gt;
&lt;h3 id=&quot;what-does-equivalent-llm-usage-cost&quot;&gt;What is the price of LLMs in a similar domain?&lt;/h3&gt;
&lt;p&gt;To make a reasonable comparison between those search prices and LLM
prices, we need two numbers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How many tokens are output per query?&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;What&#039;s the price per token?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I picked a few arbitrary queries from my search history, and phrased
them as questions, and ran them on Gemini 2.5 Flash (thinking mode off)
in AI Studio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;[When was the term LLM first used?] -&amp;gt; 361 tokens, 2.5
seconds&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;[What are the top javascript game engines?] -&amp;gt; 1145 tokens, 7.6
seconds&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;[What are the typical carry-on bag size limits in europe?] -&amp;gt; 506
tokens, 3.4 seconds&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;[List the 10 largest power outages in history] -&amp;gt; 583 tokens, 3.7
seconds&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note that I&#039;m not judging the quality of the answers here. The
purpose is just to get rough numbers for how large typical responses
are. A 500-1000 token range seems like a reasonable estimate.&lt;/p&gt;
&lt;p&gt;What&#039;s the price of a token? The pricing is sometimes different for input
and output tokens. Input tokens tend to be cheaper, and our inputs are
very short compared to the outputs, so for simplicity let&#039;s consider all
the tokens to be outputs. Here&#039;s the pricing of some relevant models,
retrieved on 2025-05-02:&lt;/p&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col style=&quot;width: 50%&quot; /&gt;
&lt;col style=&quot;width: 50%&quot; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&quot;header&quot;&gt;
&lt;th style=&quot;text-align: left;&quot;&gt;Model&lt;/th&gt;
&lt;th style=&quot;text-align: left;&quot;&gt;Price / 1M tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Gemma 3 27B&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$0.20 (&lt;a
href=&quot;https://openrouter.ai/google/gemma-3-27b-it&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Qwen3 30B A3B&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$0.30 (&lt;a
href=&quot;https://openrouter.ai/qwen/qwen3-30b-a3b&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Gemini 2.0 Flash&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$0.40 (&lt;a
href=&quot;https://ai.google.dev/gemini-api/docs/pricing&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;GPT-4.1 nano&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$0.40 (&lt;a
href=&quot;https://openai.com/api/pricing/&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Gemini 2.5 Flash Preview&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$0.60 (&lt;a
href=&quot;https://ai.google.dev/gemini-api/docs/pricing&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Deepseek V3&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$1.10 (&lt;a
href=&quot;https://api-docs.deepseek.com/quick_start/pricing&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;GPT-4.1 mini&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$1.60 (&lt;a
href=&quot;https://openai.com/api/pricing/&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Deepseek R1&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$2.19 (&lt;a
href=&quot;https://api-docs.deepseek.com/quick_start/pricing&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Claude 3.5 Haiku&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$4.00 (&lt;a
href=&quot;https://www.anthropic.com/pricing#anthropic-api&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;GPT-4.1&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$8.00 (&lt;a
href=&quot;https://openai.com/api/pricing/&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Gemini 2.5 Pro Preview&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$10.00 (&lt;a
href=&quot;https://ai.google.dev/gemini-api/docs/pricing&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Claude 3.7 Sonnet&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$15.00 (&lt;a
href=&quot;https://www.anthropic.com/pricing#anthropic-api&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;o3&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$40.00 (&lt;a
href=&quot;https://openai.com/api/pricing/&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;If we assume the average query uses 1k tokens, these prices would be
directly comparable to the prices per 1k search queries. That&#039;s convenient.&lt;/p&gt;
&lt;p&gt;The low end of that spectrum is at least an order of magnitude
cheaper than even the cheapest search API, and even the models at the
low end are pretty capable. The high end is about on par with the
highest end of search pricing. To compare a midrange pair on quality,
the Bing Search vs. a Gemini 2.5 Flash comparison shows the LLM being
1/25th the price.&lt;/p&gt;
&lt;p&gt;Note that many of the above models have cheaper pricing in exchange
for more flexible scheduling (Anthropic, Google and OpenAI give a 50%
discount for batch requests, Deepseek is 50%-75% cheaper during off-peak
hours). I&#039;ve not included those cheaper options in the table to keep
things comparable, but the presence of those cheaper tiers is worth
keeping in mind when thinking about the next section...&lt;/p&gt;
&lt;h3 id=&quot;objection&quot;&gt;Objection!&lt;/h3&gt;
&lt;p&gt;I know some people are going to have objections to this
back-of-the-envelope calculation, and a lot of them will be totally
legit concerns. I&#039;ll try to address some of them preemptively. Slightly
different assumptions can easily lead to clawing back 10% here and 50%
there. But I don&#039;t see how to bridge a 25x gap just for breaking even,
let alone making the AI significantly more expensive. If you want to play
around with different assumptions, there&#039;s a little calculator widget
below.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Surely the typical LLM response is longer than that&lt;/strong&gt;
- I already picked the upper end of what the (very light) testing
suggested as a reasonable range for the type of question that I&#039;d use
web search for. There&#039;s a lot of use cases where the inputs and outputs
are going to be much longer (e.g. coding), but then you&#039;d need to also
switch the comparison to something in that same domain as well.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The LLM API prices must be subsidized to grab market
share -- i.e. the prices might be low, but the costs are high&lt;/strong&gt; -
I don&#039;t think they are, for a few reasons. I&#039;d instead assume
APIs are typically profitable on a unit basis. I have not found any
credible analysis suggesting otherwise.&lt;/p&gt;
&lt;p&gt;First, there&#039;s not that much motive to gain API market share
with unsustainably cheap prices. Any gains would be temporary, since
there&#039;s no long-term lock-in, and better models are released weekly.
Data from paid API queries will also typically not be used for training
or tuning the models, so getting access to more data wouldn&#039;t explain
it. Note that it&#039;s not just that you&#039;d be losing money on each of
these queries for no benefit, you&#039;re losing the compute that could
be spent on training, research, or more useful types of inference.&lt;/p&gt;
&lt;p&gt;Second, some of those models have been released with open weights and
API access is also available from third-party providers who would have
no motive to subsidize inference. (Or the number in the table isn&#039;t even
first party hosting -- I sure can&#039;t figure out what the Vertex AI
pricing for Gemma 3 is). The pricing of those third-party hosted APIs
appears competitive with first-party hosted APIs. For example, the
&lt;a href=https://artificialanalysis.ai/models/deepseek-r1/providers&gt;Artificial Analysis
summary on Deepseek R1 hosting&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Third, Deepseek released &lt;a
href=&quot;https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md&quot;&gt;actual
numbers&lt;/a&gt; on their inference efficiency in February. Those numbers
suggest that their normal R1 API pricing has about 80% margins
when considering the GPU costs, though not any other serving costs.
&lt;/p&gt;
&lt;p&gt;Fourth, there are a bunch of first-principles analyses on the cost structure
of models with various architectures should be. Those are of course mathematical
models, but those costs line up pretty well with the observed end-user
pricing of models whose architecture is known. See the references section for
links.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The search API prices amortize building and updating the
search index, LLM inference is based on just the cost of
inference&lt;/strong&gt; - This seems pretty likely to be true, actually? But
the effect can&#039;t really be &lt;em&gt;that&lt;/em&gt; large for a popular model:
e.g. the allegedly leaked OpenAI financials claimed $2B/year spent on
inference vs. $3B/year on training. Given the crazy growth of
inference volumes (e.g. Google recently claimed a &lt;a href=https://blog.google/technology/ai/io-2025-keynote/&gt;
50x increase in token volumes in the last year&lt;/a&gt;) the training costs
are getting amortized much more effectively.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The search API prices must have higher margins than LLM
inference&lt;/strong&gt; - It&#039;s possible. I certainly don&#039;t know what the
margins of any Search API providers are, though it seems fair to assume
they&#039;re pretty robust. But, well, see the point above about Deepseek&#039;s
releasd numbers on the R1 profit margins.&lt;/p&gt;
&lt;p&gt;Also, it seems quite plausible that some Search providers would
accept lower margins, since at least Microsoft execs have testified
under oath that they&#039;d be willing to pay more for the iOS query stream
than their revenue, just to get more usage data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Web search returns results 20x-100x faster than an LLM
finishes the query, how could it be more expensive?&lt;/strong&gt; - Search
latency can be improved by parallelizing the problem, while LLM
inference is (for now) serial in nature. The task of predicting a single token can
be parallelized, but the you can&#039;t predict all the output tokens at
once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But OpenAI made a loss, and they don&#039;t expect to make profit for years!&lt;/strong&gt; -
That&#039;s because a huge proportion of their usage is not monetized at all,
despite the usage pattern being ideal for it. OpenAI reportedly made a
loss of $5B in 2024. They also reportedly have 500M MAUs. To reach break-even,
they&#039;d just need to monetize (e.g. with ads) those free users for an average of $10/year,
or $1/month. A $1 ARPU for a service like this would be pitifully low.
&lt;/p&gt;

&lt;p&gt;If the reported numbers are true, OpenAI doesn&#039;t actually have
high costs for a consumer service that popular, which is what you&#039;d
expect to see if the high cost of inference was the problem. They just
have a very low per-user revenue, by choice.&lt;/p&gt;

&lt;h3 id=&quot;widget&quot;&gt;Sensitivity analysis&lt;/h3&gt;

&lt;p&gt;
  If you want to play around with different assumptions, here&#039;s
  a calculator:
&lt;/p&gt;

&lt;a href=/blog/stc/files/llm-cost-widget.html target=_blank&gt;Open in new tab&lt;/a&gt;
&lt;iframe id=&quot;widget&quot; src=&quot;/blog/stc/files/llm-cost-widget.html&quot; width=&quot;100%&quot; scrolling=&quot;no&quot; style=&quot;border-width: 0&quot;
        onload=&quot;this.style.height = this.contentWindow.document.body.scrollHeight + &#039;px&#039;&quot;&gt;
&lt;/iframe&gt;

&lt;h3 id=&quot;why-does-this-matter&quot;&gt;Why does this matter?&lt;/h3&gt;
&lt;p&gt;I mean, you&#039;re right to ask that. Nothing really matters and
eventually we&#039;ll all be dead.&lt;/p&gt;
&lt;p&gt;But it is interesting how many people have built their mental
model for the near future on a premise that was true for only a brief
moment. Some things that will come as a surprise to them even assuming
all progress stops right now:&lt;/p&gt;
&lt;p&gt;There&#039;s an argument advanced by some people about how low
prices mean it&#039;ll be impossible for AI companies to ever recoup
model training costs. The thinking seems to be that it&#039;s just
the prices that have been going down, but not the costs, and
the low prices must be an unprofitable race to the bottom for
what little demand there is. What&#039;s happening and will continue
to happen instead is that as costs go down, the prices go down too,
and demand increases as new uses become viable. For an example,
look at the &lt;a href=https://openrouter.ai/rankings&gt;OpenRouter
API traffic volumes&lt;/a&gt;, both in aggregate and in the relative
share of cheaper models.
&lt;/p&gt;
&lt;p&gt;This post was mainly about APIs, but consumer usage will have
exactly the same cost structure, just a different monetization
structure. And given how low the unit costs must be, advertising isn&#039;t merely
viable but lucrative.&lt;/p&gt;
&lt;p&gt;From this it follows that the financials of frontier AI labs are a
lot better than some innumerate pundits would have you believe. They&#039;re
making a loss because they&#039;re not under pressure to be profitable, and
aren&#039;t actively trying to monetize consumer traffic yet. This
could well be a land grab unlike APIs, since unpaid consumer queries
may be used for training while paid API queries typically are not.
Even the subscription pricing might be there
mainly for demand management rather than trying to run a profit.&lt;/p&gt;
&lt;p&gt;The real cost problem isn&#039;t going to be with the LLMs themselves,
it&#039;s with all the backend services that AI agents will want to access
if even a rudimentary form of the agentic vision actually materializes.
Running the AI is already cheap, will keep getting cheaper, and will always
have a monetization model of some sort since it&#039;s what the end user is
interacting with. Neither of those is true for the end-user services
that have been turned into AI backends without their consent. An AI
trying to, I don&#039;t know, book concert tickets whenever a band I like
plays in my town will probably be phenomenally expensive to its
third-party backends (e.g. scraping ticket sites). Those sites will be
uncompensated for the expense while also removing their actual revenue
streams.&lt;/p&gt;
&lt;p&gt;I don&#039;t really know how that plays out.&lt;/p&gt;
&lt;p&gt;Obviously many service owners will try to make unauthorized scraping
harder, but that&#039;s a very hard problem to solve on the web. Maybe some
of them give up on the web entirely, and move to mobile where they can
at least get device attestations. Some might just give up on the open
web, and require all usage to be signed in, with account creation being
gated on something scarce. Some might become unviable and close up shop
entirely.&lt;/p&gt;
&lt;p&gt;If/when that happens, what&#039;s the play on the AI agent side? Will they
choose an escalating adversarial arms race with increasingly dodgy
tactics, or will they eventually decide that it&#039;s better to pay for the
services they use? The former seems unsustainable. If the latter, then
it feels like the core engineering challenge becomes one of building data
provider backends optimized specifically for AI use, with the goal of scaling to
massive volumes and cheaper unit prices, with the trade-off being higher latency, lower
reliability and lower quality.
That could be quite interesting from a systems perspective. (Yes, I&#039;m aware
of &lt;a href=https://www.anthropic.com/news/model-context-protocol&gt;MCP&lt;/a&gt;,
but it&#039;s a solution to an orthogonal issue.)
&lt;/p&gt;
&lt;p&gt;But one thing I&#039;m confident won&#039;t be happening is that it&#039;s the AIs that
turn out to be too expensive to run.&lt;/p&gt;

&lt;h3&gt;Additional reading&lt;/h3&gt;

&lt;p&gt;
Below are some additional references that were not worked into the main
narrative (this article was long-winded enough already).
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=https://arxiv.org/html/2506.04645v1&gt;Inference economics of language models&lt;/a&gt; (2025) - A mathematical model for estimating the cost structure, latency/cost tradeoffs, optimal cluster size, and optimal batching based on the LLM architecture.

&lt;li&gt;&lt;a href=https://www.tensoreconomics.com/p/llm-inference-economics-from-first&gt;LLM Inference Economics from First Principles&lt;/a&gt; - (2025) A very detailed cost-per-token computation on the cost structure of one specific model, LLama 3.3 70B.

&lt;li&gt;&lt;a href=https://www.lesswrong.com/posts/mRKd4ArA5fYhd2BPb/observations-about-llm-inference-pricing&gt;Observations About LLM Inference Pricing&lt;/a&gt; - (2025) Analysis of the economics driven by pricing data rather than first-principles cost structure; concludes that proprietary models have very significant markups.

&lt;li&gt;&lt;a href=https://semianalysis.com/2023/02/13/peeling-the-onions-layers-large-language/&gt;Large Language Models Search Architecture And Cost&lt;/a&gt; - (2023) Analysis on the cost of integrating LLMs into search; the LLM cost data is no longer very relevant due to the age of the article (GPT-3.5) but it uses a different way of estimating the search cost structure.
&lt;/ul&gt;
</description><author>jsnell@iki.fi</author><category>GENERAL</category><pubDate>Mon, 02 Jun 2025 22:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2025-06-02-llms-are-cheap/</guid></item><item><title>Web Environment Integrity vs. Private Access Tokens - They&#039;re the same thing!</title><link>https://www.snellman.net/blog/archive/2023-07-25-web-integrity-api-vs-private-access-tokens/</link><description>
&lt;p&gt;
I&#039;ve seen a lot of discussions in the last week about the &lt;a
href=https://github.com/RupertBenWiser/Web-Environment-Integrity/blob/main/explainer.md&gt;Web
Environment Integrity&lt;/a&gt; proposal. Quite predictably from the moment
it got called things like &amp;quot;DRM for the web&amp;quot;, people have
been arguing passionately against it on HN, Github issues, etc. The
basic claims seem to be that it&#039;s going to turn the web into a walled
garden, kill ad blockers, kill all small browsers, kill all small operating systems,
kill accessibility tools like screen readers, etc.

&lt;p&gt;
The Web Environment Integrity proposal is basically:

&lt;ul&gt;
&lt;li&gt;A website can request an attestation from the browser
&lt;li&gt;The browser forwards the attestation requests to an attester
&lt;li&gt;The attester checks properties like hardware and software integrity
&lt;li&gt;If they check out, the attester creates a token and signs it with its private key.
&lt;li&gt;The attester hands off the signed token to the browser, which in turn sends it to the website.
&lt;li&gt;The website checks that the token was signed by a trusted attester
&lt;/ul&gt;

&lt;p&gt;
Here&#039;s a funny thing I suspect few of those commenters know: A
very similar mechanism already exists on the web, and is already
deployed in production browsers (Safari), operating systems (&lt;a href=https://developer.apple.com/news/?id=huqjyh7k&gt;iOS, OS X&lt;/a&gt;),
and hosting infrastructure (&lt;a href=https://blog.cloudflare.com/eliminating-captchas-on-iphones-and-macs-using-new-standard/&gt;Cloudflare&lt;/a&gt;, &lt;a href=https://www.fastly.com/blog/private-access-tokens-and-the-future-of-anti-fraud&gt;Fastly&lt;/a&gt;). That mechanism is
&lt;a href=https://www.ietf.org/archive/id/draft-private-access-tokens-01.html&gt;Private Access Tokens&lt;/a&gt; /
&lt;a href=https://datatracker.ietf.org/doc/draft-ietf-privacypass-architecture/13/&gt;Privacy Pass&lt;/a&gt;.

&lt;p&gt;
Here&#039;s what PATs (as deployed by Apple, and on by default) do to the best of my understanding:

&lt;ul&gt;
&lt;li&gt;A website can request an attestation from the browser
&lt;li&gt;The browser forwards the attestation requests to an attester
&lt;li&gt;The attester checks properties like hardware and software integrity.
&lt;li&gt;If they check out, the attester calls the website&#039;s trusted token issuer
&lt;li&gt;The issuer checks whether to trust the attester and whether the information passed by the attester is sufficient, and then issues a token signed by its private key
&lt;li&gt;The attester hands off the signed token to the browser, which passes it to the website.
&lt;li&gt;The website checks that the token was signed by a trusted token issuer
&lt;/ul&gt;

&lt;p&gt;
This launching was hailed in the tech press as a win for privacy and security, not
as an attempt to kill accessibility tools or build a walled garden.
&lt;a id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;]

&lt;p&gt;
You might notice that the basic operating model of the two protocols
is almost exactly the same. So is their intended use. From the
&amp;quot;DRM for websites&amp;quot; perspective, I don&#039;t think there is a
difference.

&lt;p&gt;
With both WEI and PATs, the website would be able to ask Apple to verify that the
request is coming from a genuine non-jailbroken iPhone running
Safari, and block the ones running Firefox on Linux. And in both,
the intent is not for the API to be used for that kind of outright blocking.

&lt;p&gt;
Neither lists e.g. checking whether the browser is running an ad blocker
extension as a use case. Both would have just the same technical capabilities
for making that kind of thing happen, by just having the attester check for it,
and I bet that in both cases the attester would be equally unmotivated in
actually providing that kind of attestation.

&lt;p&gt;
It&#039;s also not that PATs would somehow make it easier for people
to spin up new attesters for small or new platforms. Want to
run your own attester for PATs? You could, but the issuers you care
about will not trust it. &lt;a id=&#039;fnref2&#039;&gt;[&lt;a href=&#039;#fn2&#039;&gt;2&lt;/a&gt;]

&lt;p&gt;
Now, the technologies aren&#039;t quite identical, but the distinctions are
subtle and would just matter for exactly the kind of anti-abuse work
that both of the proposals were ostensibly meant for. The big one is
the WEI proposal including the ability to content-bind the attestation
to a specific operation. It&#039;s a feature anyone trying to use a feature
like this for abuse prevention would think is needed, but that adds no
power to the theorized &amp;quot;DRM for the web&amp;quot; use case. There is also a
more obvious difference between the two, with whether the attester and
issuer are the same entity or split. But that too is irrelevant
in the discussion on how the technology could be misused.
&lt;a id=&#039;fnref3&#039;&gt;[&lt;a href=&#039;#fn3&#039;&gt;3&lt;/a&gt;]

&lt;p&gt;
In principle there could also be differences in the exact things that
the APIs allow attesting for. But neither standard defines the
exact set of attestations, just the mechanisms.

&lt;p&gt;
Given the DRM narrative would have worked exactly the same for the two
projects, why such a different reception? I can only think of two
differences, both social rather than technical.

&lt;p&gt;
One is that the PAT (and related Privacy Pass) draft standards  were
written in the IETF and are dense standardese. There was no plaintext
explainer. Effectively nobody outside of the internet standardization
circles read those drafts, and if they had they wouldn&#039;t have known
whether they needed to be outraged or not. The first time it actually
broke through to the public was when Apple implemented it.

&lt;p&gt;
The other is the framing. PATs were sold to the public exclusively as
a way of seeing fewer captchas. Who wouldn&#039;t want fewer captchas? WEI
was pitched as a bunch of fairly abstract use cases and mostly from
the perspective of the service provider, not for how it&#039;d improve the
user experience by reducing the need for invasive challenges and data
collection.

&lt;p&gt;
This isn&#039;t the first time I&#039;ve seen two attempts at a really similar
project, with one getting lauded while the other gets trashed for
something that&#039;s common to both. But it is the one where the two
things are the most similar, and it feels like it should be
instructive somehow.

&lt;p&gt;
If the takeaway is that standards proposals should be opaque and kept
away from the public for as long as possible, before being launched
straight to prod based on a draft spec, that&#039;d be bad. If it&#039;s that
standard proposals should be carefully written to highlight the
benefit for the end user, even starting from the first draft, that&#039;s
probably pretty good? And if it&#039;s that only Apple can launch any
browser features without a massive backlash, it seems pretty damn bad.

&lt;hr&gt;
  &lt;div class=footnotes&gt;

    &lt;p&gt;
    &lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;] Just to be clear, the
    &lt;a href=https://news.ycombinator.com/item?id=31751203&gt;one significant HN discussion on PATs&lt;/a&gt; had
    similar arguments about it being DRM, so my claim is not that
    absolutely everyone loved PATs. But it didn&#039;t actually get traction as a
    hacker cause celebre, and as far as I can see the general media
    coverage was broadly positive.
    &lt;/p&gt;

    &lt;p&gt;
    &lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;] What&#039;s the process for
    getting Cloudflare or Fastly to trust a non-Apple attester anyway? I
    can&#039;t find any documentation.
    &lt;/p&gt;

    &lt;p&gt;
    &lt;a id=&#039;fn3&#039;&gt;[&lt;a href=&#039;#fnref3&#039;&gt;3&lt;/a&gt;] The split version seems
    kind of superior for deployment, since it means each site needs to only
    care about a single key (their chosen issuer). This makes e.g. the
    creation of a new attester a lot more tractable. You only need to
    convince half a dozen issuers to trust your new attester and
    ingest the keys, not try to sign up every single website in the
    world one by one.
    &lt;/p&gt;
  &lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>GENERAL</category><pubDate>Tue, 25 Jul 2023 18:30:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2023-07-25-web-integrity-api-vs-private-access-tokens/</guid></item><item><title>A monorepo misconception - atomic cross-project commits</title><link>https://www.snellman.net/blog/archive/2021-07-21-monorepo-atomic/</link><description>

&lt;p&gt;

In articles and discussions about &lt;a
href=&#039;https://en.wikipedia.org/wiki/Monorepo&#039;&gt;monorepos&lt;/a&gt;, there&#039;s
one frequently alleged key benefit: atomic commits across the whole
tree let you make changes to both a library&#039;s implementation and the
clients in a single commit. Many authors even go as far to claim that
this is the only benefit of monorepos.

&lt;p&gt;
I like monorepos, but that particular claim makes no sense! It&#039;s not
how you&#039;d actually make backwards incompatible changes, such as
interface refactorings, in a large monorepo. Instead the process would
be highly incremental, and more like the following:

&lt;ol&gt;
&lt;li&gt;Push one commit to change the library, such that it supports both
the old and new behavior with different interfaces.
&lt;li&gt;Once you&#039;re sure the commit from stage 1 won&#039;t be reverted, push N
commits to switch each of the N clients to use the new interface.
&lt;li&gt;Once you&#039;re sure the commits from stage 2 won&#039;t be reverted, push
one commit to remove the old implementation and interface from the
library.
&lt;/ol&gt;

&lt;read-more&gt;&lt;/read-more&gt;

&lt;p&gt;
There&#039;s a bunch of reasons why this is a nicer sequencing than a
single atomic commit, but they&#039;re mostly variations on the theme:
mitigating risks. If something breaks, you want as few things as
possible to break at once, and for the rollback to a known-good state
to be simple. Here&#039;s how the risks are mitigated at the various
stages in the process:

&lt;ol&gt;
&lt;li&gt;There is nothing risky at all about the first commit. It is just
adding new code that&#039;s not yet used by anyone.

&lt;li&gt;The commits for changing the clients can be done gradually,
starting with the ones that the library owners are themselves working
on, the projects that are most likely to detect bugs, or the clients
that are most forgiving to errors. Depending on the risk profile of
the change, you might even use these commits as a form of staged
rollout, where you&#039;ll wait to see if the previous clients report any
problems in production before sending the next batch of commits
for code review.

&lt;li&gt;The final commit to remove the old implementation can only break a
minimal number of clients: the ones that just started using the library
between the removal commit being reviewed and pushed, and did so
using the old interface. The ideal environment would have tooling in
place to prevent that kind of backslipping from happening in the first
place (e.g. lint warnings on new uses of deprecated interfaces).
&lt;/ol&gt;

&lt;p&gt;
If anything goes wrong in stage 2, it&#039;s trivial to revert a commit
that&#039;s only touching a couple of files. By contrast, reverting a
commit that&#039;s spanning hundreds of projects would be quite painful,
especially if the repo has any kind of per-directory ACLs (which I
think is mandatory for a big monorepo). It gets worse if the breakage
isn&#039;t detected immediately, since the more code that the single
change is affecting, the less likely it&#039;s that the reversion applies
cleanly.

&lt;p&gt;
If anything goes wrong in stage 3, it would also have gone wrong when
using atomic commits. But with atomic commits the breakage in stage 3
is far more likely, since the new users will naturally use the old
interface (the new one doesn&#039;t exist yet in their view of the world),
and since the window between start of code review and committing will
be wider. And again, the rollback will be far easier with the commit
that&#039;s only touching the library and not the clients.

&lt;p&gt;
There&#039;s some additional reasons for why the huge commit will be
annoying. For example getting a clean presubmit CI run will become
progressively harder the more projects a single commits is changing.

&lt;p&gt;
Sure, the atomic commit will save a little bit of work in not needing
to have the implementation support both interfaces at once. But that
tiny saving is just not a worthwhile tradeoff when compared to how
much work wrangling the huge commit would be.

&lt;p&gt;
It&#039;s particularly easy to see that the &amp;quot;atomic changes across the
whole repo&amp;quot;story is rubbish when you move away from
libraries, and also consider code that has any kind of more
complicated deployment lifecycle, for example the interactions between
services and client binaries that communicate over an RPC
interface. Obviously you can&#039;t do an atomic change in that case,
since you need to continue supporting the old server implementation
until all client binaries have been upgraded (and are rollback-safe).
The same goes for changes to database schemas, command line
tools, synchronized client-side Javascript + backend changes, etc.

&lt;p&gt;
I think it&#039;s true that monorepos make refactoring easier. So that&#039;s
not the problem. It&#039;s also true that they have atomic commits across
projects. But the two facts have nothing to do with each other. The reasons
monorepos make refactoring simpler all boil down to everyone in the
organization having a shared view of what the current state is:

&lt;ul&gt;

&lt;li&gt;A monorepo will, in practice, mean trunk-based
development. You&#039;ll know that everybody really is on HEAD rather than
actually doing their development on some year-old branch.
&lt;li&gt;And conversely, you&#039;ll know that every user of the library is
using your library from HEAD rather than pinning it to some year-old
version.
&lt;li&gt;It&#039;s trivial to find all the current callers, so that you know
which clients need to be updated. (Once you&#039;ve solved the highly
non-trivial problem of having any kind of monorepo tooling at scale,
of course.)
&lt;/ul&gt;

&lt;p&gt;
In theory you could do the exact same thing with
multirepos assuming sufficient tool support, discipline about code
organization, enforced trunk-based development in all repositories,
a master list of all repositories in the org, and defaulting to all
repositories being readable by every engineer with no hidden silos.
That&#039;s all &lt;i&gt;technically&lt;/i&gt; doable, but I suspect not culturally
compatible with using multirepos in the first place.

&lt;p&gt;
Where does this misconception come from? It&#039;s certainly present in the
&lt;a href=&#039;https://research.google/pubs/pub45424/&#039;&gt;Google monorepo paper&lt;/a&gt;,
which somewhat contradicts itself on this. On one hand,
they describe exactly this form of atomic refactoring as a benefit of
monorepos:

&lt;blockquote&gt;
The ability to make atomic changes is also a very powerful
feature of the monolithic model. A developer can make a major change
touching hundreds or thousands of files across the repository in a
single consistent operation. For instance, a developer can rename a
class or function in a single commit and yet not break any builds or
tests.
&lt;/blockquote&gt;

&lt;p&gt;
But when it comes to the actual refactoring workflow is, the
process that&#039;s described is quite different:

&lt;blockquote&gt;
A team of Google developers will occasionally undertake a set of
wide-reaching code-cleanup changes to further maintain the health of
the codebase. The developers who perform these changes commonly
separate them into two phases. With this approach, a large
backward-compatible change is made first. Once it is complete, a
second smaller change can be made to remove the original pattern that
is no longer referenced.
&lt;/blockquote&gt;

&lt;p&gt;
I suspect what happened here was that the atomic commits were identified
as a benefit in the abstract, with refactoring being used as an
illustration of a use case. This was then quite understandably read
as a practical example of how you&#039;d work with a monorepo.

&lt;p&gt;
There might be a few cases where atomic commits across the whole
repository are the right solution, but it has to be exceedingly
rare. The example of renaming a function with thousands of callers,
for example, is probably better handled by just temporarily aliasing
the function, or by temporarily defining the new function in terms of
the old. (But this does suggest that languages, both programming languages and
IDLs, should make aliasing and indirection easy for as many constructs
as possible).

&lt;p&gt;
Are there organizations with a large monorepo where atomic
cross-project commits are routinely used to change both the
implementation and the clients?

</description><author>jsnell@iki.fi</author><category>GENERAL</category><pubDate>Wed, 21 Jul 2021 11:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2021-07-21-monorepo-atomic/</guid></item><item><title>Computing multiple hash values in parallel with AVX2</title><link>https://www.snellman.net/blog/archive/2017-03-19-parallel-hashing-with-avx2/</link><description>
  &lt;p&gt;
    I wanted to compute some hash values in a very particular way, and
    couldn&#039;t find any existing implementations. The special circumstances
    were:
  &lt;/p&gt;
  &lt;ul&gt;
    &lt;li&gt;The keys are short (not sure exactly what size they&#039;ll end up,
      but almost certainly in the 12-40 byte range).
    &lt;li&gt;The keys all of the same length.
    &lt;li&gt;I know the length at compile time.
    &lt;li&gt;I have a batch of keys to process at once.
  &lt;/ul&gt;

  &lt;p&gt;
    Given the above constraints,
    it seems obvious that doing multiple keys in a batch with SIMD
    could speed thing up over computing each one individually.  Now,
    typically small data sizes aren&#039;t a good sign for SIMD. But that&#039;s
    not the case here, since the core problem parallelizes so neatly.

  &lt;p&gt;
    After a couple of false starts, I ended up with a version of
    &lt;a href=&#039;https://github.com/Cyan4973/xxHash&#039;&gt;xxHash32&lt;/a&gt;
    that computes hash values for 8 keys at the same time using AVX2. The code
    is at &lt;a href=&#039;https://github.com/jsnell/parallel-xxhash&#039;&gt;parallel-xxhash&lt;/a&gt;.
  &lt;/p&gt;

  &lt;read-more&gt;&lt;/read-more&gt;

  &lt;h3&gt;Benchmarks&lt;/h3&gt;

  &lt;p&gt;
    Before heading off into the weeds with the details, below are a
    couple of pretty graphs showing performance with different key
    sizes for a few different implementations: CityHash64 since it&#039;s
    been my default hash function for years, xxHash64 since my
    parallel implementation was based on xxHash32, and MetroHash64
    since I saw people suggesting it was the fastest option for small
    keys. I did not include FarmHash since it was consistently
    slower than CityHash for all key sizes.
  &lt;/p&gt;

  &lt;p&gt;
    Finally, to isolate the benefits of specializing for the
    statically known key sizes, I&#039;ve included a scalar version of
    xxHash32. It has exactly the same structure as the parallel
    version, except for not using SIMD &lt;a id=&#039;fnref0&#039;&gt;[&lt;a href=&#039;#fn0&#039;&gt;0&lt;/a&gt;].
  &lt;/p&gt;

  &lt;p&gt;
    All implementations computed hashes for the same number of keys;
    the parallel implementations did it 8 keys at a time, the others
    did them sequentially. The tests were run on a i7-6700 and GCC
    6.3.0, with &lt;code&gt;-O3 -march=native
      -fno-strict-aliasing&lt;/code&gt;. The benchmark code is in the
    repository, but you&#039;ll need to bring your own copies of the
    external hash table libraries.
  &lt;/p&gt;

  &lt;p&gt;
    First, let&#039;s look at the time take per key for key sizes relevant
    to my use case (this graph is 4-72 bytes, but as mentioned before
    the most interesting range for me is around 12-40 bytes):
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/parallel-xxhash/bench-4-72.png&#039;&gt;

  &lt;p&gt;
    That looks pretty nice, with very significant speedups compared to
    the alternatives on all the key sizes. With larger key sizes the
    parallel Murmur3 (my first try) quickly runs out of steam, but the
    parallel xxHash32 stayed ahead of the pack. We&#039;ll switch to
    showing time per byte rather than time per key here.
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/parallel-xxhash/bench-64-256.png&#039;&gt;

  &lt;p&gt;
    And at 512 bytes or so, the time per byte has flattened out
    completely:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/parallel-xxhash/bench-all.png&#039;&gt;

  &lt;h3&gt;Don&#039;t look under the rug&lt;/h3&gt;

  &lt;p&gt;
    So what are the downsides? Why wouldn&#039;t everyone use this?
  &lt;/p&gt;

  &lt;p&gt;The most glaring problem is that most applications
    don&#039;t do hash computations in parallel. Either it&#039;s going to
    be fundamentally impossible, or at least it will require
    a major restructuring.

  &lt;p&gt;
    Second, I&#039;ve swept a small detail under the rug: the parallel
    implementations were using &lt;a href=&#039;https://en.wikipedia.org/wiki/Row-_and_column-major_order&#039;&gt;column-major order&lt;/a&gt; for the
    data. It&#039;s the natural way to structure this. The timings above do
    not include a row-major to column-major conversion step. That&#039;s
    because my application was already using column-major anyway. But
    if that weren&#039;t the case, it&#039;s totally possible that the
    conversion step would wipe away a good chunk of the
    gains. (What about scatter-gather? See below).
  &lt;/p&gt;

  &lt;p&gt;Third, I suspect that most uses of hash tables use strings as
    keys. This code will not work at all in that use case. Not only
    do the sizes of keys have to be statically known, but (another
    detail I skimmed over above) they also need to be a multiple of
    4 bytes long. Basically, I want to use structures as hash keys;
    not sure how many other people also need that.
  &lt;/p&gt;

  &lt;p&gt;And fourth, the parallel implementations were using the 32-bit
    variants of the algorithms due to reasons that I&#039;ll explain
    later. That does not make the benchmarks unfair (the 64-bit
    versions are faster than the 32-bit ones). But some applications
    will need those extra bits in the hash value. This code can&#039;t
    provide it.

  &lt;p&gt;
    So while this should work fine for me (though that still remains
    to be seen), it might not be a very large ecological niche.

  &lt;h3&gt;What&#039;s interesting about this?&lt;/h3&gt;

  &lt;p&gt;
    Converting from the &lt;a href=&#039;https://github.com/jsnell/parallel-xxhash/blob/58e9966/src/parallel-xxhash.h#L237&#039;&gt;scalar version&lt;/a&gt; to the &lt;a href=https://github.com/jsnell/parallel-xxhash/blob/58e9966/src/parallel-xxhash.h#L140&gt;parallel version&lt;/a&gt; is a
    fairly mindless process, not many insights to be had in that
    part. But while doing this, I bumped into some interesting aspects
    on the periphery.
  &lt;/p&gt;

  &lt;h4&gt;Rotates&lt;/h4&gt;

  &lt;p&gt;
    All the fast and high quality hash functions I looked at seemed to
    be descendants of Murmur, and used rotates as their primitive of
    choice for moving bits down. This is most likely because x86 has a
    dedicated rotate instruction, while most other methods require two
    instructions, e.g. shift+xor. For AVX that&#039;s not the case, and you
    need to synthesize the rotate from two shifts and a xor/or.
  &lt;/p&gt;

  &lt;p&gt;
    Based on some quick testing, a single-instruction replacement could
    give a 40% speedup, and a two instruction replacement a 20% speedup.
    There&#039;s not a huge number of single instruction options available
    though: horizontal 16-bit addition/subtraction, or the 8-bit shuffles.
    I suspect neither would work very well due to the effects aligning at
    an 8 bit boundary.
    With two instructions a shift+xor is probably the best option. Would
    be interesting to see if the best speed/quality tradeoff is different
    for AVX than for x86.
  &lt;/p&gt;

  &lt;h4&gt;Multiplies&lt;/h4&gt;

  &lt;p&gt;
    These days new hash functions are mostly built with 64*64-&amp;gt;64
    multiplies.  We won&#039;t have that in SIMD until AVX-512 (and given the way
    things are going, I wonder if a general purpose CPU using AVX-512
    will actually ever launch). Synthesizing a 64-bit multiply from
    32-bit multiplies doesn&#039;t seem viable for this use case. So for
    this use case, we really want to look at the hash functions
    defined a few years ago rather than the latest hotness.
  &lt;/p&gt;

  &lt;h4&gt;Memory layout&lt;/h4&gt;

  &lt;p&gt;
    Like I mentioned earlier, my data is already in column-major order
    so I didn&#039;t need to worry about wrangling that. But at one point I
    thought that it&#039;d be nice to provide an alternate version that
    would work on row-major data. That&#039;s what scatter-gather is for,
    right?
  &lt;/p&gt;

  &lt;p&gt;
    Nope, the gather instructions are just unbelievably slow, and
    additionally for some reason prevented compilers unrolling the
    Murmur3 loop, for a 4x performance loss. (Even on GCC 6.3 and
    clang 3.8). In theory the xxHash inner loop should be better for
    the gather instructions, since at least there you&#039;re not depending
    on the compiler unrolling to get multiple parallel loads
    going. But the results there were only marginally less worse.
  &lt;/p&gt;

  &lt;h4&gt;Auto-vectorization&lt;/h4&gt;

  &lt;p&gt;
    After having written the version using intrinsics, it occurred to
    me that I really should have started off with just writing out
    plain C++ with the same semantics, and see if it auto-vectorizes.
    Because this really looks like it should be a very easy case.
    And while the transformation from a scalar version to using
    intrinsics is not too bad, the transformation to standard C++ expressing
    the same order of operations on the same memory layout is
    easier yet. The &lt;a href=&#039;https://github.com/jsnell/parallel-xxhash/blob/58e9966/src/parallel-xxhash.h#L182&#039;&gt;theoretically auto-vectorizable&lt;a&gt;
    code certainly looks very pretty compared to the AVX intrinsic
    soup.
  &lt;/p&gt;

  &lt;p&gt;
    But ignoring aesthetics, the results were mixed. GCC 6.3 seemed to vectorize everything
    perfectly. GCC 4.9 &lt;a id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;] missed something (I didn&#039;t track down exactly
    what) that cost about 25% performance. And Clang 3.8 did nothing
    at all, with the plain-C++ version being 150% slower than the
    version using intrinsics. So still a bit on the fragile side. But
    this is the best showing for auto-vectorization that I&#039;ve
    experienced so far.
  &lt;/p&gt;

  &lt;p&gt;(The GCC 4.9 case is particularly annoying; it would have been
    easy to write the auto-vectorizable version first, see the speedups
    and think auto-vectorization was working, but miss that it was
    still leaving a lot of performance on the table).
  &lt;/p&gt;

  &lt;h4&gt;32-bit output values&lt;/h4&gt;

  &lt;p&gt;
    The other advantage of 64-bit operations is that the natural
    implementation will end up producing a 64-bit hash value. Now, for
    normal hash tables I&#039;m totally OK with a single 32-bit hash
    value. But there&#039;s some use cases like Cuckoo hash tables or
    Bloom Filters where one would really like more key material.
  &lt;/p&gt;

  &lt;p&gt;
    Before moving from Murmur3 to xxHash, I experimented a bit with a
    version that would not only compute results for multiple different
    keys at once, but also do it with multiple different seed values.
    It was actually pretty efficient. I didn&#039;t end up redoing that work
    for the xxHash version though. Primarily since I don&#039;t actually need
    that version right now, and secondarily since I&#039;m actually not
    sure of whether the different seed values will give different enough
    outputs for use in a probabilistic data structures.
  &lt;/p&gt;

  &lt;p&gt;(If anyone knows for sure whether the last bit is true or not,
    please let me know).&lt;p&gt;

  &lt;h4&gt;Is there a faster non-parallel hash function here?&lt;/h4&gt;

  &lt;p&gt;
    As mentioned multiple times, computing multiple keys in parallel
    is a very niche use case. But based on the benchmark graphs for
    large key sizes, I wonder if there&#039;s a decent non-parallel hash
    function hidden here: compute the 8 32-bit streams in parallel and
    combining them at the end (or at certain block boundaries). After
    all, that&#039;s already what xxHash does on a smaller scale.
  &lt;/p&gt;

  &lt;p&gt;
    This seems like something that people would already have explored
    in the quest for faster and faster hashing for large key sizes.
    But I can&#039;t find any trace of such an implementation. Maybe
    everyone had already moved to 64-bit multiplies by the time AVX2
    started to be widely deployed and 32-bit multiplies became the
    faster option again. Or maybe 32-bit hash values for large key
    sizes aren&#039;t actually a useful point in the design space.
  &lt;/p&gt;

  &lt;p&gt;
    Designing hash functions is hard. I explicitly did not want to
    invent a new one here, but just re-implement existing algorithms.
    I even went so far as to add in the &amp;quot;mix in the length of the key&amp;quot;
    steps, just so that I could verify my code against the reference
    implementations. Sure, it&#039;s a useless step given the length is
    constant. But it doesn&#039;t cost that much to do either, and lets
    me not worry about accidentally destroying the hash quality.
  &lt;/p&gt;

  &lt;p&gt;
    But if I wanted to burn some brain cycles on designing one and a
    lot of CPU cycles on running SMHasher... 32-bit multiplies +
    shift-xor, working 64 bytes at a time, and code organized in a way
    that makes it easy to auto-vectorize could be a pretty interesting
    place to start from.
  &lt;/p&gt;

  &lt;h3&gt;Footnotes&lt;/h3&gt;

  &lt;div class=footnotes&gt;

    &lt;p&gt;
  &lt;a id=&#039;fn0&#039;&gt;[&lt;a href=&#039;#fnref0&#039;&gt;0&lt;/a&gt;] Note that I tried to make sure to isolate this to specializing
  for the key size, not to e.g. be able to hoist any computations
  outside the benchmark loop. AFAIK all implementations went through
  the same number of non-inlined function calls.

    &lt;p&gt;
  &lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;] Yes, it&#039;s a couple of years old. But that&#039;s Debian stable for
  you. And to be honest, a year ago our main compiler at work was
  still GCC 4.4. Compared to that, 4.9 feels pretty darn luxurious.
  &lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>GENERAL</category><pubDate>Sun, 19 Mar 2017 12:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2017-03-19-parallel-hashing-with-avx2/</guid></item><item><title>I&#039;ve been writing ring buffers wrong all these years</title><link>https://www.snellman.net/blog/archive/2016-12-13-ring-buffers/</link><description>
&lt;p&gt;
So there I was, implementing a one element ring buffer. Which,
I&#039;m sure you&#039;ll agree, is a perfectly reasonable data structure.

&lt;p&gt;
It was just surprisingly annoying to write, due to reasons we&#039;ll
get to in a bit. After giving it a bit of thought, I realized I&#039;d always
been writing ring buffers &amp;quot;wrong&amp;quot;, and there was a better way.

&lt;read-more&gt;&lt;/read-more&gt;


&lt;h3&gt;Array + two indices&lt;/h3&gt;

&lt;p&gt;
There are two common ways of implementing a queue with a ring buffer.

&lt;p&gt; One is to use an array as the backing storage plus two indices
to the array; read and write. To shift a
value from the head of the queue, index into the array by the read
index, and then increment the read index. To push a value to the back,
index into the array by the write index, store the value in that
offset, and then increment the write index.

&lt;p&gt; Both indices will always be in the range 0..(capacity - 1). This
is done by masking the value after an index gets incremented.

&lt;p&gt;
That implementation looks basically like:

&lt;pre&gt;
uint32 read;
uint32 write;
mask(val)  { return val &amp; (array.capacity - 1); }
inc(index) { return mask(index + 1); }
push(val)  { assert(!full()); array[write] = val; write = inc(write); }
shift()    { assert(!empty()); ret = array[read]; read = inc(read); return ret; }
empty()    { return read == write; }
full()     { return inc(write) == read; }
size()     { return mask(write - read); }
&lt;/pre&gt;

&lt;p&gt;
The downside of this representation is that you always waste one element
in the array. If the array is 4 elements, the queue can hold at most 3. Why?
Well, an empty buffer will have a read pointer that&#039;s equal
to the write pointer; a buffer with capacity N and size N would also
have have a read pointer equal to the write pointer. Like this:

&lt;p&gt;
&lt;img src=&#039;/blog/stc/images/rb-normal.png&#039;&gt;&lt;/img&gt;

&lt;p&gt;
The 0 and 4 element cases are indistinguishable, so we need to prevent one
from ever happening.  Since empty queues are kind of necessary, it follows
that the latter case needs to go. The queue has to be defined
as full when one element in the array is still unused. And that&#039;s the
way I&#039;ve always done it.

&lt;p&gt;
Losing one element isn&#039;t a huge deal when the ring buffer has thousands
of elements. But when the array is supposed to have just one element...
That&#039;s 100% overhead, 0% payload!

&lt;h3&gt;Array + index + length &lt;/h3&gt;

&lt;p&gt;
The alternative is to use one index field and one length field. Shifting
an element indexes to the array by the read index, increments the read
index, and then decrements the length. Pushing an element writes to
the slot that &amp;quot;length&amp;quot; elements after the read index, and then
increments the length. That looks something like this:

&lt;pre&gt;
uint32 read;
uint32 length;
mask(val)  { return val &amp; (array.capacity - 1); }
inc(index) { return mask(index + 1); }
push(val)  { assert(!full()); array[mask(read + length++)] = val; }
shift()    { assert(!empty()); --length; ret = array[read]; read = inc(read); return ret; }
empty()    { return length == 0; }
full()     { return length == array.capacity; }
size()     { return length; }
&lt;/pre&gt;

&lt;p&gt;
This uses the full capacity of the array, with the code not getting much more
complex.

&lt;p&gt;
But at least I&#039;ve never liked this representation. The most common use
for ring buffers is for it to be the intermediary between a concurrent
reader and writer (be it two threads, to processes sharing memory, or
a software process communicating with hardware). And for that, the
index + size representation is kind of miserable. Both the reader and
the writer will be writing to the length field, which is bad for
caching. The read index and the length will also need to always be
read and updated atomically, which would be awkward.

&lt;p&gt;
(Obviously my one element ring buffer wasn&#039;t going to be used in a
concurrent setting. But it&#039;s a matter of principle.)

&lt;h3&gt;Array + two unmasked indices&lt;/h3&gt;

&lt;p&gt; So is there an option that gets the benefits of both
representations, without introducing a third state variable?
(Whether it&#039;s two indices + a size, or two indices + some kind
of a full vs. empty flag). Turns out there is, and it&#039;s really
simple. It uses two indices, but with one tweak compared to the
first solution: don&#039;t squash the indices into the correct
range when they are incremented, but only when they are used to index into
the array. Instead you let them grow unbounded, and eventually
wrap around to zero once the unsigned integer overflows. So:

&lt;p&gt;
&lt;img src=&#039;/blog/stc/images/rb-nowrap.png&#039;&gt;&lt;/img&gt;

&lt;p&gt;
This reclaims the wasted slot.
The code modifying the indices also becomes simpler, since the
clumsy ordering of increments vs. array accesses was only needed for
maintaining the invariant that the index is always in range.

&lt;pre&gt;
uint32 read;
uint32 write;
mask(val)  { return val &amp; (array.capacity - 1); }
push(val)  { assert(!full()); array[mask(write++)] = val; }
shift()    { assert(!empty()); return array[mask(read++)]; }
&lt;/pre&gt;

&lt;p&gt;
Checking the status of the ring also gets simpler:

&lt;pre&gt;
empty()    { return read == write; }
full()     { return size() == array.capacity }
size()     { return write - read; }
&lt;/pre&gt;

&lt;p&gt;
This all works, assuming the following restrictions:

&lt;ul&gt;
&lt;li&gt; The implementation language supports wraparound on unsigned
  integer overflow.
  If it doesn&#039;t, this approach doesn&#039;t really buy anything. (What will
  happen in these languages is that the indices get promoted to bignums
  which will be bad, or they get promoted to doubles which will be worse.
  So you&#039;ll need to manually restrict their range anyway).
&lt;li&gt; The capacity must always be a power of two. (&lt;b&gt;Edit&lt;/b&gt;: This
  limitation does not come just from the definition of &lt;code&gt;mask&lt;/code&gt;
  using a bitwise &lt;code&gt;and&lt;/code&gt;.
  It applies even if mask were defined using modular arithmetic or a
  conditional. It&#039;s required for the code to be correct on unsigned
  integer overflow.)
&lt;li&gt; The maximum capacity can only be half the range of the index
  data types. (So 2^31-1 when using 32 bit unsigned integers).
  In a way that could be interpreted as stealing the top bit of the
  index to function as a flag. But the case against flags isn&#039;t so much
  the extra memory as having to maintain the extra state.
&lt;/ul&gt;

&lt;p&gt;
All of those seem like non-issues. What kind of a monster would make
a non-power of two ring anyway?

&lt;p&gt; This is of course not a new invention. The earliest instance I
could find with a bit of searching was from 2004, with Andrew Morton
&lt;a href=&#039;http://lkml.iu.edu/hypermail/linux/kernel/0409.1/2709.html&#039;&gt;mentioning
in it a code review&lt;/a&gt; so casually that it seems to have been a
well established trick. But the vast majority of implementations
I looked at do not do this.

&lt;p&gt;
So here&#039;s the question: Why do people use the version that&#039;s inferior
and more complicated? I&#039;ve must have written a dozen ring buffers over
the years, and before being forced to really think about it, I&#039;d always
just used the first definition. I can understand why a textbook wouldn&#039;t
take advantage of unsigned integer wraparound. But it seems like it
should be exactly the kind of cleverness that hackers would relish
using and passing on.

&lt;ul&gt;
&lt;li&gt;Could it just be tradition? It seems likely that
this is the kind of thing one learns by osmosis, and then never
revisits. But even so, you&#039;d expect the &amp;quot;good&amp;quot;
implementations to push out the &amp;quot;bad&amp;quot; ones at some
point, which doesn&#039;t seem to be happening in this case.
&lt;li&gt;Is it resistance to having code actually take advantage of integer
overflow, rather than it be a sign of a bug?
&lt;li&gt;Are non-power of two capacities for ring buffers actually
common?
&lt;/ul&gt;

&lt;p&gt;
Join me next week for the exciting sequel to this post, &amp;quot;I&#039;ve
been tying my shoelaces wrong all these years&amp;quot;.

</description><author>jsnell@iki.fi</author><category>GENERAL</category><pubDate>Tue, 13 Dec 2016 17:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2016-12-13-ring-buffers/</guid></item><item><title>Ratas - A hierarchical timer wheel</title><link>https://www.snellman.net/blog/archive/2016-07-27-ratas-hierarchical-timer-wheel/</link><description>
&lt;p&gt;
  Last week I needed a timer wheel for a hobby project. That&#039;s a
  data structure that&#039;s been reimplemented over and over in
  the last three decades, but
  for various reasons I couldn&#039;t get excited by any of the
  freely available ones. Obviously this means that one more
  implementation was needed, hence
  &lt;a href=&#039;https://github.com/jsnell/ratas/&#039;&gt; Ratas - a
  hierarchical timer wheel&lt;/a&gt;. Unfortunately my vacation ran
  out before I could get back to the original project, but that&#039;s the
  nature of yak shaving.
&lt;/p&gt;

&lt;p&gt;
  In this post I&#039;ll first explain briefly what timer
  wheels are - you might want to read one of the references
  instead if you&#039;ve got the time - and then go into more detail
  on why I wrote a new one.
&lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

&lt;h3&gt;(Hierarchical) timer wheels&lt;/h3&gt;

&lt;p&gt;
  Timer wheels are one way of implementing timer queues, which
  in turn are used to to schedule events to happen at some
  future time. If you have tens or hundreds of timers, it doesn&#039;t
  matter much how they&#039;re stored. An unsorted list will do just
  fine. To handle millions of timers you need something a bit more
  sophisticated.
&lt;/p&gt;

&lt;p&gt;
  A timer wheels is effectively a ring buffer of
  linked lists of events, and a pointer to the ring buffer.
  Each slot corresponds to a specific timer tick, and contains
  the head of a linked list. The linked list contains the
  events that should happen on that tick. So something like this.
&lt;/p&gt;

&lt;img style=&#039;padding-left: 5ex;&#039; src=&#039;/blog/stc/images/wheels1.a.png&#039;/&gt;

&lt;p&gt;
(The pointer is in red, the wheel slots in gray, the events
in orange, and the numbers show the time the slot/event is
associated with.)

&lt;p&gt;
  For every tick of time, the pointer moves forward one slot.
  The slot that was passed will now refer one full rotation to
  the future. The first slot is no longer tick 0, but tick 8:
&lt;/p&gt;

&lt;img style=&#039;padding-left: 5ex;&#039; src=&#039;/blog/stc/images/wheels1.b.png&#039;/&gt;

&lt;p&gt;
  Any events in the slot will get executed as the pointer
  passes it. So on the next tick events a and b are executed,
  and removed from the ring.
&lt;/p&gt;

&lt;img style=&#039;padding-left: 5ex;&#039; src=&#039;/blog/stc/images/wheels1.c.png&#039;/&gt;

&lt;p&gt;
  A timer wheel has O(1) time complexity and cheap constant
  factors for the important operations of inserting or removing
  timers. Various kinds of sorted sequences (lists, trees, heaps)
  will scale worse, and the constant factors tend to be larger.
  But a basic timer wheel only work for a limited time range.
  The question is how you extend them to work when the timer
  range is larger than the size of the ring.

&lt;p&gt;
  One solution is the hierarchical timer wheel, which layers
  multiple simple timer wheels running at different resolutions
  on top of each other. Each field has its own slots and its own
  pointer.
&lt;/p&gt;

&lt;p&gt; When an event is scheduled far enough in the future that it does
 not fit the innermost (core) wheel, it instead gets scheduled on one
 of the outer wheels. So something like the following, where the timers
 scheduled for ticks 9 and 13 have been stored in the first slot of the
second wheel:
&lt;/p&gt;

&lt;img style=&#039;padding-left: 5ex;&#039; src=&#039;/blog/stc/images/wheels2.b.png&#039;/&gt;

&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;Usually a timer tick will just advance the pointer on the core wheel,
working exactly like the simple timer wheel. That&#039;s true all the way
up to and including tick 7 in this example.
&lt;/p&gt;

&lt;img style=&#039;padding-left: 5ex;&#039; src=&#039;/blog/stc/images/wheels2.c.png&#039;/&gt;

&lt;p&gt;
  But when the pointer of the core wheel wraps around, the pointer
  for the second wheel will advance by one slot. The events contained
  in that slot will either be executed or promoted to the correct
  slot in the core ring. (As happens here for events e and f).
&lt;/p&gt;

&lt;img style=&#039;padding-left: 5ex;&#039; src=&#039;/blog/stc/images/wheels2.d.png&#039;/&gt;

&lt;p&gt;
  This obviously generalized to more than two layers. A full rotation
  of any single wheel will advance the pointer by one on the next
  layer of wheels.

&lt;p&gt;
  The 1987 paper &lt;a href=&#039;http://www.cs.columbia.edu/~nahum/w6998/papers/sosp87-timing-wheels.pdf&#039;&gt;Hashed and Hierarchical Timing Wheels: Data
  Structures for the Efficient Implementation of a Timer
  Facility&lt;/a&gt; by Varghese &amp;amp; Lauck is where the concept of
  hierarchical timing wheels &lt;a id=&#039;fnref0&#039;&gt;[&lt;a href=&#039;#fn0&#039;&gt;0&lt;/a&gt;] was introduced.

  As usual, Adrian Colyer&#039;s
  &lt;a href=&#039;https://blog.acolyer.org/2015/11/23/hashed-and-hierarchical-timing-wheels/&#039;&gt;summary&lt;/a&gt; is great if you don&#039;t have time
  to read the full paper.
&lt;/p&gt;

&lt;h3&gt;Single-file implementation, no dependencies&lt;/h3&gt;
&lt;p&gt;
  The main reason I couldn&#039;t use existing implementations was that the
  best ones were deeply embedded in larger systems, and would have
  taken a lot of work to extract into stand-alone libraries. Ratas is
  a single-file implementation with no external dependencies beyond
  C++11. So it should be pretty easy to drop into any C++ project.

&lt;p&gt;
  In a related but equally important point, Ratas doesn&#039;t have any
  internal time source, its notion of time is driven from the outside
  by the user of the library. This is actually a more important
  property than might first appear. There are a lot of event loop
  libraries that have timer queues of some sort, but the event loops
  by their nature want to be in control of execution. I needed a
  component to use as part of building a custom event loop instead &lt;a
  id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;].

&lt;p&gt;
  Some of the libraries I looked at had interesting implementation
  strategies. (Wow, DPDK uses skiplists for the timer queue?). It
  would have been great to run some deterministic benchmarks comparing
  all these implementations. But in general even a crude hack job
  to extract the minimum viable free-standing implementation out of
  them was too much work.
&lt;/p&gt;

&lt;h3&gt;Limiting number of events triggered in a single timestep&lt;/h3&gt;
&lt;p&gt;
  One operation that I&#039;ve wanted in the past is limiting the number of
  timers triggered in a single call to the
  timer&#039;s &lt;code&gt;advance&lt;/code&gt; method. If exceeded, the timer
  wheel should bail out early, let the application do some work,
  and then continue where it left off. This way an unfortunate
  clump of timers can&#039;t starve the main processing loop for too
  long. &lt;a id=&#039;fnref2&#039;&gt;[&lt;a href=&#039;#fn2&#039;&gt;2&lt;/a&gt;]
&lt;/p&gt;

&lt;p&gt;
  Of course the main application loop will need special logic
  logic, and do more frequent calls to timer processing than
  normal until the backlog has been dealt with.
&lt;/p&gt;

&lt;p&gt;
  This is trivial with a fully ordered event queue. With a timer
  wheel it means making the timer processing re-entrant as a
  whole (including the contents of the wheel being modified
  while it&#039;s still in the state between ticks).  Luckily there&#039;s
  a way to do this with very little extra wheel-level state and
  with no changes at all to the event scheduling or canceling
  logic. So the overhead on the fast path is pretty much
  immeasurable.
&lt;/p&gt;

&lt;h3&gt;Optimize for high occupancy, not low&lt;/h3&gt;

&lt;p&gt;
  One of the perceived problems of timer wheels is that while event
  insertion and deletion are O(1) operations, finding out the
  time remaining until the next event triggers is O(m+n) where m
  is the total number of timer wheel slots and n is the number of
  events.  Specifically, you need to walk through the wheel
  until you find a slot with some events. Then depending
  on the resolution of the slot, you might need to walk through
  the full event chain of that slot. In addition to the algorithmic
  complexity, both of these operations are effectively pointer
  chasing, so there will be a lot of cache pollution.

&lt;p&gt;
  There are various tricks that can be used to make this
  operation cheaper. For example each wheel could have a bitmap
  that parallels the array of slots in the wheel. If the bitmap
  has a 1 in a given position, the slot is non-empty. This
  allows the implementation to short-circuit the search by
  skipping the empty slots completely.

&lt;p&gt;
  I am not convinced this class of optimizations is a good
  tradeoff. They speed up a couple of operations:
  looking up time of next event, advancing the time by a lot of
  ticks in one go, but only if the timer wheel is at a low
  utilization. In exchange you&#039;re paying a small extra cost on
  every insert and deletion operation. &lt;a id=&#039;fnref4&#039;&gt;[&lt;a href=&#039;#fn4&#039;&gt;4&lt;/a&gt;]

&lt;p&gt;
  And here&#039;s the thing. If the wheel is at a low utilization, it
  means that the program as a whole is at low
  utilization. That&#039;s exactly the situation where I don&#039;t care
  about the performance of a component like this. Performance
  only starts to matter once the system is under load. When the
  system as a whole is under heavy load, so are the timer
  wheels. At that point these lookup operations will
  short-circuit almost immediately, so the optimizations do
  nothing. On the other hand, that&#039;s exactly when the extra overhead
  added to insertions and deletions will be most significant.

&lt;p&gt;
  There is however one useful thing we can do on the interface
  side to help with the original issue. Every time I&#039;ve used any
  kind of &lt;code&gt;ticks_until_next_event()&lt;/code&gt;-like
  functionality, it&#039;s with a pattern like this:
&lt;/p&gt;

&lt;pre&gt;
  Tick sleep_usec = std::min(timers.ticks_until_next_event(), 1000);&lt;/pre&gt;

&lt;p&gt;
  There&#039;s some upper bound on how large a result I&#039;m interested
  in. Even if there are no timer events to be handled in the next
  millisecond, there&#039;s something non-timer driven that I know
  will need to happen. So let&#039;s just tell the timer wheel what
  that the upper bound is:

&lt;pre&gt;
  Tick sleep_usec = timers.ticks_until_next_event(1000);&lt;/pre&gt;

&lt;p&gt;
  This allows the timer wheel to short circuit the search, and
  return as soon as it&#039;s clear that the wheel contains no events
  scheduled to happen before that threshold.

&lt;p&gt;
  Another option for reducing the cost
  of &lt;code&gt;ticks_until_next_event&lt;/code&gt; is to allow it
  to return a lower bound rather than an accurate result.
  If the process reaches a slot with more than
  one tick of granularity, don&#039;t walk through the chain of
  events but return the lower bound of that slot&#039;s tick
  range. I didn&#039;t do this since again that feels like
  optimizing for the uninteresting case, and since the API would
  then need separate accurate and approximate operations.
  (Having just the approximate operation available seems wrong).
&lt;/p&gt;

&lt;h3&gt;Range-based scheduling&lt;/h3&gt;

&lt;p&gt;
  A lot of timer events get scheduled over and over again, but
  never executed &lt;a id=&#039;fnref3&#039;&gt;[&lt;a href=&#039;#fn3&#039;&gt;3&lt;/a&gt;]. Often
  the exact timing of their execution
  doesn&#039;t matter, but there is a lot of expensive churn as the
  timer is adjusted by minute amounts. To reduce the churn, Ratas
  includes a second scheduling interface that takes a range of
  acceptable times rather than a single exact time. For example:
&lt;/p&gt;

&lt;pre&gt;timers.schedule_in_range(event, 100000, 101000);&lt;/pre&gt;

&lt;p&gt;
  It is then up to the implementation to decide on the optimal
  scheduling. Right now the implementation
  of &lt;code&gt;schedule_in_range&lt;/code&gt; decides immediately on a
  single tick to schedule the event for. All information on the
  range is lost after that, rather than maintained in the long
  term. The decision is currently done as follows:
&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt; If the timer is already scheduled in the right interval, just do nothing.
  &lt;li&gt; Prefer scheduling the timer on a exact slot boundary, so that
    it can be executed in-place rather than promoted to the inner
    wheel.
  &lt;li&gt; Prefer scheduling the timer as late in the range as possible.
    This is important to maximize the efficiency of the first point.
&lt;/ul&gt;

&lt;p&gt;
  Why reify the the timing immediately, and drop the range
  information? Mostly because it&#039;s not clear there&#039;s a good time
  to use that information later on. The main times we&#039;re going to
  look at the event again is when it&#039;s up for execution, or when
  it gets rescheduled. In the first case the range gives us nothing.
  In the second case we&#039;ve already gotten a new and improved range.
&lt;/p&gt;

&lt;p&gt;
  Using this additional interface didn&#039;t make as big a difference as I
  was hoping for in my &lt;a href=&#039;#benchmarks&#039;&gt;benchmark&lt;/a&gt;,
  just about 10%. The reason is
  that only one of my four timer types was a good match
  for range-based scheduling. But I don&#039;t know that it&#039;d be appropriate
  for a much higher proportion of real world use cases, so that&#039;s
  fair enough.

&lt;p&gt;
  Approximate event scheduling is of course not a new concept in any
  way &lt;a href=&#039;#fn5&#039; name=&#039;fnref5&#039;&gt;[5]&lt;/a&gt;.
  And even if approximate scheduling is
  not supported by the timer library, some parts of it you can
  emulate on the application level. But even so it seems like
  something you&#039;d want integrated properly in the library.
  Native functionality is always more comfortable to use than
  wrappers.
&lt;/p&gt;

&lt;p&gt;
Finally, there&#039;s one more alternate strategy to mitigate the churn of
timers creeping later and later. If a timer is already active and is
rescheduled further into the future, update its scheduled time but
leave the event structure on the ring in exactly the same place. When
the event comes up on the wheel, the timer wheel checks the execution
time, notices the time is still in the future, and reschedules the
event to its proper location instead of executing the callback.

&lt;p&gt;
The theory sounds plausible, but in my testing this mostly resulted in
small slowdowns. It&#039;s of course dependent on the workload, but it&#039;s
definitely too fragile to use as a default and too obscure to be worth
an option. I&#039;d also hate losing the invariant that a slot only contains
events with the correct tick.

&lt;a name=&#039;benchmarks&#039;&gt;&lt;/a&gt;
&lt;h3&gt;A benchmark program&lt;/h3&gt;

&lt;p&gt;
I wrote a &lt;a
href=&#039;https://github.com/jsnell/ratas/blob/master/src/test/test_benchmark.cc&#039;&gt;little benchmark&lt;/a&gt; that creates a configurable number of events with a mix
of different behaviors:

&lt;ul&gt;
&lt;li&gt; Timers with a short duration that get
  scheduled constantly and are almost always executed.
&lt;li&gt; Timers with a long duration that are scheduled
  once, and eventually get executed.
&lt;li&gt; Timers with a long duration that are constantly
  rescheduled to happen later, and thus never get executed.
&lt;li&gt; Timers with a medium duration that get scheduled rarely
  and get executed once every time they get rescheduled.
&lt;li&gt; Timers with a medium duration that get scheduled rarely,
  and are either rescheduled to happen earlier or canceled.
&lt;/ul&gt;

&lt;p&gt;
  Sets of these timers are grouped into units, with each unit
  consisting of a total of 10 separate timer events. The work units
  run at a slightly different cycle lengths to make the access
  patterns vary a bit over time. There&#039;s also a long completely idle
  period at the end.

&lt;p&gt;
  The following results were generated by running the benchmark with
  various different amounts of &#039;work units&#039;. Each unit contains a
  total of 10 timers (but not all will be active at the same time), and
  during the life of a test will schedule about 70k events and end
  up executing about 23k of them. The test runtime measured in
  virtual timer ticks will always be constant, so increasing the
  number of work units will increase the average wheel occupancy
  rather than cause more to be done in serial.

&lt;img src=&#039;/blog/stc/images/ratas-bench.png&#039;&gt;&lt;/img&gt;

&lt;p&gt;
  This is a log-log graph, i.e. both the axes are logarithmic.
  Despite all the important operations being constant time in theory,
  performance degrades non-linearly and really hits a wall around
  the 256k work units, where a doubling the workload increases runtime
  by a factor of 5. It&#039;s almost certainly a cache issue.
  The i7-2640M I ran the tests on has a 4MB cache, so for the 128k
  work unit case it&#039;s already a struggle to keep even the events on
  the core wheel in L3. So the suboptimal scaling is probably to be
  expected.
&lt;/p&gt;

&lt;p&gt;
 In terms of absolute performance numbers, the 32k work
 unit workload will do ~120M scheduling operations plus ~40M
 event executions per second. And on top of that the minor work that
 the application does in the event handlers. That seems decent for
 one core of a 5 year old laptop CPU.
&lt;/p&gt;

&lt;p&gt;
  No comparative benchmarks, sorry. As I mentioned before, it was
  hard to find anything to benchmark against. My benchmark
  program appears to trigger some kind of a performance bug in the
  only fully featured standalone timer queue I found, which made
  it perform two orders of magnitude slower than it realistically
  should have. The primitive operations
  are much faster than that. I&#039;m sure the problem will get fixed,
  but right now it doesn&#039;t make for a very useful comparison.
&lt;/p&gt;

&lt;h3&gt;Features that didn&#039;t make it&lt;/h3&gt;

&lt;p&gt;
There is intentionally no support for repeating timers. I think that
kind of thing should be done by the timer explicitly rescheduling
itself.

&lt;p&gt; The timer events have a vtable due to the virtual
&lt;code&gt;execute()&lt;/code&gt; method. Originally everything was parametrized
at compile time by the callback type, so all events in a single wheel
could use the same &lt;code&gt;execute()&lt;/code&gt;. But that really did not
work with the &lt;code&gt;MemberTimerEvent&lt;/code&gt;, where the callback is a
combination of a member function specified statically and an object
specified dynamically. I wasn&#039;t willing to give up that feature, so
the vtable was a lesser evil.

&lt;p&gt;
But it might be neat to allow parametrizing &lt;code&gt;TimerWheel&lt;/code&gt;
with a specific &lt;code&gt;TimerEvent&lt;/code&gt;. If you need a heterogenous
wheel, instantiate the template with the current &lt;code&gt;
TimerEventInterface&lt;/code&gt;. If you can live with a homogenous wheel,
instantiate with some other implementation that has a non-virtual
&lt;code&gt;execute()&lt;/code&gt;. I didn&#039;t do it since it exceeded my personal
template tolerance, and since homogenous wheels don&#039;t feel like a
very compelling use case anyway.


&lt;h3&gt;Footnotes&lt;/h3&gt;

&lt;div class=&#039;footnotes&#039;&gt;
&lt;p&gt;
&lt;a id=&#039;fn0&#039;&gt;[&lt;a href=&#039;#fnref0&#039;&gt;0&lt;/a&gt;] I use the term &amp;quot;timer wheel&amp;quot; instead of the original &amp;quot;timing
wheel&amp;quot;. The former is what I&#039;d always seen them called
until seeing this paper title.

&lt;p&gt;
&lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;] I wrote earlier about why packet processing applications
  might want a &lt;a href=&#039;https://www.snellman.net/blog/archive/2015-10-01-flow-disruptor/#timers&#039;&gt;different kind of event loop&lt;/a&gt; than typical server applications, and at another time about &lt;a href=&#039;https://www.snellman.net/blog/archive/2015-07-09-unit-testing-a-tcp-stack/#time&#039;&gt;why determistic control of time is important for testability&lt;/a&gt;.

&lt;p&gt;
&lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;] Interestingly this is the opposite of what operating system kernels want to do. They&#039;d prefer to batch as many timers together as possible. The difference here is that I&#039;m thinking of a single-threaded non-locking application, while modern operating systems are all about concurrency.

&lt;p&gt;
&lt;a id=&#039;fn3&#039;&gt;[&lt;a href=&#039;#fnref3&#039;&gt;3&lt;/a&gt;] Think of a timer that deallocates
resources after a period of idleness. These might be re-scheduled after
every single operation on the resources.

&lt;p&gt;
&lt;a id=&#039;fn4&#039;&gt;[&lt;a href=&#039;#fnref4&#039;&gt;4&lt;/a&gt;] Or maybe not so small
  a cost; at least I found it tricky to maintain the bitmap without
  adding an extra back pointer in each event structure
  or each slot, either of which would be bad news. You could do
  it without that backpointer by requiring that timers
  are canceled via the timer wheel, but then you can&#039;t have the
  timers be automatically canceled on destruction. That would be totally
  unacceptable.

&lt;p&gt;
&lt;a id=&#039;fn5&#039;&gt;[&lt;a href=&#039;#fnref5&#039;&gt;5&lt;/a&gt;] For example Linux already
  applies a percentage-based timer slack, though for a the purpose
  of trying to batch as many timer executions together as possible.
  The LWN article on a &lt;a href=&#039;https://lwn.net/Articles/646950/&#039;&gt;
  proposed replacement to the Linux timer wheel&lt;/a&gt; is a good read.

&lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>GENERAL</category><pubDate>Thu, 28 Jul 2016 01:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2016-07-27-ratas-hierarchical-timer-wheel/</guid></item><item><title>json-to-multicsv - Convert hierarchical JSON to multiple CSV files</title><link>https://www.snellman.net/blog/archive/2016-01-12-json-to-multicsv/</link><description>
      &lt;h2&gt;Introduction&lt;/h2&gt;

      &lt;p&gt;
        &lt;a href=&#039;http://github.com/jsnell/json-to-multicsv&#039;&gt;json-to-multicsv&lt;/a&gt; is a little program to convert a JSON file to one or more CSV files in a way that preserves the hierarchical structure of nested objects and lists. It&#039;s the kind of dime a dozen data munging tool that&#039;s too trivial to talk about, but I&#039;ll write a bit anyway for a couple of reasons.

      &lt;p&gt;
        The first one is that I spent an hour looking for an existing tool that did this and didn&#039;t find one. Lots of converters to other formats, all of which seem to assume the JSON is effectively going to be a list of records, but none that supported arbitrary nesting. Did I just somehow manage to miss all the good ones? Or is this truly something that nobody has ever needed to do?

      &lt;p&gt;
        Second, this is as good an excuse as any to start talking a bit about some patterns in how command line programs get told what to do (I&#039;d use the word &amp;quot;configured&amp;quot;, except that&#039;s not quite right).
      &lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

      &lt;h2&gt;What and why?&lt;/h2&gt;

      &lt;p&gt;
        I needed to produce some data for someone else to analyze, but
        the statistics package they were using could not import JSON
        files with any non-trivial structure. Instead the data needed
        to be provided as multiple CSV files that can be joined
        together by the appropriate columns.

      &lt;p&gt;
        As a simplified example, instead of this:

&lt;pre&gt;
{
  &amp;quot;item 1&amp;quot;: {
    &amp;quot;title&amp;quot;: &amp;quot;The First Item&amp;quot;,
    &amp;quot;genres&amp;quot;: [&amp;quot;sci-fi&amp;quot;, &amp;quot;adventure&amp;quot;],
    &amp;quot;rating&amp;quot;: {
      &amp;quot;mean&amp;quot;: 9.5,
      &amp;quot;votes&amp;quot;: 190
     }
  },
  &amp;quot;item 2&amp;quot;: {
    &amp;quot;title&amp;quot;: &amp;quot;The Second Item&amp;quot;,
    &amp;quot;genres&amp;quot;: [&amp;quot;history&amp;quot;, &amp;quot;economics&amp;quot;],
    &amp;quot;rating&amp;quot;: {
      &amp;quot;mean&amp;quot;: 7.4,
      &amp;quot;votes&amp;quot;: 865
   },
   &amp;quot;sales&amp;quot;: [
     { &amp;quot;count&amp;quot;: 76, &amp;quot;country&amp;quot;: &amp;quot;us&amp;quot; },
     { &amp;quot;count&amp;quot;: 13, &amp;quot;country&amp;quot;: &amp;quot;de&amp;quot; },
     { &amp;quot;count&amp;quot;: 4, &amp;quot;country&amp;quot;: &amp;quot;fi&amp;quot; }
   ]
  }
}
&lt;/pre&gt;

&lt;p&gt;
My &amp;quot;customer&amp;quot; needed this:

&lt;p&gt;
&lt;b&gt;item.csv&lt;/b&gt;
&lt;table&gt;
&lt;tr&gt;&lt;td style=&#039;background-color: #eaa&#039;&gt;item._key&lt;td&gt;item.rating.mean&lt;td&gt;item.rating.votes&lt;td&gt;item.title
&lt;tr&gt;&lt;td style=&#039;background-color: #eaa&#039;&gt;&amp;quot;item 1&amp;quot;&lt;td&gt;9.5&lt;td&gt;190&lt;td&gt;&amp;quot;The First Item&amp;quot;
&lt;tr&gt;&lt;td style=&#039;background-color: #eaa&#039;&gt;&amp;quot;item 2&amp;quot;&lt;td&gt;7.4&lt;td&gt;865&lt;td&gt;&amp;quot;The Second Item&amp;quot;
&lt;/table&gt;

&lt;p&gt;
&lt;b&gt;item.genres.csv&lt;/b&gt;
&lt;table&gt;
  &lt;tr&gt;&lt;td&gt;genres&lt;td style=&#039;background-color: #eaa&#039;&gt;item._key&lt;td&gt;item.genres._key
  &lt;tr&gt;&lt;td&gt;sci-fi&lt;td style=&#039;background-color: #eaa&#039;&gt;&amp;quot;item 1&amp;quot;&lt;td&gt;1
  &lt;tr&gt;&lt;td&gt;adventure&lt;td style=&#039;background-color: #eaa&#039;&gt;&amp;quot;item 1&amp;quot;&lt;td&gt;2
  &lt;tr&gt;&lt;td&gt;history&lt;td style=&#039;background-color: #eaa&#039;&gt;&amp;quot;item 2&amp;quot;&lt;td&gt;1
  &lt;tr&gt;&lt;td&gt;economics&lt;td style=&#039;background-color: #eaa&#039;&gt;&amp;quot;item 2&amp;quot;&lt;td&gt;2
&lt;/table&gt;

&lt;p&gt;
&lt;b&gt;item.sales.csv&lt;/b&gt;
&lt;table&gt;
&lt;tr&gt;&lt;td style=&#039;background-color: #eaa&#039;&gt;item._key&lt;td&gt;item.sales._key&lt;td&gt;sales.count&lt;td&gt;sales.country
&lt;tr&gt;&lt;td style=&#039;background-color: #eaa&#039;&gt;&amp;quot;item 2&amp;quot;&lt;td&gt;1&lt;td&gt;76&lt;td&gt;us
&lt;tr&gt;&lt;td style=&#039;background-color: #eaa&#039;&gt;&amp;quot;item 2&amp;quot;&lt;td&gt;2&lt;td&gt;13&lt;td&gt;de
&lt;tr&gt;&lt;td style=&#039;background-color: #eaa&#039;&gt;&amp;quot;item 2&amp;quot;&lt;td&gt;3&lt;td&gt;4&lt;td&gt;fi
&lt;/table&gt;

&lt;p&gt;
One way to do this would have been to just change the program I used
to produce the output. That would have been a bit annoying since the
CSV output codepath would have been basically completely separate from
the JSON one (which was basically just
a &lt;code&gt;JSON::encode_json&lt;/code&gt; on the natural data structure. It&#039;s
almost easier to just have a generic converter than one specific for
that one app (the documentation is as long as the program itself). The
only question is how to configure the generic mechanism for the
specific case.

&lt;h2&gt;How command line tools get run&lt;/h2&gt;

&lt;p&gt;
Could this &amp;quot;just work&amp;quot; out of the box with no settings at all?
Not really, there&#039;s
multiple ways of interpreting the data. A compound value could mean
either the addition of more columns (ratings in the example) or adding
rows to another CSV file (sales in the example). Consistently choosing
the first interpretation would not work at all, while in the latter
case you&#039;d get really awkward &lt;a href=&#039;https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model&#039;&gt;entity-attribute-value&lt;/a&gt;-style output.

&lt;p&gt;
  Ok, so some configuration is needed. What kind of options do we have for
  doing that? Command line flags tend to be the simplest to start with,
  though they&#039;ll often eventually become complex either by developing
  ordering dependencies between flags (to express different semantics)
  or by the values developing some kind of complicated internal structure.

&lt;p&gt;
  Both of those actually happen for this tool. To run it, you need to
  pass in multiple --path command line options, each containing a pair
  of a patterns and the action to take for values whose path matches
  the pattern. (Just the first matching action is taken). For the above
  example those flags were:

&lt;pre&gt;
   --path /:table:item
   --path /*/rating:column
   --path /*/sales:table:sales
   --path /*/genres:table:genres
&lt;/pre&gt;

&lt;p&gt;
Scalar values have an automatic fallback handler that just outputs the
value as a column, but for compound data fields not finding a match is
an error. In these cases the error message will print out some
suggestions on what command line arguments could be added to resolve
the error, for example:

&lt;pre&gt;
Don&#039;t know how to handle object at /*/appendix/. Suggestions:
 --path /*/appendix/:table:name
 --path /*/appendix/:column
 --path /*/appendix/:row
 --path /*/appendix/:ignore
&lt;/pre&gt;

&lt;p&gt;
The next option would be feeding some kind of a schema file to the
tool, which would then be used to guide the process. For example if
the schema says that a type of object has a static set of fields,
those fields are probably columns. If it has an unknown set of keys,
it&#039;s probably more like tabular data.

&lt;p&gt;
The problem is that writing the schema would be a bit of a pain, and
it would be much harder for the conversion tool to guide the user
through an iterative process of getting the schema definition right.
One could maybe generate a schema file from the data file itself, and
edit any bits that the autodetection goes wrong. Schema generators do
exist, for example
&lt;a href=&#039;http://jsonschema.net/&#039;&gt;jsonschema.net&lt;/a&gt;, but at least that
one doesn&#039;t have enough knobs to tweak to even get this basic example
right. And the mistakes are such that fixing them would take a fair
bit of work. Reliable automated schema generation would make for some
pretty epic yak shaving in the context of this tiny tool.

&lt;p&gt;
Maybe if people really did write JSON schemas for everything it would
make sense to use that existing infrastructure. But I&#039;ve never seen
one of those in the wild, the spec is complicated, and JSON
schemas are not particularly well suited to this use case. (Really
you&#039;d want a custom schema format, but then it&#039;s completely guaranteed
that there&#039;s no pre-existing schema file to use).

&lt;p&gt;
And here&#039;s the thing... It&#039;s not just this specific case. It never
feels like any kind of declarative schema is the right solution. In a
couple of decades of writing data munging scripts I can remember just
a single case of basing the solution on an external description of the
data. And that single exception had several people working on the tool
full time. Sure, it&#039;s great to have a schema of some sort for for your
data interchange or storage format, for use in validation, code
generation, automated generation of example data, or other things like
that. But for actually processing it? It&#039;s just an incredibly rare
pattern.

&lt;p&gt;
And finally, could this be a use case for a special purpose language?
If schemas feel like a rarity, little languages are the
opposite. Especially in classic Unix they are ubiquitous.

&lt;p&gt;
As a recovering programming language addict, I have to be deeply
suspicious every time a new language looks like the right solution. Is
it really? Or is this just an excuse to fall off the wagon again, and
implement a language. (Not a big language, man. Just a little one, to
take the edge off).

&lt;p&gt;
It&#039;s also clear that the general idea of a JSON processing language is
solid. Some already exist
(e.g. &lt;a href=&#039;https://stedolan.github.io/jq/&#039;&gt;jq&lt;/a&gt;), but there could
be room for multiple approaches. Writing sample programs to see what a
language for JSON processing and transformation might look like was a
fun way to spend a couple of hours on the boring &amp;quot;no internet&amp;quot; leg of
a train journey. (&amp;quot;It could have this awk-like structure of a toplevel
pattern matching clauses, but on paths instead of rows of text, and
with a recursive main loop instead of a streaming one, and and
and...&amp;quot;).


&lt;p&gt;
If I kind of wanted to write this, the idea is good, and an initial
implementation is not an unreasonable amount of work, why not do it?
Well, even if a script written in this hypothetical language to
translate from hierarchical to tabular data would have been pretty
simple, it would still have been a program that the user of the tool
needs to write in a dodgy DSL. And since the language would have
been much more generic than a mere conversion tool, it it would also
have been impossible to guide the user through a process of iteratively
building the right configuration (like is now done via the error messages).

&lt;p&gt;
In all likelihood it&#039;d mean that nobody else would ever use the tool for the
original purpose. The less powerful and less flexible version is just
going to be more useful purely due to simplicity.

&lt;p&gt;
So sanity prevailed this time. But tune in for the next post for an
earlier example of where my self control failed.

</description><author>jsnell@iki.fi</author><category>GENERAL</category><category>PERL</category><pubDate>Tue, 12 Jan 2016 14:30:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2016-01-12-json-to-multicsv/</guid></item><item><title>The most obsolete infrastructure money could buy - my worst job ever</title><link>https://www.snellman.net/blog/archive/2015-09-01-the-most-obsolete-infrastructure-money-could-buy/</link><description>
&lt;p&gt;
Today marks the 10th anniversary of the most bizarre, and possibly
the saddest, job I ever took.

&lt;p&gt;
The year was 2005. My interest in writing a content management
system in Java for the company that bought our startup had been
steadily draining away, while my real passion was working on
compilers and other programming language infrastructure
(mostly &lt;a href=&#039;http://www.sbcl.org/&#039;&gt;SBCL&lt;/a&gt;). One day I spotted a
job advert looking for compiler people, which was a rare occurrence in
that time and place. I breezed through the job interview, but did not
ask the right questions and ignored a couple of warning signs. Oops.

&lt;p&gt;
It turned out to be a bit of an adventure in retrocomputing.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;h3&gt;The bizarre&lt;/h3&gt;

&lt;p&gt;
This was the former internal tools unit of a very large company, let&#039;s
call them X. For some reason X had split off the unit and sold
(given?) it to a moderately large consulting company, whom we shall
call Y. I was going to work at Y. The reason they needed compiler
people was that they were about to take over the maintenance of a C
compiler suite (compiler, linker, assembler, etc). Except I&#039;d
misunderstood them as taking over the maintenance from X. That wasn&#039;t
the case. Actually the compiler was from another very large company,
Z, who were discontinuing all support. So X bought the source code
from Z for very significant $$$, and needed somebody (Y) to actually
do something with it. In fact it wasn&#039;t even just one compiler suite
as I&#039;d initially understood, it was two. Woo, double the compilers to
play with!

&lt;p&gt;
I started in September, but some schedules had slipped and we wouldn&#039;t
actually have anything to work with for a month or two. So I had
plenty of time to acclimatize there. Which is good, because it&#039;s like
I&#039;d stepped into some strange parallel dimension where the 80s never
ended. You know, the kind of place where you need access to some old
documentation, and eventually find it&#039;s stored in an ingenious in-house
source control system built on top of &lt;a href=&#039;https://en.wikipedia.org/wiki/Revision_Control_System&#039;&gt;RCS&lt;/a&gt;.

&lt;p&gt;
For example on my first day I found that X was running what was
supposedly largest VAXcluster remaining in the world, for doing their
production builds. Yes, dozens of &lt;a href=&#039;https://en.wikipedia.org/wiki/VAX&#039;&gt;VAX&lt;/a&gt;en running &lt;a href=&#039;https://en.wikipedia.org/wiki/OpenVMS&#039;&gt;VMS&lt;/a&gt;, working as a
cross-compile farm, producing x86 code. You might wonder a
bit about the viability of the VAX as computing platform in the year
2005. Especially for something as cpu-bound as compiling. But don&#039;t
worry, one of my new coworkers had as their current task evaluating
whether this should be migrated to VMS/Alpha or to VMS/VAX running
under a VAX emulator on x86-64! &lt;a href=&#039;#ftnt0&#039;
name=&#039;ftnt_ref0&#039;&gt;[0]&lt;/a&gt;

&lt;p&gt;
Why did this company need to maintain a specific C compiler anyway?
Well, they had their own ingenious in-house programming language that
you could think of as an imperative Erlang with a Pascal-like syntax
that was compiled to C source &lt;a href=&#039;#ftnt1&#039;
name=&#039;ftnt_ref1&#039;&gt;[1]&lt;/a&gt;. I have no real data on how much
code was written in that language, but it&#039;d have to be tens of
millions lines at a minimum.


&lt;p&gt;
The result of compiling this C code would then be run on an ingenious
in-house operating system that was written in, IIRC, the late
80s. This operating system used the 386&#039;s segment registers to
implement multitasking and message passing. For this, they needed the
a compiler with much more support for segment registers than normal.
Now, you might wonder about the wisdom of relying on segment registers
heavily in the year 2005. After all use of segment registers had been
getting slower and slower with every generation of CPUs, and in x86-64
the segmentation support was essentially removed. But don&#039;t worry,
there was a project underway to migrate all of this code to run on
Solaris instead &lt;a href=&#039;#ftnt2&#039; name=&#039;ftnt_ref2&#039;&gt;[2]&lt;/a&gt;.

&lt;p&gt;
After a couple of months of twiddling my thumbs and mostly reading
up on all this mysterious infrastructure, a huge package arrived
addressed to this compiler project. But... We were supposed to get
a source dump. Why does the package need two men to carry it? Did
somebody play a practical joke on us, and send the source as
printouts?

&lt;p&gt;
Why it&#039;s the server that we&#039;ll use for compiling one of the compiler
suites once we get the source code! A Intel System/86 with a genuine
80286 CPU, running Intel &lt;a href=&#039;https://en.wikipedia.org/wiki/Xenix&#039;&gt;Xenix&lt;/a&gt; 286 3.5. The best way to interface with all
this computing power is over a 9600 bps serial port. Luckily the
previous owners were kind enough to pre-install &lt;a href=&#039;https://en.wikipedia.org/wiki/Kermit_(protocol)&#039;&gt;Kermit&lt;/a&gt; on the spacious
40MB hard drive of the machine, and I didn&#039;t need to track down a
floppy drive or a Xenix 286 Kermit or rz/sz binary. God, what primitive
pieces of crap that machine and OS were.

&lt;p&gt;
You might wonder about the wisdom of using a 15-20 year old machine as
the sole method of building a piece of software. It&#039;s dog slow and
obviously will break sooner or later. In fact I raised this very issue
and suggested maybe imaging the hard drive and getting everything
running virtualized. That idea was nixed since the machine was old and
fragile, we couldn&#039;t risk poking around in the inside. It&#039;d be really
hard to replace, when they went hunting for this machine from antique
computer specialists, they only found two remaining working
units &lt;a href=&#039;#ftnt3&#039; name=&#039;ftnt_ref3&#039;&gt;[3]&lt;/a&gt;.

&lt;p&gt;
This might be a good time to say that computationally speaking,
I was raised by the wolves on a
SunOS 4 server (which I ended up sysadmining for a few hundred users).
My personal email was still going over &lt;a href=&#039;https://en.wikipedia.org/wiki/UUCP&#039;&gt;UUCP&lt;/a&gt; in 2005.
The highlight of my previous weekend (in 2015, when I&#039;m writing this)
was finding what looks like a partial source repository for a Lisp
implementation written before I was born, and which appeared to have
been completely lost to time. It was on a copy of some old backup
tapes from an &lt;a
href=&#039;https://en.wikipedia.org/wiki/Incompatible_Timesharing_System&#039;&gt;ITS&lt;/a&gt;
server, and I don&#039;t even remember how or when those ended up on my harddrive.
Which is to say, I like old computer systems more than is reasonable.

&lt;p&gt;
But even by my standards this level of computational archeology was
going a bit too far. And the rabbit hole still had a little bit deeper
to go.

&lt;p&gt;
A couple of weeks later the source drop arrived. I&#039;ll talk about the
other compiler later, let&#039;s tackle this one that needed to be built
on a 286 first.

&lt;p&gt;
So it was written in &lt;a href=&#039;https://en.wikipedia.org/wiki/PL/M&#039;&gt;PL/M&lt;/a&gt;. (Wait, is that even a
thing? That&#039;s not a thing, right?). And it was last modified in the mid 80s. I&#039;d like to say the
build instructions were generated using a typewriter, but it could be
that my memory is playing tricks on that. Some of the components
didn&#039;t build cleanly, and required various Makefile tweaks with
excruciating round trip times for every test. Because, you know,
this is a 286.

&lt;p&gt;
The hard drive wasn&#039;t large enough for all of the components either,
so the process of rebuilding everything would be:

&lt;ul&gt;
  &lt;li&gt;Upload the linker source tarball over the 9600 bps serial connection from a Linux server acting as a frontend
  &lt;li&gt;Unpack it
  &lt;li&gt;Build
  &lt;li&gt;Download the linker binary back to safety
  &lt;li&gt;Remove the source and the build artifacts
  &lt;li&gt;Repeat the same for all five components of the system.
&lt;/ul&gt;

&lt;p&gt;
Just the data transfers for each component took an hour.  But after a
long time fighting with it I had a script that with a single keystroke
generated bit-identical binaries when compared to the ones that had
apparently been in use for almost the last 20 years.

&lt;p&gt;
I was pretty worried though, it&#039;d be really hard to actually make any
use of this source. There was no documentation except for the build
instructions, we&#039;d need to reverse engineer everything. There wouldn&#039;t be
any training either from company Z either, frankly it&#039;s a miracle if
anyone who originally worked on the software was still with the company. Nobody
knew PL/M. The roundtrip time from making a change on the build
machine to having a binary on a machine capable of actually running
it was at least an hour. And we didn&#039;t have a source level debugger for
this, so that&#039;d mean an hour just to add a single debug &lt;code&gt;printf&lt;/code&gt;.
(Wait, not a debug &lt;code&gt;printf&lt;/code&gt; of course. A debug
whatever-it-is-that-PL/M-uses-for-io). It&#039;d be pure pain.

&lt;p&gt;
I expressed these concerns, and was told not to worry.
&lt;br&gt;- &amp;quot; Oh, we&#039;ll never want to make changes to this compiler, not
enough code is compiled with it these days for that to be worth it. The more
modern suite is the important one.&amp;quot;
&lt;br&gt;- &amp;quot;Wait? I just spent a month elbow deep in PL/M and
Xenix/286 over a 9600bps Kermit connection, and you&#039;re telling me
we&#039;re never going to actually use any of this?!&amp;quot;
&lt;br&gt;- &amp;quot;Right, we just needed to verify that we really got what we
bought.&amp;quot;

&lt;p&gt;
I didn&#039;t really know whether to be happy about not having to do any
more work on that crap, or angry about the waste of time.

&lt;h3&gt;The sad&lt;/h3&gt;

&lt;p&gt;
That concluded the bizarre retrocomputing part of the story. We
now get to the part with sad dysfunctional corporate politics. If
you&#039;re just reading this for the laughs, maybe just skip to the end.

&lt;p&gt;
The more modern compiler suite wasn&#039;t a spring chicken either. It
had to be compiled specifically on Visual Studio 6. There were again
no design docs, nor tests. The lack of tests was explained as being
due to third party IP concerns. The lack of documentation we never
got an answer for.

&lt;p&gt;
Unlike the truly ancient compilers, this one was easy to build. But
what could we possibly do with it? So I read through the compiler,
tried to understand what each file did, did some experiments and wrote
some notes.

&lt;p&gt;
We arranged a big meeting with senior engineers from all the relevant
departments of X. The agenda was to figure out what improvements
they&#039;d want in the compiler. It was pretty dispiriting. Half of
them seemed to think it&#039;d be better not to touch it at all, since we&#039;d
probably just break it. Even those who weren&#039;t completely opposed to
changes couldn&#039;t think of anything they really needed. Finally
someone took pity on me, and noted that the compiler isn&#039;t very smart
about scheduling segment register loads, and those were expensive
operations. Maybe that could be improved?

&lt;p&gt;
After the meeting one of the managers told me that it was really our job
to come up with projects that the customer wanted to buy, not the other
way around. And it usually couldn&#039;t just be a general project for minor
improvements, it&#039;d need clear and ideally measurable goals.
The projects would also need to be pretty large to justify all the
overhead. It should go without saying that this is an absolutely
insane way of doing platform development, but it&#039;s something that
follows directly from the incentives of the two parties.  How anyone
at X thought that anything good would come out of this, I don&#039;t know.

&lt;p&gt;
But never mind that. Our initial project for taking over the compiler
maintenance was well funded, and vague enough that it was easy to
argue that proving capability of shipping some kind of improvements is
a core deliverable. We could at least proceed with the only improvement
anyone had shown any interest in.

&lt;p&gt;
So I implemented a new peephole optimizer stage for the segment
registers, and even got the code reviewed by the original authors of
the compiler when they came over from Z to give us a training session.
It seemed to work, but as mentioned above we didn&#039;t have a test suite
and building one would take a long time and a lot of work. (Excellent!
We can propose that as a project later!).

&lt;p&gt;
We couldn&#039;t even run any of the production code since that would
require the ingenious in-house operating system. The only way to get
any performance numbers and confidence in the changes being correct
would be to schedule a load test in X&#039;s test lab. Unfortunately weeks
and weeks of discussion over that never got us both the lab time and
the people from their side who would have been needed. It&#039;s of course
understandable; whether these compiler changes got released or not
wouldn&#039;t make a difference to these people, who had their own actual
work to do. But it also made it very hard to see how we could ship
this change. The justification would be improved performance, but
with no numbers it&#039;d be a hollow claim.

&lt;p&gt;
That&#039;s when it dawned on me that there was never going to be any real
compiler work there. These special compilers would not really matter
once X migrated away from the custom OS, which would have to happen.
Oh, sure it&#039;d need to be &amp;quot;maintained&amp;quot; just in case a
customer running 20 year old code needed a bugfix.
Given the dysfunctional processes, it seemed pretty clear that the
costs for any improvements would be massive in the short active
development life these systems had remaining. They&#039;d probably spent a
seven figure sum on this project as insurance, but actually doing
something with the code? No way.

&lt;p&gt;
All of this infrastructure was just going to be on life support while
it was being replaced by new systems that would in turn be obsoleted
in five years. But nothing would ever actually go away, all this cruft
would just accumulate and accumulate, nominally supported for
ever. And you&#039;d have these extremely good engineers doing this
completely insane work, having been moved working from a prestigious high
tech company to a despised consulting firm.

&lt;p&gt;
And how do you even get out of that job? I imagined myself in a job
interview in 2010, trying to explain how useful my extensive knowledge
of Xenix, PL/M build systems, and VMS would be to my prospective new
employer.  There might be a time when you just stop keeping up with
tech, but doing it in your 20s is really not that time :)

&lt;h3&gt;Coda&lt;/h3&gt;

&lt;p&gt;
So I quit without arranging for another job first, assuming that
something would probably turn up. In an amazing
display of serendipity, during my notice period ITA Software posted to
the SBCL mailing list that they wanted to pay somebody to work on SBCL
improvements for them, which was pretty much my dream gig at the time
&lt;a href=&#039;#ftnt4&#039; name=&#039;ftnt_ref4&#039;&gt;[4]&lt;/a&gt;. Perfect timing.

&lt;p&gt;
Ok, that&#039;s all. You can now proceed with the one-upping with stories
of developing new production software on a physical IBM 1401 in this
millennium, or something ;-)

&lt;h3&gt;Footnotes&lt;/h3&gt;

&lt;div class=&#039;footnotes&#039;&gt;

&lt;p&gt;
&lt;a href=&#039;#ftnt_ref0&#039; name=&#039;ftnt0&#039;&gt;[0]&lt;/a&gt;
I don&#039;t know the outcome of that evaluation.

&lt;p&gt;
&lt;a href=&#039;#ftnt_ref1&#039; name=&#039;ftnt1&#039;&gt;[1]&lt;/a&gt;
No, transpiling is not a word no matter how much you people
try to make it one.

&lt;p&gt;
&lt;a href=&#039;#ftnt_ref2&#039; name=&#039;ftnt2&#039;&gt;[2]&lt;/a&gt;
And that leads us to the question of whether Solaris is really
what you want to be migrating to in 2005.

&lt;p&gt;
&lt;a href=&#039;#ftnt_ref3&#039; name=&#039;ftnt3&#039;&gt;[3]&lt;/a&gt;
Wait?! There are two machines in the world that you think can be used to build
this software, and we only bought one of them? &amp;quot;Oh, yes. It would
have been a pretty good idea to buy the second one as a spare. Let&#039;s
do that now!&amp;quot;

&lt;p&gt;
&lt;a href=&#039;#ftnt_ref4&#039; name=&#039;ftnt4&#039;&gt;[4]&lt;/a&gt;
Of course if one worry with the job at Y was that I&#039;d be unemployable
due to only having worked on boring and obsolete technology, one might
wonder about the long term career prospects in Common Lisp compilers.
But look, in 2006 CL was really going places!
&lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>GENERAL</category><category>HISTORY</category><pubDate>Tue, 01 Sep 2015 17:30:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2015-09-01-the-most-obsolete-infrastructure-money-could-buy/</guid></item><item><title>Updated zlib benchmarks</title><link>https://www.snellman.net/blog/archive/2015-06-05-updated-zlib-benchmarks/</link><description>

&lt;p&gt; Last year I wrote a &lt;a
href=&#039;https://github.com/jsnell/zlib-bench&#039;&gt;small benchmark&lt;/a&gt; suite
to benchmark the various zlib optimization forks that were floating
around. There&#039;s a couple of reasons to update those &lt;a
href=&#039;https://www.snellman.net/blog/archive/2014-08-04-comparison-of-intel-and-cloudflare-zlib-patches.html&#039;&gt;results&lt;/a&gt;. First,
there were major optimizations added to the &lt;a href=&#039;https://github.com/cloudflare/zlib&#039;&gt;Cloudflare fork&lt;/a&gt;. And
second, there&#039;s now a new entrant, &lt;a
href=&#039;https://github.com/Dead2/zlib-ng&#039;&gt;zlib-ng&lt;/a&gt; which merges in
the changes from both the Intel and Cloudflare versions but also drops
support for old architectures and cleans up the code in general.

&lt;p&gt;
I&#039;ll write a bit less commentary this time, so that the results will
be easier to update in the future without a new post. The big change
compared to the 2014-08 results is that the Cloudflare version is
now significantly faster particularly on high compression levels, but
there are smaller improvements on all compression levels. Except
for compression level 1, it seems like the preferable version now for
pure speed.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;p&gt;
Zlib-ng showed a massive slowdown in decompression
speed compared to all other versions until compiled with &lt;code&gt;--zlib-compat&lt;/code&gt; (only relevant for minigzip, not necessary for general use of
the library),
and is much slower with
compression level 1 than the the Intel version despite apparently using
the new &lt;code&gt;quick&lt;/code&gt; deflate strategy. On other levels it
closely shadows the Intel results.

&lt;p&gt;
Versions used:
&lt;table&gt;
    &lt;tr&gt;&lt;td&gt;baseline&lt;td&gt;50893291621658f355bc5b4d450a8d06a563053d
    &lt;tr&gt;&lt;td&gt;cloudflare&lt;td&gt;a80420c63532c25220a54ea0980667c02303460a
    &lt;tr&gt;&lt;td&gt;intel&lt;td&gt;e176b3c23ace88d5ded5b8f8371bbab6d7b02ba8
    &lt;tr&gt;&lt;td&gt;zlib-ng&lt;td&gt;4b1728a261e32e08bc5403f391ba65bfe5f4ba57
&lt;/table&gt;

&lt;p&gt;
Flags used:
&lt;table&gt;
  &lt;tr&gt;&lt;td&gt;All: &lt;td&gt;&lt;code&gt;CFLAGS=&#039;-msse4.2 -mpclmul -O3&#039;&lt;/code&gt;
  &lt;tr&gt;&lt;td&gt;zlib-ng: &lt;td&gt;&lt;code&gt;--zlib-compat&lt;/code&gt;
&lt;/table&gt;

&lt;h4&gt;Decompression&lt;/h4&gt;
&lt;table&gt;  &lt;tr&gt;    &lt;td&gt;&lt;td colspan=2&gt;baseline&lt;td colspan=2&gt;cloudflare&lt;td colspan=2&gt;intel&lt;td colspan=2&gt;zlib-ng  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;decompress executable (50 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;1.32s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;1.10s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(83%)&lt;td&gt;1.30s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(98%)&lt;td&gt;1.31s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(99%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;decompress html (50 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.76s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.65s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(85%)&lt;td&gt;0.75s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(98%)&lt;td&gt;0.76s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;decompress jpeg (50 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.20s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.12s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(60%)&lt;td&gt;0.20s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(101%)&lt;td&gt;0.20s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;decompress pngpixels (50 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.87s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.65s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(75%)&lt;td&gt;0.85s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(98%)&lt;td&gt;0.86s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(99%)&lt;/table&gt;

&lt;h4&gt;Compression level 1&lt;/h4&gt;
&lt;table&gt;  &lt;tr&gt;    &lt;td&gt;&lt;td colspan=2&gt;baseline&lt;td colspan=2&gt;cloudflare&lt;td colspan=2&gt;intel&lt;td colspan=2&gt;zlib-ng  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress executable -1 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;0.37&lt;td colspan=2&gt;0.37&lt;td colspan=2&gt;0.46&lt;td colspan=2&gt;0.46  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.75s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.52s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(69%)&lt;td&gt;0.29s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(38%)&lt;td&gt;0.46s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(61%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress html -1 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;0.39&lt;td colspan=2&gt;0.37&lt;td colspan=2&gt;0.54&lt;td colspan=2&gt;0.54  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.38s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.27s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(71%)&lt;td&gt;0.19s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(49%)&lt;td&gt;0.28s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(73%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress jpeg -1 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;1.00&lt;td colspan=2&gt;1.00&lt;td colspan=2&gt;1.05&lt;td colspan=2&gt;1.05  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.65s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.53s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(81%)&lt;td&gt;0.24s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(36%)&lt;td&gt;0.40s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(61%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress pngpixels -1 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;0.17&lt;td colspan=2&gt;0.17&lt;td colspan=2&gt;0.23&lt;td colspan=2&gt;0.23  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.44s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.27s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(60%)&lt;td&gt;0.18s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(40%)&lt;td&gt;0.26s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(57%)&lt;/table&gt;
&lt;h4&gt;Compression level 3&lt;/h4&gt;
&lt;table&gt;  &lt;tr&gt;    &lt;td&gt;&lt;td colspan=2&gt;baseline&lt;td colspan=2&gt;cloudflare&lt;td colspan=2&gt;intel&lt;td colspan=2&gt;zlib-ng  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress executable -3 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;0.35&lt;td colspan=2&gt;0.36&lt;td colspan=2&gt;0.36&lt;td colspan=2&gt;0.36  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;1.10s&amp;plusmn;0.02&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.62s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(56%)&lt;td&gt;0.73s&amp;plusmn;0.02&lt;td style=&#039;padding-right: 2ex&#039;&gt;(66%)&lt;td&gt;0.69s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(63%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress html -3 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;0.36&lt;td colspan=2&gt;0.35&lt;td colspan=2&gt;0.35&lt;td colspan=2&gt;0.35  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.61s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.37s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(59%)&lt;td&gt;0.43s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(69%)&lt;td&gt;0.41s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(66%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress jpeg -3 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;1.00&lt;td colspan=2&gt;1.00&lt;td colspan=2&gt;1.00&lt;td colspan=2&gt;1.00  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.62s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.51s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(82%)&lt;td&gt;0.55s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(88%)&lt;td&gt;0.56s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(90%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress pngpixels -3 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;0.15&lt;td colspan=2&gt;0.15&lt;td colspan=2&gt;0.16&lt;td colspan=2&gt;0.16  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.85s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.44s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(51%)&lt;td&gt;0.46s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(54%)&lt;td&gt;0.44s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(51%)&lt;/table&gt;
&lt;h4&gt;Compression level 5&lt;/h4&gt;
&lt;table&gt;  &lt;tr&gt;    &lt;td&gt;&lt;td colspan=2&gt;baseline&lt;td colspan=2&gt;cloudflare&lt;td colspan=2&gt;intel&lt;td colspan=2&gt;zlib-ng  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress executable -5 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;0.33&lt;td colspan=2&gt;0.34&lt;td colspan=2&gt;0.34&lt;td colspan=2&gt;0.34  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;1.61s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.93s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(57%)&lt;td&gt;0.93s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(57%)&lt;td&gt;0.91s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(56%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress html -5 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;0.34&lt;td colspan=2&gt;0.33&lt;td colspan=2&gt;0.33&lt;td colspan=2&gt;0.33  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.99s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.57s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(57%)&lt;td&gt;0.53s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(53%)&lt;td&gt;0.52s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(52%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress jpeg -5 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;1.00&lt;td colspan=2&gt;1.00&lt;td colspan=2&gt;1.00&lt;td colspan=2&gt;1.00  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.64s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.53s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(83%)&lt;td&gt;0.74s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(116%)&lt;td&gt;0.74s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(115%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress pngpixels -5 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;0.14&lt;td colspan=2&gt;0.14&lt;td colspan=2&gt;0.14&lt;td colspan=2&gt;0.14  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;1.23s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.61s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(49%)&lt;td&gt;0.61s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(49%)&lt;td&gt;0.59s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(47%)&lt;/table&gt;
&lt;h4&gt;Compression level 9&lt;/h4&gt;
&lt;table&gt;  &lt;tr&gt;    &lt;td&gt;&lt;td colspan=2&gt;baseline&lt;td colspan=2&gt;cloudflare&lt;td colspan=2&gt;intel&lt;td colspan=2&gt;zlib-ng  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress executable -9 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;0.33&lt;td colspan=2&gt;0.33&lt;td colspan=2&gt;0.33&lt;td colspan=2&gt;0.33  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;9.55s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;4.07s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(42%)&lt;td&gt;7.53s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(78%)&lt;td&gt;7.34s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(76%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress html -9 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;0.33&lt;td colspan=2&gt;0.33&lt;td colspan=2&gt;0.33&lt;td colspan=2&gt;0.33  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;2.81s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;1.64s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(58%)&lt;td&gt;2.54s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(90%)&lt;td&gt;2.48s&amp;plusmn;0.02&lt;td style=&#039;padding-right: 2ex&#039;&gt;(88%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress jpeg -9 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;1.00&lt;td colspan=2&gt;1.00&lt;td colspan=2&gt;1.00&lt;td colspan=2&gt;1.00  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;0.64s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;0.53s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(82%)&lt;td&gt;0.58s&amp;plusmn;0.01&lt;td style=&#039;padding-right: 2ex&#039;&gt;(90%)&lt;td&gt;0.59s&amp;plusmn;0.00&lt;td style=&#039;padding-right: 2ex&#039;&gt;(93%)  &lt;tr&gt;&lt;td colspan=4&gt;&lt;b&gt;compress pngpixels -9 (10 iterations)&lt;/b&gt;  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex;&#039;&gt;Compression ratio&lt;/td&gt;&lt;td colspan=2&gt;0.12&lt;td colspan=2&gt;0.12&lt;td colspan=2&gt;0.12&lt;td colspan=2&gt;0.12  &lt;tr&gt;    &lt;td style=&#039;padding-left: 2ex; width: 20ex;&#039;&gt;Execution time&lt;/td&gt;&lt;td&gt;26.58s&amp;plusmn;0.05&lt;td style=&#039;padding-right: 2ex&#039;&gt;(100%)&lt;td&gt;14.24s&amp;plusmn;0.02&lt;td style=&#039;padding-right: 2ex&#039;&gt;(53%)&lt;td&gt;21.43s&amp;plusmn;0.02&lt;td style=&#039;padding-right: 2ex&#039;&gt;(80%)&lt;td&gt;19.40s&amp;plusmn;0.03&lt;td style=&#039;padding-right: 2ex&#039;&gt;(72%)&lt;/table&gt;

</description><author>jsnell@iki.fi</author><category>GENERAL</category><pubDate>Fri, 05 Jun 2015 20:30:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2015-06-05-updated-zlib-benchmarks/</guid></item><item><title>&amp;quot;It&#039;s like an OkCupid for voting&amp;quot; - the Finnish election engines</title><link>https://www.snellman.net/blog/archive/2015-05-11-okcupid-for-voting-the-finnish-election-engines/</link><description>
&lt;p&gt;
Have I ever told you about the time I built an &amp;quot;OkCupid
for elections&amp;quot; for the communists?

&lt;p&gt;
No? That&#039;s strange, I tend to get good mileage out of that story
during election season. Unfortunately for the story to make any sense,
you&#039;ll need a bit of &lt;i&gt;absolutely fascinating&lt;/i&gt; background
information on how elections work in Finland, and especially how
websites that tell people whom to vote for became an integral part of
it.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;h3&gt;The Finnish electoral system&lt;/h3&gt;

&lt;p&gt;
Like many other countries, Finland has a representative electoral
system. The country is divided into 13 districts, each district elects
some number (average of ~15) of representatives into the
parliament. The number of representatives in each district is
determined based on population, while the distribution of the seats in
each district is based on the proportion of the vote that each party
got there. The proportional distribution of seats is done using
the &lt;a href=&#039;http://en.wikipedia.org/wiki/D%27Hondt_method)&#039;&gt;D&#039;Hondt
method&lt;/a&gt;.

&lt;p&gt;
That&#039;s pretty standard fare at least for continental Europe.
The odd bit is in how the distribution of
seats is done within each party. There is no predetermined
&lt;a href=&#039;http://en.wikipedia.org/wiki/Party-list_proportional_representation&#039;&gt;party
list&lt;/a&gt; at all, the ordering of candidates within a party is determined
purely by votes. And what&#039;s more, it&#039;s not even possible to give a generic
vote to a party as a whole; each vote is equally a statement on which party
you&#039;d like to win the seats, and on which specific person in that party
you&#039;d like to get one of the seats that the party wins.

&lt;p&gt;
This is fundamentally different from systems where voting for a person
rather than just the list requires extra effort. It has a couple of
sweet advantages, but also a bunch of disadvantages.

&lt;p&gt;
The advantages of the system I can think of are:

&lt;ul&gt;
&lt;li&gt;
  &lt;p&gt;
  It decreases the power of party elites by removing from them the
  decision on how to order a list, and instead giving it to the people
  who voted for the party.

&lt;li&gt;&lt;p&gt;
It encourages demographic diversity in the candidate pool.
  It makes a lot of tactical sense for every party to have candidates of
  all ages, of both genders, of different educational backgrounds, and
  different ethnic groups. Some people won&#039;t
  vote purely based on issues, but for example specifically want to
  vote for some group they have an affinity for, or feel is
  under-represented. Each party will want have a good spread of
  candidates to make sure that they don&#039;t lose any voters due to a
  lack of a compatible candidate.

  &lt;p&gt;
    This is of course pure identity politics, but somehow it feels a
    lot less objectionable when the effect is progressive. It&#039;s worth
    noting that while maximizing the demographic diversity of
    candidates makes sense for each party, the same might not apply to
    ideology / opinions. Candidates that don&#039;t fit into any of the
    wings of a party will just make it unclear what the party really
    stands for.
&lt;/ul&gt;

&lt;p&gt;
The downsides of the system come mostly from overloading a single vote
to mean two things. You vote for a candidate, and the party gets a
vote as a side effect.

&lt;ul&gt;
&lt;li&gt;
  &lt;p&gt;
  The meaning of a vote is obfuscated. If your ideal candidate is not
  in your ideal party, what&#039;s the right action? Should you try to
  ensure that they get elected at the cost of helping the wrong party,
  or should you vote for a suboptimal candidate in the right party?

  &lt;p&gt;I&#039;m pretty sure that the rational choice there is to vote for the
  party first. The party distribution of the parliament has more
  impact than the exact distribution of people. And unless you&#039;re
  pretty certain that your preferred candidate is &amp;quot;on the
  bubble&amp;quot; within the party, it&#039;s more likely that a single vote
  will swing a seat between two parties then it&#039;s to swing a seat
  between two individuals in the same party.

  &lt;p&gt;If the more important choice is the party, why is the candidate
    at the forefront of the decision?
&lt;li&gt;
  &lt;p&gt;
  Since some people will ignore the maxim to vote for the party first
  and person second, it makes sense for parties to run candidates with
  good name recognition in an attempt to hoover in some of these voters.

  &lt;p&gt;
   So the candidate list will have TV celebrities, beauty contest
  winners, olympic medalists and so on. And occasionally they&#039;ll even
  get elected bumping off &amp;quot;serious&amp;quot; politicians from the
  list. (Please note that I don&#039;t want to imply that the celebrities
  would automatically be incompetent at that job, that&#039;s not true at
  all. But their actual or perceived level of competence is not what
  gets them elected, it&#039;s their fame.)

  &lt;p&gt;
    I&#039;m pretty sure that there&#039;s even some level of &amp;quot;no publicity
    is bad publicity&amp;quot; in the system as a result. The most obvious
    recent example is the most notorious corrupt politician in the
    country (guilty of soliciting for bribes as a minister, and I
    think the only person expelled from the parliament during my
    lifetime) getting re-elected once again. That&#039;s just the
    kind of cronyism you&#039;d hope to get rid of by eliminating the party
    list, but it clearly doesn&#039;t always work.

&lt;li&gt;
  &lt;p&gt;
    For a candidate who is serious about getting elected the main
    competition are the other people in that party in the same
    district. If as a candidate you don&#039;t have any natural name
    recognition, you have to get it by advertising. This creates a
    situation where even a nominally party-dominated system can still
    have uncomfortably large level of either personal fundraising, or
    election campaigns being self-funded by the rich.

  &lt;p&gt;
    The problem with campaign fundraising is of course corruption, or the
    perception of corruption. I give 10k &amp;euro; to your personal
    election campaign, you after getting elected help in solving some
    nasty zoning issues that my company has. The last large scale
    election funding scandal happened only 8 years ago, and implicated
    a disturbing number of high profile politicians.
    And the problem with funding a large campaign from your personal
    wealth is that it ensures that the rich have a much
    higher chance of getting elected than the poor.

  &lt;p&gt;
    The issue with competition within the party being potentially more
    important than the competition between parties can also have other
    odd consequences. Like the the bizarre case of two candidates
    from opposing (and not particularly compatible) parties doing a
    joint TV advertising campaign. They were in different districts,
    so neither was personally hurt by increasing the success chances
    of the other one. But they were directly undermining their own
    party in the other district.
&lt;li&gt;
  &lt;p&gt;
  Even if you&#039;re trying to be a &amp;quot;good&amp;quot; voter and make your
  decision based on something else than fame or budget, how are you
  actually going to decide which of the 200 candidates in the district
  or even the 30 candidates in one party to vote for? Nobody has the
  time to do deep research on all of them.
&lt;/ul&gt;

&lt;p&gt;
You might notice that a lot of these issues boil down to what&#039;s
essentially a discovery problem. If you could somehow solve that and
make people aware of candidates that they might want to vote for based
on some criteria other than name recognition, surely this would be an
absolutely awesome election system?

&lt;h3&gt;The election engines&lt;/h3&gt;

&lt;p&gt;
The solution to all these discovery problems that Finland arrived in
the mid-90s was the &lt;i&gt;vaalikone&lt;/i&gt;, which literally translates to
&amp;quot;the election machine&amp;quot;, or a bit less literally to &amp;quot;the
election engine&amp;quot;. The first one was made by the national
broadcasting corporation YLE, in later elections other news outlets
added their own versions, followed by the political parties and
various kinds of trade / lobbying organizations. There&#039;s probably
at least a couple of dozen active ones for any given election.

&lt;p&gt;
The core concept is simple. The site has a bunch of multiple choice
questions. 20 questions is a typical amount. Some questions are
related to the hot political topics of the day, some with stale /
evergreen topics (why yes, let&#039;s ask people about NATO once again),
others with cultural values, and yet others with economic ones. A
month or two before the elections, the candidates answer the
questions, possibly tell how important they consider that issue, and
they might even write a bit of text explaining their
answer.

&lt;p&gt;
During the campaign the voters will use the site to answer the same
questions, and get a match percentage with all the candidates in the
voter&#039;s district. For the really picky users, some of the modern
implementations even allow you to restrict the results e.g. only to
candidates in a given age range. You can now perhaps see why I use the
quote in the title of this post when explaining the idea ;-)

&lt;p&gt;
And do these sites matter? Apparently over half the voters use
one or more of them. In the 20-30 year old age group it&#039;s up to three
quarters. And the results aren&#039;t just ignored. One study says 40% of
the users had the results from the site affect their voting behavior
in some way. Another says that a sixth outright voted for the top
rated candidate. These numbers are not insignificant.

&lt;p&gt;
(Note: I am aware that similar sites exist in other countries. But
my impression is that they are nowhere near as central to the process.
If I&#039;m wrong about it, I&#039;d love to hear about counterexamples.)

&lt;h3&gt;The problem with the concept&lt;/h3&gt;

&lt;p&gt;
So the issues have been solved, right? Use the recommendations from
the website to narrow the candidate pool down to a handful, do deeper
research on those candidates, and select the best one. Informed
democracy has been saved!

&lt;p&gt;
Unfortunately no. While the concept is simple, I don&#039;t know if it
really should be used for anything more than entertainment.

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;
  The most critical problem is that the algorithm, the question
  selection, the question phrasing, and even the reporting UI have a
  huge effect on the results. It&#039;s no different from e.g. the way polling
  results could be distorted by these kinds of methodological
  things. But unlike polls, these tools are directly intended to
  affect voting behavior.

&lt;p&gt;
  One party or candidate can easily be the best match on one site and
  a horrible match on another. Someone from the Green party can be
  branded as the most right-wing MP in the country based on the
  results. Filtering the results to just show extremely left-wing
  politicians might show up several people from the conservative party
  due to (presumably) some kind of uninitialized data. These are
  not hypothetical examples, but event that actually happened.

&lt;p&gt;
  As an example on the power of the UI, a site that specially
  highlights candidates from the best matching party (no matter how
  small the margin is to the 2nd best match) is going to give a
  dramatically different view from a site that has a single sorted
  list of candidates.

&lt;p&gt;
  Reliance on these kinds of tools puts quite a lot of unauditable
  power into very few hands, and feels distinctly anti-democratic even
  if you assume that no malice is involved. Some people might argue
  that a similar if not greater amount of power is placed in the hands
  of e.g. the people moderating televised election debates.  The
  difference is that at the end of a debate nobody gives you
  a personalized recommendation on what your opinion should be. The
  viewer would need to apply at least some thought to the process,
  rather than trust a number with too many significant digits that was
  spit out by an ostensibly neutral website.

&lt;li&gt;
  &lt;p&gt;
    The results are often flaky and inconsistent. False positives are
    fine; you shouldn&#039;t be just blindly voting for the single top
    candidate with no thought at all (even if some people are). But
    that strategy doesn&#039;t work with false negatives.

&lt;li&gt;
  &lt;p&gt;
    While a voter has an incentive to answer the questions honestly,
    the candidates do not. Nobody will get raked over the coals if
    their views happen to evolve between answering, and actually
    having to vote on an issue, except for a handful of the most
    contentious issues. Answering a multiple choice question simply
    does not appear to count as an election promise. (Note: This year
    one of the sites phrased a few of their questions in the form of
    explicit yes / no votes on some potential bills; that seemed like
    a clever idea).

  &lt;p&gt;
    It is an interesting question how exactly an unscrupulous
    candidate should &amp;quot;optimize&amp;quot; their answers to maximize
    their chances of getting elected, or if they could do it at all.
    Most likely the problem is completely intractable without insider
    knowledge of the algorithms and the answers of other candidates.
    If you assume that can&#039;t happen, the worst you have to deal with
    is just plain political dishonesty and trying to avoid unnecessary
    controversy.

&lt;li&gt;
&lt;p&gt;
  The whole idea of trying to compute a match between the a voter and
  a party based on the candidates of that party is fundamentally
  flawed. &lt;b&gt;The candidates are not the party&lt;/b&gt;. The party wide
  match is computed based on the aggregate matches of the candidates
  with the voter. But the full pool of candidates is irrelevant. What
  actually matters is the pool of people who get elected, and that is
  obviously not yet known at this time!

&lt;p&gt;
  A cynic might say that even that&#039;s not important, and
  what &lt;i&gt;really&lt;/i&gt; matters are the opinions of the party leader and
  their coterie, but if you go down that road you might as well just
  vote randomly.

&lt;/ul&gt;

&lt;p&gt;
  All this is perhaps better illustrated by a real world example.

&lt;p&gt;
  This year one of the major election engine sites produced a
  72-77% match for me with six of the parties, and significantly lower
  ratings for the remaining two &amp;quot;serious&amp;quot; parties. When I
  redid the test a few weeks later, the results for those parties were
  70-79% in roughly the same order; so the margin of error just from
  the uncertainty in my answers was at least a couple of percentage
  points.

&lt;p&gt;
  Now, this list of six top matches included the conservatives, the
  xenophobes, the greens, the social democrats, the agrarian
  centrists, and the language-centric Swedish people&#039;s party. All of
  them were rated as the basically equally good within the margin of
  error, despite some of them being polar opposites. It&#039;s just absurd
  that all these parties spanning most of the ideological spectrum
  could be equally acceptable
  to me. I have no idea of what algorithm was used, but it was
  certainly a good way to give the impression that my vote doesn&#039;t
  matter since every party is the same.

&lt;p&gt;
  And the UI issue I mentioned earlier is relevant here too. One of
  those six almost equally good matches was promoted an order of
  magnitude more than the others on the result page, based on a
  difference of just a couple of percentage points. And as it happens,
  that party was reported as an abysmal 50% match on another major
  site.

&lt;p&gt;
  Anecdotally, this isn&#039;t just a useless idea. It&#039;s dangerously
  deceptive at least as currently implemented.

&lt;h3&gt;How the sausage was once made&lt;/h3&gt;

&lt;p&gt;
Oh yes, I promised to tell a story!

&lt;p&gt;
The year is 2003. Our first startup had been recently acquihired for
what was effectively peanuts compared to our big dotcom bubble
dreams. But hey, anything beats not making payroll next month.

&lt;p&gt;
One of our customers was the 4th largest party in the country, created
in the early 90s from merger of two struggling communist parties. At
that time it was expected for each party to have an election engine on
their website, with answers just from their candidates. Yes, a
party-specific election engine is a pretty stupid idea. Doesn&#039;t
matter, since everyone else will have one too.  And someone had sold
them one for way too cheap, maybe as a sweetener for a website
redesign done a bit earlier.

&lt;p&gt;
Especially given how rough web development was in 2003 compared to
now, the amount of time budgeted was pitiful. I don&#039;t remember the
totals, but when broken down it&#039;d be like half a day for a CRUD app to
gather answers from the candidates, a couple of hours to make some
kind of ranking algorithm, a couple of hours for design, and so on.

&lt;p&gt;
As was often the case with these underbudget one-off projects, it was
given to me and I rolled a quick flat-file Perl special for the data
collection and polling. But I had no good ideas on the ranking
algorithm. So I explained the problem in a mangled way to a
statistician friend, probably stopped listening to the answer too
soon, and ended up with the impression that a normal (Pearson)
correlation between the answers of a candidate and voter would be
appropriate; just map the answers from strongly disagree / disagree /
don&#039;t care / agree / strongly agree to a 1-5 linear scale, normalize
the -1 to -1 results to a 0 - 100 range instead.

&lt;p&gt;
In reality it wasn&#039;t appropriate for a number of reasons. The answers
were ordinal and non-linear. The results would be undefined if someone
answered every question the same way. Due to the linear transformation
property of the correlation computation, someone who answered every
question with various degrees of &amp;quot;disagree&amp;quot; could be a
pretty decent match with someone who answered everything
with various degrees of
&amp;quot;agree&amp;quot;. And it&#039;d be susceptible to overweighting
certain issues if the set of question wasn&#039;t constructed very
carefully, but included many questions that correlated closely with
each other. (IIRC the questions were supplied by the party HQ, so the
chances of them having been carefully constructed to be non-correlating
are pretty damn low).

&lt;p&gt;
But never mind that. I was young, stupid, ignorant of statistics, and
just happy that I had something working so quickly. Time to test the
system! So I took it out for a spin in the Helsinki voting
district. Top match 45%. Try another district. And another, and
another. The results were horrible everywhere. But at least it worked
superficially, so I sent an email to my coworkers and asked them to kick
the tires a bit.

&lt;p&gt;
Everyone in the same office, my old startup friends, reported similar
results.

&lt;p&gt;
We can&#039;t possibly ship something like this! They&#039;re going to cancel
their contract if we give them an election engine that gives at best
50% matches and just loses them votes. So we started whiteboarding
ways to nudge the numbers in the right direction without being too
obvious. Maybe normalize to a 0 to 1 range first, take a square root,
and then normalize to the 0 to 100 range? So 50% becomes around
70%. Is that too aggressive?

&lt;p&gt;
But in the middle of this brainstorming, before we did any changes,
the project manager walked to our office. &amp;quot;Guys, you need to
change the site. The match percentages are way too high. This will be
really embarrassing, it looks like the numbers have been
cooked.&amp;quot;.

&lt;p&gt;
Turns out that even if we weren&#039;t exactly Ayn Rand-quoting
liberalists, entrepreneurial 20-somethings weren&#039;t quite the
right people to use as test users for the ex-communist party website,
while the middle aged mother of three in the office next door had a
slightly different viewpoint. After we stopped laughing about the
situation, we concluded that the algorithm must be about right, and
shipped it. (There might have been a couple of kludges added
on top, like inserting a non-integer dummy vote into each answer set,
to guarantee some variance so that the results would at least always
be defined).

&lt;p&gt;
I guess it must have worked ok, since we never heard any complaints.
And it being a party-specific election engine, the odds are that it
did not materially affect the voting behavior of anyone let alone the
final election results. It was only a few years later that I properly
understood what a shoddy system we&#039;d created, and how irresponsible
building it was.

&lt;p&gt;
Even if I might not really like the results I get from the major
election engine sites, they must surely be doing a better job than
this. But it&#039;s still hard to shake off the feeling that automating
away a basic human right is the wrong solution to the problem.

</description><author>jsnell@iki.fi</author><category>GENERAL</category><pubDate>Mon, 11 May 2015 17:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2015-05-11-okcupid-for-voting-the-finnish-election-engines/</guid></item></channel></rss>