<rss version='2.0'><channel><title>Juho Snellman's Weblog</title><link>https://www.snellman.net/blog/</link><description>Lisp, Perl Golf</description><item><title>LLMs are cheap</title><link>https://www.snellman.net/blog/archive/2025-06-02-llms-are-cheap/</link><description>
&lt;p&gt;This post is making a point - generative AI is relatively cheap -
that might seem so obvious it doesn&#039;t need making. I&#039;m mostly writing it because I&#039;ve
repeatedly had the same discussion in the past six months where people claim the opposite.
Not only is the misconception still around, but it&#039;s not even getting less frequent. This
is mainly written to have a document I can point people at, the next time it repeats.&lt;/p&gt;

&lt;p&gt;It seems to be a common, if not a majority, belief that Large
Language Models (in the colloquial sense of &amp;quot;things that are like
ChatGPT&amp;quot;) are very expensive to operate. This then leads to a ton
of innumerate analyses about how AI companies must be obviously
doomed, as well as a myopic view on how consumer AI businesses can/will
be monetized.&lt;/p&gt;
&lt;p&gt;It&#039;s an understandable mistake, since inference was indeed very
expensive at the start of the AI boom, and those costs were talked about
a lot. But inference has gotten cheaper even faster
than models have gotten better, and nobody has an intuition for
something becoming 1000x cheaper in two years. It just doesn&#039;t happen.
It doesn&#039;t help that the common pricing model (&amp;quot;$ per million tokens&amp;quot;)
is very hard to visualize.&lt;/p&gt;
&lt;p&gt;So let&#039;s compare LLMs to web search. I&#039;m choosing search as the
comparison since it&#039;s in the same vicinity and since it&#039;s
something everyone uses and nobody pays for, not because I&#039;m suggesting that
ungrounded generative AI is a good substitute for search.&lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

&lt;p&gt;(It should also go without saying that these are just my personal
opinions.)&lt;/p&gt;
&lt;h3 id=&quot;what-does-a-web-search-cost&quot;&gt;What is the price of a web search?&lt;/h3&gt;
&lt;p&gt;Here&#039;s the public API pricing for some companies operating their own
web search infrastructure, retrieved on 2025-05-02:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a
href=&quot;https://ai.google.dev/gemini-api/docs/pricing&quot;&gt;Gemini API pricing&lt;/a&gt;
lists a &amp;quot;Grounding with Google Search&amp;quot; feature at $35/1k queries. I
believe that&#039;s the best number we can get for Google, they don&#039;t publish
prices for a &quot;raw&quot; search result API.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;a
href=&quot;https://www.microsoft.com/en-us/bing/apis/pricing&quot;&gt;Bing Search API&lt;/a&gt;
is priced at $15/1k queries at the cheapest tier.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a
href=&quot;https://brave.com/search/api/&quot;&gt;Brave&lt;/a&gt;
has a price of $5/1k searches at the cheapest tier. Though there&#039;s something very
strange about their pricing structure, with the unit pricing increasing as
the quota increases, which is the opposite of what you&#039;d expect. The
tier with real quota is priced at $9/1k searches.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So there&#039;s a range of prices, but not a horribly wide one, and with the
engines you&#039;d expect to be of higher quality also having higher prices.&lt;/p&gt;
&lt;h3 id=&quot;what-does-equivalent-llm-usage-cost&quot;&gt;What is the price of LLMs in a similar domain?&lt;/h3&gt;
&lt;p&gt;To make a reasonable comparison between those search prices and LLM
prices, we need two numbers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How many tokens are output per query?&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;What&#039;s the price per token?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I picked a few arbitrary queries from my search history, and phrased
them as questions, and ran them on Gemini 2.5 Flash (thinking mode off)
in AI Studio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;[When was the term LLM first used?] -&amp;gt; 361 tokens, 2.5
seconds&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;[What are the top javascript game engines?] -&amp;gt; 1145 tokens, 7.6
seconds&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;[What are the typical carry-on bag size limits in europe?] -&amp;gt; 506
tokens, 3.4 seconds&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;[List the 10 largest power outages in history] -&amp;gt; 583 tokens, 3.7
seconds&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note that I&#039;m not judging the quality of the answers here. The
purpose is just to get rough numbers for how large typical responses
are. A 500-1000 token range seems like a reasonable estimate.&lt;/p&gt;
&lt;p&gt;What&#039;s the price of a token? The pricing is sometimes different for input
and output tokens. Input tokens tend to be cheaper, and our inputs are
very short compared to the outputs, so for simplicity let&#039;s consider all
the tokens to be outputs. Here&#039;s the pricing of some relevant models,
retrieved on 2025-05-02:&lt;/p&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col style=&quot;width: 50%&quot; /&gt;
&lt;col style=&quot;width: 50%&quot; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&quot;header&quot;&gt;
&lt;th style=&quot;text-align: left;&quot;&gt;Model&lt;/th&gt;
&lt;th style=&quot;text-align: left;&quot;&gt;Price / 1M tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Gemma 3 27B&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$0.20 (&lt;a
href=&quot;https://openrouter.ai/google/gemma-3-27b-it&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Qwen3 30B A3B&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$0.30 (&lt;a
href=&quot;https://openrouter.ai/qwen/qwen3-30b-a3b&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Gemini 2.0 Flash&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$0.40 (&lt;a
href=&quot;https://ai.google.dev/gemini-api/docs/pricing&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;GPT-4.1 nano&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$0.40 (&lt;a
href=&quot;https://openai.com/api/pricing/&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Gemini 2.5 Flash Preview&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$0.60 (&lt;a
href=&quot;https://ai.google.dev/gemini-api/docs/pricing&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Deepseek V3&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$1.10 (&lt;a
href=&quot;https://api-docs.deepseek.com/quick_start/pricing&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;GPT-4.1 mini&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$1.60 (&lt;a
href=&quot;https://openai.com/api/pricing/&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Deepseek R1&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$2.19 (&lt;a
href=&quot;https://api-docs.deepseek.com/quick_start/pricing&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Claude 3.5 Haiku&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$4.00 (&lt;a
href=&quot;https://www.anthropic.com/pricing#anthropic-api&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;GPT-4.1&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$8.00 (&lt;a
href=&quot;https://openai.com/api/pricing/&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Gemini 2.5 Pro Preview&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$10.00 (&lt;a
href=&quot;https://ai.google.dev/gemini-api/docs/pricing&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;Claude 3.7 Sonnet&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$15.00 (&lt;a
href=&quot;https://www.anthropic.com/pricing#anthropic-api&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;o3&lt;/td&gt;
&lt;td style=&quot;text-align: left;&quot;&gt;$40.00 (&lt;a
href=&quot;https://openai.com/api/pricing/&quot;&gt;source&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;If we assume the average query uses 1k tokens, these prices would be
directly comparable to the prices per 1k search queries. That&#039;s convenient.&lt;/p&gt;
&lt;p&gt;The low end of that spectrum is at least an order of magnitude
cheaper than even the cheapest search API, and even the models at the
low end are pretty capable. The high end is about on par with the
highest end of search pricing. To compare a midrange pair on quality,
the Bing Search vs. a Gemini 2.5 Flash comparison shows the LLM being
1/25th the price.&lt;/p&gt;
&lt;p&gt;Note that many of the above models have cheaper pricing in exchange
for more flexible scheduling (Anthropic, Google and OpenAI give a 50%
discount for batch requests, Deepseek is 50%-75% cheaper during off-peak
hours). I&#039;ve not included those cheaper options in the table to keep
things comparable, but the presence of those cheaper tiers is worth
keeping in mind when thinking about the next section...&lt;/p&gt;
&lt;h3 id=&quot;objection&quot;&gt;Objection!&lt;/h3&gt;
&lt;p&gt;I know some people are going to have objections to this
back-of-the-envelope calculation, and a lot of them will be totally
legit concerns. I&#039;ll try to address some of them preemptively. Slightly
different assumptions can easily lead to clawing back 10% here and 50%
there. But I don&#039;t see how to bridge a 25x gap just for breaking even,
let alone making the AI significantly more expensive. If you want to play
around with different assumptions, there&#039;s a little calculator widget
below.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Surely the typical LLM response is longer than that&lt;/strong&gt;
- I already picked the upper end of what the (very light) testing
suggested as a reasonable range for the type of question that I&#039;d use
web search for. There&#039;s a lot of use cases where the inputs and outputs
are going to be much longer (e.g. coding), but then you&#039;d need to also
switch the comparison to something in that same domain as well.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The LLM API prices must be subsidized to grab market
share -- i.e. the prices might be low, but the costs are high&lt;/strong&gt; -
I don&#039;t think they are, for a few reasons. I&#039;d instead assume
APIs are typically profitable on a unit basis. I have not found any
credible analysis suggesting otherwise.&lt;/p&gt;
&lt;p&gt;First, there&#039;s not that much motive to gain API market share
with unsustainably cheap prices. Any gains would be temporary, since
there&#039;s no long-term lock-in, and better models are released weekly.
Data from paid API queries will also typically not be used for training
or tuning the models, so getting access to more data wouldn&#039;t explain
it. Note that it&#039;s not just that you&#039;d be losing money on each of
these queries for no benefit, you&#039;re losing the compute that could
be spent on training, research, or more useful types of inference.&lt;/p&gt;
&lt;p&gt;Second, some of those models have been released with open weights and
API access is also available from third-party providers who would have
no motive to subsidize inference. (Or the number in the table isn&#039;t even
first party hosting -- I sure can&#039;t figure out what the Vertex AI
pricing for Gemma 3 is). The pricing of those third-party hosted APIs
appears competitive with first-party hosted APIs. For example, the
&lt;a href=https://artificialanalysis.ai/models/deepseek-r1/providers&gt;Artificial Analysis
summary on Deepseek R1 hosting&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Third, Deepseek released &lt;a
href=&quot;https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md&quot;&gt;actual
numbers&lt;/a&gt; on their inference efficiency in February. Those numbers
suggest that their normal R1 API pricing has about 80% margins
when considering the GPU costs, though not any other serving costs.
&lt;/p&gt;
&lt;p&gt;Fourth, there are a bunch of first-principles analyses on the cost structure
of models with various architectures should be. Those are of course mathematical
models, but those costs line up pretty well with the observed end-user
pricing of models whose architecture is known. See the references section for
links.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The search API prices amortize building and updating the
search index, LLM inference is based on just the cost of
inference&lt;/strong&gt; - This seems pretty likely to be true, actually? But
the effect can&#039;t really be &lt;em&gt;that&lt;/em&gt; large for a popular model:
e.g. the allegedly leaked OpenAI financials claimed $2B/year spent on
inference vs. $3B/year on training. Given the crazy growth of
inference volumes (e.g. Google recently claimed a &lt;a href=https://blog.google/technology/ai/io-2025-keynote/&gt;
50x increase in token volumes in the last year&lt;/a&gt;) the training costs
are getting amortized much more effectively.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The search API prices must have higher margins than LLM
inference&lt;/strong&gt; - It&#039;s possible. I certainly don&#039;t know what the
margins of any Search API providers are, though it seems fair to assume
they&#039;re pretty robust. But, well, see the point above about Deepseek&#039;s
releasd numbers on the R1 profit margins.&lt;/p&gt;
&lt;p&gt;Also, it seems quite plausible that some Search providers would
accept lower margins, since at least Microsoft execs have testified
under oath that they&#039;d be willing to pay more for the iOS query stream
than their revenue, just to get more usage data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Web search returns results 20x-100x faster than an LLM
finishes the query, how could it be more expensive?&lt;/strong&gt; - Search
latency can be improved by parallelizing the problem, while LLM
inference is (for now) serial in nature. The task of predicting a single token can
be parallelized, but the you can&#039;t predict all the output tokens at
once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But OpenAI made a loss, and they don&#039;t expect to make profit for years!&lt;/strong&gt; -
That&#039;s because a huge proportion of their usage is not monetized at all,
despite the usage pattern being ideal for it. OpenAI reportedly made a
loss of $5B in 2024. They also reportedly have 500M MAUs. To reach break-even,
they&#039;d just need to monetize (e.g. with ads) those free users for an average of $10/year,
or $1/month. A $1 ARPU for a service like this would be pitifully low.
&lt;/p&gt;

&lt;p&gt;If the reported numbers are true, OpenAI doesn&#039;t actually have
high costs for a consumer service that popular, which is what you&#039;d
expect to see if the high cost of inference was the problem. They just
have a very low per-user revenue, by choice.&lt;/p&gt;

&lt;h3 id=&quot;widget&quot;&gt;Sensitivity analysis&lt;/h3&gt;

&lt;p&gt;
  If you want to play around with different assumptions, here&#039;s
  a calculator:
&lt;/p&gt;

&lt;a href=/blog/stc/files/llm-cost-widget.html target=_blank&gt;Open in new tab&lt;/a&gt;
&lt;iframe id=&quot;widget&quot; src=&quot;/blog/stc/files/llm-cost-widget.html&quot; width=&quot;100%&quot; scrolling=&quot;no&quot; style=&quot;border-width: 0&quot;
        onload=&quot;this.style.height = this.contentWindow.document.body.scrollHeight + &#039;px&#039;&quot;&gt;
&lt;/iframe&gt;

&lt;h3 id=&quot;why-does-this-matter&quot;&gt;Why does this matter?&lt;/h3&gt;
&lt;p&gt;I mean, you&#039;re right to ask that. Nothing really matters and
eventually we&#039;ll all be dead.&lt;/p&gt;
&lt;p&gt;But it is interesting how many people have built their mental
model for the near future on a premise that was true for only a brief
moment. Some things that will come as a surprise to them even assuming
all progress stops right now:&lt;/p&gt;
&lt;p&gt;There&#039;s an argument advanced by some people about how low
prices mean it&#039;ll be impossible for AI companies to ever recoup
model training costs. The thinking seems to be that it&#039;s just
the prices that have been going down, but not the costs, and
the low prices must be an unprofitable race to the bottom for
what little demand there is. What&#039;s happening and will continue
to happen instead is that as costs go down, the prices go down too,
and demand increases as new uses become viable. For an example,
look at the &lt;a href=https://openrouter.ai/rankings&gt;OpenRouter
API traffic volumes&lt;/a&gt;, both in aggregate and in the relative
share of cheaper models.
&lt;/p&gt;
&lt;p&gt;This post was mainly about APIs, but consumer usage will have
exactly the same cost structure, just a different monetization
structure. And given how low the unit costs must be, advertising isn&#039;t merely
viable but lucrative.&lt;/p&gt;
&lt;p&gt;From this it follows that the financials of frontier AI labs are a
lot better than some innumerate pundits would have you believe. They&#039;re
making a loss because they&#039;re not under pressure to be profitable, and
aren&#039;t actively trying to monetize consumer traffic yet. This
could well be a land grab unlike APIs, since unpaid consumer queries
may be used for training while paid API queries typically are not.
Even the subscription pricing might be there
mainly for demand management rather than trying to run a profit.&lt;/p&gt;
&lt;p&gt;The real cost problem isn&#039;t going to be with the LLMs themselves,
it&#039;s with all the backend services that AI agents will want to access
if even a rudimentary form of the agentic vision actually materializes.
Running the AI is already cheap, will keep getting cheaper, and will always
have a monetization model of some sort since it&#039;s what the end user is
interacting with. Neither of those is true for the end-user services
that have been turned into AI backends without their consent. An AI
trying to, I don&#039;t know, book concert tickets whenever a band I like
plays in my town will probably be phenomenally expensive to its
third-party backends (e.g. scraping ticket sites). Those sites will be
uncompensated for the expense while also removing their actual revenue
streams.&lt;/p&gt;
&lt;p&gt;I don&#039;t really know how that plays out.&lt;/p&gt;
&lt;p&gt;Obviously many service owners will try to make unauthorized scraping
harder, but that&#039;s a very hard problem to solve on the web. Maybe some
of them give up on the web entirely, and move to mobile where they can
at least get device attestations. Some might just give up on the open
web, and require all usage to be signed in, with account creation being
gated on something scarce. Some might become unviable and close up shop
entirely.&lt;/p&gt;
&lt;p&gt;If/when that happens, what&#039;s the play on the AI agent side? Will they
choose an escalating adversarial arms race with increasingly dodgy
tactics, or will they eventually decide that it&#039;s better to pay for the
services they use? The former seems unsustainable. If the latter, then
it feels like the core engineering challenge becomes one of building data
provider backends optimized specifically for AI use, with the goal of scaling to
massive volumes and cheaper unit prices, with the trade-off being higher latency, lower
reliability and lower quality.
That could be quite interesting from a systems perspective. (Yes, I&#039;m aware
of &lt;a href=https://www.anthropic.com/news/model-context-protocol&gt;MCP&lt;/a&gt;,
but it&#039;s a solution to an orthogonal issue.)
&lt;/p&gt;
&lt;p&gt;But one thing I&#039;m confident won&#039;t be happening is that it&#039;s the AIs that
turn out to be too expensive to run.&lt;/p&gt;

&lt;h3&gt;Additional reading&lt;/h3&gt;

&lt;p&gt;
Below are some additional references that were not worked into the main
narrative (this article was long-winded enough already).
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=https://arxiv.org/html/2506.04645v1&gt;Inference economics of language models&lt;/a&gt; (2025) - A mathematical model for estimating the cost structure, latency/cost tradeoffs, optimal cluster size, and optimal batching based on the LLM architecture.

&lt;li&gt;&lt;a href=https://www.tensoreconomics.com/p/llm-inference-economics-from-first&gt;LLM Inference Economics from First Principles&lt;/a&gt; - (2025) A very detailed cost-per-token computation on the cost structure of one specific model, LLama 3.3 70B.

&lt;li&gt;&lt;a href=https://www.lesswrong.com/posts/mRKd4ArA5fYhd2BPb/observations-about-llm-inference-pricing&gt;Observations About LLM Inference Pricing&lt;/a&gt; - (2025) Analysis of the economics driven by pricing data rather than first-principles cost structure; concludes that proprietary models have very significant markups.

&lt;li&gt;&lt;a href=https://semianalysis.com/2023/02/13/peeling-the-onions-layers-large-language/&gt;Large Language Models Search Architecture And Cost&lt;/a&gt; - (2023) Analysis on the cost of integrating LLMs into search; the LLM cost data is no longer very relevant due to the age of the article (GPT-3.5) but it uses a different way of estimating the search cost structure.
&lt;/ul&gt;
</description><author>jsnell@iki.fi</author><category>GENERAL</category><pubDate>Mon, 02 Jun 2025 22:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2025-06-02-llms-are-cheap/</guid></item><item><title>Web Environment Integrity vs. Private Access Tokens - They&#039;re the same thing!</title><link>https://www.snellman.net/blog/archive/2023-07-25-web-integrity-api-vs-private-access-tokens/</link><description>
&lt;p&gt;
I&#039;ve seen a lot of discussions in the last week about the &lt;a
href=https://github.com/RupertBenWiser/Web-Environment-Integrity/blob/main/explainer.md&gt;Web
Environment Integrity&lt;/a&gt; proposal. Quite predictably from the moment
it got called things like &amp;quot;DRM for the web&amp;quot;, people have
been arguing passionately against it on HN, Github issues, etc. The
basic claims seem to be that it&#039;s going to turn the web into a walled
garden, kill ad blockers, kill all small browsers, kill all small operating systems,
kill accessibility tools like screen readers, etc.

&lt;p&gt;
The Web Environment Integrity proposal is basically:

&lt;ul&gt;
&lt;li&gt;A website can request an attestation from the browser
&lt;li&gt;The browser forwards the attestation requests to an attester
&lt;li&gt;The attester checks properties like hardware and software integrity
&lt;li&gt;If they check out, the attester creates a token and signs it with its private key.
&lt;li&gt;The attester hands off the signed token to the browser, which in turn sends it to the website.
&lt;li&gt;The website checks that the token was signed by a trusted attester
&lt;/ul&gt;

&lt;p&gt;
Here&#039;s a funny thing I suspect few of those commenters know: A
very similar mechanism already exists on the web, and is already
deployed in production browsers (Safari), operating systems (&lt;a href=https://developer.apple.com/news/?id=huqjyh7k&gt;iOS, OS X&lt;/a&gt;),
and hosting infrastructure (&lt;a href=https://blog.cloudflare.com/eliminating-captchas-on-iphones-and-macs-using-new-standard/&gt;Cloudflare&lt;/a&gt;, &lt;a href=https://www.fastly.com/blog/private-access-tokens-and-the-future-of-anti-fraud&gt;Fastly&lt;/a&gt;). That mechanism is
&lt;a href=https://www.ietf.org/archive/id/draft-private-access-tokens-01.html&gt;Private Access Tokens&lt;/a&gt; /
&lt;a href=https://datatracker.ietf.org/doc/draft-ietf-privacypass-architecture/13/&gt;Privacy Pass&lt;/a&gt;.

&lt;p&gt;
Here&#039;s what PATs (as deployed by Apple, and on by default) do to the best of my understanding:

&lt;ul&gt;
&lt;li&gt;A website can request an attestation from the browser
&lt;li&gt;The browser forwards the attestation requests to an attester
&lt;li&gt;The attester checks properties like hardware and software integrity.
&lt;li&gt;If they check out, the attester calls the website&#039;s trusted token issuer
&lt;li&gt;The issuer checks whether to trust the attester and whether the information passed by the attester is sufficient, and then issues a token signed by its private key
&lt;li&gt;The attester hands off the signed token to the browser, which passes it to the website.
&lt;li&gt;The website checks that the token was signed by a trusted token issuer
&lt;/ul&gt;

&lt;p&gt;
This launching was hailed in the tech press as a win for privacy and security, not
as an attempt to kill accessibility tools or build a walled garden.
&lt;a id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;]

&lt;p&gt;
You might notice that the basic operating model of the two protocols
is almost exactly the same. So is their intended use. From the
&amp;quot;DRM for websites&amp;quot; perspective, I don&#039;t think there is a
difference.

&lt;p&gt;
With both WEI and PATs, the website would be able to ask Apple to verify that the
request is coming from a genuine non-jailbroken iPhone running
Safari, and block the ones running Firefox on Linux. And in both,
the intent is not for the API to be used for that kind of outright blocking.

&lt;p&gt;
Neither lists e.g. checking whether the browser is running an ad blocker
extension as a use case. Both would have just the same technical capabilities
for making that kind of thing happen, by just having the attester check for it,
and I bet that in both cases the attester would be equally unmotivated in
actually providing that kind of attestation.

&lt;p&gt;
It&#039;s also not that PATs would somehow make it easier for people
to spin up new attesters for small or new platforms. Want to
run your own attester for PATs? You could, but the issuers you care
about will not trust it. &lt;a id=&#039;fnref2&#039;&gt;[&lt;a href=&#039;#fn2&#039;&gt;2&lt;/a&gt;]

&lt;p&gt;
Now, the technologies aren&#039;t quite identical, but the distinctions are
subtle and would just matter for exactly the kind of anti-abuse work
that both of the proposals were ostensibly meant for. The big one is
the WEI proposal including the ability to content-bind the attestation
to a specific operation. It&#039;s a feature anyone trying to use a feature
like this for abuse prevention would think is needed, but that adds no
power to the theorized &amp;quot;DRM for the web&amp;quot; use case. There is also a
more obvious difference between the two, with whether the attester and
issuer are the same entity or split. But that too is irrelevant
in the discussion on how the technology could be misused.
&lt;a id=&#039;fnref3&#039;&gt;[&lt;a href=&#039;#fn3&#039;&gt;3&lt;/a&gt;]

&lt;p&gt;
In principle there could also be differences in the exact things that
the APIs allow attesting for. But neither standard defines the
exact set of attestations, just the mechanisms.

&lt;p&gt;
Given the DRM narrative would have worked exactly the same for the two
projects, why such a different reception? I can only think of two
differences, both social rather than technical.

&lt;p&gt;
One is that the PAT (and related Privacy Pass) draft standards  were
written in the IETF and are dense standardese. There was no plaintext
explainer. Effectively nobody outside of the internet standardization
circles read those drafts, and if they had they wouldn&#039;t have known
whether they needed to be outraged or not. The first time it actually
broke through to the public was when Apple implemented it.

&lt;p&gt;
The other is the framing. PATs were sold to the public exclusively as
a way of seeing fewer captchas. Who wouldn&#039;t want fewer captchas? WEI
was pitched as a bunch of fairly abstract use cases and mostly from
the perspective of the service provider, not for how it&#039;d improve the
user experience by reducing the need for invasive challenges and data
collection.

&lt;p&gt;
This isn&#039;t the first time I&#039;ve seen two attempts at a really similar
project, with one getting lauded while the other gets trashed for
something that&#039;s common to both. But it is the one where the two
things are the most similar, and it feels like it should be
instructive somehow.

&lt;p&gt;
If the takeaway is that standards proposals should be opaque and kept
away from the public for as long as possible, before being launched
straight to prod based on a draft spec, that&#039;d be bad. If it&#039;s that
standard proposals should be carefully written to highlight the
benefit for the end user, even starting from the first draft, that&#039;s
probably pretty good? And if it&#039;s that only Apple can launch any
browser features without a massive backlash, it seems pretty damn bad.

&lt;hr&gt;
  &lt;div class=footnotes&gt;

    &lt;p&gt;
    &lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;] Just to be clear, the
    &lt;a href=https://news.ycombinator.com/item?id=31751203&gt;one significant HN discussion on PATs&lt;/a&gt; had
    similar arguments about it being DRM, so my claim is not that
    absolutely everyone loved PATs. But it didn&#039;t actually get traction as a
    hacker cause celebre, and as far as I can see the general media
    coverage was broadly positive.
    &lt;/p&gt;

    &lt;p&gt;
    &lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;] What&#039;s the process for
    getting Cloudflare or Fastly to trust a non-Apple attester anyway? I
    can&#039;t find any documentation.
    &lt;/p&gt;

    &lt;p&gt;
    &lt;a id=&#039;fn3&#039;&gt;[&lt;a href=&#039;#fnref3&#039;&gt;3&lt;/a&gt;] The split version seems
    kind of superior for deployment, since it means each site needs to only
    care about a single key (their chosen issuer). This makes e.g. the
    creation of a new attester a lot more tractable. You only need to
    convince half a dozen issuers to trust your new attester and
    ingest the keys, not try to sign up every single website in the
    world one by one.
    &lt;/p&gt;
  &lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>GENERAL</category><pubDate>Tue, 25 Jul 2023 18:30:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2023-07-25-web-integrity-api-vs-private-access-tokens/</guid></item><item><title>A monorepo misconception - atomic cross-project commits</title><link>https://www.snellman.net/blog/archive/2021-07-21-monorepo-atomic/</link><description>

&lt;p&gt;

In articles and discussions about &lt;a
href=&#039;https://en.wikipedia.org/wiki/Monorepo&#039;&gt;monorepos&lt;/a&gt;, there&#039;s
one frequently alleged key benefit: atomic commits across the whole
tree let you make changes to both a library&#039;s implementation and the
clients in a single commit. Many authors even go as far to claim that
this is the only benefit of monorepos.

&lt;p&gt;
I like monorepos, but that particular claim makes no sense! It&#039;s not
how you&#039;d actually make backwards incompatible changes, such as
interface refactorings, in a large monorepo. Instead the process would
be highly incremental, and more like the following:

&lt;ol&gt;
&lt;li&gt;Push one commit to change the library, such that it supports both
the old and new behavior with different interfaces.
&lt;li&gt;Once you&#039;re sure the commit from stage 1 won&#039;t be reverted, push N
commits to switch each of the N clients to use the new interface.
&lt;li&gt;Once you&#039;re sure the commits from stage 2 won&#039;t be reverted, push
one commit to remove the old implementation and interface from the
library.
&lt;/ol&gt;

&lt;read-more&gt;&lt;/read-more&gt;

&lt;p&gt;
There&#039;s a bunch of reasons why this is a nicer sequencing than a
single atomic commit, but they&#039;re mostly variations on the theme:
mitigating risks. If something breaks, you want as few things as
possible to break at once, and for the rollback to a known-good state
to be simple. Here&#039;s how the risks are mitigated at the various
stages in the process:

&lt;ol&gt;
&lt;li&gt;There is nothing risky at all about the first commit. It is just
adding new code that&#039;s not yet used by anyone.

&lt;li&gt;The commits for changing the clients can be done gradually,
starting with the ones that the library owners are themselves working
on, the projects that are most likely to detect bugs, or the clients
that are most forgiving to errors. Depending on the risk profile of
the change, you might even use these commits as a form of staged
rollout, where you&#039;ll wait to see if the previous clients report any
problems in production before sending the next batch of commits
for code review.

&lt;li&gt;The final commit to remove the old implementation can only break a
minimal number of clients: the ones that just started using the library
between the removal commit being reviewed and pushed, and did so
using the old interface. The ideal environment would have tooling in
place to prevent that kind of backslipping from happening in the first
place (e.g. lint warnings on new uses of deprecated interfaces).
&lt;/ol&gt;

&lt;p&gt;
If anything goes wrong in stage 2, it&#039;s trivial to revert a commit
that&#039;s only touching a couple of files. By contrast, reverting a
commit that&#039;s spanning hundreds of projects would be quite painful,
especially if the repo has any kind of per-directory ACLs (which I
think is mandatory for a big monorepo). It gets worse if the breakage
isn&#039;t detected immediately, since the more code that the single
change is affecting, the less likely it&#039;s that the reversion applies
cleanly.

&lt;p&gt;
If anything goes wrong in stage 3, it would also have gone wrong when
using atomic commits. But with atomic commits the breakage in stage 3
is far more likely, since the new users will naturally use the old
interface (the new one doesn&#039;t exist yet in their view of the world),
and since the window between start of code review and committing will
be wider. And again, the rollback will be far easier with the commit
that&#039;s only touching the library and not the clients.

&lt;p&gt;
There&#039;s some additional reasons for why the huge commit will be
annoying. For example getting a clean presubmit CI run will become
progressively harder the more projects a single commits is changing.

&lt;p&gt;
Sure, the atomic commit will save a little bit of work in not needing
to have the implementation support both interfaces at once. But that
tiny saving is just not a worthwhile tradeoff when compared to how
much work wrangling the huge commit would be.

&lt;p&gt;
It&#039;s particularly easy to see that the &amp;quot;atomic changes across the
whole repo&amp;quot;story is rubbish when you move away from
libraries, and also consider code that has any kind of more
complicated deployment lifecycle, for example the interactions between
services and client binaries that communicate over an RPC
interface. Obviously you can&#039;t do an atomic change in that case,
since you need to continue supporting the old server implementation
until all client binaries have been upgraded (and are rollback-safe).
The same goes for changes to database schemas, command line
tools, synchronized client-side Javascript + backend changes, etc.

&lt;p&gt;
I think it&#039;s true that monorepos make refactoring easier. So that&#039;s
not the problem. It&#039;s also true that they have atomic commits across
projects. But the two facts have nothing to do with each other. The reasons
monorepos make refactoring simpler all boil down to everyone in the
organization having a shared view of what the current state is:

&lt;ul&gt;

&lt;li&gt;A monorepo will, in practice, mean trunk-based
development. You&#039;ll know that everybody really is on HEAD rather than
actually doing their development on some year-old branch.
&lt;li&gt;And conversely, you&#039;ll know that every user of the library is
using your library from HEAD rather than pinning it to some year-old
version.
&lt;li&gt;It&#039;s trivial to find all the current callers, so that you know
which clients need to be updated. (Once you&#039;ve solved the highly
non-trivial problem of having any kind of monorepo tooling at scale,
of course.)
&lt;/ul&gt;

&lt;p&gt;
In theory you could do the exact same thing with
multirepos assuming sufficient tool support, discipline about code
organization, enforced trunk-based development in all repositories,
a master list of all repositories in the org, and defaulting to all
repositories being readable by every engineer with no hidden silos.
That&#039;s all &lt;i&gt;technically&lt;/i&gt; doable, but I suspect not culturally
compatible with using multirepos in the first place.

&lt;p&gt;
Where does this misconception come from? It&#039;s certainly present in the
&lt;a href=&#039;https://research.google/pubs/pub45424/&#039;&gt;Google monorepo paper&lt;/a&gt;,
which somewhat contradicts itself on this. On one hand,
they describe exactly this form of atomic refactoring as a benefit of
monorepos:

&lt;blockquote&gt;
The ability to make atomic changes is also a very powerful
feature of the monolithic model. A developer can make a major change
touching hundreds or thousands of files across the repository in a
single consistent operation. For instance, a developer can rename a
class or function in a single commit and yet not break any builds or
tests.
&lt;/blockquote&gt;

&lt;p&gt;
But when it comes to the actual refactoring workflow is, the
process that&#039;s described is quite different:

&lt;blockquote&gt;
A team of Google developers will occasionally undertake a set of
wide-reaching code-cleanup changes to further maintain the health of
the codebase. The developers who perform these changes commonly
separate them into two phases. With this approach, a large
backward-compatible change is made first. Once it is complete, a
second smaller change can be made to remove the original pattern that
is no longer referenced.
&lt;/blockquote&gt;

&lt;p&gt;
I suspect what happened here was that the atomic commits were identified
as a benefit in the abstract, with refactoring being used as an
illustration of a use case. This was then quite understandably read
as a practical example of how you&#039;d work with a monorepo.

&lt;p&gt;
There might be a few cases where atomic commits across the whole
repository are the right solution, but it has to be exceedingly
rare. The example of renaming a function with thousands of callers,
for example, is probably better handled by just temporarily aliasing
the function, or by temporarily defining the new function in terms of
the old. (But this does suggest that languages, both programming languages and
IDLs, should make aliasing and indirection easy for as many constructs
as possible).

&lt;p&gt;
Are there organizations with a large monorepo where atomic
cross-project commits are routinely used to change both the
implementation and the clients?

</description><author>jsnell@iki.fi</author><category>GENERAL</category><pubDate>Wed, 21 Jul 2021 11:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2021-07-21-monorepo-atomic/</guid></item><item><title>Writing a procedural puzzle generator</title><link>https://www.snellman.net/blog/archive/2019-05-14-procedural-puzzle-generator/</link><description>
  &lt;p&gt;
    This blog post describes the level generator for my puzzle game
    &lt;a href=&#039;https://linjat.snellman.net&#039;&gt;Linjat&lt;/a&gt;. The post is
    standalone, but might be a bit easier to digest if you play
    through a few levels. The &lt;a href=&#039;https://github.com/jsnell/linjat/&#039;&gt;source code&lt;/a&gt; is available; anything discussed below is in
    &lt;code&gt;src/main.cc&lt;/code&gt;.
  &lt;/p&gt;

  &lt;p&gt;
    A rough outline of this post:
    &lt;ul&gt;
      &lt;li&gt;Linjat is a logic game of covering all the numbers and dots
        on a grid with lines.
      &lt;li&gt;The puzzles are procedurally generated by a combination of a
        solver, a generator, and an optimizer.
      &lt;li&gt;The &lt;a href=&#039;#solver&#039;&gt;solver&lt;/a&gt; tries to solve puzzles the
        way a human would, and assign a score for how interesting
        a given puzzle is.
      &lt;li&gt;The &lt;a href=&#039;#generator&#039;&gt;puzzle generator&lt;/a&gt; is designed
        such that it&#039;s easy to change one part of the puzzle (the
        numbers) and have other parts of the puzzle (the dots) get
        re-organized such that the puzzle remains solvable.
      &lt;li&gt;A &lt;a href=&#039;#optimizer&#039;&gt;puzzle optimizer&lt;/a&gt; repeatedly
        solves levels and generates new variations from the most
        interesting ones that have been found so far.
    &lt;/ul&gt;
  &lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

  &lt;a name=&#039;rules&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;The rules&lt;/h3&gt;

  &lt;p&gt;
    To understand how the level generator works, you unfortunately
    must first know the rules of the game. Luckily the rules are very
    simple. The puzzle consists of a grid containing empty squares,
    numbers, and dots. Like this:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/image5.png&#039;&gt;

  &lt;p&gt;
    The goal is to draw a vertical or horizontal line through each of
    the numbers, with three constraints:
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;The line going through a number must be of the same length
      as the number.
    &lt;li&gt;The lines can&#039;t cross.
    &lt;li&gt;All the dots need to be covered by a line.
  &lt;/ul&gt;

  &lt;p&gt;
    Like this:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/image3.png&#039;&gt;

  &lt;p&gt;
    Whee! The game is all designed, the UI is implemented, now all I
    need are a few hundred good puzzles, and we&#039;re good to go. And for
    a game like this, there&#039;s really no point in trying to make those
    puzzles by hand. That&#039;s a job for a computer.
  &lt;/p&gt;

  &lt;a name=&#039;requirements&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;Requirements&lt;/h3&gt;

  &lt;p&gt;
    What makes for a good puzzle in this game? I tend to think of
    puzzle games as coming in two categories. There&#039;s the ones where
    you&#039;re exploring a complicated state space from the start to the
    end (something like &lt;a href=&#039;https://en.wikipedia.org/wiki/Sokoban&#039;&gt;Sokoban&lt;/a&gt;
    or &lt;a href=&#039;https://en.wikipedia.org/wiki/Rush_Hour_(puzzle)&#039;&gt;Rush Hour&lt;/a&gt;), and where it might not
    even obvious exactly what states exist in the game. Then there are
    ones where all the states are known at the start, and you&#039;re
    slowly whittling the state space down by process of elimination
    (e.g. &lt;a href=&#039;https://en.wikipedia.org/wiki/Sudoku&#039;&gt;Sudoku&lt;/a&gt;
    or &lt;a href=&#039;https://en.wikipedia.org/wiki/Nonogram&#039;&gt;Picross&lt;/a&gt;).
    This game is clearly in the latter category.
  &lt;/p&gt;

  &lt;p&gt;
    Now, players have very different expectations for these two
    different kinds of puzzles. For this latter kind there&#039;s a very
    strong expectation that the puzzle is solvable just with
    deduction, and that there should never be a need for backtracking
    / guessing / trial and error. &lt;a id=&#039;fnref0&#039;&gt;[&lt;a href=&#039;#fn0&#039;&gt;0&lt;/a&gt;] &lt;a id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;]
  &lt;/p&gt;

  &lt;p&gt;
    It&#039;s not enough to know if a puzzle can be solved with just
    logic. In addition to that we need to have some idea of how good
    the produced puzzles are. Otherwise most of the levels might be
    just trivial dross. In an ideal world this could also be used for
    building a smooth progression curve, where the levels get
    progressively harder as the player progresses through the game.
  &lt;/p&gt;

  &lt;a name=&#039;solver&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;The solver&lt;/h3&gt;

  &lt;p&gt;
    The first step to meeting the above requirements is a solver for
    the game that&#039;s optimized for this purpose. A backtracking
    brute-force solver will be fast and accurate at telling whether
    the puzzle is solvable, and could also be changed to determine
    whether the solution is unique. But it
    can&#039;t give any idea of how challenging the puzzle actually
    is, since that&#039;s not how a human would solve these puzzles.
    The solver needs to imitate humans.
  &lt;/p&gt;

  &lt;p&gt;
    How does a human solve this puzzle? There&#039;s a couple of obvious
    moves, which the tutorial teaches:
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;
      &lt;p&gt;
        If a dot can only be reached from one number, the line from
        that number should be extended to cover the dot. Here the dot
        can only be reached from the three, not the four:
      &lt;/p&gt;
      &lt;img src=&#039;/blog/stc/images/procedural-puzzle/image4.png&#039;&gt;
      &lt;p&gt;Leading to:&lt;/p&gt;
      &lt;img src=&#039;/blog/stc/images/procedural-puzzle/image1.png&#039;&gt;
    &lt;li&gt;
      &lt;p&gt;
        If the line doesn&#039;t fit in one orientation, it must be placed in the other orientation instead. In the above example the 4 can no longer be placed vertically, so we know it has to be horizontal. Like this:
      &lt;/p&gt;
      &lt;img src=&#039;/blog/stc/images/procedural-puzzle/image2.png&#039;&gt;
    &lt;li&gt;
      &lt;p&gt;
        If a line of size X is known to be in a certain orientation
        and there isn&#039;t enough space to fit a line of X spaces on both
        sides, some of the squares in the middle must be covered. For
        example if in the above example the &amp;quot;4&amp;quot; had been a &amp;quot;3&amp;quot; instead,
        we wouldn&#039;t know whether it extended all the way to the right
        or to the left of the board. But we would know it must cover
        the two middle squares:
      &lt;/p&gt;
      &lt;img src=&#039;/blog/stc/images/procedural-puzzle/image6.png&#039;&gt;
  &lt;/ul&gt;

  &lt;p&gt;
    This kind of thinking is the meat and potatoes of the game. You
    figure a way to extend one line a little bit, make that move, and
    then inspect the board again since that hopefully gave you
    the information to make a new deduction elsewhere. Writing a
    solver that follows these rules would be enough to determine
    if a human &lt;i&gt;could&lt;/i&gt; solve the puzzle without backtracking.
  &lt;/p&gt;

  &lt;p&gt;
    It doesn&#039;t really say anything about how hard or interesting the
    level is though. In addition to the solvability, we need to
    somehow quantify the difficulty.
  &lt;/p&gt;

  &lt;p&gt;
    The obvious first idea for a scoring function is that a puzzle
    that takes more moves to finish is the harder one. That&#039;s probably
    a good metric in other games, but in this one the number of valid
    moves that the player has at any one time is probably more
    important. If there are 10 possible deductions a player could
    make, they&#039;ll find one of those very quickly. If there&#039;s only one
    valid move, it&#039;ll take longer.
  &lt;/p&gt;

  &lt;p&gt;
    So as a first approximation you want the solution tree to be deep
    and narrow: there&#039;s a long dependency chain of moves from start to
    finish, and at any one time there are only a few ways of moving
    forward on the chain. &lt;a id=&#039;fnref2&#039;&gt;[&lt;a href=&#039;#fn2&#039;&gt;2&lt;/a&gt;]
  &lt;/p&gt;

  &lt;p&gt;
    How do you figure out the width and depth of the tree? Just
    solving the puzzle once and evaluating the produced tree doesn&#039;t
    give a precise answer. The exact order in which you make the moves
    will end up affecting the shape of the tree. You&#039;d need to look at
    all the possible solutions, and do something like optimize for the
    best worst-case. Now, I&#039;m no stranger to &lt;a href=&#039;https://www.snellman.net/blog/archive/2018-07-23-optimizing-breadth-first-search/&#039;&gt;brute-forcing
      puzzle game search graphs&lt;/a&gt;, but for this project I wanted a
    single-pass solver rather than any kind of exhaustive search.
    Due to the opimization phase, the goal was for the solver runtime to be
    measured in microseconds rather than seconds.
  &lt;/p&gt;

  &lt;p&gt;
    I decided not to do that. Instead my solver doesn&#039;t actually make
    one move at a time, but solves the puzzle by layers: given
    a state, find all valid moves that could be made. Then apply all
    of those moves at once. Finally start over from the new state. The
    number of layers and the maximum number of moves ever found in a
    single layer are then used as proxies for the depth and the width
    of the search tree as a whole.
  &lt;/p&gt;

  &lt;p&gt;
    Here&#039;s what the solution for one of the harder puzzles looks like
    with this model (click on the thumb-nail to expand). Dotted lines are
    the lines that were extended on that solver layer, solid ones
    didn&#039;t change. Green lines are of the right length, red are not
    yet complete.
  &lt;/p&gt;

  &lt;a href=&#039;/blog/stc/images/procedural-puzzle/solver.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;/blog/stc/images/procedural-puzzle/solver.png&#039; width=&#039;700&#039;&gt;&lt;/img&gt;
  &lt;/a&gt;

  &lt;p&gt;
    The next problem is that not all moves a player makes are created
    equal. What was listed at the start of this section is really just
    common sense. Here&#039;s an example of a more complicated deduction rule,
    which would require some more thought to find. Consider a board like:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/rule-square.png&#039;&gt;&lt;/img&gt;

  &lt;p&gt;
    The dots at C and D can only be covered by the 5 and the middle 4
    (and neither piece can cover both of them at the same time). This
    means that the middle 4 needs to cover one of the two, and thus
    can&#039;t be used to cover A. Instead A has to be covered with the
    lower left 4.
  &lt;/p&gt;

  &lt;p&gt;
    It&#039;d clearly be silly to treat this chain of deductions the same
    as a one-step &amp;quot;this dot can only be reached from that number&amp;quot;. Can
    these more complex rules just be weighted more heavily in the
    scoring function?  Unfortunately not with the layer-based solver,
    since it&#039;s not guaranteed to find lowest cost solution. It&#039;s not
    just a theoretical concern, in practice it&#039;s pretty common for a part
    of the board to be solvable in either a single complex deduction
    or a chain of several much simpler moves. The layer-based solver
    basically finds the shortest path, not the cheapest one, and that
    can&#039;t just be fixed in the scoring function.
  &lt;/p&gt;

  &lt;p&gt;
    The method I ended up using was to change the solver such that
    each layer consists of only one kind of deduction. The algorithm
    goes through the deduction rules in a rough order of
    difficulty. If a rule finds any moves, they&#039;re applied and the
    iteration is over, and the next iteration starts the list over
    from the beginning.
  &lt;/p&gt;

  &lt;p&gt;
    The solution is then scored by assigning each layer a cost based
    on the single rule used for it. This is still not guaranteed to
    find the cheapest solution, but with a good selection of weights
    it&#039;ll at least not find an expensive solution if a cheap solution
    exists.
  &lt;/p&gt;

  &lt;p&gt;
    It also seems to map out pretty well to how humans solve the
    puzzle. You look for the gimmes first, and only start thinking
    hard once there are no easy moves.
  &lt;/p&gt;

  &lt;a name=&#039;generator&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;The Generator&lt;/h3&gt;

  &lt;p&gt;
    The previous section took care of figuring out if a level is any
    good or not. But that alone isn&#039;t enough, you also need to somehow
    generate levels for the solver to score. It&#039;s quite unlikely that
    a randomly generated level would be solvable, let alone
    interesting.
  &lt;/p&gt;

  &lt;p&gt;
    The key idea (which is by no means novel) is to interleave the
    solver and the generator. Let&#039;s start with a puzzle that&#039;s
    probably unsolvable, consisting just of numbers 2-5 placed in
    random locations on the grid:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/add-dots-start.png&#039;&gt;&lt;/img&gt;

  &lt;p&gt;
    The solver runs until it can&#039;t make any more progress:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/add-dots-blocked.png&#039;&gt;&lt;/img&gt;

  &lt;p&gt;
    The generator then adds some information to the puzzle, in the
    form of a dot, and continues solving.
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/add-dots-one.png&#039;&gt;&lt;/img&gt;

  &lt;p&gt;
    In this case that one added is not enough to allow the solver to make
    any progress. So the generator will keep on adding more dots until
    the solver is happy:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/add-dots-more.png&#039;&gt;&lt;/img&gt;

  &lt;p&gt;
    And then the solver resumes normal operation:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/add-dots-resume.png&#039;&gt;&lt;/img&gt;

  &lt;p&gt;
    This process continues either until the puzzle is solved or there
    is no more information to add (i.e. every space that&#039;s reachable
    from a number is covered by a dot).
  &lt;/p&gt;

  &lt;p&gt;
    This method works only if the new information that&#039;s being added
    can&#039;t invalidate any of the previously made deductions. That would
    be tough to do when adding numbers to the grid &lt;a id=&#039;fnref3&#039;&gt;[&lt;a href=&#039;#fn3&#039;&gt;3&lt;/a&gt;].
    But adding new dots to the board has that property,
    at least given the deduction rules I&#039;m using in this program.
  &lt;/p&gt;

  &lt;p&gt;
    Where shoud the algorithm add the dots? What I ended up doing was
    to add them in the empty space that could have been covered by the
    most lines in the starting state, so each dot tends to give as
    little information as possible. There is no attempt to add it
    specifically to a location where it&#039;ll be useful in advancing the
    puzzle at the point where the solver got stuck. This produces a
    pretty neat effect where most of the dots will be totally useless
    at the start of the puzzle, which makes the puzzle seem harder
    than it is. There are all these apparent moves you could make, but
    somehow none of them quite work out. The puzzle generator ends up
    being a bit of a jerk.
  &lt;/p&gt;

  &lt;p&gt;
    This process will not always produce a solution, but it&#039;s pretty
    fast (on the order of 50-100 microseconds) so it can just be repeated a
    bunch of times until it generates a level. Unfortunately it&#039;ll
    generally produce a mediocre puzzle. There are too many obvious
    moves right at the start, the board gets filled in very quickly
    and the solution tree is quite shallow.
  &lt;/p&gt;

  &lt;a name=&#039;optimizer&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;The optimizer&lt;/h3&gt;

  &lt;p&gt;
    The above process produced a mediocre puzzle. In the final stage,
    we use that level as a seed for an optimization process. The
    process works as follows.
  &lt;/p&gt;

  &lt;p&gt;
    The optimizer sets up a pool of up to 10 puzzle variants. The pool
    is initialized with the newly generated random puzzle. On each
    iteration, the optimizer selects one puzzle from the pool and
    mutates it.
  &lt;/p&gt;

  &lt;p&gt;
    The mutation removes all the dots, and then changes the numbers a
    bit (e.g. reduce/increase the value of a randomly selected number,
    or move a number to a different location on the grid). It might be
    possible to apply multiple mutations to board in one go. We then
    run the solver in the special level-generation mode described in
    the previous section. This adds enough dots to the puzzle to make
    it solvable again.
  &lt;/p&gt;

  &lt;p&gt;
    After that, we run the solver again, this time in the normal
    mode. During this run, the solver keeps track of a) the depth of
    the solution tree, b) how often each of the various kinds of rules
    was needed, c) how wide the solution tree was at times. The puzzle
    is scored based on the above criteria. The scoring function will
    basically prefer deep and narrow solutions, and at higher
    difficulty levels also rewards puzzles that require use of one or
    more of the advanced deduction rules.
  &lt;/p&gt;

  &lt;p&gt;
    The new puzzle is then added to the pool. If the pool ever
    contains more than 10 puzzles, the worst one is discarded.
  &lt;/p&gt;

  &lt;p&gt;
    This process is repeated a number of times (anything from 10k to 50k
    iterations seemed to be fine). After that, the version of the puzzle with the
    highest score is saved into the puzzle&#039;s level database. This is
    what the progress of the best puzzle looks like through
    one optimization run:
  &lt;/p&gt;

  &lt;a href=&#039;/blog/stc/images/procedural-puzzle/progress-opt.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;/blog/stc/images/procedural-puzzle/progress-opt.png&#039; width=&#039;700&#039;&gt;&lt;/img&gt;
  &lt;/a&gt;

  &lt;p&gt;
    I tried a few other ways of structuring the optimization as
    well. One version used simulated annealing, the others were
    genetic algorithms with different crossover operations. None of
    these performed as well as the naive pool of hill-climbers.
  &lt;/p&gt;

  &lt;a name=&#039;unique&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;Unique single solution&lt;/h3&gt;

  &lt;p&gt;
    There&#039;s an interesting complication that arises when the puzzle
    has a single unique solution. Is it valid for the player to assume
    that&#039;s the case, and make deductions based on that? Is it fair for
    the puzzle generator to assume that the player will do so?
  &lt;/p&gt;

  &lt;p&gt;
    In a post on HN, I mentioned four options for how to deal with
    this:
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;State the &amp;quot;only a single solution&amp;quot; up front, and
      make the puzzle generator generate levels that require this
      form of deduction. This sucks, since it&#039;ll make the rules far more
      complicated to understand. And it&#039;s also exactly the kind of
      detail people would forget.
    &lt;li&gt;Don&#039;t guarantee a single solution: have potentially multiple
      solutions, and accept any of them. This doesn&#039;t really solve
      the problem, it just moves it around.
    &lt;li&gt;Punt, and just assume this is a very rare event that won&#039;t
      matter in practice. (This is was the original implementation.)
    &lt;li&gt;Change the puzzle generator such that it doesn&#039;t generate
      puzzles where the knowing the solution is unique helps.
      (Probably the right thing to do, but also extra work.)
  &lt;/ul&gt;

  &lt;p&gt;
    I originally went with the last option, and that was a horrible
    mistake. It turns out that I&#039;d only considered one way in which
    the uniqueness of the solution leaks information, and that&#039;s
    indeed pretty rare. But there&#039;s others, and one was present in
    basically every level I&#039;d generated, and often kind of trivialized
    the solution. So in May 2019 I updated the Hard and Expert mode
    levels to go with the third option instead.
  &lt;/p&gt;

  &lt;p&gt;
    The most annoying case is the 2 with the dotted line in the
    following board:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/uncontested.png&#039;&gt;

  &lt;p&gt;
    Why could a sneaky player make that deduction? The 2 can cover
    any of the 4 adjacent squares. None of them have any dots, so they
    don&#039;t necessarily need to be covered by anything. And the square
    that&#039;s downwards doesn&#039;t have any overlap with other pieces. If
    there&#039;s a single solution, it has to be the case that other pieces
    cover the other three squares, and the 2 covers the downwards square.
  &lt;/p&gt;

  &lt;p&gt;
    The solution is to add some dots when these cases are detected,
    like this:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/ambiguate.png&#039;&gt;

  &lt;p&gt;
    Another common case was the dotted 2 on this board:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/unique.png&#039;&gt;

  &lt;p&gt;
    Nothing distinguishes the squares to the left and up of the 2.
    Neither has a dot, and neither is reachable from any other number.
    Any solution where the 2 covers the upward square would have a
    matching solution where it covers the leftware square instead, and
    vice versa. If there&#039;s a single unique solution, it can&#039;t be either
    and thus the 2 must cover the downward square instead.
  &lt;/p&gt;

  &lt;p&gt;
    This kind of case I just solved by the &amp;quot;if it hurts, just don&#039;t do
    it&amp;quot; method. I.e. having the solver use this rule very early on in
    the priority list, and assigning these moves a large negative
    weight. Puzzles with this kind of property will mostly end up
    discarded by the optimizer, and the few that make it through will
    be discarded when doing the final level selection for the
    published game.
  &lt;/p&gt;

  &lt;p&gt;
    This is not an exhaustive list, I found a lot of other
    unique-solution rules when adversarially play-testing. But most of
    them felt like they were rare and difficult enough to find that
    they&#039;re not really shortcuts. If somebody solves a puzzle using
    that kind of deduction, I&#039;m not going to begrudge them that.
  &lt;/p&gt;

  &lt;h3&gt;Conclusion&lt;/h3&gt;

  &lt;p&gt;
    The game was originally designed as an experiment for procedural
    puzzle generation. The game design and the generator go hand in
    hand, so the exact techniques won&#039;t be directly applicable to
    existing games.
  &lt;/p&gt;

  &lt;p&gt;
    The part I can&#039;t answer is whether putting this much effort into
    the procedural generation was worth it. The feedback from
    players has been pretty inconsistent when it comes to the level design.
    A common theme for positive comments has been about how the
    puzzles always feel like there&#039;s a clever gotcha in there.
    The most common negative complaint has been that there&#039;s not
    enough of a difficulty gradient in the game.
  &lt;/p&gt;

  &lt;p&gt;
    I have a couple of other puzzle games in an embryonic stage, and
    felt good enough about this generator that I&#039;d probably at least
    try similar procedural generation methods for those too. One
    thing I&#039;d definitely do differently the next time around is
    to do adversarial playtesting from the start.
  &lt;/p&gt;

  &lt;h3&gt;Footnotes&lt;/h3&gt;

  &lt;div class=footnotes&gt;
    &lt;p&gt;
      &lt;a id=&#039;fn0&#039;&gt;[&lt;a href=&#039;#fnref0&#039;&gt;0&lt;/a&gt;] Or at least that&#039;s what I believed. But when I observed a
        bunch of players in person, about half of them just
        made guesses and then iterated on those guesses. Oh, well.
    &lt;/p&gt;
    &lt;p&gt;
      &lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;] Anyone reading this should also read
        &lt;a href=&#039;https://magnushoff.com/minesweeper/&#039;&gt;Solving Minesweeper
          and making it better&lt;/a&gt; by Magnus Hoff, which has a fascinating
        twist on the perceived need for puzzle games with hidden information
        to be guaranteed solvable.
    &lt;/p&gt;
    &lt;p&gt;
      &lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;] Just to be clear, this depth / narrowness of the tree is a
        metric that I thought was meaningful to this puzzle, not something
        that&#039;s going to be applicable to all or even most puzzles. For
        example there&#039;s a &lt;a href=&#039;https://web.archive.org/web/20130703141244/http://www.thinkfun.com/microsite/rushhour/creating2500challenges&#039;&gt;good argument&lt;/a&gt;
        to be made that a Rush Hour puzzle
        is interesting if there are multiple paths paths to the solution
        of almost but not quite the same length. But that&#039;s because Rush
        Hour is a game of finding the shortest solution, not just some
        solution.
    &lt;/p&gt;
    &lt;p&gt;
      &lt;a id=&#039;fn3&#039;&gt;[&lt;a href=&#039;#fnref3&#039;&gt;3&lt;/a&gt;] With the exception if adding 1s. The first version of the puzzle
        didn&#039;t have the dots, and the plan was to have the generator add 1s
        when it needed to add more information. But that felt a little too
        constrained.
    &lt;/p&gt;
  &lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>GAMES</category><pubDate>Tue, 14 May 2019 15:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2019-05-14-procedural-puzzle-generator/</guid></item><item><title>Optimizing a breadth-first search</title><link>https://www.snellman.net/blog/archive/2018-07-23-optimizing-breadth-first-search/</link><description>
&lt;img src=&#039;/blog/stc/images/sb-thumb.png&#039; style=&#039;float: right; margin: 16px&#039;&gt;

&lt;p&gt;
  A couple of months ago I finally had to admit I wasn&#039;t smart enough to
  solve a few of the  levels in &lt;a href=&#039;http://snakebird.noumenongames.com/&#039;&gt;
    Snakebird&lt;/a&gt;, a puzzle game.
  The only way to salvage
  some pride was to write a solver, and pretend that writing
  a program to do the solving is basically as good as having solved
  the problem myself. The C++ code for the resulting program
  is &lt;a href=&#039;https://github.com/jsnell/snakebird&#039;&gt;on Github&lt;/a&gt;.
  Most of what&#039;s discussed in the post is implemented in
  &lt;a href=&#039;https://github.com/jsnell/snakebird/blob/master/src/search.h&#039;&gt;
     search.h&lt;/a&gt; and
  &lt;a href=&#039;https://github.com/jsnell/snakebird/blob/master/src/compress.h&#039;&gt;
    compress.h&lt;/a&gt;. This post deals mainly with optimizing a
  breadth-first search that&#039;s estimated to use 50-100GB of memory to
  run on a memory budget of 4GB.
&lt;/p&gt;

&lt;p&gt;
  There will be a follow up post that deals with the specifics of the game.
  For this post, all you need to know is
  that that I could not see any good alternatives to the brute force
  approach, since none of the usual tricks worked. There are a lot of states
  since there are multiple movable or pushable objects, and the shape of
  some of them matters and changes during the game.
  There were no viable conservative
  heuristics for algorithms like A* to narrow down the search
  space. The search graph was directed and implicit, so
  searching both forward and backward simultaneously was not possible.
  And a single move could cause the state to change in a lot of unrelated
  ways, so nothing like &lt;a href=https://en.wikipedia.org/wiki/Zobrist_hashing&gt;
    Zobrist hashing&lt;/a&gt; was going to be viable.
&lt;/p&gt;

&lt;p&gt;
  A back of the envelope calculation suggested that the biggest
  puzzle was going to have on the order of 10 billion states after
  eliminating all symmetries. Even after packing the state
  representation as tightly as possible, the state size was on the
  order of 8-10 bytes depending on the puzzle. 100GB of memory would
  be trivial at work, but this was my home machine with 16GB of
  RAM. And since Chrome needs 12GB of that, my actual memory budget
  was more like 4GB. Anything in excess of that would have to go to
  disk (the spinning rust kind).
&lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

&lt;p&gt;
  How do we fit 100GB of data into 4GB of RAM?  Either a) the states
  would need to be compressed to 1/20th of their original already
  optimized size, b) the algorithm would need to be able to
  efficiently page state to disk and back, c) a combination of the
  above, or d) I should buy more RAM or rent a big VM for a few
  days. Option D was out of the question due to being boring. Options
  A and C seemed out of the question after a proof of concept with
  gzip: a 50MB blob of states compressed to about 35MB. That&#039;s about 7
  bytes per state, while my budget was more like 0.4 bytes per
  state. So option B it was, even though a breadth-first search looks
  pretty hostile to secondary storage.
&lt;/p&gt;

&lt;h2&gt;Table of contents&lt;/h2&gt;

&lt;p&gt;
  This is a somewhat long post, so here&#039;s a brief overview of the
  sections ahead:
&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&#039;#textbook-bfs&#039;&gt;A textbook BFS&lt;/a&gt; - What&#039;s the normal formulation
    of breadth-first search like, and why is it not suitable for storing
    parts of the state on disk?
  &lt;li&gt;&lt;a href=&#039;#sort-merge&#039;&gt;A sort + merge BFS&lt;/a&gt; - Changing the algorithm
    to efficiently do deduplications in batches.
  &lt;li&gt;&lt;a href=&#039;#compression&#039;&gt;Compression&lt;/a&gt; - Reducing the memory
    use by 100x with a combination of off-the-shelf and custom compression.
  &lt;li&gt;&lt;a href=&#039;#cheated&#039;&gt;Oh no, I&#039;ve cheated!&lt;/a&gt; - The first
    few sections glossed over something; it&#039;s not enough to know there
    is a solution, we need to know what the solution is.
    In this section the basic algorithm is updated to carry around
    enough data to reconstruct a solution from the final state.
  &lt;li&gt;&lt;a href=&#039;#sort-merge-multiple-outputs&#039;&gt;Sort + merge with multiple outputs&lt;/a&gt; -
    Keeping more state totally negates the compression gains. The
    sort + merge algorithm needs to be updated to keep two outputs:
    one that compresses well used during the search, and another
    that&#039;s just used to reconstruct the solution after one is found.
  &lt;li&gt;&lt;a href=&#039;#swapping&#039;&gt;Swapping&lt;/a&gt; - Swapping on Linux sucks
    even more than I thought.
  &lt;li&gt;&lt;a href=&#039;#compression-before-merging&#039;&gt;Compressing new states before merging&lt;/a&gt; - So far the memory optimizations have just been concerned with the visited
    set. But it turns out that the list of newly generated states is much
    larger than one might think. This section shows a scheme for representing
    the new states more efficiently.
  &lt;li&gt;&lt;a href=&#039;#parent-states&#039;&gt;Saving space on the parent states&lt;/a&gt; -
    Investigate some CPU/memory tradeoffs for reconstructing the solution
    at the end.
  &lt;li&gt;&lt;a href=&#039;#did-not-work&#039;&gt;What didn&#039;t or might not work&lt;/a&gt; -
    Some things that looked promising but I ended up reverting, and
    others that research suggested would work but my intuition said
    wouldn&#039;t for this case.
&lt;/ul&gt;

&lt;a id=&#039;textbook-bfs&#039;&gt;&lt;/a&gt;
&lt;h2&gt;A textbook BFS&lt;/h2&gt;

&lt;p&gt;
  So what does a breadth-first search look like, and why would it be
  disk-unfriendly? Before this little project I&#039;d only ever seen
  variants of the textbook formulation, something like this:
&lt;/p&gt;

&lt;pre&gt;
def bfs(graph, start, end):
    visited = {start}
    todo = [start]
    while todo:
        node = todo.pop_first()
        if node == end:
            return True
        for kid in adjacent(node):
            if kid not in visited:
                visited.add(kid)
                todo.push_back(kid)
    return False
&lt;/pre&gt;

&lt;p&gt;
  As the program produces new candidate nodes, each node is checked
  against a hash table of already visited nodes. If it&#039;s already
  present in the hash table, we ignore the node. Otherwise it&#039;s added
  both to the queue and the hash table. Sometimes the &#039;visited&#039;
  information is carried in the nodes rather than in a side-table; but
  that&#039;s a dodgy optimization to start with, and totally impossible
  when the graph is implicit rather than explicit.
&lt;/p&gt;

&lt;p&gt;
  Why is a hash table problematic? Because hash tables will tend to
  have a totally random memory access pattern. If they don&#039;t, it&#039;s a
  bad hash function and the hash table will probably perform terribly
  due to collisions. This random access pattern can cause performance
  issues even when the data fits in memory: an access to a huge
  hash table is pretty likely to cause both a cache and TLB miss. But
  if a significant chunk of the data is actually on disk rather than
  in memory? It&#039;d be disastrous: something on the order of 10ms per
  lookup.
&lt;/p&gt;

&lt;p&gt; With 10G unique states wed be looking at about four months of
  waiting for disk IO just for the hash table accesses. That can&#039;t
  work; the problem absolutely needs to be transformed such that the
  program can process big batches of data in one go.  &lt;/p&gt;

&lt;a id=&#039;sort-merge&#039;&gt;&lt;/a&gt;
&lt;h2&gt;A sort + merge BFS&lt;/h2&gt;

&lt;p&gt;
  If we wanted to batch the data access as much as possible, what would
  be the maximum achievable coarseness? Since the program can&#039;t know which nodes
  to processes on depth layer N+1 before layer N has been fully processed, it seems
  obvious that we have to do our deduplication of states at least once
  per depth.
&lt;/p&gt;

&lt;p&gt;
  Dealing with a whole layer at one time allows ditching hash tables,
  and representing the visited set and the new states as sorted
  streams of some sort (e.g. file streams, arrays, lists). We can
  trivially find the new visited set with a set union on the streams, and
  equally trivially find the todo set with a set difference.
&lt;/p&gt;

&lt;p&gt;
  The two set operations can be combined to work on a single pass
  through both streams. Basically peek into both streams, process the
  smaller element, and then advance the stream that the element came
  from (or both streams if the elements at the head were equal).  In
  either case, add the element to the new visited set. When advancing
  just the stream of new states, also add the element to the new todo
  set:
&lt;/p&gt;

&lt;pre&gt;
def bfs(graph, start, end):
    visited = Stream()
    todo = Stream()
    visited.add(start)
    todo.add(start)
    while True:
        new = []
        for node in todo:
            if node == end:
                return True
            for kid in adjacent(node):
                new.push_back(kid)
        new_stream = Stream()
        for node in new.sorted().uniq():
            new_stream.add(node)
        todo, visited = merge_sorted_streams(new_stream, visited)
    return False

# Merges sorted streams new and visited. Return a sorted stream of
# elements that were just present in new, and another sorted
# stream containing the elements that were present in either or
# both of new and visited.
def merge_sorted_streams(new, visited):
    out_todo, out_visited = Stream(), Stream()
    while visited or new:
        if visited and new:
            if visited.peek() == new.peek():
                out_visited.add(visited.pop())
                new.pop()
            elif visited.peek() &lt; new.peek():
                out_visited.add(visited.pop())
            elif visited.peek() &gt; new.peek():
                out_todo.add(new.peek())
                out_visited.add(new.pop())
        elif visited:
            out_visited.add(visited.pop())
        elif new:
            out_todo.add(new.peek())
            out_visited.add(new.pop())
    return out_todo, out_visited
&lt;/pre&gt;

&lt;p&gt;
  The data access pattern is now perfectly linear and predictable,
  there are no random accesses at all during the merge. Disk latency
  thus becomes irrelevant, and the only thing that matters is
  throughput.
&lt;/p&gt;

&lt;p&gt;
  What does the theoretical performance look like with the simplified
  data distribution of 100 depth levels and 100M states per depth?
  The average state will be both read and written 50 times. That&#039;s
  10 bytes/state * 5G states * 50 = 2.5TB. My hard drive can supposedly
  read and write at a sustained
  100MB/s, which would mean (2 * 2.5TB) / (100MB/s) =~ 50k/s
  =~ 13 hours spent on the IO. That&#039;s a couple of orders of magnitude
  better than the earlier four month estimate!
&lt;/p&gt;

&lt;p&gt;
  It&#039;s worth noting that this simplistic model is not considering the
  size of the newly generated states. Before the merge step, they need
  to be kept in-memory for the sorting + deduplication. We&#039;ll look
  closer at that in a later section.
&lt;/p&gt;

&lt;a id=&#039;compression&#039;&gt;&lt;/a&gt;
&lt;h2&gt;Compression&lt;/h2&gt;

&lt;p&gt;
  In the introduction I mentioned that compressing the states didn&#039;t
  look very promising in the initial experiments, with a 30% compression
  ratio. But after the above algorithm change the states are now ordered.
  That should be a lot easier to compress.
&lt;/p&gt;

&lt;p&gt;
  To test this theory, I used zstd on a puzzle of 14.6M states, with each
  state being 8 bytes. After the sorting they compressed to an average of
  1.4 bytes per state. That seems like a solid improvement. Not quite
  enough to run the whole program in memory, but it could plausibly
  cut the disk IO to just a couple of hours.
&lt;/p&gt;

&lt;p&gt;
  Is there any way to do better than a state of the art general
  purpose compression algorithm, if you know something about the
  structure of the data? Almost certainly. One good example is the PNG
  format. Technically the compression is just a standard Deflate
  pass. But rather than compress the raw image data, the image is
  first transformed
  using &lt;a href=&#039;https://www.w3.org/TR/PNG-Filters.html&#039;&gt;PNG filters&lt;/a&gt;.
  A PNG filter is basically a formula for predicting the value of a
  byte in the raw data from the value of the same byte on the
  previous row and/or the same byte of the previous pixel. For example the
  &#039;up&#039; filter transforms each byte by subtracting the previous row&#039;s
  value from it during compression, and doing the inverse when
  decompressing. Given the kinds of images PNG is meant for, the
  result will probably mostly consist of zeroes or numbers close to
  zero. Deflate can compress these far better than the raw data.
&lt;/p&gt;

&lt;p&gt;
  Can we apply a similar idea to the state records of the BFS? Seems
  like it should be possible. Just like in PNGs, there&#039;s a fixed row
  size, and we&#039;d expect adjacent rows to be very similar. The first
  tries with a subtraction/addition filter followed by zstd resulted
  in another 40% improvement in compression ratios: 0.87 bytes per
  state. The filtering operations are trivial, so this was basically
  free from a CPU consumption point of view.
&lt;/p&gt;

&lt;p&gt;
  It wasn&#039;t clear if one could do a lot better than that, or whether
  this was a practical limit. In image data there&#039;s a reasonable
  expectation of similarity between adjacent bytes of the same row.
  For the state data that&#039;s not true. But actually slightly more
  sophisticated filters could still improve on that number. The
  one I ended up using worked like this:
&lt;/p&gt;

&lt;p&gt;
  Let&#039;s assume we have adjacent rows R1 = [1, 2, 3, 4] and R2 = [1, 2,
  6, 4].  When outputting R2, we compare each byte to the same byte on
  the previous row, with a 0 for match and 1 for mismatch: diff = [0,
  0, 1, 0]. We then emit that bitmap encoded as a VarInt, followed by
  just the bytes that did not match the previous row. In this example, the
  two bytes &#039;0b00000100 6&#039;. This filter
  alone compressed the benchmark to 2.2 bytes / state. But combining
  this filter + zstd got it down to 0.42 bytes / state. Or to put it
  another way, that&#039;s 3.36 bits per state, which is just a little bit
  over what the back of the envelope calculation suggested was needed
  to fit in RAM.
&lt;/p&gt;

&lt;p&gt;
  In practice the compression ratios improve as the sorted sets get
  more dense. Once the search gets to a point where memory starts
  getting an issue, the compression ratios can get a lot better than
  that. The largest problem turned out to have 4.6G distinct visited states in
  the end. These states took 405MB when sorted and compressed with the
  above scheme. That&#039;s &lt;b&gt;0.7 bits per state&lt;/b&gt;. The compression and
  decompression end up taking about 25% of the program&#039;s CPU time,
  but that seems like a great tradeoff for cutting memory use to 1/100th.
&lt;/p&gt;

&lt;p&gt;
  The filter above does feel a bit wasteful due to the VarInt
  header on every row. It seems like it should be easy to improve on it with very
  little extra cost in CPU or complexity. I tried a bunch of other
  variants that transposed the data to a column-major order, or wrote
  the bitmasks in bigger blocks, etc. These variants invariably got much
  better compression ratio by themselves, but then didn&#039;t do as well
  when the output of the filter was compressed with zstd. It wasn&#039;t
  just due to some quirk of zstd either, the results were similar with
  gzip and bzip2. I don&#039;t have any great theories on why this
  particular encoding ended up compressing much better than the
  alternatives.
&lt;/p&gt;

&lt;p&gt;
  Another mystery is the compression ratio ended up far better when the
  data was sorted little-endian rather than big-endian. I initially thought
  it was due to the little-endian sort ending up with more leading zeros
  on the VarInt-encoded bitmask. But this difference persisted even for
  filters that didn&#039;t have such dependencies.
&lt;/p&gt;

&lt;p&gt;
  (There&#039;s a lot of research on compressing sorted sets of integers,
  since they&#039;re a basic building block of search engines. I didn&#039;t find
  a lot on compressing sorted fixed-size records though, and didn&#039;t want
  to start jumping through the hoops of representing my data as arbitrary
  precision integers.q)
&lt;/p&gt;

&lt;a id=&#039;cheated&#039;&gt;&lt;/a&gt;
&lt;h2&gt;Oh no, I&#039;ve cheated!&lt;/h2&gt;

&lt;p&gt;
  You might have noticed that the above pseudocode implementations of
  BFS were only returning a boolean for solution found / not found.
  That&#039;s not very useful. For most purposes you need to be
  able to produce a list of the exact steps of the solution, not
  just state that a solution exists.
&lt;/p&gt;

&lt;p&gt;
  On the surface the solution is easy. Rather than collect sets of
  states, collect mappings from states to a parent state. Then after
  finding a solution, just trace back the list of parent states from
  the end to the start. For the hash table based solution, it&#039;d be
  something like:
&lt;/p&gt;

&lt;pre&gt;
def bfs(graph, start, end):
    visited = {start: None}
    todo = [start]
    while todo:
        node = todo.pop_first()
        if node == end:
            return trace_solution(node, visited)
        for kid in adjacent(node):
            if kid not in visited:
                visited[kid] = node
                todo.push_back(kid)
    return None

def trace_solution(state, visited):
  if state is None:
    return []
  return trace_solution(start, visited[state]) + [state]
&lt;/pre&gt;

&lt;p&gt;
  Unfortunately this will totally kill the compression gains
  from the last section; the core assumption was that adjacent rows
  would be very similar. That was true when we just looked at the
  states themselves. But there is no reason to believe that&#039;s going to
  be true for the parent states; they&#039;re effectively random data.
  Second, the sort + merge solution has to read and write back all
  seen states on each iteration. To maintain the state / parent state
  mapping, we&#039;d also have to read and write all this badly compressing
  data to disk on each iteration.
&lt;/p&gt;

&lt;a id=&#039;sort-merge-multiple-outputs&#039;&gt;&lt;/a&gt;
&lt;h2&gt;Sort + merge with multiple outputs&lt;/h2&gt;

&lt;p&gt;
  The program only needs the state/parent mappings at the very end,
  when tracing back the solution. We can thus maintain two data
  structures in parallel. &#039;Visited&#039; is still the set of visited
  states, and gets recomputed during the merge just like before.
  &#039;Parents&#039; is a mostly sorted list of state/parent pairs, which
  doesn&#039;t get rewritten. Instead the new states + their parents get
  appended to &#039;parents&#039; after each merge operation.
&lt;/p&gt;

&lt;pre&gt;
def bfs(graph, start, end):
    parents = Stream()
    visited = Stream()
    todo = Stream()
    parents.add((start, None))
    visited.add(start)
    todo.add(start)
    while True:
        new = []
        for node in todo:
            if node == end:
                return trace_solution(node, parents)
            for kid in adjacent(node):
                new.push_back(kid)
        new_stream = Stream()
        for node in new.sorted().uniq():
            new_stream.add(node)
        todo, visited = merge_sorted_streams(new_stream, visited, parents)
    return None

# Merges sorted streams new and visited. New contains pairs of
# key + value (just the keys are compared), visited contains just
# keys.
#
# Returns a sorted stream of keys that were just present in new,
# another sorted stream containing the keys that were present in either or
# both of new and visited. Also adds the keys + values to the parents
# stream for keys that were only present in new.
def merge_sorted_streams(new, visited, parents):
    out_todo, out_visited = Stream(), Stream()
    while visited or new:
        if visited and new:
            visited_head = visited.peek()
            new_head = new.peek()[0]
            if visited_head == new_head:
                out_visited.add(visited.pop())
                new.pop()
            elif visited_head &lt; new_head:
                out_visited.add(visited.pop())
            elif visited_head &gt; new_head:
                out_todo.add(new_head)
                out_visited.add(new_head)
                out_parents.add(new.pop())
        elif visited:
            out_visited.add(visited.pop())
        elif new:
            out_todo.add(new.peek()[0])
            out_visited.add(new.peek()[0])
            out_parents.add(new.pop())
    return out_todo, out_visited
&lt;/pre&gt;

&lt;p&gt;
This gives us the best of both worlds from a runtime and working
set perspective, but does mean using more secondary storage. A
separate copy of the visited states grouped by depth turns out
to also be useful later on for other reasons.
&lt;/p&gt;

&lt;a id=&#039;swapping&#039;&gt;&lt;/a&gt;
&lt;h2&gt;Swapping&lt;/h2&gt;

&lt;p&gt;
  Another detail ignored in the snippets of pseudocode is that there
  is no explicit code for disk IO, just an abstract interface
  Stream. The Stream might be a file stream or an in-memory array, but
  we&#039;ve been ignoring that implementation detail. Instead the
  pseudocode is concerned with having a memory access pattern that
  would be disk friendly. In a perfect world that&#039;d be enough, and the
  virtual memory subsystem of the OS would take care of the rest.
&lt;/p&gt;

&lt;p&gt;
  At least with Linux that doesn&#039;t seem to be the case. At one point
  (before the working set had been shrunk to fit in memory) I&#039;d gotten
  the program to run in about 11 hours when the data was stored mostly
  on disk. I then switched the program to use anonymous pages instead
  of file-backed ones, and set up sufficient swap on the same
  disk. After three days the program had gotten a quarter of the way
  through, and was still getting slower over time. My optimistic
  estimate was that it&#039;d finish in 20 days.
&lt;/p&gt;

&lt;p&gt;
  Just to be clear, this was exactly the same code and &lt;i&gt;exactly the
  same access pattern&lt;/i&gt;. The only thing that changed was whether the
  memory was backed by an explicit on-disk file or by swap. It&#039;s
  pretty much axiomatic that swapping tends to totally destroy
  performance on Linux, whereas normal file IO doesn&#039;t. I&#039;d
  always assumed it was due to programs having the gall to treat RAM
  as something to be randomly accessed. But that wasn&#039;t the case here.
&lt;/p&gt;

&lt;p&gt;
  Turns out that file-backed and anonymous pages are not treated
  identically by the VM subsystem after all. They&#039;re kept in separate
  LRU caches with different expiration policies, and they also appear
  to have different readahead / prefetching properties.
&lt;/p&gt;

&lt;p&gt;
  So now I know: Linux swapping will probably not work well even under
  optimal circumstances. If parts of the address space are likely to
  be paged out for a while, it&#039;s better to arrange manually for the to
  be file-backed than to trust swap. I did it by implementing a custom
  vector class that started off as a purely in-memory implementation, and
  after a size threshold is exceeded switches to mmap on an unlinked
  temporary file.
&lt;/p&gt;

&lt;a id=&#039;compression-before-merging&#039;&gt;&lt;/a&gt;
&lt;h2&gt;Compressing new states before merging&lt;/h2&gt;

&lt;p&gt;
  In the simplified performance model the assumption was that there
  would be 100M new states per depth. That turned out not to be too
  far off reality (the most difficult puzzle peaked at about 150M
  unique new states from one depth layer). But it&#039;s also not the right
  thing to measure; the working set before the merge isn&#039;t related to
  just the unique states, but all the states that were output for this
  iteration. This measure peaks at 880M output states / depth. These 880M
  states a) need to be accessed with a random access pattern for the sorting,
  and b) can&#039;t be compressed efficiently due to not being sorted, c)
  need to be stored along with the parent state. That&#039;s a roughly 16GB
  working set.
&lt;/p&gt;

&lt;p&gt;
  The obvious solution would be to use some form of external sorting.
  Just write all the states to disk, do an external sort, do a
  deduplication, and then execute the merge just as before. This is
  the solution I went with first, but while it mostly solved problem
  A, it did nothing for B and C.
&lt;/p&gt;

&lt;p&gt;
  The alternative I ended up with was to collect the states into an
  in-memory array. If the array grows too large (e.g. more than 100M
  elements), it&#039;s sorted, deduplicated and compressed. This gives us
  a bunch of sorted runs of states, with no duplicates inside the run
  but potentially some between the runs. The code for merging the
  new and visited states is fundamentally the same; it&#039;s still based
  on walking through the streams in lockstep. The only change is that
  instead of walking through just the two streams, there&#039;s a separate
  stream for each of the sorted runs of new states.
&lt;/p&gt;

&lt;p&gt;
  The compression ratios for these 100M state runs are of course not
  quite as good as for compressing the set of all visited states. But
  even so, it cuts down both the working set and the disk IO
  requirements by a ton. There&#039;s a little bit of extra CPU from having
  to maintain a priority queue of streams, but it was still a great
  tradeoff.
&lt;/p&gt;

&lt;a id=&#039;parent-states&#039;&gt;&lt;/a&gt;
&lt;h2&gt;Saving space on the parent states&lt;/h2&gt;

&lt;p&gt;
  At this point the vast majority of the space used by this program is
  spent on storing the parent states, so that we can reconstruct the
  solution after finding it. They are unlikely to compress well, but
  is there maybe a CPU/memory tradeoff to be made?
&lt;/p&gt;

&lt;p&gt;
  What we need is a mapping from a state S&#039; at depth D+1 to its parent
  state S at depth D. If we could iterate all possible parent states
  of S&#039;, we could simply check if any of them appear at depth D in our
  visited set. (We&#039;ve already produced the visited set grouped by
  depth as a convenient byproduct when outputting the state/parent
  mappings from merge). Unfortunately that doesn&#039;t work for this
  problem; it&#039;s simply too hard to generate all the possible states S
  given S&#039;. It&#039;d probably work just fine for many other search problems
  though.
&lt;/p&gt;

&lt;p&gt;
  If we can only generate the state transitions forward, not backward,
  how about just doing that then? Let&#039;s iterate through all the states at
  depth D, and see what output states they have. If some state produces S&#039;
  as an output, we&#039;ve found a workable S. The issue with the plan is that
  it increases the total CPU usage of the program by 50%. (Not 100%, since
  on average we find S after looking at half the states of depth D).
&lt;/p&gt;

&lt;p&gt;
  So I don&#039;t like either of the extremes, but at least there is a
  CPU/memory tradeoff available there. Is there maybe a more palatable
  option somewhere in the middle? What I ended up doing was to not
  store the pair (S&#039;, S), but instead (S&#039;, H(S)), where H is an 8 bit
  hash function. To find an S given S&#039;, again iterate through all the
  states at depth D. But before doing anything else, compute the same
  hash. If the output doesn&#039;t match H(S), this isn&#039;t the state we&#039;re
  looking for, and we can just skip it. This optimization means doing
  the expensive re-computation for just 1/256 states, which is a
  negligible CPU increase, while cutting down memory the memory spent
  for storing the parent states from 8-10 bytes to 1 byte.
&lt;/p&gt;

&lt;a id=&#039;did-not-work&#039;&gt;&lt;/a&gt;
&lt;h2&gt;What didn&#039;t or might not work&lt;/h2&gt;

&lt;p&gt;
  The previous sections go through a sequence of high level
  optimizations that worked. There were other things that I tried
  that didn&#039;t work, or that I found in the literature but decided
  would not actually work in this particular case. Here&#039;s a non-exhaustive
  list.
&lt;/p&gt;

&lt;p&gt;
  At one point I was not recomputing the full visited set at every
  iteration. Instead it was kept as multiple sorted runs, and those
  runs were occasionally compacted. The benefit was fewer disk writes
  and less CPU spent on compression. The downside was more code
  complexity and a worse compression ratio. I originally thought this
  design made sense since in my setup writes were more expensive than
  reads. But in the end the compression ratio was worse by a factor of
  2. The tradeoffs are non-obvious, but in the end I reverted back to
  the simpler form.
&lt;/p&gt;

&lt;p&gt;
  There is a little bit of research done into executing huge breadth
  first searches for implicit graphs on secondary storage,
  a &lt;a href=&#039;https://www.cs.helsinki.fi/u/bmmalone/heuristic-search-fall-2013/Korf2008.pdf&#039;&gt;2008 survey paper&lt;/a&gt; is a good starting point. As one might
  guess, the idea of doing the deduplication in a batch with
  sort+merge, on secondary store, isn&#039;t novel. The surprising part is that it was
  apparently only discovered in the 1993. That&#039;s pretty late! There
  are then some later proposals for secondary storage breadth first
  search that don&#039;t require a sorting step.
&lt;/p&gt;

&lt;p&gt;
  One of them was to map the states to integers, and to maintain an
  in-memory bitmap of the visited states. This is totally useless for
  my case, since the sizes of the encodable vs. actually reachable
  state spaces are so different. And I&#039;m a bit doubtful about there
  being any interesting problems where this approach works.
&lt;/p&gt;

&lt;p&gt;
  The other viable sounding alternative is based on temporary hash tables.
  The visited states are stored unsorted in a file. Store the outputs from
  depth D in a hash table. Then iterate through the visited states, and
  look them up in the hash table. If the element is found in the hash table,
  remove it. After iterating through the whole file, only the non-duplicates
  remain. They can then be appended to the file, and used to initialize the
  todo list for the next iteration. If the number of outputs is so large that
  the hash table doesn&#039;t fit in memory, both the files and the hash tables
  can be partitioned using the same criteria (e.g. top bits of state), with
  each partition getting processed independently.
&lt;/p&gt;

&lt;p&gt;
  While there are &lt;a href=&#039;https://pdfs.semanticscholar.org/d9b5/ca0e84ebf8566c34cf218aba1789af6d3111.pdf&#039;&gt;benchmarks&lt;/a&gt; claiming the hash-based approach is
  roughly 30% faster than sort+merge, the benchmarks don&#039;t really seem to
  consider compression. I just don&#039;t see how giving up the compression gains
  could be worth it, so didn&#039;t experiment with these approaches at all.
&lt;/p&gt;

&lt;p&gt;
  The other relevant branch of research that seemed promising was
  database query optimization. The deduplication problem seems very
  much related to database joins, with exactly the same
  &lt;a href=&#039;https://15721.courses.cs.cmu.edu/spring2018/papers/19-hashjoins/schuh-sigmod2016.pdf&#039;&gt;sort vs. hash&lt;/a&gt; &lt;a href=&#039;http://www.vldb.org/pvldb/vol7/p85-balkesen.pdf&#039;&gt;dilemma&lt;/a&gt;. Obviously some of these findings should
  carry over to a search problem. The difference
  might be that the output of a database join is transient, while the
  outputs of a BFS deduplication persist for the rest of the computation.
  It feels like that changes the tradeoffs: it&#039;s not just about how to
  process one iteration most efficiently, it&#039;s also about having the
  outputs in the optimal format for the next iteration.
&lt;/p&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;
  That concludes the things I learned from this project that seem
  generally applicable to other brute force search problems. These
  tricks combined to get the hardest puzzles of the game from
  an effective memory footprint of 50-100GB to 500MB, and degrading
  gracefully if the problem exceeds available memory and spills to
  disk. It is also
  50% faster than a naive hash table based state deduplication
  even for puzzles that fit into memory.
&lt;/p&gt;

&lt;p&gt;
  The next post will deal with optimizing grid-based spatial puzzle
  games in general, as well as some issues specific just to this
  particular game.
&lt;/p&gt;

&lt;p&gt;
  In the meanwhile, Snakebird is available at least on &lt;a href=&#039;https://store.steampowered.com/app/357300/Snakebird/&#039;&gt;Steam&lt;/a&gt;,
  &lt;a href=&#039;https://play.google.com/store/apps/details?id=com.NoumenonGames.SnakeBird_Touch&amp;hl=en&#039;&gt;Google Play&lt;/a&gt;, and the &lt;a href=&#039;https://itunes.apple.com/us/app/snakebird/id1087075743?mt=8&#039;&gt;App Store&lt;/a&gt;. I recommend it for anyone interested
  in a very hard but fair puzzle game.
&lt;/p&gt;
</description><author>jsnell@iki.fi</author><category>GAMES</category><pubDate>Mon, 23 Jul 2018 16:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2018-07-23-optimizing-breadth-first-search/</guid></item><item><title>Numbers and tagged pointers in early Lisp implementations</title><link>https://www.snellman.net/blog/archive/2017-09-04-lisp-numbers/</link><description>
  &lt;p&gt;
    There was a bit
    of &lt;a href=&#039;https://news.ycombinator.com/item?id=15121859&#039;&gt;discussion
    on HN about data representations in dynamic languages&lt;/a&gt;, and
    specifically having values that are either pointers or immediate
    data, with the two cases being distinguished by use of tag bits in
    the pointer value:
  &lt;/p&gt;

  &lt;blockquote&gt;
    &lt;blockquote&gt;
      If there&#039;s one takeway/point of interest that I&#039;d recommend looking at, it&#039;s the novel way that Ruby shares a pointer value between actual pointers to memory and special &quot;immediate&quot; values that simply occupy the pointer value itself [1].
    &lt;/blockquote&gt;
    This is usual in Lisp (compilers/implementations) and i wouldn&#039;t be surprised if it was invented on the seventies once large (i.e. 36-bit long) registers were available.
  &lt;/blockquote&gt;

  &lt;p&gt;I was going to nitpick a bit with the following:&lt;/p&gt;

  &lt;blockquote&gt;
  &lt;p&gt;
    The core claim here is correct; embedding small immediates inside
    pointers is not a novel technique. It&#039;s a good guess that it was
    first used in Lisp systems. But it can&#039;t be the case that its
    invention is tied into large word sizes, those were in wide use
    well before Lisp existed. (The early Lisps mostly ran on 36 bit
    computers.)
  &lt;/p&gt;

  &lt;p&gt;
   It seems more likely that this was tied into the general migration
   from word-addressing to byte-addressing. Due to alignment constraints,
   byte-addressed pointers to word-sized objects will always have unused
   bits around. It&#039;s harder to arrange for that with a word-addressed
   system.
  &lt;/p&gt;
  &lt;/blockquote&gt;

  &lt;p&gt;
    But the latter part of that was speculation, maybe I should try to
    check the facts first before being tediously pedantic?
    Good call, since that speculation was wrong. Let&#039;s take a tour
    through some early Lisp implementations, and look at how they
    represented data in general, and numbers in particular.
  &lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

  &lt;h3&gt;Table of Contents&lt;/h3&gt;

  &lt;ul&gt;
  &lt;li&gt; &lt;a href=&#039;#problem&#039;&gt;The problem with integers&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#lispi&#039;&gt;LISP I&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#lisp1.5&#039;&gt;LISP 1.5&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#pdp1&#039;&gt;Basic PDP-1 LISP&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#m460&#039;&gt;M 460 LISP&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#pdp6&#039;&gt;PDP-6 LISP&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#bbn&#039;&gt;BBN LISP&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#conclusion&#039;&gt;Conclusion&lt;/a&gt;
  &lt;/ul&gt;

  &lt;a name=&#039;problem&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;The problem with integers&lt;/h3&gt;

  &lt;p&gt;
    Before we get started, let&#039;s state the problem that tagged pointers
    solve. In a
    dynamically typed programming language, the language
    implementation must be able to distinguish between values of
    different types. The obvious implementation is boxing; all values
    are treated as blobs of memory allocated somewhere on the heap,
    with an envelope containing metadata such as the type and (maybe)
    the size of the object.
  &lt;/p&gt;

  &lt;p&gt;But this means that integers now have tons of overhead. They use
    up heap space, need to be garbage collected, and new memory needs
    to be constantly allocated for the results of arithmetic
    operations. Since integers are so critical to almost all kinds of
    computing, it would be great to minimize the overhead.  And
    ultimately, to eliminate the overhead completely by encoding small
    integers as recognizably invalid pointers.
  &lt;/p&gt;

  &lt;a name=&#039;lispi&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;LISP I&lt;/h3&gt;

  &lt;p&gt;
    I wasn&#039;t super hopeful about finding out exactly what numbers looked like
    in the original Lisp implementation. As far as I know, the source
    code hasn&#039;t been preserved. Now, the original paper describing Lisp
    (&lt;a href=&#039;http://edge.cs.drexel.edu/regli/Classes/Lisp_papers/McCarthy-original-LISP-paper-recursive.pdf&#039;&gt;
    Recursive Functions of Symbolic Expressions and their Computation
    by Machine, Part I
    &lt;/a&gt;) isn&#039;t quite as theoretical as the title suggests. For
    example it describes the memory allocator and garbage collector on
    a reasonable systems level. But it doesn&#039;t mention numbers at all;
    this is a system for symbolic computation, so numbers might as
    well not exist.
  &lt;/p&gt;

  &lt;p&gt;
    The &lt;a href=&#039;https://kyber.io/rawvids/LISP_I_Programmers_Manual_LISP_I_Programmers_Manual.pdf&#039;&gt;
      LISP I Programmer&#039;s Manual&lt;/a&gt; from 1960 is more illuminating, though not
      entirely consistent. In one place the manual claims that LISP I
      only supports floats, and you&#039;ll need to wait until LISP II to
      use integers. But the rest of the document happily describes the
      exact memory layout of integers, so who can tell.
  &lt;/p&gt;

  &lt;p&gt;
    A floating point value looks like this:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/lisp-numbers/lisp1.0-float.png&#039; /&gt;

  &lt;p&gt;
    Let&#039;s say we have the value 1.0 in a LISP I program. This value
    is actually pointer to a word. How do we know what the type of the
    pointed to word is? If the upper half of that word is -1, it&#039;s a
    symbol. Otherwise it&#039;s a cons. (The use of -1.0 and 1.0 as the
    example floats in this picture is unfortunate, since it looks like
    the -1.0 and -1 are somehow related. That&#039;s not the case, -1 is
    the universal tag value for atoms, and independent of the exact
    floating point values.)
  &lt;/p&gt;

  &lt;p&gt;
    So the number 1.0 is a symbol? Technically yes, since at this stage of Lisp&#039;s
    evolution everything is either a symbol or a cons. There are no
    other atoms. We can find out if the symbol represents a number by
    following the linked list starting from the &lt;code&gt;cdr&lt;/code&gt; of
    the symbol (a pointer stored in the lower half of the word). If
    we find the symbol &lt;code&gt;NUMB&lt;/code&gt; on the list, it&#039;s some kind
    of number. If we find the symbol &lt;code&gt;FLO&lt;/code&gt;, it&#039;s a floating
    point number, and the property list will be pointing to a word
    that contains the raw floating point value that this number
    represents.
  &lt;/p&gt;

  &lt;p&gt;
    There&#039;s a detail here that&#039;s kind of amazing. Notice that 1.0 and
    -1.0 share the same list structure. The only difference is that
    -1.0 has the symbol &lt;code&gt;MINUS&lt;/code&gt; in the list, after which
    the list merges with the list of 1.0. What a fabulously
    inefficient representation! Not only do you have to do a bunch of
    pointer chasing just to find the actual value of a number, but
    then you&#039;ll get to do it again to find out the sign!
  &lt;/p&gt;

  &lt;p&gt;
    The question I can&#039;t answer just from reading this document is how
    exactly the raw floating point value is handled. Surely the
    garbage collector must know not to interpret those raw
    bits as pointer data? There is a very detailed example of the
    memory layout for an integer on pages 94-95, but even with that
    example I just don&#039;t see where the type information is stored.
    It&#039;s clearly not based on address ranges (the raw values are mixed
    in with the other words), nor the pointer value (all the pointers
    are stored as 2&#039;s complement), nor the 6 unused bits in the
    machine word.
  &lt;/p&gt;

  &lt;p&gt;
    Suggestions welcome. My best guess is that the example is
    inaccurate.
  &lt;/p&gt;

  &lt;a name=&#039;lisp1.5&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;LISP 1.5&lt;/h3&gt;

  &lt;p&gt;The LISP 1.5 Programmer&#039;s Manual from 1962 explains in a very
    concise manner how numbers worked in that implementation:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/lisp-numbers/lisp1.5-int.png&#039; /&gt;

  &lt;p&gt;Numbers are still considered to be symbols, and symbols are
    still marked with -1 as the &lt;code&gt;car&lt;/code&gt;. But the standard
    symbol property list is now gone; instead the symbol is pointing
    directly to the memory that stores the raw integer value. How
    does the program know not to follow that pointer as a list? As the
    document says, that&#039;s specified by &quot;certain bits in the tag&quot;.
  &lt;/p&gt;

  &lt;p&gt;
    The tag? What&#039;s the tag? The IBM 704 had a 36-bit word size but
    just a 15 bit address space. The words were split (on the ISA
    level) into a 3 bit &quot;prefix&quot;, 15 bit &quot;address&quot;, 3 bit &quot;tag&quot;, and
    15 bit &quot;decrement&quot;.  Since Lisp values are pointers, only the two
    15 bit regions are useful for that. One of the 3 bit regions has
    been repurposed by the Lisp implementation to mark the pointers
    to raw data.
  &lt;/p&gt;

  &lt;p&gt;
    This is a clear improvement over LISP I, but a number is still
    represented as an untagged pointer to a tagged pointer to the raw
    value. Why is the intermediate word there at all, why not go
    directly with a tagged pointer to the raw value? Maybe code size?
  &lt;/p&gt;

  &lt;p&gt;
    In parallel to that, the address space has now been split into
    multiple separate pieces, with the cons cells being allocated from
    a different range of addresses than plain data like numbers and
    string segments. It could well be that the tagged pointer is
    irrelevant to the GC, which just makes its decisions on what&#039;s a
    pointer based on whether the pointer is contained in the &quot;full
    word space&quot; or the &quot;free space&quot;. The tags would then be used
    just for implementing &lt;code&gt;NUMBERP&lt;/code&gt;.
  &lt;/p&gt;

  &lt;a name=&#039;pdp1&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;Basic PDP-1 LISP&lt;/h3&gt;

  &lt;p&gt;
    For a L. Peter Deutsch joint,
    &lt;a href=&#039;http://s3data.computerhistory.org/pdp-1/DEC.pdp_1.1964.102650371.pdf&#039;&gt;
      The LISP implementation for the PDP-1 Computer
    &lt;/a&gt; proves to be a surprisingly unsatisfying document. It&#039;s almost
    exclusively user documentation, with no information on the systems
    architecture. Well, except a full source code listing. Guess we&#039;ll
    have to look at that, then.
    &lt;code&gt;NUMBERP&lt;/code&gt; is the easiest starting point:
  &lt;/p&gt;

&lt;pre&gt;
/// (&quot;is a number&quot;)
/NUMBERP
nmp,    lac i 100
        and (jmp
        sad (jmp
        jmp tru
        jmp fal
&lt;/pre&gt;

  &lt;p&gt;The main thing that need to be known from the rest of the code is
    that the interpreter stores a pointer to the Lisp value that&#039;s
    currently operated on value at address &lt;code&gt;100&lt;/code&gt; (octal).
  &lt;/p&gt;

  &lt;p&gt;First &lt;code&gt;&quot;lac i 100&quot;&lt;/code&gt; follows the pointer to read the
    first data words of the value into the accumulator. The next line
    looks bizarre; due to the way the PDP-1 macro-assembler
    works, &lt;code&gt;&quot;and (jmp&quot;&lt;/code&gt; effectively means &lt;code&gt;&quot;and
    600000&quot;&lt;/code&gt;. So this instruction is masking away all but the
    top two bits of the accumulator, and &lt;code&gt;&quot;sad
    (jmp&quot;&lt;/code&gt; is checking whether the result of the masking equals octal
    &lt;code&gt;600000&lt;/code&gt;. It appears that there is nothing special about the
    pointer to a number, but numbers are identified by having the top
    two bits set in the pointed-to value.
  &lt;/p&gt;

  &lt;p&gt;The next step in understanding the layout is the code for reading
  the raw value of a number.&lt;/p&gt;

&lt;pre&gt;
/get numeric value
vag,    lio i 100
        cla
        rcl 2s
        sas (3
        jmp qi3
        idx 100
        lac i 100
        rcl 8s
        rcl 8s
        jmp x
&lt;/pre&gt;

  &lt;p&gt;&lt;code&gt;&quot;lio i 100&quot;&lt;/code&gt; loads the current Lisp value into the IO register.
    &lt;code&gt;&quot;cla&quot;&lt;/code&gt; sets the accumulator to zero. &lt;code&gt;&quot;rcl
    2s&quot;&lt;/code&gt; then rotates the combination of the IO register and
    accumulator by 2 bits.  The accumulator now contains as its
    low bits the previous high two bits of the IO register. &lt;code&gt;&quot;sas
    (3&quot;&lt;/code&gt; compares the accumulator to 3; if they&#039;re not equal we
    jump to qi3 (the error routine for &quot;non-numeric arg for
    arith&quot;). &lt;code&gt;&quot;idx 100&quot;&lt;/code&gt; moves the pointer to the next word
    of the value, and &lt;code&gt;&quot;lac i 100&quot;&lt;/code&gt; reads that word into
    the accumulator. And finally the combination of the two registers
    is rotated by 16 bits, so that we end up with the raw 18 bit value
    in the accumulator. Written out step by step the process looks
    like this:
  &lt;/p&gt;

  &lt;pre&gt;
    . == Bit with value of 0
    ! == Bit with value of 1
    ? == Bit with unknown value
    0-9, A-H == bits of the integer value

    X                    X+1
------------------------------------------------
    [!!23456789ABCDEFGH] [................01]

    IO                   AC
------------------------------------------------
Load IO from address X
    [!!23456789ABCDEFGH] [??????????????????]
Clear AC
    [!!23456789ABCDEFGH] [..................]
Rotate left by 2
    [23456789ABCDEFGH..] [................!!]
Check AC == 3
Load AC from address X+1
    [23456789ABCDEFGH..] [................01]
Rotate left by 8
    [ABCDEFGH..........] [........0123456789]
Rotate left by 8
    [..................] [0123456789ABCDEFGH]
&lt;/pre&gt;

  &lt;p&gt;
    Clearly an integer is now represented by a pointer to two words
    that has a special tag in the high bits of the first word. This
    implementation got rid of the extra layer of indirection in LISP
    1.5; an integer is now just a pointer to tagged data. But we&#039;re
    still left with the storage of a one-word integer requiring three
    words.
  &lt;/p&gt;

  &lt;p&gt;
    Why use a layout that requires shuffling data around this much,
    instead of just having the tag in X and the raw value in X+1?  It
    seems awfully inconvenient. My best guess is that the top 1-2 bits
    of the second word are reserved for the GC, e.g. for use as mark
    bits. But understanding exactly how the GC works is maybe a project
    for another day.
  &lt;/p&gt;

  &lt;a name=&#039;m460&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;M 460 LISP&lt;/h3&gt;

  &lt;p&gt;
    Before starting research for this article, I&#039;d never heard of the
    early Lisp implementation for the Univac M 460. A description of
    the system can be found in the 1964
    collection &lt;a href=&#039;https://scholar.google.com/scholar?cluster=1071332420478270292&amp;hl=en&amp;as_sdt=0,5&amp;sciodt=0,5&#039;&gt;
    The programming language LISP: Its operation and applications
    &lt;/a&gt;.
  &lt;/p&gt;

  &lt;blockquote&gt;
    Numbers and print names are placed in free storage using the
    device that sufficiently small (i.e., less than 2^10) half-word
    quantities appear to point into the bit table area and so don&#039;t
    cause the garbage collector any trouble. A number is stored as a
    list of words (a flag-word and from 1 to 3 number words, as
    required), each number word containing in its CAR part 10
    significant bits and sign. Thus an integer whose absolute value
    is less than 2^11 will occupy the same amount of storage
    (2 words) as in 7090 LISP 1.5.
  &lt;/blockquote&gt;

  &lt;p&gt;
    This is another bit of progress! The key insight on the road to
    tagged pointers is that invalid parts of the address space can be
    used to distinguish between pointers and immediate data. Another
    important insight in this paper is that most numbers in a program
    are going to be small, so it might make sense to have variable
    representations for numbers of different magnitude. But it&#039;s not a
    full realization of the concept yet, immediate small numbers are
    not accessible directly by the user. They are internal to the
    implementation, used as a building block for boxed integers of
    various levels of inefficiency.
  &lt;/p&gt;

  &lt;p&gt;
    The paper gets even better once we get a few more pages in, since
    for characters M 460 Lisp does take that final step:
  &lt;/p&gt;

  &lt;blockquote&gt;
Each character in the character set available on the M 460
(including tab, carriage return, and others) is represented internally
by an 8-bit code (6 bits for the character (up to case),
1 bit for case, and 1 bit for color). To facilitate the manipulation
of character strings within our LISP system, we permit
such character literals to appear in list structure as if they
were atoms, i.e. pointers to property lists. These literals can,
where necessary, be distinguished from atoms since they are less
than 2^8 in magnitude and hence, viewed as pointers, don&#039;t point
into free storage (where, as in 7090 LISP, property lists are
stored). The predicate charp simply makes this magnitude test.
  &lt;/blockquote&gt;

  &lt;p&gt;
    That&#039;s about as clear a case of using embedding immediate data in
    pointers as it gets. It&#039;s just that the tag is rather large (22
    highest bits, rather than the 1-4 lowest bits you&#039;d expect today).
    And it&#039;s also dealing with characters rather than numbers, so
    let&#039;s carry on with the investigation a bit longer.
  &lt;/p&gt;

  &lt;a name=&#039;pdp6&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;PDP-6 LISP&lt;/h3&gt;

  &lt;p&gt;
    The June 1966 report on
    &lt;a href=&#039;https://dspace.mit.edu/bitstream/handle/1721.1/5899/AIM-098.pdf&#039;&gt;PDP-6 LISP&lt;/a&gt;
    has the following to say on integers:
  &lt;/p&gt;

  &lt;blockquote&gt;
    Fixed-point numbers &amp;gt;= 0 and &amp;lt; about 4000 are represented by
    a &quot;pointer&quot; 1 greater than their value, and no additional list structure.
    All other numbers use a pointer to full-word space as part of an atom
    header with a FIXNUM or FLONUM indicator.
  &lt;/blockquote&gt;

  &lt;p&gt;
    This is starting to get close to the modern fixnum, except for no
    facility for immediate negative numbers and a tiny range. (This is
    a machine with 36 bit words and 18 bit pointers; one would hope
    for a bit more than 12 bits for immediate integers).
  &lt;/p&gt;

  &lt;a name=&#039;bbn&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;BBN LISP&lt;/h3&gt;

  &lt;p&gt;
    &lt;a href=&#039;http://www.dtic.mil/dtic/tr/fulltext/u2/647601.pdf&#039;&gt;
      Structure of a LISP system using two-level storage
    &lt;/a&gt; is a wonderful systems design paper from November 1966,
    describing BBN LISP for a PDP-1 with 16K of
    core memory, 88K of absurdly slow drum memory, and no hardware
    paging support. How do you make efficient use of the drum memory?
    By some clever data layout, software-driven paging, and a
    locality-optimizing memory allocator.
  &lt;/p&gt;

  &lt;p&gt;
    So it&#039;s actually a paper I thought was totally worth reading just
    for its own sake. But for the purposes of this post, this is the
    money quote:
  &lt;/p&gt;

  &lt;blockquote&gt;
LISP assumes that it is operating in an environment containing
128K words, that is from 0 to 400,000 octal. Only 88K actually
exist on the drum. The remaining portion of the address space
is used for representation of small integers between -32,767
and 32,767 (offset by 300,000 octal), as described below.
  &lt;/blockquote&gt;

  &lt;p&gt;
    The paper describes a machine with both an 18-bit word size
    and address space, with 16-bit signed fixnums embedded in the
    pointers. That&#039;s about as good as it gets. (Though not quite
    optimal; they&#039;re using bit 17 as the integer tag, but what
    happened to bit 18? The paper doesn&#039;t say, but odds are that
    it&#039;s again a GC mark bit).
  &lt;/p&gt;

  &lt;p&gt;
    The particularly observant reader might have noticed that this
    machine had 104K words of physical memory, but the described
    tagging scheme only leaves 64K words addressable. What&#039;s up with
    that? On one level it&#039;s exactly what M 460 LISP and PDP-6 Lisp
    were doing: that 40K of address space stores things that can&#039;t be
    directly pointed to from another Lisp value. But those other
    implementations were just opportunistically reusing the
    parts of address space that contained native code.
  &lt;/p&gt;

  &lt;p&gt;By contrast, BBN LISP carefully arranged for there to exist as
    much of such storage as possible, and for it to be located above
    the address 200,000 (octal).&lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/lisp-numbers/bbn-lisp-layout.png&#039; /&gt;

  &lt;p&gt;The most clever example of that is the representation of
    symbols. The first implementations we saw just implemented symbols
    as a list of properties indexed by name (e.g. name, value cell,
    function cell, etc). An obvious optimization is to allocate a
    symbol as a single larger block of memory with fixed slots for the
    most common properties, and a generic property list slot to
    contain anything else.&lt;/p&gt;

  &lt;p&gt;
    What BBN Lisp does instead is allocate a symbol in multiple
    separate blocks rather than a single contiguous one. A pointer to
    the symbol will point to the block of value cells, so reading the
    value cell is trivial. What if you want to read another property,
    e.g. the function? We look at the offset of the value cell pointer
    to the start of the value cell block, and access the function cell
    block at the same offset. In modern parlance it ends up as an
    structure-of-arrays layout rather than an array-of-structures.
  &lt;/p&gt;

  &lt;p&gt;
    In addition to getting more address space for fixnums, they also
    got exactly the same kind of locality improvements that an
    structure-of-arrays would be used for today. So it was an
    all-around neat optimization.
  &lt;/p&gt;

  &lt;p&gt;
    There is also an &lt;a href=&#039;http://www.softwarepreservation.org/projects/LISP/bbnlisp/BBN940LispPrelimSpec_Oct1966.pdf&#039;&gt;early design document for BBN 940 LISP&lt;/a&gt; from almost the same time as the above paper. It appears
to describe the kind of elaborate tagging scheme that a modern Lisp
might use, and places the tags in the low bits where they&#039;re easier to
test for/eliminate. And they even call heap-allocated numbers &quot;boxed&quot;!
I had no idea this terminology was in use 50 years ago. The relevant
section:
  &lt;/p&gt;

  &lt;blockquote&gt;
&lt;p&gt;
There will be a maximum of 16 pointer types of
objects in the 940 LISP System. These are (numbered in octal)
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; 00. S-expressions (nonatomic)
&lt;li&gt; 01. Identifiers (literal atoms)
&lt;li&gt; 02. Small Integers
&lt;li&gt; 03. Boxed Large Integers
&lt;li&gt; 04. Boxed Floating Point Numbers
&lt;li&gt; 05. Compiled Function - Lambda Type
&lt;li&gt; 06. Compiled Function - Lambda Type - Indef Args
&lt;li&gt; 07. Compiled Function - Mu Type - Args Paired
&lt;li&gt; 10. Compiled Function - Mu Type - List of Args
&lt;li&gt; 11. Compiled Function - Macro
&lt;li&gt; 12. Array - Pointers
&lt;li&gt; 13. Array - Integers
&lt;li&gt; 14. Array - FP #s
&lt;li&gt; 15. Strings - Packed Character Arrays
&lt;li&gt; 16.
&lt;li&gt; 17. Pushdown List Pointers
&lt;/ul&gt;

&lt;p&gt;
Each pointer will be contained in one 940 word of 24
bits. Bits 0 and 1 will be nominally empty, and may in some
cases be used by the system (e.g. bit 0 for garbage collection)
or perhaps even the user (in S-expressions). The four bits
2-5 will contain the type number for this pointer. The 18
bits 6-23 will contain an effective address (in the LISP
drum file) where the referenced information is stored.
&lt;/p&gt;

&lt;/blockquote&gt;

&lt;p&gt;It looks like they ended up not using this design
    for BBN 940 LISP, and it instead uses an extended version of the
    segmented memory scheme from the PDP-1 implementation described
    earlier in this section. But even if these particular bits
    weren&#039;t practical to use with that hardware, at this point
    just about all the ideas for tagged pointers have definitely
    been invented.&lt;/p&gt;

  &lt;a name=&#039;conclusion&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;Conclusion&lt;/h3&gt;

  &lt;p&gt;
    The initial LISP I implementation in 1960 had the least efficient
    implementation of numbers this side of church numerals, where even
    just getting the value might imply chasing half a dozen
    pointers. But new implementations optimized that layout
    aggressively. By 1964, the M 460 LISP implementation had arrived
    at the general solution of using pointers to invalid parts of
    the address space for storing immediate data, but user-accessible
    integers were still boxed; the only use for the unboxed integers
    was as an internal building block. In 1966 PDP-6 LISP applied the
    idea of tagged immediate data to tiny positive integers, and the
    PDP-1 based BBN LISP took the idea to the logical conclusion, and
    allowed immediate storage of integers of almost the full machine
    word.
  &lt;/p&gt;

  &lt;p&gt;
    I would not have guessed that these optimizations were discovered
    and applied so early and so aggressively. It&#039;s also noteworthy
    that this was independent of both the machine word size, address
    space size, and addressing mode of the machine. The first fully
    fledged implementation I found was on a machine with 18 bit words, 18
    bits of address space, and word-addressing. That should have been
    just about the worst case!
  &lt;/p&gt;

  &lt;p&gt;There&#039;s an interesting tangent with how MacLISP ended up
    reversing this progress in the &#039;70s and going back to boxed
    integers, since they wanted to have just a single integer
    representation. I won&#039;t go into the details since this post
    already grew longer than intended. But for those interested in the
    subject &lt;a href=&#039;https://dspace.mit.edu/bitstream/handle/1721.1/6279/AIM-421.pdf&#039;&gt;AI Memo 421&lt;/a&gt; is a fun read.
  &lt;/p&gt;

  &lt;p&gt;
    Was the technique definitely first used in Lisp? These
    implementations are early enough that there aren&#039;t a ton of other
    possibilities. The only ones I can think of would be APL and
    Dartmouth BASIC. If anyone can find documentation on earlier
    uses of storing immediate data in tagged pointers, please
    let me know and I&#039;ll edit the article.
  &lt;/p&gt;
</description><author>jsnell@iki.fi</author><category>LISP</category><category>HISTORY</category><pubDate>Mon, 04 Sep 2017 15:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2017-09-04-lisp-numbers/</guid></item><item><title>Why PS4 downloads are so slow</title><link>https://www.snellman.net/blog/archive/2017-08-19-slow-ps4-downloads/</link><description>
  &lt;p&gt;
    Game downloads on PS4 have a reputation of being very slow, with many people
    reporting downloads being an order of magnitude faster on Steam or
    Xbox. This had long been on my list of things to look into, but at
    a pretty low priority.  After all, the PS4 operating system is
    based on a reasonably modern FreeBSD (9.0), so there should not be
    any crippling issues in the TCP stack. The implication is that the
    problem is something boring, like an inadequately dimensioned CDN.
  &lt;/p&gt;

  &lt;p&gt;
    But then I heard that people were successfully using local HTTP
    proxies as a workaround. It should be pretty rare for that to
    actually help with download speeds, which made this sound like a
    much more interesting problem.
  &lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

  &lt;p&gt;
    This is going to be a long-winded technical post.  If you&#039;re not
    interested in the details of the investigation but just want a
    recommendation on speeding up PS4 downloads, skip straight to the
    &lt;a href=&#039;#conclusions&#039;&gt;conclusions&lt;/a&gt;.
  &lt;/p&gt;

  &lt;h3&gt;Background&lt;/h3&gt;

  &lt;p&gt;
    Before running any experiments, it&#039;s good to have a mental model
    of how the thing we&#039;re testing works, and where the problems might
    be. If nothing else, it will guide the initial experiment design.
  &lt;/p&gt;

  &lt;p&gt;
    The speed of a steady-state TCP connection is basically defined by
    three numbers. The amount of data the client is will to receive on
    a single round-trip (TCP receive window), the amount of data the
    server is willing to send on a single round-trip (TCP congestion
    window), and the round trip latency between the client and the server (RTT).
    To a first approximation, the connection speed will be:
  &lt;/p&gt;

  &lt;pre&gt;
    speed = min(rwin, cwin) / RTT
&lt;/pre&gt;

  &lt;p&gt;
    With this model, how could a proxy speed up the connection?  Well,
    with a proxy the original connection will be split into two mostly
    independent parts; one connection between the client and the
    proxy, and another between the proxy and the server. The speed of
    the end-to-end connection will be determined by the slower of
    those two independent connections:
  &lt;/p&gt;

  &lt;pre&gt;
    speed_proxy_client = min(client rwin, proxy cwin) / client-proxy RTT
    speed_server_proxy = min(proxy rwin, server cwin) / proxy-server RTT
    speed = min(speed_proxy_client, speed_server_proxy)
&lt;/pre&gt;

  &lt;p&gt;
    With a local proxy the client-proxy RTT will be very low; that
    connection is almost guaranteed to be the faster one. The
    improvement will have to be from the server-proxy connection being
    somehow better than the direct client-server one. The RTT will not
    change, so there are just two options: either the client has a
    much smaller receive window than the proxy, or the client is
    somehow causing the server&#039;s congestion window to
    decrease. (E.g. the client is randomly dropping received packets,
    while the proxy isn&#039;t).
  &lt;/p&gt;

  &lt;p&gt;
    Out of these two theories, the receive window one should be much
    more likely, so we should concentrate on it first. But that just
    replaces our original question with a new one: why would the
    client&#039;s receive window be so low that it becomes a noticeable
    bottleneck? There&#039;s a fairly limited number of causes for low
    receive windows that I&#039;ve seen in the wild, and they don&#039;t really
    seem to fit here.
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt; Maybe the client doesn&#039;t support the TCP window scaling option,
      while the proxy does. Without window scaling, the receive window
      will be limited to 64kB. But since we know Sony started with a
      TCP stack that supports window scaling, they would have had to
      go out of their way to disable it. Slow downloads, for no benefit.
    &lt;li&gt; Maybe the actual downloader application is very slow. The operating
      system is supposed to have a certain amount of buffer space available
      for each connection. If the network is delivering data to the OS
      faster than the application is reading it, the buffer will start to
      fill up, and the OS will reduce the receive window as a form
      of back-pressure. But this can&#039;t be the reason; if the application
      is the bottleneck, it&#039;ll be a bottleneck with or without the
      proxy.
    &lt;li&gt; The operating system is trying to dynamically scale the
      receive window to match the actual network conditions, but
      something is going wrong. This would be interesting, so it&#039;s
      what we&#039;re hoping to find.
  &lt;/ul&gt;

  &lt;p&gt;
    The initial theories are in place, let&#039;s get digging.
  &lt;/p&gt;

  &lt;h3&gt;Experiment #1&lt;/h3&gt;

  &lt;p&gt;
    For our first experiment, we&#039;ll start a PSN download on a baseline
    non-Slim PS4, firmware 4.73. The network connection of the PS4 is
    bridged through a Linux machine, where we can add latency to the
    network using &lt;code&gt;tc netem&lt;/code&gt;. By varying the added latency,
    we should be able to find out two things: whether the receive
    window really is the bottleneck, and whether the receive window
    is being automatically scaled by the operating system.
  &lt;/p&gt;

  &lt;p&gt;
    This is what the client-server RTTs (measured from a packet
    capture using TCP timestamps) look like for the experimental
    period. Each dot represents 10 seconds of time for a single
    connection, with the Y axis showing the minimum RTT seen for that
    connection in those 10 seconds.
  &lt;/p&gt;

  &lt;a href=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl1-rtt-full.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl1-rtt-thumb.png&#039;&gt;
  &lt;/a&gt;

  &lt;p&gt;
    The next graph shows the amount of data sent by the server in one
    round trip in red, and the receive windows advertised by the
    client in blue.
  &lt;/p&gt;

  &lt;a href=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl1-win-full.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl1-win-thumb.png&#039;&gt;
  &lt;/a&gt;

  &lt;p&gt;
    First, since the blue dots are staying constantly at about 128kB,
    the operating system doesn&#039;t appear to be doing any kind of
    receive window scaling based on the RTT. (So much for that
    theory). Though at the very right end of the graph the receive
    window shoots out to 650kB, so it isn&#039;t totally
    fixed either.
  &lt;/p&gt;

  &lt;p&gt;
    Second, is the receive window the bottleneck here? If so, the
    blue dots would be close to the red dots. This is the case
    until about 10:50. And then mysteriously the bottleneck moves to
    the server.
  &lt;/p&gt;

  &lt;p&gt;
    So we didn&#039;t find quite what we were looking for, but there are a
    couple of very interesting things that are correlated with events
    on the PS4.
  &lt;/p&gt;

  &lt;p&gt;The download was in the foreground for the whole duration of the
    test. But that doesn&#039;t mean it was the only thing running on the
    machine. The Netflix app was still running in the background,
    completely idle &lt;a id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;]. When the background app was
    closed at 11:00, the receive window increased dramatically. This
    suggests a second experiment, where different applications are
    opened / closed / left running in the background.
  &lt;/p&gt;

  &lt;p&gt;
    The time where the receive window stops being the bottleneck is
    very close to the PS4 entering rest mode. That looks like another
    thing worth investigating. Unfortunately, that&#039;s not true, and
    rest mode is a red herring here. &lt;a id=&#039;fnref2&#039;&gt;[&lt;a href=&#039;#fn2&#039;&gt;2&lt;/a&gt;]
  &lt;/p&gt;

  &lt;h3&gt;Experiment #2&lt;/h3&gt;

  &lt;p&gt;
    Below is a graph of the receive windows for a second download,
    annotated with the timing of various noteworthy events.
  &lt;/p&gt;

  &lt;a href=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl2-rwin-full.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl2-rwin-thumb.png&#039;&gt;
  &lt;/a&gt;

  &lt;p&gt;
    The differences in receive windows at different times are
    striking. And more important, the changes in the receive
    windows correspond very well to specific things I did on
    the PS4.
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt; When the download was started, the game Styx: Shards of
      Darkness was running in the background (just idling in the title
      screen). The download was limited by a receive window of under
      7kB. This is an incredibly low value; it&#039;s basically going to
      cause the downloads to take &lt;b&gt;100 times longer than they should&lt;/b&gt;.
      And this was not a coincidence, whenever that game
      was running, the receive window would be that low.
    &lt;li&gt; Having an app running (e.g. Netflix, Spotify) limited the
      receive window to 128kB, for about a 5x reduction in potential
      download speed.
    &lt;li&gt; Moving apps, games, or the download window to the foreground
      or background didn&#039;t have any effect on the receive window.
    &lt;li&gt; Launching some other games (Horizon: Zero Dawn, Uncharted 4,
      Dreadnought) seemed to have the same effect as running an app.
    &lt;li&gt; Playing an online match in a networked game (Dreadnought) caused the
      receive window to be artificially limited to 7kB.
    &lt;li&gt; Playing around in a non-networked game (Horizon: Zero Dawn)
      had a very inconsistent effect on the receive window, with the
      effect seemingly depending on the intensity of gameplay. This
      looks like a genuine resource restriction (download process
      getting variable amounts of CPU), rather than an artificial
      limit.
    &lt;li&gt; I ran a speedtest at a time when downloads were limited to
      7kB receive window. It got a decent receive window of over
      400kB; the conclusion is that the artificial receive window
      limit appears to only apply to PSN downloads.
    &lt;li&gt; Putting the PS4 into rest mode had no effect.
    &lt;li&gt; Built-in features of the PS4 UI, like the web browser,
      do not count as apps.
    &lt;li&gt; When a game was started (causing the previously running game
      to be stopped automatically), the receive window could increase
      to 650kB for a very brief period of time. Basically it appears
      that the receive window gets unclamped when the old game stops,
      and then clamped again a few seconds later when the new game
      actually starts up.
  &lt;/ul&gt;

  &lt;p&gt;
    I did a few more test runs, and all of them seemed
    to support the above findings. The only additional information
    from that testing is that the rest mode behavior was dependent
    on the PS4 settings. Originally I had it set up to suspend apps
    when in rest mode. If that setting was disabled, the apps would
    be closed when entering in rest mode, and the downloads would
    proceed at full speed.
  &lt;/p&gt;

  &lt;p&gt;A 7kB receive window will be absolutely crippling for any user.
     A 128kB window might be ok for users who have CDN servers very
     close by, or who don&#039;t have a particularly fast internet. For
     example at my location, a 128kB receive window would cap the downloads at about
     35Mbp to 75Mbps depending on which CDN the DNS RNG happens to give me.
     The lowest two speed tiers for my ISP are 50Mbps and 200Mbps.
     So either the 128kB would not be a noticeable problem (50Mbps) or it&#039;d mean
     that downloads are artificially limited to to 25% speed (200Mbps).
  &lt;/p&gt;

  &lt;a name=&#039;conclusions&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;Conclusions&lt;/h3&gt;

  &lt;p&gt;If any applications are running, the PS4 appears to change the
    settings for PSN store downloads, artificially restricting their
    speed. Closing the other applications will remove the limit. There
    are a few important details:
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt; Just leaving the other applications running in the background will
      &lt;b&gt;not help&lt;/b&gt;. The exact same limit is applied whether the download
      progress bar is in the foreground or not.
    &lt;li&gt; Putting the PS4 into rest mode might or might not help,
      depending on your system settings.
    &lt;li&gt;The artificial limit applies only to the PSN store downloads.
      It does &lt;b&gt;not&lt;/b&gt; affect e.g. the built-in speedtest. This
      is why the speedtest might report much higher speeds than the
      actual downloads, even though both are delivered from the same
      CDN servers.
    &lt;li&gt; Not all applications are equal; most of them will cause the
      connections to slow down by up to a factor of 5. Some
      games will cause a difference of about a factor of 100. Some
      games will start off with the factor of 5, and then migrate to
      the factor of 100 once you leave the start menu and start playing.
    &lt;li&gt; The above limits are artificial. In addition to that,
      actively playing a game can cause game downloads to slow down.
      This appears to be due to a genuine lack of CPU resources (with
      the game understandably having top priority).
  &lt;/ul&gt;

  &lt;p&gt;
    So if you&#039;re seeing slow downloads, just closing all the running
    applications might be worth a shot. (But it&#039;s obviously not
    guaranteed to help. There are other causes for slow downloads as
    well, this will just remove one potential bottleneck).
    To close the running applications, you&#039;ll need to
    long-press the PS button on the controller, and then select &amp;quot;Close
    applications&amp;quot; from the menu.
  &lt;/p&gt;

  &lt;p&gt;
    The PS4 doesn&#039;t make it very obvious exactly what programs are
    running. For games, the interaction model is that opening a new
    game closes the previously running one. This is not how other apps
    work; they remain in the background indefinitely until you
    explicitly close them.

  &lt;p&gt;
    And it&#039;s gets worse than that. If your PS4 is configured to
    suspend any running apps when put to rest mode, you can seemingly
    power on the machine into a clean state, and still have a hidden
    background app that&#039;s causing the OS to limit your PSN download
    speeds.
  &lt;/p&gt;

  &lt;p&gt;
    This might explain some of the superstitions about this on the
    Internet. There are people who swear that putting the machine to
    rest mode helps with speeds, others who say it does nothing. Or
    how after every firmware update people will report increased
    download speeds. Odds are that nothing actually changed in the
    firmware; it&#039;s just that those people had done their first full
    reboot in a while, and finally had a system without a background
    app running.
  &lt;/p&gt;

  &lt;h3&gt;Speculation&lt;/h3&gt;

  &lt;p&gt;
    Those were the facts as I see them. Unfortunately this raises some
    new questions, which can&#039;t be answered experimentally. With no
    facts, there&#039;s no option except to speculate wildly!
  &lt;/p&gt;

  &lt;p&gt;&lt;b&gt;Q: Is this an intentional feature? If so, what its purpose?&lt;/b&gt;&lt;/p&gt;

  &lt;p&gt;
    Yes, it must be intentional. The receive window changes very
    rapidly when applications or games are opened/closed, but not for
    any other reason. It&#039;s not any kind of subtle operating system
    level behavior; it&#039;s most likely the PS4 UI explicitly
    manipulating the socket receive buffers.
  &lt;/p&gt;

  &lt;p&gt;
    But why? I think the idea here must be to not allow the network
    traffic of background downloads to take resources away from the
    foreground use of the PS4. For example if I&#039;m playing an online
    shooter, it makes sense to harshly limit the background download
    speeds to make sure the game is getting ping times that are
    both low and predictable. So there&#039;s at least some point in that
    7kB receive window limit in some circumstances.
  &lt;/p&gt;

  &lt;p&gt;
    It&#039;s harder to see what the point of the 128kB receive window
    limit for running any app is. A single game download from some
    random CDN isn&#039;t going to muscle out Netflix or Youtube... The
    only thing I can think of is that they&#039;re afraid that multiple
    simultaneous downloads, e.g. due to automatic updates, might cause
    problems for playing video. But even that seems like a stretch.
  &lt;/p&gt;

  &lt;p&gt;
    There&#039;s an alternate theory that this is due to some non-network
    resource constraints (e.g. CPU, memory, disk). I don&#039;t think that
    works. If the CPU or disk were the constraint, just having the
    appropriate priorities in place would automatically take care of
    this. If the download process gets starved of CPU or disk
    bandwidth due to a low priority, the receive buffer would fill up
    and the receive window would scale down dynamically, exactly when
    needed. And the amounts of RAM we&#039;re talking about here are
    miniscule on a machine with 8GB of RAM; less than a megabyte.
  &lt;/p&gt;

  &lt;p&gt;&lt;b&gt;Q: Is this feature implemented well?&lt;/b&gt;&lt;/p&gt;

  &lt;p&gt;
    Oh dear God, no. It&#039;s hard to believe just how sloppy this
    implementation is.
  &lt;/p&gt;

  &lt;p&gt;
    The biggest problem is that the limits get applied based just on
    what games/applications are currently running.  That&#039;s just
    insane; what matters should be which games/applications someone is
    currently using. Especially in a console UI, it&#039;s a totally
    reasonable expectation that the foreground application gets
    priority. If I&#039;ve got the download progress bar in the foreground,
    the system had damn well give that download priority. Not some
    application that was started a month ago, and hasn&#039;t been used
    since. Applying these limits in rest mode with suspended
    apps is beyond insane.
  &lt;/p&gt;

  &lt;p&gt;
    Second, these limits get applied per-connection.  So if you&#039;ve got
    a single download going, it&#039;ll get limited to 128kB of receive
    window. If you&#039;ve got five downloads, they&#039;ll all get 128kB, for a
    total of 640kB. That means the efficiency of the &amp;quot;make sure downloads
    don&#039;t clog the network&amp;quot; policy depends purely on how many downloads
    are active. That&#039;s rubbish. This is all controlled on the
    application level, and the application knows how many downloads
    are active. If there really were an optimal static receive window
    X, it should just be split evenly across all the downloads.
  &lt;/p&gt;

  &lt;p&gt;
    Third, the core idea of applying a static receive window as a
    means of fighting bufferbloat is just fundamentally broken.
    Using the receive window as the rate limiting mechanism just
    means that the actual transfer rate will depend on the RTT
    (this is why a local proxy helps). For this kind of thing to
    work well, you can&#039;t have the rate limit depend on the
    RTT. You also can&#039;t just have somebody come up with a number
    once, and apply that limit to everyone. The limit needs to
    depend on the actual network conditions.
  &lt;/p&gt;

   &lt;p&gt;
    There are ways to detect how congested the downlink is in the
    client-side TCP stack. The proper fix would be to implement them,
    and adjust the receive window of low-priority background downloads
    if and only if congestion becomes an issue. That would actually be
    a pretty valuable feature for this kind of appliance. But I can
    kind of forgive this one; it&#039;s not an off the shelf feature, and
    maybe Sony doesn&#039;t employ any TCP kernel hackers.
  &lt;/p&gt;

  &lt;p&gt;
    Fourth, whatever method is being used to decide on whether a game
    is network-latency sensitive is broken. It&#039;s absurd that a demo of
    a single-player game idling in the initial title screen would
    cause the download speeds to be totally crippled. This really
    should be limited to actual multiplayer titles, and ideally just
    to periods where someone is actually playing the game online.
    Just having the game running should not be enough.
  &lt;/p&gt;

  &lt;p&gt;&lt;b&gt;Q: How can this still be a problem, 4 years after launch?&lt;/b&gt;&lt;/p&gt;

  &lt;p&gt;
    I have no idea. Sony must know that the PSN download speeds have
    been a butt of jokes for years. It&#039;s probably the biggest
    complaint people have with the system. So it&#039;s hard to believe
    that nobody was ever given the task of figuring out why it&#039;s
    slow. And this is not rocket science; anyone bothering to look
    into it would find these problems in a day.&lt;/p&gt;

  &lt;p&gt;
    But it seems equally impossible that they know of the cause, but
    decided not to apply any of the the trivial fixes to it. (Hell, it
    wouldn&#039;t even need to be a proper technical fix. It could just be
    a piece of text saying that downloads will work faster with all
    other apps closed).
  &lt;/p&gt;

  &lt;p&gt;
    So while it&#039;s possible to speculate in an informed manner about
    other things, this particular question will remain as an open
    mystery.  Big companies don&#039;t always get things done very
    efficiently, eh?
  &lt;/p&gt;

  &lt;h3&gt;Footnotes&lt;/h3&gt;

  &lt;div class=footnotes&gt;

    &lt;p&gt; &lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;]
      How idle? So idle that I hadn&#039;t even logged in, the app
        was in the login screen.
    &lt;/p&gt;

    &lt;p&gt;
      &lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;] To be specific, the
        slowdown is caused by the artifical latency changes. The PS4
        downloads files in chunks, and each chunk can be served from a
        different CDN. The CDN that was being used from 10:51 to 11:00
        was using a delay-based congestion control algorithm, and
        reacting to the extra latency by reducing the amount of data
        sent. The CDN used earlier in the connection was using a
        packet-loss based congestion control algorithm, and did not
        slow down despite seeing the latency change in exactly the same
        pattern.
    &lt;/p&gt;
  &lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>NETWORKING</category><category>GAMES</category><pubDate>Sat, 19 Aug 2017 19:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2017-08-19-slow-ps4-downloads/</guid></item><item><title>The mystery of the hanging S3 downloads</title><link>https://www.snellman.net/blog/archive/2017-07-20-s3-mystery/</link><description>
  &lt;p&gt;
    A coworker was experiencing a strange problem with their Internet
    connection at home. Large downloads from most sites worked
    fine. The exception was that downloads from a Amazon S3 would get
    up to a good speed (500Mbps), stall completely for a few seconds,
    restart for a while, stall again, and eventually hang
    completely. The problem seemed to be specific to S3,
    downloads from generic AWS VMs were ok.
  &lt;/p&gt;

  &lt;p&gt;
    What could be going on? It shouldn&#039;t be a problem with
    the ISP, or anything south of that: after all, connections to other
    sites were working. It should not be a problem between the ISP
    and Amazon, or there would have been problems with AWS too.
    But it also seems very unlikely that S3 would
    have a trivially reproducible problem causing large downloads to hang.
    It&#039;s not like this is some minor use case of the service.
  &lt;/p&gt;

  &lt;p&gt;
    If it had been a problem with e.g. viewing Netflix, one might
    suspect some kind of targeted traffic shaping. But an ISP
    throttling or forcibly closing connections to S3 but not to AWS in
    general? That&#039;s just silly talk.
  &lt;/p&gt;

  &lt;p&gt;
    The normal troubleshooting tips like reducing the MTU didn&#039;t help
    either. This sounded like a fascinating networking whodunit,
    so I couldn&#039;t resist butting in after hearing about it through
    the grapevine.
  &lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

  &lt;h3&gt;The packet captures&lt;/h3&gt;

  &lt;p&gt;
    The first step of debugging pretty much any networking problem is getting
    a packet capture from as many points in the network as possible. In this
    case we only had one capture point: the client machine. The problem
    could not be reproduced on anything but S3, and obviously taking a capture
    from S3 was not an option. Nor did we have access to any devices elsewhere
    on the traffic path. &lt;a id=&#039;fnref0&#039;&gt;[&lt;a href=&#039;#fn0&#039;&gt;0&lt;/a&gt;]
  &lt;/p&gt;

  &lt;p&gt;
    A superficial check of the ACK stream showed the following pattern.
    The traffic would be humming along nicely, from the sequence numbers
    we can see that about 57MB have already been downloaded in the first
    2.5 seconds.
  &lt;/p&gt;

&lt;pre&gt;00:00:02.543596 client &amp;gt; server: Flags [.], ack &lt;b&gt;57657817&lt;/b&gt;
00:00:02.543623 client &amp;gt; server: Flags [.], ack 57661318
00:00:02.543682 client &amp;gt; server: Flags [.], ack 57667046
&lt;/pre&gt;

  &lt;p&gt;Then, a single packet loss occurs. We can tell from the SACK block that 1432 bytes of payload are missing. That&#039;s almost certainly a single packet.&lt;/p&gt;
&lt;pre&gt;&lt;b&gt;00:00:02.543734&lt;/b&gt; client &amp;gt; server: Flags [.], ack &lt;b&gt;57667046&lt;/b&gt;,
    options [sack 1 {&lt;b&gt;57668478&lt;/b&gt;:57669910}]
&lt;/pre&gt;

  &lt;p&gt;After the single packet loss, more data continues to be delivered
    with no problems. In the next 100ms a further 6MB gets delivered. But the
    missing data never arrives.&lt;/p&gt;
&lt;pre&gt;...
00:00:02.648316 client &amp;gt; server: Flags [.], ack 57667046,
    options [sack 1 {57668478:63829515}]
&lt;b&gt;00:00:02.648371&lt;/b&gt; client &amp;gt; server: Flags [.], ack 57667046,
    options [sack 1 {57668478:&lt;b&gt;63830947&lt;/b&gt;}]
&lt;/pre&gt;

  &lt;p&gt;In fact, no further ACKs are sent for 4 seconds. And even then it&#039;s not done
    by one 1432 byte packet like we expected, but by two 512 byte packets and one
    408 byte one. There&#039;s also a RTT-sized delay between the first and second
    packets.
  &lt;/p&gt;

  &lt;pre&gt;00:00:&lt;b&gt;06.751691&lt;/b&gt; client &amp;gt; server: Flags [.], ack &lt;b&gt;57667558&lt;/b&gt;,
    options [sack 1 {57668478:63830947}]
00:00:&lt;b&gt;06.792592&lt;/b&gt; client &amp;gt; server: Flags [.], ack &lt;b&gt;57668070&lt;/b&gt;,
    options [sack 1 {57668478:63830947}]
00:00:06.796277 client &amp;gt; server: Flags [.], ack &lt;b&gt;63830947&lt;/b&gt;
&lt;/pre&gt;

  &lt;p&gt;
    After that, the connection continues merrily along, but the exact same thing
    happens 3 seconds later.
  &lt;/p&gt;

  &lt;p&gt;What can we tell from this? Clearly the actual server would be
    retransmitting the lost packet much more quickly than with a 4 second
    delay. It also would not be re-packetizing the 1432 byte packet into three
    pieces. Instead what must be happening is that each retransmitted copy
    is getting lost. After a few seconds RFC 4821-style path MTU probing kicks in,
    and a smaller packet gets retransmitted. For some reason this retransmission
    makes it through; this makes the sender believe that the path MTU has been
    reduced, and it starts sending smaller packets.&lt;/p&gt;

  &lt;p&gt;Again this suggests there&#039;s something dodgy going on with MTUs, but as
    mentioned in the beginning, reducing the MTU did not help.&lt;/p&gt;

  &lt;p&gt;But it also suggests a mechanism for why the connection eventually hangs
    completely, rather than alternating between stalling and recovering.
    There&#039;s a limit to how far
    the MSS can be reduced. If nothing else, the segments will need to
    have at least one byte of payload. In practice most operating systems have
    a much higher limit on the MSS (something in the 80-160 byte range is
    typical). If even packets of the minimum size aren&#039;t making it through,
    the server can&#039;t react by sending smaller packets.&lt;/p&gt;

  &lt;p&gt;With the information from the ACK stream exhausted, it&#039;s time
    to look at the packets in both directions. And what do you know?
    We actually see the earlier retransmissions at the client, with
    beautiful exponential backoff.
    The packets were not lost in the network, but were silently rejected by
    the client for some reason.&lt;/p&gt;

 &lt;pre&gt;00:00:02.685557 server &amp;gt; client: Flags [.], seq 57667046:&lt;b&gt;57668478&lt;/b&gt;, ack 4257, length 1432
00:00:02.960249 server &amp;gt; client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:03.500500 server &amp;gt; client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:04.580168 server &amp;gt; client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:06.751657 server &amp;gt; client: Flags [.], seq 57667046:&lt;b&gt;57667558&lt;/b&gt;, ack 4257, length 512
00:00:06.751691 client &amp;gt; server: Flags [.], ack 57667558, win 65528,
    options [sack 1 {57668478:63830947}]
00:00:06.792565 server &amp;gt; client: Flags [.], seq &lt;b&gt;57667558:57668070&lt;/b&gt;, ack 4257, length 512
00:00:06.792567 server &amp;gt; client: Flags [.], seq &lt;b&gt;57668070:57668478&lt;/b&gt;, ack 4257, length 408
00:00:06.792592 client &amp;gt; server: Flags [.], ack 57668070,
    options [sack 1 {57668478:63830947}]
&lt;/pre&gt;

  &lt;p&gt;There are really just two reasons this would happen. The IP or
    TCP checksum could be wrong. But how could it be wrong for the
    same packet six times in a row? That&#039;s crazy talk, the expected packet
    corruption rate is more like one in a million. Alternatively
    the packet is too large. But damn it, we know that&#039;s not the
    problem, no matter how well this case is matching the common pattern.
    Let&#039;s just have a look at the checksums, to rule it out...&lt;/p&gt;

&lt;pre&gt;server &amp;gt; client: Flags [.], cksum &lt;b&gt;0x0000&lt;/b&gt; (incorrect -&amp;gt; &lt;b&gt;0xd7a7&lt;/b&gt;), seq 57667046:57668478, ack 4257, length 1432
server &amp;gt; client: Flags [.], cksum 0x0000 (incorrect -&amp;gt; 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
server &amp;gt; client: Flags [.], cksum 0x0000 (incorrect -&amp;gt; 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
...
&lt;/pre&gt;

  &lt;p&gt;Oh... Every single copy of that packet had a checksum of 0 instead of the
    expected checksum of 0xd7a7. (Checksums of 0 are often not real errors,
but just artifacts of checksum offload. The packets being captured by software
before the checksum is computed by hardware.
That&#039;s not the case here; these are packets we&#039;re receiving rather than
transmitting.). And it gets crazier, when we look at the next
    instance of the problem a few seconds later.&lt;/p&gt;

&lt;pre&gt;server &amp;gt; client: Flags [.], cksum 0x0000 (incorrect -&amp;gt; 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
server &amp;gt; client: Flags [.], cksum 0x0000 (incorrect -&amp;gt; 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
server &amp;gt; client: Flags [.], cksum 0x0000 (incorrect -&amp;gt; 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
...
&lt;/pre&gt;

  &lt;p&gt;It&#039;s the exact same problem, all the way down to the problem
    appearing specifically with a TCP checksum of 0xd7a7. Further
    analysis of the captures verified that this was a systematic
    problem and not a coincidence. &lt;b&gt;Packets with an expected checksum of
    0xd7a7 would always have the checksum replaced with
    0. Packets with any other expected checksum would work just fine.&lt;/b&gt;
    &lt;a id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;].&lt;/p&gt;

  &lt;p&gt;This explains why the path MTU probing temporarily fixes the problem:
    the repacketized segments have different checksums, and make it through
    unharmed.&lt;/p&gt;

  &lt;h3&gt;TCP Timestamps&lt;/h3&gt;

  &lt;p&gt;So, a problem internal to S3 is causing this very specific kind
    of packet corruption then?&lt;/p&gt;

  &lt;p&gt;Not so fast! It turns out that most TCP implementations would
    work around this kind of corruption by accident. The reason for
    that is TCP Timestamps. And while you don&#039;t need to actually know
    much about TCP Timestamps to understand this story, I have been
    looking for an excuse to rant about them.
  &lt;/p&gt;

  &lt;p&gt;With TCP Timestamps, every TCP packet will contain a TCP option
    with two extra values. One of them is the sender&#039;s latest
    timestamp. The other is an echo of the latest timestamp the sender
    received from the other party. For example here the client is
    sending the timestamp 805, and the server is echoing it back:&lt;/p&gt;

  &lt;pre&gt;client &amp;gt; server: Flags [.], ack 89,
    options [TS val &lt;b&gt;805&lt;/b&gt; ecr 10087]
server &amp;gt; client: Flags [P.], seq 89:450, ack 569,
    options [TS val 10112 ecr &lt;b&gt;805&lt;/b&gt;]
&lt;/pre&gt;

  &lt;p&gt;
    TCP Timestamps were added to TCP very early on, for two
    reasons, neither of which was very compelling in retrospect.&lt;/p&gt;

  &lt;p&gt;Reason number one was PAWS, Protection Against Wrapped-Around
    Sequence-Numbers. The idea was that very fast connections might
    require huge TCP window sizes, and minor packet reordering/duplication
    might cause an old packet to be interpreted as a new packet, due to the
    32 bit sequence number having wrapped around. I don&#039;t think that
    world ever really arrived, and PAWS is irrelevant to practically
    all TCP use cases.&lt;/p&gt;

  &lt;p&gt;The other original reason for timestamps was to enable TCP
    senders to measure RTTs in the presence of packet loss. But this
    can also be done with TCP Selective ACKs, a feature that&#039;s much
    more useful in general (and thus was widely deployed a lot sooner,
    despite being standardized later).
  &lt;/p&gt;

  &lt;p&gt;In exchange for these dubious benefits, every TCP packet (both
    data segments and pure control packets) is bloated by 12 bytes.
    This is in contrast to something like selective ACKs, where most
    packets don&#039;t grow in size. You only pay for selective ACKs when
    packets are lost or reordered. I &lt;a href=&#039;https://www.snellman.net/blog/archive/2016-12-01-quic-tou/&#039;&gt;think that the debuggability
    of network protocols is important&lt;/a&gt;, but with TCP you get basically
    everything you need from other sources. TCP timestamps have a high
    fixed cost, but give very little additional power.
  &lt;/p&gt;

  &lt;p&gt;If TCP Timestamps suck so much, why does everyone use them
    them? I don&#039;t know for sure anyone else&#039;s reasons. I ended up
    implementing them purely due to an interoperability issue with the
    FreeBSD TCP stack. Basically FreeBSD uses a small static receive
    window for connections without TCP timestamps, while with TCP
    timestamps on it&#039;d scale the receive window up as necessary.
    With connections with even a bit of latency, you needed
    TCP timestamps to avoid the receive window becoming a bottleneck.
    (This was &lt;a href=&#039;https://svnweb.freebsd.org/base?view=revision&amp;revision=316676&#039;&gt;fixed in FreeBSD a few months ago&lt;/a&gt;. Yay!).&lt;/p&gt;

  &lt;p&gt;Now, performance of FreeBSD clients isn&#039;t a big deal for me as long as
    the connections work. But you know who else uses a FreeBSD-derived
    TCP stack? Apple. And when it comes to mobile networks, performance
    of iOS devices is about as important as it gets. Anyone who cares about
    large transfers to iOS or OS X clients must use TCP Timestamps,
    no matter how distasteful they find the feature.&lt;/p&gt;

  &lt;p&gt;&lt;i&gt;&quot;But Juho, what does any of this have to do with S3?&quot;&lt;/i&gt;, you ask.
    Well, S3 is one of those rare services that disable
    timestamps. And that actually makes for a big difference
    in this case. With timestamps, each retransmitted copy of a packet would use a
    different timestamp value &lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;].
    And when any part of the TCP header changes, odds are that the
      checksum changes as well. Even if some packets are lost due to the
      having the magic checksum, at least the retransmissions will
      make it through promptly.
  &lt;/p&gt;

  &lt;p&gt;To check this theory, I asked for a test with TCP timestamps
    disabled on the client. And immediately large downloads from
    anywhere - even the ISP&#039;s own speedtest server - started hanging.
    Success!&lt;/p&gt;

  &lt;h3&gt;Conclusion&lt;/h3&gt;

  &lt;p&gt;With this information I suggested my coworker call his ISP, and
    report the problem.
    He was smarter than that, and ran one more test: switching the
    cable modem from router mode to bridging mode. Bam, the problem
    was gone. In retrospect this makes sense: in router mode the cable
    modem needs to update the checksums for each packet that pass
    through the device. In bridging mode there&#039;s no NAT, so no
    checksum update is needed.
  &lt;/p&gt;

  &lt;p&gt;And that&#039;s how a dodgy cable modem caused downloads to fail with
    one service, but one service only. I&#039;ve seen many kinds of packet
    corruption before, but never anything that was so absurdly specific.
  &lt;/p&gt;

  &lt;h3&gt;Footnotes&lt;/h3&gt;

  &lt;div class=footnotes&gt;
    &lt;p&gt;
      &lt;a id=&#039;fn0&#039;&gt;[&lt;a href=&#039;#fnref0&#039;&gt;0&lt;/a&gt;] There are techniques
  around for routing the traffic such that we would have had a
  measurement point. One would have been using something like a VPN or
  a Socks proxy. But that&#039;s such a fundamental change to the traffic
  pattern that it doesn&#039;t make for a very interesting test. Odds are
  that the problem would just go away when you do that. The other
  option would be to use a fully transparent generic TCP proxy on some
  server with a public IP, have the client connect to the TCP
  proxy and the proxy connect to the actual server. But setting that
        up is tedious; certainly not worth doing as a first step.
    &lt;/p&gt;

    &lt;p&gt;It&#039;s also pretty common to only have one trace point to start
      with.  For analysis I&#039;d do for actual work purposes, we pretty
      often have just a trace from somewhere in the middle of the
      path, but nothing from the client or the server. Getting traces
      from multiple points is so much trouble that we usually need
      to roughly pinpoint the problem first with single-point packet
      capture, and only then ask for more trace points.&lt;/p&gt;

    &lt;p&gt;
  &lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;] As far as I can tell 0xd7a7 has no interesting special
    properties. The bytes are not printable ASCII characters. 0xd7a7
    isn&#039;t a value with any special significance in another TCP header
    field either. There are ways to screw up TCP checksum computations, but
    I think they&#039;re mostly to do with the way 0x0 and 0xffff are both
    zero values in a one&#039;s complement system.

    &lt;p&gt;
  &lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;] Assuming sensible timestamp resolution. Not the rather unpractical 500ms tick that e.g. OpenBSD uses.
  &lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>NETWORKING</category><pubDate>Thu, 20 Jul 2017 16:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2017-07-20-s3-mystery/</guid></item><item><title>I don&#039;t want no &#039;wantarray&#039;</title><link>https://www.snellman.net/blog/archive/2017-07-18-wantarray/</link><description>
  &lt;p&gt;
    A while back, I got a bug report for
    &lt;a href=https://www.snellman.net/blog/archive/2016-01-12-json-to-multicsv/&gt;json-to-multicsv&lt;/a&gt;. The user was getting the following error for
    any input file, including the one used as an example
    in the documentation:
  &lt;/p&gt;

  &lt;pre&gt;
    , or } expected while parsing object/hash, at character offset 2 (before &quot;n&quot;)&lt;/pre&gt;

  &lt;p&gt;The full facts of the matter were:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt; The JSON parser was failing on the third character of the
      file.
    &lt;li&gt; That was also the end of the first line in the
      file. (I.e. the first line of the JSON file contained just the
      opening bracket).
    &lt;li&gt; The user was running it on Windows.
    &lt;li&gt; The same input file worked fine for me on Linux.
  &lt;/ul&gt;

  &lt;read-more&gt;&lt;/read-more&gt;

  &lt;p&gt;
    Now, there&#039;s an obvious root cause here. It&#039;s almost
    impossible not to blame this on Windows using CR-LF line endings,
    where Unix uses just LF. The pattern match is irresistible: works
    on Linux, fails on Windows, fails at the end of the first line. And I
    almost answered the email based on this assumption.

  &lt;p&gt;Except... Something feels off with that theory. What would be the
    root cause here? &quot;&lt;i&gt;Wow, I can&#039;t believe that the JSON spec
    missed specifying the CR as whitespace&lt;/i&gt;&quot;? No, that makes no
    sense, nobody would define a text-based file format that
    sloppily. 0

  &lt;p&gt;How about: &quot;&lt;i&gt;Wow, I can&#039;t believe the JSON module
    of a major programming language has a bug making it fail on all
      inputs on a major operating system, and it took a decade for anyone
      to notice&lt;/i&gt;&quot;. That doesn&#039;t seem plausible either.

  &lt;p&gt;
    So I tried to reproduce the problem, by making a file with DOS
    line endings and running it through the script on Linux. That
    worked fine. Hm. Put in some invalid garbage, and you get a parser
    error as expected. Double-hm. But the error message I got was very
    different from that in the bug report. Could it be that it&#039;s using
    a totally different JSON module altogether?

  &lt;p&gt;
    Turns out that&#039;s basically what was going on. Perl&#039;s JSON module
    doesn&#039;t actually do any parsing itself. It&#039;s mostly a
    shim layer, the actual work is done by one of several
    different parser modules. On Linux, I&#039;d been getting &lt;code&gt;JSON::XS&lt;/code&gt;
    as the backend (&lt;code&gt;XS&lt;/code&gt; is Perl-talk for &quot;native code&quot;). In cases
    where &lt;code&gt;JSON::XS&lt;/code&gt; is not available, the shim module would use
    a pure Perl fallback, e.g. &lt;code&gt;JSON::PP&lt;/code&gt;.

  &lt;p&gt;Ok, so force the JSON module to dispatch to &lt;code&gt;JSON::PP&lt;/code&gt;.
    Success! Problem reproduced. Guess it really was buggy parser after
    all. Remove the DOS line endings, just to be sure... And it&#039;s still
    failing. WTF?
  &lt;/p&gt;

  &lt;p&gt;A bit more digging revealed that the error message was actually
    a lie. The problem wasn&#039;t with the whitespace, but with
    there being an end of file right after said whitespace. The input
    to &lt;code&gt;JSON::PP&lt;/code&gt; contained just a single line, not the whole
    file! At that point, the actual problem becomes obvious and the
    fix trivial:

  &lt;pre&gt;
-    my $json = decode_json read_file $file;
+    my $json = decode_json scalar read_file $file;&lt;/pre&gt;

  &lt;p&gt;I was using the &lt;code&gt;read_file&lt;/code&gt; function from
    &lt;code&gt;File::Slurp&lt;/code&gt; to read the contents of the file.
    Unfortunately that function behaves differently in scalar and list
    contexts. In scalar context, it returns the contents of the file
    in a single string. In list context, an array of strings. What had
    to be happening was that the context was changing based on the
    backend.
  &lt;/p&gt;

  &lt;p&gt;
    And just why would changing the parser backend change the context
    for that &lt;code&gt;read_file&lt;/code&gt; call? As it happens, the &lt;code&gt;JSON&lt;/code&gt;
    module does not actually define &lt;code&gt;decode_json&lt;/code&gt;, but
    directly aliases to the matching function in the backend. For
    example:
  &lt;/p&gt;

  &lt;pre&gt;*{&quot;JSON::decode_json&quot;} = &amp;{&quot;JSON::XS::decode_json&quot;};&lt;/pre&gt;

  &lt;p&gt;
    &lt;code&gt;JSON::XS&lt;/code&gt; declares the function with a &lt;code&gt;$&lt;/code&gt;
    prototype forcing the argument to be evaluated in scalar context.
    &lt;code&gt;JSON::PP&lt;/code&gt; uses no prototype and thus the arguments
    defaulted to being evaluated in list context.
  &lt;/p&gt;

  &lt;h2&gt;The blame game&lt;/h2&gt;

   &lt;p&gt;So, that&#039;s the bug. But what was the real culprit?
     I could come up with the following suspects.&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt; Me, for using &lt;code&gt;File::Slurp&lt;/code&gt; for this
      in the first place. &lt;i&gt;&quot;Oh, I just always pass a file-handle to
      &lt;code&gt;decode_json&lt;/code&gt;&quot;&lt;/i&gt; said one coworker when I described
      this bug. And that would indeed have side-stepped the problem, and
      &lt;code&gt;read_file&lt;/code&gt; is just saving a couple of lines of code.
      But it&#039;s exactly the couple of lines of code I don&#039;t want to be writing:
      pairing up file opens/closes, and boilerplate error handling.
    &lt;li&gt; Me, for not realizing that the code was only working by
      accident.  I knew &lt;code&gt;read_file&lt;/code&gt; works differently in
      scalar and list contexts. I also knew this case needed scalar context,
      and had no special reason to believe that &lt;code&gt;decode_json&lt;/code&gt;
      would provide it. The default assumption should have bene for this
      code not to work. When it did, I should not have accepted it, but
      figured out why it worked and whether it was guaranteed to work
      in the future.
    &lt;li&gt;The &lt;code&gt;JSON&lt;/code&gt; module, for not explicitly documenting
      the inconsistent prototypes as part of the interface. I don&#039;t know
      that anyone would actually notice that in the documentation though.
      It might end up as just cover-your-ass documentation.
    &lt;li&gt; The &lt;code&gt;JSON&lt;/code&gt; module, for directly exposing the
      backend functions with aliasing, for a minimal performance gain.
      It&#039;s a shim: isn&#039;t the whole point to hide away the
      implementation differences from the user?
    &lt;li&gt;The &lt;code&gt;File::Slurp&lt;/code&gt; module, for using &lt;code&gt;wantarray&lt;/code&gt;
      to switch behavior of &lt;code&gt;read_file&lt;/code&gt; based on the context.
    &lt;li&gt;Perl for having the concept of different contexts in the first place.
    &lt;li&gt;Perl for allowing random library code to detect different contexts
      via &lt;code&gt;wantarray&lt;/code&gt;.
  &lt;/ul&gt;

  &lt;p&gt;The thing that really sticks out to me here is overloading
     of &lt;code&gt;File::Slurp::read_file&lt;/code&gt; based on the
     context. Returning a file as a single string vs. an array of
     lines are very different operations. There is absolutely no
     reason for them to share a name. It&#039;d be simpler to implement, simpler to use,
     and simpler to document. It&#039;s even already in a library, so it&#039;s
     not like there would be any kind of namespace pollution by using
     different names. (Unlike for the uses of context-sensitive
     overloading in core Perl. Sure, &lt;code&gt;count&lt;/code&gt; would probably
     make more sense than &lt;code&gt;scalar grep&lt;/code&gt;. But it would be a
     new name in the global namespace).
  &lt;/p&gt;

   &lt;p&gt;What about &lt;code&gt;wantarray&lt;/code&gt;? It&#039;s what&#039;s enabling this
     bogus overloading in the first place. I&#039;ve been using Perl for 20
     years, writing some pretty hairy stuff. As far as I can remember,
     I haven&#039;t used &lt;code&gt;wantarray&lt;/code&gt; once. And what&#039;s more,
     I don&#039;t remember ever using a library that used it to good
     effect. The reason context-sensitivity works in core Perl
     is the limited set of operations. One can reasonably learn the
     entire set of context-sensitive operations, and their (sometimes
     surprising) behavior. It&#039;s a lot less reasonable to expect people to
     learn this for arbitrary amounts of user code.
   &lt;/p&gt;

   &lt;p&gt;It&#039;s a bit unfortunate that function aliasing can cause action
      at a distance like this. But at least that&#039;s a feature with
      solid use cases.&lt;/p&gt;

   &lt;p&gt;So I think that&#039;s where I fall on this. It&#039;s all because of a horrible
     and mostly unnecessary language feature, used for particularly bad effect
     in a library. It feels like avoiding this kind of problem on the consumer
     side is almost impossible; it&#039;d just require superhuman levels of attention
     to detail. Avoiding it on the producer side is really easy:
     &lt;code&gt;wantarray&lt;/code&gt;: just say no.&lt;/p&gt;

  &lt;h3&gt;Footnotes&lt;/h3&gt;
  &lt;div class=footnotes&gt;

  &lt;a id=&#039;fn0&#039;&gt;[&lt;a href=&#039;#fnref0&#039;&gt;0&lt;/a&gt;] Did you nod and agree at &lt;i&gt;&quot;that makes no sense&quot;&lt;/i&gt;? Haha. The
    original JSON spec does say that &lt;i&gt;&quot;whitespace can be inserted
    between any two tokens&quot;&lt;/i&gt;, but doesn&#039;t actually define whitespace.
   &lt;/div&gt;

</description><author>jsnell@iki.fi</author><category>PERL</category><pubDate>Tue, 18 Jul 2017 18:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2017-07-18-wantarray/</guid></item><item><title>The origins of XXX as FIXME</title><link>https://www.snellman.net/blog/archive/2017-04-17-xxx-fixme/</link><description>
&lt;p&gt;
  The token &lt;code&gt;XXX&lt;/code&gt; is frequently used in source code
  comments as a way of marking some code as needing attention.
  (Similar to a &lt;code&gt;FIXME&lt;/code&gt; or &lt;code&gt;TODO&lt;/code&gt;, though at
  least to me &lt;code&gt;XXX&lt;/code&gt; signals something far to the hacky
  end of the spectrum, and perhaps even outright broken).

&lt;p&gt;
  It&#039;s a bit of an odd and non-obvious string though, unlike
  &lt;code&gt;FIXME&lt;/code&gt; and &lt;code&gt;TODO&lt;/code&gt;. Where did this
  convention come from? I did a little bit of light software archaeology
  to try to find out. To start with, my guesses in order were:

&lt;ul&gt;
  &lt;li&gt; MIT (since it sometimes feels like that&#039;s the source of
    90% of ancient hacker shibboleths)
  &lt;li&gt; Early Unix (probably the most influential codebase that&#039;s
    ever existed)
  &lt;li&gt; Some kind of DEC thing (because really, all the world was
    a PDP)
&lt;/ul&gt;

&lt;read-more&gt;&lt;/read-more&gt;

&lt;h3&gt;Other uses of &lt;code&gt;XXX&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;
  It turns out that &lt;code&gt;XXX&lt;/code&gt; and &lt;code&gt;xxx&lt;/code&gt; are
  incredibly annoying things to search for in old code. I&#039;d bet it&#039;s
  the most common sequence of 3+ identical letters in source code.
  That means there&#039;s a ton of false positives to sift through.
  Here&#039;s a few examples of the kind of stuff that will be found.

&lt;p&gt;
  By far the most common use of &lt;code&gt;XXX&lt;/code&gt; in old is for it to
  be some kind of a &lt;b&gt;template placeholder&lt;/b&gt;. This makes some sense;
  &lt;code&gt;x&lt;/code&gt; for an unknown value has an obvious long history
  that predates computing. These templates might be used to describe
  the exact data layout of something, like in the following bits from the
  Apollo guidance computer:
&lt;/p&gt;

&lt;pre&gt;
# 17    ASTRONAUT TOTAL ATTITUDE      3COMP   XXX.XX DEG FOR EACH
# 18    AUTO MANEUVER BALL ANGLES     3COMP   XXX.XX DEG FOR EACH
# 19    BYPASS ATTITUDE TRIM MANEUVER 3COMP   XXX.XX DEG FOR EACH
# 20    ICDU ANGLES                   3COMP   XXX.XX DEG FOR EACH
# 21    PIPAS                         3COMP   XXXXX. PULSES FOR EACH
# 22    NEW ICDU ANGLES               3COMP   XXX.XX DEG FOR EACH
# 23    SPARE
# 24    DELTA TIME FOR AGC CLOCK      3COMP   00XXX. HRS. DEC ONLY
&lt;/pre&gt;

&lt;p&gt;Or as just a wildcard for a bunch of related names, like the in
  this Lisp Machine source code:&lt;/p&gt;

&lt;pre&gt;
;Q-FASL-xxxx refers to functions which load into the cold load, and
; return a &quot;Q&quot;, i.e. a list of data-type and address-expression.
;M-FASL-xxxx refers to functions which load into Maclisp, and
; return a Lisp object.
&lt;/pre&gt;

&lt;p&gt;Or as actual templates-as-program, with parts of
  an input remains while others (those marked with &lt;code&gt;XXX&lt;/code&gt;)
  are programatically replaced. For example temporary file generation
  in in UNIXv5:&lt;/p&gt;

&lt;pre&gt;
                f = ranname(&quot;/usr/lpd/dfxxx&quot;);
&lt;/pre&gt;

&lt;p&gt;And finally, it could denote parts of persistent data structures
  that were reserved for future use (or no longer used), for example
  in CPM:&lt;/p&gt;

&lt;pre&gt;
/* THE FILE CONTROL BLOCK FORMAT IS SH0WN BELOW:
   --------------------------------------------------------
   /    1 BY / 8 BY / 3 BY / 1 BY /2BY/1 BY/ 16 BY /
   /F1LETYPE/   NAME / EXT / REEL NO/XXX/RCNT/DM0 DM15/
   --------------------------------------------------------

   FILETYPE     :       0E5H IF AVAILABLE (OTHERWISE UNDEFINED NOW)
...
   XXX          :       UNUSED FOR NOW
   RCNT         :       RECORD COUNT IN FILE (0 TO , 127)
&lt;/pre&gt;

&lt;p&gt;
  A less savoury use of &lt;code&gt;XXX&lt;/code&gt; is as an identifier for
  something that didn&#039;t even qualify to have a real name. Most
  commonly it&#039;d be the name of a branch target, like in a very
  early version of the C compiler:
&lt;/p&gt;

&lt;pre&gt;
    xxx:
        if (o==KEYW) {
                if (cval==EXTERN) {
                        o = symbol();
                        goto xxx;
                }
&lt;/pre&gt;

&lt;p&gt;It could also be used to name variables. The following is from the
  FORTRAN II compiler for the IBM 704 from 1958. (I don&#039;t read 704
  assembler, so maybe I&#039;m misinterpreting what&#039;s going on in that
  program. It seems funny enough that I wanted to include it here
  anyway).
&lt;/p&gt;

&lt;pre&gt;
XXXXXX SYN 0  THE APPEARANCE OF THIS SYMBOL IN   F4400370
       REM       THE LISTING INDICATES THAT ITS  F4400380
       REM       VALUE IS SET BY THE PROGRAM.    F4400390
&lt;/pre&gt;

&lt;p&gt;
  Some DEC code seems to have gone really overboard with this, with
  single source files having half a dozen different &lt;code&gt;XXXYYY&lt;/code&gt;
  identifiers. (Sorry, had to use YYY as the placeholder there for
  obvious reasons).

&lt;p&gt;
  Finally, there are all kinds of bizarre one-off uses. TENEX seems
  to have used &lt;code&gt;XXX&lt;/code&gt; for implementing rubout. That is,
  when you&#039;d press backspace to delete something you&#039;ve typed, it&#039;d
  print out XXX on the teletype to mark the deletion. (Rather than
  try to move the cursor back). Some kind proto-instant messaging
  program from 1976 written in Interlisp that I found would just print
  &lt;code&gt;XXX&lt;/code&gt; as the error message for invalid user input.

&lt;p&gt;
  Now, sorry if the above parts were kind of tedious. But there is
  actually a point here. Turns out that &lt;code&gt;XXX&lt;/code&gt; is
  a really stupid marker to use for
  a &lt;code&gt;FIXME&lt;/code&gt;. Looking at the Panda TOPS-20 distribution,
  there are 3083 instances of XXX, none of which are &lt;code&gt;FIXME&lt;/code&gt;s.
  Just about anything else would be easier to
  find. This makes its use as one of the three main
  &lt;code&gt;FIXME&lt;/code&gt;-markers all the more puzzling.

&lt;h3&gt;&lt;code&gt;XXX&lt;/code&gt; as a &lt;code&gt;FIXME&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;
  To get the negative results out of the way, there is absolutely no
  sign of this being an MIT or DEC thing. &lt;code&gt;XXX&lt;/code&gt; as &lt;code&gt;FIXME&lt;/code&gt;
  doesn&#039;t appear on ITS or TOPS-20 disks, nor does it appear in any of
  the mountains of really old Lisp code that I happened to have around;
  I don&#039;t think it makes it to Lisp-land until the mid-&#039;80s. It&#039;s
  also absent in smaller collections of old code from other sources.
&lt;/p&gt;

&lt;p&gt;
  No, this seems to definitely be a Unix thing. There are a couple of
  interesting possibilities in early BSD. First, there&#039;s the following
  lines in a package of &lt;a href=https://github.com/dspinellis/unix-history-repo/commit/b41454192b6489951f36873ca3a792e9b1a73c92&gt;troff macros that
    first appeared in 2BSD, with a copyright date of 1978&lt;/a&gt;:

&lt;pre&gt;
..
.de (t                 \&quot; XXX temp ref to (z
.(z \\$1 \\$2
..
.de )t                 \&quot; XXX temp ref to )t
.)z \\$1 \\$2
&lt;/pre&gt;

&lt;p&gt;I&#039;m pretty sure these are not actually a &lt;code&gt;FIXME&lt;/code&gt;.
  It looks like the convention in this code was to mark
  &lt;code&gt;.de&lt;/code&gt; commands with three character tags depending
  on their type, as explained in the beginning of the file:

&lt;pre&gt;
+.\&quot;	Code on .de commands:
+.\&quot;		***	a user interface macro.
+.\&quot;		&amp;&amp;&amp;	a user interface macro which is redefined
+.\&quot;			when used to be the real thing.
+.\&quot;		$$$	a macro which may be redefined by the user
+.\&quot;			to provide variant functions.
+.\&quot;		---	an internal macro.
&lt;/pre&gt;

&lt;p&gt;These lines seem to have been commands that didn&#039;t fit into
  those existing categories, and needed a new tag.&lt;/p&gt;

&lt;p&gt;
  Next up, there&#039;s a bunch of very promising looking changes to
  the troff C source in the summer of 1980. Stuff like:
&lt;/p&gt;

&lt;pre&gt;
if(j == &#039; &#039;){
        storeword(i,width(i));  /* XXX */
        continue;
}
&lt;/pre&gt;

&lt;p&gt;
  That certainly looks like a classic &lt;code&gt;FIXME&lt;/code&gt;. But I think
  this is another dead end. It turns out that after this change there are
  37 &lt;code&gt;/* XXX */&lt;/code&gt; comments in code that didn&#039;t use to have any.
  And when comparing to Unix v7 source code, it looks like basically
  every single line that was changed got marked with one. So it&#039;s
  unlikely that these are actual &lt;code&gt;FIXME&lt;/code&gt;s. I think this
  was just the author making sure they could identify their changes,
  in case they wanted to reintegrate with &quot;upstream&quot;.
&lt;/p&gt;

&lt;p&gt;Soon after that BSD moves to SCCS, and we start getting fine-grained
  changes rather than huge code-dumps. From there, it&#039;s easy to find
  &lt;a href=&#039;https://github.com/dspinellis/unix-history-repo/commit/9e295a2f65c046125ece0ad68f142f59df4c3400&#039;&gt;the first &lt;code&gt;/* XXX */&lt;/code&gt; commit&lt;/a&gt;
  from Nov 9, 1981. This one is interesting in a few ways:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;This is definitely a &lt;code&gt;FIXME&lt;/code&gt;; just a few very special
    parts of the code got tagged, and many of them got rewritten soon
    after.
  &lt;li&gt;After this commit, the use of &lt;code&gt;/* XXX */&lt;/code&gt; starts
    spreading quickly through the BSD codebase and eventually to
    other authors.
  &lt;li&gt;A closer reading of the commit shows something interesting:
    a bunch of &lt;code&gt;/* ### */&lt;/code&gt; comments. Going through the earlier
    history, it seems that Bill Joy had been marking
    his &lt;code&gt;FIXME&lt;/code&gt;s with &lt;code&gt;###&lt;/code&gt;,
    and halfway through this commit changed to using
    &lt;code&gt;XXX&lt;/code&gt;. I don&#039;t know why, or whether these two markers were
    intended to have slightly different semantics (like &lt;code&gt;###&lt;/code&gt;
    was code that needed to be fixed, &lt;code&gt;XXX&lt;/code&gt; was code that
    was commented out and needed to be fixed and re-enabled).
    But &lt;code&gt;XXX&lt;/code&gt; quickly became the preferred form.
&lt;/ul&gt;

&lt;pre&gt;
                        if (rcv_empty(tp)) {                    /* 16 */
-                               tcp_close(tp, UCLOSED);
+                               sowakeup(tp-&gt;t_socket); /* ### */
+/* XXX */                      /* tcp_close(tp, UCLOSED); */
                                nstate = CLOSED;
                        } else
&lt;/pre&gt;

&lt;p&gt;(On a personal note, as someone who goes out of their way to read through any
  published TCP stacks, I&#039;m kind of amused that a search for a random
  historical trivia leads me to a damn TCP stack).&lt;/p&gt;


&lt;p&gt;Leaving it at that seems like a good story. And I&#039;d already
  checked basically all of Bell Labs code that I could find. It&#039;s not
  in Unix v2-v7 and not in the Programmer&#039;s Workbench. But then I
  decided to check Unix v1 just for completeness sake, and got very
  confused. Because...
&lt;/p&gt;

&lt;pre&gt;
/ XXX fix me, I dont quite understand what to do here or
/ what is done in the similar code below e407:
/ cmp   r5, u.count / see if theres enough room
/ bgt   1f
mov     r5,u.count / read text+data into core
&lt;/pre&gt;

&lt;p&gt;WTF? It doesn&#039;t get any clearer than that. But where did it come from?
  And if this convention was used at
  Bell Labs in 1970, where did &lt;code&gt;XXX&lt;/code&gt; disappear for a
  decade?&lt;/p&gt;

&lt;p&gt;Turns out this was a false alarm. The only reason we have
  the Unix v1 source code in the first place is that a team of people
  transcribed the source from PDF scans to text. Then they went on to
  make it possible to compile the code and run it in an emulator. As
  part of this latter work, a block of code was added to the source.
  And a bit unfortunately it was this patched version rather than the
  &quot;original&quot; that made it to the Unix History Repo.  This comment was
  actually from 2008, not 1971.
&lt;/p&gt;

&lt;p&gt;There&#039;s actually an interesting story behind that extra block of
  code, &lt;a href=https://www.usenix.org/legacy/events/usenix09/tech/full_papers/toomey/toomey.pdf&gt;as told by Toomey&lt;/a&gt;. After finally getting the v1
  kernel transcribed, compiled, and running, they hit the
  problem of the only having two userland programs available:
  &lt;code&gt;init&lt;/code&gt; and &lt;code&gt;sh&lt;/code&gt;. Everything else was using
  a more recent executable header. To be able to do anything at all
  with the system, they needed to add support for &quot;0407 binaries&quot; as
  opposed to the &quot;0405&quot; ones the kernel supported natively.

&lt;p&gt;
  What about C code outside of Unix distributions? It&#039;s actually kind
  of hard to find any of that from before 1982.
  There might be an earlier instance in Gosling Emacs, though it
  differs from the modern form by going for a full 9 &lt;code&gt;X&lt;/code&gt;s:
&lt;/p&gt;

&lt;pre&gt;
#ifdef HalfBaked
/*    sigset (SIGINT, InterruptKey); *//*XXXXXXXXX*/
    sigset (SIGINT, InterruptKey);/*XXXXXXXXX*/
#endif
&lt;/pre&gt;

&lt;p&gt;
And there&#039;s a Changelog entry from July 1981, which seems to match
up perfectly with both the functionality of the code, and the surrounding
&lt;code&gt;ifdef&lt;/code&gt;:
&lt;/p&gt;

&lt;pre&gt;
Tue Jul  7 12:51:44 1981  James Gosling  (jag at VLSI-Vax)
        ... I also installed Dave
        Dyer&#039;s hack to allow ^G&#039;s to interrupt execution immediatly.  This
        has a rather major bug, and is the reason that I didn&#039;t implement
        it a long time ago: if you type ^G while Emacs is doing output,
        then all queued-but-not-printed characters get lost and Emacs no
        longer has any idea of what the screen looks like. It is pretty
        much impossible for Emacs to tell whether or not this has
        happened. You end up having to type ^L now and then.  The
        &quot;HalfBaked&quot; switch in config.h controls the compilation of this
        facility, ...
&lt;/pre&gt;

&lt;p&gt;But thankfully this code has RCS history starting from 1986, and
 somebody did in fact edit this code in 1986 with no functional changes,
 but adding the commented out copy and the
 &lt;code&gt;XXXXXXXXX&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;
 #ifdef HalfBaked
-    sigset (SIGINT, InterruptKey);
+/*    sigset (SIGINT, InterruptKey); *//*XXXXXXXXX*/
+    sigset (SIGINT, InterruptKey);/*XXXXXXXXX*/
 #endif
&lt;/pre&gt;

&lt;p&gt;
  And those are the only signs of &lt;code&gt;XXX&lt;/code&gt; in
  applications that could predate the BSD usage. Both were
  red herrings, caused by how difficult it&#039;s to actually find
  pristine copies of source code that old. It was very lucky that
  the Gosling Emacs comment was added after the code was put to RCS,
  and made not in the five year interval between the original commit
  and the project starting to use RCS.
&lt;/p&gt;

&lt;p&gt;So it seems likely that this convention was invented by Bill Joy
  in BSD. If he wasn&#039;t the first one, he was certainly the one that
  popularized it. Why he chose to switch to the rather inconvenient
  &lt;code&gt;XXX&lt;/code&gt; from &lt;code&gt;###&lt;/code&gt; is unclear.
  &lt;/p&gt;

&lt;p&gt;If you can find an earlier occurence (or know of good collections
  of pre-1981 C source code), please let me know and
  I&#039;ll update the post.&lt;/p&gt;

</description><author>jsnell@iki.fi</author><category>HISTORY</category><pubDate>Mon, 17 Apr 2017 18:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2017-04-17-xxx-fixme/</guid></item></channel></rss>