My blog now has simple comment support. This turned out to be a much better way to procrastinate than I'd expected. Prior to this, everything in the system has been stored in files, but I decided that comments were better off being stored in a database. In retrospect, this probably wasn't such a good idea.

The initial coding took a few hours, much of which was spent figuring out the right way to use CLSQL. I was especially bit by the caching that CLSQL does, which caused some problems that were hard to diagnose. After being bitten couple of times I just turned it off. Having caching on by default seems like a bad choice.

The real fun started once I decided to check that the system would still work with a light load, and ran ab (ApacheBench) with 5 concurrent processes accessing the web server. It failed on alarmingly many requests. So I got to spend most of the day debugging threads.

First problem in SB-BSD-SOCKETS. gethostbyname and gethostbyaddr return data in statically allocated buffers, which will be overwritten by the next call. SB-BSD-SOCKETS accounts for this by copying the data immediately after the call. However, it's possible for one thread to overwrite the buffer before another thread had time to copy the data to safety. Boom!

Second problem in the SBCL internal caches, which caused occasional nonsense errors like "STRING is a bad type specifier for sequences". Surprisingly easy to trigger once you figure out what's going on:

(defun random-type (n)
  `(integer ,(random n) ,(+ n (random n))))
(defun one-test ()
  (dotimes (i 10000)
    (let ((type1 (random-type 500))
          (type2 (random-type 500)))
      (let ((a (subtypep type1 type2)))
        (dotimes (i 100)
          (assert (eq (subtypep type1 type2) a))))))
  (format t "ok~%")
  (force-output))
(defun test ()
  (dotimes (i 10)
    (sb-thread:make-thread #'one-test)))

The heavy-handed solution is to sprinkle some magic pixie locks on all the functions created by DEFINE-HASH-CACHE. Unfortunately these functions are called very often, and the mutex overhead caused a 50% slowdown in the average page generation time. Definitely not committable in this state. Unfortunately the locks need to be recursive, so spinlocks as currently implemented were not an option.

Third problem between keyboard and chair, though I'll happily assign some of the blame to CLSQL. I forgot to specify the database for one call to SELECT, and it ended up using *DEFAULT-DATABASE*. This wouldn't have been too bad, except that WITH-DATABASE has a really strange feature: instead of just binding the connection to the specified variable it will also SETF it before establishing the new binding. I.e.

CL-USER> (progn
           (setf clsql::*default-database* nil)
           (clsql:with-database (clsql::*default-database* *db-spec*))
           clsql::*default-database*)
#<CLSQL-POSTGRESQL-SOCKET:POSTGRESQL-SOCKET-DATABASE localhost:7432/blog/jsnell CLOSED {100289EF71}>

Due to the above points the same connection ended up getting used from multiple threads at the same time in certain circumstances, with predictably bad results. I spent a couple of merry hours debugging this as a fd-stream problem before realizing the mistake.

After fixing these the Araneida instance has now survived without errors for 100000 requests with 10 concurrent client processes and another 100000 with 20 clients. That should suffice for now, and it doesn't really matter that handling the average page request takes 25ms instead of 18ms.