status update from +William Stein
Hi SageMathCloud Users,
A week ago, I moved all of SageMathCloud to Google Compute Engine, and rolled out a storage backend for SMC that relied on btrfs send/receive. The resulting week was one of the most painful weeks of my life. At UW, SMC ran on 19 computers, each having 1TB of SSD, 96GB (or more) of RAM, and 16 cores. Squishing all of this into something affordable
on GCE meant shrinking resources by an order of magnitude and this required rewriting or rethinking most of the backend to be more efficient. Also, there are over 150,000 projects, so just dealing with all that data is pretty daunting. Efficiency is important because Google isn't giving me a penny anymore to help with this (they gave us $60K in credits for last year), and SageMath, Inc. is paying for it using borrowed money in the hopes that there will be enough paying customers soon.
Unfortunately, it turns out that btrfs streaming isn't as robust as
I had hoped, especially under heavy load, and would regularly crash
the OS, and have other performance problems. Just in case, I had SMC store incremental rsync backups to Google nearline storage regularly, so if you find that there are file you made during the last week that are missing now, they are likely very easy for me to restore.
Over the last week, I wrote a new much more robust and efficient storage backend that doesn't use btrfs streaming, and switched everything over to it last night. This new storage system works well so far, and I intend to stick with it.
There are some loose ends remaining. For example, the snapshots you can browse in a project are made periodically across all projects
(not just yours), so don't currently represent when you used your project, though that will change. They are really just symbolic links into the /snapshots directory, which includes snapshots across all
projects. Right now the display only shows less than a day of
snapshots, but in fact there are much more -- I just need to write some more code to properly present them.
Reducing other resources (e.g., number of web servers) also highlighted other issues, which I've tracked down and fixed.
Despite running on way less cores and much less RAM, right now SMC feels much, much faster than it did before. The Google network is extremely fast, the Haswell Intel processors they provide have 45MB caches so are much much faster in arithmetic benchmarks I've tried, and the local disk (where your project files sit when you're using
them), is a fast PCIe SSD. I've also been fixing a lot of little issues over the last 3 weeks that could lead to things feeling slow intermittently or connections being dropped (mostly recently fixing a
couple issues this afternoon).