Big numbers, little metal
I’m finally getting caught up with my RSS feeds again (without declaring bankruptcy this time) and I found a couple announcements regarding site-traffic that I thought were pretty cool.
Stackoverflow
Love or hate them, Team Attwod-Spolsky has turned the stackexchange platform into *the* Question+Answer platform. Perhaps somewhat surprisingly, they’ve had virtually no major scaling challenges that I’ve heard of. The latest performance note from Attwood claimed that the site is “almost comically overprovisioned” with 10 webservers. Last I heard they were somewhere around the 200 million pageviews per month mark.
A disclaimer for that last link: It’s clear that Digg made some serious engineering missteps with the latest revision of their platform, but they also have to deal with bursts of traffic that probably isn’t seen in the same magnitude on something like SOF. On the other hand, I could be way off the mark. Just my $0.02.
Reddit hits 1b pageviews
My favorite online free-time black hole, Reddit, announced yesterday that they’d surpassed the 1 billion pageviews per month mark. They’re peaking around 1250 requests per second on the app servers, or 2500+ on the cache nodes [link]. It’s nothing short of amazing that they keep things running relatively smooth given those bursts.
Beyond all the awkwardly funny or just plain compelling content on the site, I’ve been fascinated by Reddit ever since I first happened upon it for a couple of reasons:
- It was founded by two 22-year-old college grads and was picked for a round of funding from Y Combinator (you should be reading Hacker News)
- They team running the site is tiny. Really tiny. Last I checked there were 6 people behind the site, and that includes development, hosting/ops, design, everything.
- They essentially gobbled up most of Digg’s userbase. I left Digg quite a while back, thought I no longer remember exactly why other than it had lost it’s tech focus that piqued my interest in the beginning.
- They survived the Dreaded Rewrite. Reddit was originally written in Lisp but was rewritten to Python in 2005 and continues to run on Pylons to this day.
I think what I find most fascinating about the site is that it’s completely run in the cloud. Reddit’s growth has come with it’s share of major engineering challenges, but so far they’ve been able to overcome. As of october, they were running everything on 112 EC2 instances.
The Hacker News post is full of some other people sharing their numbers, and sites like HighScalability have more info on some bigger name sites. Read up and write ye some efficient code!
Visualizing Data
I love data visualizations.
Raw data can tell you a lot by itself, but putting that data through a process that creates a visual context is a great way to make people fanatical about your software. For example, I’m absolutely addicted to checking my Flickr stats.

It’s strangely fascinating to log in day after day and see how many people took the time to flip through my photos when I don’t spend an iota of time trying to drive traffic there. While I don’t get a large amount of traffic, it’s exciting to try to draw conclusions based on the view stats.
Charts and graphs can make tedious or otherwise boring information seem downright exciting. Take Zipdecode, for example. Zip codes are not very interesting. However, slap them into a Processing app and suddenly people are interested enough to spend some time figuring out how zip codes work. In fact, I’m willing to bet that the Zipdecode example is powerful enough that you can learn more about zip codes in 30 seconds than you’ve learned in the rest of your time scrawling them on envelopes.
Twitter is another interesting point of study pertaining to what data conveys about an individual. Enter Twitter Stats: a Perl script that chews on some data from Twitter’s API and spits back some slick graphs that go beyond the traditional Twitter context of, “What are you doing?”

What does this say about me? I’m more of a night owl than an early bird. Wednesday morning I’m more active than any other morning, and for quite a span of time. I’m also more apt to post on Wednesday than any other day with Saturday seeing the least activity of all (I try to stay unplugged on weekends to some extent). Overall? Yeah, it’s pretty accurate.
But it doesn’t end there.
I was late to jump on the Google Reader bandwagon. Ok I just started last week. They have graphs too!

According to Google, I don’t read shit in the morning.
When I worked for the IT department at MSUM, we used Cacti to track the basic server vitals: memory consumption, processor utilization, spam statistics, that sort of thing. Of course, it’s useful to see some quick stats from SpamAssasin on the command line, but what’s truly useful is being able to see what changes over time.
Take the WebOps Visualization photo pool, started by some of the Flickr server admins. Flickr handles an insane amount of data every day. It’s wildly fascinating to see what other organizations are doing with their mundane data.