PostgreSQL 9.0 createdb Revelations (Updated)
2011-08-27 15:18:16 by jdixon
One of my first projects at Heroku has been to modernize our shared PostgreSQL offering (working with @asenchi). As we get closer to internal testing of our new service, @markimbriaco asked for benchmarks looking for any bottlenecks in PostgreSQL 9.x when creating large quantities of small databases. We've seen instances where Pg 8.3 will start to choke after 2000 databases on the same server and we're hoping that 9.x alleviates this issue.
My initial test was overly simplistic but still revealed some interesting patterns. I started with createdb on the command-line, generating 8000 roles and empty databases, serially. The results were promising, with PostgreSQL 9.0.4 (Ubuntu 10.04) able to scale up without any noticeably increasing latency. Unfortunately, it's not a terribly useful benchmark given the absence of any workload. And yet, I couldn't help but notice a pattern in the scatter plot:
Notice the gap between 500 and 600 ms? I don't have an explanation for this but I suspected that Pg has an internal condition that triggers for actions that take 500ms or longer. Regardless, our primary expectations had been met. Whatever bottleneck 8.3 demonstrated when creating databases on a server with large quantities of existing small databases appears to be fixed in 9.0.
The next test was to run a similar sequence with our new application server. It offers an internal RESTful API using Sinatra and Sequel to provision and manage customer databases on shared servers. The results for this run were even more enlightening. Check out the stratification:
Not only is the initial gap (around 400ms) even more pronounced, but you can see a pattern of latency introduced at 200ms intervals after the initial 400ms delay. I have no explanation for this, but I wanted to publish these results and see if anyone else has a guess as to what might be causing these patterns.
UPDATE: To rule out any distortion caused by GNU time, I ran another test using Ruby's Time class to get a more accurate representation. In the most simple terms, we start the clock with Time.now, connect to the database (no caching), create a role, create the database and stop the clock. Output is logged and then imported into Excel for plotting. I think the results speak for themselves (measured in milliseconds):
- Comments (1)
Giant Robots Are Cool and Shit, But Seriously...
2011-07-15 16:54:34 by jdixon
I'm pleased to see so many people interested in the #monitoringsucks movement/campaign/whatever. My last post seemed to resonate with a lot of you out there. I'm excited to hear discussions surrounding APIs, command-line monitors, monitoring frameworks, etc. But I think a major thrust of my article was missed. It's not just that Nagios can be a pain in the ass, or that we need a modular monitoring system. What I'm trying to emphasize is that monolithic monitoring systems are bad and not suited for the task at hand.
Some very smart systems people (and developers) are trying to solve this problem in the open-source arena. Unfortunately, while they're attempting to diagnose and cure the problems in contemporary monitoring systems, they continue to architect big honking inflexible software projects. When I refer to "the Voltron of monitoring systems" I'm not talking about an enormous fucking automaton of monitoring, alerting and trending components. I mean that each component should exist independently of the others, with a stable data format and communications API. Any single component should be easily replaceable and deprecated. Authors should strive for competition because it makes the inclusive architecture that much stronger.
Realistically I see one of three things happening over the next 12-18 months:
- A community forms around a reasonable set of defined components and begins cranking out useful bits. Over time we have what resembles a useful ecosphere of monitoring tools and users.
- Motivated developers continue to solve the issues affecting monitoring software, but in their own walled garden projects. We benefit from a larger pool of projects to choose from, but they all continue to suffer from NIH syndrome.
- I'm disregarded as a nutcase. Nothing changes and we continue to use the same crappy ubiquitous software.
At this point I think the most likely outcome is a combination of numbers 1 and 2. It's hard for anyone to justify working on a disassociated component when the related components it needs to be useful might never be developed. On the other hand, if someone working on a monolithic project has the foresight to break up the bits into a true Service Oriented Architecture, then it would be feasible for external developers to fork individual units.
- Comments (6)
Monitoring Sucks. Do Something About It.
2011-07-07 23:45:30 by jdixon
For as long as I can remember, systems administrators have bitched about the state of monitoring. Now, depending on who you ask, you might get a half dozen (or more) answers as to what "monitoring" actually means. Monitoring is most commonly used as a casual catch-all term to describe one or more pieces of software that perform host and service monitoring and basic trending (graphs or charts). But in most cases, these complaints are targeted at software responsible for daily fault detection and notifications for IT shops and Web Operations. The usual whipping boy is Nagios, a popular open-source monitoring project that supports a universe of host and service checks, notifications, escalations and more.
Nagios has been the "lesser of all evils" for quite some time. Its cost (free), extensibility (high) and configuration flexibility have helped it achieve significant adoption levels across a variety of industries and range of business sizes, from small one-man web startups to Fortune 500 enterprises. It's been forked multiple times and is recognized by industry analysts as a force to be reckoned with. Regardless, those who use it, do so with a fair amount of hostility. Ask around and you're likely to find more users who stay with Nagios because it's "good enough" than those who actually like it. So why doesn't Nagios have more competition in the open-source marketplace? Largely because writing an entire monitoring system from scratch is an enormous undertaking. Ok, does that mean we should keep improving Nagios (or forking it... again)? Perhaps.
- Comments (6)
Trending with Purpose
2011-03-18 13:52:44 by jdixon
I threw together a presentation on short notice this week for an internal tele-conference about Trending with Purpose. The end result was much better than I might have expected (even given my penchant for procrastinating). Although much of the content is specific to applications currently in use at $DAYJOB, I think there's something to take out of it even if you're not using these tools.
The content is intended for developers who might not (or know how to) use application profiling data to complement their operations' monitoring and trending efforts. Special props to the Orbitz.com developers for open-sourcing their Graphite graphing tool, as well as John Allspaw and the Etsy Engineering team for their work on StatsD, and for generally serving as innovators in the Web Operations industry.
Special note: These slides were thrown together in rapid fashion. Anyone who experiences violent reactions to Gill Sans Italic should not download this slideshow. You have been warned.
The slides are available here.
- Comments (2)
New Year's Resolutions
2010-01-01 22:28:02 by jdixon
I'm not sure how effective it is to post these here, but I'm hopeful that having them in cyberspace will help keep me motivated. I'm hereafter calling these goals rather than resolutions The latter, to me, implies something that you begin immediately. This cold-turkey approach virtually guarantees failure. The moment you trip up, the subconscious immediately considers them a lost cause and reverts to the old behavior. As goals, I think it sets a more optimistic tone and allows me to gradually adapt the preferred conduct.
Without further ado, my personal list of goals for this year (in no particular order)...
- Comments (1)
Business Metrics
2009-09-15 23:58:42 by jdixon
Somewhere between our first corrupt filesystem and an unlikely ascent to CTO, all Systems Administrators are taught to monitor their systems. We're trained to monitor the health of our computers and trend the usage for capacity planning and analytics. A Nagios is deployed; eventually complemented by Cacti; both of which are inevitably supplanted by Something Enterprise (TM). Services are checked, change is managed, and reports are reportified.
Have you asked yourself, what value does this offer my company? Perhaps you've correlated your database connection breakdown time with website load time. Or you noticed that the FULL backups on Sunday coincide with excessive packet loss on your Seattle firewalls. Besides buffing out some of the rough edges on your operational capabilities, how does this data work for you?
- Comments (1)
Noit Grows Hair on Your Chest
2009-08-15 14:09:13 by jdixon
Todd Hoff over at High Scalability takes a look at Reconnoiter. He went through the [currently] arduous task of installing and configuring it manually; setting up checks can be a hairy experience. But the end result seems to justify the initial pain. It's a very exciting (and useful) application that will only get better as the #noit devs continue to hack on it.
As an Ops guy over at OmniTI, I've been fortunate to watch Reconnoiter's incubation process. Theo Schlossnagle is probably one of the smartest guys in this industry and he gets scalability issues. We've batted around ideas about network trend and analysis tools before (e.g. NFDB) so naturally I'm anxious to see where Noit takes us.
- Comments (0)
RSS 1.0