Using messaging for scalability

I don’t understand why this whole debate about scalability is so focused on choice of the programming language.

At Runa, we provide a service to our immediate customers who are all online retailers. Their website integrates with our services in a couple of ways (via Javascript inserts and via traditional http calls) – and each time a shopper browses any of these online market-places, our services are called. Lots of times – a dozen times per retailer page-view on average.

Our load also grows in large step-functions. Each time we get a new customer, our services get called by another set of users (as all our customers’ customers get added on). We need our services to keep up with all this demand.

Finally, lets take an example of one of the things we provide via these services – dynamic pricing for products. Obviously, the response of such calls needs to be in real-time – since the price has to be shown next to the product being browsed.

So we have a load as well as a response-time concern – as most typical web-scale services do.

Our approach has been to favor simplicity – very early on we introduced a messaging backbone.

messaging_for_scalability.png

Despite the fact that this picture looks a bit more complex than it would without the RabbitMQ portion, this has allowed us to do a few things -

  • For those service calls that don’t need immediate responses – (for instance, our client websites send us data that we analyze later, or we need to send an email notification) – we just drop these onto an appropriate queue. An asynchronous processor picks up the message, and does the needful.
  • For those services that need responses immediate responses, the call is handled synchronously by one of the application servers.
  • For those services that are heavier in terms of the computation required, we split the request into pieces and have them run on separate machines. A simple federation model is used to coordinate the responses and they’re combined to return the result to the requester.
  • With the above in place, and by ensuring that each partial service handler is completely stateless, we can scale by simply adding more machines. Transparently. The same is true for all the asynchronous processors.

As an aside – I’ve written a mini-framework to help with those last couple of bits. It is called Swarmiji – and once it works (well) in production, I will open-source it. It may be useful for other folks who want a sort of message-passing based parallel programming system in Clojure.

So anyway, with this messaging system in place, we can do a lot of things with individual services. Importantly, we can try different approaches to implementing them – including trying different technologies and even languages.

IMHO, you can’t really have a conversation about scalability without context of how much load you’re talking about. When you’re just getting started with the system, this is moot – you don’t want to optimize early at all – and you can (mostly) ignore the issue altogether.

When you get to the next level – asynchronicity can get you far – and this side-steps the whole discussion of whether the language you’re using is efficient or not. Ruby is as good a choice as any in this scenario – Python or Java or most other languages will leave you in the same order of magnitude of capability. The key here is high-level design, and not doing too many obviously stupid things in the code.

When you do get crazy large (like Google or whatever), then you can start looking at ways to squeeze more out of each box – and here it may be possible that using the right language can be an important issue. I know Google even has a bunch of people working on compilers – purely to squeeze more out of the generated code . When you have tens of thousands of computers in your server farms, a 2% increase is probably worth a lot of money.

Still, this choice of language issue should be treated with caution. I’m personally of the opinion that programmer productivity is more important than raw language efficiency. That is one reason why we’re writing most of our system in a lisp (Clojure) this time around. The other thing is that these days runtimes are not what they used to be – Ruby code (and Python and Clojure and Scala) can all run on the JVM. So you can get the benefits of all those years of R&D basically for free.

Finally, a word on our messaging system. We’re using RabbitMQ – and it is *fast*, reliable and scalable. It truly is the backbone of our system – allowing us to pick and choose technology, language, and approach. It’s also a good way to minimize risk – a sub-system can be replaced with no impact to the rest of the pieces.

Anyway – take all of this with many grains of salt – after all, we’re not the size of Google (or even Twitter) – so what do I know?

Startup logbook – v0.1 – Clojure in production

Late last night, around 3 AM to be exact, we made one of our usual releases to production (unfortunately, we’re still in semi-stealth mode, so I’m not talking a lot about our product yet – I will in a few weeks). There was nothing particularly remarkable about the release from a business point of view. It had a couple of enhancements to functionality, and a couple of bug-fixes.

What was rather interesting, at least to the geek in me, was that it was something of a technology milestone for us. It was the first production release of a new architecture that I’d been working on over the past 2-3 weeks. Our service contains several pieces, and until this release we had a rather traditional architecture – we were using ruby on rails for all our back-end logic (eg. real-time analytics and pricing calculations) as well as the user-facing website. And we were using mysql for our storage requirements.

This release has seen our production system undergo some changes. Going forward, the rails portion will continue to do what it was originally designed for – supporting a user-facing web-UI. The run-time service is now a combination of rails and a cluster of clojure processes.

When data needs to be collected (for further analytics down the line), the rails application simply drops JSON messages on a queue (we’re using the excellent erlang-based RabbitMQ), and one of a cluster of clojure processes picks it up, processes it, and stores it in an HBase store. Since each message can result in several actions that need to be performed (and these are mostly independent), clojure’s safe concurrency helps a lot. And since its a lisp, the code is just so much shorter than equivalent ruby could ever be.

Currently, all business rules, analytics, and pricing calculations are still being handled by the ruby/rails code. Over the next few releases we’re looking to move away from this – to instead let the clojure processes do most of the heavy lifting.

We’re hoping we can continue to do this in a highly incremental fashion, as the risk of trying to get this perfect the first time is too high. We absolutely need to get the feedback that only production can give us – so we’re more sure that we’re building the thing right.

The last few days have been the most fun I’ve had in any job so far. Besides learning clojure, and hadoop/ hbase pretty much at the same time (and getting paid for doing that!), it has also been a great opportunity to do this as incrementally as possible. I strongly believe in set-based engineering methods, and this is the approach I took with this as well – currently, we haven’t turned off the ruby/rails/mysql system – it is doing essentially the same thing that the new clojure/hbase system is doing. We’re looking to build the rest of the system out (incrementally), ensure it works (and repeat until it does) – before turning off the (nearly legacy) ruby system.

I’ll keep posting as I progress on this front. Overall, we’re very excited at the possibilities that using clojure represents – and hey, if it turns out to be a mistake – we’ll throw it out instead.

Design for throwability

Build one to throw away, you will anyhow. — Fred Brooks

Fred Brooks said what he did many, many years ago but it is probably just as true today. How many times have you and your team gotten a few months into a project only to realize all the design mistakes you made? Ask any engineer, and they’ll tell you they would build it right the second time.

This is just reality, the nature of discovering the complexity of the domain or the technology or the usage pattern or whatever else you didn’t know about when you started.

On the other hand, there’s this [what Joel Spolsky says about rewriting software] -

It is the single worst strategic mistake that any software company can make — Joel Spolsky

So… what gives? The answer, IMHO, is basically two things -

1. Understand and internalize the idea of the strangler application

2. Architect your system in such a way to support strangling it later

In essence, this means that because it would be a bad idea to rewrite the entire system from scratch, it must be built in a way so as to enable swapping out components of it as they are rewritten (or perhaps heavily refactored).

The architecture must draw from an approach called  concurrent set-based engineering (CSBE) – and indeed, sometimes each logical component would have more than one implementation. At Runa, two components of our system actually have two implementations each. And in each case, they’re both running in production – in parallel.

The way we accomplish this is through very loose-coupling. Additionally, because we take a very release-driven approach to our software process, our architecture evolves according to our current needs… and we refactor and extend things as new requirements are prioritized. At all times, despite our super-short release-cycles, our goal is to always have a version of the system in production. Whenever our pipeline tells us that a peice of the existing design may not work in the long term, we start to work on the replacements – more than one, and in parallel.

We run them in what I’ve been calling shadow-mode. This implies that its not quite part of the official system, but is running in order to prove some design hypothesis. Once everyone involved is satisfied with the results, we pick the most suitable sub-system and decommission  the other contenders (including the old one). At Runa, we achieve much of our inter-component loose-coupling via messaging (our current choice is RabbitMQ).

To summarize – we design everything  with one over-arching goal in mind  – the thing will be thrown away someday, and be replaced with another. As I said before, this enforces a few things -

1. Loose coupling

2. Clear interfaces between components

3. Good automated system testing!

About that last point – because we have many moving parts, functional testing becomes even more important. We currently use Selenium for true functional testing (Runa is a web-based service) – and a variety of other home-grown tools for custom systems testing. Not only do automated system tests tell us that the collaborating set of components are working right, but they also allow us to change things with impunity – knowing that we’ll know if things break.

This thinking is what I’ve been jokingly calling Design For Throwability – and it’s been working rather nicely. It’s essentially a design philosophy that embraces CSBE – and is especially useful for small startups where everything is changing quickly – almost by definition.

Software development agility – bottom-up design and domain-specific languages (DSLs)

It’s been nearly three months that I’ve been working at Runa. And in that time I’ve worked with the team to get a lean and agile process in place, and while we’re not there yet, things are getting more and more focused every day.

I’ve also been working to significantly re-factor the code that existed, as we get ready to get the first beta release out the door. In this post, I’d like to talk about one aspect of our design philosophy which I’m glad to say has been working out really well – bottom-up design (and an associated DSL).

Before we decided to change the system, it had been designed in the usual top-down manner. Runa operates in the online retail space – we provide web-based services to online merchants, and while we have a grand vision for where we want our platform to go, our first offering is a promotions service. It allows merchants to run campaigns based on a variety of criteria. Merchants are a fussy lot, and they like to control every aspect of campaigns – they want to be able to tweak all kinds of things about it.

In true agile fashion, our initial release allows them to select only a few parameters. Our system, however, needs to be extensible, so that as we learn more about what our merchants need, we can implement these things and give it to them. Quickly. And all of this needs to be done in a changing environment with lots of back-and-forth between the business team and the dev team.

So here’s what we came up with – it is an example of what we call a campaign template -

campaign_template :recapture_campaign do
  title is 'Recapture'
  subtitle is 'Reduce cart abandonment rate'
  description is 'This campaign presents offers to shoppers as they abandon their shopping cart.'

  #adding criteria here
  accept required time_period_criteria with defaults start_date('1/1/08'), end_date('12/31/10')
  accept required product_criteria with no defaults

  hardcode visit_count_criteria with number('1')

  #more criteria
  reject optional url_referrer_criteria with no defaults

  inside context :view_badge do
    never applicable
  end

  inside context :abandon_cart do
    allow only customer_type_criteria with customer_type('visitor')
  end

  inside context :cart do
    allow only user_action_criteria with user_action('accepted_recapture')
  end

end

A campaign template behaves like a blue-print of an actual campaign. Sort of like the relationship between a class and an object. In a sense, this is a higher-order description of just such a relationship. The (computer) language now lets us speak in the domain of the business.

There are a couple of reasons why our core business logic is written like this -

a) It lets us communicate easily with the business. Whenever a question about a rule comes up, I bring up the associated template up on the screen, and make them read the ‘code’. Once they agree thats what they mean, it just works, because it is real code.

b) Since this is in essence a way to script the domain model, it has forced a certain design upon it. All the objects evolved in a bottom-up manner, and each does a very specific thing. It lends to a very highly de-coupled design where objects collaborate together to achieve the higher goal, but each is very independent of the other.

c) This notation makes several things easier. One, the actual business rules are described here, and they just work. The other thing is that we’re able to use this same representation for other things – for example, our merchant GUI is completely auto-generated off these template files. Menu items, navigation, saving, editing, error-reporting, everything is generated.

This allows very fast turn around time for implementing new concepts, or making changes to existing ones.

It’s an internal DSL written in Ruby, and does whatever it can without any extra parsing, as you can probably imagine. I will write about the specifics of how this is implemented in future posts. For the moment, I would like to stress, however, the importance of the bottom-up approach. Because our domain model is made up of many small objects (instead of a few larger ones), each representing a tiny aspect of the domain, we’re able to put them together and fashion more complex beasts out of them. And the combinations are limitless, bounded only by business rules. This is where the power of this approach lies.

The DSL itself is almost only a convenience!

Perfect vs. Shipped

I overheard an amusing comment in my team-room the other day, I think it might have been Kris Kemper who said it – “Anyone who knows Ruby On Rails has a half-done personal project that’s going nowhere”. How true. I have at-least three.

The thing is, in my mind I’m always envisioning these grand cathedrals, and even when I do start work on any one of them, I never seem to quite finish them. Or I don’t complete everything properly (or quite enough to be production-ready), and the application is never quite done.

I think it has to do with a lack of focus. I find myself thrashing between the hundreds of things that interest me, and I end up with a ton of unfinished work. I’m very much into lean software methods, and I know that all I’m doing by operating this way is creating a lot of inventory. I seem to be able to use lean and other workflow management techniques at work, but in the world of my personal projects, I seem to be at a loss.

Sometimes, it has to do with trying to get everything perfect. After all, since it is a personal project, I feel like I don’t have a delivery dead-line, so I can take the time to get it right. Which leads me down the rabbit-hole of perfection and cathedral building, with no real end. Cause there probably ain’t anything called perfection.

When consulting for our various clients, I’ve a clear idea in my mind about the compromises and trade-offs needed between design, architecture, refactoring, and delivery. And I aggressively do whatever might be needed to prod the folks along (be they developers, or product-owners) to get the thing done and into production. After all, there’s always another iteration coming up, and there’s always a next release.

So why the heck can’t I seem to do the same thing when I’m working on a nights-and-weekends project?

On DSLs

I’ve been busy, working on two very interesting projects these past several weeks (on the side – my current project for work is unbelievably mind-numbing). Both these projects are in Ruby and use Ruby On Rails. There, got that out of the way.

I’ve always been in search of software designs that result in a clean DSL (Domain Specific Language)… lisp style, bottom up, bring the language up to the problem, rather than the other way around, etc. What I’ve managed so far is a set of mini-DSLs in my applications. One for each module, so to speak.

And perhaps, that is the way things work in general; rather than the unrealistic goal of building *one* DSL which is supposed to capture your entire domain – build several, smaller ones, which each capture aspects of the domain.