9 thoughts
last posted Aug. 30, 2014, 8:41 p.m.
get stream as: markdown or atom

I started a new job about 4 months ago. One of my soft criteria was a company not using Python. I love Python, but it's been my primary language since 2001, and I'd been feeling like I wanted to stretch. And I had this hypothesis that the jumping in the deep end was the best way way to do that. So what did I learn?


Just looking at the source you call might not be enough: sometimes you have to follow the thread further. The faraday-http-cache middleware looked like it'd work out of the box, so long as my service supplied an ETag.

Today I discovered that it was hashing the entire set of request headers to determine the cache key, not just those that might Vary.


There's also this magical thing that Ruby seems to do, where you can omit the {} around your keyword arguments, and it collects them up into a single hash. I'm not sure how I feel about that.


Things I've developed some expertise with in the last four months: Redis, DynamoDB, Node.js, RabbitMQ, Heroku. Some of these were all new to me (DynamoDB, Node.js, Heroku), and some of them were tools I'd used, but only as a consumer (ie, RabbitMQ by way of consuming Celery).

repost from golang

Something that's bitten me a few times -- although I'm slowly internalizing it -- is the fact that while *Person is a pointer to a Person struct, *Individual is not a pointer to "any type implementing Individual".

*Individual is a pointer to the interface itself.

A parameter that's an interface type doesn't tell you anything about whether it's a pointer or a copy. Either can implement the interface, depending on how the methods are declared.


There are lots of things that seem obviously wrong only once you discover the problem. Up until that point, you can read the code over and over as you work on it, and never see the issue.

It seems like Javascript Promises are particularly prone to this sort of sharp edge.

Today's example:

this.worker.doSomething() .then(...) .catch(...)

This works fine, so long as this.worker actually has a method that returns a Promise called doSomething.

In our case, this used to have a worker, which became worker(), which meant doSomething() stopped working.

I like the way Promises make my code look at a glance, but they fundamentally provide illusions: they make async code look synchronous (linear), and they use words like catch that we're used to seeing with different semantics.

It'll be interesting to see how I feel about them in another 90 days.


DynamoDB is a hosted NoSQL database, part of Amazon AWS. We're using it as the primary data store for our messaging system.

Dynamo pricing is based on three factors: dataset size, provisioned read capacity, and provisioned write capacity. Write capacity is by far the most expensive (approx 3x as expensive as reads).

The provisioned I/O capacity is interesting: as you reach your provisioned capacity, calls to Dynamo being returning HTTP 400 responses, with a Throttling status. This is the cue for your application to back off and retry. I'll come back to throttling shortly.

When you start out -- say with no data, and 3000 reads and 1000 writes per second -- all of your data is in a single partition. As you add data or scale up your provisioned capacity, Amazon transparently splits your data into additional partitions to keep up. This is one of the selling points: that you don't have to worry about sharding or rebalancing.

It's not just your data that gets split when you hit the threshold of a partition: it's the provisioned capacity, as well. So if you have your table provisioned for 1000 writes per second and 3000 reads per second, and your data approaches the capacity of a single partition, it will be split into two partitions. Each partition will be allocated 500 writes per second and 1500 reads per second.

DynamoDB works best with evenly distributed keys and access, so that shouldn't be a problem. But it could be: if you try to make 600 writes per second to data that all happens to live in a single partition, you'll be throttled even though you think you have excess capacity.

Provisioning that I/O capacity is important to get right: it's not sufficient to turn the dial all the way to 11 from day 1. That's because Dynamo will also split a partition based on provisioned I/O capacity. A single a partition is targeted roughly at the 1000/3000 level, so doubling that to 2000/6000 will also cause a split, regardless of how much data you have.

Splits due to provisioned I/O capacity -- particularly when you dramatically increase the capacity for a high volume ingest -- are the source of dilution.

"Dilution" is the euphemism Amazon uses to refer to a situation where the provisioned I/O is distributed across so many partitions that you're effective throughput is "diluted". So why would this happen? Well, remember that a partition can be split when either data size or provisioned I/O increases.

Partitions only split, they are never consolidated.

So if you decide that you want an initial ingest to complete at a much faster rate than your application is going to sustain in production and increase the provisioned I/O to match, you're effectively diluting your future I/O performance by artificially increasing the number of partitions.

Whomp whomp.


RabbitMQ topic exchanges are a powerful tool: they let you publish messages to an exchange and route it to zero or more queues in parallel. They're rapidly becoming the Rabbit tool we turn to first; we've used them as fanout exchanges by "accident", and been happy for the flexibility later.

This weekend we dropped some data on the ground, though, because they route to zero or more queues. If you publish to a routing key that doesn't have a queue bound, the message is dutifully routed... nowhere.

You can set an Alternate Exchange for any Rabbit exchange for handling just this situation: any message that an exchange can't handle will be routed to its alternate exchange, if set.

I know now that when using an exchange, you should always configure an AE for it. It gives you visibility into whether you have routing bugs, as well as a way to recover from those bugs (by replaying the messages into the correct queue).


This week I had to add caching to a specific HTTP endpoint. The endpoint is implemented in Express/Node.js, and unfortunately has caching semantics that mean a general purpose cache (i.e., Varnish) isn't appropriate. If I were still working in Django, I'd have either decorated the function, or written a piece of middleware to handle the caching. So my first exploration was in that direction.

I read the Express docs, and then I read the Express code. I'm not in love with NPM's dependency resolution model, but I do like that all of your dependencies are just in a sub-directory for easy perusal: no site-packages, dist-packages, virtualenv, etc to explain.

I learned that to do the sort of wrapping that's so natural in Django with Express middleware meant overwriting a bunch of methods on response to catch the output as it streams by. If you just override end, you're probably too late.

For an example, check out the compress middleware. It makes for some difficult reading.

I'm more aware today that there are layers to learning: there are facts, and there are implications. I knew the fact that Express was good for streaming responses, but hadn't considered what the implications of that were on how I'd write collaborating code.

Child Streams


4 thoughts
updated Aug. 19, 2014, 3:02 p.m.