Sunday, February 22, 2015

Go faster, please.

Learning Mongo, part 2: Go faster, please

Continued from Part 1: Hello, NoSQL

First steps

To start with, I had collected about 130,000 sensor data records, in the form of CSV files generated by my OBD logger.

I defined a map and reduce function for mongo (see part 1), and used a statistics class in python that I wrote which accumulates values in a python list.

This chart shows the result, in number of documents processed per millisecond (larger numbers are better) from batches of the given size.

Documents Count Map Reduce Python
10¹ 0.9 13.6
10² 11.2 43.1
10³ 21.0 65.6
10⁴ 25.0 66.6
10⁵ 25.1 68.0

Based on this information, not only was Python faster, but it was more than twice as fast.

This lead me to believe one of two things: Map Reduce on mongo is stupid, or I was doing it wrong. Is map reduce is pointless unless you have a really huge number of documents or a lot of shards? Maybe, but somehow I find it more likely that I don't know what I'm doing.

I tried messing with the map and reduce functions a bit, but discovered that small tweaks in calculations didn't affect the results much. This lead me back to the internet to try and figure out what I was doing wrong.

Eventually, I found a mention of something called "jsMode". In this mode, Mongo keeps data in JSON format for internal computation, rather than converting back and forth to BSON. This means that it's making fewer copies, saving time. Great, I'll just turn that on and ...

Document Count Map Reduce Python
10¹ 0.8 13.6
10² 8.3 43.1
10³ 42.1 65.6
10⁴ 67.4 66.6
10⁵ 70.6 68.0

Ok, now it's starting to get better. JS Mode helps significantly, but we're still just breaking even.

Ramping up

130,000 documents sounds like a lot to a human, but this is the product of one sensor over a few months, collecting maybe 1000 data points per day. If I had a few more sensors or increased my collection frequency, that number could easily increase by an order of magnitude. Plus, there are certain economies of scale when scanning through data. Mongo is touted for it's prowess with "Big data."

I'm fresh out of Big data, so I faked it by inserting my data nine more times into the database.

Results, ~1.3 million documents, Map Reduce in JS Mode, Python.

Document Count Map Reduce, JS Mode Python
10¹ 1.8 4.2
10² 14.0 6.8
10³ 52.3 7.3
10⁴ 64.2 7.3
10⁵ 69.2 7.4
10⁶ 49.7 6.9

Oh dear, something terrible has happened. The increased index size has reduced the overall performance quite a bit.

And, with this many documents, I can see where the performance peaks are. The maximum performance is somewhere between 100k and 1000k documents.

This is not what I would call a satisfying result. For one, the python performance is absurd. 7.4 documents per second? That's almost 13.5 seconds to process 100,000 documents. Plus, Map Reduce performance has dropped by 30%. We can do better.

In thinking about this, I realized that in python, in order to compute statistics on one field in my document, I was actually fetching the entire document (40 fields), while Map Reduce was only getting exactly one. This isn't really fair.

This lead me to discover that you can tell Mongo to give you only particular fields in a query, by specifying which field you want in the query:

result = db.collection.find({}, {'your-field': 1})

Using that, python can now ask for exactly 1 field. Plus, Python's list implementation starts to slow way down when the number of items exceeds 10,000 or so, so I know I can still make Python go faster by dropping the list and computing values with single variable accumulators.

With those modifications to the Python code, the results now look like this:

Document Count Map Reduce, JS Mode Python, Accumulators, 1 Field
10¹ 1.8 13.9
10² 14.0 42.5
10³ 52.3 62.7
10⁴ 64.2 64.7
10⁵ 69.2 66.0
10⁶ 49.7 42.6

Well, that's more reasonable. Python is now back in the running, thanks to a fair fight. However, Mongo is barely in the lead, and then only when the number of documents starts to get large.

Back to the drawing board.

It was about this time when I my attention started wandering over to CouchDB, Redis, and other No-sqly things. I really don't want to use CouchDB, mostly because I didn't like MemBase, and Couch implies relaxation, not performance. Yes, I reject this database because I don't like its name (as a PDX resident, I also feel the need to pronounce it Cooch-DB. If you're from here you should know why). Redis is fine, but not a document store. And there are a million other NoSQL databases. Sigh

The internet then found me a thing called Robomongo, which is a really slick UI for mongo servers. And, as I installed it and started poking around, what did I find by a database object category called "Functions."

Eureka

It makes sense that Mongo, which runs on V8, and is javascripty, would be able to store functions server-side for use in Map Reduce or whatever else.

Presumably, one of the reasons the map reduce functions were slow before is because there is overhead to compile them. Based on the performance results, I would guess that the compilation isn't stored anywhere either. Mongo probably just compiles & executes them, and then throws them away again. This is a lot of overhead.

Given the ability to store functions in the database, hopefully it is compiled and cached, avoiding that overhead.

So, time to run the tests again, this time with the map & reduce functions inside the db. And just for fun, I'll refactor the python stats code to use accumulators instead of lists.

Number Map Reduce 2, JS Mode Python, Accumulators
10¹ 1.5 13.9
10² 18.1 42.5
10³ 61.7 62.7
10⁴ 128.4 64.7
10⁵ 123.7 66.0
10⁶ 71.0 42.6

And finally we have a race worth watching. Python takes an early lead, then Map Reduce comes in strong from behind. They're neck and neck through 1000 items, and then Map Reduce is off like a rocket from 10k — 100k, and finally both tire out at 1000k.

Thus, with a few small tweaks, I've been able to increase the map reduce performance by about 5x. That's not bad.

Continued in part 3, What do all these numbers mean?

No comments: