Sunday, February 22, 2015

Mongo Test Results

Learning mongo, part 3: Test Results

If you haven't read part 1 and part 2, suffice to say I got mongo going, loaded some data into it, and then was surprised to find out that map reduce was literally the same speed as doing it in Python. And then, I figured out a few tricks to make things a bit faster.

The Apparatus

My client machine is a new iMac Corei5, running Python 3.4.

My mongo server machine is an abused Core2 T9600 @ 2.8GHz running Mint linux 17, and Mongod 2.4. It features a startling 4GB of RAM and a 500GB disk that I found in a pile. The T9600 gets a score of 1956 from Passmark, a not exactly stellar score for 2015, but plenty good for what I use it for. (the i5 scores 6763).

Both systems have the kind of disks with moving parts, and are connected to each other with ancient catX ethernet cables (where X at one time was probably 5, I'm not so sure anymore) through an old consumer-grade gigabit switch, which I often keep under a pile of laundry so it stays warm.

The test methodology

I ran tests with on a database with 1.3 million documents, with an average size of about 1400 (bytes), and 40 fields a piece.

I ran each test ten times, and then I averaged the results.

Results

Map Reduce: The functions mentioned in part 1 are loaded and run, and then again with jsMode turned on.

Python: A simple list based stats program, the selected value is pulled from the database, stored in a list, and then computations are run on the list.

Python, Accumulators: Stats are computed using accumulator variables, to avoid list overhead.

Map Reduce 2: The same functions as the other map reduce, except that they are stored in the database first.

The count shows the number of documents that were processed (via limit).

The numbers in the columns are the average number of documents processed per millisecond (higher numbers are better), followed by the standard deviation, σ (lower is better). The winners are shown in boldface.

Count Map Reduce Map Reduce,
jsMode
Python Python,
Accumulators
Map Reduce 2 Map Reduce 2,
JS Mode
10⁰ 0.08 σ=0.09 0.3 σ=0.03 1.2 σ=0.07 1.5 σ=0.02 0.2 σ=0.07 0.2 σ=0.08
10¹ 0.7 σ=0.8 1.8 σ=0.7 8.3 σ=0.2 13.9 σ=0.8 1.9 σ=0.7 1.5 σ=0.8
10² 6.8 σ=4.2 14.0 σ=4.5 31.5 σ=0.6 42.5 σ=1.9 16.9 σ=6.2 18.1 σ=6.6
10³ 20.8 σ=3.5 52.3 σ=10.2 46.7 σ=2.5 62.7 σ=0.5 65.9 σ=29.5 61.7 σ=30.8
10⁴ 24.74 σ=0.7 64.2 σ=6.6 50.3 σ=0.6 64.7 σ=0.1 119.6 σ=10.5 128.4 σ=23.3
10⁵ 24.72 σ=0.7 69.2 σ=1.5 50.9 σ=0.7 66.0 σ=0.5 121.5 σ=5.0 123.7 σ=6.2
10⁶ 22.5 σ=0.2 49.7 σ=4.3 37.2 σ=0.8 42.6 σ=1.5 69.2 σ=1.8 71.0 σ=2.2

Observations

For some reason, I measured a fairly high standard deviation when running the map reduce functions internal to the database. Without knowing what is going on in there, I must speculate that this variability is caused by something like garbage collection, or some other external factors, such as the server dipping into virtual memory.

As with many things in computing, there is always some amount of overhead in doing the calculations (setting up a network socket, seeking to the right place on disk, etc.). As such, there is some efficiency with scale. Presumably, nobody would bother running Map Reduce on 1 element. The data shows this pretty clearly.

The table seems to show that maximum performance is achieved somewhere between 10⁴ and 10⁵ documents. As the number approaches 10⁶ documents, the performance goes down, quite dramatically in some cases. If these numbers are indicative of how Mongo works in general, then aggregations should be limited to smallish batches.

With map reduce, processing a million documents takes about 14 seconds. However, ten batches of one hundred thousand documents should take about 8.08 seconds, apparently almost twice as fast as doing a million in a single batch. Processing a hundred batches of ten thousand documents should take about 7.8 seconds, which may be slightly faster than still.

Where to do aggregation

If the data is small, (i.e., < 10,000 documents), then it's probably better to stay in python. In reality, it's probably better to stay in python anyway, since logic really doesn't belong in the data store.

If the data that you're aggregating is bigger than that, but smaller than 1,000,000 documents, then still do it in python (but break it up into chunks). I ran some quick multithreaded python tests, and was able to churn through 1,000,000 documents in about 8.2 seconds (σ=2.4), faster than any of the other tests. I imagine that you'd get similar results by doing multiple Map Reduce jobs, but I haven't measured that.

If you have more than a million documents, I have no advice. Mongo is noticeably slower with 10⁶ documents in the datastore than it was with only 10⁵. With more than a million, I'd expect that it would continue to get slower. Check back in a few months once I've collected more data.

Obvious performance improvements

It's the CPU, stupid

I ran the in-database Map reduce function in jsMode on my iMac. As mentioned above, the iMac's CPU is roughly 6763/1956 = 3.4 times faster than the Core2. And, surprisingly, all of my tests ran a bit over three times faster. This tells me that mongo's performance is very CPU dependent (I expected some performance loss to I/O). However, even though the tests ran faster, they seemed to show the same dip in performance starting after 10,000 documents.

Count Map Reduce2,
jsMode
Python,
Accumulators
10⁰ 0.3 σ=0.12 3.9 σ=0.2
10¹ 1.6 σ=1.7 23.6 σ=2.0
10² 39.1 σ=2.9 119.7 σ=3.7
10³ 126.1 σ=45.2 168.8 σ=11.7
10⁴ 211.9 σ=45.3 177.9 σ=12.0
10⁵ 229.8 σ=21.7 187.6 σ=5.9
10⁶ 220.0 σ=10.6 190.8 σ=2.0

The relatively high standard deviation is likely due to running the client and server on the same system, and/or too many cores in use interfering with my CPU's ability to keep its turbo spun up.

Use a SQL database

Yeah, that may be kind of silly, since this post is about Mongo. However, Postgres (the not-too-fast-snail of databases) performs this aggregation in about 950 ms on 1,000,000 rows (on the T9600). For comparison, that's a bit over 1000 documents/ms, ten times faster than Mongo on the same server, and 4.5x faster than Mongo on the Core i5.

For non-relational data, an SQL database is definitely a bit awkward. But SQL database developers have been learning how to make their products fast and practical for the last forty years, so it's not as though they are useless when it comes to storing data and crunching numbers.

I'm also not the only one who's noticed that Postgres is faster than Mongo. That, and Postgres has that new JSON column.

Maybe my comment above should have said that postgres is the racing snail of databases.

alt text

A point for mongo though, it inserts much faster than postgres. As you might imagine, it takes a while to get a million rows of data across the wire, but postgres took roughly 30% than mongo.

Conclusion

As they say, 97% of the time, optimization is the root of all evil. However, this was an exercise for me to get a feel for how my database performs under some load. I've just scratched the surface of performance tuning, and still very much getting used to mongo.

My simple example probably doesn't do Map reduce any justice. I suspect that where it really shines is when you have many millions of documents, lots of collections, many sharded databases, and have a big aggregation pipeline.

My needs are pretty modest by comparison, so it may simply be that my data isn't "big data" enough for it to matter.

But in terms of simplicity and utility, it's hard to beat db.collection.insert({whatever: you_want}). Mongo's documentation makes it seem like setting up replication isn't much harder than that either. So while Postgres may be faster in (cheap) computation time, mongo is way faster in (expensive) developer time.

Most importantly, I'm having a bit of fun learning about Mongo. I look forward to stowing it in my programming toolkit.

Go faster, please.

Learning Mongo, part 2: Go faster, please

Continued from Part 1: Hello, NoSQL

First steps

To start with, I had collected about 130,000 sensor data records, in the form of CSV files generated by my OBD logger.

I defined a map and reduce function for mongo (see part 1), and used a statistics class in python that I wrote which accumulates values in a python list.

This chart shows the result, in number of documents processed per millisecond (larger numbers are better) from batches of the given size.

Documents Count Map Reduce Python
10¹ 0.9 13.6
10² 11.2 43.1
10³ 21.0 65.6
10⁴ 25.0 66.6
10⁵ 25.1 68.0

Based on this information, not only was Python faster, but it was more than twice as fast.

This lead me to believe one of two things: Map Reduce on mongo is stupid, or I was doing it wrong. Is map reduce is pointless unless you have a really huge number of documents or a lot of shards? Maybe, but somehow I find it more likely that I don't know what I'm doing.

I tried messing with the map and reduce functions a bit, but discovered that small tweaks in calculations didn't affect the results much. This lead me back to the internet to try and figure out what I was doing wrong.

Eventually, I found a mention of something called "jsMode". In this mode, Mongo keeps data in JSON format for internal computation, rather than converting back and forth to BSON. This means that it's making fewer copies, saving time. Great, I'll just turn that on and ...

Document Count Map Reduce Python
10¹ 0.8 13.6
10² 8.3 43.1
10³ 42.1 65.6
10⁴ 67.4 66.6
10⁵ 70.6 68.0

Ok, now it's starting to get better. JS Mode helps significantly, but we're still just breaking even.

Ramping up

130,000 documents sounds like a lot to a human, but this is the product of one sensor over a few months, collecting maybe 1000 data points per day. If I had a few more sensors or increased my collection frequency, that number could easily increase by an order of magnitude. Plus, there are certain economies of scale when scanning through data. Mongo is touted for it's prowess with "Big data."

I'm fresh out of Big data, so I faked it by inserting my data nine more times into the database.

Results, ~1.3 million documents, Map Reduce in JS Mode, Python.

Document Count Map Reduce, JS Mode Python
10¹ 1.8 4.2
10² 14.0 6.8
10³ 52.3 7.3
10⁴ 64.2 7.3
10⁵ 69.2 7.4
10⁶ 49.7 6.9

Oh dear, something terrible has happened. The increased index size has reduced the overall performance quite a bit.

And, with this many documents, I can see where the performance peaks are. The maximum performance is somewhere between 100k and 1000k documents.

This is not what I would call a satisfying result. For one, the python performance is absurd. 7.4 documents per second? That's almost 13.5 seconds to process 100,000 documents. Plus, Map Reduce performance has dropped by 30%. We can do better.

In thinking about this, I realized that in python, in order to compute statistics on one field in my document, I was actually fetching the entire document (40 fields), while Map Reduce was only getting exactly one. This isn't really fair.

This lead me to discover that you can tell Mongo to give you only particular fields in a query, by specifying which field you want in the query:

result = db.collection.find({}, {'your-field': 1})

Using that, python can now ask for exactly 1 field. Plus, Python's list implementation starts to slow way down when the number of items exceeds 10,000 or so, so I know I can still make Python go faster by dropping the list and computing values with single variable accumulators.

With those modifications to the Python code, the results now look like this:

Document Count Map Reduce, JS Mode Python, Accumulators, 1 Field
10¹ 1.8 13.9
10² 14.0 42.5
10³ 52.3 62.7
10⁴ 64.2 64.7
10⁵ 69.2 66.0
10⁶ 49.7 42.6

Well, that's more reasonable. Python is now back in the running, thanks to a fair fight. However, Mongo is barely in the lead, and then only when the number of documents starts to get large.

Back to the drawing board.

It was about this time when I my attention started wandering over to CouchDB, Redis, and other No-sqly things. I really don't want to use CouchDB, mostly because I didn't like MemBase, and Couch implies relaxation, not performance. Yes, I reject this database because I don't like its name (as a PDX resident, I also feel the need to pronounce it Cooch-DB. If you're from here you should know why). Redis is fine, but not a document store. And there are a million other NoSQL databases. Sigh

The internet then found me a thing called Robomongo, which is a really slick UI for mongo servers. And, as I installed it and started poking around, what did I find by a database object category called "Functions."

Eureka

It makes sense that Mongo, which runs on V8, and is javascripty, would be able to store functions server-side for use in Map Reduce or whatever else.

Presumably, one of the reasons the map reduce functions were slow before is because there is overhead to compile them. Based on the performance results, I would guess that the compilation isn't stored anywhere either. Mongo probably just compiles & executes them, and then throws them away again. This is a lot of overhead.

Given the ability to store functions in the database, hopefully it is compiled and cached, avoiding that overhead.

So, time to run the tests again, this time with the map & reduce functions inside the db. And just for fun, I'll refactor the python stats code to use accumulators instead of lists.

Number Map Reduce 2, JS Mode Python, Accumulators
10¹ 1.5 13.9
10² 18.1 42.5
10³ 61.7 62.7
10⁴ 128.4 64.7
10⁵ 123.7 66.0
10⁶ 71.0 42.6

And finally we have a race worth watching. Python takes an early lead, then Map Reduce comes in strong from behind. They're neck and neck through 1000 items, and then Map Reduce is off like a rocket from 10k — 100k, and finally both tire out at 1000k.

Thus, with a few small tweaks, I've been able to increase the map reduce performance by about 5x. That's not bad.

Continued in part 3, What do all these numbers mean?

Hello, NoSQL

Learning Mongo, part 1: Hello, NoSQL

A few months ago, I attached an OBD sensor to my car, and I've been collecting loads of telemetry ever since. This data is screaming out for statistics.

I had been loading up data in memory to run calculations. But my data was starting to get to the point where storing it all in memory was becoming inconvenient (really big lists in Python tend to get slow).

So I started to think about data storage options. Normally, I would use an SQL database to store my data and do calculations. SQL is an old friend of mine, but this data isn't relational, it's a bunch of data points from different sensors. If I used an SQL database, I would get none of the benefits of a relational part, and all of the inflexibility of structured query language.

MongoDB is a well known unrelational database (a document store). I've never used it before. This task seemed suitable for mongo, so I figured it was a reasonable first choice.

Data, meet Mongo

For the last couple weeks, I've been learning how to feed mongo data and get it back out.

One of the big stumblings for me initially was overcoming the urge to model the data, as I would if I were using an SQL database. That step just seems unnecessary in mongo. I don't even have to create a database, I can just reference one and mongo makes sure it's there.

My source data comes in CSV format. Each row has a date stamp, followed by about 40 sensor data values.

Ultimately, I just stuffed each row into a mongo 'document' and stored it away. My loading algorithm is essentially this:

from csv import DictReader

with open(csv_file, encoding='utf-8') as f:
    csv = DictReader(f)
    for row in csv: 
        db.collection.insert(row)

Inserting data with mongo is absurdly easy. Part of me still wants to model something, but it I still don't know how to best go about that in Mongo.

Aggregation

In mongo-land I have access to a map reduce function.

After scratching my head for a while, I produced a map reduce function that computed some marginally useful stats: min, max, sum, and count.

// Map
function mapit() {
    var item = this['field'];
    emit(0, {sum: item, min: item, max: item, count: 1});
}

// Reduce
function reduceit(key, values) {
    var reduced = values[0];

    for (var i = 1; i < values.length; i++) {
        reduced.sum += values[i];
        reduced.min = Math.min(reduced.min, values[i].min);
        reduced.max = Math.max(reduced.max, values[i].max);
        reduced.count += values[i].count;
    }

    return reduced;
}

For a brief explanation: The map function's job is to fetch data, and assign it to a key. In this case, there is exactly one key, which I call 0. Then, the reduce function receives a key and a list of values that were emit-ed for that key. It then does something like add all the values up and return the result.

Unfortunately, Mongo requires functions to be written in JavaScript.

From Python, you do the map reduce like so:

from pymongo import MongoClient

# connect to mongo
client = MongoClient()
db = client.db_name

# define your functions
mapper = '''function () { ... };'''
reducer = '''function (key, values) { ... };'''

# map reduce!
db.collection.map_reduce(mapper, reducer, 'output_collection')

The map reduce does its thing, and then you look in your output_collection for the results.

>>> result = db.output_collection.find():
>>> pprint(result)
{'_id': 1.0,
'value': {'count': 1000.0,
          'max': 459.1000061,
          'min': 109.0,
          'sum': 317039.99999936076}}

Although this was exciting, it seemed slower than manually computing stats in python. The mapping and reducing happens close to the data, so surely it should be faster!?

From what I've read about Mongo, doing this kind of thing in the database isn't supposed to be fast. Mongo itself is supposed to be screaming fast, but for queries, not number crunching. Maybe I just need to use it as a data store and keep the computation at the application level.

Since I am new to mongo, I have no intuition to rely on to determine if it's "Fast" or "Slow" in mongo terms. So, I indulged this tangential question, and spent an afternoon measuring.

Continued in part 2: Go faster, please