Learning Mongo, part 1: Hello, NoSQL
A few months ago, I attached an OBD sensor to my car, and I've been collecting loads of telemetry ever since. This data is screaming out for statistics.
I had been loading up data in memory to run calculations. But my data was starting to get to the point where storing it all in memory was becoming inconvenient (really big lists in Python tend to get slow).
So I started to think about data storage options. Normally, I would use an SQL database to store my data and do calculations. SQL is an old friend of mine, but this data isn't relational, it's a bunch of data points from different sensors. If I used an SQL database, I would get none of the benefits of a relational part, and all of the inflexibility of structured query language.
MongoDB is a well known unrelational database (a document store). I've never used it before. This task seemed suitable for mongo, so I figured it was a reasonable first choice.
Data, meet Mongo
For the last couple weeks, I've been learning how to feed mongo data and get it back out.
One of the big stumblings for me initially was overcoming the urge to model the data, as I would if I were using an SQL database. That step just seems unnecessary in mongo. I don't even have to create a database, I can just reference one and mongo makes sure it's there.
My source data comes in CSV format. Each row has a date stamp, followed by about 40 sensor data values.
Ultimately, I just stuffed each row into a mongo 'document' and stored it away. My loading algorithm is essentially this:
from csv import DictReader
with open(csv_file, encoding='utf-8') as f:
csv = DictReader(f)
for row in csv:
db.collection.insert(row)
Inserting data with mongo is absurdly easy. Part of me still wants to model something, but it I still don't know how to best go about that in Mongo.
Aggregation
In mongo-land I have access to a map reduce function.
After scratching my head for a while, I produced a map reduce function that computed some marginally useful stats: min, max, sum, and count.
// Map
function mapit() {
var item = this['field'];
emit(0, {sum: item, min: item, max: item, count: 1});
}
// Reduce
function reduceit(key, values) {
var reduced = values[0];
for (var i = 1; i < values.length; i++) {
reduced.sum += values[i];
reduced.min = Math.min(reduced.min, values[i].min);
reduced.max = Math.max(reduced.max, values[i].max);
reduced.count += values[i].count;
}
return reduced;
}
For a brief explanation: The map function's job is to fetch data, and assign it to a key. In this case, there is exactly one key, which I call 0. Then, the reduce function receives a key and a list of values that were emit-ed for that key. It then does something like add all the values up and return the result.
Unfortunately, Mongo requires functions to be written in JavaScript.
From Python, you do the map reduce like so:
from pymongo import MongoClient
# connect to mongo
client = MongoClient()
db = client.db_name
# define your functions
mapper = '''function () { ... };'''
reducer = '''function (key, values) { ... };'''
# map reduce!
db.collection.map_reduce(mapper, reducer, 'output_collection')
The map reduce does its thing, and then you look in your output_collection for the results.
>>> result = db.output_collection.find():
>>> pprint(result)
{'_id': 1.0,
'value': {'count': 1000.0,
'max': 459.1000061,
'min': 109.0,
'sum': 317039.99999936076}}
Although this was exciting, it seemed slower than manually computing stats in python. The mapping and reducing happens close to the data, so surely it should be faster!?
From what I've read about Mongo, doing this kind of thing in the database isn't supposed to be fast. Mongo itself is supposed to be screaming fast, but for queries, not number crunching. Maybe I just need to use it as a data store and keep the computation at the application level.
Since I am new to mongo, I have no intuition to rely on to determine if it's "Fast" or "Slow" in mongo terms. So, I indulged this tangential question, and spent an afternoon measuring.
Continued in part 2: Go faster, please
No comments:
Post a Comment