Giter VIP home page Giter VIP logo

monary's Introduction

Introduction

MongoDB is a document-oriented database, organized for quick access to records (or rows) of data. When doing analytics on a large data set, it is often desirable to have it in a column-oriented format. Columns of data may be thought of as mathematical vectors, and a wealth of techniques exist for gathering statistics about data that is stored in vector form.

For small to medium sized collections, it is possible to materialize several columns of data in the memory of a modern PC. For example, an array of 100 million double-precision numbers consumes 800 million bytes, or about 0.75 GB. For larger problems, it's still possible to materialize a substantial portion of the data, or to work with data in multiple segments. (Very large problems require more powerful weapons, such as map/reduce.)

Extracting column data from MongoDB using Python is fairly straightforward. In PyMongo, collection.find() generates a sequence of dictionary objects. When dealing with millions of records, the trick is not to keep these dictionaries in memory, as they tend to be large. Fortunately, it's easy to move the data in to arrays as it is loaded.

First, let's create 3.5 million rows of test data:

Here's an example that uses numpy arrays:

With 3.5 million records, this query takes 85 seconds on an EC2 Large instance running Ubuntu 10.10 64-bit, and takes 88 seconds on my MacBook Pro (2.66 GHz Intel Core 2 Duo with 8 GB RAM).

These timings might seem impressive, given that they're loading 200,000+ values per second. However, closer examination reveals that much of that time is spent by pymongo as it reads each query result and transforms the BSON result to a Python dictionary. (If you watch the CPU usage, you'll see Python is using 90% or more of the CPU.)

Monary

It is possible to get (much) more speed from the query if we bypass the PyMongo driver. To demonstrate this, I've developed monary, a simple C library and accompanying Python wrapper which make use of MongoDB C driver. The code is designed to accept a list of desired fields, and to load exactly those fields from the BSON results into some provided array storage.

Here's an example of the same query using monary:

Monary is able to perform the same query in 4 seconds flat, for a rate of about 4 million values per second (20 times faster!) Here's a quick summary of how this Monary query stacks up against PyMongo:

  • PyMongo Insert -- EC2: 102 s -- Mac: 76 s
  • PyMongo Query -- EC2: 85 s -- Mac: 88 s
  • Monary Query -- EC2: 5.4 s -- Mac: 3.8 s

Of course, this test has created some fairly ideal circumstances: It's querying for every record in the collection, the records contain only the queried data (plus ObjectIDs), and the database is running locally. The performance may degrade if we used a remote server, if the records were larger, or if queried for a only subset of the records (requiring either that more records be scanned, or that an index be used).

Monary now knows about the following types:

  • id (Mongo's 12-byte ObjectId)
  • int8
  • int16
  • int32
  • int64
  • float32
  • float64
  • bool
  • date (stored as int64, milliseconds since epoch)

Monary's source code is available on bitbucket. It includes a copy of the Mongo C driver, and requires compilation and installation, which can be done via the included "setup.py" file. (The installation script works, but is in a somewhat rough state. Any help from a distutils guru would be greatly appreciated!) To run Monary from Python, you will need to have the pymongo and numpy packages installed.

Monary has been slowly gaining functionality (including the recent additions of more numeric types and the date type). Here are some planned future improvements:

  • Support for string / binary types

    (I hope to develop Monary to support some reasonable mapping of most BSON types onto array storage.)

  • Support for fetching nested fields (e.g. "x.y")

  • Remove dependencies on PyMongo and NumPy (possibly)

    (Currently these must be installed in order to use Monary.)

monary's People

Contributors

ksuarz avatar machyne avatar dbeach24 avatar

Stargazers

 avatar Chad Gray avatar Nice avatar J Delaney avatar A. Jesse Jiryu Davis avatar

Watchers

Jason Carey avatar James Cloos avatar  avatar  avatar

monary's Issues

Monary() giving runtime error: ffi_prep_cif_var failed

Hello,
I have a mongo database that I want to explore using Monary(). The pymongo client works quite well and I am able to explore using the MongoClient in pymongo. But when I use Monary() I get the following runtime error

File "<stdin>", line 1, in <module> File "/home/shubhang/.local/lib/python3.9/site-packages/monary/monary.py", line 373, in __init__ self.connect(host, port, username, password, database, File "/home/shubhang/.local/lib/python3.9/site-packages/monary/monary.py", line 436, in connect self._connection = cmonary.monary_connect( RuntimeError: ffi_prep_cif_var failed
I tried Monary('localhost',27017) to explicitly identify the localhost as well but to no avail. I still get the same RuntimeError.

Here is the mongod status in case it helps

mongod.service - MongoDB Database Server Loaded: loaded (/lib/systemd/system/mongod.service; enabled; vendor preset: enabled> Active: active (running) since Mon 2022-03-14 08:40:33 CDT; 1 day 1h ago Docs: https://docs.mongodb.org/manual Main PID: 32645 (mongod) Memory: 2.9G CPU: 10min 32.377s CGroup: /system.slice/mongod.service โ””โ”€32645 /usr/bin/mongod --config /etc/mongod.conf

Aggregation failures

If you perform an invalid aggregation, you get a RuntimeError because the counting stage of the aggregation pipeline will fail. This also happens if you're not connected to a proper mongod.

We probably can't do anything about this for now, but if anyone ever gets to integrate with the Python standard logging module we might be able to emit a better message.

This might happen with the query as well, but I'm not sure.

import monary raises OSError

Hello,

I'm using Mac OS X 10.10 with Anaconda 3

I installed MongoDB

$ brew install mongodb --dev

I try my install with pymongo

>>> import pymongo
>>> client = pymongo.MongoClient("localhost", 27017)
>>> db = client.test
>>> db.name
u'test'
>>> db.my_collection
Collection(Database(MongoClient('localhost', 27017), u'test'), u'my_collection')
>>> db.my_collection.insert_one({"x": 10}).inserted_id
ObjectId('4aba15ebe23f6b53b0000000')
>>> db.my_collection.insert_one({"x": 8}).inserted_id
ObjectId('4aba160ee23f6b543e000000')
>>> db.my_collection.insert_one({"x": 11}).inserted_id
ObjectId('4aba160ee23f6b543e000002')
>>> db.my_collection.find_one()
{u'x': 10, u'_id': ObjectId('4aba15ebe23f6b53b0000000')}
>>> for item in db.my_collection.find():
...     print item["x"]
...
10
8
11
>>> db.my_collection.create_index("x")
u'x_1'
>>> for item in db.my_collection.find().sort("x", pymongo.ASCENDING):
...     print item["x"]
...
8
10
11
>>> [item["x"] for item in db.my_collection.find().limit(2).skip(1)]
[8, 11]

I was working fine

I would like to try Monary
so I also installed mongo-c-driver https://github.com/mongodb/mongo-c-driver

git clone https://github.com/mongodb/mongo-c-driver.git
cd mongo-c-driver
./autogen.sh
make
sudo make install

Then, I installed Monary

$ pip install monary

but running ipython

$ ipython

and importing monary

import monary

raised

OSError: dlopen(//anaconda/lib/python3.4/site-packages/monary/libcmonary.so, 6): Library not loaded: libcrypto.1.0.0.dylib
  Referenced from: //anaconda/lib/python3.4/site-packages/monary/libcmonary.so
  Reason: image not found

Any idea ?

Kind regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.