NoSQL Berlin Meetup Notes

Posted by phillip Sat, 31 Oct 2009 19:39:00 GMT

“The world is diverse. Act accordingly.”
—Prof. Dr. Stefan Edlich, in his talk on object databases.

Where do you store your data? In a relational database, of course. It’s so convenient to use the persistence store we are used to, which has been there for us since the day we started programming. But in the spirit of using the right tool for the job – and making our lives easier – it pays off to know other persistent storages—those which aren’t based on the RDBMS/SQL paradigma. They promise to be better suited for some of the problems we face day to day; mapping the real world to a persistent storage, scaling, and reliability being among them.

The NoSQL meetup in Berlin gave a great overview of this active and growing scene, and shed some light on the characteristics of the main tools. Here are some rough notes from the meetup. For the full monty, all video and slides of the talks are available at the NoSQL Berlin website. Thanks guys for the perfect organization!

Consistency in Key-Value Stores (Monika Moser)

The only talk which wasn’t about a specific database. It gave an introduction into the problems and solutions that we face when working with many database servers (nodes). Since the written data has to be distributed across many physical machines, there will be a noticable delay until every node has received the updated data—the replication lag. Only after the replication lag, all nodes will contain the same (=consistent) data.

Two types of consistency were distinguished: Strong consistency (updated data is immediately available to all processes in the system) and eventual consistency (at some point in time all processes will get the update).

Strong consistency is usually expensive to implement on larger systems and isn’t always necessary, so eventual consistency is often acceptable. Depending on the use case, one can go for one of these subtypes of eventual consistency:

  • “read your writes” consistency

The process that wrote the data will always get the latest data. Other processes may still get old data for some time.

  • session consistency

A special case of the above: only the session that wrote the data is guaranteed to get the latest data immediately.

  • monotonic read consistency

after one process has read the new data, all following reads get it. So once the new data is in the system, the old data doesn’t appear again.

Monika went on to describe the CAP theorem (choose 2 of 3 for your storage setup: partition tolerance, availability, consistency), the reasons strong consistency is expensive, and the Paxos algorithm (good trade off between fault tolerance and consistency). See the slides and video for details!

My personal summary of the talk: good overview with lots of pointers to further info. And I’ll care about the details I didn’t grasp when I first need them.

Redis, Fast and Furious (Mathias Meyer)

Redis is awesome, I heard someone say.

Oh … Redis is also like memcached, but with extra features: persistence, additional commands (increment values, sets, push/pop, sorting, a text-based simple protocol). It is also slower than memcache, but not so much you would care.

It is also like K.I.T.T., if you believe Mathias.

According to Mathias, Redis is put to good use when storing statistical data (as long as it fits in memory!) and implementing worker queues.

Peer-to-peer Applications with CouchDB (Jan Lehnardt)
  • Jan contradicted himself on the first slide. It read: “Relax.”. Then he started a 10_000 WPM (words per minute) presentation, that still managed to raise my interest in CouchDB again. The presentation was about the “what can it do” instead of “how to do it”. Good choice to go this way.
  • a nice explanation “CouchDB is built “of the Web””—REST, JSON and HTTP are core technologies of the database.
  • Learning curve: store full documents, not relations (JSON). No data normalization into tables => make developers happy, not computers.
  • meant to be robust: append-only design for the database file. on crash, old data is not damaged.
  • scales out (horizontally). Does Master-Master replication. No scaling built in, but prepared (use couchdb-lounge). Then a scaled CouchDB cluster looks like a single DB from the outside.
  • scales down (runs on small devices). Own your own data, take it with you on your device.
  • incremental map-reduce: after updates, only the affected documents get reindexed
  • as with any document-oriented database, store full documents as JSON, not relations. Good tip in the Q&A: a document is something that will be updated and used as a whole. “put stuff into seperate documents when it is updated seperately”. There’s no clear guideline however, it depends on the use case.
  • RESTful HTTP: “text-based protocol is not slower than binary” / “all HTTP infrastructure and tools can be used”
  • BBC uses CouchDB in production, after a survey/comparison of storage solutions.
Riak (Martin Scholl)
  • document-oriented DB like CouchDB. “Riak combines a decentralized key-value store, a flexible map/reduce engine, and a friendly HTTP/JSON query interface to provide a database ideally suited for Web applications.”
  • 100% awesome. Though disputable, even more awesome than CouchDB.
  • the Riak “Data-Sphere” consists of: Bucket x Key x Document
  • GET/POST/PUT /jiak/<bucket>/<key>
  • travel the graph/links between documents with map/reduce
  • but: travelling links is expensive (no caching of map/reduce result, although possible to implement it yourself)
  • Bucket: can have as many keys as you want
  • chainable map/reduce stages—unique feature of RIAK
  • “It is extensible and configurable in many ways. Riak is a perfect fit for buiding reliable and scalable custom data storage systems.”
  • unfortunately my brain went offline through the second half of the talk … See the video
MongoDB (Mathias Stearn)

Quote: “I won’t tell you MongoDB is awesome. But I hope you’ll know it is after my talk.”

  • Mongo as in “HuMONGOus” scaling
  • Schemaless; data organized into Databases and Collections (like tables). But document-oriented (not a K/V store)
  • Good when you don’t know up front what you will be looking for (example: logfile analysis), and want to store everything.
  • extended JSON with data types Date, Int32/64, OID, Binary and called it BSON. B as in binary.
  • Wants to integrate with native language as well as possible. I.e. “db.users.find({$where: “this.a + this.b >= 42”})” instead of “RestClient.get ‘http://example.com/resource’”. And, btw. old-school C++.
  • changing only part of a document is possible. features: $set, $inc, $push, $pull, $remove for subdocuments
  • “you can put all data in one place, MongoDB”. Get rid of RDBMS.
  • works for 1 billion documents.
  • map/reduce + finalizers.
  • uses the eventual consistency model (see first talk)
  • uses MMAP database files (OS kernel) to automatically use available RAM
  • async modifications: no server response, client doesn’t wait. good for bulk inserts.
  • good for: websites, complex objects, high and low volume sites, real-time analysis.
  • bad for: complex transactions, business intelligence
4th Generation Object Databases (Stefan Edlich)
  • these have been around for 15 years
  • “no impedance mismatch”. These DBs are very nice to work with OOP—no disassembling objects into tables required, and back. Just dump full objects, and load them again.
  • but: when refactoring code, DB has to change. No insulation between code and data.
  • looks as if they are alive and well in a specific niche (extremely large datasets)
  • db4o is a simple way to try a OO database.
  • typical applications: transportation networks, tree structures, social graphs, object traversal, capture space (grok this!). He gave an example of one OOD application which stores 3.2 Mio. objects per second, 1TB of data per hour.
  • no convincing answer why OO databases haven’t entered mainstream, while OO programming has—it sounds like such a good idea. My impression was they are a great tool for specific uses (high performance, huge scale), but exotic and commercial solutions with high up-front investment.
  • Somehow I wonder if document-oriented databases will make it, when object-oriented DBs haven’t …
A talk not held: Neo4j – The Benefits of Graph Databases

There was no talk on GraphDBs, which are designed to store nodes and the relationships between them. As in social networks (nodes = people, relationships = connections/friendships). Slideshare to the rescue, it has Neo4j – The Benefits of Graph Databases. There also was a talk on Neo4j at NoSQLEast.

Posted in | Tags