After years of waiting, Cassandra joins open source colleagues MySQL, MariaDB, PostgreSQL and MongoDB as cloud database-as-a-Service (DBaaS). As Asha Barbaschow reported. Amazon Keyspaces for Apache Cassandra now hits general release. It sets the stage for a real differentiation of what was previously a gap in the market. Where there were no managed Cassandra database cloud services, there will now be at least two. with DataStax that already has a beta service based on a completely different design approach, the market has a real choice.
As Barbaschow noted, Keyspaces is designed to make it easy to migrate local workloads to the cloud. But the real question is whether this will make Cassandra, a database that has never been known for its ease of use, become available to a wider audience?
Diamond in the rough
It’s ironic. Apache Cassandra was arguably the first NoSQL platform to introduce a truly distributed operational database in nature. But it’s also one of the last to get its own managed DBaaS (Database-as-a-Service) cloud service, which is something that AWS – and DataStax – have been getting a lot of demand for. Both have had managed services in preview for the past few months and now AWS has made it ready for release. AWS offers a built-in, optimized implementation of Cassandra that it designates a “serverless Apache Cassandra compliant service.”
Cassandra’s strength has always been its support for large scale and performance as one of the first truly distributed databases to support multi-master operation. Its biggest challenge was that the database could be very complex to implement.
Eg. Tasks such as setup; backups; and collecting and compressing waste (the key to maintaining data consistency in a distributed database) required advanced skills because of the low-level tool. Part of the challenge is the inherent complexity of designing highly distributed databases. And then there is the challenge of how to model the data. In the relational world, you will design tables based on expected queries and indexes as shortcuts to queries that otherwise require lots of joins. In Cassandra, best practices are also to put the data based on expected reading and writing patterns, but there are some important differences. You can index data in Cassandra, but in reality, denormalizing data across the cluster (or multiple clusters) is best practice. As with any distributed, denormalized system, there is the issue of getting the right workload balance.
What about DataStax?
All this comes as DataStax is on the home stretch of preparing his Astra managed cloud Cassandra service that we expect is likely to debut on Google Cloud. And with all this, it will continue to function normally DataStax Enterprise at AWSthat remains supported on EC2.
We expect DataStax to initially offer the pure Apache deployment (rather than DataStax Enterprise) designed for multiple public clouds when it comes from beta. While AWS’s implementation will vary, it is looking at opportunities where it can contribute features back to the open source community.
AWS seeks to simplify matters by offering Keyspaces as a server-free offering. This steals a page from DynamoDB, which is also serverless. As a managed service for an open source database, AWS takes an approach that comes straight out of its Amazon Aurora and DocumentDB playbooks: implement an open source database in a cloud-native architecture there separates storage from calculation with specific features optimized for AWS storage engines. The name Keyspaces refers to the upper-level database container that controls replication of database objects in Apache Cassandra
Going serverless makes life easier by failing to deliver, fix, and manage the tasks of servers; it also removes the need to run compressions manually because it has its own storage optimization that avoids the need to use Apache Cassandra’s tombstone mechanism for marking deleted data; this optimization eliminates the need to provide more storage space to continue housing the deleted data. By being serverless, Keyspace supports autoscaling of the computer priced either by the number of reads and writes, or by the service level (for example, the ability to handle 50,000 reads or writes per second).
As part of the AWS portfolio, Keyspaces integrates with its core security, identity and compliance services, such as AWS Identity and Access Management (IAM) for access management; Key Management Service (KMS) for encryption at rest; and Amazon CloudWatch for monitoring.
Like DynamoDB, all data is encrypted at rest. And like DynamoDB, Aurora, DocumentDB, Keyspace automatically supports three replicas that can be distributed across different accessibility zones (AZs) in a region for durability and performance. But there is a subtle difference since Keyspaces also carries Apache Cassandra’s multi-master feature, a feature not available in Aurora or DocumentDB. While DynamoDB already has a cross-region multi-master capability called Global tables, upon launch, Keyspaces will not have cross-regional support. But we wouldn’t be surprised if a Global Tables-like feature materialized for Keyspaces down the road.
So let’s take a look at how the new AWS service stacks up against Apache Cassandra and the data platform that Cassandra is often compared to: DynamoDB.
Comparisons with Apache Cassandra
Since Keyspaces is an AWS implementation of Cassandra, there are some differences with the Apache platform. Eg. Can Apache Cassandra write transactions to any node wherever it is, while Keyspaces can currently only write to nodes within the same region. Another difference is that at launch, Keyspaces does not have support for everyone CQL (Cassandra Query Language) features; AWS states that it omitted CQL features that would not be compatible with serverless operation with others it considered “experimental.”
There is other subtle differences with table space and key management, system table storage and load balancing, range deletion along with differences in best practice for setting CQL queries and partition size. For example, in Apache Cassandra, partition size best practices are to keep the number of values below 100,000 records and the disk size below 100 Mbytes; conversely, Keyspace has no boundaries. Nevertheless, AWS enforces boundaries limiting the rows to a maximum of 1 Mbyte.
Comparisons with DynamoDB
Under the covers are both very different databases. DynamoDB follows a simpler scheme of key values, whereas Cassandra implements a broad column model that is more complex and processes partitions differently. As we noted in our comment after AWS announced Keyspaces at re: Invent, the use cases for both databases (such as distributed, operating platforms) are similar, but that the biggest difference is likely to be that of the developer preference.
Initially, DynamoDB was the recommended destination in AWS for distributed NoSQL databases as it was positioned as a platform capable of handling key values and document data. In fact, Cassandra and DynamoDB have a common lineage, with the designers of Apache Cassandra applying a number of principles from Amazon’s original Dynamo research article; Amazon’s Dynamo and SimpleDB databases were the ancestors of DynamoDB. Since then, AWS has significantly diversified its NoSQL database portfolio DocumentDB. Neptune. Timestream, ElastiCache, and others to target different use cases and data types.
But Cassandra continued to distinguish itself as a multi-master distributed database, meaning it could accept writing across instances spread across different data centers. While AWS says Cassandra was not a model, a few years back, it found that DynamoDB customers demanded replication in several regions, which is how Global Tables originated.
In developing Keyspaces, AWS took some lessons from DynamoDB; besides serverless operation, it customized automated partition management for balancing read and write loads in the new service. There are some features, e.g. authentication plug-ins of short-term credentials that AWS has already accessed from GitHub. They may be contributing to a comparable server-side component to the Apache Cassandra project to enable customers running the EC2 database to manage access to their clusters in a similar way.
A bigger scene for Cassandra?
With Keyspaces, Cassandra becomes the latest open source database for which AWS offers a managed service. Despite access barriers, Apache Cassandra has become one of the most popular databases out there, ranked eleventh of db Engines. A managed cloud service should expand that audience.
But if it was so popular, what took Cassandra so long to get it in the cloud? Look no further than the top five databases on db engines; apart from Oracle and SQL Server, open source databases MySQL. PostgreSQL, and MongoDB round out the top five. First things first.
Beyond that is the answer to why so long, also the answer to why a managed cloud service is so desperately needed: the complexity of the platform and the lack of decent tools (we will probably have complaints about it, but the tools available are not very intuitive ). The good news is that introducing a managed cloud service will tackle the infrastructure half of the problem. But the database designer still has to define the data model, something that a managed service cannot automate on its own. There are some good ones white Papers available, and at AWS, a NoSQL-workbench tool for DynamoDB that could conceivably be adapted to Cassandra. Ultimately, we would like to see some visual tool that provides a guided approach to developing the schema. That’s the missing link. We hope AWS or DataStax, or preferably both, will step up to the record there.