How the Apache Cassandra community stopped fighting to build its best release yet

how-the-apache-cassandra-community-stopped-fighting-to-build-its-best-release-yet

Commentary: In this interview with an Apache Cassandra project leader, learn how the 4.0 beta release’s improved stability, bug fixes, and the confluence of vendors and users.

databaseistock000054790152agsandrew.jpg

Image: iStock/agsandrew

Over the years, the Apache Cassandra community has demonstrated the best and worst of open source collaboration. But a funny thing happened on the way to Cassandra’s 4.0 (beta) release: A sometime fractious family of contributors came together to deliver something truly exceptional. Already one of the world’s most popular databases (currently ranked #10 on DB-Engines.com), Cassandra’s 4.0 beta release promises new levels of stability while also rediscovering flair. As Instaclustr CTO Ben Bromhead put it, “I’m an absolute sucker for process and quality improvement and Cassandra 4.0 has this in spades, but the improvements around Netty and Zero Copy Streaming also look super cool.” 

To learn more about the release, and why enterprises comfortable in their relational data models should care, I talked to Josh McKenzie, Apache Cassandra Committer and PMC (Project Management Committee) Member.

SEE:  Big data management tips (free PDF) (TechRepublic)

An open source community grows up and together

Of course, if you’ve been paying attention to the world of databases over the years, you know that data is not sitting comfortably in the tidy rows and columns of relational databases. Modern data often doesn’t fit. Asked about this, McKenzie noted, “We don’t know what the data of tomorrow looks like,” making it critical to rely on open source while also exploring non-relational approaches to data management. 

While Cassandra has long been a popular option with enterprises, for years the community neglected key stability issues. What had once been a strength became a weakness.

But this is also where Cassandra becomes such an interesting success story. For years I’ve argued that unless users of open source contribute back, open source won’t achieve its maximum impact. Vendors are nice, but open source users have unique perspectives on how to improve software. 

In the case of Cassandra, some of its key users include Apple, Netflix, and Instagram, who increased their participation in the project even as some vendors reduced their participation. But the 4.0 release represents a near-perfect confluence of vendors and users coming together to make Cassandra dramatically better, as McKenzie pointed out:

The Cassandra community is incredibly robust at this point. While it’s somewhat bimodal between contributors employed by DataStax and Apple with regards to the lines of code in the 4.0 release, the number of humans involved and contributors scratching their own itch represents the majority of commits on the project. While committers are of course involved in every merge to the code-base (as per the Apache Way), on 60+% of the tickets the other side of that work is someone that’s contributing their time and energy into the project. That kind of diversity is crucial to the long-term resilience of an open-source community and we’re quite happy with how things look on that front coming up to 4.0.

One key area that users, in particular, have contributed is toward Cassandra stability. 

Making Apache Cassandra stable…together

As McKenzie related, Cassandra 4.0’s improved stability comes, in large part, from “a significant amount of real-world workload testing” going on at big contributors that replay real use-cases through the system to ensure both mixed-version (i.e., during upgrade) clusters are healthy as well as post-upgrade. For example, Netflix engineers have done some scale performance testing. 

The result? As McKenzie related, the 4.0 release has over 30% more bug fixes and improvements in it than the 3.0 release and “is the best tested, most stable .0 release of Cassandra ever.” The addition of Zero Copy Streaming, mentioned above, means scaling clusters will be up to 5x faster without vnodes (virtual nodes), and recovery from hardware failure should be 5x faster, as well. “We’ve never seen the community really rally around quality and stability in this way,” he said.

At the same time, the addition of full query, real-time audit logging and workload replay adds a significant new element of visibility into the administration of and introspection into what people are doing in the database. Ultimately, therefore, “4.0 is targeting everyone that runs Cassandra, making all the core basics of how it’s used more robust, visible, and elastic,” said McKenzie. The result? Better than 20% performance improvements in many of the workloads the community has been using to regression test.

Not bad. 

As for what comes next (in Cassandra 5.0), we’re “moving towards a pluggable, modular storage engine and adding new ways to visualize and explore the data in your system, all while keeping the scale and availability guarantees users demand from the database,” McKenzie noted. Furthermore, he stressed, “We’re keenly aware that Cassandra needs to keep evolving to keep up with innovation in other adjacent and complementary spaces and meet users where they are, helping them solve the interesting, fast-paced problems they’re looking to solve in modern, cloud-native application development.”

Because the Cassandra community has learned how to blend both vendors and users together, it’s well-poised to deliver on this promise.

Disclosure: I work for AWS but the views expressed herein are mine, and don’t reflect those of my employer.

Data, Analytics and AI Newsletter

Learn the latest news and best practices about data science, big data analytics, and artificial intelligence.
Delivered Mondays



Sign up today

Also see 

How to speed up Apache web loads with mod_pagespeed

how-to-speed-up-apache-web-loads-with-mod-pagespeed

If your Apache page load times are slow, speed them up with mod_pagespeed. Jack Wallen shows you how.

Image: Jack Wallen

Are your company websites being served up by the Apache web browser? If so, are they loading as quickly as you’d like them? If not, you could throw more hardware at the situation or you could use a handy Apache module (created by Google) that is used to automatically optimize web pages.

That module is called mod_pagespeed, which compresses things like javascript, CSS, JPEG, and PNG files, so your pages load faster.

I’m going to show you how to install this module and how to access it’s web-based admin page.

SEE: Resolve IT issues quickly with these 10 PowerShell cmdlets (TechRepublic download)

What you’ll need

The only things you’ll need for this to work are a server running the Apache web browser and a user with sudo privileges. I’m going to demonstrate this using the Ubuntu Server 18.04 platform. If you use a different platform, you’ll need to alter the installation process for the module.

How to install Apache

In case you’re coming at this without a server running Apache, let’s install it. On Ubuntu Server, Apache is installed with a single command. Log in to your server and issue the command:

sudo apt-get install apache2 -y

When that command completes, start and enable the server with the commands:

sudo systemctl start apache2
sudo systemctl enable apache2

With Apache up and running, it’s time to install mod_pagespeed.

How to install mod_pagespeed

In order to install mod_pagespeed, you first need to download the necessary .deb file. To do this, issue the command:

wget https://dl-ssl.google.com/dl/linux/direct/mod-pagespeed-stable_current_amd64.deb

When the file download has completed, install the module with the command:

sudo dpkg -i mod-pagespeed-stable_current_amd64.deb

Finally, restart Apache with the command:

sudo systemctl restart apache2

How to test the module

With mod_pagespeed installed, let’s make sure it’s running. Issue the command:

sudo curl -D- http://localhost | head

You should see the version of mod_pagespeed printed out–along with various other bits of information (Figure A).

Figure A

Our mod_pagespeed installation was successful.

How to access the admin panel

The mod_pagespeed module includes a handy admin panel that displays the statistics of the pages served up by Apache. By default, this panel is only accessible from localhost. Seeing as how we’ve installed it on a headless server, we need to make it available from anywhere on our LAN. 

To enable the admin panel, open the configuration file with the command:

sudo nano /etc/apache2/mods-available/pagespeed.conf

Scroll down to the bottom of the page and look for the sections that start with:

And:

In both of those sections, you’ll need to add the following line under Allow From 127.0.0.1:

Allow from all

Once you’ve taken care of that, save and close the file. Restart Apache with the command:

sudo systemctl restart apache2

Point a web browser to http://SERVER_IP/pagespeed_admin/ (where SERVER_IP is the IP address of the server hosting Apache). You should be presented with the mod_pagespeed admin panel (Figure B).

Figure B

The mod_pagespeed admin panel is ready to view.

You can now monitor the statistics of your Apache sites, optimized with the help of mod_pagespeed.

And that’s all there is to it. Enjoy the newfound speed of your page load times.

http://www.techrepublic.com/

Open Source Weekly Newsletter

You don’t want to miss our tips, tutorials, and commentary on the Linux OS and open source applications.
Delivered Tuesdays



Sign up today

Also see

Why the Apache Lucene and Solr “divorce” is better for developers and users

why-the-apache-lucene-and-solr-“divorce”-is-better-for-developers-and-users

Commentary: A decade ago Apache Lucene and Apache Solr merged to improve both projects. The projects recently split for the same reason, which is a really good thing for users of search services.

Image: photo_Pawel, Getty Images/iStockphoto

It’s very possible that you rely on Apache Lucene and Apache Solr every day, whether you’re looking for jobs on LinkedIn, trying to find that “bird-carries-shark” video on Twitter, or looking up random facts on Wikipedia. It’s also very possible that you have no clue how Lucene/Solr work, or how they’re developed. As such, you can be forgiven for not noticing that a few weeks back the Lucene/Solr community voted to break up, breaking Solr out from under Lucene and reversing the merger of the two a decade earlier, which you also likely missed. 

And yet the designation of Solr as a top-level Apache Software Foundation project matters, and not just for the developers who contribute to one or the other (or both). While disentangling the two projects (build infrastructure, source code, etc.) will take time, users will benefit. Here’s how.

Making life easier for the kingmakers

While most people reading this won’t have any familiarity with Lucene, Solr, or Elasticsearch (a distributed search application that relies on Lucene), we use them every day. Lucene is a full-text search engine library, whereas Solr is a full-text search engine web application built on Lucene. One way to think about Lucene and Solr is as a car and its engine. The engine is Lucene; the car is Solr. A wide array of companies (Ford, Salesforce, etc.) use Solr to provide search on their websites without needing to build an application to make use of the Lucene library. Others want to fiddle more with the dials and knobs of Lucene and don’t rely on Solr. 

Regardless, the two projects have been tightly bound since 2010 when the Lucene and Solr project management committees (PMC) voted to merge the two projects because “there was a lot of code duplication and interaction between Solr and Lucene back then,” as Dawid Weiss explained. Keeping the two together has become a burden over time. Solr depends on Lucene, but Lucene doesn’t depend on Solr, and tying Lucene to Solr has, among other things, made it harder to innovate the Lucene code at a pace many of its developers would like. 

SEE: 5 developer interview horror stories (free PDF) (TechRepublic)

The two projects have continued to attract healthy, largely independent development communities, with new feature work happening in one or the other, not both. This divergence isn’t complete, of course. As Mike Sokolov noted, “A substantial number of people commit to both, over time, although most people do not. Also, relatively few commits span both projects. Some do though, and it’s certainly worth considering what the workflow for such changes would be like in the split world.” Even so, forcing them to join at the hip, though it once made sense as a way to retire some technical debt, no longer makes sense. 

None of which would matter to the average user of LinkedIn, except that this separation promises to improve developer productivity for Lucene and Solr. If developers are the new kingmakers, as analyst firm Redmonk is wont to say, then making developers as productive as possible matters a great deal. So how does this split promise to help developers?

First, the split will make development for the respective projects more nimble. According to Weiss:

Precommit/ test times. These are crazy high. If we split into two projects we can pretty much cut all of Lucene testing out of Solr (and likewise), making development a bit more fun again.

Build system itself and source release packaging. The current combined codebase is a *beastto maintain. Working with gradle on both projects at once made me realise how little the two have in common. The code layout, the dependencies, even the workflow of people working on these projects… The build (both ant and gradle) is full of Solr and Lucene-specific exceptions and hooks that could be more elegantly solved if moved to each project independently.

Second, separating the two allows their respective developers to focus on making Lucene (or Solr) as great as possible; so, a developer who makes API changes to Lucene will no longer need to make corresponding changes to Solr. This, in turn, allows both projects to release according to feature readiness, rather than waiting on each other. Given that Lucene tends to move at a fast pace of feature development, it means faster releases and improvements to the search services users depend upon. 

This split won’t make the front page of The New York Times, unfortunately. However, those searching for articles in the Times will benefit. Because, of course, the Times relies on Lucene-powered Elasticsearch for that search functionality. 

Disclosure: I work at AWS, but this article reflects my views, not those of my employer.

Open Source Weekly Newsletter

You don’t want to miss our tips, tutorials, and commentary on the Linux OS and open source applications.
Delivered Tuesdays



Sign up today

Also see