Difficulty: Beginner
Estimated Time: 20 minutes

Are you looking for the best way to import data into your Apache Cassandra™ database? We have your back on this!

There are four common approaches to loading data into Cassandra:

  • Use the CQL COPY command, which is too slow for any significant dataset
  • Use an integrated Apache Spark™ solution like DSE, which is a great solution, but is very difficult if you are using open source Cassandra
  • Write your own custom program, which works, but usually is a lot of unnecessary work
  • Use DSBulk, which is a tool that will handle 99% of your data import tasks

In this scenario, we'll give you a taste of what DSBulk can do by showing you how to:

  • Load a simple csv file
  • Use DSBulk in an ETL pipeline by using STDIN
  • Map file field values to columns
  • Select lines from a file
  • Change the file field delimiters
  • Unload data

We'll get you started, but there many more options you can use with DSBulk. Take a look at the docs and this great blog on DSBulk for the details.

ver 0.005

In this scenario, we learned how to:

  • Load a simple csv file
  • Use DSBulk in an ETL pipeline by using STDIN
  • Map file field values to columns
  • Select lines from a file
  • Change the file field delimiters
  • Unload data

Having completed this scenario successfully, you should have a basic understanding of DSBulk.

Hot Topic: DSBulk

Step 1 of 6

Loading Simple Data

Let's start by downloading DSBulk, expanding it and placing it in our path.

wget https://downloads.datastax.com/dsbulk/dsbulk-1.5.0.tar.gz tar -xvf dsbulk-1.5.0.tar.gz export PATH=/root/dsbulk-1.5.0/bin:$PATH

We have a sample data file we want to load into a table. Take a look at the first few lines of this file.

clear head -n 4 data-files/movies-comma.csv

Notice that the first line is a header line with labels for the fields within the file. Let's use this header line to construct a table to hold this file.

cqlsh -e " CREATE KEYSPACE movies WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }; CREATE TABLE movies.movie_metadata( imdb_id text, overview text, release_date text, title text, average_rating float, PRIMARY KEY(imdb_id));"

Now, we are ready to load data into the table.

dsbulk load -url data-files/movies-comma.csv -k movies -t movie_metadata

Here are the parameters we used:

  • load says we want to move data into a table. There are two other DSBulk commands: unload and count. We will explore these later.
  • -url data-files/movies-comma.csv tells DSBulk where the input is. This can be a file (or files) as in this example, a directory, or even a remote URL.
  • -k movies this is the name of the target keyspace where we will store the data.
  • -t movie_metadata this is the name of the target table where we will store the data.

Let's look at the table.

cqlsh -e "SELECT COUNT(*) FROM movies.movie_metadata;"

NOTE: The previous unconstrained query requires a full table scan. We would not advise using this type of query on production databases because a full table scan can seriously impact performance. However, for educational purposes on small tables, it's appropriate.

When we count the records, we see we now have 100 rows, so DSBulk loaded the data as expected! That was easy!

Of course, DSBulk reported that it loaded the 100 rows. So, now we know we can trust what DSBulk tells us :).

Notice that the names of the fields in the file's header line match the names of the columns in the table. This makes loading straight forward. In later steps, we'll see how to handle the situation where we don't have the header/column alignment.

Now, you try it.

We'll create the table and give you a file, and you figure out the DSBulk command to load the data.

Here's the command to create the table.

clear cqlsh -e "CREATE TABLE movies.movie_profits( year int, movie text, us_gross decimal, budget decimal, run_time int, critic_score int, PRIMARY KEY(movie));"

Here's what the file looks like.

head -n 4 data-files/movie-profits-comma.csv

Construct and execute the DSBulk command to load this data file.

Solution dsbulk load -url data-files/movie-profits-comma.csv -k movies -t movie_profits

Count the lines and compare to what DSBulk reported to make sure DSBulk loaded all the data. Remember, because of the header line you should see 101 lines in the file, but DSBulk only loads 100 rows.

wc -l data-files/movie-profits-comma.csv

Great! You can load simple data into Cassandra with DSBulk!