Insert Spark Dataset[(String, Map[String, String])] to Cassandra Table - scala

I have a Spark Dataset of type Dataset[(String, Map[String, String])].
I have to insert the same into a Cassandra table.
Here, key in the Dataset[(String, Map[String, String])] will become my primary key of the row in Cassandra.
The Map in the Dataset[(String, Map[String, String])] will go in the same row in a column ColumnNameValueMap.
The Dataset can have millions of rows.
I also want to do it in optimum way (e.g. batch insert Etc.)
My Cassandra table structure is:
CREATE TABLE SampleKeyspace.CassandraTable (
RowKey text PRIMARY KEY,
ColumnNameValueMap map<text,text>
);
Please suggest how to do the same.

Everything that you need is just to use Spark Cassandra Connector (better to take version 2.5.0 that was just released). It provides read & write functions for datasets, so in your case it will be just
import org.apache.spark.sql.cassandra._
your_data.write.cassandraFormat("CassandraTable", "SampleKeyspace").mode("append").save()
If your table don't exist yet, then you can create it base don the structure of the dataset itself - there are 2 functions: createCassandraTable & createCassandraTableEx - it's better to use 2nd, as it provides more control over table creation.
P.S. You can find more about 2.5.0 release in the following blog post.

Related

pyspark - insert generated primary key in dataframe

I have a dataframe and for each row, I want to insert this row in postgres databases and returning the generated primary key in this dataframe. I don't find a good way to do this.
I'm trying with rdd but it doesn't works (pg8000 get inserted id into dataframe)
I think it is possible with this process :
loop on dataframe.collect() in order to process the sql insert
make a sql select for a second dataframe
join the first dataframe with the second
But I think this is not optimized.
Do you have any idea ?
I'm using pyspark in aws glue job. Thanks.
The only things that you can optimized are the data inserting and connectivity.
As you mentioned that you have totally two operations, one is the data inserting and another one is to collect the data inserted. Based on my understanding, either spark jdbc or python connector like psycopg2 will not return the primary key of the data that you inserted. Therefore, you need to do it separately.
Back to your question:
You don't need to use the for loop to do the inserting or .collect() to convert back to python object. You can use spark-postgresql jdbc to do it directly with dataframe:
df\
.write.mode('append').format('jdbc')\
.option('driver', 'org.postgresql.Driver')\
.option('url', url)\
.option('dbtable', table_name)\
.option('user', user)\
.option('password', password)\
.save()

How did spark RDD map to Cassandra table?

I am new to Spark, and recently I saw a code is saving data in RDD format to Cassandra table. But I am not able to figure it out how it is doing the column mapping. It neither uses case class, also specifies any column names in the code like below:
rdd
.map(x => (x._1, x._2, x_3)) // x is a List here
.repartitionByCassandraReplica(keyspace, tableName)
.saveToCassandra(keyspace, tableName)
Since x inside is simply a List[(Int, String, Int)], which is not a case class, there is no name mapping to Cassandra table. So is there any definite order in Cassandra table that can match the order of columns we specify in the code?
This mapping relies on the order of columns in the Cassandra table definition that is as following:
Partition key columns in the specified order
Clustering columns in the specified order
Alphabetically sorted by name for rest of the columns
Spark Cassandra Connector relies that these columns from table definition will be matched to the order of fields in the Scala tuple. You can see that in the source code of TupleColumnMapper class.

How to create an empty dataframe using hive external hive table?

I am using the below to create a dataframe (spark scala) using hive external table. But the dataframe also loaded data in it. I need an empty DF created using hive external table's schema. I am using spark scala for this.
val table1 = sqlContext.table("db.table")
How can I create an empty dataframe using hive external hive table?
You can just do:
val table1 = sqlContext.table("db.table").limit(0)
This will give you the empty df with only the schema. Because of lazy evaluation it also does not take longer than just loading the schema.

How to insert rows into cassandra if they don't exist using spark- cassandra driver?

I want to write to cassandra from a data frame and I want to exclude the rows if a particular row is already existing (i.e Primary key- though upserts happen I don't want to change the other columns) using spark-cassandra connector. Is there a way we can do that?
Thanks.!
You can use the ifNotExists WriteConf option which was introduced in this pr.
It works like so:
val writeConf = WriteConf(ifNotExists = true)
rdd.saveToCassandra(keyspaceName, tableName, writeConf = writeConf)
You can do
sparkConf.set("spark.cassandra.output.ifNotExists", "true")
With this config
if partition key and clustering column are same as row which exists in cassandra: write will be ignored
else write will be performed
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/insert_r.html#reference_ds_gp2_1jp_xj__if-not-exists
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
Srinu, this all boils down to "read before write" no matter whether you are using Spark or not.
But there is IF NOT EXISTS clause:
If the column exists, it is updated. The row is created if none
exists. Use IF NOT EXISTS to perform the insertion only if the row
does not already exist. Using IF NOT EXISTS incurs a performance hit
associated with using Paxos internally. For information about Paxos,
see Cassandra 2.1 documentation or Cassandra 2.0 documentation.
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/insert_r.html

How do I create a Hive External table from AVRO files writen using databricks?

The code below is how it was written into HDFS using scala. What is the HQL syntax to create a Hive table to query this data?
import com.databricks.spark.avro._
val path = "/user/myself/avrodata"
dataFrame.write.avro(path)
The examples I find require providing an avro.schema.literal to describe the schema or an avro.schema.url to the actual avro schema.
In the spark-shell all I would need to do to read this is:
scala> import com.databricks.spark.avro._
scala> val df = sqlContext.read.avro("/user/myself/avrodata")
scala> df.show()
So I cheated to get this to work. Basically I created a temporary table and used HQL to create and insert the data from the temp table. This method uses the metadata from the temporary table and creates the avro target table which I wanted to create and populate. If the data frame can create a temporary table from its schema, why could it not save the table as avro?
dataFrame.registerTempTable("my_tmp_table")
sqlContext.sql(s"create table ${schema}.${tableName} stored as avro as select * from ${tmptbl}")