How did spark RDD map to Cassandra table? - scala

I am new to Spark, and recently I saw a code is saving data in RDD format to Cassandra table. But I am not able to figure it out how it is doing the column mapping. It neither uses case class, also specifies any column names in the code like below:
rdd
.map(x => (x._1, x._2, x_3)) // x is a List here
.repartitionByCassandraReplica(keyspace, tableName)
.saveToCassandra(keyspace, tableName)
Since x inside is simply a List[(Int, String, Int)], which is not a case class, there is no name mapping to Cassandra table. So is there any definite order in Cassandra table that can match the order of columns we specify in the code?

This mapping relies on the order of columns in the Cassandra table definition that is as following:
Partition key columns in the specified order
Clustering columns in the specified order
Alphabetically sorted by name for rest of the columns
Spark Cassandra Connector relies that these columns from table definition will be matched to the order of fields in the Scala tuple. You can see that in the source code of TupleColumnMapper class.

Related

Why is Pyspark/Snowflake insensitive to underscores in column names?

Working in Python 3.7 and pyspark 3.2.0, we're writing out a PySpark dataframe to a Snowflake table using the following statement, where mode usually equals 'append'.
df.write \
.format('snowflake') \
.option('dbtable', table_name) \
.options(**sf_configs) \
.mode(mode)\
.save()
We've made the surprising discovery that this write can be insensitive to underscores in column names -- specifically, a dataframe with the column "RUN_ID" is successfully written out to a table with the column "RUNID" in Snowflake, with the column mapping accordingly. We're curious why this is so (I'm in particular wondering if the pathway runs through a LIKE statement somewhere, or if there's something interesting in the Snowflake table definition) and looking for documentation of this behavior (assuming that it's a feature, not a bug.)
According to the docs, the snowflake connector defaults to using column order instead of name, see parameter column_mapping.
The connector must map columns from the Spark data frame to the Snowflake table. This can be done based on column names (regardless of
order), or based on column order (i.e. the first column in the data
frame is mapped to the first column in the table, regardless of column
name).
By default, the mapping is done based on order. You can override that by setting this parameter to name, which tells the connector to
map columns based on column names. (The name mapping is
case-insensitive.)

pyspark - insert generated primary key in dataframe

I have a dataframe and for each row, I want to insert this row in postgres databases and returning the generated primary key in this dataframe. I don't find a good way to do this.
I'm trying with rdd but it doesn't works (pg8000 get inserted id into dataframe)
I think it is possible with this process :
loop on dataframe.collect() in order to process the sql insert
make a sql select for a second dataframe
join the first dataframe with the second
But I think this is not optimized.
Do you have any idea ?
I'm using pyspark in aws glue job. Thanks.
The only things that you can optimized are the data inserting and connectivity.
As you mentioned that you have totally two operations, one is the data inserting and another one is to collect the data inserted. Based on my understanding, either spark jdbc or python connector like psycopg2 will not return the primary key of the data that you inserted. Therefore, you need to do it separately.
Back to your question:
You don't need to use the for loop to do the inserting or .collect() to convert back to python object. You can use spark-postgresql jdbc to do it directly with dataframe:
df\
.write.mode('append').format('jdbc')\
.option('driver', 'org.postgresql.Driver')\
.option('url', url)\
.option('dbtable', table_name)\
.option('user', user)\
.option('password', password)\
.save()

Insert Spark Dataset[(String, Map[String, String])] to Cassandra Table

I have a Spark Dataset of type Dataset[(String, Map[String, String])].
I have to insert the same into a Cassandra table.
Here, key in the Dataset[(String, Map[String, String])] will become my primary key of the row in Cassandra.
The Map in the Dataset[(String, Map[String, String])] will go in the same row in a column ColumnNameValueMap.
The Dataset can have millions of rows.
I also want to do it in optimum way (e.g. batch insert Etc.)
My Cassandra table structure is:
CREATE TABLE SampleKeyspace.CassandraTable (
RowKey text PRIMARY KEY,
ColumnNameValueMap map<text,text>
);
Please suggest how to do the same.
Everything that you need is just to use Spark Cassandra Connector (better to take version 2.5.0 that was just released). It provides read & write functions for datasets, so in your case it will be just
import org.apache.spark.sql.cassandra._
your_data.write.cassandraFormat("CassandraTable", "SampleKeyspace").mode("append").save()
If your table don't exist yet, then you can create it base don the structure of the dataset itself - there are 2 functions: createCassandraTable & createCassandraTableEx - it's better to use 2nd, as it provides more control over table creation.
P.S. You can find more about 2.5.0 release in the following blog post.

How to create an empty dataframe using hive external hive table?

I am using the below to create a dataframe (spark scala) using hive external table. But the dataframe also loaded data in it. I need an empty DF created using hive external table's schema. I am using spark scala for this.
val table1 = sqlContext.table("db.table")
How can I create an empty dataframe using hive external hive table?
You can just do:
val table1 = sqlContext.table("db.table").limit(0)
This will give you the empty df with only the schema. Because of lazy evaluation it also does not take longer than just loading the schema.

Spark DataFrame duplicate column names while using mergeSchema

I have a huge Spark DataFrame which I create using the following statement
val df = sqlContext.read.option("mergeSchema", "true").parquet("parquet/partitions/path")
Now when I try to do column rename or select operation on above DataFrame it fails saying ambiguous columns found with the following exception
org.apache.spark.sql.AnalysisException: Reference 'Product_Type' is
ambiguous, could be Product_Type#13, Product_Type#235
Now I saw columns and found there are two columns Product_Type and Product_type which seems to be same columns with one letter case different created because of schema merge over time. Now I don't mind keeping duplicate columns but Spark sqlContext for some reason don't like it.
I believe by default spark.sql.caseSensitive config is true so don't know why it fails. I am using Spark 1.5.2. I am new to Spark.
By default, spark.sql.caseSensitive property is false so before your rename or select statement, you should set the property to true
sqlContext.sql("set spark.sql.caseSensitive=true")