How to load rows from Cassandra table as Dataframe in Spark? - scala

I can load whole Cassandra table as dataframe as below
val tableDf = sparkSession.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> table, "keyspace" -> keyspace))
.load()
But I couldn't find a way to fetch rows by primary key, something like
select * from table where key = ''
Is there a way to do this?

val tableDf = sparkSession.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> table, "keyspace" -> keyspace))
.load()
.filter("key='YOUR_KEY'")
Using this spark-cassandra-connector will use predicate pushdown and will fetch only required data.
Dataframes and Predicate pushdown

Java way to the same is :
SparkSession sparkSession = SparkSession.builder().appName("Spark Sql Job").master("local[*]")
.config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
.config("spark.cassandra.connection.host", "localhost")
.config("spark.cassandra.connection.port", "9042").getOrCreate();
SQLContext sqlCtx = sparkSession.sqlContext();
Dataset<Row> rowsDataset = sqlCtx.read().format("org.apache.spark.sql.cassandra").option("keyspace", "myschema")
.option("table", "mytable").load();
rowsDataset.show();
It should be the same for scala i believe

Related

Group by and count on Spark Data frame all columns

I want to Perform Group by on each column of the data frame using Spark Sql. The Dataframe will have approx. 1000 columns.
I have tried Iterating over all the columns in the data frame and performed groupBy on each column. But the program is executing more than 1.5 hour
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "exp", "keyspace" -> "testdata"))
.load()
val groupedData= channelDf.columns.map(c => channelDf.groupBy(c).count().take(10).toList)
println("Printing Dataset :"+ dataset)
If I have columns in the Dataframe For Example Name and Amount then the output should be like
GroupBy on column Name:
Name Count
Jon 2
Ram 5
David 3
GroupBy on column Amount:
Amount Count
1000 4
2525 3
3000 3
I want the group by result for each column.
The only way I can see a speed up here is to cache the df straight after reading it.
Unfortunately, each computation is independant, and you have to do them, there is no "work around".
Something like this can speed up a little bit, but not that much :
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "exp", "keyspace" -> "testdata"))
.load()
.cache()

Why could streaming join of queries over kafka topics take so long?

I'm using Spark Structured Streaming and joining two streams from Kafka topics.
I noticed that the streaming query takes around 15 seconds for each record. In the below screenshot, the stage id 2 takes 15s. Why could that be?
The code is as follows:
val kafkaTopic1 = "demo2"
val kafkaTopic2 = "demo3"
val bootstrapServer = "localhost:9092"
val spark = SparkSession
.builder
.master("local")
.getOrCreate
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", kafkaTopic1)
.option("failOnDataLoss", false)
.load
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", kafkaTopic2)
.option("failOnDataLoss", false)
.load
val order_details = df1
.withColumn(...)
.select(...)
val invoice_details = df2
.withColumn(...)
.where(...)
order_details
.join(invoice_details)
.where(order_details.col("s_order_id") === invoice_details.col("order_id"))
.select(...)
.writeStream
.format("console")
.option("truncate", false)
.start
.awaitTermination()
Code-wise everything works fine. The only problem is the time to join the two streams. How could this query be optimised?
It's fairly possible that the execution time is not satisfactory given the master URL, i.e. .master("local"). Change it to local[*] at the very least and you should find the join faster.

Is there any way to Load Spark SQL resultset (data frames ) back to cassandra?

I have queried Cassandra using spark in Scala. Below is the result:
Is there any way to write this result back to Cassandra tables ?
df.write
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "t_payu_df", "keyspace" -> "ks_payu"))
.save()
This will work.
You can also specify SaveMode (overwrite,append,ErrorIfExists).
Example with SaveMode:
df.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Overwrite)
.options(Map( "table" -> "t_payu_df", "keyspace" -> "ks_payu"))
.save()
For more details visit Dataframe

SparkSQL - Read parquet file directly

I am migrating from Impala to SparkSQL, using the following code to read a table:
my_data = sqlContext.read.parquet('hdfs://my_hdfs_path/my_db.db/my_table')
How do I invoke SparkSQL above, so it can return something like:
'select col_A, col_B from my_table'
After creating a Dataframe from parquet file, you have to register it as a temp table to run sql queries on it.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.parquet("src/main/resources/peopleTwo.parquet")
df.printSchema
// after registering as a table you will be able to run sql queries
df.registerTempTable("people")
sqlContext.sql("select * from people").collect.foreach(println)
With plain SQL
JSON, ORC, Parquet, and CSV files can be queried without creating the table on Spark DataFrame.
//This Spark 2.x code you can do the same on sqlContext as well
val spark: SparkSession = SparkSession.builder.master("set_the_master").getOrCreate
spark.sql("select col_A, col_B from parquet.`hdfs://my_hdfs_path/my_db.db/my_table`")
.show()
Suppose that you have the parquet file ventas4 in HDFS:
hdfs://localhost:9000/sistgestion/sql/ventas4
In this case, the steps are:
Charge the SQL Context:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Read the parquet File:
val ventas=sqlContext.read.parquet("hdfs://localhost:9000/sistgestion/sql/ventas4")
Register a temporal table:
ventas.registerTempTable("ventas")
Execute the query (in this line you can use toJSON to pass a JSON format or you can use collect()):
sqlContext.sql("select * from ventas").toJSON.foreach(println(_))
sqlContext.sql("select * from ventas").collect().foreach(println(_))
Use the following code in intellij:
def groupPlaylistIds(): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder.appName("FollowCount")
.master("local[*]")
.getOrCreate()
val sc = spark.sqlContext
val d = sc.read.format("parquet").load("/Users/CCC/Downloads/pq/file1.parquet")
d.printSchema()
val d1 = d.select("col1").filter(x => x!='-')
val d2 = d1.filter(col("col1").startsWith("searchcriteria"));
d2.groupBy("col1").count().sort(col("count").desc).show(100, false)
}

PySpark - read recursive Hive table

I have a Hive table that has multiple sub-directories in HDFS, something like:
/hdfs_dir/my_table_dir/my_table_sub_dir1
/hdfs_dir/my_table_dir/my_table_sub_dir2
...
Normally I set the following parameters before I run a Hive script:
set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
select * from my_db.my_table;
I'm trying to do the same using PySpark,
conf = (SparkConf().setAppName("My App")
...
.set("hive.input.dir.recursive", "true")
.set("hive.mapred.supports.subdirectories", "true")
.set("hive.supports.subdirectories", "true")
.set("mapred.input.dir.recursive", "true"))
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)
my_table = sqlContext.sql("select * from my_db.my_table")
and end up with an error like:
java.io.IOException: Not a file: hdfs://hdfs_dir/my_table_dir/my_table_sub_dir1
What's the correct way to read a Hive table with sub-directories in Spark?
What I have found is that these values must be preceded with spark as in:
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
Try setting them through ctx.sql() prior to execute the query:
sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
my_table = sqlContext.sql("select * from my_db.my_table")
Try setting them through SpakSession to execute the query:
sparkSession = (SparkSession
.builder
.appName('USS - Unified Scheme of Sells')
.config("hive.metastore.uris", "thrift://probighhwm001:9083", conf=SparkConf())
.config("hive.input.dir.recursive", "true")
.config("hive.mapred.supports.subdirectories", "true")
.config("hive.supports.subdirectories", "true")
.config("mapred.input.dir.recursive", "true")
.enableHiveSupport()
.getOrCreate()
)