Spark scala cassandra

Spark scala cassandra - scala

Please see the below code and let me know where I am doing it wrong?
Using:
DSE Version - 5.1.0
Connected to Test Cluster at 172.31.16.45:9042.
[cqlsh 5.0.1 | Cassandra 3.10.0.1652 | DSE 5.1.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
Thanks
Cassandra Table :
cqlsh:tdata> select * from map;
sno | name
-----+------
1 | One
2 | Two
-------------------------------------------
scala> :showSchema tdata
========================================
Keyspace: tdata
========================================
Table: map
----------------------------------------
- sno : Int (partition key column)
- name : String
scala> val rdd = sc.cassandraTable("tdata", "map")
scala> rdd.foreach(println)
I am not getting anything here?
Not even an error.

You have hit a very common spark issue. Your println code is being executed on your remote executor JVMs. That means the printout is to the STDOUT of the executor JVM process. If you want to bring the data back to the driver JVM before printing you need a collect call.
rdd
.collect //Change from RDD to local collection
.foreach(println)

Related

Using Idiomatic Scala in Spark

I have the following expression,
val pageViews = spark.sql(
s"""
|SELECT
| proposal,
| MIN(timestamp) AS timestamp,
| MAX(page_view_after) AS page_view_after
|FROM page_views
|GROUP BY proposalId
|""".stripMargin
).createOrReplaceTempView("page_views")
I want convert it into one that uses the Dataset API
val pageViews = pageViews.selectExpr("proposal", "MIN(timestamp) AS timestamp", "MAX(page_view_after) AS page_view_after").groupBy("proposal")
The problems is I can't call createOrReplaceTempView on this one - build fails.
My question is how do I convert the first one into the second one and create a TempView out of that?

You can get rid of SQL expression al together by using Spark Sql's functions
import org.apache.spark.sql.functions._
as below
pageViews
.groupBy("proposal")
.agg(max("timestamp").as("timestamp"),max("page_view_after").as("page_view_after"))
`

Considering you have a dataframe available with name pageViews -
Use -
pageViews
.groupBy("proposal")
.agg(expr("min(timestamp) AS timestamp"), expr("max(page_view_after) AS page_view_after"))

INSERT SPARK DATAFRAME INTO HIVE Managed Acid Table not working, HDP 3.0

I have a issue with inserting the Spark dataframe into hive table. Can anyone please help me out. HDP version 3.1, Spark version 2.3 Thanks in advance.
//ORIGNAL CODE PART
import org.apache.spark.SparkContext;
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl;
import org.apache.spark.sql.DataFrame
import com.hortonworks.hwc.HiveWarehouseSession;
import org.apache.spark.sql.SparkSession$;
val spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
**val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()**
/*
Some Transformation operations happend and the output of transformation is stored in VAL RESULT
/*
val result = {
num_records
.union(df.transform(profile(heatmap_cols2type)))
}
result.createOrReplaceTempView("out_temp"); //Create tempview
scala> result.show()
+-----+--------------------+-----------+------------------+------------+-------------------+
| type| column| field| value| order| date|
+-----+--------------------+-----------+------------------+------------+-------------------+
|TOTAL| all|num_records| 737| 0|2019-12-05 18:10:12|
| NUM|available_points_...| present| 737| 0|2019-12-05 18:10:12|
hive.setDatabase("EXAMPLE_DB")
hive.createTable("EXAMPLE_TABLE").ifNotExists().column("`type`", "String").column("`column`", "String").column("`field`", "String").column("`value`","String").column("`order`", "bigint").column("`date`", "TIMESTAMP").create()
hive.executeUpdate("INSERT INTO TABLE EXAMPLE_DB.EXAMPLE_TABLE SELECT * FROM out_temp");
-----ERROR of Orginal code----------------
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:86 Table not found 'out_temp'**strong text**
What I tried as a alternative is: (As Hive and Spark use independent catalogues, by checking the documentation from HWC write operations)
spark.sql("SELECT type, column, field, value, order, date FROM out_temp").write.format("HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR").option("table", "wellington_profile").save()
-------ERROR of Alternative Step----------------
java.lang.ClassNotFoundException: Failed to find data source: HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:639)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:241)
... 58 elided
Caused by: java.lang.ClassNotFoundException: HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR.DefaultSource
My Question is:
Instead of saving the out_temp as a tempview in Spark is there any way to directly create the table in hive ?
Is there any way to insert into Hive table from spark dataframe ?
Thank you everyone for your time!

result.write.save("example_table.parquet")

result.write.mode(SaveMode.Overwrite).saveAsTable("EXAMPLE_TABLE")
You can read in more detail from here

Can't query Spark DF from Hive after `saveAsTable` - Spark SQL specific format, which is NOT compatible with Hive

I am trying to save a dataframe as an external table which will be queried both with spark and possibly with hive, but somehow, I cannot query or see any data with hive. It works on in spark.
Here is how to reproduce the problem:
scala> println(spark.conf.get("spark.sql.catalogImplementation"))
hive
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("spark.sql.sources.bucketing.enabled", true)
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("hive.enforce.bucketing","true")
scala> spark.conf.set("optimize.sort.dynamic.partitionining","true")
scala> spark.conf.set("hive.vectorized.execution.enabled","true")
scala> spark.conf.set("hive.enforce.sorting","true")
scala> spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
scala> spark.conf.set("hive.metastore.uris", "thrift://localhost:9083")
scala> var df = spark.range(20).withColumn("random", round(rand()*90))
df: org.apache.spark.sql.DataFrame = [id: bigint, random: double]
scala> df.head
res19: org.apache.spark.sql.Row = [0,46.0]
scala> df.repartition(10, col("random")).write.mode("overwrite").option("compression", "snappy").option("path", "s3a://company-bucket/dev/hive_confs/").format("orc").bucketBy(10, "random").sortBy("random").saveAsTable("hive_random")
19/08/01 19:26:55 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`hive_random` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Here is how I query in hive:
Beeline version 2.3.4-amzn-2 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select * from hive_random;
+------------------+
| hive_random.col |
+------------------+
+------------------+
No rows selected (0.213 seconds)
But it works fine in spark:
scala> spark.sql("SELECT * FROM hive_random").show
+---+------+
| id|random|
+---+------+
| 3| 13.0|
| 15| 13.0|
...
| 8| 46.0|
| 9| 65.0|
+---+------+

There is warning after your saveAsTable call. That's where the hint lies -
'Persisting bucketed data source table default.hive_random into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.'
The reason being 'saveAsTable' creates RDD partitions but not Hive partitions, the workaround is to create the table via hql before calling DataFrame.saveAsTable.

I will suggest t try couple of thing. First, try to set hive execution engine to use Spark.
set hive.execution.engine=spark;
Second, try to create external table in metastore and then save data to that table.

The Semantics of bucketed table in Spark and Hive is different.
The doc has details of the differences in semantics.
It states that
Data is written to bucketed tables but the output does not adhere with expected
bucketing spec. This leads to incorrect results when one tries to consume the
Spark written bucketed table from Hive.
Workaround: If reading from both engines is the requirement, writes need to happen from Hive

Script result in sqlContext on registered temp table in Scala has minor difference than using Reduce in RDD

I am learning Scala now, and notice something that I don't understand why, I have a result to be generated registered Temp Table via sqlContext on a DataFrame derived from a RDD, the RDD is from a hdfs file exported from mysql.
Raw hdfs data can be retrieved from here (2MB):
https://github.com/mdivk/175Scala/tree/master/data/part-00000
Here is the script used on mysql:
select avg(order_item_subtotal) from order_items;
+--------------------------+
| avg(order_item_subtotal) |
+--------------------------+
| 199.32066922046081 |
+--------------------------+
On Scala:
sqlContext:
> scala> val res = sqlContext.sql("select avg(order_item_subtotal) from
> order_items")
> +------------------+
| _c0|
+------------------+
|199.32066922046081|
+------------------+
So they are the same, exactly same, which is expected;
RDD (please use the data file from https://github.com/mdivk/175Scala/tree/master/data/part-00000):
val orderItems = sc.textFile("/public/retail_db/order_items")
val orderItemsRevenue = orderItems.map(oi => oi.split(",")(4).toFloat)
val totalRev = orderItemsRevenue.reduce((total, revenue) => total + revenue)
res4: Float = 3.4326256E7
val cnt = orderItemsRevenue.count
val avgRev = totalRev/cnt
avgRev: Float = 199.34178
As you can see, the avgRev is 199.34178, not what we calculated above in mysql and sqlContext 199.32066922046081
I do not think this is an acceptable discrepancy but I could be wrong, am I missing anything here?
It would be appreciated if you can help me understand this. Thank you.

Maybe your mysql table is using double thus you could try to change:
val orderItemsRevenue = orderItems.map(oi => oi.split(",")(4).toFloat)
into
val orderItemsRevenue = orderItems.map(oi => oi.split(",")(4).toDouble)
since you need more precise results you can use double instead of float. Float is 32bits and double 64 which makes it useful when memory usage is critical otherwise use almost always double.

UPDATE Cassandra table using spark cassandra connector

I'm facing an issue with spark cassandra connector on scala while updating a table in my keyspace
Here is my piece of code
val query = "UPDATE " + COLUMN_FAMILY_UNIQUE_TRAFFIC + DATA_SET_DEVICE +
" SET a= a + " + b + " WHERE x=" +
x + " AND y=" + y +
" AND z=" + x
println(query)
val KeySpace = new CassandraSQLContext(sparkContext)
KeySpace.setKeyspace(KEYSPACE)
hourUniqueKeySpace.sql(query)
When I execute this code, I'm getting an error like this
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier UPDATE found
Any idea why this is happening?
How can I fix this?

The UPDATE of a table with counter column is feasible via the spark-cassandra-connector. You will have to use DataFrames and DataFrameWriter method save with mode "append" (or SaveMode.Append if you prefer). Check the code DataFrameWriter.scala.
For example, given a table:
cqlsh:test> SELECT * FROM name_counter ;
name | surname | count
---------+---------+-------
John | Smith | 100
Zhang | Wei | 1000
Angelos | Papas | 10
The code should look like this:
val updateRdd = sc.parallelize(Seq(Row("John", "Smith", 1L),
Row("Zhang", "Wei", 2L),
Row("Angelos", "Papas", 3L)))
val tblStruct = new StructType(
Array(StructField("name", StringType, nullable = false),
StructField("surname", StringType, nullable = false),
StructField("count", LongType, nullable = false)))
val updateDf = sqlContext.createDataFrame(updateRdd, tblStruct)
updateDf.write.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "test", "table" -> "name_counter"))
.mode("append")
.save()
After UPDATE:
name | surname | count
---------+---------+-------
John | Smith | 101
Zhang | Wei | 1002
Angelos | Papas | 13
The DataFrame conversion can be simpler by implicitly convert an RDD to a DataFrame: import sqlContext.implicits._ and using .toDF().
Check the full code for this toy application:
https://github.com/kyrsideris/SparkUpdateCassandra/tree/master
Since versions are very important here, the above apply to Scala 2.11.7, Spark 1.5.1, spark-cassandra-connector 1.5.0-RC1-s_2.11, Cassandra 3.0.5.
DataFrameWriter is designated as #Experimental since #since 1.4.0.

I believe that you cannot update natively through the SPARK connector. See the documention:
"The default behavior of the Spark Cassandra Connector is to overwrite collections when inserted into a cassandra table. To override this behavior you can specify a custom mapper with instructions on how you would like the collection to be treated."
So you'll want to actually INSERT a new record with an existing key.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark scala cassandra - scala

Related

Using Idiomatic Scala in Spark

INSERT SPARK DATAFRAME INTO HIVE Managed Acid Table not working, HDP 3.0

Can't query Spark DF from Hive after `saveAsTable` - Spark SQL specific format, which is NOT compatible with Hive

Script result in sqlContext on registered temp table in Scala has minor difference than using Reduce in RDD

UPDATE Cassandra table using spark cassandra connector

Categories

Resources