How to test a Spark SQL Query without Scala - scala

I am trying to figure out how to test Spark SQL queries against a Cassandra database -- kind of like you would in SQL Server Management Studio. Currently I have to open the Spark Console and type Scala commands which is really tedious and error prone.
Something like:
scala > var query = csc.sql("select * from users");
scala > query.collect().foreach(println)
Especially with longer queries this can be a real pain.
This seems like a terribly inefficient way to test if your query is correct and what data you will get back. The other issue is when your query is wrong you get back a mile long error message and you have to scroll up the console to find it. How do I test my spark queries without using the console or writing my own application?

You could use bin/spark-sql to avoid construct Scala program and just write SQL.
In order to use bin/spark-sql you may need to rebuild your spark with -Phive and -Phive-thriftserver.
More informations on Building Spark. Note: do not build against Scala2.11, thrift server dependencies seem not ready for the moment.

You can write SQL in a file, read it in a variable in your testing script and set ssc.sql(file.read()) [Python way]
But it seems you are looking for something else. A test approach may be?

Here is one example:
[donghua#vmxdb01 ~]$ $SPARK_HOME/bin/spark-sql --packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11 --conf spark.cassandra.connection.host=127.0.0.1
spark-sql> select * from kv where value > 2;
Error in query: Table or view not found: kv; line 1 pos 14
spark-sql> create TEMPORARY TABLE kv USING org.apache.spark.sql.cassandra OPTIONS (table "kv",keyspace "mykeyspace", cluster "Test Cluster",pushdown "true");
16/10/12 08:28:09 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE kv USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead
Time taken: 4.008 seconds
spark-sql> select * from kv;
key1 1
key4 4
key3 3
key2 2
Time taken: 2.253 seconds, Fetched 4 row(s)
spark-sql> select substring(key,1,3) from kv;
key
key
key
key
Time taken: 1.328 seconds, Fetched 4 row(s)
spark-sql> select substring(key,1,3),count(*) from kv group by substring(key,1,3);
key 4
Time taken: 3.518 seconds, Fetched 1 row(s)
spark-sql>

Related

pyspark dataframe reference vs value

I learn pyspark. I'm trying to build DataFrame from sql, for example
DF=spark.sql("with a as (select ....) select ...")
My sql is a little complex, so it's executed for 20 minutes.
I feel like DF is a refer to my SQL, it means when I execute DF.head(10) it takes 20 minuts, next step DF.count() takes also 20 minuts etc.
I'd like to have DataFrame like in pandas with value in RAM where DF.head(10), DF.count() take a few seconds.
The only way I can think of is to use "create table", for example:
xx=spark.sql("create table yyy as with a as (select ....) select ...")
DF=sqlContext.sql("select * from yyy")
It works but it looks strange to me.
What are the best practices to create DataFrame in pyspark from complex SQL ? I would like to skip the step with "create table".
I'd like to have DataFrame like in pandas with value in RAM where DF.head(10), DF.count() take a few seconds.
Pandas load your data into memory from the moment you read that data, that's why it's lightning-fast. But remember the data size you can load is limited to your computer's memory
My sql is a little complex, so it's executed for 20 minutes. I feel like DF is a refer to my SQL, it means when I execute DF.head(10) it takes 20 minuts, next step DF.count() takes also 20 minuts etc.
Spark does not load the data when you read it. It only read data when there is an "action" like cache or count or head.
The only way I can think of is to use "create table"
Yes, creating a table is also an action, where your query is executed entirely. Next time when you read it, it doesn't have to re-compute it. Tha alternative of creating table is caching, you can do something like this DF.cache().count(), spark will load entire data into memory and all other actions later will be much faster.

Create table in spark taking a lot of time

We have a table creation databricks script like this,
finalDF.write.format('delta').option("mergeSchema", "true").mode('overwrite').save(table_path)
spark.sql("CREATE TABLE IF NOT EXISTS {}.{} USING DELTA LOCATION '{}' ".format('GOLDDB', table, table_path))
So in the table_path initially during first load we just have 1 file.. So this runs as incremental and everyday files accumulates.. So after 10 incremental loads, this takes around 10 hours to complete. Could you please help me on how to optimise the load? Is it possible to merge files?
I just tried removing some files for testing purpose but it failed with error that some files present in log file is missing and this occurs when you manually delete the files..
please suggest on how to optimize this query
Instead of write + create table you can just do everything in one step using the path option + saveAsTable:
finalDF.write.format('delta')\
.option("mergeSchema", "true")\
.option("path", table_path)\
.mode('overwrite')\
.saveAsTable(table_name) # like 'GOLDDB.name'
To cleanup old data you need to use VACUUM command (doc), maybe you may need to decrease retention from default 30 days (see doc on delta.logRetentionDuration option)

Spark Job simply stalls when querying full cassandra table

I have a rather peculiar problem. In a DSE spark analytics engine I produce frequent stats that I store to cassandra in a small table. Since I keep the table trimmed and it is supposed to serve a web interface with consolidated information, I simply want to query the whole table in spark and send the results over an API. I have tried two methods for this:
val a = Try(sc.cassandraTable[Data](keyspace, table).collect()).toOption
val query = "SELECT * FROM keyspace.table"
val df = spark.sqlContext.sql(query)
val list = df.collect()
I am doing this in a scala program. When I use method 1, spark job mysteriously gets stuck showing stage 10 of 12 forever. Verified in logs and spark jobs page. When I use the second method it simply tells me that no such table exists:
Unknown exception: org.apache.spark.sql.AnalysisException: Table or view not found: keyspace1.table1; line 1 pos 15;
'Project [*]
+- 'UnresolvedRelation keyspace1.table1
Interestingly, I tested both methods in spark shell on the cluster and they work just fine. My program has plenty of other queries done using method 1 and they all work fine, the key difference being that in each of them the main partition key always has a condition on it unlike in this query (holds true for this particular table too).
Here is the table structure:
CREATE TABLE keyspace1.table1 (
userid text,
stat_type text,
event_time bigint,
stat_value double,
PRIMARY KEY (userid, stat_type))
WITH CLUSTERING ORDER BY (stat_type ASC)
Any solid diagnosis of the problem or a work around would be much appreciated
When you do select * without where clause in cassandra, you're actually performing a full range query. This is not intended use case in cassandra (aside from peeking at the data perhaps). Just for the fun of it, try replacing with select * from keyspace.table limit 10 and see if it works, it might...
Anyway, my gut feeling says you're problem isn't with spark, but with cassandra. If you have visibility for cassandra metrics, look for the range query latencies.
Now, if your code above is complete - the reason that method 1 freezes, while method 2 doesn't, is that method 1 contains an action (collect), while method 2 doesn't involve any spark action, just schema inference. Should you add to method 2 df.collect you will face the same issue with cassandra

How to make a Cassandra connection using CQL used to create a table?

I am new to tableau, gone through the site before having this question posted, didn't found answer matching to my question.
I have connection established successfully to Cassandra using "DataStax cassandra ODBC driver 64bit windows", evrything is fine, filled all details like "keyspace name, table name as per documentation available in Datastax site".
But when I drag the available table to canvas it's keep on loading for minutes, what the database guy has told me about the data is it's more millions of data for one day, so we have 6months data and that to data gets updated for every 10 minutes, it;s for a reputed wind energy company.
My client has given me "" CQL used for creating table:
create table abc_data_test.machine_data
(machine_id text, tag text, timestamp timestamp, value double,
PRIMARY KEY((machine_id, tag), timestamp))
WITH CLUSTERING ORDER BY(timestamp DESC)
AND compression = { 'sstable_compression' : 'LZ4Compressor' };"".
Where to keep this code?
I tried to insert in connection page it's giving a error. I am getting a new custom sql error (I placed the code in "new custom sql" ) .
The time is still running, can be seen as:
processing request: connecting to datasource, Elapsed time 87:09
The error from new custom sql is
An error occured while commuicating with the datasource. [DataStax][CassandraODBC] (10) Error while executing a query in Cassandra:33562624: line 1.11 no viable alternative at input '1' (SELECT [TOP]1...)
I'm using windows 10 64bit, DataStax odbc driver 64bit-2.4.1 version,DSE is4.8 and later .
You cannot pass DDL sql into the custom sql box. If the Cassandra connection supports the Initial SQL option, you could pass it there. Then your custom sql would be some sort of select statement. Otherwise, create the table in Cassandra then connect to that table from Tableau.

Prepare access to many DB tables for a possible later usage with spark

I am doing my first steps with Spark and looking currently into ways to Import some data from a database via a JDBC driver.
My plan is that I prepare the access for many tables from the
DB for a possible later usage from another team with pure SparkSQL commands.
So they can focus on the data and have no contact with the code anymore.
My connection to the DB is working and I found so far two working ways to get some data.
Way 1:
sqlContext.read.jdbc(url,"tab3",myProp).registerTempTable("tab3")
Way 2:
case class RowClass_TEST (COL1:String, COL2:String)
val myRDD_TEST= new JdbcRDD(sc,() => DriverManager.getConnection(url,username,pw), "select * from TEST where ? < ?", 0,1,1,row => RowClass_TEST(row.getString("COL1"),row.getString("COL2")) myRDD_TEST.toDF().registerTempTable("TEST")
But both ways have some bad effects,
Way 1 is not so fast if you have to prepare a higher amount of tables that are not used later.
(I trace 5 jdbc commandos during the execution of the example ( create connection, login, settings, query for header, terminate connection) )
Way 2 works very fast, but the case class from Scala hast a heavy limitation.
You can only setup 22 values with this kind of class.
So is there an easy solution to setup way 2 without a case class?
I want to access some DB tables with more than 22 columns.
I tried already to get it working ,but my Scala know-how is not good enough yet.
You can write something like this:
sqlContext.load("jdbc",
Map(
"url" -> "jdbc:mysql://dbConnectionString",
"dbtable" ->
"(SELECT * FROM someTable WHERE someField > 10 ) AS a"
)
).registerTempTable("tmp_table")