I have a rather peculiar problem. In a DSE spark analytics engine I produce frequent stats that I store to cassandra in a small table. Since I keep the table trimmed and it is supposed to serve a web interface with consolidated information, I simply want to query the whole table in spark and send the results over an API. I have tried two methods for this:
val a = Try(sc.cassandraTable[Data](keyspace, table).collect()).toOption
val query = "SELECT * FROM keyspace.table"
val df = spark.sqlContext.sql(query)
val list = df.collect()
I am doing this in a scala program. When I use method 1, spark job mysteriously gets stuck showing stage 10 of 12 forever. Verified in logs and spark jobs page. When I use the second method it simply tells me that no such table exists:
Unknown exception: org.apache.spark.sql.AnalysisException: Table or view not found: keyspace1.table1; line 1 pos 15;
'Project [*]
+- 'UnresolvedRelation keyspace1.table1
Interestingly, I tested both methods in spark shell on the cluster and they work just fine. My program has plenty of other queries done using method 1 and they all work fine, the key difference being that in each of them the main partition key always has a condition on it unlike in this query (holds true for this particular table too).
Here is the table structure:
CREATE TABLE keyspace1.table1 (
userid text,
stat_type text,
event_time bigint,
stat_value double,
PRIMARY KEY (userid, stat_type))
WITH CLUSTERING ORDER BY (stat_type ASC)
Any solid diagnosis of the problem or a work around would be much appreciated
When you do select * without where clause in cassandra, you're actually performing a full range query. This is not intended use case in cassandra (aside from peeking at the data perhaps). Just for the fun of it, try replacing with select * from keyspace.table limit 10 and see if it works, it might...
Anyway, my gut feeling says you're problem isn't with spark, but with cassandra. If you have visibility for cassandra metrics, look for the range query latencies.
Now, if your code above is complete - the reason that method 1 freezes, while method 2 doesn't, is that method 1 contains an action (collect), while method 2 doesn't involve any spark action, just schema inference. Should you add to method 2 df.collect you will face the same issue with cassandra
Related
I am observing some weired issue. I am not sure whether it is lack of my knowledge in spark or what.
I have a dataframe as shown in below code. I create a tempview from it and I am observing that after merge operation, that tempview becomes empty. Not sure why.
val myDf = getEmployeeData()
myDf.createOrReplaceTempView("myView")
// Result1: Below lines display all the records
myDf.show()
spark.Table("myView").show()
// performing merge operation
val sql = s"""MERGE INTO employee AS a
USING myView AS b
ON a.Id = b.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *"""
spark.sql(sql)
// Result2: ISSUE is here. myDf & mvView both are empty
myDf.show()
spark.Table("myView").show()
Edit
getEmployeeData method performs join between two dataframes and returns the result.
df1.as(df1Alias).join(df2.as(df2Alias), expr(joinString), "inner").filter(finalFilterString).select(s"$df1Alias.*")
Dataframes in Spark are lazily evaluated, ie. not executed until an action like .show, .collect is executed or they're used in SQL DDL operation. This also means that if you refer to it once more, it will get reevaluated again.
Assuming there's no other background activity that can mess up, apparently your function getEmployeeData, depends on employee table. It gets executed both before and after the merge and might yield different result.
To prevent it you can checkpoint the dataframe:
myView.checkpoint()
or explicitly materialize it:
myView.write.saveAsTable("myViewMaterialized")
and later refer to the materialized version.
I agree with the points that #Kombajn zbozowy said related to Dataframes in spark are lazily evaluated and will be reevaluated again if you call action on it once more.
I would like to point out that, it is normal expected behavior and might not have to do anything with the merge operation.
For example, the dataframe that you get as an output of the join only contains inserts and you perform those inserts on the target table using dataframe write api, and then if you run the df.show() command again it would show you output as empty because when it is reevaluating the content of the dataframe by performing join it won't get any difference and will not output any records...
Same holds true for merge operation as it also updates the target table with insert/updates and when you rerun it won't show any output rows.
I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.
For example, I have a DataFrame called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
Just use distinct not groupBy:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ... has to be provided by agg or equivalent method.
A group by in spark followed by aggregation and then a select statement will return a data frame. For your example it should be something like:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")
I am trying to figure out how to test Spark SQL queries against a Cassandra database -- kind of like you would in SQL Server Management Studio. Currently I have to open the Spark Console and type Scala commands which is really tedious and error prone.
Something like:
scala > var query = csc.sql("select * from users");
scala > query.collect().foreach(println)
Especially with longer queries this can be a real pain.
This seems like a terribly inefficient way to test if your query is correct and what data you will get back. The other issue is when your query is wrong you get back a mile long error message and you have to scroll up the console to find it. How do I test my spark queries without using the console or writing my own application?
You could use bin/spark-sql to avoid construct Scala program and just write SQL.
In order to use bin/spark-sql you may need to rebuild your spark with -Phive and -Phive-thriftserver.
More informations on Building Spark. Note: do not build against Scala2.11, thrift server dependencies seem not ready for the moment.
You can write SQL in a file, read it in a variable in your testing script and set ssc.sql(file.read()) [Python way]
But it seems you are looking for something else. A test approach may be?
Here is one example:
[donghua#vmxdb01 ~]$ $SPARK_HOME/bin/spark-sql --packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11 --conf spark.cassandra.connection.host=127.0.0.1
spark-sql> select * from kv where value > 2;
Error in query: Table or view not found: kv; line 1 pos 14
spark-sql> create TEMPORARY TABLE kv USING org.apache.spark.sql.cassandra OPTIONS (table "kv",keyspace "mykeyspace", cluster "Test Cluster",pushdown "true");
16/10/12 08:28:09 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE kv USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead
Time taken: 4.008 seconds
spark-sql> select * from kv;
key1 1
key4 4
key3 3
key2 2
Time taken: 2.253 seconds, Fetched 4 row(s)
spark-sql> select substring(key,1,3) from kv;
key
key
key
key
Time taken: 1.328 seconds, Fetched 4 row(s)
spark-sql> select substring(key,1,3),count(*) from kv group by substring(key,1,3);
key 4
Time taken: 3.518 seconds, Fetched 1 row(s)
spark-sql>
Let's assume we have a Cassandra cluster with RF = N and a table containing wide rows.
Our table could have an index something like this: pk / ck1 / ck2 / ....
If we create an RDD from a row in the table as follows:
val wide_row = sc.cassandraTable(KS, TABLE).select("c1", "c2").where("pk = ?", PK)
I notice that one Spark node has 100% of the data and the others have none. I assume this is because the spark-cassandra-connector has no way of breaking down the query token range into smaller sub ranges because it's actually not a range -- it's simply the hash of PK.
At this point we could simply call redistribute(N) to spread the data across the Spark cluster before processing, but this has the effect of moving data across the network to nodes that already have the data locally in Cassandra (remember RF = N)
What we would really like is to have each Spark node load a subset (slice) of the row locally from Cassandra.
One approach which came to mind is to generate an RDD containing a list of distinct values of the first cluster key (ck1) when pk = PK. We could then use mapPartitions() to load a slice of the wide row based on each value of ck1.
Assuming we already have our list values for ck1, we could write something like this:
val ck1_list = .... // RDD
ck1_list.repartition(ck1_list.count().toInt) // create a partition for each value of ck1
val wide_row = ck1_list.mapPartitions(f)
Within the partition iterator, f(), we would like to call another function g(pk, ck1) which loads the row slice from Cassandra for partition key pk and cluster key ck1. We could then apply flatMap to ck1_list so as to create a fully distributed RDD of the wide row without any shuffing.
So here's the question:
Is it possible to make a CQL call from within a Spark task? What driver should be used? Can it be set up only once an reused for subsequent tasks?
Any help would be greatly appreciated, thanks.
For the sake of future reference, I will explain how I solved this.
I actually used a slightly different method to the one outlined above, one which does not involve calling Cassandra from inside Spark tasks.
I started off with ck_list, a list of distinct values for the first cluster key when pk = PK. The code is not shown here, but I actually downloaded this list directly from Cassandra in the Spark driver using CQL.
I then transform ck_list into a list of RDDS. Next we combine the RDDs (each one representing a Cassandra row slice) into one unified RDD (wide_row).
The cast on CassandraRDD is necessary because union returns type org.apache.spark.rdd.RDD
After running the job I was able to verify that the wide_row had x partitions where x is the size of ck_list. A useful side effect is that wide_row is partitioned by the first cluster key, which is also the key I want to reduce by. Hence even more shuffling is avoided.
I don't know if this is the best way to achieve what I wanted, but it certainly works.
val ck_list // list first cluster key values where pk = PK
val wide_row = ck_list.map( ck =>
sc.cassandraTable(KS, TBL)
.select("c1", "c2").where("pk = ? and ck1 = ?", PK, ck)
.asInstanceOf[org.apache.spark.rdd.RDD]
).reduce( (x, y) => x.union(y) )
Classic issue, new framework -- thus problem.
PostgreSQL + Scala + ScalaQuery. I have Master table with serial (autincrement) id and Slave table also with serial id.
I need to insert one master record and several slaves. I have to do it within transaction (to have ability to cancel all), so I cannot run a query after inserting master to find out id. As far as I see SQ "insert" method does not return any reference to inserted master record.
So how to do it?
SQ Examples cover this however without autoincremented field, so such solution (pre-set ids) is not applicable here.
If I understand it correctly this is not possible for now in automatic way. If one is not afraid, this can be done this way. Obtaining the id of last insert (per each master record insertion):
postgreSQL function for last inserted ID
Then using it in SQ:
http://groups.google.com/group/scalaquery/browse_thread/thread/faa7d3e5842da82e
This code shows the MySql way. I'm posting it to the list for
posterity's sake.
val scopeIdentity = SimpleFunction.nullaryLong
val inserted = Actions.insert(
"cat", "eats", "dog)
//Print out the count of inserted records. println(inserted )
//Print out the primary key for the last inserted record.
println(Query(scopeIdentity).first)
//Regards //Bryan
But since for auto incremented fields you have to use projections excluding autoinc fields, and then inserting tuples instead of named record types, there is a question if it is not worth to hold breath until SQ will support this directly.
Note I am SQ newbie, I might just misinform you.