Pass columnNames dynamically to cassandraTable().select() - scala

I'm reading query off of a file at run-time and executing it on the SPark+Cassandra environment.
I'm executing :
sparkContext.cassandraTable.("keyspaceName", "colFamilyName").select("col1", "col2", "col3").where("some condition = true")
Query in FIle :
select col1, col2, col3
from keyspaceName.colFamilyName
where somecondition = true
Here Col1,col2,col3 can vary depending on the query parsed from the file.
Question :
How do I pick columnName from query and pass them to select() and runtime.
I have tried many ways to do it :
1. dumbest thing done (which obviously threw an error) -
var str = "col1,col2,col3"
var selectStmt = str.split("\\,").map { x => "\"" + x.trim() + "\"" }.mkString(",")
var queryRDD = sc.cassandraTable().select(selectStmt)
Any ideas are welcome.
Side Notes :
1. I do not want to use cassandraCntext becasue it will be depricated/ removed in next realase (https://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/spark/sparkCCcontext.html)
2. I'm on
- a. Scala 2.11
- b. spark-cassandra-connector_2.11:1.6.0-M1
- c. Spark 1.6

Use Cassandra Connector
Your use case sounds like you actually want to use CassandraConnector Objects. These give you a direct access to a per ExecutorJVM session pool and are ideal for just executing random queries. This will end up being much more efficient than creating an RDD for each query.
This would look something like
rddOfStatements.mapPartitions( it =>
CassandraConnector.withSessionDo { session =>
it.map(statement =>
session.execute(statement))})
But you most likely would want to use executeAsync and handle the futures separately for better performance.
Programatically specifying columns in cassandraTable
The select method takes ColumnRef* which means you need to pass in some number of ColumnRefs. Normally there is an implicit conversion from String --> ColumnRef which is why you can pass in just a var-args of strings.
Here it's a little more complicated because we want to pass var args of another type so we end up with double implicits and Scala doesn't like that.
So instead we pass in ColumnName objects as varargs (:_*)
========================================
Keyspace: test
========================================
Table: dummy
----------------------------------------
- id : java.util.UUID (partition key column)
- txt : String
val columns = Seq("id", "txt")
columns: Seq[String] = List(id, txt)
//Convert the strings to ColumnNames (a subclass of ColumnRef) and treat as var args
sc.cassandraTable("test","dummy")
.select(columns.map(ColumnName(_)):_*)
.collect
Array(CassandraRow{id: 74f25101-75a0-48cd-87d6-64cb381c8693, txt: hello world})
//Only use the first column
sc.cassandraTable("test","dummy")
.select(columns.map(ColumnName(_)).take(1):_*)
.collect
Array(CassandraRow{id: 74f25101-75a0-48cd-87d6-64cb381c8693})
//Only use the last column
sc.cassandraTable("test","dummy")
.select(columns.map(ColumnName(_)).takeRight(1):_*)
.collect
Array(CassandraRow{txt: hello world})

Related

Spark Scala Dataset Validations using UDF and its Performance

I'm new to Spark Scala. I have implemented an solution for Dataset validation for multiple columns using UDF rather than going through individual columns in for loop. But i dint know how this is working faster and i have to explain it was the better solution.
The columns for data validation will be received at run time, so we cannot hard-coded the column names in code. And also the comments column needs to be updated with the column name when column value got failed in validation.
Old Code,
def doValidate(data: Dataset[Row], columnArray: Array[String], validValueArrays: Array[String]): Dataset[Row] = {
var ValidDF: Dataset[Row] = data
var i:Int = 0
for (s <- columnArray) {
var list = validValueArrays(i).split(",")
ValidDF = ValidDF.withColumn("comments",when(ValidDF.col(s).isin(list: _*),concat(lit(col("comments")),lit(" Error: Invalid Records in: ") ,lit(s))).otherwise(col("comments")))
i = i + 1
}
return ValidDF;
}
New Code,
def validateColumnValues(data: Dataset[Row], columnArray: Array[String], validValueArrays: Array[String]): Dataset[Row] = {
var ValidDF: Dataset[Row] = data
var checkValues = udf((row: Row, comment: String) => {
var newComment = comment
for (s: Int <- 0 to row.length-1) {
var value = row.get(s)
var list = validValueArrays(s).split(",")
if(!list.contains(value))
{
newComment = newComment + " Error:Invalid Records in: " + columnArray(s) +";"
}
}
newComment
});
ValidDF = ValidDF.withColumn("comments",checkValues(struct(columnArray.head, columnArray.tail: _*),col("comments")))
return ValidDF;
}
columnArray --> Will have list of columns
validValueArrays --> Will have Valid Values Corresponding to column array position. The multiple valid values will be , separated.
I want to know which one better or any other better approach to do it. When i tested new code looks better. And also what is the difference between this two logic's as i read UDF is a black-box for Spark. And in this case the UDF will affect performance in any case?
I need to correct some closed bracket before running it. One '}' to be removed when you return the validDF. I still get a runtime analysis error.
It is better to avoid UDF as a UDF implies deserialization to process the data in classic Scala and then reserialize it. However, if your requirement cannot be archived using in build SQL function, then you have to go for UDF but you must make sure you review the SparkUI for performance and plan of execution.

Scala Spark not returning value outside loop [duplicate]

I am new to Scala and Spark and would like some help in understanding why the below code isn't producing my desired outcome.
I am comparing two tables
My desired output schema is:
case class DiscrepancyData(fieldKey:String, fieldName:String, val1:String, val2:String, valExpected:String)
When I run the below code step by step manually, I actually end up with my desired outcome. Which is a List[DiscrepancyData] completely populated with my desired output. However, I must be missing something in the code below because it returns an empty list (before this code gets called there are other codes that is involved in reading tables from HIVE, mapping, grouping, filtering, etc etc etc):
val compareCols = Set(year, nominal, adjusted_for_inflation, average_private_nonsupervisory_wage)
val key = "year"
def compare(table:RDD[(String, Iterable[Row])]): List[DiscrepancyData] = {
var discs: ListBuffer[DiscrepancyData] = ListBuffer()
def compareFields(fieldOne:String, fieldTwo:String, colName:String, row1:Row, row2:Row): DiscrepancyData = {
if (fieldOne != fieldTwo){
DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
row1.getAs(colName).toString, //table1Value
row2.getAs(colName).toString, //table2Value
row2.getAs(colName).toString) //expectedValue
}
else null
}
def comparison() {
for(row <- table){
var elem1 = row._2.head //gets the first element in the iterable
var elem2 = row._2.tail.head //gets the second element in the iterable
for(col <- compareCols){
var value1 = elem1.getAs(col).toString
var value2 = elem2.getAs(col).toString
var disc = compareFields(value1, value2, col, elem1, elem2)
if (disc != null) discs += disc
}
}
}
comparison()
discs.toList
}
I'm calling the above function as such:
var outcome = compare(groupedFiltered)
Here is the data in groupedFiltered:
(1991,CompactBuffer([1991,7.14,5.72,39%], [1991,4.14,5.72,39%]))
(1997,CompactBuffer([1997,4.88,5.86,39%], [1997,3.88,5.86,39%]))
(1999,CompactBuffer([1999,5.15,5.96,39%], [1999,5.15,5.97,38%]))
(1947,CompactBuffer([1947,0.9,2.94,35%], [1947,0.4,2.94,35%]))
(1980,CompactBuffer([1980,3.1,6.88,45%], [1980,3.1,6.88,48%]))
(1981,CompactBuffer([1981,3.15,6.8,45%], [1981,3.35,6.8,45%]))
The table schema for groupedFiltered:
(year String,
nominal Double,
adjusted_for_inflation Double,
average_provate_nonsupervisory_wage String)
Spark is a distributed computing engine. Next to "what the code is doing" of classic single-node computing, with Spark we also need to consider "where the code is running"
Let's inspect a simplified version of the expression above:
val records: RDD[List[String]] = ??? //whatever data
var list:mutable.List[String] = List()
for {record <- records
entry <- records }
{ list += entry }
The scala for-comprehension makes this expression look like a natural local computation, but in reality the RDD operations are serialized and "shipped" to executors, where the inner operation will be executed locally. We can rewrite the above like this:
records.foreach{ record => //RDD.foreach => serializes closure and executes remotely
record.foreach{entry => //record.foreach => local operation on the record collection
list += entry // this mutable list object is updated in each executor but never sent back to the driver. All updates are lost
}
}
Mutable objects are in general a no-go in distributed computing. Imagine that one executor adds a record and another one removes it, what's the correct result? Or that each executor comes to a different value, which is the right one?
To implement the operation above, we need to transform the data into our desired result.
I'd start by applying another best practice: Do not use null as return value. I also moved the row ops into the function. Lets rewrite the comparison operation with this in mind:
def compareFields(colName:String, row1:Row, row2:Row): Option[DiscrepancyData] = {
val key = "year"
val v1 = row1.getAs(colName).toString
val v2 = row2.getAs(colName).toString
if (v1 != v2){
Some(DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
v1, //table1Value
v2, //table2Value
v2) //expectedValue
)
} else None
}
Now, we can rewrite the computation of discrepancies as a transformation of the initial table data:
val discrepancies = table.flatMap{case (str, row) =>
compareCols.flatMap{col => compareFields(col, row.next, row.next) }
}
We can also use the for-comprehension notation, now that we understand where things are running:
val discrepancies = for {
(str,row) <- table
col <- compareCols
dis <- compareFields(col, row.next, row.next)
} yield dis
Note that discrepancies is of type RDD[Discrepancy]. If we want to get the actual values to the driver we need to:
val materializedDiscrepancies = discrepancies.collect()
Iterating through an RDD and updating a mutable structure defined outside the loop is a Spark anti-pattern.
Imagine this RDD being spread over 200 machines. How can these machines be updating the same Buffer? They cannot. Each JVM will be seeing its own discs: ListBuffer[DiscrepancyData]. At the end, your result will not be what you expect.
To conclude, this is a perfectly valid (not idiomatic though) Scala code but not a valid Spark code. If you replace RDD with an Array it will work as expected.
Try to have a more functional implementation along these lines:
val finalRDD: RDD[DiscrepancyData] = table.map(???).filter(???)

How to run two SparkSql queries in parallel in Apache Spark

First, let me write the part of the code I want to execute in .scala file on spark.
This is my source file. It has structured data with four fields
val inputFile = sc.textFile("hdfs://Hadoop1:9000/user/hduser/test.csv")
I have declared a case class to store the data from file into table with four columns
case class Table1(srcIp: String, destIp: String, srcPrt: Int, destPrt: Int)
val inputValue = inputFile.map(_.split(",")).map(p => Table1(p(0),p(1),p(2).trim.toInt,p(3).trim.toInt)).toDF()
inputValue.registerTempTable("inputValue")
Now, let's say, I want to run following two queries. How can I run these queries in parallel as they are mutually independent. I feel, if I could run them in parallel, it can reduce the execution time. Right now, they are executed serially.
val primaryDestValues = sqlContext.sql("SELECT distinct destIp FROM inputValue")
primaryDestValues.registerTempTable("primaryDestValues")
val primarySrcValues = sqlContext.sql("SELECT distinct srcIp FROM inputValue")
primarySrcValues.registerTempTable("primarySrcValues")
primaryDestValues.join(primarySrcValues, $"destIp" === $"srcIp").select($"destIp",$"srcIp").show(
May be you can look in direction of Futures/Promises. There is a method in SparkContext submitJob which return you future with results. So may this you can fire two jobs and then collect results from futures.
I have not tried this method yet. Just an assumption.
No idea why you want to use sqlContext in the first place, and don't keep things simple.
val inputValue = inputFile.map(_.split(",")).map(p => (p(0),p(1),p(2).trim.toInt,p(3).trim.toInt))
Assuming p(0) = destIp, p(1)=srcIp
val joinedValue = inputValue.map{case(destIp, srcIp, x, y) => (destIp, (x, y))}
.join(inputFile.map{case(destIp, srcIp, x, y) => (srcIp, (x, y))})
.map{case(ip, (x1, y1), (x2, y2)) => (ip, destX, destY, srcX, srcY)}
Now it will be parallezied, and you can even control number of partitions you want using colasce
You can skip the two DISTINCT and do one at the end:
inputValue.select($"srcIp").join(
inputValue.select($"destIp"),
$"srcIp" === $"destIp"
).distinct().show
That's a nice question. This can be executed in parallel using the par in the array. For this you have customize your code accordingly.
Declare an array with two items in it (your can name this as per your wish). Write your code inside each case statement which you need to execute in parallel.
Array("destIp","srcIp").par.foreach { i =>
{
i match {
case "destIp" => {
val primaryDestValues = sqlContext.sql("SELECT distinct destIp FROM inputValue")
primaryDestValues.registerTempTable("primaryDestValues")
}
case "srcIp" => {
val primarySrcValues = sqlContext.sql("SELECT distinct srcIp FROM inputValue")
primarySrcValues.registerTempTable("primarySrcValues")
}}}
}
Once both of the case statement's execution is completed, your below code will be executed.
primaryDestValues.join(primarySrcValues, $"destIp" === $"srcIp").select($"destIp",$"srcIp").show()
Note : If you remove par from the code, it will run sequentially
The other option is to create another sparksession inside the code and execute sql using that sparksession variable. But this is little risky and has be used very carefully

Formatting the join rdd - Apache Spark

I have two key value pair RDD, I join the two rdd's and I saveastext file, here is the code:
val enKeyValuePair1 = rows_filter6.map(line => (line(8) -> (line(0),line(4),line(10),line(5),line(6),line(14),line(1),line(9),line(12),line(13),line(3),line(15),line(7),line(16),line(2),line(14))))
val enKeyValuePair = DATA.map(line => (line(0) -> (line(2),line(3))))
val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val output = final_res.saveAsTextFile("C:/out")
my output is as follows:
(534309,((17999,5161,45005,00000,XYZ,,29.95,0.00),None))
How can i get rid of all the parenthesis?
I want my output as follows:
534309,17999,5161,45005,00000,XYZ,,29.95,0.00,None
When outputing to a text file Spark will just use the toString representation of the element in the RDD. If you want control over the format, then, tou can do one last transform of the data to a String before the call to saveAsTextFile.
Luckily the tuples that arise form using the Spark API can be pulled apart using destructuring. In your example I'd do:
val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val formatted = final_res.map { tuple =>
val (f1,((f2,f3,f4,f5,f6,f7,f8,f9),f10)) = tuple
Seq(f1, f2, f3, f4, f5, f6, f7, f8, f9, f10).mkString(",")
}
formatted.saveAsTextFile("C:/out")
The first val line will take the tuple that is passed into the map function and assign the components to the values on the left. The second line creates a temporary Seq with the fields in the order you want displayed and then invokes mkString(",") to join the fields using a comma.
In cases with fewer fields or you're just hacking away at a problem on the REPL, a slight alternate to the above can also be used by using pattern matching on the partial function passed to map.
simpleJoinedRdd.map { case (key,(left,right)) => s"$key,$left,$right"}}
While that does allow you do make it a single line expression it can throw Exceptions if the data in the RDD don't match the pattern provided, as opposed to the earlier example where the compiler will complain if the tuple parameter cannot be destructured into the expected form.
You can do something like this:
import scala.collection.JavaConversions._
val output = sc.parallelize(List((534309,((17999,5161,45005,1,"XYZ","",29.95,0.00),None))))
val result = output.map(p => p._1 +=: p._2._1.productIterator.toBuffer += p._2._2)
.map(p => com.google.common.base.Joiner.on(", ").join(p.iterator))
I used guava to format string but there is porbably scala way of doing this.
do a flatmap before saving. Or, you can write a simple format function and use it in map.
Adding a bit code, just to show how it can be done. function formatOnDemand can be anything
test = sc.parallelize([(534309,((17999,5161,45005,00000,"XYZ","",29.95,0.00),None))])
print test.collect()
print test.map(formatOnDemand).collect()
def formatOnDemand(t):
out=[]
out.append(t[0])
for tok in t[1][0]:
out.append(tok)
out.append(t[1][1])
return out
>>>
[(534309, ((17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0), None))]
[[534309, 17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0, None]]

Apache Spark map and reduce with passing values

I have a simple map and reduce job over an RDD loaded from Cassandra.
The code looks something like this
sc.cassandraTable("app","channels").select("id").toArray.foreach((o) => {
val orders = sc.cassandraTable("fam", "table")
.select("date", "f2", "f3", "f4")
.where("id = ?", o("id")) # This o("id") is the ID i want later append to the finished list
val month = orders
.map( oo => {
var total_revenue = List(oo.getIntOption("f2"), oo.getIntOption("f3"), oo.getIntOption("f4")).flatten.reduce(_ + _)
(getDateAs("hour", oo.getDate("date")), total_revenue)
})
.reduceByKey(_ + _)
})
So this code summs the revenue up and returns something like this
(2014-11-23 18:00:00, 12412)
(2014-11-23 19:00:00, 12511)
Now I want to save this back to a Cassandra Table revenue_hour but i need the ID somehow in that list, something like that.
(2014-11-23 18:00:00, 12412, "CH1")
(2014-11-23 19:00:00, 12511, "CH1")
How can I make this work with more then just a (key, value) list? How can i pass along more values, which should not be transformed, instead just passed through to the end so I can save it back to Cassandra?
Maybe you could use a class and work with it through the flow. I mean, define RevenueHour class
case class RevenueHour(date: java.util.Date,revenue: Long, id: String)
Then built an intermediate RevenueHour in the map phase and then another one in the reduce phase.
val map: RDD[(Date, RevenueHour)] = orders.map(row =>
(
getDateAs("hour", oo.getDate("date")),
RevenueHour(
row.getDate("date"),
List(row.getIntOption("f2"),row.getIntOption("f3"),row.getIntOption("f4")).flatten.reduce(_ + _),
row.getString("id")
)
)
).reduceByKey((o1: RevenueHour, o2: RevenueHour) => RevenueHour(getDateAs("hour", o1.date), o1.revenue + o2.revenue, o1.id))
I use o1 RevenueHour because both o1 and o2 will have same key and same id (because the where clause before).
Hope it helps.
The approach presented on the question is sequencing the processing of data by iterating over a array of ids and applying a Spark job on only a (potentially small) subset of the data.
Without knowing how is the relation between the 'channels' and 'table' data, I see two options to fully utilize the ability of Spark of processing data in parallel:
Option 1
If the data on the 'table' table (called "orders" from here on) contains all the set of ids that we require in the report, we could apply the reporting logic to the whole table:
Based on the question, we will use this C* schema:
CREATE TABLE example.orders (id text,
date TIMESTAMP,
f2 decimal,
f3 decimal,
f4 decimal,
PRIMARY KEY(id, date)
);
It makes is a lot easier to access cassandra data by providing a case class that represents the schema of the table:
case class Order(id: String, date:Long, f2:Option[BigDecimal], f3:Option[BigDecimal], f4:Option[BigDecimal]) {
lazy val total = List(f2,f3,f4).flatten.sum
}
Then we can define an rdd based on the cassandra table. When we provide the case class as type, the spark-cassandra driver can directly perform a conversion for our convenience:
val ordersRDD = sc.cassandraTable[Order]("example", "orders").select("id", "date", "f2", "f3", "f4")
val revenueByIDPerHour = ordersRDD.map{order => ((order.id, getDateAs("hour", order.date)), order.total)}.reduceByKey(_ + _)
And finally save back to Cassandra:
revenueByIDPerHour.map{ case ((id,date), revenue) => (id, date, revenue)}
.saveToCassandra("example","revenue", SomeColumns("id", "date", "total"))
Option 2
if the ids contained in the ("app","channels") table should be used to filter the set of ids (e.g. valid ids), then, we can join the ids from this table with the orders. The job will be similar to the previous on, with the addition of:
val idRDD = sc.cassandraTable("app","channels").select("id").map(_.getString)
val ordersRDD = sc.cassandraTable[Order]("example", "orders").select("id", "date", "f2", "f3", "f4")
val validOrders = idRDD.join(ordersRDD.map(order => (id,order))
These two ways illustrate how to work with Cassandra and Spark, making use of the distributed nature of Spark's operations. It should also be considerably faster then executing a query for each ID in the 'channels' table.