List[Row] to RDD[CassandrRow] conversion in scala - scala

I've following code :-
val result = session.execute("Select * from table where imei= '" + imei + "'")
val list = result.all()
val sCollection = list.asScala
val rdd = sc.parallelize(Seq(sCollection))
I'm trying to create list[Row] to RDD[CassandraRow] and I found somewhere that we need to convert this list to scala collection before making it RDD, but when I'm trying to run this is giving error that:
value asScala is not a member of java.util.List[com.datastax.driver.core.Row]
Where I'm going wrong and what can be done to resolve this ?
Thanks,

You missed import scala.collection.JavaConverters._ at the beginning. However I don't recommend the solution you've written, because it's not scalable.
There is Spark-Cassandra connector, that can load data into Spark in distributed (scalable) way.

Related

Converting from java.util.List to spark dataset

I am still very new to spark and scala, but very familiar with Java. I have some java jar that has a function that returns an List (java.util.List) of Integers, but I want to convert these to a spark dataset so I can append it to another column and then perform a join. Is there any easy way to do this? I've tried things similar to this code:
val testDSArray : java.util.List[Integer] = new util.ArrayList[Integer]()
testDSArray.add(4)
testDSArray.add(7)
testDSArray.add(10)
val testDS : Dataset[Integer] = spark.createDataset(testDSArray, Encoders.INT())
but it gives me compiler errors (cannot resolve overloaded method)?
If you look at the type signature you will see that in Scala the encoder is passed in a second (and implicit) parameter list.
You may:
Pass it in another parameter list.
val testDS = spark.createDataset(testDSArray)(Encoders.INT)
Don't pass it, and leave the Scala's implicit mechanism resolves it.
import spark.implicits._
val testDS = spark.createDataset(testDSArray)
Convert the Java's List to a Scala's one first.
import collection.JavaConverters._
import spark.implicits._
val testDS = testDSArray.asScala.toDS()

Get Json Key value from List[Row] with Scala

Let's say that I have a List[Row] such as {"name":"abc,"salary","somenumber","id":"1"},{"name":"xyz","salary":"some_number_2","id":"2"}
How do I get the JSON key value pair with scala. Let's assume that I want to get the value of the key "salary". IS the below one right ?
val rows = List[Row] //Assuming that rows has the list of rows
for(row <- rows){
row.get(0).+("salary")
}
If you have a List[Row] I assume that you've had a DataFrame and you did collectAsList. If you collect/collectAsList that means that you
Can no longer use that Spark SQL operations
Can not run your calculations in parallel on the nodes in your cluster. At this point everything is executed in your driver.
I would recommend keeping it as a DataFrame and then doing:
val salaries = df.select("salary")
Then you can do further calculations on the salaries, show them or collect or persist them somewhere.
If you choose to use DataSet (which is like a typed DataFrame) then you could do
val salaries = dataSet.map(_.salary)
Using Spray Json:
import spray.json._
import DefaultJsonProtocol._
object sprayApp extends App {
val list = List("""{"name":"abc","salary":"somenumber","id":"1"}""", """{"name":"xyz","salary":"some_number_2","id":"2"}""")
val jsonAst = list.map(_.parseJson)
for(l <- jsonAst) {
println(l.asJsObject.getFields("salary")(0))
}
}

error: value saveAsTextFile is not a member of scala.collection.Map[String,Long]

I tried all the possible ways by importing all possible libraries and checking out answers for all the questions related to saveAstextFile or saveAsSequenceFile havn't even helped. Hence initiating a new thread.
I am getting an error "error: value saveAsTextFile is not a member of scala.collection.Map[String,Long] countResult.saveAsTextFile("tmp/testfile"). While trying to save an rdd to HDFS. I am following below steps.
1.scala> import org.apache.spark.SparkFiles
import org.apache.spark.SparkFiles
2.scala> val countrdd = sc.parallelize(Array( "hadoop","spark","hadoop","spark")).map( k => (k,1))
countrdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[17] at map at :28
3.scala> val countResult = countrdd.countByKey()
countResult: scala.collection.Map[String,Long] = Map(spark -> 2, hadoop -> 2)
4.scala> countResult.saveAsTextFile("tmp/testfile")
:33: error: value saveAsTextFile is not a member of scala.collection.Map[String,Long]
countResult.saveAsTextFile("tmp/testfile")
Note: I am using Spark 2.X version on standalone cluster.
The method like saveAstextFile is only available with RDD.
You can perform any number of transformation if it is RDD then you can use a method like this
But if you had applied any action like countByKey then the method like this will no longer will be available.
Instead of countByKey you can use the reduceByKey here You can find more detail about this here under RDD API Example Section.
Or you can try this code:-
val countrdd = sc.parallelize(Array( "hadoop","spark","hadoop","spark"))
val findRDD = .map(word => (word, 1))
.reduceByKey(_ + _)
Hope this clears your issue
Thanks

Not able to apply function to Spark Dataframe Column

I am trying to apply a function to one of my dataframe columns to convert the values. The values in the column are like "20160907" I need value to be "2016-09-07".
I wrote a function like this:
def convertDate(inDate:String ): String = {
val year = inDate.substring(0,4)
val month = inDate.substring(4,6)
val day = inDate.substring(6,8)
return year+'-'+month+'-'+day
}
And in my spark scala code, I am using this:
def final_Val {
val oneDF = hiveContext.read.orc("/tmp/new_file.txt")
val convertToDate_udf = udf(convertToDate _)
val convertedDf = oneDF.withColumn("modifiedDate", convertToDate_udf(col("EXP_DATE")))
convertedDf.show()
}
Suprisingly, in spark shell I am able to run without any error. In scala IDE I am getting the below compilation error:
Multiple markers at this line:
not enough arguments for method udf: (implicit evidence$2:
reflect.runtime.universe.TypeTag[String], implicit evidence$3: reflect.runtime.universe.TypeTag[String])org.apache.spark.sql.UserDefinedFunction. Unspecified value parameters evidence$2, evidence$3.
I am using Spark 1.6.2, Scala 2.10.5
Can someone please tell me what I am doing wrong here?
Same code I tried with different functions like in this post: stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column".
I am not getting any compilation issues with this code. I am unable to find out the issue with my code
From what I have learned in a spark-summit course, you have to use the sql.functions methods as much as possible. before implementing your own udf you have to check if there's no existing function in the sql.functions package that does the same work. using the existing functions spark can do a lot of optimizations for you and it will not be obliged to serialize and deserialize you data from and to JVM objects.
to achieve the result you want I'm gonna propose this solution :
val oneDF = spark.sparkContext.parallelize(Seq("19931001", "19931001")).toDF("EXP_DATE")
val convertedDF = oneDF.withColumn("modifiedDate", from_unixtime(unix_timestamp($"EXP_DATE", "yyyyMMdd"), "yyyy-MM-dd"))
convertedDF.show()
this gives the following results :
+--------+------------+
|EXP_DATE|modifiedDate|
+--------+------------+
|19931001| 1993-10-01|
|19931001| 1993-10-01|
+--------+------------+
Hope this help. Best Regards

addition of two dataframe integer values in Scala/Spark

So I'm new to both Scala and Spark so it may be kind of a dumb question...
I have the following code :
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(List(1,2,3)).toDF();
df.foreach( value => println( value(0) + value(0) ) );
error: type mismatch;
found : Any
required: String
What is wrong with it? How do I tell "this is an integer not an any"?
I tried value(0).toInt but "value toInt is not a member of Any".
I tried List(1:Integer, 2:Integer, 3:Integer) but I can not convert into a dataframe afterward...
Spark Row is an untyped container. If you want to extract anything else than Any you have to use typed extractor method or pattern matching over the Row (see Spark extracting values from a Row):
df.rdd.map(value => value.getInt(0) + value.getInt(0)).collect.foreach(println)
In practice there should be reason to extract these values at all. Instead you can operate directly on the DataFrame:
df.select($"_1" + $"_1")