Not able to apply function to Spark Dataframe Column - scala

I am trying to apply a function to one of my dataframe columns to convert the values. The values in the column are like "20160907" I need value to be "2016-09-07".
I wrote a function like this:
def convertDate(inDate:String ): String = {
val year = inDate.substring(0,4)
val month = inDate.substring(4,6)
val day = inDate.substring(6,8)
return year+'-'+month+'-'+day
}
And in my spark scala code, I am using this:
def final_Val {
val oneDF = hiveContext.read.orc("/tmp/new_file.txt")
val convertToDate_udf = udf(convertToDate _)
val convertedDf = oneDF.withColumn("modifiedDate", convertToDate_udf(col("EXP_DATE")))
convertedDf.show()
}
Suprisingly, in spark shell I am able to run without any error. In scala IDE I am getting the below compilation error:
Multiple markers at this line:
not enough arguments for method udf: (implicit evidence$2:
reflect.runtime.universe.TypeTag[String], implicit evidence$3: reflect.runtime.universe.TypeTag[String])org.apache.spark.sql.UserDefinedFunction. Unspecified value parameters evidence$2, evidence$3.
I am using Spark 1.6.2, Scala 2.10.5
Can someone please tell me what I am doing wrong here?
Same code I tried with different functions like in this post: stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column".
I am not getting any compilation issues with this code. I am unable to find out the issue with my code

From what I have learned in a spark-summit course, you have to use the sql.functions methods as much as possible. before implementing your own udf you have to check if there's no existing function in the sql.functions package that does the same work. using the existing functions spark can do a lot of optimizations for you and it will not be obliged to serialize and deserialize you data from and to JVM objects.
to achieve the result you want I'm gonna propose this solution :
val oneDF = spark.sparkContext.parallelize(Seq("19931001", "19931001")).toDF("EXP_DATE")
val convertedDF = oneDF.withColumn("modifiedDate", from_unixtime(unix_timestamp($"EXP_DATE", "yyyyMMdd"), "yyyy-MM-dd"))
convertedDF.show()
this gives the following results :
+--------+------------+
|EXP_DATE|modifiedDate|
+--------+------------+
|19931001| 1993-10-01|
|19931001| 1993-10-01|
+--------+------------+
Hope this help. Best Regards

Related

Find columns to select, for spark.read(), from another Dataset - Spark Scala

I have a Dataset[Year] that has the following schema:
case class Year(day: Int, month: Int, Year: Int)
Is there any way to make a collection of the current schema?
I have tried:
println("Print -> "+ds.collect().toList)
But the result were:
Print -> List([01,01,2022], [31,01,2022])
I expected something like:
Print -> List(Year(01,01,2022), Year(31,01,2022)
I know that with a map I can adjust it, but I am trying to create a generic method that accepts any schema, and for this I cannot add a map doing the conversion.
That is my method:
class SchemeList[A]{
def set[A](ds: Dataset[A]): List[A] = {
ds.collect().toList
}
}
Apparently the method return is getting the correct signature, but when running the engine, it gets an error:
val setYears = new SchemeList[Year]
val YearList: List[Year] = setYears.set(df)
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to schemas.Schemas$Year
Based on your additional information in your comment:
I need this list to use as variables when creating another dataframe via jdbc (I need to make a specific select within postgresql). Is there a more performative way to pass values from a dataframe as parameters in a select?
Given your initial dataset:
val yearsDS: Dataset[Year] = ???
and that you want to do something like:
val desiredColumns: Array[String] = ???
spark.read.jdbc(..).select(desiredColumns.head, desiredColumns.tail: _*)
You could find the column names of yearsDS by doing:
val desiredColumns: Array[String] = yearsDS.columns
Spark achieves this by using def schema, which is defined on Dataset.
You can see the definition of def columns here.
May be you got a DataFrame,not a DataSet.
try to use "as" to transform dataframe to dataset.
like this
val year = Year(1,1,1)
val years = Array(year,year).toList
import spark.implicits._
val df = spark.
sparkContext
.parallelize(years)
.toDF("day","month","Year")
.as[Year]
println(df.collect().toList)

Converting from java.util.List to spark dataset

I am still very new to spark and scala, but very familiar with Java. I have some java jar that has a function that returns an List (java.util.List) of Integers, but I want to convert these to a spark dataset so I can append it to another column and then perform a join. Is there any easy way to do this? I've tried things similar to this code:
val testDSArray : java.util.List[Integer] = new util.ArrayList[Integer]()
testDSArray.add(4)
testDSArray.add(7)
testDSArray.add(10)
val testDS : Dataset[Integer] = spark.createDataset(testDSArray, Encoders.INT())
but it gives me compiler errors (cannot resolve overloaded method)?
If you look at the type signature you will see that in Scala the encoder is passed in a second (and implicit) parameter list.
You may:
Pass it in another parameter list.
val testDS = spark.createDataset(testDSArray)(Encoders.INT)
Don't pass it, and leave the Scala's implicit mechanism resolves it.
import spark.implicits._
val testDS = spark.createDataset(testDSArray)
Convert the Java's List to a Scala's one first.
import collection.JavaConverters._
import spark.implicits._
val testDS = testDSArray.asScala.toDS()

List[Row] to RDD[CassandrRow] conversion in scala

I've following code :-
val result = session.execute("Select * from table where imei= '" + imei + "'")
val list = result.all()
val sCollection = list.asScala
val rdd = sc.parallelize(Seq(sCollection))
I'm trying to create list[Row] to RDD[CassandraRow] and I found somewhere that we need to convert this list to scala collection before making it RDD, but when I'm trying to run this is giving error that:
value asScala is not a member of java.util.List[com.datastax.driver.core.Row]
Where I'm going wrong and what can be done to resolve this ?
Thanks,
You missed import scala.collection.JavaConverters._ at the beginning. However I don't recommend the solution you've written, because it's not scalable.
There is Spark-Cassandra connector, that can load data into Spark in distributed (scalable) way.

Unit testing with Spark dataframes

I'm trying to test a part of my program which performs transformations on dataframes
I want to test several different variations of these dataframe which rules out the option of reading a specific DF from a file
And so my questions are:
Is there any good tutorial on how to perform unit testing with Spark and dataframes, especially regarding the dataframes creation?
How can I create these different several lines dataframes without a lot of boilerplate and without reading these from a file?
Are there any utility classes for checking for specific values inside a dataframe?
I obviously googled that before but could not find anything which was very useful. Among the more useful links I found were:
Running a basic unit test with a dataframe
Custom made assertions with DF
It would be great if examples/tutorials are in Scala but I'll take whatever language you've got
Thanks in advance
This link shows how we can programmatically create a data frame with schema. You can keep the data in separate traits and mix it in with your tests. For instance,
// This example assumes CSV data. But same approach should work for other formats as well.
trait TestData {
val data1 = List(
"this,is,valid,data",
"this,is,in-valid,data",
)
val data2 = ...
}
Then with ScalaTest, we can do something like this.
class MyDFTest extends FlatSpec with Matchers {
"method" should "perform this" in new TestData {
// You can access your test data here. Use it to create the DataFrame.
// Your test here.
}
}
To create the DataFrame, you can have few util methods like below.
def schema(types: Array[String], cols: Array[String]) = {
val datatypes = types.map {
case "String" => StringType
case "Long" => LongType
case "Double" => DoubleType
// Add more types here based on your data.
case _ => StringType
}
StructType(cols.indices.map(x => StructField(cols(x), datatypes(x))).toArray)
}
def df(data: List[String], types: Array[String], cols: Array[String]) = {
val rdd = sc.parallelize(data)
val parser = new CSVParser(',')
val split = rdd.map(line => parser.parseLine(line))
val rdd = split.map(arr => Row(arr(0), arr(1), arr(2), arr(3)))
sqlContext.createDataFrame(rdd, schema(types, cols))
}
I am not aware of any utility classes for checking specific values in a DataFrame. But I think it should be simple to write one using the DataFrame APIs.
You could use SharedSQLContext and SharedSparkSession that Spark uses for its own unit tests. Check my answer for examples.
For those looking to achieve something similar in Java, you can use start by using this project to initialize a SparkContext within your unit tests: https://github.com/holdenk/spark-testing-base
I personally had to mimick the file structure of some AVRO files. So I used Avro-tools (https://avro.apache.org/docs/1.8.2/gettingstartedjava.html#download_install) to extract the schema from my binary records using the following command:
java -jar $AVRO_HOME/avro tojson largeAvroFile.avro | head -3
Then, using this small helper method, you can convert the output JSON into a DataFrame to use in your unit tests.
private DataFrame getDataFrameFromList() {
SQLContext sqlContext = new SQLContext(jsc());
ImmutableList<String> elements = ImmutableList.of(
{"header":{"appId":"myAppId1","clientIp":"10.22.63.3","createdDate":"2017-05-10T02:09:59.984Z"}}
{"header":{"appId":"myAppId1","clientIp":"11.22.63.3","createdDate":"2017-05-11T02:09:59.984Z"}}
{"header":{"appId":"myAppId1","clientIp":"12.22.63.3","createdDate":"2017-05-11T02:09:59.984Z"}}
);
JavaRDD<String> parallelize = jsc().parallelize(elements);
return sqlContext.read().json(parallelize);
}

call of distinct and map together throws NPE in spark library

I am unsure if this is a bug, so if you do something like this
// d:spark.RDD[String]
d.distinct().map(x => d.filter(_.equals(x)))
you will get a Java NPE. However if you do a collect immediately after distinct, all will be fine.
I am using spark 0.6.1.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list.
It looks like your current code is trying to group the elements of d by value; you can do this efficiently with the groupBy() RDD method:
scala> val d = sc.parallelize(Seq("Hello", "World", "Hello"))
d: spark.RDD[java.lang.String] = spark.ParallelCollection#55c0c66a
scala> d.groupBy(x => x).collect()
res6: Array[(java.lang.String, Seq[java.lang.String])] = Array((World,ArrayBuffer(World)), (Hello,ArrayBuffer(Hello, Hello)))
what about the windowing example provided in the Spark 1.3.0 stream programming guide
val dataset: RDD[String, String] = ...
val windowedStream = stream.window(Seconds(20))...
val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
SPARK-5063 causes the example to fail since the join is being called from within the transform method on an RDD