Spark - Scala: "error: not found: value transform" - scala

In my implemented code I get the following error:
error: not found: value transform
.withColumn("min_date", array_min(transform('min_date,
^
I have been unable to resolve this. I already have the following import statements:
import sqlContext.implicits._
import org.apache.spark.sql.functions.split
import org.apache.spark.sql.functions._
I'm using Apache Zeppelin to execute this.
Here is the full code for reference and the sample of the dataset I'm using:
1004,bb5469c5|2021-09-19 01:25:30,4f0d-bb6f-43cf552b9bc6|2021-09-25 05:12:32,1954f0f|2021-09-19 01:27:45,4395766ae|2021-09-19 01:29:13,
1018,36ba7a7|2021-09-19 01:33:00,
1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59,77a90a1c97b|2021-09-19 01:34:53,
1022,3623fe40|2021-09-19 01:33:00,
1028,6c77d26c-6fb86|2021-09-19 01:50:50,f0ac93b3df|2021-09-19 01:51:11,
1032,ac55-4be82f28d|2021-09-19 01:54:20,82229689e9da|2021-09-23 01:19:47,
val users = sc.textFile("path to file").map(x=>x.replaceAll("\\(","")).map(x=>x.replaceAll("\\)","")).map(x=>x.replaceFirst(",","*")).toDF("column")
val tempDF = users.withColumn("_tmp", split($"column", "\\*")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2")
)
val output = tempDF.withColumn("min_date", split('col2 , ","))
.withColumn("min_date", array_min(transform('min_date,
c => to_timestamp(regexp_extract(c, "\\|(.*)$", 1)))))
.show(10,false)

There is no method in functions (version 3.1.2) with the signature transform(c: Column, fn: Column => Column) so you're writing importing the wrong object or trying to do something else.

You are probably using a version of Spark < Spark 3.x, and this Scala dataframe API transform does not work. With Spark 3.x your code works fine.
I could not get with 2.4 that to work I noted. Not enough time, but have a look here: Higher Order functions in Spark SQL

Related

Converting from java.util.List to spark dataset

I am still very new to spark and scala, but very familiar with Java. I have some java jar that has a function that returns an List (java.util.List) of Integers, but I want to convert these to a spark dataset so I can append it to another column and then perform a join. Is there any easy way to do this? I've tried things similar to this code:
val testDSArray : java.util.List[Integer] = new util.ArrayList[Integer]()
testDSArray.add(4)
testDSArray.add(7)
testDSArray.add(10)
val testDS : Dataset[Integer] = spark.createDataset(testDSArray, Encoders.INT())
but it gives me compiler errors (cannot resolve overloaded method)?
If you look at the type signature you will see that in Scala the encoder is passed in a second (and implicit) parameter list.
You may:
Pass it in another parameter list.
val testDS = spark.createDataset(testDSArray)(Encoders.INT)
Don't pass it, and leave the Scala's implicit mechanism resolves it.
import spark.implicits._
val testDS = spark.createDataset(testDSArray)
Convert the Java's List to a Scala's one first.
import collection.JavaConverters._
import spark.implicits._
val testDS = testDSArray.asScala.toDS()

Not able to write SequenceFile in Scala for Array[NullWritable, ByteWritable]

I have a Byte Array in Scala: val nums = Array[Byte](1,2,3,4,5,6,7,8,9) or you can take any other Byte array.
I want to save it as a sequence file in HDFS. Below is the code, I am writing in scala console.
import org.apache.hadoop.io.compress.GzipCodec
nums.map( x => (NullWritable.get(), new ByteWritable(x)))).saveAsSequenceFile("/yourPath", classOf[GzipCodec])
But, it's giving following error:
error: values saveAsSequenceFile is not a member of Array[ (org.apache.hadoop.io.NullWritable), (org.apache.hadoop.io.ByteWritable)]
You require to import these classes as well (in scala console).
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.io.ByteWritable
The method saveAsSequenceFile is available on an RDD not on an array. So first you need to lift your array into an RDD and then you will be able to call the method saveAsSequenceFile
val v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2), ("ant",5)), 2)
v.saveAsSequenceFile("hd_seq_file")
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

Scala - Encoder missing for type stored in dataset

I am trying to run the following command in Scala 2.2
val x_test0 = cn_train.map( { case row => row.toSeq.toArray } )
And I keep getting the following mistake
error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
I have already imported implicits._ through the following commands:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
The error message tells you that it cannot find an Encoder for a heterogeneous Array to save it in a Dataset. But you can get an RDD of Arrays like this:
cn_train.rdd.map{ row => row.toSeq.toArray }

Scala not able to save as sequence file in RDD, as per doc it is allowed

I am using Spark 1.6, as per the official doc it is allowed to save a RDD to sequence file format, however I notice for my RDD textFile:
scala> textFile.saveAsSequenceFile("products_sequence")
<console>:30: error: value saveAsSequenceFile is not a member of org.apache.spark.rdd.RDD[String]
I googled and found similar discussions seem to suggest this works in pyspark. Is my understanding to the official doc wrong? Can saveAsSequenceFile() be used in Scala?
The saveAsSequenceFile is only available when you have key value pairs in the RDD. The reason for this is that it is defined in PairRDDFunctions
https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
You can see that the API definition takes a K and a V.
if you change your code above to
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd._
object SequeneFile extends App {
val conf = new SparkConf().setAppName("sequenceFile").setMaster("local[1]")
val sc = new SparkContext(conf)
val rdd : RDD[(String, String)] = sc.parallelize(List(("foo", "foo1"), ("bar", "bar1"), ("baz", "baz1")))
rdd.saveAsSequenceFile("foo.seq")
sc.stop()
}
This works perfectly and you will get foo.seq file. The reason why the above works is because we have an RDD which is a key value pair and not just a RDD[String].

adding two columns from a data frame in scala

I have two columns age and salary stored in DF. I just want to write a scala code to add these values column wise. i tried
val age_1 = df.select("age")
val salary_1=df.select("salary")
val add = age_1+salary_1
gives me error. please help
In the following spark is an instance of SparkSession, so the import has to come after the instantiation of spark.
$-notation can be used here by importing spark implicits with
import spark.implicits._
then use $-notation
val add = df.select($"age" + $"salary")
final scala code:
import spark.implicits._
val add = df.select($"age" + $"salary")
Apache doc