Not able to write SequenceFile in Scala for Array[NullWritable, ByteWritable] - scala

I have a Byte Array in Scala: val nums = Array[Byte](1,2,3,4,5,6,7,8,9) or you can take any other Byte array.
I want to save it as a sequence file in HDFS. Below is the code, I am writing in scala console.
import org.apache.hadoop.io.compress.GzipCodec
nums.map( x => (NullWritable.get(), new ByteWritable(x)))).saveAsSequenceFile("/yourPath", classOf[GzipCodec])
But, it's giving following error:
error: values saveAsSequenceFile is not a member of Array[ (org.apache.hadoop.io.NullWritable), (org.apache.hadoop.io.ByteWritable)]
You require to import these classes as well (in scala console).
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.io.ByteWritable

The method saveAsSequenceFile is available on an RDD not on an array. So first you need to lift your array into an RDD and then you will be able to call the method saveAsSequenceFile
val v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2), ("ant",5)), 2)
v.saveAsSequenceFile("hd_seq_file")
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

Related

Spark - Scala: "error: not found: value transform"

In my implemented code I get the following error:
error: not found: value transform
.withColumn("min_date", array_min(transform('min_date,
^
I have been unable to resolve this. I already have the following import statements:
import sqlContext.implicits._
import org.apache.spark.sql.functions.split
import org.apache.spark.sql.functions._
I'm using Apache Zeppelin to execute this.
Here is the full code for reference and the sample of the dataset I'm using:
1004,bb5469c5|2021-09-19 01:25:30,4f0d-bb6f-43cf552b9bc6|2021-09-25 05:12:32,1954f0f|2021-09-19 01:27:45,4395766ae|2021-09-19 01:29:13,
1018,36ba7a7|2021-09-19 01:33:00,
1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59,77a90a1c97b|2021-09-19 01:34:53,
1022,3623fe40|2021-09-19 01:33:00,
1028,6c77d26c-6fb86|2021-09-19 01:50:50,f0ac93b3df|2021-09-19 01:51:11,
1032,ac55-4be82f28d|2021-09-19 01:54:20,82229689e9da|2021-09-23 01:19:47,
val users = sc.textFile("path to file").map(x=>x.replaceAll("\\(","")).map(x=>x.replaceAll("\\)","")).map(x=>x.replaceFirst(",","*")).toDF("column")
val tempDF = users.withColumn("_tmp", split($"column", "\\*")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2")
)
val output = tempDF.withColumn("min_date", split('col2 , ","))
.withColumn("min_date", array_min(transform('min_date,
c => to_timestamp(regexp_extract(c, "\\|(.*)$", 1)))))
.show(10,false)
There is no method in functions (version 3.1.2) with the signature transform(c: Column, fn: Column => Column) so you're writing importing the wrong object or trying to do something else.
You are probably using a version of Spark < Spark 3.x, and this Scala dataframe API transform does not work. With Spark 3.x your code works fine.
I could not get with 2.4 that to work I noted. Not enough time, but have a look here: Higher Order functions in Spark SQL

Converting from java.util.List to spark dataset

I am still very new to spark and scala, but very familiar with Java. I have some java jar that has a function that returns an List (java.util.List) of Integers, but I want to convert these to a spark dataset so I can append it to another column and then perform a join. Is there any easy way to do this? I've tried things similar to this code:
val testDSArray : java.util.List[Integer] = new util.ArrayList[Integer]()
testDSArray.add(4)
testDSArray.add(7)
testDSArray.add(10)
val testDS : Dataset[Integer] = spark.createDataset(testDSArray, Encoders.INT())
but it gives me compiler errors (cannot resolve overloaded method)?
If you look at the type signature you will see that in Scala the encoder is passed in a second (and implicit) parameter list.
You may:
Pass it in another parameter list.
val testDS = spark.createDataset(testDSArray)(Encoders.INT)
Don't pass it, and leave the Scala's implicit mechanism resolves it.
import spark.implicits._
val testDS = spark.createDataset(testDSArray)
Convert the Java's List to a Scala's one first.
import collection.JavaConverters._
import spark.implicits._
val testDS = testDSArray.asScala.toDS()

Scala - Encoder missing for type stored in dataset

I am trying to run the following command in Scala 2.2
val x_test0 = cn_train.map( { case row => row.toSeq.toArray } )
And I keep getting the following mistake
error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
I have already imported implicits._ through the following commands:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
The error message tells you that it cannot find an Encoder for a heterogeneous Array to save it in a Dataset. But you can get an RDD of Arrays like this:
cn_train.rdd.map{ row => row.toSeq.toArray }

Scala not able to save as sequence file in RDD, as per doc it is allowed

I am using Spark 1.6, as per the official doc it is allowed to save a RDD to sequence file format, however I notice for my RDD textFile:
scala> textFile.saveAsSequenceFile("products_sequence")
<console>:30: error: value saveAsSequenceFile is not a member of org.apache.spark.rdd.RDD[String]
I googled and found similar discussions seem to suggest this works in pyspark. Is my understanding to the official doc wrong? Can saveAsSequenceFile() be used in Scala?
The saveAsSequenceFile is only available when you have key value pairs in the RDD. The reason for this is that it is defined in PairRDDFunctions
https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
You can see that the API definition takes a K and a V.
if you change your code above to
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd._
object SequeneFile extends App {
val conf = new SparkConf().setAppName("sequenceFile").setMaster("local[1]")
val sc = new SparkContext(conf)
val rdd : RDD[(String, String)] = sc.parallelize(List(("foo", "foo1"), ("bar", "bar1"), ("baz", "baz1")))
rdd.saveAsSequenceFile("foo.seq")
sc.stop()
}
This works perfectly and you will get foo.seq file. The reason why the above works is because we have an RDD which is a key value pair and not just a RDD[String].

spark error RDD type not found when creating RDD

I am trying to create an RDD of case class objects. Eg.,
// sqlContext from the previous example is used in this example.
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
I am trying to complete the part from the previous example by giving
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people: RDD[Person] = sc.textFile("/user/root/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")
I get the following error:
<console>:28: error: not found: type RDD
val people: RDD[Person] =sc.textFile("/user/root/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
Any idea on what went wrong?
Thanks in advance!
The issue here is the explicit RDD[String] type annotation. It looks like RDD isn't imported by default in spark-shell, which is why Scala is complaining that it can't find the RDD type. Try running import org.apache.spark.rdd.RDD first.