Here is a working example for the usage of Either:
val a: Either[Int, String] = {
if (true)
Left(42) // return an Int
else
Right("Hello, world") // return a String
}
But the below doesn't work:
The condition "text" is just to determine if the input file is text file or parquet file
val a: Either[org.apache.spark.rdd.RDD[String], org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]] = {
if (text)
spark.sparkContext.textFile(input_path + "/lineitem.tbl") // read in text file as rdd
else
sparkSession.read.parquet(input_path + "/lineitem").rdd //read in parquet file as df, convert to rdd
}
It gives me type mismatch errors:
<console>:33: error: type mismatch;
found : org.apache.spark.rdd.RDD[String]
required: scala.util.Either[org.apache.spark.rdd.RDD[String],org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]]
spark.sparkContext.textFile(input_path + "/lineitem.tbl") // read in text file as rdd
^
<console>:35: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: scala.util.Either[org.apache.spark.rdd.RDD[String],org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]]
sparkSession.read.parquet(input_path + "/lineitem").rdd //read in parquet file as df, convert to rdd
Your working example tells you exactly, what to do. Just wrap those two expressions returned by Spark into Left and Right:
val a: Either[org.apache.spark.rdd.RDD[String], org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]] = {
if (text)
Left(spark.sparkContext.textFile(input_path + "/lineitem.tbl")) // read in text file as rdd
else
Right(sparkSession.read.parquet(input_path + "/lineitem").rdd) //read in parquet file as df, convert to rdd
}
Left and Right are two classes, both extend from Either. You can create instances by using new Left(expression) and new Right(expression). Since both of them are case classes, than the new keyword can be omitted and you simply use Left(expression) and Right(expression).
Related
I'm trying to rearrange the columns of a data frame in spark scala with this code
def performTransformations(commonArgs: Map[String, Any], dataDf: Dataset[Row]): Dataset[Row] = {
// Create local var as a copy of data
var data = dataDf
*all the transformations here*
val data2 = data.select(reorderedColNames: _*)
data = data2
}
the reorderdColNames is an array that has all the columns in the order I want.
But I am getting this error
error: type mismatch;
[ERROR] found : Unit
[ERROR] required: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
How can I manage this? Thanks
I have tried to arrange the columns with other methods but I wasn't able.
According to your comment, the problem is that your function is defined to return Dataset[Row] but in fact it returns Unit as the last statement of the function is a variable assignment (data = data2).
Change your function to:
def performTransformations(commonArgs: Map[String, Any], dataDf: Dataset[Row]): Dataset[Row] = {
// Create local var as a copy of data
var data = dataDf
// all the transformations here
data.select(reorderedColNames: _*)
}
when i try to decompress gzip file i get error:
my code:
val file_inp = new FileInputStream("Textfile.txt.gzip")
val file_out = new FileOutputStream("Textfromgzip.txt")
val gzInp = new GZIPInputStream(new BufferedInputStream(file_inp))
while (gzInp.available != -1) {
file_out.write(gzInp.read)
}
file_out.close()
output :
scala:25: error: ambiguous reference to overloaded definition,
both method read in class GZIPInputStream of type (x$1: Array[Byte], x$2: Int, x$3:
Int)Int
and method read in class InflaterInputStream of type ()Int
match expected type ?
file_out.write(gzInp.read)
^
one error found
if anyone knows about this error please help me.
working model for gzip compress and decompress in scala :
//create a text file and write something ..
val file_txt = new FileOutputStream("Textfile1.txt")
file_txt.write("hello this is a sample text ".getBytes)
file_txt.write(10) // 10 -> \n (or) next Line
file_txt.write("hello it is second line".getBytes)
file_txt.write(10)
file_txt.write("the good day".getBytes)
file_txt.close()
// gzip file
val file_ip = new FileInputStream("Textfile1.txt")
val file_gzOut = new GZIPOutputStream(new FileOutputStream("Textfile1.txt.gz"))
file_gzOut.write(file_ip.readAllBytes)
file_gzOut.close
//extract gzip
val gzFile = new GZIPInputStream(new FileInputStream("Textfile1.txt.gz"))
val txtFile = new FileOutputStream("Textfrom_gz.txt")
txtFile.write(gzFile.readAllBytes)
gzFile.close
don't forgot to close the files it leads to "truncated gzip input"
I'm using spark .I want to save value 2.484 which repeated 13849 times in parquet file instead of console.save values in parquet file instead of console
implicit class Rep(n: Int) {
def times[A](f: => A): Seq[A] = { 1 to n map(_ => f) }
}
val myHis= 13849.times { println("2.4848911616270923")}
this code which repeate value 2.484.
How to save it in parquet file?
import org.apache.spark.sql._
val myHis:Dataset[String] = ???
val path:String = ??? //the path you want
myHis.write.mode(SaveMode.Overwrite).parquet(path)
Try this
How do I convert a dataset obj to a dataframe? In my example, I am converting a JSON file to dataframe and converting to DataSet. In dataset, I have added some additional attribute(newColumn) and convert it back to a dataframe. Here is my example code:
val empData = sparkSession.read.option("header", "true").option("inferSchema", "true").option("multiline", "true").json(filePath)
.....
import sparkSession.implicits._
val res = empData.as[Emp]
//for (i <- res.take(4)) println(i.name + " ->" + i.newColumn)
val s = res.toDF();
s.printSchema()
}
case class Emp(name: String, gender: String, company: String, address: String) {
val newColumn = if (gender == "male") "Not-allowed" else "Allowed"
}
But I am expected the new column name newColumn added in s.printschema(). output result. But it is not happening? Why? Any reason? How can I achieve this?
The schema of the output with Product Encoder is solely determined based on it's constructor signature. Therefore anything that happens in the body is simply discarded.
You can
empData.map(x => (x, x.newColumn)).toDF("value", "newColumn")
I am working on Spark Scala and there is a requirement to save Map[String, String] to the disk so that a different Spark application can read it.
(x,1),(y,2)...
To Save:
sc.parallelize(itemMap.toSeq).coalesce(1).saveAsTextFile(fileName)
I am doing a coalesce as the data is only 450 rows.
But to read it back, I am not able to convert it back to Map[String, String]
val myMap = sc.textFile(fileName).zipWithUniqueId().collect.toMap
the data comes as
((x,1),0),((y,2),1)...
What is the possible solution?
Thanks.
Loading a text file results in RDD[String], so you will have to deserialize your string representations of the tuples.
You can change your Save operation to add a delimiter between tuple value 1 and tuple value 2, or parse the string (:v1, :v2).
val d = spark.sparkContext.textFile(fileName)
val myMap = d.map(s => {
val parsedVals = s.substring(1, s.length-1).split(",")
(parsedVals(0), parsedVals(1))
}).collect.toMap
Alternatively, you can change your save operation to create a delimiter (like a comma) and parse the structure that way:
itemMap.toSeq.map(kv => kv._1 + "," + kv._2).saveAsTextFile(fileName)
val myMap = spark.sparkContext.textFile("trash3.txt")
.map(_.split(","))
.map(d => (d(0), d(1)))
.collect.toMap
Method "collectAsMap" exists in "PairRDDFunctions" class, means, applicable only for RDD with two values RDD[(K, V)].
If this function call is required, can be organized with code below. Dataframe is used for store in csv format ant avoid hand-made parsing
val originalMap = Map("x" -> 1, "y" -> 2)
// write
sparkContext.parallelize(originalMap.toSeq).coalesce(1).toDF("k", "v").write.csv(path)
// read
val restoredDF = spark.read.csv(path)
val restoredMap = restoredDF.rdd.map(r => (r.getString(0), r.getString(1))).collectAsMap()
println("restored map: " + restoredMap)
Output:
restored map: Map(y -> 2, x -> 1)