addition of two dataframe integer values in Scala/Spark - scala

So I'm new to both Scala and Spark so it may be kind of a dumb question...
I have the following code :
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(List(1,2,3)).toDF();
df.foreach( value => println( value(0) + value(0) ) );
error: type mismatch;
found : Any
required: String
What is wrong with it? How do I tell "this is an integer not an any"?
I tried value(0).toInt but "value toInt is not a member of Any".
I tried List(1:Integer, 2:Integer, 3:Integer) but I can not convert into a dataframe afterward...

Spark Row is an untyped container. If you want to extract anything else than Any you have to use typed extractor method or pattern matching over the Row (see Spark extracting values from a Row):
df.rdd.map(value => value.getInt(0) + value.getInt(0)).collect.foreach(println)
In practice there should be reason to extract these values at all. Instead you can operate directly on the DataFrame:
df.select($"_1" + $"_1")

Related

How can I convert BinaryType to Array[Byte] when calling Scala UDF in Spark?

I've written the following UDF in Scala:
import java.io.{ByteArrayOutputStream, ByteArrayInputStream}
import java.util.zip.{GZIPInputStream}
def Decompress(compressed: Array[Byte]): String = {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(compressed))
val output = scala.io.Source.fromInputStream(inputStream).mkString
return output
}
val decompressUdf = (compressed: Array[Byte]) => {
Decompress(compressed)
}
spark.udf.register("Decompress", decompressUdf)
I'm then attempting to call the UDF with the following:
val sessionsRawDF =
sessionRawDF
.withColumn("WebsiteSession", decompressUdf(sessionRawDF("body")))
.select(
current_timestamp().alias("ingesttime"),
current_timestamp().cast("date").alias("p_ingestdate"),
col("partition"),
col("enqueuedTime"),
col("WebsiteSession").alias("Json")
)
When I run this, I get the following error:
command-130062350733681:9: error: type mismatch;
found: org.apache.spark.sql.Column
required: Array[Byte]
decompressUdf(col("WebsiteSession")).alias("Json")
I was under the impression Spark would implicitly get the value and go from the spark type to Array[Byte] in this case.
Would some please help me understand what's going on, I've been fighting this for a while and not sure what else to try.
You need to convert the Scala function to a Spark UDF first, before you can register it as a UDF. For example,
val decompressUdf = udf(Decompress _)
spark.udf.register("Decompress", decompressUdf)
In fact, there is no need to register the UDF if you're just using it in the DataFrame API. You can simply run the first line and use decompressUdf. Registering is only needed if you want to use the UDF in SQL.

Adding new column using existing one using Spark Scala

Hi I want to add new column using existing column in each row of DataFrame , I am trying this in Spark Scala like this ...
df is dataframe containing variable number of column , which can be decided at run time only.
// Added new column "docid"
val df_new = appContext.sparkSession.sqlContext.createDataFrame(df.rdd, df.schema.add("docid", DataTypes.StringType))
df_new.map(x => {
import appContext.sparkSession.implicits._
val allVals = (0 to x.size).map(x.get(_)).toSeq
val values = allVals ++ allVals.mkString("_")
Row.fromSeq(values)
})
But this is giving error is eclipse itself
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
not enough arguments for method map: (implicit evidence$7: org.apache.spark.sql.Encoder[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]. Unspecified value parameter evidence$7.
Please help.
concat_ws from the functions object can help.
This code adds the docid field
df = df.withColumn("docid", concat_ws("_", df.columns.map(df.col(_)):_*))
assuming all columns of df are strings.

Changing columns that are string in Spark GraphFrame

I'm using GraphFrame in spark 2.0 and scala.
I need to remove double quote from columns that are in string type (out of many columns).
I'm trying to do so using UDF as follow:
import org.apache.spark.sql.functions.udf
val removeDoubleQuotes = udf( (x:Any) =>
x match{
case s:String => s.replace("\"","")
case other => other
}
)
And I get the following error since type Any is not supported in GraphFrame.
java.lang.UnsupportedOperationException: Schema for type Any is not
supported
What is a workaround for that?
I think you don't have a column with type Any and you can't return different datatype from udf. You need to have a single datatype return from udf.
If your column is String then you can create udf as
import org.apache.spark.sql.functions.udf
val removeDoubleQuotes = udf( (x:String) => s.replace("\"",""))

Converting String RDD to Int RDD

I am new to scala..I want to know when processing large datasets with scala in spark is it possible to read as int RDD instead of String RDD
I tried the below:
val intArr = sc
.textFile("Downloads/data/train.csv")
.map(line=>line.split(","))
.map(_.toInt)
But I am getting the error:
error: value toInt is not a member of Array[String]
I need to convert to int rdd because down the line i need to do the below
val vectors = intArr.map(p => Vectors.dense(p))
which requires the type to be integer
Any kind of help is truly appreciated..thanks in advance
As far as I understood, one line should create one vector, so it should goes like:
val result = sc
.textFile("Downloads/data/train.csv")
.map(line => line.split(","))
.map(numbers => Vectors.dense(numbers.map(_.toInt)))
numbers.map(_.toInt) will map every element of array to int, so result type will be Array[Int]

toDF() not handling RDD

I have an RDD of Rows called RowRDD. I am simply trying to convert into DataFrame. From the examples I have seen on the internet from various places, I am seeing that I shoudl be trying RowRDD.toDF() I am getting the error :
value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
It doesn't work because Row is not a Product type and createDataFrame with as single RDD argument is defined only for RDD[A] where A <: Product.
If you want to use RDD[Row] you have to provide a schema as the second argument. If you think about it is should be obvious. Row is just just a container of Any and as such it doesn't provide enough information for schema inference.
Assuming this is the same RDD as defined in your previous question then schema is easy to generate:
import org.apache.spark.sql.types._
import org.apache.spark.rdd.RD
val rowRdd: RDD[Row] = ???
val schema = StructType(
(1 to rowRdd.first.size).map(i => StructField(s"_$i", StringType, false))
)
val df = sqlContext.createDataFrame(rowRdd, schema)