Adding new column using existing one using Spark Scala - scala

Hi I want to add new column using existing column in each row of DataFrame , I am trying this in Spark Scala like this ...
df is dataframe containing variable number of column , which can be decided at run time only.
// Added new column "docid"
val df_new = appContext.sparkSession.sqlContext.createDataFrame(df.rdd, df.schema.add("docid", DataTypes.StringType))
df_new.map(x => {
import appContext.sparkSession.implicits._
val allVals = (0 to x.size).map(x.get(_)).toSeq
val values = allVals ++ allVals.mkString("_")
Row.fromSeq(values)
})
But this is giving error is eclipse itself
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
not enough arguments for method map: (implicit evidence$7: org.apache.spark.sql.Encoder[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]. Unspecified value parameter evidence$7.
Please help.

concat_ws from the functions object can help.
This code adds the docid field
df = df.withColumn("docid", concat_ws("_", df.columns.map(df.col(_)):_*))
assuming all columns of df are strings.

Related

Find columns to select, for spark.read(), from another Dataset - Spark Scala

I have a Dataset[Year] that has the following schema:
case class Year(day: Int, month: Int, Year: Int)
Is there any way to make a collection of the current schema?
I have tried:
println("Print -> "+ds.collect().toList)
But the result were:
Print -> List([01,01,2022], [31,01,2022])
I expected something like:
Print -> List(Year(01,01,2022), Year(31,01,2022)
I know that with a map I can adjust it, but I am trying to create a generic method that accepts any schema, and for this I cannot add a map doing the conversion.
That is my method:
class SchemeList[A]{
def set[A](ds: Dataset[A]): List[A] = {
ds.collect().toList
}
}
Apparently the method return is getting the correct signature, but when running the engine, it gets an error:
val setYears = new SchemeList[Year]
val YearList: List[Year] = setYears.set(df)
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to schemas.Schemas$Year
Based on your additional information in your comment:
I need this list to use as variables when creating another dataframe via jdbc (I need to make a specific select within postgresql). Is there a more performative way to pass values from a dataframe as parameters in a select?
Given your initial dataset:
val yearsDS: Dataset[Year] = ???
and that you want to do something like:
val desiredColumns: Array[String] = ???
spark.read.jdbc(..).select(desiredColumns.head, desiredColumns.tail: _*)
You could find the column names of yearsDS by doing:
val desiredColumns: Array[String] = yearsDS.columns
Spark achieves this by using def schema, which is defined on Dataset.
You can see the definition of def columns here.
May be you got a DataFrame,not a DataSet.
try to use "as" to transform dataframe to dataset.
like this
val year = Year(1,1,1)
val years = Array(year,year).toList
import spark.implicits._
val df = spark.
sparkContext
.parallelize(years)
.toDF("day","month","Year")
.as[Year]
println(df.collect().toList)

Unsupported operation exception from spark: Schema for type org.apache.spark.sql.types.DataType is not supported

Spark Streaming:
I am receiving a dataframe that consists of two columns. The first column is of string type that contains a json string and the second column consists of schema for each value(first column).
Batch: 0
-------------------------------------------
+--------------------+--------------------+
| value| schema|
+--------------------+--------------------+
|{"event_time_abc...|`event_time_abc...|
+--------------------+--------------------+
The table is stored in the val input(non mutable variable). I am using DataType.fromDDL function to convert the string type to a json DataFrame in the following way:
val out= input.select(from_json(col("value").cast("string"),ddl(col("schema"))))
where ddl is a predefined function,DataType.from_ddl(_:String):DataType in spark(scala) but i have registered it so that i can use it on whole column instead of a string only. I have done it in following way:
val ddl:UserDefinedFunction = udf(DataType.fromDDL(_:String):DataType)
and here is the final transformation on both column, value and schema of input table.
val out = input.select(from_json(col("value").cast("string"),ddl(col("schema"))))
However, i get exception from the registration at this line:
val ddl:UserDefinedFunction = udf(DataType.fromDDL(_:String):DataType)
The error is:
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.types.DataType is not supported
If i use:
val out = input.select(from_json(col("value").cast("string"),DataType.fromDDL("`" + "event_time_human2"+"`" +" STRING")).alias("value"))
then it works but as you see i am only using a string(manually typed coming from the schema column) inside the function DataType.fromDDL(_:String):DataType.
So how can i apply this function to whole column without registration or is there any other way to register the function?
EDIT: from_json function's first argument requires a column while second argument requires a schema and not a column. Hence, i guess a manual approach is required to parse each value field with each schema field. After some investigation i found out that DataFrames do not support DataType.
Since a bounty has been set on this question. I would like to provide additional information regarding the data and schema. The schema is defined in DDL(string type) and it can be parsed with from_DDL function. The value is simple json string that will be parsed with the schema that we derive using from_DDL function.
The basic idea is that each value has it's own schema and needs to be parsed with corresponding schema. A new column should be created where the result will be store.
Data:
Here is one example of the data:
value = {"event_time_human2":"09:45:00 +0200 09/27/2021"}
schema = "`event_time_human2` STRING"
It is not needed to convert to correct time format. Just a string will be fine.
It is in streaming context. So ,not all approaches work.
Schemas are being applied and validated before runtime, that is, before the Spark code is executed on the executors. Parsed schemas must be part of the execution plan therefore schema parsing can't be executed dynamically as you intended until now. This is the reason that you see the exception:
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.types.DataType is not supported only for the UDF. Consequently that implies that DataType.fromDDL should be used only inside the driver code and not in the runtime/executor code, which is the code within your UDF function. Inside the UDF function Spark has already executed the transformation of the imported data applying the schemas that you specified on the driver side. This is the reason that you can't use DataType.fromDDL directly in your UDF because it is essentially useless. All the above means that inside the UDF functions we can only use primitive Scala/Java types and some wrappers provided by the Spark API i.e WrappedArray.
An alternative could be to collect all the schemas on the driver. Then create a map with the pair (schema, dataframe) for each schema.
Keep in mind that collecting data to the driver is an expensive operation and it would make sense only if you have a reasonable number of unique schemas, i.e max some thousands. Also, applying these schemas to each dataset need to be done sequentially in the driver, which is quite expensive too, therefore it is important to realize that the suggested solution will only work efficiently if you have a limited amount of unique schemas.
Up to this point, your code could look as next:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.StructType
import spark.implicits._
val df = Seq(
("""{"event_time_human2":"09:45:00 +0200 09/27/2021", "name":"Pinelopi"}""", "`event_time_human2` STRING, name STRING"),
("""{"first_name":"Pin", "last_name":"Halen"}""", "first_name STRING, last_name STRING"),
("""{"code":993, "full_name":"Sofia Loren"}""", "code INT, full_name STRING")
).toDF("value", "schema")
val schemaDf = df.select("schema").distinct()
val dfBySchema = schemaDf.collect().map{ row =>
val schemaValue = row.getString(0)
val ddl = StructType.fromDDL(schemaValue)
val filteredDf = df.where($"schema" === schemaValue)
.withColumn("value", from_json($"value", ddl))
(schemaValue, filteredDf)
}.toMap
// Map(
// `event_time_human2` STRING, name STRING -> [value: struct<event_time_human2: string, name: string>, schema: string],
// first_name STRING, last_name STRING -> [value: struct<first_name: string, last_name: string>, schema: string],
// code INT, full_name STRING -> [value: struct<code: int, full_name: string>, schema: string]
// )
Explanation: first we gather each unique schema with schemaDf.collect(). Then we iterate through schemas and filter the initial df based on the current schema. We also use from_json to convert current string value column to the specific schema.
Note that we can't have one common column with different data type, this is the reason that we are creating a different df for each schema and not one final df.

Scala - Encoder missing for type stored in dataset

I am trying to run the following command in Scala 2.2
val x_test0 = cn_train.map( { case row => row.toSeq.toArray } )
And I keep getting the following mistake
error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
I have already imported implicits._ through the following commands:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
The error message tells you that it cannot find an Encoder for a heterogeneous Array to save it in a Dataset. But you can get an RDD of Arrays like this:
cn_train.rdd.map{ row => row.toSeq.toArray }

Dropping columns by data type in Scala Spark

df1.printSchema() prints out the column names and the data type that they possess.
df1.drop($"colName") will drop columns by their name.
Is there a way to adapt this command to drop by the data-type instead?
If you are looking to drop specific columns in the dataframe based on the types, then the below snippet would help. In this example, I have a dataframe with two columns of type String and Int respectivly. I am dropping my String (all fields of type String would be dropped) field from the schema based on its type.
import sqlContext.implicits._
val df = sc.parallelize(('a' to 'l').map(_.toString) zip (1 to 10)).toDF("c1","c2")
df.schema.fields
.collect({case x if x.dataType.typeName == "string" => x.name})
.foldLeft(df)({case(dframe,field) => dframe.drop(field)})
The schema of the newDf is org.apache.spark.sql.DataFrame = [c2: int]
Here is a fancy way in scala:
var categoricalFeatColNames = df.schema.fields filter { _.dataType.isInstanceOf[org.apache.spark.sql.types.StringType] } map { _.name }

addition of two dataframe integer values in Scala/Spark

So I'm new to both Scala and Spark so it may be kind of a dumb question...
I have the following code :
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(List(1,2,3)).toDF();
df.foreach( value => println( value(0) + value(0) ) );
error: type mismatch;
found : Any
required: String
What is wrong with it? How do I tell "this is an integer not an any"?
I tried value(0).toInt but "value toInt is not a member of Any".
I tried List(1:Integer, 2:Integer, 3:Integer) but I can not convert into a dataframe afterward...
Spark Row is an untyped container. If you want to extract anything else than Any you have to use typed extractor method or pattern matching over the Row (see Spark extracting values from a Row):
df.rdd.map(value => value.getInt(0) + value.getInt(0)).collect.foreach(println)
In practice there should be reason to extract these values at all. Instead you can operate directly on the DataFrame:
df.select($"_1" + $"_1")