How to create a column reference dynamically? - scala

I have DataFrame df with the following structure:
root
|-- author: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- client: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- outbound_link: array (nullable = true)
| |-- element: string (containsNull = true)
|-- url: string (nullable = true)
I run this code:
val sourceField = "outbound_link" // set automatically
val targetField = "url" // set automatically
val nodeId = "client" // set automatically
val result = df.as("df1").join(df.as("df2"),
$"df1."+sourceField === $"df2."+targetField
).groupBy(
($"df1."+nodeId).as("nodeId_1"),
($"df2."+nodeId).as("nodeId_2")
)
.agg(
count("*") as "value", max($"df1."+timestampField) as "timestamp"
)
.toDF("source", "target", "value", "timestamp")
But I get the error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: df1.;
For some reason, the variables sourceField and targetField are not visible inside joinoperation. These variables are not empty and contain the names of fields. I must use variables because I define them automatically in a previous step of code.

An interesting case indeed. Look at $"df1."+sourceField and think about when $"df1." gets converted to a Column versus the concatenation of "df1."+sourceField.
scala> val sourceField = "id"
sourceField: String = id
scala> $"df1."+sourceField
org.apache.spark.sql.AnalysisException: syntax error in attribute name: df1.;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:151)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:170)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:142)
at org.apache.spark.sql.Column.<init>(Column.scala:137)
at org.apache.spark.sql.ColumnName.<init>(Column.scala:1203)
at org.apache.spark.sql.SQLImplicits$StringToColumn.$(SQLImplicits.scala:45)
... 55 elided
Replace $"df1."+sourceField to use col or column functions and you should be fine.
scala> col(s"df1.$sourceField")
res7: org.apache.spark.sql.Column = df1.id

Related

Select column based on Map

I have the following dataframe df with the following schema:
|-- type: string (nullable = true)
|-- record_sales: array (nullable = false)
| |-- element: string (containsNull = false)
|-- record_marketing: array (nullable = false)
| |-- element: string (containsNull = false)
and a map
typemap = Map("sales" -> "record_sales", "marketing" -> "record_marketing")
I want a new column "record" that is either the value of record_sales or record_marketing based on the value of type.
I've tried some variants of this:
val typeMapCol = typedLit(typemap)
val df2 = df.withColumn("record", col(typeMapCol(col("type"))))
But nothing has worked. Does anyone have any idea? Thanks!
You can iterate over the map typemap and use when function to get case/when expressions depending on the value of type column:
val recordCol = typemap.map{case (k,v) => when(col("type") === k, col(v))}.toSeq
val df2 = df.withColumn("record", coalesce(recordCol: _*))

Error while adding a new utf8 string column to Row in Scala spark

I am trying to add a new column in each row of DataFrame like this
def addNamespace(iter: Iterator[Row]): Iterator[Row] = {
iter.map (row => {
println(row.getString(0))
// Row.fromSeq(row.toSeq ++ Array[String]("shared"))
val newseq = row.toSeq ++ Array[String]("shared")
Row(newseq: _*)
})
iter
}
def transformDf(source: DataFrame)(implicit spark: SparkSession): DataFrame = {
val newSchema = StructType(source.schema.fields ++ Array(StructField("namespace", StringType, nullable = true)))
val df = spark.sqlContext.createDataFrame(source.rdd.mapPartitions(addNamespace), newSchema)
df.show()
df
}
But I keep getting this error - Caused by: java.lang.RuntimeException: org.apache.spark.unsafe.types.UTF8String is not a valid external type for schema of string on the line df.show()
Can somebody please help in figuring out this. I have searched around in multiple posts but whatever I have tried is giving me this error.
I have also tried val again = sourceDF.withColumn("namespace", functions.lit("shared")) but it has the same issue.
Schema of already read data
root
|-- name: string (nullable = true)
|-- data: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- description: string (nullable = true)
| |-- activates_on: timestamp (nullable = true)
| |-- expires_on: timestamp (nullable = true)
| |-- created_by: string (nullable = true)
| |-- created_on: timestamp (nullable = true)
| |-- updated_by: string (nullable = true)
| |-- updated_on: timestamp (nullable = true)
| |-- properties: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
Caused by: java.lang.RuntimeException:
org.apache.spark.unsafe.types.UTF8String is not a valid external type
for schema of string
means its unable to understand as string type... for newly added "namespace" column.
Clearly indicates datatype mismatch error at catalyst level...
see spark code here..
override def eval(input: InternalRow): Any = {
val result = child.eval(input)
if (checkType(result)) {
result
} else {
throw new RuntimeException(s"${result.getClass.getName}$errMsg")
}
}
and error message is s" is not a valid external type for schema of ${expected.catalogString}"
So UTF String is not real string you need to encode/decode it before passing it as string type otherwise catalyst will not able to understand what you are passing.
How to fix it ?
Below are the SO content which will address how to encode/decode to/from utfstring to string and viceversa... you may need to apply suitable solution for this.
https://stackoverflow.com/a/5943395/647053
string decode utf-8
Note :
This online UTF-8 encoder/decoder tool is very handy to put sample data and convert that to string. try this first....

select column of a dataframe using udf

I use spark-shell and want to create a dataframe (df2) from another dataframe (df1) using select and udf. But there is an error when I want to show the df2 ==> df2.show(1)
var df1 = sql(s"select * from table_1")
val slice = udf ((items: Array[String]) => if (items == null) items
else {
if (items.size <= 20)
items
else
items.slice(0, 20)
})
var df2 = df1.select($"col1", slice($"col2"))
and the df1 schema is:
scala> df1.printSchema
root
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
scala> df2.printSchema
root
|-- col1: string (nullable = true)
|-- UDF(col2): array (nullable = true)
| |-- element: string (containsNull = true)
error:
Failed to execute user defined function($anonfun$1: (array<string>) => array<string>)
Used Seq[String] instead of Array[String] in the udf and the issue is resolved.

How to get an element in WrappedArray: result of Dataset.select("x").collect()?

I am a beginer on Spark/Scala. I would like to extract a value(Double) in the Array selected from Dataset. Simplified major steps are shown below. How can I extract each value[Double] in the
last val wpA ? Something like val p1 = wpA(1). I failed to conver it to normal array by wpA.toArray.
Thank you in advance for your help.
case class Event(eventId: Int, n_track: Int, px:ArrayBuffer[Double],py: ArrayBuffer[Double], pz: ArrayBuffer[Double],ch: ArrayBuffer[Int], en: ArrayBuffer[Double])
---
val rawRdd = sc.textFile("expdata/rawdata.bel").map(_.split("\n"))
val eventRdd = rawRdd.map(x => buildEvent(x(0).toString))
val dataset = sqlContext.createDataset[Event](eventRdd)
dataset.printSchema()
root
|-- eventId: integer (nullable = false)
|-- n_track: integer (nullable = false)
|-- px: array (nullable = true)
| |-- element: double (containsNull = false)
|-- py: array (nullable = true)
| |-- element: double (containsNull = false)
|-- pz: array (nullable = true)
| |-- element: double (containsNull = false)
|-- ch: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- en: array (nullable = true)
| |-- element: double (containsNull = false)
val dataFrame = dataset.select("px")
val dataRow = dataFrame.collect()
val wpA = dataRow(1)(0)
println(wpA)
WrappedArray(-0.99205, 0.379417, 0.448819,.....)
When you write:
val wpA = dataRow(1)(0)
You get a variable of type Any, because org.apache.spark.sql.Row.apply(Int) (which is the method called here on the result of datarow(1)), returns Any.
Since you know the expected type of the first item (index = 0) of this row, you should use Row.getAs[T](Int) and indicate that you expect a WrappedArray. Then, compiler will know that wpA is an array and you'll be able to use any of its methods (including the apply method that takes an int and can be called using parens only):
import scala.collection.mutable
val wpA = dataRow(1).getAs[mutable.WrappedArray[Double]](0)
println(wpA) // WrappedArray(-0.99205, 0.379417, 0.448819,.....)
println(wpA(0)) // -0.99205

Renaming column names of a DataFrame in Spark Scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name.
for( i <- 0 to origCols.length - 1) {
df.withColumnRenamed(
df.columns(i),
df.columns(i).toLowerCase
);
}
If structure is flat:
val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
// |-- _1: long (nullable = false)
// |-- _2: string (nullable = true)
// |-- _3: string (nullable = true)
// |-- _4: double (nullable = false)
the simplest thing you can do is to use toDF method:
val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)
dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)
If you want to rename individual columns you can use either select with alias:
df.select($"_1".alias("x1"))
which can be easily generalized to multiple columns:
val lookup = Map("_1" -> "foo", "_3" -> "bar")
df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)
or withColumnRenamed:
df.withColumnRenamed("_1", "x1")
which use with foldLeft to rename multiple columns:
lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))
With nested structures (structs) one possible option is renaming by selecting a whole structure:
val nested = spark.read.json(sc.parallelize(Seq(
"""{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))
nested.printSchema
// root
// |-- foobar: struct (nullable = true)
// | |-- foo: struct (nullable = true)
// | | |-- bar: struct (nullable = true)
// | | | |-- first: double (nullable = true)
// | | | |-- second: double (nullable = true)
// |-- id: long (nullable = true)
#transient val foobarRenamed = struct(
struct(
struct(
$"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
).alias("point")
).alias("location")
).alias("record")
nested.select(foobarRenamed, $"id").printSchema
// root
// |-- record: struct (nullable = false)
// | |-- location: struct (nullable = false)
// | | |-- point: struct (nullable = false)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
// |-- id: long (nullable = true)
Note that it may affect nullability metadata. Another possibility is to rename by casting:
nested.select($"foobar".cast(
"struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema
// root
// |-- record: struct (nullable = true)
// | |-- location: struct (nullable = true)
// | | |-- point: struct (nullable = true)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
or:
import org.apache.spark.sql.types._
nested.select($"foobar".cast(
StructType(Seq(
StructField("location", StructType(Seq(
StructField("point", StructType(Seq(
StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema
// root
// |-- record: struct (nullable = true)
// | |-- location: struct (nullable = true)
// | | |-- point: struct (nullable = true)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
For those of you interested in PySpark version (actually it's same in Scala - see comment below) :
merchants_df_renamed = merchants_df.toDF(
'merchant_id', 'category', 'subcategory', 'merchant')
merchants_df_renamed.printSchema()
Result:
root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)
def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}
In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.
Suppose the dataframe df has 3 columns id1, name1, price1
and you wish to rename them to id2, name2, price2
val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)
I found this approach useful in many cases.
Sometime we have the column name is below format in SQLServer or MySQL table
Ex : Account Number,customer number
But Hive tables do not support column name containing spaces, so please use below solution to rename your old column names.
Solution:
val renamedColumns = df.columns.map(c => df(c).as(c.replaceAll(" ", "_").toLowerCase()))
df = df.select(renamedColumns: _*)
tow table join not rename the joined key
// method 1: create a new DF
day1 = day1.toDF(day1.columns.map(x => if (x.equals(key)) x else s"${x}_d1"): _*)
// method 2: use withColumnRenamed
for ((x, y) <- day1.columns.filter(!_.equals(key)).map(x => (x, s"${x}_d1"))) {
day1 = day1.withColumnRenamed(x, y)
}
works!