In scala I have a List[String] which I want to add as a new Column to an existing DataFrame.
Original DF:
Name | Date
======|===========
Rohan | 2007-12-21
... | ...
... | ...
Suppose want to add a new Column of Department
Expected DF:
Name | Date | Department
=====|============|============
Rohan| 2007-12-21 | Comp
... | ... | ...
... | ... | ...
How can I do this in Scala?
You can do it with one way like just create the dataframe of name and listvalues and join both the dataframe with name column
This solved my issue
val newrows = dataset.rdd.zipWithIndex.map(_.swap)
.join(spark.sparkContext.parallelize(results).zipWithIndex.map(_.swap))
.values
.map { case (row: Row, x: String) => Row.fromSeq(row.toSeq :+ x) }
Still need some exact explanation of it.
Related
I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1
|protocolNo|serialNum|testMethod |testProperty|
+----------+---------+------------+------------+
|Product1 | AB |testMethod1 | TP1 |
|Product2 | CD |testMethod2 | TP2 |
DF2-
|action| type| value | exploded |
+------------+---------------------------+-----------------+
|append|hash | [protocolNo] | protocolNo |
|append|text | _ | _ |
|append|hash | [serialNum,testProperty] | serialNum |
|append|hash | [serialNum,testProperty] | testProperty |
Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.
Required -
New column should be created in DF1. the value should be calculated like below-
hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.
eg. for Row1 of DF1, col value should be
hash[Product1]_hash[ABTP1]
this will result into something like this abc-df_egh-45e after hashing.
The above procedure should be followed for each and every row of DF1.
I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.
Input DFs would be DF1 and DF2 as mentioned above.
Desired Output DF-
|protocolNo|serialNum|testMethod |testProperty| newColumn |
+----------+---------+------------+------------+----------------+
|Product1 | AB |testMethod1 | TP1 | abc-df_egh-4je |
|Product2 | CD |testMethod2 | TP2 | dfg-df_ijk-r56 |
newColumn value is after hashing
Instead of DF2, you can translate DF2 to case class like Specifications, e.g
case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}
Create instances of above class
val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
)
Then you can process the below columns
val transformed = specifications
.foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))
def transformColumn(spec: Spec)(df: DataFrame): DataFrame = {
spec.type.foldLeft(df)((df: DataFrame, type : String) => {
type match {
case "append" => {have a case match of the action and do that , then append with df.withColumn}
}
}
Syntax may not be correct
Since DF2 has the column names that will be used to calculate a new column from DF1, I have made this assumption that DF2 will not be a huge Dataframe.
First step would be to filter DF2 and get the column names that we want to pick from DF1.
val hashColumns = DF2.filter('type==="hash").select('exploded).collect
Now, hashcolumns will have the columns that we want to use to calculate hash in the newColumn. The hashcolumns is an Array of Row. We need this to be a Column that will be applied while creating the newColumn in DF1.
val newColumnHash = hashColumns.map(f=>hash(col(f.getString(0)))).reduce(concat_ws("_",_,_))
The above line will convert the Row to a Column with hash function applied to it. And we reduce it while concatenating _. Now, the task becomes simple. We just need to apply this to DF1.
DF1.withColumn("newColumn",newColumnHash).show(false)
Hope this helps!
var columnnames= "callStart_t,callend_t" // Timestamp column names are dynamic input.
scala> df1.show()
+------+------------+--------+----------+
| name| callStart_t|personid| callend_t|
+------+------------+--------+----------+
| Bindu|1080602418 | 2|1080602419|
|Raphel|1647964576 | 5|1647964576|
| Ram|1754536698 | 9|1754536699|
+------+------------+--------+----------+
code which i tried :
val newDf = df1.withColumn("callStart_Time", to_utc_timestamp(from_unixtime($"callStart_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
val newDf = df1.withColumn("callend_Time", to_utc_timestamp(from_unixtime($"callend_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
Here, I don't want new columns to convert (from_unixtime to to_utc_timestamp), the existing column itself I want to convert
Example Output
+------+---------------------+--------+--------------------+
| name| callStart_t |personid| callend_t |
+------+---------------------+--------+--------------------+
| Bindu|1970-01-13 04:40:02 | 2|1970-01-13 04:40:02 |
|Raphel|1970-01-20 06:16:04 | 5|1970-01-20 06:16:04 |
| Ram|1970-01-21 11:52:16 | 9|1970-01-21 11:52:16 |
+------+---------------------+--------+--------------------+
Note: The Timestamp column names are dynamic.
how to get each column dynamically?
Just use the same name for the column and it will replace it:
val newDf = df1.withColumn("callStart_t", to_utc_timestamp(from_unixtime($"callStart_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
val newDf = df1.withColumn("callend_t", to_utc_timestamp(from_unixtime($"callend_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
To make it dynamic, just use the relevant string. For example:
val colName = "callend_t"
val newDf = df.withColumn(colName , to_utc_timestamp(from_unixtime(col(colName)/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
For multiple columns you can do:
val columns=Seq("callend_t", "callStart_t")
val newDf = columns.foldLeft(df1){ case (curDf, colName) => curDf.withColumn(colName , to_utc_timestamp(from_unixtime(col(colName)/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))}
Note: as stated in the comments, the division by 1000 is not needed.
After a series of validations over a DataFrame,
I obtain a List of String with certain values like this:
List[String]=(lvalue1, lvalue2, lvalue3, ...)
And I have a Dataframe with n values:
dfield 1 | dfield 2 | dfield 3
___________________________
dvalue1 | dvalue2 | dvalue3
dvalue1 | dvalue2 | dvalue3
I want to append the values of the List at the beggining of my Dataframe, in order to get a new DF with something like this:
dfield 1 | dfield 2 | dfield 3 | dfield4 | dfield5 | dfield6
__________________________________________________________
lvalue1 | lvalue2 | lvalue3 | dvalue1 | dvalue2 | dvalue3
lvalue1 | lvalue2 | lvalue3 | dvalue1 | dvalue2 | dvalue3
I have found something using a UDF. Could be this correct for my purpose?
Regards.
TL;DR Use select or withColumn with lit function.
I'd use lit function with select operator (or withColumn).
lit(literal: Any): Column Creates a Column of literal value.
A solution could be as follows.
val values = List("lvalue1", "lvalue2", "lvalue3")
val dfields = values.indices.map(idx => s"dfield ${idx + 1}")
val dataset = Seq(
("dvalue1", "dvalue2", "dvalue3"),
("dvalue1", "dvalue2", "dvalue3")
).toDF("dfield 1", "dfield 2", "dfield 3")
val offsets = dataset.
columns.
indices.
map { idx => idx + colNames.size + 1 }
val offsetDF = offsets.zip(dataset.columns).
foldLeft(dataset) { case (df, (off, col)) => df.withColumnRenamed(col, s"dfield $off") }
val newcols = colNames.zip(dfields).
map { case (v, dfield) => lit(v) as dfield } :+ col("*")
scala> offsetDF.select(newcols: _*).show
+--------+--------+--------+--------+--------+--------+
|dfield 1|dfield 2|dfield 3|dfield 4|dfield 5|dfield 6|
+--------+--------+--------+--------+--------+--------+
| lvalue1| lvalue2| lvalue3| dvalue1| dvalue2| dvalue3|
| lvalue1| lvalue2| lvalue3| dvalue1| dvalue2| dvalue3|
+--------+--------+--------+--------+--------+--------+
I'm trying to change the schema of an existing dataframe to the schema of another dataframe.
DataFrame 1:
Column A | Column B | Column C | Column D
"a" | 1 | 2.0 | 300
"b" | 2 | 3.0 | 400
"c" | 3 | 4.0 | 500
DataFrame 2:
Column K | Column B | Column F
"c" | 4 | 5.0
"b" | 5 | 6.0
"f" | 6 | 7.0
So I want to apply the schema of the first dataframe on the second. So all the columns which are the same remain. The columns in dataframe 2 that are not in 1 get deleted. The others become "NULL".
Output
Column A | Column B | Column C | Column D
"NULL" | 4 | "NULL" | "NULL"
"NULL" | 5 | "NULL" | "NULL"
"NULL" | 6 | "NULL" | "NULL"
So I came with a possible solution:
val schema = df1.schema
val newRows: RDD[Row] = df2.map(row => {
val values = row.schema.fields.map(s => {
if(schema.fields.contains(s)){
row.getAs(s.name).toString
}else{
"NULL"
}
})
Row.fromSeq(values)
})
sqlContext.createDataFrame(newRows, schema)}
Now as you can see this will not work because the schema contains String, Int and Double. And all my rows have String values.
This is where I'm stuck, is there a way to automatically convert the type of my values to the schema?
If schema is flat I would use simply map over per-existing schema and select required columns:
val exprs = df1.schema.fields.map { f =>
if (df2.schema.fields.contains(f)) col(f.name)
else lit(null).cast(f.dataType).alias(f.name)
}
df2.select(exprs: _*).printSchema
// root
// |-- A: string (nullable = true)
// |-- B: integer (nullable = false)
// |-- C: double (nullable = true)
// |-- D: integer (nullable = true)
Working in 2018 (Spark 2.3) reading a .sas7bdat
Scala
val sasFile = "file.sas7bdat"
val dfSas = spark.sqlContext.sasFile(sasFile)
val myManualSchema = dfSas.schema //getting the schema from another dataframe
val df = spark.read.format("csv").option("header","true").schema(myManualSchema).load(csvFile)
PD: spark.sqlContext.sasFile use saurfang library, you could skip that part of code and get the schema from another dataframe.
Below are simple PYSPARK steps to achieve same:
df = <dataframe whose schema needs to be copied>
df_tmp = <dataframe with result with fewer fields>
#Note: field names from df_tmp must match with field names from df
df_tmp_cols = [colmn.lower() for colmn in df_tmp.columns]
for col_dtls in df.dtypes:
col_name, dtype = col_dtls
if col_name.lower() in df_tmp_cols:
df_tmp = df_tmp.withColumn(col_name,f.col(col_name).cast(dtype))
else:
df_tmp = df_tmp.withColumn(col_name,f.lit(None).cast(dtype))
df_fin = df_tmp.select(df.columns) #Final dataframe
You could simply do Left Join on your dataframes with query like this:-
SELECT Column A, Column B, Column C, Column D FROM foo LEFT JOIN BAR ON Column C = Column C
Please checkout the answer by #zero323 in this post:-
Spark specify multiple column conditions for dataframe join
Thanks,
Charles.
I am facing a strange problem with Apache Spark (using the Scala API). There are two DataFrame objects, let's call them beans and relation.
The beans dataframe consists of two columns, named id and data. Consider that all ids are unique and data holds a text representation of some action or a target of an action.
The relation DataFrame defines relationship between the actions and their targets. It consists of two columns: actionId and targetId
(look at the code snippet below to view a table represantation of the DataFrame objects)
Basically, I am trying to alias the beans as two new DataFrame objects: actions and targets and then join them via the relation DataFrame
Here's some code to illustrate what is going on:
//define sql context, using
val sqlContext = new SQLContext(sparkContext)
// ...
// Produce the following DataFrame objects:
// beans: relation:
// +--------+--------+ +----------+----------+
// | id | data | | actionId | targetId |
// +--------+--------+ +----------+----------+
// | a | save | | a | 1 |
// +--------+--------+ +----------+----------+
// | b | delete | | b | 2 |
// +--------+--------+ +----------+----------+
// | c | read | | c | 3 |
// +--------+--------+ +----------+----------+
// | 1 | file |
// +--------+--------+
// | 2 | os |
// +--------+--------+
// | 3 | book |
// +--------+--------+
case class Bean(id: String, data: String)
case class Relation(actionId: String, targetId: String)
val beans = sqlContext.createDataFrame(
Bean("a", "save") :: Bean("b", "delete") :: Bean("c", "read") ::
Bean("1", "file") :: Bean("2", "os") :: Bean("3", "book") :: Nil
)
val relation = sqlContext.createDataFrame(
Relation("a", "1") :: Relation("b", "2") :: Relation("c", "3") :: Nil
)
// alias beans as "actions" and "targets" to avoid ambiguity
val actions = beans as "actions"
val targets = beans as "targets"
// join actions and targets via relation
actions.join(relation, actions("id") === relation("actionId"))
.join(targets, targets("id") === relation("targetId"))
.select(actions("id") as "actionId", targets("id") as "targetId",
actions("data") as "action", targets("data") as "target")
.show()
The desired output of this snippet is
// desired output
// +----------+----------+--------+--------+
// | actionId | targetId | action | target |
// +----------+----------+--------+--------+
// | a | 1 | save | file |
// +----------+----------+--------+--------+
// | b | 2 | delete | os |
// +----------+----------+--------+--------+
// | c | 3 | read | book |
// +----------+----------+--------+--------+
However, the real (and strange) output is an empty DataFrame
+--------+--------+------+------+
|actionId|targetId|action|target|
+--------+--------+------+------+
+--------+--------+------+------+
I have suspected that there is an issue with joining a DataFrame to itself, but the example in Usage of spark DataFrame “as” method proves this suspicion wrong.
I am working with Spark 1.4.1 and Scala 2.10.4 but got the same result on Spark 1.5.1 and Scala 2.11.7
Changing the schema of the DataFrame objects is not an option. Any suggestions?
Solution
Refer to zero323's response. If you are getting an error message like this
error: value $ is not a member of StringContext
actions.join(relation, $"actions.id" === $"actionId")
^
be sure to add the following statement
import sqlContext.implicits._
There is a subtle difference between what you do here and an example you've linked. In the linked answer I use Column objects directly, here you use apply method on a DataFrame. To see the difference just type both in a REPL:
scala> actions("actions.id")
res59: org.apache.spark.sql.Column = id
scala> col("actions.id")
res60: org.apache.spark.sql.Column = actions.id
For an alias to be properly recognized you have to use Column objects directly otherwise alias is simply stripped. It means you need query like this:
actions.join(relation, $"actions.id" === $"actionId")
.join(targets, $"targets.id" === $"targetId")
or
import org.apache.spark.sql.functions.col
actions.join(relation, col("actions.id") === col("actionId"))
.join(targets, col("targets.id") === col("targetId"))
to make it work. Of course using col on the RHS is strictly optional here. You could have use apply as before.
If you prefer to use apply you can rename join columns:
val targets = beans.withColumnRenamed("id", "_targetId")
val actions = beans.withColumnRenamed("id", "_actionId")
actions.join(relation, actions("_actionId") === relation("actionId"))
.join(targets, targets("_targetId") === relation("targetId"))
Solution
I would split it in two phases, so:
val beans = sqlContext.createDataFrame(
Bean("a", "save") ::
Bean("b", "delete") ::
Bean("c", "read") ::
Bean("1", "file") ::
Bean("2", "os") ::
Bean("3", "book") ::
Nil
)
val relation = sqlContext.createDataFrame(
Relation("a", "1") ::
Relation("b", "2") ::
Relation("c", "3") ::
Nil
)
// "add" action
val step1 = beans.join(relation, beans("id") === relation("actionId"))
.select(
relation("actionId"),
relation("targetId"),
beans("data").as("action")
)
// "add" target column
val result = step1.join( beans, beans("id") === relation("targetId"))
.select(
step1("actionId"),
step1("targetId"),
step1("action"),
beans("data").as("target")
)
result.show
Remark
Still, it seems unusual and smelly to keep the different beans ("a", "b", "c") in the same table as ("1", "2", "3")