This is my current code :
impcomp = ['connectors', 'contract_no', 'document_confidentiality', 'document_type', 'drawing_no', 'equipment_specifications', 'external_drawings', 'is_psi', 'line_numbers', 'owner_no', 'plant', 'project_title', 'psi_category', 'revision', 'revision_date', 'revision_status', 'tags', 'unit']
for el in impcomp:
df1 = df3.select(df[Pk[1]],posexplode_outer(df3[el]))
df1 = df1.where(df1.pos !='1')
df1 = df1.drop('pos')
df1 = df1.withColumnRenamed('col',el)
dfu = df4.join(df1,df4.DocumentNo == df1.DocumentNo,"left")
display(dfu)
What i want is for the final dataframe to iterate to each and every element and append itself to main dataframe (dfu). But instead, my current code overwrite the previous element column, leaving the final dataframe (dfu) as: dfu + 'unit' column. is there any way for me to store the value for each column that was iterate in the for loop without overwriting the previous element?
expcted result:
Document | Author | connectors | contract_no .........|unit|
A | AA | 12 | C13 |Z12
current result:
Document | Author | unit|
A | AA | Z12
thanks in advance
dfs = []
for el in impcomp:
df1 = df3.select(df[Pk[1]],posexplode_outer(df3[el]))
df1 = df1.where(df1.pos !='1')
df1 = df1.drop('pos')
df1 = df1.withColumnRenamed('col',el)
dfs.append(df1[el])
df6 = reduce(df4.union(dfs))
i have tried this but it returns error :
AttributeError: 'list' object has no attribute '_jdf'
Related
In Spark (Scala), after the application jar is submitted to Spark, is it possible for the jar to fetch many strings from a database table, convert each string to a catalyst Expression and then convert that expression to a UDF, and use the UDF to filters rows in another DataFrame, and finally union the result of each UDF?
(The said expression needs some or all columns of the DataFrame, but which columns are needed is unknown at the time of the code of the jar is written, the schema of the DataFrame is known at development time)
An example:
expression 1: "id == 1"
expression 2: "name == \"andy\""
DataFrame:
row 1: id = 1, name = "red", age = null
row 2: id = 2, name = "andy", age = 20
row 3: id = 3, name = "juliet", age = 21
the final result should be the first two rows
Note: it is not acceptable to first concatenate the two expressions with a or, for I needed to track which expression results the result row
Edited: Filter for each argument and union All.
import org.apache.spark.sql.DataFrame
val df = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv")
val args = Array("id == 1", "name == \"andy\"")
val filters = args.zipWithIndex
var dfs = Array[DataFrame]()
filters.foreach {
case (filter, index) =>
val tempDf = df.filter(filter).withColumn("index", lit(index))
dfs = dfs :+ tempDf
}
val resultDF = dfs.reduce(_ unionAll _)
resultDF.show(false)
+---+----+----+-----+
|id |name|age |index|
+---+----+----+-----+
|1 |red |null|0 |
|2 |andy|20 |1 |
+---+----+----+-----+
Original: Why just put the string to the filter?
val df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
val condition = "id == 1 or name == \"andy\""
df.filter(condition).show(false)
+---+----+----+
|id |name|age |
+---+----+----+
|1 |red |null|
|2 |andy|20 |
+---+----+----+
Something I have missed?
I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1
|protocolNo|serialNum|testMethod |testProperty|
+----------+---------+------------+------------+
|Product1 | AB |testMethod1 | TP1 |
|Product2 | CD |testMethod2 | TP2 |
DF2-
|action| type| value | exploded |
+------------+---------------------------+-----------------+
|append|hash | [protocolNo] | protocolNo |
|append|text | _ | _ |
|append|hash | [serialNum,testProperty] | serialNum |
|append|hash | [serialNum,testProperty] | testProperty |
Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.
Required -
New column should be created in DF1. the value should be calculated like below-
hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.
eg. for Row1 of DF1, col value should be
hash[Product1]_hash[ABTP1]
this will result into something like this abc-df_egh-45e after hashing.
The above procedure should be followed for each and every row of DF1.
I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.
Input DFs would be DF1 and DF2 as mentioned above.
Desired Output DF-
|protocolNo|serialNum|testMethod |testProperty| newColumn |
+----------+---------+------------+------------+----------------+
|Product1 | AB |testMethod1 | TP1 | abc-df_egh-4je |
|Product2 | CD |testMethod2 | TP2 | dfg-df_ijk-r56 |
newColumn value is after hashing
Instead of DF2, you can translate DF2 to case class like Specifications, e.g
case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}
Create instances of above class
val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
)
Then you can process the below columns
val transformed = specifications
.foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))
def transformColumn(spec: Spec)(df: DataFrame): DataFrame = {
spec.type.foldLeft(df)((df: DataFrame, type : String) => {
type match {
case "append" => {have a case match of the action and do that , then append with df.withColumn}
}
}
Syntax may not be correct
Since DF2 has the column names that will be used to calculate a new column from DF1, I have made this assumption that DF2 will not be a huge Dataframe.
First step would be to filter DF2 and get the column names that we want to pick from DF1.
val hashColumns = DF2.filter('type==="hash").select('exploded).collect
Now, hashcolumns will have the columns that we want to use to calculate hash in the newColumn. The hashcolumns is an Array of Row. We need this to be a Column that will be applied while creating the newColumn in DF1.
val newColumnHash = hashColumns.map(f=>hash(col(f.getString(0)))).reduce(concat_ws("_",_,_))
The above line will convert the Row to a Column with hash function applied to it. And we reduce it while concatenating _. Now, the task becomes simple. We just need to apply this to DF1.
DF1.withColumn("newColumn",newColumnHash).show(false)
Hope this helps!
I have a table which has a column containing array like this -
Student_ID | Subject_List | New_Subject
1 | [Mat, Phy, Eng] | Chem
I want to append the new subject into the subject list and get the new list.
Creating the dataframe -
val df = sc.parallelize(Seq((1, Array("Mat", "Phy", "Eng"), "Chem"))).toDF("Student_ID","Subject_List","New_Subject")
I have tried this with UDF as follows -
def append_list = (arr: Seq[String], s: String) => {
arr :+ s
}
val append_list_UDF = udf(append_list)
val df_new = df.withColumn("New_List", append_list_UDF($"Subject_List",$"New_Subject"))
With UDF, I get the required output
Student_ID | Subject_List | New_Subject | New_List
1 | [Mat, Phy, Eng] | Chem | [Mat, Phy, Eng, Chem]
Can we do it without udf ? Thanks.
In Spark 2.4 or later a combination of array and concat should do the trick,
import org.apache.spark.sql.functions.{array, concat}
import org.apache.spark.sql.Column
def append(arr: Column, col: Column) = concat(arr, array(col))
df.withColumn("New_List", append($"Subject_List",$"New_Subject")).show
+----------+---------------+-----------+--------------------+
|Student_ID| Subject_List|New_Subject| New_List|
+----------+---------------+-----------+--------------------+
| 1|[Mat, Phy, Eng]| Chem|[Mat, Phy, Eng, C...|
+----------+---------------+-----------+--------------------+
but I wouldn't expect serious performance gains here.
val df = Seq((1, Array("Mat", "Phy", "Eng"), "Chem"),
(2, Array("Hindi", "Bio", "Eng"), "IoT"),
(3, Array("Python", "R", "scala"), "C")).toDF("Student_ID","Subject_List","New_Subject")
df.show(false)
val final_df = df.withColumn("exploded", explode($"Subject_List")).select($"Student_ID",$"exploded")
.union(df.select($"Student_ID",$"New_Subject"))
.groupBy($"Student_ID").agg(collect_list($"exploded") as "Your_New_List").show(false)
[enter code here][1]
var columnnames= "callStart_t,callend_t" // Timestamp column names are dynamic input.
scala> df1.show()
+------+------------+--------+----------+
| name| callStart_t|personid| callend_t|
+------+------------+--------+----------+
| Bindu|1080602418 | 2|1080602419|
|Raphel|1647964576 | 5|1647964576|
| Ram|1754536698 | 9|1754536699|
+------+------------+--------+----------+
code which i tried :
val newDf = df1.withColumn("callStart_Time", to_utc_timestamp(from_unixtime($"callStart_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
val newDf = df1.withColumn("callend_Time", to_utc_timestamp(from_unixtime($"callend_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
Here, I don't want new columns to convert (from_unixtime to to_utc_timestamp), the existing column itself I want to convert
Example Output
+------+---------------------+--------+--------------------+
| name| callStart_t |personid| callend_t |
+------+---------------------+--------+--------------------+
| Bindu|1970-01-13 04:40:02 | 2|1970-01-13 04:40:02 |
|Raphel|1970-01-20 06:16:04 | 5|1970-01-20 06:16:04 |
| Ram|1970-01-21 11:52:16 | 9|1970-01-21 11:52:16 |
+------+---------------------+--------+--------------------+
Note: The Timestamp column names are dynamic.
how to get each column dynamically?
Just use the same name for the column and it will replace it:
val newDf = df1.withColumn("callStart_t", to_utc_timestamp(from_unixtime($"callStart_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
val newDf = df1.withColumn("callend_t", to_utc_timestamp(from_unixtime($"callend_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
To make it dynamic, just use the relevant string. For example:
val colName = "callend_t"
val newDf = df.withColumn(colName , to_utc_timestamp(from_unixtime(col(colName)/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
For multiple columns you can do:
val columns=Seq("callend_t", "callStart_t")
val newDf = columns.foldLeft(df1){ case (curDf, colName) => curDf.withColumn(colName , to_utc_timestamp(from_unixtime(col(colName)/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))}
Note: as stated in the comments, the division by 1000 is not needed.
I'm trying to change the schema of an existing dataframe to the schema of another dataframe.
DataFrame 1:
Column A | Column B | Column C | Column D
"a" | 1 | 2.0 | 300
"b" | 2 | 3.0 | 400
"c" | 3 | 4.0 | 500
DataFrame 2:
Column K | Column B | Column F
"c" | 4 | 5.0
"b" | 5 | 6.0
"f" | 6 | 7.0
So I want to apply the schema of the first dataframe on the second. So all the columns which are the same remain. The columns in dataframe 2 that are not in 1 get deleted. The others become "NULL".
Output
Column A | Column B | Column C | Column D
"NULL" | 4 | "NULL" | "NULL"
"NULL" | 5 | "NULL" | "NULL"
"NULL" | 6 | "NULL" | "NULL"
So I came with a possible solution:
val schema = df1.schema
val newRows: RDD[Row] = df2.map(row => {
val values = row.schema.fields.map(s => {
if(schema.fields.contains(s)){
row.getAs(s.name).toString
}else{
"NULL"
}
})
Row.fromSeq(values)
})
sqlContext.createDataFrame(newRows, schema)}
Now as you can see this will not work because the schema contains String, Int and Double. And all my rows have String values.
This is where I'm stuck, is there a way to automatically convert the type of my values to the schema?
If schema is flat I would use simply map over per-existing schema and select required columns:
val exprs = df1.schema.fields.map { f =>
if (df2.schema.fields.contains(f)) col(f.name)
else lit(null).cast(f.dataType).alias(f.name)
}
df2.select(exprs: _*).printSchema
// root
// |-- A: string (nullable = true)
// |-- B: integer (nullable = false)
// |-- C: double (nullable = true)
// |-- D: integer (nullable = true)
Working in 2018 (Spark 2.3) reading a .sas7bdat
Scala
val sasFile = "file.sas7bdat"
val dfSas = spark.sqlContext.sasFile(sasFile)
val myManualSchema = dfSas.schema //getting the schema from another dataframe
val df = spark.read.format("csv").option("header","true").schema(myManualSchema).load(csvFile)
PD: spark.sqlContext.sasFile use saurfang library, you could skip that part of code and get the schema from another dataframe.
Below are simple PYSPARK steps to achieve same:
df = <dataframe whose schema needs to be copied>
df_tmp = <dataframe with result with fewer fields>
#Note: field names from df_tmp must match with field names from df
df_tmp_cols = [colmn.lower() for colmn in df_tmp.columns]
for col_dtls in df.dtypes:
col_name, dtype = col_dtls
if col_name.lower() in df_tmp_cols:
df_tmp = df_tmp.withColumn(col_name,f.col(col_name).cast(dtype))
else:
df_tmp = df_tmp.withColumn(col_name,f.lit(None).cast(dtype))
df_fin = df_tmp.select(df.columns) #Final dataframe
You could simply do Left Join on your dataframes with query like this:-
SELECT Column A, Column B, Column C, Column D FROM foo LEFT JOIN BAR ON Column C = Column C
Please checkout the answer by #zero323 in this post:-
Spark specify multiple column conditions for dataframe join
Thanks,
Charles.