I'm trying to change the schema of an existing dataframe to the schema of another dataframe.
DataFrame 1:
Column A | Column B | Column C | Column D
"a" | 1 | 2.0 | 300
"b" | 2 | 3.0 | 400
"c" | 3 | 4.0 | 500
DataFrame 2:
Column K | Column B | Column F
"c" | 4 | 5.0
"b" | 5 | 6.0
"f" | 6 | 7.0
So I want to apply the schema of the first dataframe on the second. So all the columns which are the same remain. The columns in dataframe 2 that are not in 1 get deleted. The others become "NULL".
Output
Column A | Column B | Column C | Column D
"NULL" | 4 | "NULL" | "NULL"
"NULL" | 5 | "NULL" | "NULL"
"NULL" | 6 | "NULL" | "NULL"
So I came with a possible solution:
val schema = df1.schema
val newRows: RDD[Row] = df2.map(row => {
val values = row.schema.fields.map(s => {
if(schema.fields.contains(s)){
row.getAs(s.name).toString
}else{
"NULL"
}
})
Row.fromSeq(values)
})
sqlContext.createDataFrame(newRows, schema)}
Now as you can see this will not work because the schema contains String, Int and Double. And all my rows have String values.
This is where I'm stuck, is there a way to automatically convert the type of my values to the schema?
If schema is flat I would use simply map over per-existing schema and select required columns:
val exprs = df1.schema.fields.map { f =>
if (df2.schema.fields.contains(f)) col(f.name)
else lit(null).cast(f.dataType).alias(f.name)
}
df2.select(exprs: _*).printSchema
// root
// |-- A: string (nullable = true)
// |-- B: integer (nullable = false)
// |-- C: double (nullable = true)
// |-- D: integer (nullable = true)
Working in 2018 (Spark 2.3) reading a .sas7bdat
Scala
val sasFile = "file.sas7bdat"
val dfSas = spark.sqlContext.sasFile(sasFile)
val myManualSchema = dfSas.schema //getting the schema from another dataframe
val df = spark.read.format("csv").option("header","true").schema(myManualSchema).load(csvFile)
PD: spark.sqlContext.sasFile use saurfang library, you could skip that part of code and get the schema from another dataframe.
Below are simple PYSPARK steps to achieve same:
df = <dataframe whose schema needs to be copied>
df_tmp = <dataframe with result with fewer fields>
#Note: field names from df_tmp must match with field names from df
df_tmp_cols = [colmn.lower() for colmn in df_tmp.columns]
for col_dtls in df.dtypes:
col_name, dtype = col_dtls
if col_name.lower() in df_tmp_cols:
df_tmp = df_tmp.withColumn(col_name,f.col(col_name).cast(dtype))
else:
df_tmp = df_tmp.withColumn(col_name,f.lit(None).cast(dtype))
df_fin = df_tmp.select(df.columns) #Final dataframe
You could simply do Left Join on your dataframes with query like this:-
SELECT Column A, Column B, Column C, Column D FROM foo LEFT JOIN BAR ON Column C = Column C
Please checkout the answer by #zero323 in this post:-
Spark specify multiple column conditions for dataframe join
Thanks,
Charles.
Related
Actually I am stuck in a problem where I have a dataframe with 2 columns having schema
scala> df1.printSchema
root
|-- actions: string (nullable = true)
|-- event_id: string (nullable = true)
actions column actually contains as array of objects but it's type is string and hence I can't use explode here
Sample data :
------------------------------------------------------------------------------------------------------------------
| event_id | actions |
------------------------------------------------------------------------------------------------------------------
| 1 | [{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}] |
------------------------------------------------------------------------------------------------------------------
There are some other keys present in each object of actions, but for simplicity I have taken 2 here.
I want to convert this to below format
OUTPUT :-
---------------------------------------
| event_id | name | score |
---------------------------------------
| 1 | Vijay | 843 |
---------------------------------------
| 2 | Manish | 840 |
---------------------------------------
| 3 | Mayur | 930 |
---------------------------------------
how can I do this with spark dataframe?
I have tried to read actions column using
val df2= spark.read.option("multiline",true).json(df1.rdd.map(row => row.getAs[String]("actions")))
but here I am not able to map event_id with each line.
You can do this by using the from_json function.
This function has 2 inputs:
A column from which we want to read json string from
A schema with which to parse the json string
That would look something like this:
import spark.implicits._
import org.apache.spark.sql.types._
// Reading in your data
val df = spark.read.option("sep", ";").option("header", "true").csv("./csvWithJson.csv")
df.show(false)
+--------+---------------------------------------------------------------------------------------------------+
|event_id|actions |
+--------+---------------------------------------------------------------------------------------------------+
|1 |[{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]|
+--------+---------------------------------------------------------------------------------------------------+
// Creating the necessary schema for the from_json function
val actionsSchema = ArrayType(
new StructType()
.add("name", StringType)
.add("score", IntegerType)
)
// Parsing the json string into our schema, exploding the column to make one row
// per json object in the array and then selecting the wanted columns,
// unwrapping the parsedActions column into separate columns
val parsedDf = df
.withColumn("parsedActions",explode(from_json(col("actions"), actionsSchema)))
.drop("actions")
.select("event_id", "parsedActions.*")
parsedDf.show(false)
+--------+------+-----+
|event_id| name|score|
+--------+------+-----+
| 1| Vijay| 843|
| 1|Manish| 840|
| 1| Mayur| 930|
+--------+------+-----+
Hope this helps!
This is my current code :
impcomp = ['connectors', 'contract_no', 'document_confidentiality', 'document_type', 'drawing_no', 'equipment_specifications', 'external_drawings', 'is_psi', 'line_numbers', 'owner_no', 'plant', 'project_title', 'psi_category', 'revision', 'revision_date', 'revision_status', 'tags', 'unit']
for el in impcomp:
df1 = df3.select(df[Pk[1]],posexplode_outer(df3[el]))
df1 = df1.where(df1.pos !='1')
df1 = df1.drop('pos')
df1 = df1.withColumnRenamed('col',el)
dfu = df4.join(df1,df4.DocumentNo == df1.DocumentNo,"left")
display(dfu)
What i want is for the final dataframe to iterate to each and every element and append itself to main dataframe (dfu). But instead, my current code overwrite the previous element column, leaving the final dataframe (dfu) as: dfu + 'unit' column. is there any way for me to store the value for each column that was iterate in the for loop without overwriting the previous element?
expcted result:
Document | Author | connectors | contract_no .........|unit|
A | AA | 12 | C13 |Z12
current result:
Document | Author | unit|
A | AA | Z12
thanks in advance
dfs = []
for el in impcomp:
df1 = df3.select(df[Pk[1]],posexplode_outer(df3[el]))
df1 = df1.where(df1.pos !='1')
df1 = df1.drop('pos')
df1 = df1.withColumnRenamed('col',el)
dfs.append(df1[el])
df6 = reduce(df4.union(dfs))
i have tried this but it returns error :
AttributeError: 'list' object has no attribute '_jdf'
I have a pyspark column of type string containing a array of dictionary.
x = {"a":1,"b":[{"type":"abc","unitValue":"4.4"}]}
I want to cast the string into array of struct but while doing that the fields in the new column are getting populated as null.
Databricks run time - 8.3 (includes Apache Spark 3.1.1, Scala 2.12)
My dataframe looks like:
from pyspark.sql.functions import *
from pyspark.sql.types import *
inputSchema = StructType([StructField("a",StringType(),True),
StructField("b",StringType(),True)])
jsonStruct = StructType([StructField("type",StringType(),True),
StructField("unitValue",StringType(),True)])
df = spark.createDataFrame(data =[x],schema = inputSchema).show()
+---+---------------------------+
| a| b |
+---+---------------------------+
| 1|[{type=abc, unitValue=4.4}]|
+---+---------------------------+
df.printSchema()
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
I am using from_json function to achieve the same but the values are getting populated as null
df1 = df.withColumn("newvalue",from_json(col("b"),jsonStruct,{"mode" : "PERMISSIVE"}))
display(df1)
+---+----------------------------+----------------------------------+
| a| b | newvalue |
+---+----------------------------+----------------------------------+
| 1|[{type=abc, unitValue=xyz}] |{"type": null, "unitValue": null} |
+---+----------------------------+----------------------------------+
Can someone please help me here
In column b JSON structure is not proper. After creating dataframe : is getting replaced by =.
Either you have to make type of b as string while declaring variable itself or you have to replace = by : using regexp_replace()
x = {"a":1,"b":'[{"type":"abc","unitValue":"4.4"}]'}
And and you need to change JSON schema as shown below.
jsonStruct = ArrayType(StructType([
StructField("type",StringType(),True),
StructField("unitValue",StringType(),True)]),True)
This problem is specific to spark 3.0.0 and above.
Databricks Link to issue : https://kb.databricks.com/scala/from-json-null-spark3.html
I have found the resolution as well.
inputSchema = StructType([StructField("a",StringType(),True),StructField("b",StringType(),True)])
jsonStruct = ArrayType(StructType([StructField("type",StringType(),True),StructField("unitValue",StringType(),True)]),True)
x = {"a":1,"b":'[{"type":"abc","unitValue":"xyz"}]'}
df = spark.createDataFrame(data =[x],schema = schema)
df = df.withColumn("b",regexp_replace('b', '=', ':').cast(StringType()))
df1 = df.withColumn("newvalue",from_json(col("b"),jsonStruct,{"mode" : "PERMISSIVE"}))
display(df1)
+---+----------------------------+----------------------------------+
| a| b | newvalue |
+---+----------------------------+----------------------------------+
| 1|[{type=abc, unitValue=xyz}] |{"type": "abc", "unitValue": "xyz"} |
+---+----------------------------+----------------------------------+
I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1
|protocolNo|serialNum|testMethod |testProperty|
+----------+---------+------------+------------+
|Product1 | AB |testMethod1 | TP1 |
|Product2 | CD |testMethod2 | TP2 |
DF2-
|action| type| value | exploded |
+------------+---------------------------+-----------------+
|append|hash | [protocolNo] | protocolNo |
|append|text | _ | _ |
|append|hash | [serialNum,testProperty] | serialNum |
|append|hash | [serialNum,testProperty] | testProperty |
Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.
Required -
New column should be created in DF1. the value should be calculated like below-
hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.
eg. for Row1 of DF1, col value should be
hash[Product1]_hash[ABTP1]
this will result into something like this abc-df_egh-45e after hashing.
The above procedure should be followed for each and every row of DF1.
I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.
Input DFs would be DF1 and DF2 as mentioned above.
Desired Output DF-
|protocolNo|serialNum|testMethod |testProperty| newColumn |
+----------+---------+------------+------------+----------------+
|Product1 | AB |testMethod1 | TP1 | abc-df_egh-4je |
|Product2 | CD |testMethod2 | TP2 | dfg-df_ijk-r56 |
newColumn value is after hashing
Instead of DF2, you can translate DF2 to case class like Specifications, e.g
case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}
Create instances of above class
val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
)
Then you can process the below columns
val transformed = specifications
.foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))
def transformColumn(spec: Spec)(df: DataFrame): DataFrame = {
spec.type.foldLeft(df)((df: DataFrame, type : String) => {
type match {
case "append" => {have a case match of the action and do that , then append with df.withColumn}
}
}
Syntax may not be correct
Since DF2 has the column names that will be used to calculate a new column from DF1, I have made this assumption that DF2 will not be a huge Dataframe.
First step would be to filter DF2 and get the column names that we want to pick from DF1.
val hashColumns = DF2.filter('type==="hash").select('exploded).collect
Now, hashcolumns will have the columns that we want to use to calculate hash in the newColumn. The hashcolumns is an Array of Row. We need this to be a Column that will be applied while creating the newColumn in DF1.
val newColumnHash = hashColumns.map(f=>hash(col(f.getString(0)))).reduce(concat_ws("_",_,_))
The above line will convert the Row to a Column with hash function applied to it. And we reduce it while concatenating _. Now, the task becomes simple. We just need to apply this to DF1.
DF1.withColumn("newColumn",newColumnHash).show(false)
Hope this helps!
I am looking to convert null values nested in Array of String to empty strings in spark. The data is in a dataframe. I plan on running a reduce function after making the dataframe null safe, not sure if that helps in answering the question. I am using spark 1.6.
Schema:
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Example input:
+--------------------+
|carLineName |
+--------------------+
|[null,null,null] |
|[null, null] |
|[Mustang, null] |
|[Pilot, Jeep] |
Desired output:
+--------------------+
|carLineName |
+--------------------+
|[,,] |
|[,] |
|[Mustang,] |
|[Pilot, Jeep] |
My attempt:
val safeString: Seq[String] => Seq[String] = s => if (s == null) "" else s
val udfSafeString = udf(safeString)
The input to the UDF is a sequence of strings, not a single string. Since that is the case, you need to map over it. You can do this as follows:
val udfSafeString = udf((arr: Seq[String]) => {
arr.map(s => if (s == null) "" else s)
})
df.withColumn("carLineName", udfSafeString($"carLineName"))