I have below requirement, need to get all the non null columns into a single row
DataFrame:
TRXN_CD TRXN_BR CODE
A NULL NULL
NULL CD NULL
NULL NULL MOR
Expected output as below so that it can be loaded in table.
Output Dataframe:
TRXN_CD TRXN_BR CODE
A CD MOR
You can use the first function, ignoring null values.
import pyspark.sql.functions as F
...
df = df.select(*[F.first(c, True).alias(c) for c in df.columns])
df.show(truncate=False)
Related
I am trying to find out the null fields present in a dataframe and concatenate all the fields in a new field in the same dataframe.
Input dataframe looks like this
name
state
number
James
CA
100
Julia
Null
Null
Null
CA
200
Expected Output
name
state
number
Null Fields
James
CA
100
Julia
Null
Null
state,number
Null
CA
200
name
My code looks like this but it is failing. Please help me here.
from pyspark.sql import functions as F
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","CA","100"),
("Julia",None,None),
(None,"CA","200")]
schema = StructType([ \
StructField("name",StringType(),True), \
StructField("state",StringType(),True), \
StructField("number",StringType(),True)
])
df = spark.createDataFrame(data=data2,schema=schema)
cols = ["name","state","number"]
df.show()
def null_constraint_check(df,cols):
df_null_identifier = df.withColumn("NULL Fields",\
[F.count(F.when(F.col(c).isNull(), c)) for c in cols])
return df_null_identifier
df1 = null_constraint_check(df,cols)
Error I am getting
AssertionError: col should be Column
Your approach is correct, you only have to make a small change in null_constraint_check:
[F.count(...)] is a list of columns and withColumn expects a single column as second parameter. One way to get there is to concatenate all elements of the list using concat_ws:
def null_constraint_check(df,cols):
df_null_identifier = df.withColumn("NULL Fields",
F.concat_ws(",",*[F.when(F.col(c).isNull(), c) for c in cols]))
return df_null_identifier
I have also removed the F.count because your question says that you want the names of the null columns.
The result is:
+-----+-----+------+------------+
| name|state|number| NULL Fields|
+-----+-----+------+------------+
|James| CA| 100| |
|Julia| null| null|state,number|
| null| CA| 200| name|
+-----+-----+------+------------+
How to retrieve nothing out of a spark dataframe.
I need something like this,
df.where("1" === "2")
I needed this so that I can do a left join with another dataframe.
Basically I am trying to avoid the data skewing while joining two dataframes by splitting the null and not null key columns and joining them separately and then do a union them.
df1 has 300M records out of which 200M records has Null keys.
df2 has another 300M records.
So to join them, I am splitting the df1 containing null and not null keys separately and then join them with df2. so to join the null key dataframe with df2, I don't need any records from df2.
I can just add the columns from df2 to null key df1,
but curious to see if we have something like this in spark
df.where("1" === "2")
As we do in RDBMS SQLs.
There many different ways, like limit:
df.limit(0)
where with Column:
import org.apache.spark.sql.functions._
df.where(lit(false))
where with String expression:
df.where("false")
1 = 2 expressed as
df.where("1 = 2")
or
df.where(lit(1) === lit(2))
would work as well, but are more verbose than required.
where function calls filter function at the internal level so you can use filter as
import org.apache.spark.sql.functions._
df.filter(lit(1) === lit(2))
or
import org.apache.spark.sql.functions._
df.filter(expr("1 = 2"))
or
df.filter("1 = 2")
or
df.filter("false")
or
import org.apache.spark.sql.functions._
df.filter(lit(false))
Any expression that would return false in the filter function would work.
I have a Spark dataframe as
id name address
1 xyz nc
null
..blank line....
3 pqr stw
I need to remove row 2 and 3 from the dataframe and need following output
id name address
1 xyz nc
3 pqr stw
I have tried using
df1.filter(($"id" =!= "") && ($"id".isNotNull)).filter(($"name" =!= "") && ($"name".isNotNull))
But here i need to do it for every single column by iterating column over column,is there a way where i can do it on an entire row level not by iterating over the columns.
You can use the following logic
import org.apache.spark.sql.functions._
def filterEmpty = udf((cols: mutable.WrappedArray[String]) => cols.map(_.equals("")).contains(true))
df.na.fill("").filter(filterEmpty(array("id", "name", "address")) =!= true).show(false)
where filterEmpty is a udf function which returns true if any of the columns contains an empty value.
na.fill("") replaces all null values to empty value in the dataframe.
and filter function filters out the unnecessary rows.
I hope the answer is helpful
I have a pysaprk dataframe with 100 cols:
df1=[(col1,string),(col2,double),(col3,bigint),..so on]
I have another pyspark dataframe df2 with same col count and col names but different datatypes.
df2=[(col1,bigint),(col2,double),(col3,string),..so on]
how do i make the dataypes of all the cols in df2 same as ones present in the dataframe df1 for their respective cols?
It should happen iteratively and if the datatypes match then it should not change
If as you said the column names match and columns count match, then you can simply loop in the schema of df1 and cast the columns to dataTypes of df1
df2 = df2.select([F.col(c.name).cast(c.dataType) for c in df1.schema])
You can use the cast function:
from pyspark.sql import functions as f
# get schema for each DF
df1_schema=df1.dtypes
df2_schema=df2.dtypes
# iterate through cols to cast columns which differ in type
for (c1, d1), (c2,d2) in zip(df1_schema, df2_schema):
# check if datatypes are the same, otherwise cast
if d1!=d2:
df2=df2.withColumn(c2, f.col(c2).cast(d2))
I have a spark dataframe called "df_array" it will always returns a single array as an output like below.
arr_value
[M,J,K]
I want to extract it's value and add to another dataframe.
below is the code I was executing
val new_df = old_df.withColumn("new_array_value", df_array.col("UNCP_ORIG_BPR"))
but my code always fails saying "org.apache.spark.sql.AnalysisException: resolved attribute(s)"
Can someone help me on this
The operation needed here is join
You'll need to have the a common column in both dataframes, which will be used as "key".
After the join you can select which columns to be included in the new dataframe.
More detailed can be found here:
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
join(other, on=None, how=None)
Joins with another DataFrame, using the given join expression.
Parameters:
other – Right side of the join
on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, leftsemi.
The following performs a full outer join between df1 and df2.
>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
[Row(name=None, height=80), Row(name=u'Bob', height=85), Row(name=u'Alice', height=None)]
If you know the df_array has only one record, you can collect it to driver using first() and then use it as an array of literal values to create a column in any DataFrame:
import org.apache.spark.sql.functions._
// first - collect that single array to driver (assuming array of strings):
val arrValue = df_array.first().getAs[mutable.WrappedArray[String]](0)
// now use lit() function to create a "constant" value column:
val new_df = old_df.withColumn("new_array_value", array(arrValue.map(lit): _*))
new_df.show()
// +--------+--------+---------------+
// |old_col1|old_col2|new_array_value|
// +--------+--------+---------------+
// | 1| a| [M, J, K]|
// | 2| b| [M, J, K]|
// +--------+--------+---------------+