I am trying to find out the null fields present in a dataframe and concatenate all the fields in a new field in the same dataframe.
Input dataframe looks like this
name
state
number
James
CA
100
Julia
Null
Null
Null
CA
200
Expected Output
name
state
number
Null Fields
James
CA
100
Julia
Null
Null
state,number
Null
CA
200
name
My code looks like this but it is failing. Please help me here.
from pyspark.sql import functions as F
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","CA","100"),
("Julia",None,None),
(None,"CA","200")]
schema = StructType([ \
StructField("name",StringType(),True), \
StructField("state",StringType(),True), \
StructField("number",StringType(),True)
])
df = spark.createDataFrame(data=data2,schema=schema)
cols = ["name","state","number"]
df.show()
def null_constraint_check(df,cols):
df_null_identifier = df.withColumn("NULL Fields",\
[F.count(F.when(F.col(c).isNull(), c)) for c in cols])
return df_null_identifier
df1 = null_constraint_check(df,cols)
Error I am getting
AssertionError: col should be Column
Your approach is correct, you only have to make a small change in null_constraint_check:
[F.count(...)] is a list of columns and withColumn expects a single column as second parameter. One way to get there is to concatenate all elements of the list using concat_ws:
def null_constraint_check(df,cols):
df_null_identifier = df.withColumn("NULL Fields",
F.concat_ws(",",*[F.when(F.col(c).isNull(), c) for c in cols]))
return df_null_identifier
I have also removed the F.count because your question says that you want the names of the null columns.
The result is:
+-----+-----+------+------------+
| name|state|number| NULL Fields|
+-----+-----+------+------------+
|James| CA| 100| |
|Julia| null| null|state,number|
| null| CA| 200| name|
+-----+-----+------+------------+
Related
I have a dataframe with column FN and a list of a subset of these column values
e.g.
**FN**
ABC
DEF
GHI
JKL
MNO
List:
["GHI","DEF"]
I want to add a column to my dataframe where, if the column value exists in the List, I record the position within the list, that is my end DF
FN POS
ABC
DEF 1
GHI 0
JKL
MNO
My code is as follows
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
l = ["GHI","DEF"]
x = udf(lambda fn, p = l: p.index(fn), StringType())
df = df.withColumn('POS', when(col("FN").isin(l), x(col("FN"))).otherwise(lit('')))
But when running I get a "Job aborted due to stage failure" exception with a series of other exceptions, the only meaningful part being "ValueError: 'JKL' is not in list" (JKL being a random other column in my DF column)
If instead of "p.index(fn)" I just enter "fn", I get the correct column values in my new column, similarly if I use "p.index("DEF")", I get "1" back so individually these are working, any ideas why the exceptions?
TIA
EDIT: I have managed to go around this by doing an if-else within the lambda which is almost implying that it is executing the lambda prior to the "isin" check within the withColumn statement.
What I would like to know (other than whether the above is true), does anyone have a better suggestion on how to achieve this in a better manner?
Here is my try. I have made a dataframe for the given list and join them.
from pyspark.sql.functions import *
l = ['GHI','DEF']
m = [(l[i], i) for i in range(0, len(l))]
df2 = spark.createDataFrame(m).toDF('FN', 'POS')
df1 = spark.createDataFrame(['POS','ABC','DEF','GHI','JKL','MNO'], "string").toDF('FN')
df1.join(df2, ['FN'], 'left').show()
+---+----+
| FN| POS|
+---+----+
|JKL|null|
|MNO|null|
|DEF| 1|
|POS|null|
|GHI| 0|
|ABC|null|
+---+----+
I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1
|protocolNo|serialNum|testMethod |testProperty|
+----------+---------+------------+------------+
|Product1 | AB |testMethod1 | TP1 |
|Product2 | CD |testMethod2 | TP2 |
DF2-
|action| type| value | exploded |
+------------+---------------------------+-----------------+
|append|hash | [protocolNo] | protocolNo |
|append|text | _ | _ |
|append|hash | [serialNum,testProperty] | serialNum |
|append|hash | [serialNum,testProperty] | testProperty |
Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.
Required -
New column should be created in DF1. the value should be calculated like below-
hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.
eg. for Row1 of DF1, col value should be
hash[Product1]_hash[ABTP1]
this will result into something like this abc-df_egh-45e after hashing.
The above procedure should be followed for each and every row of DF1.
I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.
Input DFs would be DF1 and DF2 as mentioned above.
Desired Output DF-
|protocolNo|serialNum|testMethod |testProperty| newColumn |
+----------+---------+------------+------------+----------------+
|Product1 | AB |testMethod1 | TP1 | abc-df_egh-4je |
|Product2 | CD |testMethod2 | TP2 | dfg-df_ijk-r56 |
newColumn value is after hashing
Instead of DF2, you can translate DF2 to case class like Specifications, e.g
case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}
Create instances of above class
val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
)
Then you can process the below columns
val transformed = specifications
.foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))
def transformColumn(spec: Spec)(df: DataFrame): DataFrame = {
spec.type.foldLeft(df)((df: DataFrame, type : String) => {
type match {
case "append" => {have a case match of the action and do that , then append with df.withColumn}
}
}
Syntax may not be correct
Since DF2 has the column names that will be used to calculate a new column from DF1, I have made this assumption that DF2 will not be a huge Dataframe.
First step would be to filter DF2 and get the column names that we want to pick from DF1.
val hashColumns = DF2.filter('type==="hash").select('exploded).collect
Now, hashcolumns will have the columns that we want to use to calculate hash in the newColumn. The hashcolumns is an Array of Row. We need this to be a Column that will be applied while creating the newColumn in DF1.
val newColumnHash = hashColumns.map(f=>hash(col(f.getString(0)))).reduce(concat_ws("_",_,_))
The above line will convert the Row to a Column with hash function applied to it. And we reduce it while concatenating _. Now, the task becomes simple. We just need to apply this to DF1.
DF1.withColumn("newColumn",newColumnHash).show(false)
Hope this helps!
I have pyspark dataframe with two columns with datatypes as
[('area', 'int'), ('customer_play_id', 'int')]
+----+----------------+
|area|customer_play_id|
+----+----------------+
| 100| 8606738 |
| 110| 8601843 |
| 130| 8602984 |
+----+----------------+
I want to cast column area to str using pyspark commands but I am getting error as below
I tried below
str(df['area']) : but it didnt change datatype to str
df.area.astype(str) : gave "TypeError: unexpected type: "
df['area'].cast(str) same as error above
Any help will be appreciated
I want datatype of area as string using pyspark dataframe operation
Simply you can do any of these -
Option1:
df1 = df.select('*',df.area.cast("string"))
select - All the columns you want in df1 should be mentioned in select
Option2:
df1 = df.selectExpr("*","cast(area as string) AS new_area")
selectExpr - All the columns you want in df1 should be mentioned in selectExpr
Option3:
df1 = df.withColumn("new_area", df.area.cast("string"))
withColumn will add new column (additional to existing columns of df)
"*" in select and selectExpr represent all the columns.
use withColumn function to change the data type or values in the field in spark e.g. is show below:
import pyspark.sql.functions as F
df = df.withColumn("area",F.col("area").cast("string"))
You Can use this UDF Function
from pyspark.sql.types import FloatType
tofloatfunc = udf(lambda x: x,FloatType())
changedTypedf = df.withColumn("Column_name", df["Column_name"].cast(FloatType()))
How can I convert a dataframe to a tuple that includes the datatype for each column?
I have a number of dataframes with varying sizes and types. I need to be able to determine the type and value of each column and row of a given dataframe so I can perform some actions that are type-dependent.
So for example say I have a dataframe that looks like:
+-------+-------+
| foo | bar |
+-------+-------+
| 12345 | fnord |
| 42 | baz |
+-------+-------+
I need to get
Seq(
(("12345", "Integer"), ("fnord", "String")),
(("42", "Integer"), ("baz", "String"))
)
or something similarly simple to iterate over and work with programmatically.
Thanks in advance and sorry for what is, I'm sure, a very noobish question.
If I understand your question correct, then following shall be your solution.
val df = Seq(
(12345, "fnord"),
(42, "baz"))
.toDF("foo", "bar")
This creates dataframe which you already have.
+-----+-----+
| foo| bar|
+-----+-----+
|12345|fnord|
| 42| baz|
+-----+-----+
Next step is to extract dataType from the schema of the dataFrame and create a iterator.
val fieldTypesList = df.schema.map(struct => struct.dataType)
Next step is to convert the dataframe rows into rdd list and map each value to dataType from the list created above
val dfList = df.rdd.map(row => row.toString().replace("[","").replace("]","").split(",").toList)
val tuples = dfList.map(list => list.map(value => (value, fieldTypesList(list.indexOf(value)))))
Now if we print it
tuples.foreach(println)
It would give
List((12345,IntegerType), (fnord,StringType))
List((42,IntegerType), (baz,StringType))
Which you can iterate over and work with programmatically
I have multiple duplicate columns (due to joins) If I try to call them by alias, I get an ambiguous reference error:
Reference 'customers_id' is ambiguous, could be: customers_id#13, customers_id#85, customers_id#130
Is there a way to reference a column in a Scala Spark Dataframe by it's order in the Dataframe or by numeric ID, not by an alias? Sanitized names suggest that columns do have an id assigned (13, 85, 130 in the example below)
LATER EDIT:
I found out that I can reference a specific column by the original dataframe it was in. But, while I can use OriginalDataframe.customer_id in select function, the withColumnRename function only accepts string alias so I cannot rename the duplicate column in the final dataframe.
So, I guess the end question is:
Is there a way to reference a column that has a duplicate alias, that works with all functions that require a string alias as argument?
LATER EDIT 2:
Renaming seemed to have worked via adding a new column and dropping one of the current ones:
joined_dataframe = joined_dataframe.withColumn("renamed_customers_id", original_dataframe("customers_id")).drop(original_dataframe("customers_id"))
But, I'd like to keep my question open:
Is there a way to reference a column that has a duplicate alias (so, using something other than alias) in a way that all functions which expect a string alias accept it?
One way to get out of such a situation would be to create a new Dataframe using the old one's rdd, but with a new schema, where you can name each column as you'd like. This, of course, requires you to explicitly describe the entire schema, including the type of each column. As long as the new schema you provides matches the number of columns, and the column types, of the old Dataframe - this should work.
For example - starting with a Dataframe with two columns named type we can rename them type1 and type2:
df.show()
// +---+----+----+
// | id|type|type|
// +---+----+----+
// | 1| AAA| aaa|
// | 1| BBB| bbb|
// +---+----+----+
val newDF = sqlContext.createDataFrame(df.rdd, new StructType()
.add("id", IntegerType)
.add("type1", StringType)
.add("type2", StringType)
)
newDF.show()
// +---+-----+-----+
// | id|type1|type2|
// +---+-----+-----+
// | 1| AAA| aaa|
// | 1| BBB| bbb|
// +---+-----+-----+
The main problem is join, ı use python.
h1.createOrReplaceTempView("h1")
h2.createOrReplaceTempView("h2")
h3.createOrReplaceTempView("h3")
joined1 = h1.join(h2, (h1.A == h2.A) & (h1.B == h2.B) & (h1.C == h2.C), 'inner')
Result dataframe columns:
A B Column1 Column2 A B Column3 ...
I don't like this , but join must be implement like this:
joined1 = h1.join(h2, [*argv], 'inner')
We assume argv = ["A", "B", "C"]
Result columns:
A B column1 column2 column3 ...