PySpark - get string position from list

PySpark - get string position from list - pyspark

I have a dataframe with column FN and a list of a subset of these column values
e.g.
**FN**
ABC
DEF
GHI
JKL
MNO
List:
["GHI","DEF"]
I want to add a column to my dataframe where, if the column value exists in the List, I record the position within the list, that is my end DF
FN POS
ABC
DEF 1
GHI 0
JKL
MNO
My code is as follows
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
l = ["GHI","DEF"]
x = udf(lambda fn, p = l: p.index(fn), StringType())
df = df.withColumn('POS', when(col("FN").isin(l), x(col("FN"))).otherwise(lit('')))
But when running I get a "Job aborted due to stage failure" exception with a series of other exceptions, the only meaningful part being "ValueError: 'JKL' is not in list" (JKL being a random other column in my DF column)
If instead of "p.index(fn)" I just enter "fn", I get the correct column values in my new column, similarly if I use "p.index("DEF")", I get "1" back so individually these are working, any ideas why the exceptions?
TIA
EDIT: I have managed to go around this by doing an if-else within the lambda which is almost implying that it is executing the lambda prior to the "isin" check within the withColumn statement.
What I would like to know (other than whether the above is true), does anyone have a better suggestion on how to achieve this in a better manner?

Here is my try. I have made a dataframe for the given list and join them.
from pyspark.sql.functions import *
l = ['GHI','DEF']
m = [(l[i], i) for i in range(0, len(l))]
df2 = spark.createDataFrame(m).toDF('FN', 'POS')
df1 = spark.createDataFrame(['POS','ABC','DEF','GHI','JKL','MNO'], "string").toDF('FN')
df1.join(df2, ['FN'], 'left').show()
+---+----+
| FN| POS|
+---+----+
|JKL|null|
|MNO|null|
|DEF| 1|
|POS|null|
|GHI| 0|
|ABC|null|
+---+----+

Related

Using pandas udf without looping in pyspark

So suppose I have a big spark dataframe .I dont know how many columns.
(the solution has to be in pyspark using pandas udf. Not a different approach)
I want to perform an action on all columns. So it's ok to loop inside on all columns
But I dont want to loop through rows. I want it to act on the column at once.
I didnt find on the internet how this could be done.
Suppose I have this datafrme
A B C
5 3 2
1 7 0
Now I want to send to pandas udf to get sum of each row.
Sum
10
8
Number of columns not known.
I can do it inside the udf by looping row at a time. But I dont want. I want it to act on all rows without looping. And I allow looping through columns if needed.
One option I tried is combining all colmns to array column
ARR
[5,3,2]
[1,7,0]
But even here it doesnt work for me without looping.
I send this column to the udf and then inside I need to loop through its rows and sum each value of the list-row.
It would be nice if I could seperate each column as a one and act on the whole column at once
How do I act on the column at once? Without looping through the rows?
If I loop through the rows I guess it's no better than a regular python udf

I wouldnt go to pandas udfs, resort to udfs it cant be done in pyspark. Anyway code for both below
df = spark.read.load('/databricks-datasets/asa/small/small.csv', header=True,format='csv')
sf = df.select(df.colRegex("`.*rrDelay$|.*pDelay$`"))
#sf.show()
columns = ["id","ArrDelay","DepDelay"]
data = [("a", 81.0,3),
("b", 36.2,5),
("c", 12.0,5),
("d", 81.0,5),
("e", 36.3,5),
("f", 12.0,5),
("g", 111.7,5)]
sf = spark.createDataFrame(data=data,schema=columns)
sf.show()
# Use aggregate function
new = (sf.withColumn('sums', array(*[x for x in ['ArrDelay','DepDelay'] ]))#Create an array of values per row on desired columns
.withColumn('sums', expr("aggregate(sums,cast(0 as double), (c,i)-> c+i)"))# USE aggregate to sum
).show()
#use pandas udf
sch= sf.withColumn('v', lit(90.087654623)).schema
def sum_s(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for pdf in iterator:
yield pdf.assign(v=pdf.sum(1))
sf.mapInPandas(sum_s, schema=sch).show()

here's a simple way to do it
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
from functools import reduce
df = spark.createDataFrame(
[
(5,3,2),
(1,7,0),
],
["A", "B", "C"],
)
cols = df.columns
calculate_sum = reduce(lambda a, x: a+x, map(col, cols))
df = (
df
.withColumn(
"sum",calculate_sum
)
)
df.show()
output:
+---+---+---+---+
| A| B| C|sum|
+---+---+---+---+
| 5| 3| 2| 10|
| 1| 7| 0| 8|
+---+---+---+---+

How Can I find all the Null fields in Dataframe

I am trying to find out the null fields present in a dataframe and concatenate all the fields in a new field in the same dataframe.
Input dataframe looks like this
name
state
number
James
CA
100
Julia
Null
Null
Null
CA
200
Expected Output
name
state
number
Null Fields
James
CA
100
Julia
Null
Null
state,number
Null
CA
200
name
My code looks like this but it is failing. Please help me here.
from pyspark.sql import functions as F
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","CA","100"),
("Julia",None,None),
(None,"CA","200")]
schema = StructType([ \
StructField("name",StringType(),True), \
StructField("state",StringType(),True), \
StructField("number",StringType(),True)
])
df = spark.createDataFrame(data=data2,schema=schema)
cols = ["name","state","number"]
df.show()
def null_constraint_check(df,cols):
df_null_identifier = df.withColumn("NULL Fields",\
[F.count(F.when(F.col(c).isNull(), c)) for c in cols])
return df_null_identifier
df1 = null_constraint_check(df,cols)
Error I am getting
AssertionError: col should be Column

Your approach is correct, you only have to make a small change in null_constraint_check:
[F.count(...)] is a list of columns and withColumn expects a single column as second parameter. One way to get there is to concatenate all elements of the list using concat_ws:
def null_constraint_check(df,cols):
df_null_identifier = df.withColumn("NULL Fields",
F.concat_ws(",",*[F.when(F.col(c).isNull(), c) for c in cols]))
return df_null_identifier
I have also removed the F.count because your question says that you want the names of the null columns.
The result is:
+-----+-----+------+------------+
| name|state|number| NULL Fields|
+-----+-----+------+------------+
|James| CA| 100| |
|Julia| null| null|state,number|
| null| CA| 200| name|
+-----+-----+------+------------+

Spark generate a list of column names that contains(SQL LIKE) a string

This one below is a simple syntax to search for a string in a particular column uisng SQL Like functionality.
val dfx = df.filter($"name".like(s"%${productName}%"))
The questions is How do I grab each and every column NAME that contained the particular string in its VALUES and generate a new column with a list of those "column names" for every row.
So far this is the approach I took but stuck as I cant use spark-sql "Like" function inside a UDF.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import spark.implicits._
val df1 = Seq(
(0, "mango", "man", "dit"),
(1, "i-man", "man2", "mane"),
(2, "iman", "mango", "ho"),
(3, "dim", "kim", "sim")
).toDF("id", "col1", "col2", "col3")
val df2 = df1.columns.foldLeft(df1) {
(acc: DataFrame, colName: String) =>
acc.withColumn(colName, concat(lit(colName + "="), col(colName)))
}
val df3 = df2.withColumn("merged_cols", split(concat_ws("X", df2.columns.map(c=> col(c)):_*), "X"))
Here is a sample output. Note that here there are only 3 columns but in the real job I'll be reading multiple tables which can contain dynamic number of columns.
+--------------------------------------------+
|id | col1| col2| col3| merged_cols
+--------------------------------------------+
0 | mango| man | dit | col1, col2
1 | i-man| man2 | mane | col1, col2, col3
2 | iman | mango| ho | col1, col2
3 | dim | kim | sim|
+--------------------------------------------+

This can be done using a foldLeft over the columns together with when and otherwise:
val e = "%man%"
val df2 = df1.columns.foldLeft(df.withColumn("merged_cols", lit(""))){(df, c) =>
df.withColumn("merged_cols", when(col(c).like(e), concat($"merged_cols", lit(s"$c,"))).otherwise($"merged_cols"))}
.withColumn("merged_cols", expr("substring(merged_cols, 1, length(merged_cols)-1)"))
All columns that satisfies the condition e will be appended to the string in the merged_cols column. Note that the column must exist for the first append to work so it is added (containing an empty string) to the dataframe when sent into the foldLeft.
The last row in the code simply removes the extra , that is added in the end. If you want the result as an array instead, simply adding .withColumn("merged_cols", split($"merged_cols", ",")) would work.
An alternative appraoch is to instead use an UDF. This could be preferred when dealing with many columns since foldLeft will create multiple dataframe copies. Here regex is used (not the SQL like since that operates on whole columns).
val e = ".*man.*"
val concat_cols = udf((vals: Seq[String], names: Seq[String]) => {
vals.zip(names).filter{case (v, n) => v.matches(e)}.map(_._2)
})
val df2 = df.withColumn("merged_cols", concat_cols(array(df.columns.map(col(_)): _*), typedLit(df.columns.toSeq)))
Note: typedLit can be used in Spark versions 2.2+, when using older versions use array(df.columns.map(lit(_)): _*) instead.

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.

One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+

Spark merge dataframe with mismatching schemas without extra disk IO

I would like to merge 2 dataframes with (potentially) mismatching schemas
org.apache.spark.sql.DataFrame = [name: string, age: int, height: int]
org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> A.unionAll(B)
would result in :
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the left table has 2 columns and the right has 3;
I would like to do this from within Spark.
However, the Spark docs only propose to write the whole 2 dataframes out to a directory and read them back in using spark.read.option("mergeSchema", "true").
link to docs
So a union doesn't help me out, and neither does the documentation. I would like to keep this extra I/O out of my job if at all possible. Am I missing some undocumented info, or is it not possible (yet)?

You can append a null column to frame B and after union 2 frames:
import org.apache.spark.sql.functions._
val missingFields = A.schema.toSet.diff(B.schema.toSet)
var C: DataFrame = null
for (field <- missingFields){
C = A.withColumn(field.name, expr("null"));
}
A.unionAll(C)

parquet schema merging is disabled by default, turn on this option by:
(1) set global option: spark.sql.parquet.mergeSchema=true
(2) write code: sqlContext.read.option("mergeSchema", "true").parquet("my.parquet")

Here's a pyspark solution.
It assumes that if the merge can't take place because one dataframe is missing a column contained in the other, then the right thing is to add the missing column with null values.
On the other hand, if the merge can't take place because the two dataframes share a column with conflicting type or nullability, then the right thing is to raise a TypeError (because that's a conflict you probably want to know about).
def harmonize_schemas_and_combine(df_left, df_right):
left_types = {f.name: f.dataType for f in df_left.schema}
right_types = {f.name: f.dataType for f in df_right.schema}
left_fields = set((f.name, f.dataType, f.nullable) for f in df_left.schema)
right_fields = set((f.name, f.dataType, f.nullable) for f in df_right.schema)
# First go over left-unique fields
for l_name, l_type, l_nullable in left_fields.difference(right_fields):
if l_name in right_types:
r_type = right_types[l_name]
if l_type != r_type:
raise TypeError, "Union failed. Type conflict on field %s. left type %s, right type %s" % (l_name, l_type, r_type)
else:
raise TypeError, "Union failed. Nullability conflict on field %s. left nullable %s, right nullable %s" % (l_name, l_nullable, not(l_nullable))
df_right = df_right.withColumn(l_name, lit(None).cast(l_type))
# Now go over right-unique fields
for r_name, r_type, r_nullable in right_fields.difference(left_fields):
if r_name in left_types:
l_type = right_types[r_name]
if r_type != l_type:
raise TypeError, "Union failed. Type conflict on field %s. right type %s, left type %s" % (r_name, r_type, l_type)
else:
raise TypeError, "Union failed. Nullability conflict on field %s. right nullable %s, left nullable %s" % (r_name, r_nullable, not(r_nullable))
df_left = df_left.withColumn(r_name, lit(None).cast(r_type))
return df_left.union(df_right)

Thanks #conradlee! I modified your solution to allow union by adding casting and removing nullability check. It worked for me.
def harmonize_schemas_and_combine(df_left, df_right):
'''
df_left is the main df; we try to append the new df_right to it.
Need to do three things here:
1. Set other claim/clinical features to NULL
2. Align schemas (data types)
3. Align column orders
'''
left_types = {f.name: f.dataType for f in df_left.schema}
right_types = {f.name: f.dataType for f in df_right.schema}
left_fields = set((f.name, f.dataType) for f in df_left.schema)
right_fields = set((f.name, f.dataType) for f in df_right.schema)
# import pdb; pdb.set_trace() #pdb debugger
# I. First go over left-unique fields:
# For columns in the main df, but not in the new df: add it as Null
# For columns in both df but w/ different datatypes, use casting to keep them consistent w/ main df (Left)
for l_name, l_type in left_fields.difference(right_fields): #1. find what Left has, Right doesn't
if l_name in right_types: #2A. if column is in both, then something's off w/ the schema
r_type = right_types[l_name] #3. tell me what's this column's type in Right
df_right = df_right.withColumn(l_name,df_right[l_name].cast(l_type)) #4. keep them consistent w/ main df (Left)
print("Casting magic happened on column %s: Left type: %s, Right type: %s. Both are now: %s." % (l_name, l_type, r_type, l_type))
else: #2B. if Left column is not in Right, add a NULL column to Right df
df_right = df_right.withColumn(l_name, F.lit(None).cast(l_type))
# Make sure Right columns are in the same order of Left
df_right = df_right.select(df_left.columns)
return df_left.union(df_right)

Here is another solution for this. I used rdd union because dataFrame union operation doesnt support multiple dataFrames.
Note - This should not be used to merge lot of dataFrames with different schema. The cost of adding null columns to dataFrames will result quickly in out of memory errors. (i.e: trying to merge 1000 dataFrames with 10 columns missing will result in 10,000 transformations)
If your use case it to read a dataFrame from storage with different schema that is composed from multiple paths with different schemas, a much better option would be to have your data saved as parquet in the first place and then use the 'mergeSchema' option when reading the dataFrame.
def unionDataFramesAndMergeSchema(spark, dfsList):
'''
This function can perform a union between x dataFrames with different schemas.
Non-existing columns will be filled with null.
Note: If a column exist in 2 dataFrames with different types, an exception will be thrown.
:example:
>>> df1 = spark.createDataFrame([
>>> {
>>> 'A': 1,
>>> 'B': 1,
>>> 'C': 1
>>> }])
>>> df2 = spark.createDataFrame([
>>> {
>>> 'A': 2,
>>> 'C': 2,
>>> 'DNew' : 2
>>> }])
>>> unionDataFramesAndMergeSchema(spark,[df1,df2]).show()
>>> +---+----+---+----+
>>> | A| B| C|DNew|
>>> +---+----+---+----+
>>> | 2|null| 2| 2|
>>> | 1| 1| 1|null|
>>> +---+----+---+----+
:param spark: The Spark session.
:param dfsList: A list of dataFrames.
:return: A union of all dataFrames, with schema merged.
'''
if len(dfsList) == 0:
raise ValueError("DataFrame list is empty.")
if len(dfsList) == 1:
logging.info("The list contains only one dataFrame, no need to perform union.")
return dfsList[0]
logging.info("Will perform union between {0} dataFrames...".format(len(dfsList)))
columnNamesAndTypes = {}
logging.info("Calculating unified column names and types...")
for df in dfsList:
for columnName, columnType in dict(df.dtypes).iteritems():
if columnNamesAndTypes.has_key(columnName) and columnNamesAndTypes[columnName] != columnType:
raise ValueError(
"column '{0}' exist in at least 2 dataFrames with different types ('{1}' and '{2}'"
.format(columnName, columnType, columnNamesAndTypes[columnName]))
columnNamesAndTypes[columnName] = columnType
logging.info("Unified column names and types: {0}".format(columnNamesAndTypes))
logging.info("Adding null columns in dataFrames if needed...")
newDfsList = []
for df in dfsList:
newDf = df
dfTypes = dict(df.dtypes)
for columnName, columnType in dict(columnNamesAndTypes).iteritems():
if not dfTypes.has_key(columnName):
# logging.info("Adding null column for '{0}'.".format(columnName))
newDf = newDf.withColumn(columnName, func.lit(None).cast(columnType))
newDfsList.append(newDf)
dfsWithOrderedColumnsList = [df.select(columnNamesAndTypes.keys()) for df in newDfsList]
logging.info("Performing a flat union between all dataFrames (as rdds)...")
allRdds = spark.sparkContext.union([df.rdd for df in dfsWithOrderedColumnsList])
return allRdds.toDF()

If you read both data frames from storage files you can just use predefined schema:
val schemaForRead =
StructType(List(
StructField("userId", LongType,true),
StructField("dtEvent", LongType,true),
StructField("goodsId", LongType,true)
))
val dfA = spark.read.format("parquet").schema(schemaForRead).load("/tmp/file1.parquet")
val dfB = spark.read.format("parquet").schema(schemaForRead).load("/tmp/file2.parquet")
val dfC = dfA.union(dfB)
Note that schema in files file1 and file2 can be different and can differ form schemaForRead. If file1 doesn't contain field from schemaForRead dataframe A will have empty field with null's. If file contains additional field not presented in schemaForRead dataframe just wouldn't have it.

Here's the version in Scala also answered here -
( Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema ) -
It takes List of dataframe to be unioned .. Provided same named columns in all the dataframe should have same datatype..
def unionPro(DFList: List[DataFrame], spark: org.apache.spark.sql.SparkSession): DataFrame = {
/**
* This Function Accepts DataFrame with same or Different Schema/Column Order.With some or none common columns
* Creates a Unioned DataFrame
*/
import spark.implicits._
val MasterColList: Array[String] = DFList.map(_.columns).reduce((x, y) => (x.union(y))).distinct
def unionExpr(myCols: Seq[String], allCols: Seq[String]): Seq[org.apache.spark.sql.Column] = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
// Create EmptyDF , ignoring different Datatype in StructField and treating them same based on Name ignoring cases
val masterSchema = StructType(DFList.map(_.schema.fields).reduce((x, y) => (x.union(y))).groupBy(_.name.toUpperCase).map(_._2.head).toArray)
val masterEmptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], masterSchema).select(MasterColList.head, MasterColList.tail: _*)
DFList.map(df => df.select(unionExpr(df.columns, MasterColList): _*)).foldLeft(masterEmptyDF)((x, y) => x.union(y))
}
Here is the sample test for it -
val aDF = Seq(("A", 1), ("B", 2)).toDF("Name", "ID")
val bDF = Seq(("C", 1, "D1"), ("D", 2, "D2")).toDF("Name", "Sal", "Deptt")
unionPro(List(aDF, bDF), spark).show
Which gives output as -
+----+----+----+-----+
|Name| ID| Sal|Deptt|
+----+----+----+-----+
| A| 1|null| null|
| B| 2|null| null|
| C|null| 1| D1|
| D|null| 2| D2|
+----+----+----+-----+

if you are using spark version > 2.3.0 then you can use the unionByName in-built function to get the required output.
Link to the Git Repo that contains the code for the unionByName code:
https://github.com/apache/spark/blame/cee4ecbb16917fa85f02c635925e2687400aa56b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1894