I have simple dataframe consisting of 3 most important columns: ID, CAT and SUBCAT.
I want to group results including only rows with some type of SUBCAT.
Everything works fine with below code:
subcategories = ["AA", "AB", "BA", "BB"]
df_grouped = df \
.groupby("ID") \
.agg(
collect_set(when(col("SUBCAT").isin(subcategories), struct(*[df[columnName] for columnName in restOfColumns]))))
When I want to add "nested" when for differentiation list of allowed SUBCATs per CAT:
df_grouped = df \
.groupby("ID") \
.agg(
collect_set(when(col("SUBCAT").isin( \
when((col("CAT") == lit("A")), array([lit("AA"), lit("AB")])) \
.otherwise(array([lit("BA"), lit("BB")])) \
), struct(*[df[columnName] for columnName in restOfColumns]))))
, I start receiving exception:
cannot resolve '(df.SUBCAT IN (CASE WHEN (df.CAT = 'A') THEN array('AA', 'AB') ELSE array('BA', 'BB') END))' due to data type mismatch: Arguments must be same type but were: string != array<string>
I have read similar topics here and people got similar errors, but now with the same type of query. Is such "nested" when limitation of pyspark or my query is wrong?
In the first code the string in the column is compared with the strings in the list.
But in the second case as when only supports returning literals and columns expressions according to this documentation, you are returning the array of string literals from the nested when.
cannot resolve '(df.SUBCAT IN (CASE WHEN (df.CAT = 'A') THEN array('AA', 'AB') ELSE array('BA', 'BB') END))' due to data type mismatch: Arguments must be same type but were: string != array<string>
That’s why it is giving the above exception of data type mismatch which arise due to the comparison of the string in the column with the array of Strings.
In this kind of cases, it is good to use other methods rather than nested when as commented.
Related
I'm trying to add exploded columns to a dataframe:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Convenience function for turning JSON strings into DataFrames.
def jsonToDataFrame(json, schema=None):
# SparkSessions are available with Spark 2.0+
reader = spark.read
if schema:
reader.schema(schema)
return reader.json(sc.parallelize([json]))
schema = StructType().add("a", MapType(StringType(), IntegerType()))
events = jsonToDataFrame("""
{
"a": {
"b": 1,
"c": 2
}
}
""", schema)
display(
events.withColumn("a", explode("a").alias("x", "y"))
)
However, I'm hitting the following error:
AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got a
Any ideas?
In the end, I used the following:
display(
events.select(explode("a").alias("x", "y"), *[c for c in events.columns])
)
This approach uses select to specify the columns to return.
The first argument explodes the data:
explode("a").alias("x", "y")
The second argument specifies all existing columns should be included in the select:\
*[c for c in events.columns]
Note that I'm prefixing the list with * - this sends each column name as a separate parameter.
Simpler Method
The API docs specify:
Parameters
colsstr, Column, or list
column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.
We can simplify the first approach by passing in "*" to select all the columns:
display(
events.select("*", explode("a").alias("x", "y"))
)
Hi I'm starting to use Pyspark and want to put a when and otherwise condition in:
df_1 = df.withColumn("test", when(df.first_name == df2.firstname & df.last_namne == df2.lastname, "1. Match on First and Last Name").otherwise ("No Match"))
I get the below error and wanted some assistance to understand why the above is not working.
Both df.first_name and df.last_name are strings and also df2.firstname and df2.lastname strings too
Error:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Thanks in advance
There are several issues in your statement:
For df.withColum(), you can not use df and df2 columns in one statement. First join the two dataframes using df.join(df2, on="some_key", how="left/right/full").
Enclose the and condition of "when" clause in round brackets: (df.first_name == df2.firstname) & (df.last_name == df2.lastname)
The string literals of "when" and "otherwise" should be enclosed in lit() like: lit("1. Match on First and Last Name") and lit("No Match").
There is possibly a typo in your field name df.last_namne.
I'm using spark scala. I've two dataframes that I want to join and select all columns from the first and a few from the second.
This is mu code, that doesn't work,
val df = df1.join(df2,
df1("a") <=> df2("a")
&& df1("b") <=> df2("b"),
"left").select(df1("*"),---> is this correct?
df2("c AS d", "e AS f")) ---> fails here
This fails with the following error,
too many arguments for method apply: (colName: String)org.apache.spark.sql.Column in class Dataset
df2("c AS d", "e AS f"))
I couldn't find a different method in the API to do it.
How do I do this.
Try using Aliases. I don't know in scala, but below code is in python/pyspark to join and get all columns from one table and some column from another table
df1_col =df1.columns
resultdf= df1.alias('left_table') \
.join(df2.alias('right_table'),f.col('left_table.col1') == f.col('right_table.col1')) \
.select(
[f.col('left_table.' + xx) for xx in df1_col] + [f.col('right_table.col2'),f.col('right_table.col3'),f.col('right_table.col4')])
I would like to parse and get the value of specific key from the PySpark SQL dataframe with the below format
I could able to achieve this with UDF but it takes almost 20 mins to process 40 columns with the JSON size of 100MB. Tried explode as well but it gives seperate rows for each array element. but i need only the specific value of the key in a given array of struct.
Format
array<struct<key:string,value:struct<int_value:string,string_value:string>>>
Function to get a specific key values
def getValueFunc(searcharray, searchkey):
for val in searcharray:
if val["key"] == searchkey:
if val["value"]["string_value"] is not None:
actual = val["value"]["string_value"]
return actual
elif val["value"]["int_value"] is not None:
actual = val["value"]["int_value"]
return str(actual)
else:
return "---"
.....
getValue = udf(getValueFunc, StringType())
....
# register the name rank udf template
spark.udf.register("getValue", getValue)
.....
df.select(getValue(col("event_params"), lit("category")).alias("event_category"))
For Spark 2.40+, you can use SparkSQL's filter() function to find the first array element which matches key == serarchkey and then retrieve its value. Below is a Spark SQL snippet template(searchkey as a variable) to do the first part mentioned above.
stmt = '''filter(event_params, x -> x.key == "{}")[0]'''.format(searchkey)
Run the above stmt with expr() function, and assign the value (StructType) to a temporary column f1, and then use coalesce() function to retrieve the non-null value.
from pyspark.sql.functions import expr
df.withColumn('f1', expr(stmt)) \
.selectExpr("coalesce(f1.value.string_value, string(f1.value.int_value),'---') AS event_category") \
.show()
Let me know if you have any problem running the above code.
I have a dataset with a stringType column which contains nulls. I wanted to change each row with a null value with a string. I was trying the following:
val renameDF = DF
.withColumn("code", when($"code".isNull,lit("NON")).otherwise($"code"))
But I am getting the following exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN
(del.code IS NULL) THEN 'NON' ELSE del.code END' due to
data type mismatch: THEN and ELSE expressions should all be same type
or coercible to a common type;
How can I make the string a column type compatible with $"code"
This is weird, I just tried this snippet :
val df = Seq("yoyo","yaya",null).toDF("code")
df.withColumn("code", when($"code".isNull,lit("NON")).otherwise($"code")).show
And this is working fine, can you share your spark version. And did you import the spark implicits ? Are you sure your column is StringTyped ?