Pyspark error when converting boolean column to pandas - pyspark

I`m trying to use the toPandas() function of pyspark on a simple dataframe with an id column (int), a score column (float) and a "pass" column (boolean).
My problem is that whenever I call the function I get this error:
> raise AttributeError("module {!r} has no attribute "
"{!r}".format(__name__, attr))
E AttributeError: module 'numpy' has no attribute 'bool'
/usr/local/lib/python3.8/site-packages/numpy/__init__.py:284: AttributeError
Column:
0 False
1 False
2 False
3 True
Name: pass, dtype: bool
Column<'pass'>
Do I need to manually convert this column to a different type?

Related

convert column with 0 to float in pyspark

I'm trying to convert a column to double or float, however the column has 0 values, so I'm getting errors when I try to use that column after applying the cast.
df = (df.withColumn('received_sp_click_l1wk' ,df['received_sp_click_l1wk'].cast("double")))
Doesn't return any error, however applying any function to the casted column returns errors :
df.head(7)
TypeError: field received_sp_click_l1wk: FloatType can not accept object 0 in type <class 'int'>

How to apply nltk.pos_tag on pyspark dataframe

I'm trying to apply pos tagging on one of my tokenized column called "removed" in pyspark dataframe.
I'm trying with
nltk.pos_tag(df_removed.select("removed"))
But all I get is Value Error: ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.
How can I make it?
It seems the answer is in the error message: the input of pos_tag should be a string and you provide a column input. You should apply pos_tag on each row of you column, using the function withColumn
For example you start by writing:
my_new_df = df_removed.withColumn("removed", nltk.pos_tag(df_removed.removed))
You can do also :
my_new_df = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x)).toDF()
Here you have the documentation.

How to set 'int' datatype for a column with n/a values

I have a table in postgres with column 'col' with age values. This column also contains n/a values.
When I am applying a condition of age < 15, I am getting below error:
[Code: 0, SQL State: 22P02] ERROR: invalid input syntax for integer: "n/a"
I am using below query to handle the n/a values but I am still getting the same error:
ALTER TABLE tb
ADD COLUMN col CHARACTER VARYING;
UPDATE tb
Set col =
CASE
WHEN age::int <= 15
THEN 'true'
ELSE 'false'
END
;'
Please see 'age' is in text format in my table. I have two questions here:
How can I set the datatype while creating the initial table (in the create table statement)?
How can I handle n/a values in above case statement?
Thanks
You should really fix your data model and store numbers in integer columns.
You can get around you current problem, by converting your invalid "numbers" to null:
UPDATE tb
Set col = CASE
WHEN nullif(age, 'n/a')::int <= 15 THEN 'true'
ELSE 'false'
END;
And it seems col should be a boolean rather than a text column as well.

PySpark Parsing nested array of struct

I would like to parse and get the value of specific key from the PySpark SQL dataframe with the below format
I could able to achieve this with UDF but it takes almost 20 mins to process 40 columns with the JSON size of 100MB. Tried explode as well but it gives seperate rows for each array element. but i need only the specific value of the key in a given array of struct.
Format
array<struct<key:string,value:struct<int_value:string,string_value:string>>>
Function to get a specific key values
def getValueFunc(searcharray, searchkey):
for val in searcharray:
if val["key"] == searchkey:
if val["value"]["string_value"] is not None:
actual = val["value"]["string_value"]
return actual
elif val["value"]["int_value"] is not None:
actual = val["value"]["int_value"]
return str(actual)
else:
return "---"
.....
getValue = udf(getValueFunc, StringType())
....
# register the name rank udf template
spark.udf.register("getValue", getValue)
.....
df.select(getValue(col("event_params"), lit("category")).alias("event_category"))
For Spark 2.40+, you can use SparkSQL's filter() function to find the first array element which matches key == serarchkey and then retrieve its value. Below is a Spark SQL snippet template(searchkey as a variable) to do the first part mentioned above.
stmt = '''filter(event_params, x -> x.key == "{}")[0]'''.format(searchkey)
Run the above stmt with expr() function, and assign the value (StructType) to a temporary column f1, and then use coalesce() function to retrieve the non-null value.
from pyspark.sql.functions import expr
df.withColumn('f1', expr(stmt)) \
.selectExpr("coalesce(f1.value.string_value, string(f1.value.int_value),'---') AS event_category") \
.show()
Let me know if you have any problem running the above code.

Spark - Getting Type mismatch when assigning a string label to null values

I have a dataset with a stringType column which contains nulls. I wanted to change each row with a null value with a string. I was trying the following:
val renameDF = DF
.withColumn("code", when($"code".isNull,lit("NON")).otherwise($"code"))
But I am getting the following exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN
(del.code IS NULL) THEN 'NON' ELSE del.code END' due to
data type mismatch: THEN and ELSE expressions should all be same type
or coercible to a common type;
How can I make the string a column type compatible with $"code"
This is weird, I just tried this snippet :
val df = Seq("yoyo","yaya",null).toDF("code")
df.withColumn("code", when($"code".isNull,lit("NON")).otherwise($"code")).show
And this is working fine, can you share your spark version. And did you import the spark implicits ? Are you sure your column is StringTyped ?