convert column with 0 to float in pyspark - pyspark

I'm trying to convert a column to double or float, however the column has 0 values, so I'm getting errors when I try to use that column after applying the cast.
df = (df.withColumn('received_sp_click_l1wk' ,df['received_sp_click_l1wk'].cast("double")))
Doesn't return any error, however applying any function to the casted column returns errors :
df.head(7)
TypeError: field received_sp_click_l1wk: FloatType can not accept object 0 in type <class 'int'>

Related

Pyspark error when converting boolean column to pandas

I`m trying to use the toPandas() function of pyspark on a simple dataframe with an id column (int), a score column (float) and a "pass" column (boolean).
My problem is that whenever I call the function I get this error:
> raise AttributeError("module {!r} has no attribute "
"{!r}".format(__name__, attr))
E AttributeError: module 'numpy' has no attribute 'bool'
/usr/local/lib/python3.8/site-packages/numpy/__init__.py:284: AttributeError
Column:
0 False
1 False
2 False
3 True
Name: pass, dtype: bool
Column<'pass'>
Do I need to manually convert this column to a different type?

null values while changing data type from string type to integer type in Databricks

i have a table that contains a column of numbers like (959, 1189...) when i check the column type i find it string type so i changed the type of column to integer type the problem is that when the column becomes integer type it shows null values that doesn't existed before instead of other values ( every number > 999 , for exemple 1232) this is how i'am changing the data type any help? : ```
from pyspark.sql.types import (
IntegerType
)
dfnumber2 = dfnumber \
.withColumn("Offres d'emploi" ,
dfnumber["Offres d'emploi"]
.cast(IntegerType())) \
dfnumber2.printSchema()
The values are too big for the int type so PySpark is trimming, perhaps try to cast it to double type
from pyspark.sql.types import (
DoubleType
)
dfnumber2 = dfnumber \
.withColumn("Offres d'emploi" ,
dfnumber["Offres d'emploi"]
.cast(DoubleType())) \
dfnumber2.printSchema()

How to apply nltk.pos_tag on pyspark dataframe

I'm trying to apply pos tagging on one of my tokenized column called "removed" in pyspark dataframe.
I'm trying with
nltk.pos_tag(df_removed.select("removed"))
But all I get is Value Error: ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.
How can I make it?
It seems the answer is in the error message: the input of pos_tag should be a string and you provide a column input. You should apply pos_tag on each row of you column, using the function withColumn
For example you start by writing:
my_new_df = df_removed.withColumn("removed", nltk.pos_tag(df_removed.removed))
You can do also :
my_new_df = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x)).toDF()
Here you have the documentation.

Spark - Getting Type mismatch when assigning a string label to null values

I have a dataset with a stringType column which contains nulls. I wanted to change each row with a null value with a string. I was trying the following:
val renameDF = DF
.withColumn("code", when($"code".isNull,lit("NON")).otherwise($"code"))
But I am getting the following exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN
(del.code IS NULL) THEN 'NON' ELSE del.code END' due to
data type mismatch: THEN and ELSE expressions should all be same type
or coercible to a common type;
How can I make the string a column type compatible with $"code"
This is weird, I just tried this snippet :
val df = Seq("yoyo","yaya",null).toDF("code")
df.withColumn("code", when($"code".isNull,lit("NON")).otherwise($"code")).show
And this is working fine, can you share your spark version. And did you import the spark implicits ? Are you sure your column is StringTyped ?

How to handle nulls in SparkSQL Dataframes

This is the code that I am following:
val ebayds = sc.textFile("/user/spark/xbox.csv")
case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, bidderrate: Int, openbid: Float, price: Float)
val ebay = ebayds.map(a=>a.split(",")).map(p=>Auction(p(0),p(1).toFloat,p(2).toFloat,p(3),p(4).toInt,p(5).toFloat,p(6).toFloat)).toDF()
ebay.select("auctionid").distinct.count
The error that I am getting is:
For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
Use DataFrameNaFunctions
DataFrame fill(double value) Returns a new DataFrame that replaces
null values in numeric columns with value.
DataFrame fill(double
value, scala.collection.Seq cols) (Scala-specific) Returns a
new DataFrame that replaces null values in specified numeric columns.
Example Usage :
df.na.fill(0.0,Seq("your columnname"))
for that column null values will be replaced with 0.0 or any default value.
replace is also useful for replacing empty strings with default values
replace public DataFrame replace(String col,
java.util.Map replacement) Replaces values matching keys in replacement map with the corresponding values. Key
and value of replacement map must have the same type, and can only be
doubles or strings. If col is "*", then the replacement is applied on
all string columns or numeric columns.
import com.google.common.collect.ImmutableMap;
// Replaces all occurrences of 1.0 with 2.0 in column "height".
df.replace("height", ImmutableMap.of(1.0, 2.0));
// Replaces all occurrences of "UNKNOWN" with "unnamed" in column
"name". df.replace("name", ImmutableMap.of("UNKNOWN", "unnamed"));
// Replaces all occurrences of "UNKNOWN" with "unnamed" in all
string columns. df.replace("*", ImmutableMap.of("UNKNOWN",
"unnamed")); Parameters: col - name of the column to apply the value
replacement replacement - value replacement map, as explained above
Returns: (undocumented) Since:
1.3.1
for example :
df.na.replace("your column", Map(""-> 0.0)))
This worked for me. It returned a dataframe. Here A and B are columns and 1.0 and "unknown" are values to be replaced.
df.na.fill(Map("A" -> "unknown","B" -> 1.0))