I am trying to save change a column from string to integer, I am getting a nonetype error as shown, I am not sure what is going wrong here.
In the cell #113, you ended with printSchema(), which has return type None. Remove that and your data_use will be a dataframe
Related
I need to assign one DF columns value using other DF columns. For this i wrote below
DF1.withColumm("hic_num",lit(DF2.select("hic_num")))
And got error:
sparkRuntimeException the feature is not supported:literal for [HICN:string] of class org.apache.spark.sql.Dataset.
Please help me with the above.
lit stands for literal and is, as the name suggests, a literal, or a constant. A column is not a constant.
You can do: .withColumn("hic_num2", col("hic_num")), you do not have to wrap this within a lateral.
Also, in your example, you are trying to create a new column called hic_num with the value of hic_num which does not make sense.
I have existing code as below in Scala/Spark, where I am supposed to replace repartition() to coalesce(). But after changing to coalesce its not compiling and saying datatype mismatch as it is considering it as Column Name.
How could I change existing code to Coalesce (with Column names) or there is no way to do it?
As I am new to Scala any suggestion would help and appreciate it. Do let me know if need any more details. Thanks!
val accountList = AccountList(MAPR_DB, src_accountList).filterByAccountType("GAMMA")
.fetchOnlyAccountsToProcess.df
.repartition($"Account", $"SecurityNo", $"ActivityDate")
val accountNos = broadcast(accountList.select($"AccountNo", $"Account").distinct)
I'm working with a column inside a pyspark dataframe that is of string type, and im trying to isolate string values that start with a number and then convert that column to a new string based on that condition.
How can I apply something to a whole column to check to see if the first value is a string?
Im guessing you asking for something like this?
newdf = (
DF.withColumn('is_start_with_digit', df.MyColumn.substr(1, 1).cast("int").isNotNull())
)
I have been scratching my head with a problem in pyspark.
I want to conditionally apply a UDF on a column depending on if it is NULL or not. One constraint is that I do not have access to the DataFrame at the location I am writing the code I only have access to a column object.
Thus, I cannot simply do:
df.where(my_col.isNull()).select(my_udf(my_col)).toPandas()
Therefore, having only access to a Column object, I was writing the following:
my_res_col = F.when(my_col.isNull(), F.lit(0.0) \
.otherwise(my_udf(my_col))
And then later do:
df.select(my_res_col).toPandas()
Unfortunately for some reason that I do not know, I sill receive NULLs in my UDF, forcing me to check for NULL values directly in my UDF.
I do not understand why the isNull() is not preventing rows with NULL values from calling the UDF.
Any insight on this matter would be greatly appreciated.
I thank you in advance for your help.
Antoine
I am not sure about your data. does it contains nan? spark handles null and nan differently.
Differences between null and NaN in spark? How to deal with it?
so can you just try the below and check if it solves
import pyspark.sql.functions as F
my_res_col = F.when(((my_col.isNull())|(F.isnan(mycol))), F.lit(0.0)).otherwise(my_udf(my_col))
Problem statement
I have a table called employee from which i am creating a data-frame .There are some columns which don't have any record.I want to remove that columns from data frame.i also don't know how many columns of the data frame has no record in it.
You cannot remove a column from the dataFrame AFAIK !
What you can do is make another dataframe from the old dataFrame and extract the column names that you actually want to !
Example:
oldDFSchema like this(id,name,badColumn,email)
then
val newDf=oldDF.select("id","name","email")
Or there is one more thing that you can use is :
the .drop() function on dataframe that takes the column names and drops them and returns you a new dataframe !
You can find about it here : https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.sql.Dataset#drop(col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame
I hope this might solve your use case !