how to replace missing values from another column in PySpark?

how to replace missing values from another column in PySpark? - pyspark

I want to use values in t5 to replace some missing values in t4. Searched code, but doesn’t work for me
Current:
example of current
Goal:
example of target
df is a dataframe.Code:
pdf = df.toPandas()
from pyspark.sql.functions import coalesce
pdf.withColumn("t4", coalesce(pdf.t4, pdf.t5))
 Error: 'DataFrame' object has no attribute 'withColumn'
Also, tried the following code previously, didnt work neither.
new_pdf=pdf['t4'].fillna(method='bfill', axis="columns")
Error: No axis named columns for object type

Like the error indicates .withColumn() is not a method of pandas dataframes but spark dataframes. Note that when using .toPandas() your pdf becomes a pandas dataframe, so if you want to use .withColumn() avoid the transformation
UPDATE:
If pdf is a pandas dataframe you can do:
pdf['t4']=pdf['t4'].fillna(pdf['t5'])

Related

#SPARK #Need to assign dataframe column value from other dataframe column in spark Scala

I need to assign one DF columns value using other DF columns. For this i wrote below
DF1.withColumm("hic_num",lit(DF2.select("hic_num")))
And got error:
sparkRuntimeException the feature is not supported:literal for [HICN:string] of class org.apache.spark.sql.Dataset.
Please help me with the above.

lit stands for literal and is, as the name suggests, a literal, or a constant. A column is not a constant.
You can do: .withColumn("hic_num2", col("hic_num")), you do not have to wrap this within a lateral.
Also, in your example, you are trying to create a new column called hic_num with the value of hic_num which does not make sense.

Iterating through a DataFrame using Pandas UDF and outputting a dataframe

I have a piece of code that I want to translate into a Pandas UDF in PySpark but I'm having a bit of trouble understanding whether or not you can use conditional statements.
def is_pass_in(df):
x = list(df["string"])
result = []
for i in x:
if "pass" in i:
result.append("YES")
else:
result.append("NO")
df["result"] = result
return df
The code is super simple all I'm trying to do is iterate through a column and in each row contains a sentence. I want to check if the word pass is in that sentence and if so append that to a list that will later become a column right next to the df["string"] column. Ive tried to do this using Pandas UDF but the error messages I'm getting are something that I don't understand because I'm new to spark. Could someone point me in the correct direction?

There is no need to use a UDF. This can be done in pyspark as follows. Even in pandas, I would advice you dont do what you have done. use np.where()
df.withColumn('result', when(col('store')=='target','YES').otherwise('NO')).show()

How do I use a from_json() dataframe in Spark?

I'm trying to create a dataset from a json-string within a dataframe in Databricks 3.5 (Spark 2.2.1). In the code block below 'jsonSchema' is a StructType with the correct layout for the json-string which is in the 'body' column of the dataframe.
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema))
This returns a dataframe where the root object is
jsontostructs(CAST(body AS STRING)):struct
followed by the fields in the schema (looks correct). When I try another select on the newDF
val transform = newDF.select($"propertyNameInTheParsedJsonObject")
it throws the exception
org.apache.spark.sql.AnalysisException: cannot resolve '`columnName`' given
input columns: [jsontostructs(CAST(body AS STRING))];;
I'm aparently missing something. I hoped from_json would return a dataframe I could manipulate further.
My ultimate objective is to cast the json-string within the oldDF body-column to a dataset.

from_json returns a struct or (array<struct<...>>) column. It means it is a nested object. If you've provided a meaningful name:
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema) as "parsed")
and the schema describes a plain struct you could use standard methods like
newDF.select($"parsed.propertyNameInTheParsedJsonObject")
otherwise please follow the instructions for accessing arrays.

Scala repartition cannot resolve symbol

I am trying to save my dataframe aa parquet file with one partition per day. So trying to use the date column. However, I want to write one file per partition, so using repartition($"date"), but keep getting errors:
This error "cannot resolve symbol repartition" and "value $ is not a member of stringContext" when I use,
DF.repartition($"date")
.write
.mode("append")
.partitionBy("date")
.parquet("s3://file-path/")
This error Type mismatch, expected column, actual string, when I use:
DF.repartition("date")
.write
.mode("append")
.partitionBy("date")
.parquet("s3://file-path/")
However, this works fine without any error.
DF.write.mode("append").partitionBy("date").parquet("s3://file-path/")
Cant we use date type in repartition? Whats wrong here?

To use the $ symbol inplace of col(), you need to first import spark.implicits. spark here is an instance of a SparkSession, hence the import must be done after the creation of a SparkSession. A simple example:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
This import will also enable other functionallity such as converting RDDs to Dataframe of Datasets with toDF() and toDS() respectively.

Column having list datatype : Spark HiveContext

The following code does aggregation and create a column with list datatype:
groupBy(
"column_name_1"
).agg(
expr("collect_list(column_name_2) "
"AS column_name_3")
)
So it seems it is possible to have 'list' as column datatype in a dataframe.
I was wondering if I can write a udf that returns custom datatype, for example a python dict?

The list is a representation of spark's Array datatype. You can try using the Map datatype (pyspark.sql.types.MapType).
an example of something which creates it is: pyspark.sql.functions.create_map which creates a map from several columns
That said if you want to create a custom aggregation function to do anything not already available in pyspark.sql.functions you will need to use scala.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to replace missing values from another column in PySpark? - pyspark

Related

#SPARK #Need to assign dataframe column value from other dataframe column in spark Scala

Iterating through a DataFrame using Pandas UDF and outputting a dataframe

How do I use a from_json() dataframe in Spark?

Scala repartition cannot resolve symbol

Column having list datatype : Spark HiveContext

Categories

Resources