I want to use values in t5 to replace some missing values in t4. Searched code, but doesn’t work for me
Current:
example of current
Goal:
example of target
df is a dataframe.Code:
pdf = df.toPandas()
from pyspark.sql.functions import coalesce
pdf.withColumn("t4", coalesce(pdf.t4, pdf.t5))
Error: 'DataFrame' object has no attribute 'withColumn'
Also, tried the following code previously, didnt work neither.
new_pdf=pdf['t4'].fillna(method='bfill', axis="columns")
Error: No axis named columns for object type
Like the error indicates .withColumn() is not a method of pandas dataframes but spark dataframes. Note that when using .toPandas() your pdf becomes a pandas dataframe, so if you want to use .withColumn() avoid the transformation
UPDATE:
If pdf is a pandas dataframe you can do:
pdf['t4']=pdf['t4'].fillna(pdf['t5'])
Related
I need to assign one DF columns value using other DF columns. For this i wrote below
DF1.withColumm("hic_num",lit(DF2.select("hic_num")))
And got error:
sparkRuntimeException the feature is not supported:literal for [HICN:string] of class org.apache.spark.sql.Dataset.
Please help me with the above.
lit stands for literal and is, as the name suggests, a literal, or a constant. A column is not a constant.
You can do: .withColumn("hic_num2", col("hic_num")), you do not have to wrap this within a lateral.
Also, in your example, you are trying to create a new column called hic_num with the value of hic_num which does not make sense.
I have a piece of code that I want to translate into a Pandas UDF in PySpark but I'm having a bit of trouble understanding whether or not you can use conditional statements.
def is_pass_in(df):
x = list(df["string"])
result = []
for i in x:
if "pass" in i:
result.append("YES")
else:
result.append("NO")
df["result"] = result
return df
The code is super simple all I'm trying to do is iterate through a column and in each row contains a sentence. I want to check if the word pass is in that sentence and if so append that to a list that will later become a column right next to the df["string"] column. Ive tried to do this using Pandas UDF but the error messages I'm getting are something that I don't understand because I'm new to spark. Could someone point me in the correct direction?
There is no need to use a UDF. This can be done in pyspark as follows. Even in pandas, I would advice you dont do what you have done. use np.where()
df.withColumn('result', when(col('store')=='target','YES').otherwise('NO')).show()
I'm trying to create a dataset from a json-string within a dataframe in Databricks 3.5 (Spark 2.2.1). In the code block below 'jsonSchema' is a StructType with the correct layout for the json-string which is in the 'body' column of the dataframe.
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema))
This returns a dataframe where the root object is
jsontostructs(CAST(body AS STRING)):struct
followed by the fields in the schema (looks correct). When I try another select on the newDF
val transform = newDF.select($"propertyNameInTheParsedJsonObject")
it throws the exception
org.apache.spark.sql.AnalysisException: cannot resolve '`columnName`' given
input columns: [jsontostructs(CAST(body AS STRING))];;
I'm aparently missing something. I hoped from_json would return a dataframe I could manipulate further.
My ultimate objective is to cast the json-string within the oldDF body-column to a dataset.
from_json returns a struct or (array<struct<...>>) column. It means it is a nested object. If you've provided a meaningful name:
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema) as "parsed")
and the schema describes a plain struct you could use standard methods like
newDF.select($"parsed.propertyNameInTheParsedJsonObject")
otherwise please follow the instructions for accessing arrays.
I am trying to save my dataframe aa parquet file with one partition per day. So trying to use the date column. However, I want to write one file per partition, so using repartition($"date"), but keep getting errors:
This error "cannot resolve symbol repartition" and "value $ is not a member of stringContext" when I use,
DF.repartition($"date")
.write
.mode("append")
.partitionBy("date")
.parquet("s3://file-path/")
This error Type mismatch, expected column, actual string, when I use:
DF.repartition("date")
.write
.mode("append")
.partitionBy("date")
.parquet("s3://file-path/")
However, this works fine without any error.
DF.write.mode("append").partitionBy("date").parquet("s3://file-path/")
Cant we use date type in repartition? Whats wrong here?
To use the $ symbol inplace of col(), you need to first import spark.implicits. spark here is an instance of a SparkSession, hence the import must be done after the creation of a SparkSession. A simple example:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
This import will also enable other functionallity such as converting RDDs to Dataframe of Datasets with toDF() and toDS() respectively.
The following code does aggregation and create a column with list datatype:
groupBy(
"column_name_1"
).agg(
expr("collect_list(column_name_2) "
"AS column_name_3")
)
So it seems it is possible to have 'list' as column datatype in a dataframe.
I was wondering if I can write a udf that returns custom datatype, for example a python dict?
The list is a representation of spark's Array datatype. You can try using the Map datatype (pyspark.sql.types.MapType).
an example of something which creates it is: pyspark.sql.functions.create_map which creates a map from several columns
That said if you want to create a custom aggregation function to do anything not already available in pyspark.sql.functions you will need to use scala.