This question already has answers here:
How do I detect if a Spark DataFrame has a column
(11 answers)
Closed 2 years ago.
I have a large dataframe in which I need to check if a particular column (column_A) exist in dataframe and if the column exist then based on that some processing need to happen otherwise it has to do some other processing -
I am currently trying below -
try:
input_df = input_df.withColumn("column_A", input_df["column_A"].cast(StringType()))
Do some processing
except:
input_df = input_df.drop('column_B')
There must be a better way of achieving it. Thanks in advance
I don't understand what is the "Better" way but this works.
if "id" in df.columns:
print("There is id")
else:
print("There is no id")
# There is id
Related
This question already has answers here:
How to show full column content in a Spark Dataframe?
(18 answers)
Closed 2 years ago.
My Data looks like this
H1234|1234|1999-12-03.3.22.34.132456
G1345|2345|1998-11-03-12.22.45.23456
I stored this data on a List[String], while converting it to a data frame by doing:
val dataframe = list.map(r => r.split("\\|")).map(r => (r(0),r(1),r(2)).toDF("ID","Number","Timestamp ")
but when I am using dataframe.show, the Timestamp Column, I am getting it as below:
(FYI every value is a String)
Timestamp
1999-12-03.3.22....
1998-11-03-12.22...
Could you please tell me how to solve this.
Your data is actually intact, but show is truncating the output. Do dataframe.show(false) to avoid truncation of output.
This question already has an answer here:
Spark SQL filter multiple fields
(1 answer)
Closed 3 years ago.
I was trying to search this on stackoverflow, but i couldn't find one. Is there a spark syntax that filters on where two or more columns share the same value? For instance something like
dataFrame.filter($"col01" == $"col02"== $"col03")
Yes there is. You got it almost correct put 3 '=' between them
dataFrame.filter($"col01" === $"col02"=== $"col03")
Example:
val df = spark.sparkContext.parallelize(Array((1,1,1),(1,2,3))).toDF("col01","col02","col03")
df.filter($"col01" === $"col02"=== $"col03").show(false)
Result:
This question already has answers here:
About how to add a new column to an existing DataFrame with random values in Scala
(2 answers)
Closed 4 years ago.
We have a uuid udf :
import java.util.UUID
val idUdf = udf(() => idgen.incrementAndGet.toString + "_" + UUID.randomUUID)
spark.udf.register("idgen", idUdf)
An issue being faced is that when running count, or show or write each of those end up with a different value of the udf result.
df.count() // generates a UUID for each row
df.show() // regenerates a UUID for each row
df.write.parquet(path) // .. you get the picture ..
What approaches might be taken to retain a single uuid result for a given row? The first thought would be to invoke a remote Key-Value store using some unique combination of other stable fields within each column. That is of course expensive both due to the lookup-per-row and the configuration and maintenance of the remote KV Store. Are there other mechanisms to achieve stability for these unique ID columns?
Just define your udf as nondeterministic by calling:
val idUdf = udf(() => idgen.incrementAndGet.toString + "_" + UUID.randomUUID)
.asNondeterministic()
This will evaluate your udf just once and keep the result in the RDD
This question already has answers here:
Write to multiple outputs by key Spark - one Spark job
(10 answers)
Closed 4 years ago.
If there was a Spark RDD like such:
id | data
----------
1 | "a"
1 | "b"
2 | "c"
3 | "d"
How could I output this to separate json textfiles, grouped based on the id? Such that part-0000-1.json would contain rows "a" and "b", part-0000-2.json contains "c", etc.
df.write.partitionBy("col").json(<path_to_file>)
is what you need.
Thanks to #thebluephantom, I was able to understand what was going wrong.
I was fundamentally misunderstanding Spark. When I was initially doing df.write.partitionBy("col").json(<path_to_file>) as #thebluephantom suggested, I was confused as to why my output was split into many different files.
I have since added .repartition(1) to collect all data into a single node, and then partitionBy("col") to split the data in here to multiple file outputs. My final code is:
latestUniqueComments
.repartition(1)
.write
.mode(SaveMode.Append)
.partitionBy("_manual_file_id")
.format("json")
.save(outputFile)
This question already has an answer here:
Computing difference between Spark DataFrames
(1 answer)
Closed 5 years ago.
I see that except and not in are same in sql but spark we have "except" function .
[The documentation is there 1 but can anyone give example for how to implement this in scala .
The path to the DataFrame class does contain the word 'sql', but it's still a class that you can create and use directly in scala. You can call the except function:
df_final = df1.except(df2)