How to rename columns starting with 20 dynamically Pyspark - pyspark

I have columns in my dataframe like this where the columns starting with 20 were generated dynamically.
I want to rename the columns starting with 20 to 2019_p, 2020_p, 2021_p dynamically.
How do I achieve this?

This should work:
df.select(*[col(c).alias(f"${c}_p") if c.startswith("20") else col(c) for c in df.columns])

Related

How to add trailer/footer in csv dataframe azure blob pyspark

i have as solution which goes like
df1 -->dataframe 1 with having 50 columns of data
df2 --->datarame 2 having footer/trailer 3 columns of data like Trailer,count of rows,date
so i added the remaining 47 columns as "","",""..... so on
so that i can union 2 dataframe:
df3=df1.union(df2)
now if i want to save
df3.coalesce(1).write.format("com.databricks.spark.csv")\
.option("header","true").mode("overwrite")\
.save(output_blob_path);
so now i am getting the footer as well
like this Trailer,400,20210805,"","","","","","","".. and so on
if any one can suggest how to remove ,"","","",.. these double quotes from the last row
where i want to save this file in blob container.
it would be very helpful
You can try to define structure of data frame to treat entire row as single column for both the files and then perform union. This way you no need to add extra columns on data frame 2 and then struck in to tricky situation to remove extra columns after union.

Scala Spark-> Select first 15 columns from a DataFrame

I am trying to get the first 15 columns from a DataFrame that contains more than 500 cols. But I don't know how to do it because is my first time using Scala Spark.
I was searching but didn't find anything, just how to get cols by name, for example:
val df2 = df.select("firstColName", "secondColeName")
How can i do this by index?
Thanks in advance!
Scala example:
df.selectExpr(df.columns.take(15):_*)

Drop rows of Spark DataFrame that contain specific value in column using Scala

I am tryping to drop rows of a spark dataframe which contain a specific value in a specific row.
For example, if i have the following DataFrame, i´d like to drop all rows which have "two" in column "A". So i´d like to drop the rows with index 1 and 2.
I want to do this using Scala 2.11 and Spark 2.4.0.
A B C
0 one 0 0
1 two 2 4
2 two 4 8
3 one 6 12
4 three 7 14
I tried something like this:
df = df.filer(_.A != "two")
or
df = df.filter(df("A") != "two")
Anyway both did not work. Any suggestions how that can be done?
Try:
df.filter(not($"A".contains("two")))
Or if you look for exact match:
df.filter(not($"A".equalTo("two")))
I finally found the solution in a very old post:
Is there a way to filter a field not containing something in a spark dataframe using scala?
The trick which does it is the following:
df = df.where(!$"A".contains("two")

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

Creating a pyspark.sql.dataframe out of two columns in two different pyspark.sql.dataframes in PySpark

Assume the following two Dataframes in pyspark with equal number of rows:
df1:
 |_ Column1a
 |_ Column1b
df2:
 |_ Column2a
 |_ Column2b
I wish to create a a new DataFrame "df" which has Column1a and Column 2a only. What could be the best possible solution for it?
Denny Lee's answer is the way.
It involves creating another column on both the DataFrames which is the Unique_Row_ID for every row. We then perform a join on Unique_Row_ID. And then drop Unique_Row_ID if required.