How to populate a Spark DataFrame column based on another column's value? - scala

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.

Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

Related

Pyspark filter a dataframe based on another dataframe containing distinct values

I have a df1 based on disticnt values containing two columns, date and value. There is df2 that has multiple column but contains the date and value column. For each distinct value from df1, i want to filter the df2 such that the records before the date from df1 are dropped. It would be rather easy for a single disticnt value, i can use something like filter by vlaue and then gt(lit(date), however i have over 500 such distinct pairs in df1. For single operation, it takes around 20 minute so if i use a loop then it is computationally not feasible. Perhaps some body can advice me on a better methodology here.
have tried multiple methodlogies but nothing has worked yet.

PySpark: How to count the number of distinct values from two columns?

I have a DataFrame with two columns, id1, id2 and what I'd like to get is to count the number of distinct values of these two columns. Essentially this is count(set(id1+id2)).
How can I do that with PySpark?
Thanks!
Please note that this isn't a duplicate as I'd like for PySpark to calculate the count(). Of course it's possible to get the two lists id1_distinct and id2_distinct and put them in a set() but it doesn't seem to me the proper solution when dealing with big data and it's not really in the PySpark spirit
You can combine the two columns into one using union, and get the countDistinct:
import pyspark.sql.functions as F
cnt = df.select('id1').union(df.select('id2')).select(F.countDistinct('id1')).head()[0]

Splitting a column data as per delimiter

I have a Spark (1.4) dataframe where the data in a column is like "1-2-3-4-5-6-7-8-9-10-11-12". I want to split the data into multiple columns. Please note that the number of fields can vary from 1 to 12, its not fixed.
P.S. we are using Scala API.
Edit:
Editing over the original question. I have the delimited string as below:
"ABC-DEF-PQR-XYZ"
From this string I need to create delimited strings in separate columns as below. Please note that this string is in a column in DF.
Original column: ABC-DEF-PQR-XYZ
New col1 : ABC
New col2 : ABC-DEF
New col3 : ABC-DEF-PQR
New col4 : ABC-DEF-PQR-XYZ
Please note that there can be 12 such new columns which needs to get derived from original field. Also, the string in original column might vary i.e. some times 1 column, some time 2 but max can be 12.
Hope I have articulated the problem statement clearly.
Thanks!
You can use explode and pivot. Here is some sample data:
df=sc.parallelize([["1-2-3-4-5-6-7-8-9-10-11-12"], ["1-2-3-4"], ["1-2-3-4-5-6-7-8-9-10"]]).toDF(schema=["col"])
Now add a unique id to rows so that we can keep track of which row the data belongs to:
df=df.withColumn("id", f.monotonically_increasing_id())
Then split the columns by delimiter - and then explode to get a long-form dataset:
df=df.withColumn("col_split", f.explode(f.split("col", "\-")))
Finally pivot on id to get back to wide form:
df.groupby("id")
.pivot("col_split")
.agg(f.max("col_split"))
.drop("id").show()

Fastest way to get specific record in Spark DataFrame

I want to collect a specific Row in a Spark 1.6 DataFrame which originates from a partitioned HiveTable (the table is partitioned by a String column named date and saved as Parquet)
A record is unambiguously identified by date,section,sample
In addition, I have the following constraints
date and section are Strings, sample is Long
date is unique and the Hive table is partitioned by date. But
there are maybe more than 1 files on HDFS for each date
section is also unique across the dataframe
sample is unique for a given section
So far I use this query, but it takes quite a long time to execute (~25 seconds using 10 executors):
sqlContext.table("mytable")
.where($"date"=== date)
.where($"section"=== section)
.where($"sample" === sample)
.collect()(0)
I also tried to replace collect()(0) with take(1)(0) which is not faster.

Creating a pyspark.sql.dataframe out of two columns in two different pyspark.sql.dataframes in PySpark

Assume the following two Dataframes in pyspark with equal number of rows:
df1:
 |_ Column1a
 |_ Column1b
df2:
 |_ Column2a
 |_ Column2b
I wish to create a a new DataFrame "df" which has Column1a and Column 2a only. What could be the best possible solution for it?
Denny Lee's answer is the way.
It involves creating another column on both the DataFrames which is the Unique_Row_ID for every row. We then perform a join on Unique_Row_ID. And then drop Unique_Row_ID if required.