How merge 3 DataFrame in Spark-Scala? I completly don't have any Idea how I can make this. On stackOverFlow I can't found similar example.
I have 3 similar DataFrames. The same name of Column, and the same number of them. Difference is only a value on rows.
DataFrame1:
+----+------+----+---+
|type| Model|Name|ID |
+----+------+----+---+
| 1 |wdasd |xyzd|111|
| 1 |wd |zdfd|112|
| 1 |bdp |2gfs|113|
+----+------+----+---+
DataFrame2:
+----+------+----+---+
|type| Model|Name|ID |
+----+------+----+---+
| 2 |wdasd |xyzd|221|
| 2 |wd |zdfd|222|
| 2 |bdp |2gfs|223|
+----+------+----+---+
DataFrame3:
+----+------+----+---+
|type| Model|Name|ID |
+----+------+----+---+
| 3 |AAAA |N_AM|331|
| 3 |BBBB |NA_M|332|
| 3 |CCCC |MA_N|333|
+----+------+----+---+
And I want to this type of DataFrame
MergeDataFrame:
+----+------+----+---+
|type| Model|Name|ID |
+----+------+----+---+
| 1 |wdasd |xyzd|111|
| 1 |wd |zdfd|112|
| 1 |bdp |2gfs|113|
| 2 |wdasd |xyzd|221|
| 2 |wd |zdfd|222|
| 2 |bdp |2gfs|223|
| 3 |AAAA |N_AM|331|
| 3 |BBBB |NA_M|332|
| 3 |CCCC |MA_N|333|
+----+------+----+---+
Spark provides a union and unionAll. Looks like they are deprecating the unionAll function so I would use the union function as below:
dataFrame1.union(dataFrame2).union(dataFrame3)
Note that in order to union data frames the data frames must have the exact same column names in the exact same order.
See the spark docs here
Related
I have PySpark dataframe below:
cust | amount |
----------------
A | 5 |
A | 1 |
A | 3 |
B | 4 |
B | 4 |
B | 2 |
C | 2 |
C | 1 |
C | 7 |
C | 5 |
I need to group by column 'cust' and calculates the average per group.
Expected result:
cust | avg_amount
-------------------
A | 3
B | 3.333
C | 7.5
I've been using the code as below but giving me the error.
data.withColumn("avg_amount", F.avg("amount"))
Any idea how I can make this average?
Use groupBy to count the number of transactions and the average of amount by customer:
from pyspark.sql import functions as F
data = data.groupBy("cust")\
.agg(
F.count("*").alias("amount"),
F.avg("amount").alias("avg_amount")
)
I'm new to spark and scala. I was trying to use mapPartitions function on a Spark dataframe to iterate over dataframe rows and derive a new column based on the value of another column from the previous row.
Input Dataframe:
+------------+----------+-----------+
| order_id |person_id | create_dt |
+------------+----------+-----------+
| 1 | 1 | 2020-01-11|
| 2 | 1 | 2020-01-12|
| 3 | 1 | 2020-01-13|
| 4 | 1 | 2020-01-14|
| 5 | 1 | 2020-01-15|
| 6 | 1 | 2020-01-16|
+------------+----------+-----------+
From above dataframe, I want to use mapPartitions function and call a scala method which takes Iterator[Row] as a parameter and produces another output row with new column date_diff. The new column is derived as the date difference between create_dt column of current row and previous row
Expected output dataframe:
+------------+----------+-----------+-----------+
| order_id |person_id | create_dt | date_diff |
+------------+----------+-----------+-----------+
| 1 | 1 | 2020-01-11| NA |
| 2 | 1 | 2020-01-12| 1 |
| 3 | 1 | 2020-01-13| 1 |
| 4 | 1 | 2020-01-14| 1 |
| 5 | 1 | 2020-01-15| 1 |
| 6 | 1 | 2020-01-16| 1 |
+------------+----------+-----------+-----------+
Code I tried so far:
// Read input data
val input_data = sc.parallelize(Seq((1,1,"2020-01-11"), (2,1,"2020-01-12"), (3,1,"2020-01-13"), (4,1,"2020-01-14"), (5,1,"2020-01-15"), (6,1,"2020-01-16"))).toDF("order_id", "person_id","create_dt")
//Generate output data using mapPartitions and call getDateDiff method
val output_data = input_data.mapPartitions(getDateDiff).show()
//getDateDiff method to iterate over each row and derive the date difference
def getDateDiff(srcItr: scala.collection.Iterator[Row]) : Iterator[Row] = {
for(row <- srcItr){ row.get(2)}
/*derive date difference and generate output row*/
}
Could someone help me on how to write the getDateDiff method to get the expected output.
I have Spark 2.3 very big dataframe like this:
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AB | 2 | 1 |
| AA | 2 | 3 |
| AC | 1 | 2 |
| AA | 3 | 2 |
| AC | 5 | 3 |
-------------------------
I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AA | 2 | 3 |
| AA | 3 | 2 |
-------------------------
and
-------------------------
| col_key | col1 | col2 |
-------------------------
| AC | 1 | 2 |
| AC | 5 | 3 |
-------------------------
and so far.
Every result dataframe I need to save as different csv file.
Count of keys is not big (20-30) but total count of data is (~200 millions records).
I have the solution where in the loop is selected every part of data and then saved to file:
val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList
keysList.foreach(k => {
val dfi = df.where($"col_key" === lit(k))
SaveDataByKey(dfi, path_to_save)
})
It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time.
I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :)
May be, someone have ideas about it?
Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).
You need to partition by column and save as csv file. Each partition save as one file.
yourDF
.write
.partitionBy("col_key")
.csv("/path/to/save")
Why don't you try this ?
This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have a Spark DF with the following structure:
+--------------------------------------+
| user| time | counts |
+--------------------------------------+
| 1 | 2018-06-04 16:00:00.0 | 5 |
| 1 | 2018-06-04 17:00:00.0 | 7 |
| 1 | 2018-06-04 17:30:00.0 | 7 |
| 1 | 2018-06-04 18:00:00.0 | 8 |
| 1 | 2018-06-04 18:30:00.0 | 10 |
| 1 | 2018-06-04 19:00:00.0 | 9 |
| 1 | 2018-06-04 20:00:00.0 | 7 |
| 2 | 2018-06-04 17:00:00.0 | 4 |
| 2 | 2018-06-04 18:00:00.0 | 4 |
| 2 | 2018-06-04 18:30:00.0 | 5 |
| 2 | 2018-06-04 19:30:00.0 | 7 |
| 3 | 2018-06-04 16:00:00.0 | 6 |
+--------------------------------------+
It was obtained from an event-log table using the following code:
ranked.groupBy($"user", sql.functions.window($"timestamp", "30 minutes"))
.agg(sum("id").as("counts"))
.withColumn("time", $"window.start")
As can be seen looking at the time column, not all 30-min intervals registered events for each user, i.e. not all user groups of frames are of equal lengths. I'd like to impute (possibly with NA's or 0's) missing time values and create a table (or RDD) like the following:
+-----------------------------------------------------------------------------+
| user| 2018-06-04 16:00:00 | 2018-06-04 16:30:00 | 2018-06-04 17:00:00 | ... |
+-----------------------------------------------------------------------------+
| 1 | 5 | NA (or 0) | 7 | ... |
| 2 | NA (or 0) | NA (or 0) | 4 | ... |
| 3 | 6 | NA (or 0) | NA (or 0) | ... |
+-----------------------------------------------------------------------------+
The transpose of the table above (with a time, column, and a column for the counts of each user) would theoretically work too, but I am not sure it would be optimal spark-wise as I have almost a million different users.
How can I perform a table re-structuring like described?
If each time window appears for at least one user, a simple pivot would do the trick (and put null for missing values). With millions of rows, it should be the case.
val reshaped_df = df.groupBy("user").pivot("time").agg(sum('counts))
In case a column is still missing, you could access the list of the columns with reshaped_df.columns and then add the missing ones. You would need to generate the list of columns that you expect (expected_columns) and then generate the missing ones as follows:
val expected_columns = ???
var result = reshaped_df
expected_columns
.foreach{ c =>
if(! result.columns.contains(c))
result = result.withColumn(c, lit(null))
}
I have data set like this which I am taking from csv file and converting it into RDD using scala.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234 | 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
How to sort the data on all columns ?
Thanks
Suppose your input RDD/DataFrame is called df.
To sort recent in descending order, Freq and Monitor both in ascending you can do:
import org.apache.spark.sql.functions._
val sorted = df.sort(desc("recent"), asc("Freq"), asc("Monitor"))
You can use df.orderBy(...) as well, it's an alias of sort().
csv.sortBy(r => (r.recent, r.freq)) or equivalent should do it