Dataset:
GroupID Name_of_books
101 book1, book2, book3, book4
102 book10, book12, book13, book14
Required output:
101 book1
101 book2
101 book3
101 book4
102 book10
102 book11
103 book12
104 book13
You can use explode function as
import org.apache.spark.sql.functions._
val resuldDF = df.select($"GroupID", explode($"Name_of_books").as("Name_of_books")
or withColumn
val resuldDF = df.withColumn("Name_of_books", explode($"Name_of_books"))
This works if the column is Array or Map
If you have a string value separated by a comma, You need to split it first and apply explode as
val resuldDF = df.select($"GroupID", explode(split($"Name_of_books", ",")))
Hope this helps!
Related
I have a spark dataframe in the below format where each unique id can have maximum of 3 rows which is given by rank column.
id pred prob rank
485 9716 0.19205872 1
729 9767 0.19610429 1
729 9716 0.186840048 2
729 9748 0.173447074 3
818 9731 0.255104463 1
818 9748 0.215499913 2
818 9716 0.207307154 3
I want to convert (cast) into a row wise data such that each id has just one row and the pred & prob column have multiple columns differentiated by rank variable( column postfix).
id pred_1 prob_1 pred_2 prob_2 pred_3 prob_3
485 9716 0.19205872
729 9767 0.19610429 9716 0.186840048 9748 0.173447074
818 9731 0.255104463 9748 0.215499913 9716 0.207307154
I am not able to figure out how to o it in Pyspark
Sample code for input data creation:
# Loading the requisite packages
from pyspark.sql.functions import col, explode, array, struct, expr, sum, lit
# Creating the DataFrame
df = sqlContext.createDataFrame([(485,9716,19,1),(729,9767,19,1),(729,9716,18,2), (729,9748,17,3), (818,9731,25,1), (818,9748,21,2), (818,9716,20,3)],('id','pred','prob','rank'))
df.show()
This is the pivot on multiple columns problem.Try:
import pyspark.sql.functions as F
df_pivot = df.groupBy('id').pivot('rank').agg(F.first('pred').alias('pred'), F.first('prob').alias('prob')).orderBy('id')
df_pivot.show(truncate=False)
I have a table in RDBMS which I'm taking into a dataframe(DF1):
1 employee_id
2 employee_name
3 salary
4 designation
And I have a dataframe(DF2) with the following:
_c0 _c1 _c2 _c3
101 monali 70000 developer
102 Amy 70000 developer
103 neha 65000 tester
How do I define the schema for DF2 from DF1. I want DF2 to have the schema that is defined in the above table.
expected output:
employee_id employee_name salary designation
101 monali 70000 developer
102 Amy 70000 developer
103 neha 65000 tester
I want to make it parameterized.
You can create a function mapColumnNames that takes two parameters, dataframe containing the columns (which I call columns dataframe) and dataframe you want to change columns' name (which I call data dataframe).
This function first retrieves name and id of each column in columns dataframe as a list of tuples. Then it iterates over this list of tuples, applying method withColumnRenamed on data dataframe on each iteration.
Then you can call this function mapColumnNames with DF1 as columns dataframe and DF2 as data dataframe.
Below the complete code:
def mapColumnNames(columns: DataFrame, data: DataFrame): DataFrame = {
val columnNames = columns.collect().map(x => (x.getInt(0) - 1, x.getString(1)))
columnNames.foldLeft(data)((data, columnName) => {
data.withColumnRenamed(s"_c${columnName._1}", columnName._2)
})
}
val output = mapColumnNames(DF1, DF2)
It wasn't clear what schema does your df1 holds, so used index 1 reference to fetch columns
val columns = df1.select($"1").collect()
Otherwise, we can get all the columns associated with the first dataframe
val columns = df1.schema.fieldNames.map(col(_))
and then use select with columns fetched for our new dataframe
val newDF = df2.select(columns :_*)
I'm using pyspark 3.0.1. I have a dataframe df with following details
ID Class dateEnrolled dateStarted
32 1 2016-01-09 2016-01-26
25 1 2016-01-09 2016-01-10
33 1 2016-01-16 2016-01-05
I need to replace dateEnrolled my latest of two date field & my data should look like
ID Class dateEnrolled dateStarted
32 1 2016-01-26 2016-01-26
25 1 2016-01-10 2016-01-10
33 1 2016-01-16 2016-01-05
Can you suggest me how to do that?
You can use greatest:
import pyspark.sql.functions as F
df2 = df.withColumn('dateEnrolled', F.greatest('dateEnrolled', 'dateStarted'))
I have 2 dataframes. I want to take distinct values of 1 column and link it with all the rows of another dataframe. For e.g -
Dataframe 1 : df1 contains
scenarioId
---------------
101
102
103
Dataframe 2 : df2 contains columns
trades
-------------------------------------
isin price
ax11 111
re32 909
erre 445
Expected output
trades
----------------
isin price scenarioid
ax11 111 101
re32 909 101
erre 445 101
ax11 111 102
re32 909 102
erre 445 102
ax11 111 103
re32 909 103
erre 445 103
Note that i dont have a possibility to join the 2 dataframes on a common column. Please suggest.
What you need is cross join or cartessian product:
val result = df1.crossJoin(df2)
although I do not recommend it as the amount of data rises very fast. You'll get all possible pairs - elements of cartessian product (the number will be number of rows in df1 times number of rows in df2).
I have the below DataFrame, it has keys with different dates out of which I would like to display latest date together with the count for each of the key-id pairs.
Input data as below:
id key date
11 222 1/22/2017
11 222 1/22/2015
11 222 1/22/2016
11 223 9/22/2017
11 223 1/22/2010
11 223 1/22/2008
Code I have tried:
val counts = df.groupBy($"id",$"key").count()
I am getting the below output,
id key count
11 222 3
11 223 3
However, I want like the output to be as below:
id key count maxDate
11 222 3 1/22/2017
11 223 3 9/22/2017
One way would be to transform the date into unixtime, do the aggregation and then convert it back again. This conversions to and from unixtime can be performed with unix_timestamp and from_unixtime respectively. When the date is in unixtime, the latest date can be selected by finding the maximum value. The only possible down-side of this approach is that the date format must be explicitly given.
val dateFormat = "MM/dd/yyyy"
val df2 = df.withColumn("date", unix_timestamp($"date", dateFormat))
.groupBy($"id",$"key").agg(count("date").as("count"), max("date").as("maxDate"))
.withColumn("maxDate", from_unixtime($"maxDate", dateFormat))
Which will give you:
+---+---+-----+----------+
| id|key|count| maxDate|
+---+---+-----+----------+
| 11|222| 3|01/22/2017|
| 11|223| 3|09/22/2017|
+---+---+-----+----------+
Perform an agg on both fields
df.groupBy($"id", $"key").agg(count($"date"), max($"date"))
Output:
+---+---+-----------+-----------+
| _1| _2|count(date)| max(date)|
+---+---+-----------+-----------+
| 11|222| 3| 1/22/2017|
| 11|223| 3| 9/22/2017|
+---+---+-----------+-----------+
Edit: The as option proposed in the other answer is pretty good too.
Edit: Comment below is true. You need to convert to a proper date format. You can check the other answer wich converts to timestamp or use udf
import java.text.SimpleDateFormat
import org.apache.spark.sql.{SparkSession, functions}
val simpleDateFormatOriginal:SimpleDateFormat = new SimpleDateFormat("MM/dd/yyyy")
val simpleDateFormatDestination:SimpleDateFormat = new SimpleDateFormat("yyyy/MM/dd")
val toyyyymmdd = (s:String) => {
simpleDateFormatDestination.format(simpleDateFormatOriginal.parse(s))
}
val toddmmyyyy = (s:String) => {
simpleDateFormatOriginal.format(simpleDateFormatDestination.parse(s))
}
val toyyyymmddudf = functions.udf(toyyyymmdd)
val toddmmyyyyyudf = functions.udf(toddmmyyyy)
df.withColumn("date", toyyyymmddudf($"date"))
.groupBy($"id", $"key")
.agg(count($"date"), max($"date").as("maxDate"))
.withColumn("maxDate", toddmmyyyyyudf($"maxDate"))