Lookup DF:
+--------------------+------------------+
| seller_name| codes|
+--------------------+------------------+
| BlueR |[5944, 5813, 5812]|
| jack |[4814, 5734, 5968]|
| Cwireless |[7349, 7399, 5999]|
| Tea |[4899, 5813, 8398]|
Base DF:
seller_name | raw_code
BlueR | 5813
jack | 5968
Cwireless | 7865
Tea | 5999
Tea | 5813
blueR | 5678
jack | 9999
Tea | null
If the seller_name in the BaseDF is present in the LookUp data frame, and if the raw_code of the seller_name from the Base DF is present in the Lookup DF codes then i should retain the same value, but if the raw_code value is something else apart from the elements in the tuple of LookUp DF than the raw_code value should be replaced by the first element in the tuple for that seller.
edit: if the seller_name of base_df is not present in lookup df than the raw_code value should be retained as it is.
Expected Output DF:
seller_name | revised_code
blueR | 5813
jack | 5968
Cwireless | 7349
Tea | 4899
Tea | 5813
blueR | 5678
jack | 4814
Tea | 4899
How can i implement this feature?
Broadcast the small lookUpDf and left join with baseDf, then use a udf function to check whether the raw_code is contained in codes, if it does return the raw_code else first value of codes array.
import org.apache.spark.sql.functions._
def retainUdf = udf((rawCode: Int, codes:Seq[Int]) => if(codes == null || codes.isEmpty) rawCode else if(codes.contains(rawCode)) rawCode else codes.head)
baseDf.join(broadcast(lookUpDf), Seq("seller_name"), "left")
.select(col("seller_name"), retainUdf(col("raw_code"), col("codes")).as("raw_code"))
which should give you
+-----------+--------+
|seller_name|raw_code|
+-----------+--------+
|BlueR |5813 |
|jack |5968 |
|Cwireless |7349 |
|Tea |4899 |
|Tea |5813 |
|blueR |5678 |
|jack |4814 |
+-----------+--------+
I hope the answer is helpful
Related
I have the following stream dataframe
+-----------------------------------------------------------------------
|______name__________________|______orderOfHobbies_____________________|
| Liza | [singing, painting] |
| Inter | [singing, singing] |
| Ovin | [singing, playing, reading, singing] |
------------------------------------------------------------------------
I want to find the favorite hobby of each student. If the occurrence of each hobby per each student is equal then I want to drop the record. eg- Liza's record will be dropped. Since singing is frequently occurred with Inter and Ovin singing will be the favorite hobby.
Expected output
+----------------------------------------------------
|______name__________________|______favoriteHobby___|
| Inter | singing |
| Ovin | singing |
-----------------------------------------------------
You can use a UDF:
val favoriteUDF = udf(
(hobby: Seq[String]) =>
if ((hobby.distinct.size != hobby.size) || (hobby.size == 1))
hobby.groupBy(identity).maxBy(_._2.size)._1
else "invalid"
)
val df2 = df.select(
col("name"),
favoriteUDF(col("orderOfHobbies")).as("favoriteHobby")
).filter("favoriteHobby != 'invalid'")
df2.show
+-----+-------------+
| name|favoriteHobby|
+-----+-------------+
|Inter| singing|
| Ovin| singing|
+-----+-------------+
So i have a data set like
{"customer":"customer-1","attributes":{"att-a":"att-a-7","att-b":"att-b-3","att-c":"att-c-10","att-d":"att-d-10","att-e":"att-e-15","att-f":"att-f-11","att-g":"att-g-2","att-h":"att-h-7","att-i":"att-i-5","att-j":"att-j-14"}}
{"customer":"customer-2","attributes":{"att-a":"att-a-9","att-b":"att-b-7","att-c":"att-c-12","att-d":"att-d-4","att-e":"att-e-10","att-f":"att-f-4","att-g":"att-g-13","att-h":"att-h-4","att-i":"att-i-1","att-j":"att-j-13"}}
{"customer":"customer-3","attributes":{"att-a":"att-a-10","att-b":"att-b-6","att-c":"att-c-1","att-d":"att-d-1","att-e":"att-e-13","att-f":"att-f-12","att-g":"att-g-9","att-h":"att-h-6","att-i":"att-i-7","att-j":"att-j-4"}}
{"customer":"customer-4","attributes":{"att-a":"att-a-9","att-b":"att-b-14","att-c":"att-c-7","att-d":"att-d-4","att-e":"att-e-8","att-f":"att-f-7","att-g":"att-g-14","att-h":"att-h-9","att-i":"att-i-13","att-j":"att-j-3"}}
I have flattened the data in the DF like this
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a| att-b| att-c| att-d| att-e| att-f| att-g| att-h| att-i| att-j| customer|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a-7| att-b-3|att-c-10|att-d-10|att-e-15|att-f-11| att-g-2| att-h-7| att-i-5|att-j-14| customer-1|
| att-a-9| att-b-7|att-c-12| att-d-4|att-e-10| att-f-4|att-g-13| att-h-4| att-i-1|att-j-13| customer-2|
I want to complete the comapreColumns function.
which compares the columns of the two dataframes(userDF and flattenedDF) and returns a new DF as sample output.
how to do that? Like, compare each row's and column in flattenedDF with userDF and count++ if they match? e.g att-a with att-a att-b with att-b.
def getCustomer(customerID: String)(dataFrame: DataFrame): DataFrame = {
dataFrame.filter($"customer" === customerID).toDF()
}
def compareColumns(customerID: String)(dataFrame: DataFrame): DataFrame = {
val userDF = dataFrame.transform(getCustomer(customerID))
userDF.printSchema()
userDF
}
Sample Output:
+--------------------+-----------+
| customer | similarity_score |
+--------------------+-----------+
|customer-1 | -1 | its the same as the reference customer so to ignore '-1'
|customer-12 | 2 |
|customer-3 | 2 |
|customer-44 | 5 |
|customer-5 | 1 |
|customer-6 | 10 |
Thanks
I am creating a dataframe using
val snDump = table_raw
.applyMapping(mappings = Seq(
("event_id", "string", "eventid", "string"),
("lot-number", "string", "lotnumber", "string"),
("serial-number", "string", "serialnumber", "string"),
("event-time", "bigint", "eventtime", "bigint"),
("companyid", "string", "companyid", "string")),
caseSensitive = false, transformationContext = "sn")
.toDF()
.groupBy(col("eventid"), col("lotnumber"), col("companyid"))
.agg(collect_list(struct("serialnumber", "eventtime")).alias("snetlist"))
.createOrReplaceTempView("sn")
I have data like this in the df
eventid | lotnumber | companyid | snetlist
123 | 4q22 | tu56ff | [[12345,67438]]
456 | 4q22 | tu56ff | [[12346,67434]]
258 | 4q22 | tu56ff | [[12347,67455], [12333,67455]]
999 | 4q22 | tu56ff | [[12348,67459]]
I want to explode it put data in 2 columns in my table for that what I am doing is
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("serialN"), explode(col("snetlist")).alias("eventT"), col("companyid"))
Also tried
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), col($"snetlist.serialnumber").alias("serialN"), col($"snetlist.eventtime").alias("eventT"), col("companyid"))
but it turns out that explode can be only used once and I get error in the select so how do I use explode/or something else to achieve what I am trying to.
eventid | lotnumber | companyid | serialN | eventT |
123 | 4q22 | tu56ff | 12345 | 67438 |
456 | 4q22 | tu56ff | 12346 | 67434 |
258 | 4q22 | tu56ff | 12347 | 67455 |
258 | 4q22 | tu56ff | 12333 | 67455 |
999 | 4q22 | tu56ff | 12348 | 67459 |
I have looked at a lot of stackoverflow threads but none of it helped me. It is possible that such question is already answered but my understanding of scala is very less which might have made me not understand the answer. If this is a duplicate then someone could direct me to the correct answer. Any help is appreciated.
First, explode the array in a temporary struct-column, then unpack it:
val serialNumberEvents = snDump
.withColumn("tmp",explode((col("snetlist"))))
.select(
col("eventid"),
col("lotnumber"),
col("companyid"),
// unpack struct
col("tmp.serialnumber").as("serialN"),
col("tmp.eventtime").as("serialT")
)
The trick is to pack the columns you want to explode in an array (or struct), use explode on the array and then unpack them.
val col_names = Seq("eventid", "lotnumber", "companyid", "snetlist")
val data = Seq(
(123, "4q22", "tu56ff", Seq(Seq(12345,67438))),
(456, "4q22", "tu56ff", Seq(Seq(12346,67434))),
(258, "4q22", "tu56ff", Seq(Seq(12347,67455), Seq(12333,67455))),
(999, "4q22", "tu56ff", Seq(Seq(12348,67459)))
)
val snDump = spark.createDataFrame(data).toDF(col_names: _*)
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("snetlist"), col("companyid"))
val exploded = serialNumberEvents.select($"eventid", $"lotnumber", $"snetlist".getItem(0).alias("serialN"), $"snetlist".getItem(1).alias("eventT"), $"companyid")
exploded.show()
Note that my snetlist has the schema Array(Array) rather then Array(Struct). You can simply get this by also creating an array instead of a struct out of your columns
Another approach, if needing to explode twice, is as follows - for another example, but to demonstrate the point:
val flattened2 = df.select($"director", explode($"films.actors").as("actors_flat"))
val flattened3 = flattened2.select($"director", explode($"actors_flat").as("actors_flattened"))
See Is there an efficient way to join two large Datasets with (deeper) nested array field? for a slightly different context, but same approach applies.
This answer in response to your assertion you can only explode once.
I have a Spark DataFrame of customers as shown below.
#SparkR code
customers <- data.frame(custID = c("001", "001", "001", "002", "002", "002", "002"),
date = c("2017-02-01", "2017-03-01", "2017-04-01", "2017-01-01", "2017-02-01", "2017-03-01", "2017-04-01"),
value = c('new', 'good', 'good', 'new', 'good', 'new', 'bad'))
customers <- createDataFrame(customers)
display(customers)
custID| date | value
--------------------------
001 | 2017-02-01| new
001 | 2017-03-01| good
001 | 2017-04-01| good
002 | 2017-01-01| new
002 | 2017-02-01| good
002 | 2017-03-01| new
002 | 2017-04-01| bad
In the first month observation for a custID the customer gets a value of 'new'. Thereafter they are classified as 'good' or 'bad'. However, it is possible for a customer to revert from 'good' or 'bad' back to 'new' in the case that they open a second account. When this happens I want to tag the customer with '2' instead of '1', to indicate that they opened a second account, as shown below. How can I do this in Spark? Either SparkR or PySpark commands work.
#What I want to get
custID| date | value | tag
--------------------------------
001 | 2017-02-01| new | 1
001 | 2017-03-01| good | 1
001 | 2017-04-01| good | 1
002 | 2017-01-01| new | 1
002 | 2017-02-01| good | 1
002 | 2017-03-01| new | 2
002 | 2017-04-01| bad | 2
In pyspark:
from pyspark.sql import functions as f
spark = SparkSession.builder.getOrCreate()
# df is equal to your customers dataframe
df = spark.read.load('file:///home/zht/PycharmProjects/test/text_file.txt', format='csv', header=True, sep='|').cache()
df_new = df.filter(df['value'] == 'new').withColumn('tag', f.rank().over(Window.partitionBy('custID').orderBy('date')))
df = df_new.union(df.filter(df['value'] != 'new').withColumn('tag', f.lit(None)))
df = df.withColumn('tag', f.collect_list('tag').over(Window.partitionBy('custID').orderBy('date'))) \
.withColumn('tag', f.UserDefinedFunction(lambda x: x.pop(), IntegerType())('tag'))
df.show()
And output:
+------+----------+-----+---+
|custID| date|value|tag|
+------+----------+-----+---+
| 001|2017-02-01| new| 1|
| 001|2017-03-01| good| 1|
| 001|2017-04-01| good| 1|
| 002|2017-01-01| new| 1|
| 002|2017-02-01| good| 1|
| 002|2017-03-01| new| 2|
| 002|2017-04-01| bad| 2|
+------+----------+-----+---+
By the way, pandas can do that easy.
This can be done using the following piece of code:
Filter out all the records with "new"
df_new<-sql("select * from df where value="new")
createOrReplaceTempView(df_new,"df_new")
df_new<-sql("select *,row_number() over(partiting by custID order by date)
tag from df_new")
createOrReplaceTempView(df_new,"df_new")
df<-sql("select custID,date,value,min(tag) as tag from
(select t1.*,t2.tag from df t1 left outer join df_new t2 on
t1.custID=t2.custID and t1.date>=t2.date) group by 1,2,3")
I have a sample dataframe
df_that_I_have
+---------+---------+-------+
| country | members | some |
+---------+---------+-------+
| India | 50 | 1 |
+---------+---------+-------+
| Japan | 20 | 3 |
+---------+---------+-------+
| India | 20 | 1 |
+---------+---------+-------+
| Japan | 10 | 3 |
+---------+---------+-------+
and I want a dataframe that looks like this
df_that_I_want
+---------+---------+-------+
| country | members | some |
+---------+---------+-------+
| India | 70 | 10 | // 5 * Sum of "some" for India, i.e. (1 + 1)
+---------+---------+-------+
| Japan | 30 | 30 | // 5 * Sum of "some" for Japan, i.e. (3 + 3)
+---------+---------+-------+
The second dataframe has the sum of members and the sum of some multiplied 5.
This is what I'm doing to achieve this
val df_that_I_want = df_that_I_have
.select(df_that_I_have("country"),
df_that_I_have.groupBy("country").sum("members"),
5 * df_that_I_have.groupBy("country").sum("some")) //Problem here
But the compiler does not allow me to do this because apparently I can't multiply 5 with a column.
How can I multiply an Integer value with the sum of some for each country?
You can try lit function.
scala> val df_that_I_have = Seq(("India",50,1),("India",20,1),("Japan",20,3),("Japan",10,3)).toDF("Country","Members","Some")
df_that_I_have: org.apache.spark.sql.DataFrame = [Country: string, Members: int, Some: int]
scala> val df1 = df_that_I_have.groupBy("country").agg(sum("members"), sum("some") * lit(5))
df1: org.apache.spark.sql.DataFrame = [country: string, sum(members): bigint, ((sum(some),mode=Complete,isDistinct=false) * 5): bigint]
scala> val df_that_I_want= df1.select($"Country",$"sum(Members)".alias("Members"), $"((sum(Some),mode=Complete,isDistinct=false) * 5)".alias("Some"))
df_that_I_want: org.apache.spark.sql.DataFrame = [Country: string, Members: bigint, Some: bigint]
scala> df_that_I_want.show
+-------+-------+----+
|Country|Members|Some|
+-------+-------+----+
| India| 70| 10|
| Japan| 30| 30|
+-------+-------+----+
Please try this
df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))
df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))
lit function is used for creating the column of literal value which is 5 here.
when you are not able to multiply 5 directly, it is creating a column containing 5 and multiplying with it.