Using Idiomatic Scala in Spark

Using Idiomatic Scala in Spark - scala

I have the following expression,
val pageViews = spark.sql(
s"""
|SELECT
| proposal,
| MIN(timestamp) AS timestamp,
| MAX(page_view_after) AS page_view_after
|FROM page_views
|GROUP BY proposalId
|""".stripMargin
).createOrReplaceTempView("page_views")
I want convert it into one that uses the Dataset API
val pageViews = pageViews.selectExpr("proposal", "MIN(timestamp) AS timestamp", "MAX(page_view_after) AS page_view_after").groupBy("proposal")
The problems is I can't call createOrReplaceTempView on this one - build fails.
My question is how do I convert the first one into the second one and create a TempView out of that?

You can get rid of SQL expression al together by using Spark Sql's functions
import org.apache.spark.sql.functions._
as below
pageViews
.groupBy("proposal")
.agg(max("timestamp").as("timestamp"),max("page_view_after").as("page_view_after"))
`

Considering you have a dataframe available with name pageViews -
Use -
pageViews
.groupBy("proposal")
.agg(expr("min(timestamp) AS timestamp"), expr("max(page_view_after) AS page_view_after"))

Related

How to remove all characters that start with "_" from a spark string column

I'm trying to modify a column from my dataFrame by removing the suffix from all the rows under that column and I need it in Scala.
The values from the column have different lengths and also the suffix is different.
For example, I have the following values:
09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0
0978C74C69E8D559A62F860EA36ADF5E-28_3_1
0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1
0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1
22AEA8C8D403643111B781FE31B047E3-0_1_0_0
I need to remove everything after the "_" so that I can get the following values:
09E9894DB868B70EC3B55AFB49975390-0
0978C74C69E8D559A62F860EA36ADF5E-28
0C12FA1DAFA8BCD95E34EE70E0D71D10-0
0D075AA40CFC244E4B0846FA53681B4D
22AEA8C8D403643111B781FE31B047E3-0

As #werner pointed out in his comment, substring_index provides a simple solution to this. It is not necessary to wrap this in a call to selectExpr.
Whereas #AminMal has provided a working solution using a UDF, if a native Spark function can be used then this is preferable for performance.[1]
val df = List(
"09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0",
"0978C74C69E8D559A62F860EA36ADF5E-28_3_1",
"0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1",
"0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1",
"22AEA8C8D403643111B781FE31B047E3-0_1_0_0"
).toDF("col0")
import org.apache.spark.sql.functions.{col, substring_index}
df
.withColumn("col0", substring_index(col("col0"), "_", 1))
.show(false)
gives:
+-----------------------------------+
|col0 |
+-----------------------------------+
|09E9894DB868B70EC3B55AFB49975390-0 |
|0978C74C69E8D559A62F860EA36ADF5E-28|
|0C12FA1DAFA8BCD95E34EE70E0D71D10-0 |
|0D075AA40CFC244E4B0846FA53681B4D |
|22AEA8C8D403643111B781FE31B047E3-0 |
+-----------------------------------+
[1] Is there a performance penalty when composing spark UDFs

spark merge datasets based on the same input of one column and concat the others

Currently I have several Dataset[UserRecord], and it looks like this
case class UserRecord(
Id: String,
ts: Timestamp,
detail: String
)
Let's call the several datasets datasets.
Previously I tried this
datasets.reduce(_ union _)
.groupBy("Id")
.agg(collect_list("ts", "detail"))
.as[(String, Seq[DetailRecord]]
but this code gives me an OOM error. I think the root cause is collect_list.
Now I'm thinking if I can do the groupBy and agg for each of the dataset first and then join them together to solve the OOM issue. Any other good advice is welcome too :)
I have an IndexedSeq of datasets look like this
|name| lists |
| x |[[1,2], [3,4]]|
|name| lists |
| y |[[5,6], [7,8]]|
|name| lists |
| x |[[9,10], [11,12]]|
How can I combine them to get a Dataset that looks like
|name| lists |
| x |[[1,2], [3,4],[9,10], [11,12]]|
| y |[[5,6], [7,8]] |
I tried ds.reduce(_ union _) but it didn't seem to work

You can aggregate after union:
val ds2 = ds.reduce(_ unionAll _).groupBy("name").agg(flatten(collect_list("lists")).as("lists"))

How to find the unique product among the stores using spark?

I am new to Apache spark. I want to find the unique product among the stores using scala spark.
Data in file is like below where 1st column in each row represents store name.
Sears,shoe,ring,pan,shirt,pen
Walmart,ring,pan,hat,meat,watch
Target,shoe,pan,shirt,hat,watch
I want the output to be
Only Walmart has Meat.
only Sears has Pen.
I tried the below in scala spark, able to get the unique products but don't know how to get the store name of those products. Please help.
val filerdd = sc.textFile("file:///home/hduser/stores_products")
val uniquerdd = filerdd.map(x=>x.split(",")).map(x=>Array(x(1),x(2),x(3),x(4),x(5))).flatMap(x=>x).map(x=>(x,1)).reduceByKey((a,b)=>a+b).filter(x=>x._2==1)
uniquerdd holds - Array((pen,1),(meat,1))
Now I want to find in which row of filerdd these products presents and should display the output as below
Only Walmart has Meat.
Only Sears has Pen.
can you please help me to get the desired output?

The dataframe API is probably easier than the RDD API to do this. You can explode the list of products and filter those with count = 1.
import org.apache.spark.sql.expressions.Window
df = spark.read.csv("filepath")
result = df.select(
$"_c0".as("store"),
explode(array(df.columns.tail.map(col):_*)).as("product")
).withColumn(
"count",
count("*").over(Window.partitionBy("product"))
).filter(
"count = 1"
).select(
format_string("Only %s has %s.", $"store", $"product").as("output")
)
result.show(false)
+----------------------+
|output |
+----------------------+
|Only Walmart has meat.|
|Only Sears has pen. |
+----------------------+

How to combine several Dataframes together in scala?

I have several dataframes which contains single column in them. Let's say I have 4 such dataframe all with one column. How can I form a single dataframe by combining all of them?
val df = xmldf.select(col("UserData.UserValue._valueRef"))
val df2 = xmldf.select(col("UserData.UserValue._title"))
val df3 = xmldf.select(col("author"))
val df4 = xmldf.select(col("price"))
To combine, I am trying this, but it doesn't work:
var newdf = df
newdf = newdf.withColumn("col1",df1.col("UserData.UserValue._title"))
newdf.show()
It errors out saying that field of one column are not present in another. I am not sure how can I combine these 4 dataframes together. They don't have any common column.
df2 looks like this:
+---------------+
| _title|
+---------------+
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
+---------------+
and df looks like this:
+-----------+
|_valuegiven|
+-----------+
| qwe|
| dfdfrt|
| dfdf|
+-----------+
df3 and df4 are also in same format. I want like below dataframe:
+-----------+---------------+
|_valuegiven| _title|
+-----------+---------------+
| qwe|_CONFIG_CONTEXT|
| dfdfrt|_CONFIG_CONTEXT|
| dfdf|_CONFIG_CONTEXT|
+-----------+---------------+
I used this:
val newdf = xmldf.select(col("UserData.UserValue._valuegiven"),col("UserData.UserValue._title") )
newdf.show()
But I am getting column name on the go and as such, I would need to append on the go, due to which I don't know exactly how many columns I will get. Which is why I cannot use the above command.

It's a little unclear of your goal. If asking to join these dataframes, but perhaps you just want to select those 4 columns.
val newdf = xmldf.select($"UserData.UserValue._valueRef", $"UserData.UserValue._title", 'author,'price")
newdf.show
If you really want to join all these dataframes, you'll need to join them all and select the appropriate fields.

If the goal is to get 4 columns from xmldf into a new dataframe you shouldn't be splitting it into 4 dataframes in the first place.
You can select multiple columns from a dataframe by providing additional column names in the select function.
val newdf = xmldf.select(
col("UserData.UserValue._valueRef"),
col("UserData.UserValue._title"),
col("author"),
col("price"))
newdf.show()

So I looked at various ways and finally Ram Ghadiyaram's answer in Solution 2 does what I wanted to do. Using this approach, you can combine any number of columns on the go. Basically, you need to create indexes by which you can join the dataframes together and after joining, drop the index column altogether.

Spark dataframe explode column

Every row in the dataframe contains a csv formatted string line plus another simple string, so what I'm trying to get at the end is a dataframe composed of the fields extracted from the line string together with category.
So I proceeded as follows to explode the line string
val df = stream.toDF("line","category")
.map(x => x.getString(0))......
At the end I manage to get a new dataframe composed of the line fields but I can't return the category to the new dataframe
I can't join the new dataframe with the initial one since the common field id was not a separate column at first.
Sample of input :
line | category
"'1';'daniel';'dan#gmail.com'" | "premium"
Sample of output:
id | name | email | category
1 | "daniel"| "dan#gmail.com"| "premium"
Any suggestions, thanks in advance.

If the structure of strings in line column is fixed as mentioned in the question, then following simple solution should work where split inbuilt function is used to split the string into array and then finally selecting the elements from the array and aliasing to get the final dataframe
import org.apache.spark.sql.functions._
df.withColumn("line", split(col("line"), ";"))
.select(col("line")(0).as("id"), col("line")(1).as("name"), col("line")(2).as("email"), col("category"))
.show(false)
which should give you
+---+--------+---------------+--------+
|id |name |email |category|
+---+--------+---------------+--------+
|'1'|'daniel'|'dan#gmail.com'|premium |
+---+--------+---------------+--------+
I hope the answer is helpful

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Using Idiomatic Scala in Spark - scala

You can get rid of SQL expression al together by using Spark Sql's functions import org.apache.spark.sql.functions._ as below pageViews .groupBy("proposal") .agg(max("timestamp").as("timestamp"),max("page_view_after").as("page_view_after")) `

Considering you have a dataframe available with name pageViews - Use - pageViews .groupBy("proposal") .agg(expr("min(timestamp) AS timestamp"), expr("max(page_view_after) AS page_view_after"))

Related

How to remove all characters that start with "_" from a spark string column

spark merge datasets based on the same input of one column and concat the others

How to find the unique product among the stores using spark?

How to combine several Dataframes together in scala?

Spark dataframe explode column

Categories

Resources