how should count enum in dataframe by scala - scala

I'm fresh about scala, there's a spark dataframe like above:
userid|productid|enumXX|
1 3 1
2 3 1
3 4 2
1 3 3
for enumXX values are 1,2,3; it's enum type, definition is below:
object enumXX extends Enumeration {
type EnumXX = Value
val apple = Value(1)
val balana = Value(2)
val orign = Value(3)
}
I would like to group by userid, productid and collect how many apple, balana and orign(count), how should I do by scala?

Related

SparkSQL custom defined function in when clause

I have a DataFrame like this:
id val1 val2
------------
1 v11 v12
2 v21 v22
3 v31 v32
4 v41 v42
5 v51 v52
6 v61 v62
Each row represents a person which may belong to one or more groups.I have a function that takes the values for each row, and determines whether that person meets the criteria for a particular group:
def isInGroup: Boolean = f(group: Int)(id: String, v1: String, v2: String)
and I'm trying to output a DataFrame like this:
Group1 Group2 Group3 Group4
---------------------------
3 0 6 1
Here's my code so far, which doesn't work. Unfortunately, the when clause only takes a parameter of type Column, and my function doesn't work. User Defined functions don't work either. I'd really like to stick with the select/struct/as was of doing it if possible.
val summaryDF = dataDF
.select(struct(
sum(when(isInGroup(1)($"id", $"val1", $"val2"), value = 1)).as("Group1")),
sum(when(isInGroup(2)($"id", $"val1", $"val2"), value = 1)).as("Group2")),
sum(when(isInGroup(3)($"id", $"val1", $"val2"), value = 1)).as("Group3")),
sum(when(isInGroup(4)($"id", $"val1", $"val2"), value = 1)).as("Group4"))
))
As I shown in my previous answer, you'll need an udf:
import org.apache.spark.sql.functions.udf
def isInGroupUDF(group: Int) = udf(isInGroup(group) _)
sum(when(
isInGroupUDF(1)($"id", $"val1", $"val2"), 1
)).as("Group1")
If you want to avoid listing columns you can try for example with default arguments:
def isInGroupUDF(group: Int, id: Column = $"id",
v1: Column = $"val1", v2: Column = $"val2") = {
val f = udf(isInGroup(group) _)
f(id, v1, v2)
}
sum(when(
isInGroupUDF(1), 1
)).as("Group1")

Spark Scala. Using an external variables "dataframe" in a map

I have two dataframes,
val df1 = sqlContext.csvFile("/data/testData.csv")
val df2 = sqlContext.csvFile("/data/someValues.csv")
df1=
startTime name cause1 cause2
15679 CCY 5 7
15683 2 5
15685 1 9
15690 9 6
df2=
cause description causeType
3 Xxxxx cause1
1 xxxxx cause1
3 xxxxx cause2
4 xxxxx
2 Xxxxx
and I want to apply a complex function getTimeCust to both cause1 and cause2 to determine a final cause, then match the description of this final cause code in df2. I must have a new df (or rdd) with the following columns:
startTime name cause descriptionCause
My solution was
val rdd2 = df1.map(row => {
val (cause, descriptionCause) = getTimeCust(row.getInt(2), row.getInt(3), df2)
Row (row(0),row(1),cause,descriptionCause)
})
If a run the code below I have a NullPointerException because the df2 is not visible.
The function getTimeCust(Int, Int, DataFrame) works well outside the map.
Use df1.join(df2, <join condition>) to join your dataframes together then select the fields you need from the joined dataframe.
You can't use spark's distributed structures (rdd, dataframe, etc) in code that runs on an executor (like inside a map).
Try something like this:
def f1(cause1: Int, cause2: Int): Int = some logic to calculate cause
import org.apache.spark.sql.functions.udf
val dfCause = df1.withColumn("df1_cause", udf(f1)($"cause1", $"cause2"))
val dfJoined = dfCause.join(df2, on= df1Cause("df1_cause")===df2("cause"))
dfJoined.select("cause", "description").show()
Thank you #Assaf. Thanks to your answer and the spark udf with data frame. I have resolved the this problem. The solution is:
val getTimeCust= udf((cause1: Any, cause2: Any) => {
var lastCause = 0
var categoryCause=""
var descCause=""
lastCause= .............
categoryCause= ........
(lastCause, categoryCause)
})
and after call the udf as:
val dfWithCause = df1.withColumn("df1_cause", getTimeCust( $"cause1", $"cause2"))
ANd finally the join
val dfFinale=dfWithCause.join(df2, dfWithCause.col("df1_cause._1") === df2.col("cause") and dfWithCause.col("df1_cause._2") === df2.col("causeType"),'outer' )

Spark Mllib - Scala

I have a dataset containing the Customer_ID and the Movies that each customer have seen.
I am analyzing the pattern over the movies. Like If customer X see movie Y then he also se movie Z.
I already group my dataset by Customer ID and I've the sample of data:
Customer_ID,Movie_ID
1, 2,1,3
2, 1
3, 3,6,8
What I want is ignore the column Customer_ID and only having the list of movies like this:
2,1,3
1
3,6,8
How can I do this? My code is:
val data = sc.textFile("FILE");
case class Movies(Customer_ID:String,Movie_ID:String);
def csvToMyClass(line: String) = {
val split = line.split(',');
Movies(split(0),split(1))
}
val df = data.map(csvToMyClass).toDF("Customer_ID","Movie_ID");
df.show;
val movies = df.groupBy(col("Customer_ID")).agg(collect_list(col("Movie_ID")) as "Movie_ID").withColumn("Movie_ID", concat_ws(",", col("Movie_ID"))).rdd
Thanks

Need to generate sequence ids without using groupby in Spark Scala

I want to generate Sequence of Numbers column(Seq_No) as Product_IDs changes in table. In my input table I have only Product_IDs and want output with Seq_No. We can not use GropuBy or Row Number over partition in SQL as Scala does not support.
Logic : Seq_No = 1
for(i = 2:No_of_Rows)
when Product_IDs(i) != Product_IDs(i-1) then Seq_No(i) = Seq_No(i-1)+1
Else Seq_No(i) = Seq_No(i-1)
end as Seq_No
Product_IDs Seq_No
ID1 1
ID1 1
ID1 1
ID2 2
ID3 3
ID3 3
ID3 3
ID3 3
ID1 4
ID1 4
ID4 5
ID5 6
ID3 7
ID6 8
ID6 8
ID5 9
ID5 9
ID4 10
So I want to generate Seq_No as current Product_Id is not equal to previous Product_Ids. Input table has only one column Product_IDs and we want Product_IDs with Seq_No using Spark Scala.
I would probably just write a function to generate the sequence numbers:
scala> val getSeqNum: String => Int = {
var prevId = ""
var n = 0
(id: String) => {
if (id != prevId) {
n += 1
prevId = id
}
n
}
}
getSeqNum: String => Int = <function1>
scala> for { id <- Seq("foo", "foo", "bar") } yield getSeqNum(id)
res8: Seq[Int] = List(1, 1, 2)
UPDATE:
I'm not quite clear on what you want beyond that, Nikhil, and I am not a Spark expert, but I imagine you want something like
val rrd = ??? // Hopefully you know how to get the RRD
for {
(id, col2, col3) <- rrd // assuming the entries are tuples
seqNum = getSeqNum(id)
} yield ??? // Hopefully you know how to transform the entries

Spark RDD mapping one row of data into multiple rows

I have a text file with data that look like this:
Type1 1 3 5 9
Type2 4 6 7 8
Type3 3 6 9 10 11 25
I'd like to transform it into an RDD with rows like this:
1 Type1
3 Type1
3 Type3
......
I started with a case class:
MyData[uid : Int, gid : String]
New to spark and scala, and I can't seem to find an example that does this.
It seems you want something like this?
rdd.flatMap(line=>{
val splitLine = line.split(' ').toList
splitLine match{
case (gid:String) :: rest => rest.map(x:String =>MyData(x.toInt, gid))
}
}