Difference between these two count methods in Spark - scala

I have been doing a count of "games" using spark-sql. The first way is like so:
val gamesByVersion = dataframe.groupBy("game_version", "server").count().withColumnRenamed("count", "patch_games")
val games_count1 = gamesByVersion.where($"game_version" === 1 && $"server" === 1)
The second is like this:
val gamesDf = dataframe.
groupBy($"hero_id", $"position", $"game_version", $"server").count().
withColumnRenamed("count", "hero_games")
val games_count2 = gamesDf.where($"game_version" === 1 && $"server" === 1).agg(sum("hero_games"))
For all intents and purposes dataframe just has the columns hero_id, position, game_version and server.
However games_count1 ends up being about 10, and games_count2 ends up being 50. Obviously these two counting methods are not equivalent or something else is going on, but I am trying to figure out: what is the reason for the difference between these?

I guess because in first query you group by only 2 columns and in the second 4 columns. Therefore, you may have less distinct groups just on two columns.

Related

spark program to check if a given keyword exists in a huge text file or not

To find out a given keyword exists in a huge text file or not, I came up wit below two approaches.
Approach1:
def keywordExists(line):
if (line.find(“my_keyword”) > -1):
return 1
return 0
lines = sparkContext.textFile(“test_file.txt”);
isExist = lines.map(keywordExists);
sum = isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)
Approach2:
var keyword="my_keyword"
val rdd=sparkContext.textFile("test_file.txt")
val count= rdd.filter(line=>line.contains(keyword)).count
print(“Found” if count>0 else “Not Found”)
Main difference is first one using map and then reducing whereas second one is filtering and doing a count.
Could anyone suggest which is efficient.
I would suggest:
val wordFound = !rdd.filter(line=>line.contains(keyword)).isEmpty()
Benefit: The search can be stopped once 1 occurence of keyword was found
see also Spark: Efficient way to test if an RDD is empty

Spark, Scala, Databricks, combine and add columns

Using Spark/Scala to attempt a "simple" query. I have a file which, after line 1 below runs, looks like this
EmpReg,EmpOT,RegPay,OTPay
Alice,Alice,400,20
Bob,Bob,300,0
Carol,Carol,450,120
Dan,Dan,400,200
Ellen,Ellen,360,40
The first and third columns (EmpReg, RegPay) come from one source and the second and third columns (EmpOT, OTPay) come from a second source. My objective is output that looks like this.
Emp,Pay
Alice,420
Bob,300
Carol,570
Dan,600
Ellen,400
Here is the code that I have been trying, at least what I have saved.
var q2 = q.join(q1, q("EmpReg") === q1("EmpOT"), "fullouter")
//q2 = q2.select("EmpReg", ($"RegPay" + $"OTPay"))
//q2 = q2.groupBy($"EmpReg".sum($"RegPay" + $"OTPay"))
var add = q2.select(($"RegPay" + $"OTPay"))
//q2 = q2.sum("RegPay", "OTPay")
//q2 = q2.groupBy("EmpReg", "EmpOT")
//var q2 = q.join(q1).where("EmpReg") === "EmpOT"))
//q2 = q2.select("EmpReg").sum("RegPay", "OTPay")
//q2.show
add.show
[q] is the first file which represents regular pay. [q1] is the second file which represents overtime pay. [q2] is the combination shown in the first example above. Primary keys are [EmpReg] and [EmpOT]. don't really need to combine [EmpReg] and [EmpOT] since they are the same, and it doesn't make any difference which I use.
I really need to add [RegPay] and [OTPay] to get [Pay], but for the life of me I can't get it to work. The lines commented out return various errors. I can add the two pay columns, and select an appropriate employee column, but can't seem to do it in one query. I am constrained to use Scala on Databricks. Othewise, I might do something like this.
select q.EmpReg as Emp, (q.RegPay + q1.OTPay) as Pay
from q join q1 on q.EmpReg = q1.EmpOT
(Why can't things ever be simple?)
You can use a similar approach as in your SQL query:
val q2 = q.join(q1, q("EmpReg") === q1("EmpOT"), "fullouter")
val add = q2.select(q("EmpReg").as("Emp"), (q("RegPay") + q1("OTPay")).as("Pay"))
Your code has this line
q2.select("EmpReg", ($"RegPay" + $"OTPay"))
which should work if you add $ before "EmpReg". You can't have both strings and columns in the select statement. This works in Python but not Scala.

PySpark filtering gives inconsistent behavior

So I have a data set where I do some transformations and the last step is to filter out rows that have a 0 in a column called frequency. The code that does the filtering is super simple:
def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
df = getattr(self, name)
df = df.where(df[frequency_col] >= threshold)
setattr(self, name, df)
return self
The problem is a very strange behavior where if I put a rather high threshold like 10, it works fine, filtering out all the rows below 10. But if I make the threshold just 1, it does not remove the 0s! Here is an example of the former (threshold=10):
{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}
And now here is some of the data with threshold=1:
{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}
I should note that before this step I perform some other transformations, and I've noticed weird behaviors in Spark in the past sometimes doing very simple things like this after a join or a union can give very strange results where eventually the only solution is to write out the data and read it back in again and do the operation in a completely separate script. I hope there is a better solution than this!

How to do equality check of two DataFrames?

I have below scenario:
I have 2 dataframes containing only 1 column
Lets say
DF1=(1,2,3,4,5)
DF2=(3,6,7,8,9,10)
Basically those values are keys and I am creating a parquet file of DF1 if the keys in DF1 are not in DF2 (In current example it should return false). My current way of achieving my requirement is:
val df1count= DF1.count
val df2count=DF2.count
val diffDF=DF2.except(DF1)
val diffCount=diffDF.count
if(diffCount==(df2count-df1count)) true
else false
The problem with this approach is I am calling action elements 4 times which is for sure not the best way. Can someone suggest me the best effective way of achieving this?
You can use intersect to get the values common to both DataFrames, and then check if it's empty:
DF1.intersect(DF2).take(1).isEmpty
That will use only one action (take(1)) and a fairly quick one.
Here is the check if Dataset first ist equal to Dataset second:
if(first.except(second).union(second.except(first)).count() == 0)
first == second
else
first != second
Try an intersection combined with a count this would assure the the contents are the same and the number of values in both are the same and asserts to a true
val intersectcount= DF1.intersect(DF2).count()
val check =(intersectcount == DF1.count()) && (intersectcount==DF2.count())

Imputing the dataset with mean of class label causing crash in filter operation

I have a csv file that contains numeric values.
val row = withoutHeader.map{
line => {
val arr = line.split(',')
for (h <- 0 until arr.length){
if(arr(h).trim == ""){
val abc = avgrdd.filter {case ((x,y),z) => x == h && y == arr(dependent_col_index).toDouble} //crashing here
arr(h) = //imputing with the value above
}
}
arr.mkString(",")
}
}
This is a snippet of the code where I am trying to impute the missing values with the mean of class labels.
avgrdd contains the average for the key value pairs where key is column index and the class label value. This avgrdd is calculated using the combiners which I see is calculating the results correctly.
dependent_col_index is the column containing the class labels.
The line with filter is crashing with the null pointer exception.
On removing this line the original array is the output (comma separated).
I am confused why the filter operation is causing a crash.
Please suggest on how to fix this issue.
Example
col1,dependent_col_index
4,1
8,0
,1
21,1
21,0
,1
25,1
,0
34,1
mean for class 1 is 84/4 = 21 and for class 0 is 29/2 = 14.5
Required Output
4,1
8,0
21,1
21,1
21,0
21,1
25,1
14.5,0
34,1
Thanks !!
You are trying to execute a RDD transformation inside of another RDD transformation. Remember that you cannot use RDD inside of another RDD transformation, this would cause an error.
The way to proceed is the following:
Transform the source RDD withoutHeader to the RDD of pairs <Class, Value> of the corrent type (Long in your case). Cache it
Calculate avgrdd on top of withoutHeader. This should be an RDD of pairs <Class, AvgValue>
Join withoutHeader RDD and avgrdd together - this way for each row you would have a structure <Class, <Value, AvgValue>>
Execute map on top of the result to replace missing Value with AvgValue
Another option might be to split the RDD in two parts on step 3 (one part - RDD with missing values, second one - RDD with non-missing values), join the avgrdd only with the RDD containing only missing values and after that make a union between this two parts. It would be faster if you have a small fraction of missing values