examine the result of groupBy followed by orderBy in spark - pyspark

I have dataframe, including two columns, e.g., "ID" and "Time". I would like to quickly check the result of grouped by ID and based on the time information.
I can do this
df.groupBy('ID').orderBy('Time')
But I cannot add show after that because show() is not a attribute of either orderBy or groupBy. How to quickly check this kind of operation result.

For a quick check (and if you do not care too much about performance) you can use GroupedData.applyInPandas() to print out each group separately as Pandas Dataframe:
def debug(pandas_df):
print(pandas_df.sort_values("Time"))
return pandas_df
df = df.groupBy("ID").applyInPandas(lambda pandas_df: debug(pandas_df), df.schema)
df.<call action>()
In a multi node environment, the debug messages will be printed out on the executors.

Related

Databricks - All records from Dataframe/Tempview get removed after merge

I am observing some weired issue. I am not sure whether it is lack of my knowledge in spark or what.
I have a dataframe as shown in below code. I create a tempview from it and I am observing that after merge operation, that tempview becomes empty. Not sure why.
val myDf = getEmployeeData()
myDf.createOrReplaceTempView("myView")
// Result1: Below lines display all the records
myDf.show()
spark.Table("myView").show()
// performing merge operation
val sql = s"""MERGE INTO employee AS a
USING myView AS b
ON a.Id = b.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *"""
spark.sql(sql)
// Result2: ISSUE is here. myDf & mvView both are empty
myDf.show()
spark.Table("myView").show()
Edit
getEmployeeData method performs join between two dataframes and returns the result.
df1.as(df1Alias).join(df2.as(df2Alias), expr(joinString), "inner").filter(finalFilterString).select(s"$df1Alias.*")
Dataframes in Spark are lazily evaluated, ie. not executed until an action like .show, .collect is executed or they're used in SQL DDL operation. This also means that if you refer to it once more, it will get reevaluated again.
Assuming there's no other background activity that can mess up, apparently your function getEmployeeData, depends on employee table. It gets executed both before and after the merge and might yield different result.
To prevent it you can checkpoint the dataframe:
myView.checkpoint()
or explicitly materialize it:
myView.write.saveAsTable("myViewMaterialized")
and later refer to the materialized version.
I agree with the points that #Kombajn zbozowy said related to Dataframes in spark are lazily evaluated and will be reevaluated again if you call action on it once more.
I would like to point out that, it is normal expected behavior and might not have to do anything with the merge operation.
For example, the dataframe that you get as an output of the join only contains inserts and you perform those inserts on the target table using dataframe write api, and then if you run the df.show() command again it would show you output as empty because when it is reevaluating the content of the dataframe by performing join it won't get any difference and will not output any records...
Same holds true for merge operation as it also updates the target table with insert/updates and when you rerun it won't show any output rows.

SCALA: How to use collect function to get the latest modified entry from a dataframe?

I have a scala dataframe with two columns:
id: String
updated: Timestamp
From this dataframe I just want to get out the latest date, for which I use the following code at the moment:
df.agg(max("updated")).head()
// returns a row
I've just read about the collect() function, which I'm told to be
safer to use for such a problem - when it runs as a job, it appears it is not aggregating the max on the whole dataset, it looks perfectly fine when it is running in a notebook -, but I don't understand how it should
be used.
I found an implementation like the following, but I could not figure how it should be used...
df1.agg({"x": "max"}).collect()[0]
I tried it like the following:
df.agg(max("updated")).collect()(0)
Without (0) it returns an Array, which actually looks good. So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamps. My question now is, how is collect() actually supposed to work in such a situation?
Thanks a lot in advance!
I'm assuming that you are talking about a spark dataframe (not scala).
If you just want the latest date (only that column) you can do:
df.select(max("updated"))
You can see what's inside the dataframe with df.show(). Since df are immutable you need to assign the result of the select to another variable or add the show after the select().
This will return a dataframe with just one row with the max value in "updated" column.
To answer to your question:
So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamp
When you select on a dataframe, spark will select data from the whole dataset, there is not a partitioned version and a driver version. Spark will shard your data across your cluster and all the operations that you define will be done on the entire dataset.
My question now is, how is collect() actually supposed to work in such a situation?
The collect operation is converting from a spark dataframe into an array (which is not distributed) and the array will be in the driver node, bear in mind that if your dataframe size exceed the memory available in the driver you will have an outOfMemoryError.
In this case if you do:
df.select(max("Timestamp")).collect().head
You DF (that contains only one row with one column which is your date), will be converted to a scala array. In this case is safe because the select(max()) will return just one row.
Take some time to read more about spark dataframe/rdd and the difference between transformation and action.
It sounds weird. First of all you donĀ“t need to collect the dataframe to get the last element of a sorted dataframe. There are many answers to this topics:
How to get the last row from DataFrame?

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame

Using groupBy in Spark and getting back to a DataFrame

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.
For example, I have a DataFrame called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
Just use distinct not groupBy:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ... has to be provided by agg or equivalent method.
A group by in spark followed by aggregation and then a select statement will return a data frame. For your example it should be something like:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")