I have a PySpark DataFrame with 3 columns: 'client', 'product', 'date'. I want to run a groupBy operation:
df.groupBy("product", "date").agg(F.countDistinct("client"))
So I want to count the number of clients that bought a product in each day. This is causing huge skew data (in fact, it causes error because of memory). I have been learning about salting techniques. As I understood, it can be used with 'sum' or 'count' adding a new column to the groupBy and performing a second aggregation, but I do not see how to apply them in this case because of the countDistinct aggregation method.
How can I do apply it in this case?
I would recommend to just not use countDistinct at all here and achieve what you want using 2 aggregations in a row especially since you have a skew in your data. It can look like the following:
import pyspark.sql.functions as F
new_df = (df
.groupBy("product", "date", "client")
.agg({}) # getting unique ("product", "date", "client") tuples
.groupBy("product", "date")
.agg(F.count('*').alias('clients'))
)
First aggregation here ensures that you have a DataFrame with one row per each distinct ("product", "date", "client") tuple, second is counting number of clients for each ("product", "date") pair. This way you don't need to worry about skews anymore since Spark will know to do partial aggregations for you (as opposed to countDistinct which is forced to send all individual "client" values corresponding to each ("product", "date") pair to one node).
Related
I'm trying to filter a dataset by order status. This is my code:
df1=all_in_all_df.groupBy("productName") \
.agg(F.max('orderItemSubTotal')) \
.filter(col("orderStatus") == "CLOSED") \
.show()
But when I run the code, I get the following error:
AnalysisException: cannot resolve 'orderStatus' given input columns: [max(orderItemSubTotal), productName];
'Filter ('orderStatus = CLOSED)
Removing the .filter() helps displaying a result but I need to filter the data.
The aggregation restricts the number of resulting columns to the ones used for the grouping (in group by clause) and the result of the aggregation.
Thus, there is no orderStatus column anymore.
If you want to be able to filter on it, do it before the aggregation (but only filtered rows will be taken into account for the aggregation) or integrate them in the group by clause (again, the aggregation will be made by status, not globally, but in this second case you will have all statuses, with related aggregations, available).
Neo4j documentation mentions that:
4.3. Aggregating functions To calculate aggregated data, Cypher offers aggregation, analogous to SQL’s GROUP BY.
Aggregating functions take a set of values and calculate an aggregated
value over them. Examples are avg() that calculates the average of
multiple numeric values, or min() that finds the smallest numeric or
string value in a set of values. When we say below that an aggregating
function operates on a set of values, we mean these to be the result
of the application of the inner expression (such as n.age) to all the
records within the same aggregation group.
Aggregation can be computed over all the matching subgraphs, or it can
be further divided by introducing grouping keys. These are
non-aggregate expressions, that are used to group the values going
into the aggregate functions.
Assume we have the following return statement:
RETURN n, count(*)
We have two return expressions: n, and count(). The first, n, is not an aggregate function, and so it will be the grouping key. The
latter, count() is an aggregate expression. The matching subgraphs
will be divided into different buckets, depending on the grouping
key. The aggregate function will then be run on these buckets,
calculating an aggregate value per bucket.
I cannot figure out how to:
RETURN n, m COLLECT(n);
for example, and only use n as the grouping key, not both n and m.
This is not possible in Cypher as it does implicit group by as you have learned from the documentation. It is quite similar to SQL, except there you have to explicitly add the GROUP BY clause.
What you can do is use subqueries, or split the query into two parts, where you first aggregate the data and then iterate over each node again in the second part.
I have the following dataframe data:
root
|-- userId: string
|-- product: string
|-- rating: double
and the following query:
val result = sqlContext.sql("select userId, collect_list(product), collect_list(rating) from data group by userId")
My question is that, does product and rating in the aggregated arrays match each other? That is, whether the product and the rating from the same row have the same index in the aggregated arrays.
Update:
Starting from Spark 2.0.0, one can do collect_list on struct type so we can do one collect_list on a combined column. But for pre 2.0.0 version, one can only use collect_list on primitive type.
I believe there is no explicit guarantee that all arrays will have the same order. Spark SQL uses multiple optimizations and under certain conditions there is no guarantee that all aggregations are scheduled at the same time (one example is aggregation with DISTINCT). Since exchange (shuffle) results in nondeterministic order it is theoretically possible that order will differ.
So while it should work in practice it could be risky and introduce some hard to detect bugs.
If you Spark 2.0.0 or later you can aggregate non-atomic columns with collect_list:
SELECT userId, collect_list(struct(product, rating)) FROM data GROUP BY userId
If you use an earlier version you can try to use explicit partitions and order:
WITH tmp AS (
SELECT * FROM data DISTRIBUTE BY userId SORT BY userId, product, rating
)
SELECT userId, collect_list(product), collect_list(rating)
FROM tmp
GROUP BY userId
How to sort Array[Row] by given column index in Scala?
I'm using RDD[Row].collect() which gives me array[Row], but I want to sort it based on a given column index.
I have already used quick-sort logic and it's working, but there are too many for loops and all.
I would like to use a Scala built-in API which can do this task with the minimum amount of code.
It would be much more efficient to sort the Dataframe before collecting it - if you collect it, you lose the distributed (and parallel) computation. You can use Dataframe's sort, for example - ascending order by column "col1":
val sorted = dataframe.sort(asc("col1"))
I am running a spark job and at some point I want to connect to an elastic search server to get some data and add them to an RDD. So the code I am using looks like this
input.mapParitions(records=>{
val elcon=new ElasticSearchConnection
val client:TransportClient=elcon.openConnection()
val newRecs=records.flatMap(record=>{
val response = client.prepareGet("index" "indexType",
record.id.toString).execute().actionGet()
val newRec=processRec(record,reponse)
newRec
})//end of flatMap
client.close()
newRecs
})//end of mapPartitions
My problem is that the client.close() command is called before the flatMap operation is finished which results of course into an Exception. The code works if I move the generation and the closing of the connection inside the flatMap, but this would generate a huge amount of connections. Is it possible to make sure that client.close will be called after the flatMap operation is finished?
Making a blocking call for each item in your RDD to fetch corresponding ElasticSearch document is causing the problem. It is generally advised to avoid blocking calls.
There is another alternate approach using the ElasticSearch-for-Hadoop's Spark support.
Read the ElasticSearch index/type as another RDD and join it with your RDD.
Include the right version of ESHadoop dependency.
import org.elasticsearch.spark._
val esRdd = sc.esRDD("index/indexType") //This returns a pair RDD of (_id, Map of all key value pairs for all fields]
input.map(record => (record.id, record)) //Convert your RDD of records to a pair rdd of (id, record) as we want to join based on the id
input.join(esRdd).map(rec => processResponse(rec._2._1, rec._2._2)) // Join the two RDDs based on id column it returns a pair RDD with key=id & value=Pair of matching records (id,(inputrddrecord,esrddrecord))
Hope this helps.
PS: It will still not alleviate the problem of lack of co-location. (i.e. each document with _id will come from different shard of the index). Better approach would have been to achieve co-location of the input RDD and the ES index's documents at the time of creating the ES index.