multiple aggregations on same column using agg in pyspark

multiple aggregations on same column using agg in pyspark - pyspark

I am not able to get multiple metrics using agg as below.
table.select("date_time")\
.withColumn("date",to_timestamp("date_time"))\
.agg({'date_time':'max', 'date_time':'min'}).show()
I see that second aggregation overwriting first aggregation,
can someone help me to get multiple aggregations on same column?

I can't replicate and make sure that it works but I would suggest instead of using a dict for your aggregations try it like this:
table.select("date_time")\
.withColumn("date",to_timestamp("date_time"))\
.agg(min('date_time'), max('date_time')).show()

Related

How to Place a variable in pyspark groupby agg query

Hi I have query in which i want to place the variable data into the group by query
i Tried like this but it not working
dd2=(dd1.groupBy("hours").agg({'%s':'%s'})%(columnname1,input1))
In the columnname1 contain 'total' and input1 contain what kind of aggregation is required like mean or stddev.
i want this query to be dynamic.

Try this,
dd2=(dd1.groupBy("hours").agg({'{}'.format(columnname1):'{}'.format(input1)}))

can we use find query inside map in mongo

I need to perform some aggregation on one existing table and then use aggregated table to perform the map reduce.
The aggregation table is sort of a temporary used so that it can be used in map reduce. Record set in temporary table reaches around 8M.
What can be the way to avoid the temporary table?
One way could be to write find() query inside map() function and emit the aggregated result(initially being stored on aggregation table).
However, I am not able to implement this.
Is there a way! Please help.

You can use the "query" parameter on MongoDB MapReduce. With this parameter the data sent to map function is filtered before processing.
More info on MapReduce documentation

Hadoop - producing multi column ouptut (MongoDB)

I am using Hadoop to apply map reduce in my MongoDB database.
I can able to execute the sample in this link.
Right now I can able to get only key, value pair in output collection after map reduce job was executed. I wonder if it is possible to save multiple columns in a map reduce output collection?
or embedded document in value column?
thanks.

Yes - use BSONWritable as your reducer output class, and create a BSONWritable object with as many columns as you need.
See example here:
https://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldReducer.java

pymongo sort grouped results

I need to group and sort by date_published some documents stored on mongodb using pymongo.
the group part went just fine :) but when I'm addding .sort() to the query it keeps failing no matter what I tried :(
here is my query:
db.activities.group(keyf_code,cond,{},reduce_code)
I want to sort by a field called "published" (timestamp)
tried to do
db.activities.group(keyf_code,cond,{},reduce_code).sort({"published": -1})
and many more variations without any success
ideas anyone?

You can't currently do sort with group in MongoDB. You can use MapReduce instead which does support a sort option. There is also an enhancement request to support group with sort here.

Although MongoDB doesn't do what you want, you can always use Python to do the sorting:
result = db.activities.group(keyf_code,cond,{},reduce_code)
result = sorted(result, key=itemgetter("published"), reverse=True)

MongoDB: How to execute a query to result of another query (nested queries)?

I need to apply a set of filters (queries) to a collection. By default, the MongoDB applies AND operator to all queries submitted to find function. Instead of whole AND I need to apply each query sequentially (one by one). That is, I need to run the first-query and get a set of documents, run the second-query to result of first-query, and so on.
Is this Possible?
db.list.find({..q1..}).find({..q2..}).find({..q3..});
Instead Of:
db.list.find({..q1..}, {..q2..}, {..q3..});
Why do I need this?
Bcoz, the second-query needs to apply an aggregate function to result of first-query, instead of applying the aggregate to whole collection.

Yes this is possible in MongoDB. You can write nested queries as per the requirement.Even in my application I created nested MongoDb queries.If you are familiar with SQL syntax then compare this with in of sql syntax:
select cname from table where cid in (select .....)
In the same way you can create nested MongoDB queries on different collections also.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

multiple aggregations on same column using agg in pyspark - pyspark

I can't replicate and make sure that it works but I would suggest instead of using a dict for your aggregations try it like this: table.select("date_time")\ .withColumn("date",to_timestamp("date_time"))\ .agg(min('date_time'), max('date_time')).show()

Related

How to Place a variable in pyspark groupby agg query

can we use find query inside map in mongo

Hadoop - producing multi column ouptut (MongoDB)

pymongo sort grouped results

MongoDB: How to execute a query to result of another query (nested queries)?

Categories

Resources