Pyspark - grouped data with count() and sorting possible? - pyspark

I have a dataframe with location and gender as string values and i want to look at the top 20 locations with male and female count splits, in descending order. This is the code I have so far but its not sorted in desc. How can i do that?
display(Markdown("**Top 20 locations** with highest active users split by sex ratio (in \%):"))
pivotDF = datingDF.groupBy("location").pivot("sex"). count()
pivotDF.show(truncate=False)
+-------------------------+----+----+
|location |f |m |
+-------------------------+----+----+
|mill valley, california |176 |139 |
|london, united kingdom |null|1 |
|west oakland, california |3 |4 |
|freedom, california |1 |null|
|columbus, ohio |null|1 |
|rochester, michigan |1 |null|
|mountain view, california|106 |278 |
|magalia, california |null|1 |
|san rafael, california |340 |415 |
|nicasio, california |1 |2 |
|santa cruz, california |null|5 |
|moss beach, california |3 |5 |
|muir beach, california |null|1 |
|larkspur, california |35 |45 |
|san quentin, california |1 |1 |
|kentfield, california |7 |11 |
|montara, california |9 |3 |
|brooklyn, new york |1 |2 |
|utica, michigan |null|1 |
|burlingame, california |154 |207 |
+-------------------------+----+----+

You can use orderBy
orderBy(*cols, **kwargs)
Returns a new DataFrame sorted by the specified column(s).
Parameters
cols – list of Column or column names to sort by.
ascending – boolean or list of boolean (default True). Sort ascending vs. descending. Specify list for multiple sort orders. If a
list is specified, length of the list must equal length of the cols.
datingDF.groupBy("location").pivot("sex").count().orderBy("F","M",ascending=False)
Incase you want one ascending and the other one descending you can do something like this.
datingDF.groupBy("location").pivot("sex").count().orderBy("F","M",ascending=[1,0])

I didn't get how exactly you want to sort, by sum of f and m columns or by multiple columns.
For the sum:
pivotDF = pivotDF.orderBy((F.col('f') + F.col('m')).desc())
For multiple columns:
pivotDF = pivotDF.orderBy(F.col('f').desc(), F.col('m').desc())

Related

Select max common Date from differents DataFrames (Scala Spark)

I have differents dataframes and I want to select the max common Date of these DF. For example, I have the following dataframes:
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2012-12-21 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
The selected date would be 2016-09-02 because is the max date that exists in these 3 DF (the date 2017-11-19 is not in the third DF).
I am trying to do it with agg(max) but in this way I just have the highest date of a DataFrame:
df1.select("Date").groupBy("Date").agg(max("Date))
Thanks in advance!
You can do semi joins to get the common dates, and aggregate the maximum date. No need to group by date because you want to get its maximum.
val result = df1.join(df2, Seq("Date"), "left_semi").join(df3, Seq("Date"), "left_semi").agg(max("Date"))
You can also use intersect:
val result = df1.select("Date").intersect(df2.select("Date")).intersect(df3.select("Date")).agg(max("Date"))

SQL Select Unique Values Each Column

I'm looking to select unique values from each column of a table and output the results into a single table. Take the following example table:
+------+---------------+------+---------------+
|col1 |col2 |col_3 |col_4 |
+------+---------------+------+---------------+
|1 |"apples" |A |"red" |
|2 |"bananas" |A |"red" |
|3 |"apples" |B |"blue" |
+------+---------------+------+---------------+
the ideal output would be:
+------+---------------+------+---------------+
|col1 |col2 |col_3 |col_4 |
+------+---------------+------+---------------+
|1 |"apples" |A |"red" |
|2 |"bananas" |B |"blue" |
|3 | | | |
+------+---------------+------+---------------+
Thank you!
Edit: My actual table has many more columns, so ideally the SQL query can be done via a SELECT * as opposed to 4 individual select queries within the FROM statement.

Query over 3 tables and different columns

I would like to do a fairly simple query, but I cant figure out, how to join the tables together. I am new to this world of SQL and after reading documentation of JOIN and SELECT clauses, I still can't figure this one out.
Here are my 3 tables:
Seller
|SELLER_ID|NUMBER|FIRST_NAME|LAST_NAME|TEAM_NR|
| 1|105 |John |Smith |1 |
| 2|106 |James |Brown |3 |
| 3|107 |Jane |Doe |3 |
| 4|108 |Nicole |Sanchez |2 |
Service
|SERVICE_ID|CODE|NAME |PRICE |SELLER_ID|CLIENT_ID|
| 1| 502|BLAHBLAH |200 |2 |2 |
| 2| 503|BLAHBLAH2|175 |1 |3 |
| 3| 504|BLAHBLAH3|250 |3 |2 |
| 4| 505|BLAHBLAH4|130 |2 |4 |
Client
|CLIENT_ID|NUMBER |FIRST_NAME | LAST_NAME |
| 1|51 |JOHN | ADAMS |
| 2|52 |MARY | BRYANT |
| 3|53 |FRANCIS | JOHNSON |
| 4|55 |BEN | CASTLE |
The goal of this query would be to figure out which team(TEAM_NR from Seller) sold the most services in a month, on the basis of total amount sold(sum of PRICE from Service)
The result should display FIRST_NAME, LAST_NAME and TEAM_NR of everyone in the "winning" team.
I already looked for help from StackOverFlow and Google and tried editing these according to my tables, but they didn't pan out.
Thank You!
SELECT S.FIRST_NAME, S.LAST_NAME, S.TEAM_NR, sum(R.PRICE) Winning
FROM Seller S
LEFT JOIN Service R ON (S.SELLER_ID=R.SELLER_ID)
GROUP BY S.TEAM_NR, S.FIRST_NAME, S.LAST_NAME
EDIT Don't even need any join on Client table.
EDIT 2 All fields from the SELECT have to be in the GROUP BY.

OrientDB spatial query to find all pairs within X km of each other

I'm testing out the orientdb spatial module. I've put together a simple dataset with coordinates for a few of geoglyphs in the nazca lines (nazca_lines.csv):
Name,Latitude,Longitude
Hummingbird,-14.692131,-75.148892
Monkey,-14.706940,-75.138532
Condor,-14.697444,-75.126208
Spider,-14.694145,-75.122381
Spiral,-14.688277,-75.122746
Hands,-14.694459,-75.113881
Tree,-14.693898,-75.114520
Astronaut,-14.745222,-75.079755
Dog,-14.706401,-75.130788
Wing,-14.680309,-75.100385
Parrot,-14.689463,-75.107498
I create a spatial index using:
CREATE INDEX GeoGlyph.index.Location
ON GeoGlyph(Latitude,Longitude) SPATIAL ENGINE LUCENE
I can generate a list of nodes that are within, say 2km, of a specific geoglyph using a query like the one I generated in this stack-overflow question:
SELECT $temp.Name AS SourceName, Name AS TargetName, $distance.format("%.4f") AS Distance
FROM GeoGlyph
LET $temp = first((SELECT * FROM GeoGlyph WHERE Name = "Tree"))
WHERE [Latitude,Longitude,$spatial]
NEAR [$temp.Latitude, $temp.Longitude,{"maxDistance":2}]
ORDER BY Distance
which gives me this result:
+----+----------+----------+--------+
|# |SourceName|TargetName|Distance|
+----+----------+----------+--------+
|0 |Tree |Tree |0.0000 |
|1 |Tree |Hands |0.0884 |
|2 |Tree |Spider |0.9831 |
|3 |Tree |Spiral |1.0883 |
|4 |Tree |Condor |1.5735 |
+----+----------+----------+--------+
This is nice, but I can only find the nodes relative to a specific node. I would like to expand this to ask for all pairs of nodes that are within 2km of each other.
A result that I'm interested in would look something like this:
+----+-----------+-----------+--------+
|# |SourceName |TargetName |Distance|
+----+-----------+-----------+--------+
|1 |Hummingbird|Monkey |1.6314 |
|2 |Monkey |Dog |1.8035 |
|3 |Dog |Condor |0.9349 |
|4 |Dog |Spider |1.5487 |
|5 |Condor |Spider |0.6772 |
|6 |Condor |Spiral |1.2685 |
|7 |Condor |Tree |1.5735 |
|8 |Condor |Hands |1.6150 |
|9 |Spider |Spiral |0.6797 |
...
Any ideas?
you should use the new spatial module feature with OPoint type
and use ST_Distance_Sphere function
http://orientdb.com/docs/2.2/Spatial-Index.html#stdistancesphere-from-orientdb-224

SSRS - Sum of Lookup Values within a group

I'm attempting to create a report which will pull data from multiple inventory databases, and based on the item counts in the various systems provide a Cost Estimate from another data source. The report has two levels of grouping, and I need to sum the cost values within each group only (not a total for all groups).
As an illustration, my result set looks like this
MODEL |COMPONENT |Sys1 |Sys2 |Sys3 |Sys1ID |Match |Cost
=====================================================================
Car | | | | | | |620 <--- Sum of component costs
|Wheel |4 |8 |10 | |Sys1 |40
|wheel1
|wheel2
|wheel3
|Brakes |0 |9 |11 | |Sys2 |80
|Horn |0 |0 |50 | |Sys3 |500
---------------------------------------------------------------------
Truck | | | | | | |980 <--- Sum of component costs
|Wheel |0 |0 |10 | |Sys3 |400
|Brakes |0 |9 |11 | |Sys2 |80
|Horn |0 |0 |50 | |Sys3 |500
The Table shows the Unique ID for the any matching items in Sys1, and the Sys1 column contains an COUNT() aggregate for these items. The Sys2 and Sys3 return material counts based on a lookup of Model+Component in two other datasources.
The Match column, indicates which inventory to service from, in order of priority, based on whether assets exist (Sys1, then Sys2, then Sys3). Finally, the Cost column is populated based on a lookup to a fourth dataset, which is formatted as :
MODEL |COMPONENT |System |Cost
=====================================
Car |Wheel |Sys1 |40
Car |Wheel |Sys2 |50
Car |Wheel |Sys3 |300
Car |Brakes |Sys1 |60
Car |Brakes |Sys2 |80
Car |Brakes |Sys3 |900
...
Everything is set up and working, apart from the Sum of Costs within the group. Does anyone know how to resolve this?
For reference, my original attempt to sum these used this expression :
=SUM(ReportItem!Estimate_Cost.Value)
which yields the following error message :
The Value expression for the textrun 'Textbox39.Paragraphs[0].TextRuns[0]'
uses an aggregate function on a report item. Aggregate functions can be used
only on report items contained in page headers and footers.