Creating a pivot table in PySpark

Creating a pivot table in PySpark - pyspark

New to PySpark and would like to make a table that counts the unique pairs of values from two columns and shows the average of another column over all rows with those pairs of values. My code so far is:
df1 = df.withColumn('trip_rate', df.total_amount / df.trip_distance)
df1.groupBy('PULocationID', 'DOLocationID').count().orderBy('count', ascending=False).show()
I want to add the average of the trip rate for each unique pair as a column. Can you help me please?

Related

How do I Map and Remap string values to Int or double in scala

I have a data file of some columns. I am performing some mathematical computations on the values for that purpose I want to map my non Integer value columns to Int and then after the operations on the values I want to remap them.
Following are my columns values
atom_id,molecule_id,element,type,charge
d100_1,d100,c,22,-0.128
d100_10,d100,h,3,0.132
d100_11,d100,c,29,0.002
d100_12,d100,c,22,-0.128
d100_13,d100,c,22,-0.128
Suppose I want to map only 2 columns and then remap those columns values only. I have searched for methods and found String Indexer but it maps all of the columns of the DF, I need to map only specific columns and then remap the values of those specific columns. Any help will be appreciated.
//edited Part
I have the following columns in my DataFrame
ind1,inda,logp,lumo,mutagenic,element
1,0,4.23,-1.246,yes,c
1,0,4.62,-1.387,yes,b
0,0,2.68,-1.034,no,h
1,0,6.26,-1.598,yes,c
1,0,2.4,-3.172,yes,a
Basically I am writing the code for synthetic Data Generation based on the given input data, so I want to use column values i.e ind1,inda,logp,lumo,mutagenic,element. single row at a time and after applying some math functions on it I will get a row which will consist of 6 values and each value will be representing the corresponding column value.
Now the problem is that all column values are of type double except mutagenic and element. I want to map this mutagenic and element columns to double values for example yes to 0 and No to 1 so that I can use them and then when I will receive the output row then I will reverse map that generated mutagenic value back to the corresponding string value using that mapping function.
Hope so I am clear this time

Indexing into MAtlab table based on max value

I have a table T2 with 5 columns and 5 rows. The table columns are 'FirstName','Height','Shoesize','Gender' and 'Profession'.
I have to create a second table containing the 'FirstName','Height' and 'Profession' of the person with the maximum 'Shoesize'.
So far, I found the index of the maximum 'Shoesize'.
[m,i] = max(T2{:,3})
However I am struggling to index into the table to abstract the relevant values. Any help is highly appreciated!

efficient aggregation (sum) on a single column Data Frame in spark scala

I have a Spark Data Frame with a single column and large number of rows (in billions). I am trying to calculate the sum of the values in each row using the code shown below. However, it is very slow. Is there an efficient way to calculate the sum?
val df = sc.parallelize(Array(1,3,5,6,7,10,30)).toDF("colA")
df.show()
df.agg(sum("colA")).first().get(0) //very slow
Similar query was posted here: How to sum the values of one column of a dataframe in spark/scala
The focus of this query is however about efficiency.

compute difference between 2 ranked columns in tableau

I am trying to find the difference between 2 columns in tableau. The catch though here is that each column is ranked based on a value. The difference i need is between these 2 ranked columns
The rank is computed using the table calculations rank function. Attaching the picture for more information

I am assuming "current" and "prior" are calculated fields.
Just create a new calculated field, here I'll call it "Result". In this field just minus your one from the other so:
[Current - Prior]
Then pull this new field into your measures values on your sheet.

Compare column values in consecutive rows in Scala

I am new to Spark Scala. I am having a situation where I have to compare values of a particular column in a data set, for example:
Source Data
Source Destination Distance
Austin Houston 200
Dallas Houston 400
Kansas Dallas 700
Resultant
Source1 Destination1 Distance1 Source2 Destination2 Distance2 DistDiff
Dallas Houston 400 Kansas Dallas 700 300
As per the situation, I have to compare the distance of subsequent rows and if the difference is more than or equals to 300 then save the records in the Resultant data set
700 - 400 = 300
The examples which I have encountered are having functions which are executing on per row basis on any particular data set, however my scenario is to play with consecutive rows.

You mentioned you can sort rows by datetime. So, assuming it's sorted using sortBy or sortByKey to create an ordered rdd, and also assuming you have an even number of rows (so each row has another one to calculate difference with) you can:
Give each row an index using zipWithIndex.
Split the RDD into two RDDs, one with even-numbered indices and one with odd-numbered indices, by filtering on the index created.
zip the split RDDs together, creating a new RDD of Tuple2 with even-indexed rows on the left and odd-indexed rows on the right.
map the result to calculate the difference between left/right of each row.

This may be done in the following way:
Add an index column to your sorted rdd
Make sure the rdd has an even number of rows N
Make an rdd rdd_even1 to contain the even rows with indices [0, N-2]
Make an rdd rdd_odd1 to contain the odd rows [1, N-1]
Make an rdd rdd_even2 to contain the even rows [2, N-2]
Make an rdd rdd_odd2 to contain the odd rows [1, N-3]
Now you need to repartition rdd_even1 and *rdd_odd1 before zipping, because zipping won't work if both rdd's do not have the same number of elements in all partitions (in pyspark at least). You can do it in memory using collect and parallelize, but most likely you have to write the rdd's to HDFS and re-read them, controlling for the paritioning
Do the same for rdd_even2 and rdd_odd2
zip the rdd's from step 7 to rdd_zip1
zip the rdd's from step 8 to rdd_zip2
Call rdd_zip1.union(rdd_zip2)
Now you can call map() on the union to get your "resultant" with the required differences
Good luck.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse