how to find max value from multiple columns in dataframe in spark [duplicate] - scala

This question already has an answer here:
Scala/Spark dataframes: find the column name corresponding to the max
(1 answer)
Closed 3 years ago.
I have input spark dataframe as
sample A B C D
1 1 3 5 7
2 6 8 10 9
3 6 7 8 1
I need to find the max among A,B,C,D columns which are subject marks.
I need to create a new dataframe with max_marks as the new column.
sample A B C D max_marks
1 1 3 5 7 7
2 6 8 10 9 10
3 6 7 8 1 8
I have done this using scala as
val df = df.columns.toSeq
val df1=df.foldLeft(df){(df,colName)=> df.withColumn("max_sub",max((colName)))
df.show()
I am getting an error message
"main" org.apache.spark.sql.AnalysisException:grouping expression sequence is empty
this dataframe has about 100 columns so how to iterate over this dataframe
It would be helpful to iterate over the data frame as the columns where the mean has to be found out are about 10 out of 100 column dataframe with about 10000 records
I am looking to dynamically pass the columns without giving the column names manually which means to loop over the columns that i choose and perform any mathematical operation

There are many ways to accomplish this one of the ways would be using map.
Simple pseudo code to do what you want (It wont work in anyway but I think the idea is clear)
df = df.withColumn("max_sub", "A")
df.map({x=> {
max = "A"
maxVal = 0
for col in x{
if(col != "max_sub" && x.col > maxVal){
max = col
maxVal = x.col
}
}
x.max_sub = max
x
})

Related

Performing random trials in pyspark

I am learning pyspark recently and wanted to apply in one of the problems. Basically i want to perform random trials on each record in a dataframe.My dataframe is structured as below.
order_id,order_date,distribution,quantity
O1,D1,3 4 4 5 6 7 8 ... ,10
O2,D2,1 6 9 10 12 16 18 ..., 20
O3,D3,7 12 15 16 18 20 ... ,50
Here distribution column is 100 percentile points where each value is space separated.
I want to loop through each of these rows in the dataframe and randomly select a point in the distribution and add those many days to order_date and create a new column arrival_date.
At the end i want to get the avg(quantity) by arrival_date. So my final dataframe should look like
arrival_date,qty
A1,5
A2,10
What i have achieved till now is below
df = spark.read.option("header",True).csv("/tmp/test.csv")
def randSample(row):
order_id = row.order_id
quantity = int(row.quantity)
data = []
for i in range(1,20):
n = random.randint(0,99)
randnum = int(float(row.edd.split(" ")[n]))
arrival_date = datetime.datetime.strptime(row.order_date.split(" ")[0], "%Y-%m-%d") + datetime.timedelta(days=randnum)
data.append((arrival_date, quantity))
return data
finalRDD = df.rdd.map(randSample)
The calculations look correct, however the finalRDD is structured as list of lists as below
[
[(),(),(),()]
,[(),(),(),()]
,[(),(),(),()]
,[(),(),(),()]
]
Each of the list inside the main list is a single record . And each tuple inside the nested list is a trial of that record.
Basically i want the final output as flattened records, so that i can perform the average.
[
(),
(),
(),
]

str_detect for multiple patterns

I am using str_detect within the stringr package and I am having trouble searching a string with more than one pattern.
Here is the code I am using, however it is not returning anything even though my vector ("Notes-Title") contains these patterns.
filter(str_detect(`Notes-Title`, c("quantity","single")))
The logic I want to code is:
Search each row and filter it if it contains the string "quantity" or "single".
You need to use the | separator in your search, all within one set of "".
> words <- c("quantity", "single", "double", "triple", "awful")
> set.seed(1234)
> df = tibble(col = sample(words,10, replace = TRUE))
> df
# A tibble: 10 x 1
col
<chr>
1 triple
2 single
3 awful
4 triple
5 quantity
6 awful
7 triple
8 single
9 single
10 triple
> df %>% filter(str_detect(col, "quantity|single"))
# A tibble: 4 x 1
col
<chr>
1 single
2 quantity
3 single
4 single

Create new binary column based off of join in spark

My situation is I have two spark data frames, dfPopulation and dfSubpopulation.
dfSubpopulation is just that, a subpopulation of dfPopulation.
I would like a clean way to create a new column in dfPopulation that is binary of whether the dfSubpopulation key was in the dfPopulation key. E.g. what I want is to create the new DataFrame dfPopulationNew:
dfPopulation = X Y key
1 2 A
2 2 A
3 2 B
4 2 C
5 3 C
dfSubpopulation = X Y key
1 2 A
3 2 B
4 2 C
dfPopulationNew = X Y key inSubpopulation
1 2 A 1
2 2 A 0
3 2 B 1
4 2 C 1
5 3 C 0
I know this could be down fairly simply with a SQL statement, however given that a lot of Sparks optimization is now using the DataFrame construct, I would like to utilize that.
Using SparkSQL compared to DataFrame operations should make no difference from a performance perspective, the execution plan is the same. That said, here is one way to do it using a join
val dfPopulationNew = dfPopulation.join(
dfSubpopulation.withColumn("inSubpopulation", lit(1)),
Seq("X", "Y", "key"),
"left_outer")
.na.fill(0, Seq("inSubpopulation"))

create a new matrix from values obtained iterating through other matricies

In Matlab I have 4 matricies which are all 1(row) by 4(coloumns) (ABDC, EFGH, IJKL, MNOP)
Their names are also stored in a list
Stock_List2 = {'ABCD' 'EFGH' 'IJKL' 'MNOP'} and is a 1 by 4 cell.
I want to iterate through the list and create a new matrix called "display" which takes the values of the indvidual matricies and places them underneath each other)
I am trying something like
for e = 1:length(Stock_List2)
display(e) = eval(strcat(Stock_List2)(e))
end
Error: ()-indexing must appear last in an index expression.
However getting the following error expression which truthfully may well just be that I'm way off the mark.
As an example if the orginal matricies are as follows:
ABCD 1 2 3 4
DEFG 5 6 7 8
HIJK 9 8 7 6
LMNO 5 4 3 2
I would like the final output ie the 'display matrix to be a 4 by 4 matrix looking like
display
1 2 3 4
5 6 7 8
9 8 7 6
5 4 3 2
If I understood right you want to concatenate vertically the matrices ABDC, EFGH, IJKL and MNOP saving them in the matrix "display".
You could do:
display = [ABDC; EFGH; IJKL; MNOP]
or:
for i=1:length(Stock_List2)
display(i,:) = Stock_List2{i}
end
Apologies if what I wanted wasnt clear - I've got the following from a colleague which achieves the desired result
for e=1:length(Stock_List2)
eval(strcat('display_mat(e,:) = ',Stock_List2{e}));
end

select rows by comparing columns using HDFStore

How can I select some rows by comparing two columns from hdf5 file using Pandas? The hdf5 file is too big to load into memory. For example, I want to select rows where column A and columns B is equal. The dataframe is save in file 'mydata.hdf5'. Thanks.
import pandas as pd
store = pd.HDFstore('mydata.hdf5')
df = store.select('mydf',where='A=B')
This doesn't work. I know that store.select('mydf',where='A==12') will work. But I want to compare column A and B. The example data looks like this:
A B C
1 1 3
1 2 4
. . .
2 2 5
1 3 3
You cannot directly do this, but the following will work
In [23]: df = DataFrame({'A' : [1,2,3], 'B' : [2,2,2]})
In [24]: store = pd.HDFStore('test.h5',mode='w')
In [26]: store.append('df',df,data_columns=True)
In [27]: store.select('df')
Out[27]:
A B
0 1 2
1 2 2
2 3 2
In [28]: store.select_column('df','A') == store.select_column('df','B')
Out[28]:
0 False
1 True
2 False
dtype: bool
This should be pretty efficient.