Scala/Spark : How to do outer join based on common columns - scala

I have 2 data dataframes:
First dataframe contains temparature information.
Second dataframe contains precipitation information"
I read those files and created dataframes as :
val dataRecordsTemp = sc.textFile(tempFile).map{rec=>
val splittedRec = rec.split("\\s+")
Temparature(splittedRec(0),splittedRec(1),splittedRec(2),splittedRec(3),splittedRec(4))
}.map{x => Row.fromSeq(x.getDataFields())}
val headerFieldsForTemp = Seq("YEAR","MONTH","DAY","MAX_TEMP","MIN_TEMP")
val schemaTemp = StructType(headerFieldsForTemp.map{f => StructField(f, StringType, nullable=true)})
val dfTemp = session.createDataFrame(dataRecordsTemp,schemaTemp)
.orderBy(desc("year"), desc("month"), desc("day"))
println("Printing temparature data ...............................")
dfTemp.select("YEAR","MONTH","DAY","MAX_TEMP","MIN_TEMP").take(10).foreach(println)
val dataRecordsPrecip = sc.textFile(precipFile).map{rec=>
val splittedRec = rec.split("\\s+")
Precipitation(splittedRec(0),splittedRec(1),splittedRec(2),splittedRec(3),splittedRec(4),splittedRec(5))
}.map{x => Row.fromSeq(x.getDataFields())}
val headerFieldsForPrecipitation = Seq("YEAR","MONTH","DAY","PRECIPITATION","SNOW","SNOW_COVER")
val schemaPrecip = StructType(headerFieldsForPrecipitation.map{f => StructField(f, StringType, nullable=true)})
val dfPrecip = session.createDataFrame(dataRecordsPrecip,schemaPrecip)
.orderBy(desc("year"), desc("month"), desc("day"))
println("Printing precipitation data ...............................")
dfPrecip.select("YEAR","MONTH","DAY","PRECIPITATION","SNOW","SNOW_COVER").take(10).foreach(println)
I have to join 2 RDDs based on common columns (year,month,day). Input files have header and output file will have the header as well.The 1st file has information on temperature as (example):
year month day min-temp mav-temp
2017 12 13 13 25
2017 12 16 25 32
2017 12 25 34 56
2nd file has information precipitation as (example)
year month day precipitation snow snow-cover
2018 7 6 0.00 0.0 0
2017 12 13 0.04 0.0 0
2017 12 16 0.4 0.04 1
My expected output should be (ordered by date asynchronous , if no value found then blank):
year month day min-temp mav-temp precipitation snow snow-cover
2017 12 13 13 25 0.04 0.0 0
2017 12 16 25 32 0.4 0.04 1
2017 12 25 34 56
2018 7 6 0.00 0.0 0
May I get help on how to do that in Scala?

You need outer join these two datasets and then order result like this:
import org.apache.spark.sql.functions._
dfTemp
.join(dfPrecip, Seq("year", "month", "day"), "outer")
.orderBy(desc("year"), desc("month"), desc("day"))
.na.fill("")
If you don't need blank values and fine with null, then you may avoid .na.fill("").
Hope it helps!

Related

Error in post hoc test for lmer(): both multcomp() and emmeans()

I have a dataset of measurements of "Y" at different locations, and I am trying to determine how variable Y is influenced by variables A, B, and D by running a lmer() model and analyzing the results. However, when I reach the post hoc step, I receive an error when trying to analyze.
Here is an example of my data:
table <- " ID location A B C D Y
1 1 AA 0 0.6181587 -29.67 14.14 168.041
2 2 AA 1 0.5816176 -29.42 14.21 200.991
3 3 AA 2 0.4289670 -28.57 13.55 200.343
4 4 AA 3 0.4158891 -28.59 12.68 215.638
5 5 AA 4 0.3172721 -28.74 12.28 173.299
6 6 AA 5 0.1540603 -27.86 14.01 104.246
7 7 AA 6 0.1219355 -27.18 14.43 128.141
8 8 AA 7 0.1016643 -26.86 13.75 179.330
9 9 BB 0 0.6831649 -28.93 17.03 210.066
10 10 BB 1 0.6796935 -28.54 18.31 280.249
11 11 BB 2 0.5497743 -27.88 17.33 134.023
12 12 BB 3 0.3631052 -27.48 16.79 142.383
13 13 BB 4 0.3875498 -26.98 17.81 136.647
14 14 BB 5 0.3883785 -26.71 17.56 142.179
15 15 BB 6 0.4058061 -26.72 17.71 109.826
16 16 CC 0 0.8647298 -28.53 11.93 220.464
17 17 CC 1 0.8664036 -28.39 11.59 326.868
18 18 CC 2 0.7480748 -27.61 11.75 322.745
19 19 CC 3 0.5959143 -26.81 13.27 170.064
20 20 CC 4 0.4849077 -26.77 14.68 118.092
21 21 CC 5 0.3584687 -26.65 15.65 95.512
22 22 CC 6 0.3018285 -26.33 16.11 71.717
23 23 CC 7 0.2629121 -26.39 16.16 60.052
24 24 DD 0 0.8673077 -27.93 12.09 234.244
25 25 DD 1 0.8226558 -27.96 12.13 244.903
26 26 DD 2 0.7826429 -27.44 12.38 252.485
27 27 DD 3 0.6620447 -27.23 13.84 150.886
28 28 DD 4 0.4453213 -27.03 15.73 102.787
29 29 DD 5 0.3720257 -27.13 16.27 109.201
30 30 DD 6 0.6040217 -27.79 16.41 101.509
31 31 EE 0 0.8770987 -28.62 12.72 239.036
32 32 EE 1 0.8504547 -28.47 12.92 220.600
33 33 EE 2 0.8329484 -28.45 12.94 174.979
34 34 EE 3 0.8181102 -28.37 13.17 138.412
35 35 EE 4 0.7942685 -28.32 13.69 121.330
36 36 EE 5 0.7319724 -28.22 14.62 111.851
37 37 EE 6 0.7014828 -28.24 15.04 110.447
38 38 EE 7 0.7286984 -28.15 15.18 121.831"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
# Make sure location is a factor
df$location<-as.factor(df$location)
Here is my model:
# Load libraries
library(ggplot2)
library(pscl)
library(lmtest)
library(lme4)
library(car)
mod = lmer(Y ~ A * B * poly(D, 2) * (1|location), data = df)
summary(mod)
plot(mod)
I now need to determine what variables significantly influence Y. Thus I ran Anova() from the package car (output pasted here).
Anova(mod)
# Analysis of Deviance Table (Type II Wald chisquare tests)
#
# Response: Y
# Chisq Df Pr(>Chisq)
# A 8.2754 1 0.004019 **
# B 0.0053 1 0.941974
# poly(D, 2) 40.4618 2 1.636e-09 ***
# A:B 0.1709 1 0.679348
# A:poly(D, 2) 1.6460 2 0.439117
# B:poly(D, 2) 5.2601 2 0.072076 .
# A:B:poly(D, 2) 0.6372 2 0.727175
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This suggests that:
A significantly influences Y
B does not significantly influence Y
D significantly influences Y
So next I would run a post hoc test for each of these variables, but this is where I run into issues. I have tried using both emmeans and multcomp packages below:
library(emmeans)
emmeans(mod, list(pairwise ~ A), adjust = "tukey")
# NOTE: Results may be misleading due to involvement in interactions
# Error in if ((misc$estType == "pairs") && (paste(c("", by), collapse = ",") != :
# missing value where TRUE/FALSE needed
pairs(emmeans(mod, "A"))
# NOTE: Results may be misleading due to involvement in interactions
# Error in if ((misc$estType == "pairs") && (paste(c("", by), collapse = ",") != :
# missing value where TRUE/FALSE needed
library(multcomp)
summary(glht(mod, linfct = mcp(A = "Tukey")), test = adjusted("fdr"))
# Error in h(simpleError(msg, call)) :
# error in evaluating the argument 'object' in selecting a method for function 'summary': Variable(s) ‘depth’ of class ‘integer’ is/are not contained as a factor in ‘model’.
This is the first time I've run an ANOVA/post hoc test on a lmer() model, and though I've read a few introductory sites for this model, I'm not sure I am testing it correctly. Any help would be appreciated.
If I am looking at the data correctly, A is the variable that has values of 0, 1, ..., 7. Now look at your anova table, where you see that A has only 1 d.f., not 7 as it should for a factor having 8 levels. That means your model is taking A to be a numerical predictor -- which is rather meaningless. Make A into a factor and re-fit he model. You'll have better luck.
I also think you meant to have + (1|location) at the end of the model formula, rather than having the random effects interacting with some of the polynomial effects.

Filling a calendar using Arrayformula or LOOKUP

I've made a calendar sheet and would like to fill it using an Arrayformula or some kind of Lookup.
The problem is, the code in each cell is different, do I need it all to be the same code or is it possible to do an Arrayformula that does a different formula for each line?
I spent ages getting the calendar code working but would now like to simplify the code and I'm not sure what my next step should be:
https://docs.google.com/spreadsheets/d/1u_J7bmOFyDlYXhcL5dW3CHFJ1esySAKK_yPc6nFTdLA/edit?usp=sharing
Any advice would be much appreciated.
I've added a new sheet in your file called 'Aresvik'.
The green cells have new formula.
Cell B3 can be =date(B1,1,1)
Then each successive month can be =eomonth(B3,0)+1, =eomonth(J3,0)+1 etc.
The date formula in cell B5 is:
=arrayformula(iferror(vlookup(sequence(7,7,1),{array_constrain(sequence(40,1),day(eomonth(B3,0))+weekday(B3,3),1),query({flatten(split(rept(",",day(eomonth(B3,0))-1),",",0,0));sequence(day(eomonth(B3,0)),1,1)},"offset "&day(eomonth(B3,0))-weekday(B3,3)&" ",0)},2,false),))
It can be copied to each other cell below Mo, so B5 will change to J5, R5, Z5 etc.
Notes
The concept revolves around using the SEQUENCE function to generate a grid of numbers, 6 rows, 7 columns:
sequence(6,7)
which looks like this:
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31 32 33 34 35
36 37 38 39 40 41 42
Then using these numbers in a VLOOKUP to get a corresponding date for the calendar. If the first of the month falls on a Thursday (April 2021), the vlookup range needs 3 gaps at the top of the list of dates. player0 has a more elegant solution than my original query using offset, so I've incorporated it below. Cell Z3 is the date 1/4/2021:
=arrayformula(
iferror(
vlookup(sequence(6,7),
{sequence(day(eomonth(Z3,0))+weekday(Z3,2),1,0),
{iferror(sequence(weekday(Z3,2),1)/0,);sequence(day(eomonth(Z3,0)),1,Z3)}},
2,false)
,))
The first column in the vlookup range is:
sequence(day(eomonth(Z3,0))+weekday(Z3,2),1,0)
which is an array of numbers from 0, corresponding with the number of days in the month plus the number of gaps before the 1st day.
The second column in the vlookup range is:
{iferror(sequence(weekday(Z3,2),1)/0,);sequence(day(eomonth(Z3,0)),1,Z3)}},
It is an array of 2 columns in this format: {x;y}, where y sits below x because of the ;.
These are the gaps: iferror(sequence(weekday(Z3,2),1)/0,), followed by the date numbers: sequence(day(eomonth(Z3,0)),1,Z3)
(Example below is April 2021):
0
1
2
3
4
5
6 44317
7 44318
8 44319
9 44320
10 44321
11 44322
12 44323
13 44324
14 44325
15 44326
16 44327
17 44328
18 44329
19 44330
20 44331
21 44332
22 44333
23 44334
24 44335
25 44336
26 44337
27 44338
28 44339
29 44340
30 44341
31 44342
32 44343
33 44344
34 44345
35 44346
36 44347
The vlookup takes each number in the initial sequence (6x7 layout), and brings back the corresponding date from col2 in the range, based on a match in col1.
When the first day of the month is a Monday, iferror(sequence(weekday(BB1,2),1)/0,) generates a gap in col2 of the vlookup range. This is why col1 in the vlookup range has to start with 0.
I've updated the sheet at https://docs.google.com/spreadsheets/d/1u_J7bmOFyDlYXhcL5dW3CHFJ1esySAKK_yPc6nFTdLA/edit#gid=68642071
Values on the calendar are dates so the formatting has to be d.
If you want numbers, then use:
=arrayformula(
iferror(
vlookup(sequence(6,7),
{sequence(day(eomonth(Z3,0))+weekday(Z3,2),1,0),
{iferror(sequence(weekday(Z3,2),1)/0,);sequence(day(eomonth(Z3,0)),1)}},
2,false)
,))
shorter solution:
=INDEX(IFNA(VLOOKUP(SEQUENCE(6, 7), {SEQUENCE(DAY(EOMONTH(B3, ))+WEEKDAY(B3, 2), 1, ),
{IFERROR(ROW(INDIRECT("1:"&WEEKDAY(B3, 2)))/0); SEQUENCE(DAY(EOMONTH(B3, )), 1, B3)}}, 2, )))

how to join two differents dataframes in pyspark?

I have two dataframes one with columns of X year,month and measure, and
columns with x1, x2 which corresonds to the first day and the second
day . The first dataframe is:
X year month measure X1 X2
1 1 2014 12 Max.TemperatureF 64 42
2 2 2014 12 Mean.TemperatureF 52 38
3 3 2014 12 Min.TemperatureF 39 33
The second dataframe where only I have the days.
X3 X4 X5 X6 X7
1 51 43 42 45
2 44 37 34 42
3 37 30 26 38
So I want to join the two dataframes and obtain in pyspark
X year month measure X1 X2 X3 X4 X5 X6
'1 1 2014 12 Max.TemperatureF 64 42 1 51 43 42
'2 2 2014 12 Mean.TemperatureF 52 38 2 44 37 34
'3 3 2014 12 Min.TemperatureF 39 33 3 37 30 26
I have joined them but they get one dataframe above the another dataframe instead of that they remain in the same rows
from functools import reduce
from pyspark.sql import DataFrame
def unionAll(*dfs):
return reduce(DataFrame.unionAll, dfs)
td = unionAll(*[weather1, weather2])
X year month measure X1 X2
1 1 2014 12 Max.TemperatureF 64 42
2 2 2014 12 Mean.TemperatureF 52 38
3 3 2014 12 Min.TemperatureF 39 33
X3 X4 X5 X6
1 51 43 42 45
2 44 37 34 42
3 37 30 26 38
So this is a wrong joining .
I suppose what you are trying to do is join two tables. To join two tables, you need a common column and since you don't have a common column, you will have to create something. This is how I would tackle this:
# Copy the entire 'X' column (which I am assuming is the index)
weather2 = weather2.withColumn('X', weather1['X'])
# Join the two tables on 'X'
joinExpr = 'X'
td = weather1.join(weather2, joinExpr)
This should solve the problem.
Joining two differents dataframes in pyspark?
Assuming
two DataFrame to be df and df1 respectively
X1 as common column in two dataframe
sqlContext.sql( "select * from df, df1 where df.X1 == df1.X1").show()

decision tree error in train classification in scala

val pdata = sc.parallelize(Seq(data))
val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split('').map(_.toDouble)))}.cache()
// Split the data into training and test sets (30% held out for testing)
val splits = parsedData.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
val numClasses = 2
val categoricalFeaturesInfo = {}
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
I have written this code to build a decision tree Classification model on my given data. First column is the prediction column.
It throws an error stating that "overloaded method value trainClassifier with alternatives:"
Here is my sample Input data:
1 2 50 12500 98
1 0 13 3250 28
1 1 16 4000 35
1 2 20 5000 45
0 1 24 6000 77
0 4 4 1000 4
1 2 7 1750 14
0 1 12 3000 35
1 2 9 2250 22
1 5 46 11500 98
0 4 23 5750 58

Matlab beginner median , mode and binning

I am a beginner with MATLAB and I am struggling with this assignment. Can anyone guide me through it?
Consider the data given below:
x = [ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , ...
34, 47, 455, 21, , 22, 100 ];
Once the data is loaded, see if you can find any:
Outliers or
Missing data in the data file
Correct the missing values using median, mode and noisy data using median binning, mean binning and bin boundaries.
This isn't so bad. First off, take a look at the distribution of your data. You can see that the majority of your data has double digits. The outliers are those with single digits, or those that are way larger than double digits. Mind you, this is totally subjective so someone else may tell you that the single digits are part of your data too. Also, the missing data are those numbers that are spaces in between the commas. Let's write some MATLAB code and change these to NaN (or not-a-number), because if you try copying and pasting this code directly into MATLAB, it will give you a syntax error because if you are explicitly defining numbers this way, you have to be sure all of them are there.
To do this, use regexprep so that any parts of this string that have a comma, space, then another comma, put a NaN in between. To do this, we need to put this statement as a string first. We then use eval to convert this string to an actual MATLAB statement:
x = '[ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , 34, 47, 455, 21, , 22, 100 ];'
y = eval(regexprep(x, ', ,', ', NaN, '));
If we display this data, we get:
y =
Columns 1 through 6
1 48 81 2 10 25
Columns 7 through 12
NaN 14 18 53 41 56
Columns 13 through 18
89 0 1000 NaN 34 47
Columns 19 through 23
455 21 NaN 22 100
As such, to answer our first question, any values that are missing are denoted as NaN and those numbers that are bigger than double digits are outliers.
For the next question, we simply extract those values that are not missing, calculate the mean and median of what is not missing, and fill in those NaN values with the mean and median. For the bin boundaries, this is the same thing as using the values to the left (or right... depends on your definition, but let's use left) of the missing value and fill those in. As such:
yMissing = isnan(y); %// Which values are missing?
y_noNaN = y(~yMissing); %// Extract the non-missing values
meanY = mean(y_noNaN); %// Get the mean
medianY = median(y_noNaN); %// Get the median
%// Output - Fill in missing values with median
yMedian = y;
yMedian(yMissing) = medianY;
%// Same for mean
yMean = y;
yMean(yMissing) = meanY;
%// Bin boundaries
yBinBound = y;
yBinBound(yMissing) = y(find(yMissing)-1);
The mean and median for the data of the non-missing values is:
meanY =
105.8500
medianY =
37.5000
The outputs for each of these, in addition to the original data with the missing values looks like:
format bank; %// Do this to show just the first two decimal places for compact output
format compact;
y =
Columns 1 through 5
1 48 81 2 10
Columns 6 through 10
25 NaN 14 18 53
Columns 11 through 15
41 56 89 0 1000
Columns 16 through 20
NaN 34 47 455 21
Columns 21 through 23
NaN 22 100
yMean =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 105.85 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
105.85 34.00 47.00 455.00 21.00
Columns 21 through 23
105.85 22.00 100.00
yMedian =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 37.50 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
37.50 34.00 47.00 455.00 21.00
Columns 21 through 23
37.50 22.00 100.00
yBinBound =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 25.00 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
1000.00 34.00 47.00 455.00 21.00
Columns 21 through 23
21.00 22.00 100.00
If you take a look at each of the output values, this fills in our data with the mean, median and also the bin boundaries as per the question.