decision tree error in train classification in scala - scala

val pdata = sc.parallelize(Seq(data))
val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split('').map(_.toDouble)))}.cache()
// Split the data into training and test sets (30% held out for testing)
val splits = parsedData.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
val numClasses = 2
val categoricalFeaturesInfo = {}
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
I have written this code to build a decision tree Classification model on my given data. First column is the prediction column.
It throws an error stating that "overloaded method value trainClassifier with alternatives:"
Here is my sample Input data:
1 2 50 12500 98
1 0 13 3250 28
1 1 16 4000 35
1 2 20 5000 45
0 1 24 6000 77
0 4 4 1000 4
1 2 7 1750 14
0 1 12 3000 35
1 2 9 2250 22
1 5 46 11500 98
0 4 23 5750 58

Related

Replace values of one pyspark dataframe with another

I have a pyspark dataframe df2 :-
ID
Total_Count
Final_A
Final_B
Final_C
Final_D
11
80
36
30
8
6
4
80
36
30
8
6
13
65
30
24
6
5
12
56
26
21
5
4
2
65
30
24
6
5
1
56
26
21
5
4
I have another dataframe df1 :-
ID
Total_Count
A
B
C
D
4
80
0
0
3
0
11
80
0
0
0
0
13
65
0
0
0
0
12
56
0
4
0
0
2
65
0
0
0
0
1
56
0
0
0
0
10
34
10
10
10
4
I want to replace values of df1 by df2 for respective ID(primary key).
Expected df1 :-
ID
Total_Count
A
B
C
D
11
80
36
30
8
6
4
80
36
30
8
6
13
65
30
24
6
5
12
56
26
21
5
4
2
65
30
24
6
5
1
56
26
21
5
4
10
34
10
10
10
4
df2=spark.read.option("header","True").option("inferSchema","True").csv("df1.csv")
df1=spark.read.option("header","True").option("inferSchema","True").csv("df2.csv")
df2 = df2.withColumnRenamed("ID",'df2_ID').withColumnRenamed("Total_Count",'df2_Total_Count')
final_df = df1.join(df2,(df1.ID == df2.df2_ID) & (df1.Total_Count == df2.df2_Total_Count),"left")
from pyspark.sql.functions import when
for i in ('A','B','C','D'):
final_df = final_df.withColumn(i, when(final_df[i] == 0, final_df["Final_{}".format(i)]).otherwise(final_df[i]))
cols = df2.columns
final_df = final_df.drop(*cols)
df = df1.join(df2.select('Final_A', 'Final_B', 'Final_C', 'Final_D'), 'ID'], 'left')
df =df.withColumn('A', coalesce(df['Final_A'],df['A'])).\
withColumn('B', coalesce(df['Final_B'],df['B'])).\
withColumn('C', coalesce(df['Final_C'],df['C'])).\
withColumn('D', coalesce(df['Final_D'],df['D']))
df1 = df.select('ID', 'Total_Count','A', 'B', 'C', 'D')
df1.show()

Pivot table and onehot in pyspark

I have a pyspark data frame which looks like -
id age cost gender
1 38 230 M
2 40 832 M
3 53 987 F
1 38 764 M
4 63 872 F
5 21 763 F
I want my data frame looks like -
id age cost gender M F
1 38 230 M 1 0
2 40 832 M 1 0
3 53 987 F 0 1
1 38 764 M 1 0
4 63 872 F 0 1
5 21 763 F 0 1
4 63 1872 F 0 1
Using python I can manage in following way -
final_df = pd.concat([df.drop(['gender'], axis=1), pd.get_dummies(df['gender'])], axis=1)
How to manage in pyspark?
just need to add 2 columns :
from pyspark.sql import functions as F
final_df = df.select(
"id",
"age",
"cost",
"gender",
F.when(F.col("gender")==F.lit("M"),1).otherwise(0).alias("M"),
F.when(F.col("gender")==F.lit("F"),1).otherwise(0).alias("F"),
)

Scala/Spark : How to do outer join based on common columns

I have 2 data dataframes:
First dataframe contains temparature information.
Second dataframe contains precipitation information"
I read those files and created dataframes as :
val dataRecordsTemp = sc.textFile(tempFile).map{rec=>
val splittedRec = rec.split("\\s+")
Temparature(splittedRec(0),splittedRec(1),splittedRec(2),splittedRec(3),splittedRec(4))
}.map{x => Row.fromSeq(x.getDataFields())}
val headerFieldsForTemp = Seq("YEAR","MONTH","DAY","MAX_TEMP","MIN_TEMP")
val schemaTemp = StructType(headerFieldsForTemp.map{f => StructField(f, StringType, nullable=true)})
val dfTemp = session.createDataFrame(dataRecordsTemp,schemaTemp)
.orderBy(desc("year"), desc("month"), desc("day"))
println("Printing temparature data ...............................")
dfTemp.select("YEAR","MONTH","DAY","MAX_TEMP","MIN_TEMP").take(10).foreach(println)
val dataRecordsPrecip = sc.textFile(precipFile).map{rec=>
val splittedRec = rec.split("\\s+")
Precipitation(splittedRec(0),splittedRec(1),splittedRec(2),splittedRec(3),splittedRec(4),splittedRec(5))
}.map{x => Row.fromSeq(x.getDataFields())}
val headerFieldsForPrecipitation = Seq("YEAR","MONTH","DAY","PRECIPITATION","SNOW","SNOW_COVER")
val schemaPrecip = StructType(headerFieldsForPrecipitation.map{f => StructField(f, StringType, nullable=true)})
val dfPrecip = session.createDataFrame(dataRecordsPrecip,schemaPrecip)
.orderBy(desc("year"), desc("month"), desc("day"))
println("Printing precipitation data ...............................")
dfPrecip.select("YEAR","MONTH","DAY","PRECIPITATION","SNOW","SNOW_COVER").take(10).foreach(println)
I have to join 2 RDDs based on common columns (year,month,day). Input files have header and output file will have the header as well.The 1st file has information on temperature as (example):
year month day min-temp mav-temp
2017 12 13 13 25
2017 12 16 25 32
2017 12 25 34 56
2nd file has information precipitation as (example)
year month day precipitation snow snow-cover
2018 7 6 0.00 0.0 0
2017 12 13 0.04 0.0 0
2017 12 16 0.4 0.04 1
My expected output should be (ordered by date asynchronous , if no value found then blank):
year month day min-temp mav-temp precipitation snow snow-cover
2017 12 13 13 25 0.04 0.0 0
2017 12 16 25 32 0.4 0.04 1
2017 12 25 34 56
2018 7 6 0.00 0.0 0
May I get help on how to do that in Scala?
You need outer join these two datasets and then order result like this:
import org.apache.spark.sql.functions._
dfTemp
.join(dfPrecip, Seq("year", "month", "day"), "outer")
.orderBy(desc("year"), desc("month"), desc("day"))
.na.fill("")
If you don't need blank values and fine with null, then you may avoid .na.fill("").
Hope it helps!

Scala for loop yield

I'm new to Scala so I'm trying to mess around with an example in Programming in Scala: A Comprehensive Step-by-Step Guide, 2nd Edition
// Returns a row as a sequence
def makeRowSeq(row: Int) =
for (col <- 1 to 10) yield {
val prod = (row * col).toString
val padding = " " * (4 - prod.length)
padding + prod
}
// Returns a row as a string
def makeRow(row: Int) = makeRowSeq(row).mkString
// Returns table as a string with one row per line
def multiTable() = {
val tableSeq = // a sequence of row strings
for (row <- 1 to 10)
yield makeRow(row)
tableSeq.mkString("\n")
}
When calling multiTable() the above code outputs:
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
6 12 18 24 30 36 42 48 54 60
7 14 21 28 35 42 49 56 63 70
8 16 24 32 40 48 56 64 72 80
9 18 27 36 45 54 63 72 81 90
10 20 30 40 50 60 70 80 90 100
This makes sense but if I try to change the code in multiTable() to be something like:
def multiTable() = {
val tableSeq = // a sequence of row strings
for (row <- 1 to 10)
yield makeRow(row) {
2
}
tableSeq.mkString("\n")
}
The 2 is being returned and changing the output. I'm not sure where it's being used though to manipulate the output and can't seem to find a similar example searching around here or Google. Any input would be appreciated!
makeRow(row) {2}
and
makeRow(row)(2)
and
makeRow(row).apply(2)
are all equivalent.
makeRow(row) is of type List[String], each String representing one row. So effectively, you are picking character at index 2 from each row. That is why you are seeing 9 spaces and one 1 in your output.
def multiTable() = {
val tableSeq = // a sequence of row strings
for (row <- 1 to 10)
yield makeRow(row) {2}
tableSeq.mkString("\n")
}
is equivalent to applying a map on each row like
def multiTable() = {
val tableSeq = // a sequence of row strings
for (row <- 1 to 10)
yield makeRow(row)
tableSeq.map(_(2)).mkString("\n")
}

Matlab isn't incrementing my variable

I have the following matrix declared in Matlab:
EmployeeData =
1 20 100000 42 14
2 15 95000 35 14
3 18 70000 28 14
4 10 85000 35 14
5 10 40000 21 12
6 4 45000 14 8
7 3 50000 21 10
8 5 55000 21 14
9 1 25000 14 7
10 2 50000 21 9
42 4 100000 42 10
Where column 1 represents ID numbers, 2 represents years, 3 is salary, 4 is vacation days, and 5 is sick days. I am trying to find the maximum value of a column (in this case the salary column), and print out the ID associated with that value. If more than one employee has the maximum value, all the IDs with that maximum are supposed to be shown. So here is how I naively implemented a way to do it:
>> maxVal = [];
>> j = 1;
>> for i = EmployeeData(:, 3)
if i == max(EmployeeData(:, 3))
maxVal = [maxVal EmployeeData(j, 1)];
end
j = j + 1;
end
But it shows maxVal to be [] in my workspace variables, instead of [1 42] as I expected. Upon inserting a disp(i) in the for loop above the if to debug, I get the following output:
100000
95000
70000
85000
40000
45000
50000
55000
25000
50000
Just like I expected. But when I switch out that disp(i) with a disp(j), I get this for my output:
1
What am I doing wrong? Should this not work?
MATLAB for loops operate on rows, not columns. You should try replacing your for loop with:
for i = EmployeeData(:, 3)' % NOTE THE TRANSPOSE
...
end
EDIT: Note that you can do what you're trying to do without a forloop:
maxVal = EmployeeData(EmployeeData(:,3) == max(EmployeeData(:,3)),1);
Is this what you want?
>> EmployeeData(EmployeeData(:,3)==max(EmployeeData(:,3)),1)
ans =
1
42