Dropping constant columns in a csv file - scala

i would like to drop columns which are constant in a dataframe , here what i did , but i see that it tooks some much time to do it , special while writing the dataframe into the csv file , please any help to optimize the code to take less time
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("inferSchema", "true").option("header", "false").csv("D:\\ProcessDataSet\\anis_data\\Set _1Mud Pumps_Merged.csv")
val aggregations = df.drop("DateTime").columns.map(c => stddev("c").as(c))
val df2 = df.agg(aggregations.head, aggregations.tail: _*)
val columnsToKeep: Seq[String] = (df2.first match {
case r : Row => r.toSeq.toArray.map(_.asInstanceOf[Double])
}).zip(df.columns)
.filter(_._1 != 0) // your special condition is in the filter
.map(_._2) // keep just the name of the column
// select columns with stddev != 0
val finalResult = df.select(columnsToKeep.head, columnsToKeep.tail : _*)
finalResult.write.option("header",true).csv("D:\\ProcessDataSet\\dataWithoutConstant\\Set _1Mud Pumps_MergedCleaned.csv")
}

I think there is no much room left for optimization. You are doing the right thing.
Maybe what you can try is to cache() your dataframe df.
df is used in two separate Spark actions so it is loaded twice.
Try :
...
val df = spark.read.option("inferSchema", "true").option("header", "false").csv("D:\\ProcessDataSet\\anis_data\\Set _1Mud Pumps_Merged.csv")
df.cache()
val aggregations = df.drop("DateTime").columns.map(c => stddev("c").as(c))
...

Related

dataframe.select, select dataframe columns from file

I am trying to create a child dataframe from parent dataframe. but I have more than 100 cols to select.
so in Select statement can I give the columns from a file?
val Raw_input_schema=spark.read.format("text").option("header","true").option("delimiter","\t").load("/HEADER/part-00000").schema
val Raw_input_data=spark.read.format("text").schema(Raw_input_schema).option("delimiter","\t").load("/DATA/part-00000")
val filtered_data = Raw_input_data.select(all_cols)
how can I send the columns names from file in all_cols
I would assume you would read file somewhere from hdfs or from shared config file? Reason for this, that on the cluster this code, would be executed on individual node etc.
In this case I would approach this with next pice of code:
import org.apache.spark.sql.functions.col
val lines = Source.fromFile("somefile.name.csv").getLines
val cols = lines.flatMap(_.split(",")).map( col(_)).toArray
val df3 = df2.select(cols :_ *)
Essentially, you just have to provide array of strings and use :_ * notation for variable number of arguments.
finally this worked for me;
val Raw_input_schema=spark.read.format("csv").option("header","true").option("delimiter","\t").load("headerFile").schema
val Raw_input_data=spark.read.format("csv").schema(Raw_input_schema).option("delimiter","\t").load("dataFile")
val filtered_file = sc.textFile("filter_columns_file").map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList
//or
val filtered_file = sc.textFile(filterFile).map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList.map(x => new Column(x))
val final_df=Raw_input_data.select(filtered_file.head, filtered_file.tail: _*)
//or
val final_df = Raw_input_data.select(filtered_file:_*)'

Better way to join in spark scala [duplicate]

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)

dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)

How to add corresponding Integer values in 2 different DataFrames

I have two DataFrames in my code with exact same dimensions, let's say 1,000,000 X 50. I need to add corresponding values in both dataframes. How to achieve that.
One option would be to add another column with ids, union both DataFrames and then use reduceByKey. But is there any other more elegent way?
Thanks.
Your approach is good. Another option can be two take the RDD and zip those together and then iterate over those to sum the columns and create a new dataframe using any of the original dataframe schemas.
Assuming the data types for all the columns are integer, this code snippets should work. Please note that, this has been done in spark 2.1.0.
import spark.implicits._
val a: DataFrame = spark.sparkContext.parallelize(Seq(
(1, 2),
(3, 6)
)).toDF("column_1", "column_2")
val b: DataFrame = spark.sparkContext.parallelize(Seq(
(3, 4),
(1, 5)
)).toDF("column_1", "column_2")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for(i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val ab: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
ab.show()
Update:
So, I tried to experiment with the performance of my solution vs yours. I tested with 100000 rows and each row has 50 columns. In case of your approach it has 51 columns, the extra one is for the ID column. In a single machine(no cluster), my solution seems to work a bit faster.
The union and group by approach takes about 5598 milliseconds.
Where as my solution takes about 5378 milliseconds.
My assumption is the first solution takes a bit more time because of the union operation of the two dataframes.
Here are the methods which I created for testing the approaches.
def option_1()(implicit spark: SparkSession): Unit = {
import spark.implicits._
val a: DataFrame = getDummyData(withId = true)
val b: DataFrame = getDummyData(withId = true)
val allData = a.union(b)
val result = allData.groupBy($"id").agg(allData.columns.collect({ case col if col != "id" => (col, "sum") }).toMap)
println(result.count())
// result.show()
}
def option_2()(implicit spark: SparkSession): Unit = {
val a: DataFrame = getDummyData()
val b: DataFrame = getDummyData()
// Merge rows
val rows = a.rdd.zip(b.rdd).map {
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for (i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val result: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
println(result.count())
// result.show()
}

Scala Spark: Performance issue renaming huge number of columns

To be able to work with columnnames of my DataFrame without escaping the . I need a function to "validify" all columnnames - but none of the methods I tried does the job in a timely manner (I'm aborting after 5 minutes).
The dataset I'm trying my algorithms on is the golub Dataset (get it here). It's a 2.2MB CSV file with 7200 columns. Renaming all columns should be a matter of seconds
Code to read the CSV in
var dfGolub = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("golub_merged.csv")
.drop("_c0") // drop the first column
.repartition(numOfCores)
Attempts to rename columns:
def validifyColumnnames1(df : DataFrame) : DataFrame = {
import org.apache.spark.sql.functions.col
val cols = df.columns
val colsRenamed = cols.map(name => col(name).as(name.replaceAll("\\.","")))
df.select(colsRenamed : _*)
}
def validifyColumnnames2[T](df : Dataset[T]) : DataFrame = {
val newColumnNames = ArrayBuffer[String]()
for(oldCol <- df.columns) {
newColumnNames += oldCol.replaceAll("\\.","")
}
df.toDF(newColumnNames : _*)
}
def validifyColumnnames3(df : DataFrame) : DataFrame = {
var newDf = df
for(col <- df.columns){
newDf = newDf.withColumnRenamed(col,col.replaceAll("\\.",""))
}
newDf
}
Any ideas what is causing this performance issue?
Setup: I'm running Spark 2.1.0 on Ubuntu 16.04 in local[24] mode on a machine with 16cores * 2 threads and 96GB of RAM
Assuming you know the types you can simply create the schema instead of infering it (infering the schema costs performance and might even be wrong for csv).
Lets assume for simplicity you have the file example.csv as follows:
A.B, A.C, A.D
a,3,1
You can do something like this:
val scehma = StructType(Seq(StructField("A_B",StringType),StructField("A_C", IntegerType), StructField("AD", IntegerType)))
val df = spark.read.option("header","true").schema(scehma).csv("example.csv")
df.show()
+---+---+---+
|A_B|A_C| AD|
+---+---+---+
| a| 3| 1|
+---+---+---+
IF you do not know the info in advance you can use infer schema as you did before, then you can use the dataframe to generate the schema:
val fields = for {
x <- df.schema
} yield StructField(x.name.replaceAll("\\.",""), x.dataType, x.nullable)
val schema = StructType(fields)
and the reread the dataframe using that schema as before