I am new to scala and spark. I have a requirement to create the dataframes dynamically by reading a file. each line of a file is a query. at last join all dataframes and store the result in a file.
I wrote below basic code, having trouble to dynamically create the dataframes.
import org.apache.spark.sql.SQLContext._
import org.apache.spark.sql.SQLConf
import scala.io.Source
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val empFile = "/user/sri/sample2.txt"
sqlContext.load("com.databricks.spark.csv", Map("path" -> empFile, "header" -> "true")).registerTempTable("emp")
var cnt=0;
val filename ="emp.sql"
for (line <- Source.fromFile(filename).getLines)
{
println(line)
cnt += 1
//var dis: String = "emp"+cnt
val "emp"+cnt = sqlContext.sql("SELECT \"totalcount\", count(*) FROM emp")
println(dis)
//val dis = sqlContext.sql("SELECT \"totalcount\", count(*) FROM emp")
}
println(cnt)
exit
Please help me, suggest me if it can be done better way
What error are you getting? I assume you're code won't compile, considering this line:
val "emp"+cnt = sqlContext.sql("SELECT \"totalcount\", count(*) FROM emp")
In Scala, defining the name of some construct (val "emp"+cnt in your case) programatically is not easily possible.
In your case, you could use a collection to hold the results.
val queries = for (line <- Source.fromFile(filename).getLines) yield {
sqlContext.sql("SELECT \"totalcount\", count(*) FROM emp")
}
val cnt = queries.length
Related
I am trying to create a child dataframe from parent dataframe. but I have more than 100 cols to select.
so in Select statement can I give the columns from a file?
val Raw_input_schema=spark.read.format("text").option("header","true").option("delimiter","\t").load("/HEADER/part-00000").schema
val Raw_input_data=spark.read.format("text").schema(Raw_input_schema).option("delimiter","\t").load("/DATA/part-00000")
val filtered_data = Raw_input_data.select(all_cols)
how can I send the columns names from file in all_cols
I would assume you would read file somewhere from hdfs or from shared config file? Reason for this, that on the cluster this code, would be executed on individual node etc.
In this case I would approach this with next pice of code:
import org.apache.spark.sql.functions.col
val lines = Source.fromFile("somefile.name.csv").getLines
val cols = lines.flatMap(_.split(",")).map( col(_)).toArray
val df3 = df2.select(cols :_ *)
Essentially, you just have to provide array of strings and use :_ * notation for variable number of arguments.
finally this worked for me;
val Raw_input_schema=spark.read.format("csv").option("header","true").option("delimiter","\t").load("headerFile").schema
val Raw_input_data=spark.read.format("csv").schema(Raw_input_schema).option("delimiter","\t").load("dataFile")
val filtered_file = sc.textFile("filter_columns_file").map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList
//or
val filtered_file = sc.textFile(filterFile).map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList.map(x => new Column(x))
val final_df=Raw_input_data.select(filtered_file.head, filtered_file.tail: _*)
//or
val final_df = Raw_input_data.select(filtered_file:_*)'
In my code I change a DF into an RDD to run a function through a map call. With each run there is a couple seconds of overhead so if I run the function call like this:
var iterator = 1
val inputDF = spark.sql("select * from DF")
var columnPosition = Array("Column_4")
columnPosition = columnPosition ++ Array("Column_9")
var selectedDF = inputDF
var intrimDF = inputDF
var finalDF = inputDF
val inputDF = spark.sql("select * from DF")
while(iterator <= columnPosition.length) {
selectedDF=finalDF.selectExpr("foreign_key",columnPosition(iterator))
intrimDF = selectedDF.rdd.map(x => (x(0),action(x(1)))).toDF.selectExpr("_1 as some_key","_2 as " columnPosition(iterator)).joinBack.changeColumnPositions.dropOriginalColumn.renameColumns
finalDF=intrimDF
iterator = iterator + 1
}
It runs in ~90 seconds for a large job, because of the join. What I am trying to do is build it like below to cut out the join entirely and have be dynamic.
val inputDF = spark.sql("select * from DF")
val intrimDF = inputDF.rdd.map(x=>(x(0),x(1),x(2),action(x(3)),x(4),x(5),x(6),x(7),action(x(8))))
val columnStatement//Create an Array with columnName changes
val finalDF = intrimDF.selectExpr(columnStatement :_*)
The issue is I cant get past the hard coding side the problem, below is an example of what I want to try to do by dynamically setting the mapping call.
val mappingStatement = "x=>(x(0),x(1),x(2),action(x(3)),x(4),x(5),x(6),x(7),action(x(8)))"
val intrimDF = inputDF.rdd.map(mappingStatement)
Everything I have tried failed:
1 Calling using the Map() function
2 Setting an Array and passing it as :_*
3 Trying to build it as an calling, but it doesnt like being dyanmic
Hope this makes sense!
I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)
I have a case class in scala
case class TestDate (id: String, loginTime: java.sql.Date)
I created 2 RDD's of type TestDate
I wanted to do an inner join on two rdd's where the values of loginTime column is equal. Please find the code snippet below,
firstRDD.toDF.registerTempTable("firstTable")
secondRDD.toDF.registerTempTable("secondTable")
val res = sqlContext.sql("select * from firstTable INNER JOIN secondTable on to_date(firstTable.loginTime) = to_date(secondTable.loginTime)")
I'm not getting any exception. But i'm not getting correct answer too.
It does a cartesian and some random dates are generated in the result.
The issue was due to a wrong format given while creating the date object. When the format was rectified, it worked fine.
You can try using another approach:
val df1 = firstRDD.toDF
val df2 = secondRDD.toDF
val res = df1.join(df2, Seq("loginTime"))
If it doesn't work, you can try casting your dates to string:
val df1 = firstRDD.toDF.withColumn("loginTimeStr", col("loginTime").cast("string"))
val df2 = secondRDD.toDF.withColumn("loginTimeStr", col("loginTime").cast("string"))
val res = df1.join(df2, Seq("loginTimeStr"))
Finally, maybe the problem is that you also need the ID column in the join?
val df1 = firstRDD.toDF
val df2 = secondRDD.toDF
val res = df1.join(df2, Seq("id", "loginTime"))
With Spark 1.6, I try to save Arrays to a Hive-Table myTable consisting of two columns, each of type array<double>:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
val x = Array(1.0,2.0,3.0,4.0)
val y = Array(-1.0,-2.0,-3.0,-4.0)
val mySeq = Seq(x,y)
val df = sc.parallelize(mySeq).toDF("x","y")
df.write.insertInto("myTable")
But then I get the message:
error: value toDF is not a member of org.apache.spark.rdd.RDD[Array[Double]]
val df = sc.parallelize(mySeq).toDF("x","y")
What is the correct way to do this simple task?
I'm assuming the actual structure you're going for looks like this:
x|y
1.0|-1.0
2.0|-2.0
3.0|-3.0
4.0|-4.0
For this the code you want is this:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
val x = Array(1.0,2.0,3.0,4.0)
val y = Array(-1.0,-2.0,-3.0,-4.0)
val mySeq = x.zip(y)
val df = sc.parallelize(mySeq).toDF("x","y")
df.write.insertInto("myTable")
Essentially you need a collection of row like objects (ie: Array[Row]). It'd be better to use a case class as mentioned in another comment as opposed to just the tuple.