Spark SQL Scala - Fetch column names in JDBCRDD - scala

I am new to Spark and Scala. I am trying to fetch contents from a procedure in SQL-server to use it in Spark SQL. For that, I am importing the data via JDBCRDD in Scala (Eclipse) and making an RDD from the procedure.
After creating the RDD, I am registering it as a temporary table and then using sqlContext.sql("Select query to select particular columns"). But when I enter column names in the select query, it is throwing an error as I do not have column names in neither RDD, nor the temporary table.
Please find below my code:
val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
val url = XXXX
val username = XXXX
val password = XXXX
val query = "select A, B, C, D from Random_Procedure where ID_1 = ? and ID_2 = ?"
// New SparkContext
val sc = new SparkConf().setMaster("local").setAppName("Amit")
val sparkContext = new SparkContext(sc)
val rddData = new JdbcRDD(sparkContext, () =>
DriverManager.getConnection(url, username, password),
query, 1, 0, 1, (x: ResultSet) => x.getString("A") + ", " +
x.getString("B") + ", " + x.getString("C") + ", " +
x.getString("D")).cache()
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val dataFrame = rddData.toDF
dataFrame.registerTempTable("Data")
sqlContext.sql("select A from Data").collect.foreach(println)
When I run this code, it throws an error: cannot resolve 'code' given input columns _1;
But when I run:
sqlContext.sql("select * from Data").collect.foreach(println)
It prints all columns A, B, C, D
I believe I did not fetch column names in the JdbcRDD that I created, hence they are not accessible in the temporary table. I need help.

The problem is that you create JdbcRDD object and you need DataFrame. RDD simple doesn't contain information about mapping from your tuples to column names. So you should create DataFrame from jdbc source as it explained in programming guide Moreover:
Spark SQL also includes a data source that can read data from other
databases using JDBC. This functionality should be preferred over
using JdbcRDD
Also notice that DataFrames are added in Spark 1.3.0. If you use older version you have to operate with org.apache.spark.sql.SchemaRDD

Related

Not able to insert Value using SparkSql

I need to insert some values in my hive table using sparksql.I'm using below code.
val filepath:String = "/user/usename/filename.csv'"
val fileName : String = filepath
val result = fileName.split("/")
val fn=result(3) //filename
val e=LocalDateTime.now() //timestamp
First I tried using Insert Into Values but then i found this feature is not available in sparksql.
val ds=sparksession.sql("insert into mytable("filepath,filename,Start_Time") values('${filepath}','${fn}','${e}')
is there any other way to insert these values using sparksql(mytable is empty,I need to load this table everyday)?.
You can directly use Spark Dataframe Write API to insert data into the table.
If you do not have the Spark Dataframe then first create one Dataframe using spark.createDataFrame() then, try as follow to write the data:
df.write.insertInto("name of hive table")
Hi Below code worked for me since i need to use variable in my dataframe so first i created dataframe form selected data then using df.write.insertInto(tablename) saved in hive table.
val filepath:String = "/user/usename/filename.csv'"
val fileName : String = filepath
val result = fileName.split("/")
val fn=result(3) //filename
val e=LocalDateTime.now() //timestamp
val df1=sparksession.sql(s" select '${filepath}' as file_path,'${fn}' as filename,'${e}' as Start_Time")
df1.write.insertInto("dbname.tablename")

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

Lookup in Spark dataframes

I am using Spark 1.6 and I would like to know how to implement in lookup in the dataframes.
I have two dataframes employee & department.
Employee Dataframe
-------------------
Emp Id | Emp Name
------------------
1 | john
2 | David
Department Dataframe
--------------------
Dept Id | Dept Name | Emp Id
-----------------------------
1 | Admin | 1
2 | HR | 2
I would like to lookup emp id from the employee table to the department table and get the dept name. So, the resultset would be
Emp Id | Dept Name
-------------------
1 | Admin
2 | HR
How do I implement this look up UDF feature in SPARK. I don't want to use JOIN on both the dataframes.
As already mentioned in the comments, joining the dataframes is the way to go.
You can use a lookup, but I think there is no "distributed" solution, i.e. you have to collect the lookup-table into driver memory. Also note that this approach assumes that EmpID is unique:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
import scala.collection.Map
val emp = Seq((1,"John"),(2,"David"))
val deps = Seq((1,"Admin",1),(2,"HR",2))
val empRdd = sc.parallelize(emp)
val depsDF = sc.parallelize(deps).toDF("DepID","Name","EmpID")
val lookupMap = empRdd.collectAsMap()
def lookup(lookupMap:Map[Int,String]) = udf((empID:Int) => lookupMap.get(empID))
val combinedDF = depsDF
.withColumn("empNames",lookup(lookupMap)($"EmpID"))
My initial thought was to pass the empRdd to the UDF and use the lookup method defined on PairRDD, but this does of course not work because you cannot have spark actions (i.e. lookup) within transformations (ie. the UDF).
EDIT:
If your empDf has multiple columns (e.g. Name,Age), you can use this
val empRdd = empDf.rdd.map{row =>
(row.getInt(0),(row.getString(1),row.getInt(2)))}
val lookupMap = empRdd.collectAsMap()
def lookup(lookupMap:Map[Int,(String,Int)]) =
udf((empID:Int) => lookupMap.lift(empID))
depsDF
.withColumn("lookup",lookup(lookupMap)($"EmpID"))
.withColumn("empName",$"lookup._1")
.withColumn("empAge",$"lookup._2")
.drop($"lookup")
.show()
As you are saying you already have Dataframes then its pretty easy follow these steps:
1)create a sqlcontext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2) Create Temporary tables for all 3 Eg:
EmployeeDataframe.createOrReplaceTempView("EmpTable")
3) Query using MySQL Queries
val MatchingDetails = sqlContext.sql("SELECT DISTINCT E.EmpID, DeptName FROM EmpTable E inner join DeptTable G on " +
"E.EmpID=g.EmpID")
Starting with some "lookup" data, there are two approaches:
Method #1 -- using a lookup DataFrame
// use a DataFrame (via a join)
val lookupDF = sc.parallelize(Seq(
("banana", "yellow"),
("apple", "red"),
("grape", "purple"),
("blueberry","blue")
)).toDF("SomeKeys","SomeValues")
Method #2 -- using a map in a UDF
// turn the above DataFrame into a map which a UDF uses
val Keys = lookupDF.select("SomeKeys").collect().map(_(0).toString).toList
val Values = lookupDF.select("SomeValues").collect().map(_(0).toString).toList
val KeyValueMap = Keys.zip(Values).toMap
def ThingToColor(key: String): String = {
if (key == null) return ""
val firstword = key.split(" ")(0) // fragile!
val result: String = KeyValueMap.getOrElse(firstword,"not found!")
return (result)
}
val ThingToColorUDF = udf( ThingToColor(_: String): String )
Take a sample data frame of things that will be looked up:
val thingsDF = sc.parallelize(Seq(
("blueberry muffin"),
("grape nuts"),
("apple pie"),
("rutabaga pudding")
)).toDF("SomeThings")
Method #1 is to join on the lookup DataFrame
Here, the rlike is doing the matching. And null appears where that does not work. Both columns of the lookup DataFrame get added.
val result_1_DF = thingsDF.join(lookupDF, expr("SomeThings rlike SomeKeys"),
"left_outer")
Method #2 is to add a column using the UDF
Here, only 1 column is added. And the UDF can return a non-Null value. However, if the lookup data is very large it may fail to "serialize" as required to send to the workers in the cluster.
val result_2_DF = thingsDF.withColumn("AddValues",ThingToColorUDF($"SomeThings"))
Which gives you:
In my case I had some lookup data that was over 1 million values, so Method #1 was my only choice.

Extract value from scala TimeStampType

I have a schemaRDD created from a hive query
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sqlContext.sql("Select * from mytime")
My RDD contains the following schema
StructField(id,StringType,true)
StructField(t,TimestampType,true)
We have our own custom database and want to same the TimestampType to a string. But I could not find a way to extract the value and save it as a string.
Can you help? Thanks!
What happens if you change your query to:
SELECT id, cast(t as STRING) from mytime

Joining two HDFS files in in Spark

I want to join two files from HDFS using spark shell.
Both the files are tab separated and I want to join on second column
Tried code
But not giving any output
val ny_daily= sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock /NYSE_daily"))
val ny_daily_split = ny_daily.map(line =>line.split('\t'))
val enKeyValuePair = ny_daily_split.map(line => (line(0).substring(0, 5), line(3).toInt))
val ny_dividend= sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_dividends"))
val ny_dividend_split = ny_dividend.map(line =>line.split('\t'))
val enKeyValuePair1 = ny_dividend_split.map(line => (line(0).substring(0, 4), line(3).toInt))
enKeyValuePair1.join(enKeyValuePair)
But I am not getting any information for how to join files on particular column
Please suggest
I am not getting any information for how to join files on particular column
RDDs are joined on their keys, so you decided the column to join on when you wrote:
val enKeyValuePair = ny_daily_split.map(line => (line(0).substring(0, 5), line(3).toInt))
...
val enKeyValuePair1 = ny_daily_split.map(line => (line(0).substring(0, 4), line(3).toInt))
Your RDDs will be joined on the values coming from line(0).substring(0, 5) and line(0).substring(0, 4).
You can find the join function (and many other useful functions) here and the Spark Programming Guide is a great reference to understand how Spark works.
Tried code But not giving any output
In order to see the output, you have to ask Spark to print it:
enKeyValuePair1.join(enKeyValuePair).foreach(println)
Note: to load data from files you should use sc.textFile(): sc.parallelize() is only used to make RDDs out of Scala collections.
The following code should do the job:
val ny_daily_split = sc.textFile("hdfs://localhost:8020/user/user/NYstock/NYSE_daily").map(line =>line.split('\t'))
val ny_dividend_split = sc.textFile("hdfs://localhost:8020/user/user/NYstock/NYSE_dividends").map(line =>line.split('\t'))
val enKeyValuePair = ny_daily_split.map(line => line(0).substring(0, 5) -> line(3).toInt)
val enKeyValuePair1 = ny_dividend_split.map(line => line(0).substring(0, 4) -> line(3).toInt)
enKeyValuePair1.join(enKeyValuePair).foreach(println)
By the way, you mentioned that you want to join on the second column but you are actually using line(0), is this intended?
Hope this helps!