Not able to insert Value using SparkSql - scala

I need to insert some values in my hive table using sparksql.I'm using below code.
val filepath:String = "/user/usename/filename.csv'"
val fileName : String = filepath
val result = fileName.split("/")
val fn=result(3) //filename
val e=LocalDateTime.now() //timestamp
First I tried using Insert Into Values but then i found this feature is not available in sparksql.
val ds=sparksession.sql("insert into mytable("filepath,filename,Start_Time") values('${filepath}','${fn}','${e}')
is there any other way to insert these values using sparksql(mytable is empty,I need to load this table everyday)?.

You can directly use Spark Dataframe Write API to insert data into the table.
If you do not have the Spark Dataframe then first create one Dataframe using spark.createDataFrame() then, try as follow to write the data:
df.write.insertInto("name of hive table")

Hi Below code worked for me since i need to use variable in my dataframe so first i created dataframe form selected data then using df.write.insertInto(tablename) saved in hive table.
val filepath:String = "/user/usename/filename.csv'"
val fileName : String = filepath
val result = fileName.split("/")
val fn=result(3) //filename
val e=LocalDateTime.now() //timestamp
val df1=sparksession.sql(s" select '${filepath}' as file_path,'${fn}' as filename,'${e}' as Start_Time")
df1.write.insertInto("dbname.tablename")

Related

Convert header (column names) to new dataframe

I have a dataframe with headers for example outputDF. I now want to take outputDF.columns and create a new dataframe with just one row which contains column names.
I then want to union both these dataframes with option("head=false") which spark can then write to a HDFS.
How do i do that?
below is an example
Val df = spark.read.csv("path")
val newDf = df.columns.toSeq.toDF
val unoindf= df.union(newDf);

How to use a string as a expression/argument in Scala/Spark?

I am trying to add lot more columns to a dataframe using existing columns in a dataframe. However, Scala dataframes are immutable making it difficult to do it iteratively. So, I came up with a for loop which outputs the string (see a sample code below, which stores the entire statement I can use on the spark dataframe).
val train_df = sqlContext.sql("select * from someTable")
/*for loop output is similar to the Str variable as below*/
var Str = ".withColumn(\"newCol1\",$\"col1\").withColumn(\"newCol2\",$\"col2\").withColumn(\"newCol3\",$\"col3\")"
/* Below is what I am trying to do" */
val train_df_new = train_df.Str
So, how can I save the expression/argument in a string and reuse it in scala/spark to add all those new columns at once to a new dataframe?
Use a foldLeft instead. Here a Map with the old and new column names are used:
val m = Map(("col1", "newCol1"), ("col2", "newCol2"), ("col3", "newCol3"))
val train_df_new = m.keys.foldLeft(train_df)((df, c) => df.withColumnRenamed(c, m(c)))
Instead of withColumnRenamed any iterative function on the dataframe can be used here.

How to create a dataframe using the value of another dataframe

I am getting suppId DataFrame using below code.
val suppId = sqlContext.sql("SELECT supp_id FROM supplier")
The DataFrame return single or multiple value.
Now I want to create a DataFrame using the value of supp_id from suppId DataFrame. But not understand, how to write this.
I have written below code. But the code is not working.
val nonFinalPE = sqlContext.sql("select * from pmt_expr)
nonFinalPE.where("supp_id in suppId(supp_id)")
It took me a second to figure out what you're trying to do. But, it looks like you want rows from nonFinalPe that are also in suppId. You'd get this by doing an inner join of the two data frames which would look like below
val suppId = sqlContext.sql("SELECT supp_id FROM supplier")
val nonFinalPE = sqlContext.sql("select * from pmt_expr")
val joinedDF = nonFinalPE.join(suppId, nonFinalPE("???") === suppId("supp_id"), "inner")

Extract value from scala TimeStampType

I have a schemaRDD created from a hive query
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sqlContext.sql("Select * from mytime")
My RDD contains the following schema
StructField(id,StringType,true)
StructField(t,TimestampType,true)
We have our own custom database and want to same the TimestampType to a string. But I could not find a way to extract the value and save it as a string.
Can you help? Thanks!
What happens if you change your query to:
SELECT id, cast(t as STRING) from mytime

Spark SQL Scala - Fetch column names in JDBCRDD

I am new to Spark and Scala. I am trying to fetch contents from a procedure in SQL-server to use it in Spark SQL. For that, I am importing the data via JDBCRDD in Scala (Eclipse) and making an RDD from the procedure.
After creating the RDD, I am registering it as a temporary table and then using sqlContext.sql("Select query to select particular columns"). But when I enter column names in the select query, it is throwing an error as I do not have column names in neither RDD, nor the temporary table.
Please find below my code:
val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
val url = XXXX
val username = XXXX
val password = XXXX
val query = "select A, B, C, D from Random_Procedure where ID_1 = ? and ID_2 = ?"
// New SparkContext
val sc = new SparkConf().setMaster("local").setAppName("Amit")
val sparkContext = new SparkContext(sc)
val rddData = new JdbcRDD(sparkContext, () =>
DriverManager.getConnection(url, username, password),
query, 1, 0, 1, (x: ResultSet) => x.getString("A") + ", " +
x.getString("B") + ", " + x.getString("C") + ", " +
x.getString("D")).cache()
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val dataFrame = rddData.toDF
dataFrame.registerTempTable("Data")
sqlContext.sql("select A from Data").collect.foreach(println)
When I run this code, it throws an error: cannot resolve 'code' given input columns _1;
But when I run:
sqlContext.sql("select * from Data").collect.foreach(println)
It prints all columns A, B, C, D
I believe I did not fetch column names in the JdbcRDD that I created, hence they are not accessible in the temporary table. I need help.
The problem is that you create JdbcRDD object and you need DataFrame. RDD simple doesn't contain information about mapping from your tuples to column names. So you should create DataFrame from jdbc source as it explained in programming guide Moreover:
Spark SQL also includes a data source that can read data from other
databases using JDBC. This functionality should be preferred over
using JdbcRDD
Also notice that DataFrames are added in Spark 1.3.0. If you use older version you have to operate with org.apache.spark.sql.SchemaRDD