How to create a RDD from RC file using data which is partitioned in the hive table - scala

CREATE TABLE employee_details(
emp_first_name varchar(50),
emp_last_name varchar(50),
emp_dept varchar(50)
)
PARTITIONED BY (
emp_doj varchar(50),
emp_dept_id int )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat';
Location of the hive table stored is /data/warehouse/employee_details
I have a hive table employee loaded with data and is partitioned by emp_doj ,emp_dept_id and FileFormat is RC file format.
I would like to process the data in the table using the spark-sql without using the hive-context(simply using sqlContext).
Could you please help me in how to load partitioned data of the hive table into an RDD and convert to DataFrame

If you are using Spark 2.0, you can do it in this way.
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()

Related

Create a snowflake table from pyspark using SQLContext

I want to create a snowflake table from my pyspark code as below:
import pyspark.sql import SparkSession
import pyspark.sql.context import SQLContext
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
sqlContext.sql("create or replace table NEW_TABLE (id integer, desc varchar)")
I am getting this error
Your code will create Hive table not Snowflake table. You'd have to write by dataframe like this
sfOptions = {
'sfUrl': '...',
'sfUser': '...',
'sfPassword': '...',
...
}
(df
.write
.format('snowflake')
.mode(mode)
.options(**sfOptions)
.save()
)
OR, if you really want to trigger a single Snowflake query from Spark, you can use Snowflake runQuery API
query = "create or replace table NEW_TABLE (id integer, desc varchar)"
spark._jvm.net.snowflake.spark.snowflake.Utils.runQuery(sfOptions, query)

Spark dataframe cast column for Kudu compatibility

(I am new to Spark, Impala and Kudu.) I am trying to copy a table from an Oracle DB to an Impala table having the same structure, in Spark, through Kudu. I am getting an error when the code tries to map an Oracle NUMBER to a Kudu data type. How can I change the data type of a Spark DataFrame to make it compatible with Kudu?
This is intended to be a 1-to-1 copy of data from Oracle to Impala. I have extracted the Oracle schema of the source table and created a target Impala table with the same structure (same column names and a reasonable mapping of data types). I was hoping that Spark+Kudu would map everything automatically and just copy the data. Instead, Kudu complains that it cannot map DecimalType(38,0).
I would like to specify that "column #1, with name SOME_COL, which is a NUMBER in Oracle, should be mapped to a LongType, which is supported in Kudu".
How can I do that?
// This works
val df: DataFrame = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
// This does not work
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")
// Error: No support for Spark SQL type DecimalType(38,0)
// See https://github.com/cloudera/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/SparkUtil.scala
// So let's see the Spark data types
df.dtypes.foreach{case (colName, colType) => println(s"$colName: $colType")}
// Spark data type: SOME_COL DecimalType(38,0)
// Oracle data type: SOME_COL NUMBER -- no precision specifier; values are int/long
// Kudu data type: SOME_COL BIGINT
Apparently, we can specify a custom schema when reading from a JDBC data source.
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
That worked. I was able to specify a customSchema like so:
col1 Long, col2 Timestamp, col3 Double, col4 String
and with that, the code works:
import spark.implicits._
val df: Dataset[case_class_for_table] = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
.as[case_class_for_table]
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")

How to add a new column to an existing dataframe while also specifying the datatype of it?

I have a dataframe: yearDF obtained from reading an RDBMS table on Postgres which I need to ingest in a Hive table on HDFS.
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",10)
.load()
Before ingesting it, I have to add a new column: delete_flag of datatype: IntegerType to it. This column is used to mark a primary-key whether the row is deleted in the source table or not.
To add a new column to an existing dataframe, I know that there is the option: dataFrame.withColumn("del_flag",someoperation) but there is no such option to specify the datatype of new column.
I have written the StructType for the new column as:
val delFlagColumn = StructType(List(StructField("delete_flag", IntegerType, true)))
But I don't understand how to add this column with the existing dataFrame: yearDF. Could anyone let me know how to add a new column along with its datatype, to an existing dataFrame ?
import org.apache.spark.sql.types.IntegerType
df.withColumn("a", lit("1").cast(IntegerType)).show()
Though casting is not required if you are passing lit(1) as spark will infer the schema for you. But if you are passing as lit("1") it will cast it to Int

Create Hive Table from Spark using API, rather than SQL?

I want to create a hive table with partitions.
The schema for the table is:
val schema = StructType(StructField(name,StringType,true),StructField(age,IntegerType,true))
I can do this with Spark-SQL using:
val query = "CREATE TABLE some_new_table (name string, age integer) USING org.apache.spark.sql.parquet OPTIONS (path '<some_path>') PARTITIONED BY (age)"
spark.sql(query)
When I try to do with Spark API (using Scala), the table is filled with data. I only want to create an empty table and define partitions. This is what I am doing, what I am doing wrong :
val df = spark.createDataFrame(sc.emptyRDD[Row], schema)
val options = Map("path" -> "<some_path>", "partitionBy" -> "age")
df.sqlContext().createExternalTable("some_new_table", "org.apache.spark.sql.parquet", schema, options);
I am using Spark-2.1.1.
If you skip partitioning. can try with saveAsTable:
spark.createDataFrame(sc.emptyRDD[Row], schema)
.write
.format("parquet")
//.partitionBy("age")
.saveAsTable("some_new_table")
Spark partitioning and Hive partitioning are not compatible, so if you want access from Hive you have to use SQL: https://issues.apache.org/jira/browse/SPARK-14927

SparkSQL - Read parquet file directly

I am migrating from Impala to SparkSQL, using the following code to read a table:
my_data = sqlContext.read.parquet('hdfs://my_hdfs_path/my_db.db/my_table')
How do I invoke SparkSQL above, so it can return something like:
'select col_A, col_B from my_table'
After creating a Dataframe from parquet file, you have to register it as a temp table to run sql queries on it.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.parquet("src/main/resources/peopleTwo.parquet")
df.printSchema
// after registering as a table you will be able to run sql queries
df.registerTempTable("people")
sqlContext.sql("select * from people").collect.foreach(println)
With plain SQL
JSON, ORC, Parquet, and CSV files can be queried without creating the table on Spark DataFrame.
//This Spark 2.x code you can do the same on sqlContext as well
val spark: SparkSession = SparkSession.builder.master("set_the_master").getOrCreate
spark.sql("select col_A, col_B from parquet.`hdfs://my_hdfs_path/my_db.db/my_table`")
.show()
Suppose that you have the parquet file ventas4 in HDFS:
hdfs://localhost:9000/sistgestion/sql/ventas4
In this case, the steps are:
Charge the SQL Context:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Read the parquet File:
val ventas=sqlContext.read.parquet("hdfs://localhost:9000/sistgestion/sql/ventas4")
Register a temporal table:
ventas.registerTempTable("ventas")
Execute the query (in this line you can use toJSON to pass a JSON format or you can use collect()):
sqlContext.sql("select * from ventas").toJSON.foreach(println(_))
sqlContext.sql("select * from ventas").collect().foreach(println(_))
Use the following code in intellij:
def groupPlaylistIds(): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder.appName("FollowCount")
.master("local[*]")
.getOrCreate()
val sc = spark.sqlContext
val d = sc.read.format("parquet").load("/Users/CCC/Downloads/pq/file1.parquet")
d.printSchema()
val d1 = d.select("col1").filter(x => x!='-')
val d2 = d1.filter(col("col1").startsWith("searchcriteria"));
d2.groupBy("col1").count().sort(col("count").desc).show(100, false)
}