Include Hive query in a Pyspark program - pyspark

I am new to Hive and I need to create a new Hive table via HiveContext in a Pyspark program.
How to do that ? Please help me with this issue .Thanks a lot!

import pyspark
from pyspark.sql import HiveContext
sqlCtx= HiveContext(sc)
test_df = spark.createDataFrame([(1, 'metric1', 10), (2, 'metric2', 20), (3, 'metric3', 30)], ['id', 'metric', 'score'])
test_df.registerTempTable("df_table")
sqlCtx.sql("CREATE TABLE df_hive_table AS SELECT * from df_table")

Related

Create a snowflake table from pyspark using SQLContext

I want to create a snowflake table from my pyspark code as below:
import pyspark.sql import SparkSession
import pyspark.sql.context import SQLContext
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
sqlContext.sql("create or replace table NEW_TABLE (id integer, desc varchar)")
I am getting this error
Your code will create Hive table not Snowflake table. You'd have to write by dataframe like this
sfOptions = {
'sfUrl': '...',
'sfUser': '...',
'sfPassword': '...',
...
}
(df
.write
.format('snowflake')
.mode(mode)
.options(**sfOptions)
.save()
)
OR, if you really want to trigger a single Snowflake query from Spark, you can use Snowflake runQuery API
query = "create or replace table NEW_TABLE (id integer, desc varchar)"
spark._jvm.net.snowflake.spark.snowflake.Utils.runQuery(sfOptions, query)

Is it possible to reference a PySpark DataFrame using it's rdd id?

If I "overwrite" a df using the same naming convention in PySpark such as in the example below, am I able to reference it later on using the rdd id?
df = spark.createDataFrame([('Abraham','Lincoln')], ['first_name', 'last_name'])
df.checkpoint()
print(df.show())
print(df.rdd.id())
from pyspark.sql.functions import *
df = df.select(names.first_name,names.last_name,concat_ws(' ', names.first_name, names.last_name).alias('full_name'))
df.checkpoint()
print(df.show())
print(df.rdd.id())

Writing data from kafka to hive using pyspark - stucked

I quite new to spark and started with pyspark, I am learning to push data from kafka to hive using pyspark.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import *
from pyspark.streaming.kafka import KafkaUtils
from os.path import abspath
warehouseLocation = abspath("spark-warehouse")
spark = SparkSession.builder.appName("sparkstreaming").getOrCreate()
df = spark.read.format("kafka").option("startingoffsets", "earliest").option("kafka.bootstrap.servers", "kafka-server1:66,kafka-server2:66").option("kafka.security.protocol", "SSL").option("kafka.ssl.keystore.location", "mykeystore.jks").option("kafka.ssl.keystore.password","mykeystorepassword").option("subscribe","json_stream").load().selectExpr("CAST(value AS STRING)")
json_schema = df.schema
df1 = df.select($"value").select(from_json,json_schema).alias("data").select("data.*")
The above is not working, however after extracting data, I want to insert data to hive table.
As I am completely new, looking for help.
Appreciated in advance! :)
from os.path import expanduser, join, abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
# warehouse_location points to the default location for managed databases and tables
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
# spark is an existing SparkSession
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

SparkSQL - Read parquet file directly

I am migrating from Impala to SparkSQL, using the following code to read a table:
my_data = sqlContext.read.parquet('hdfs://my_hdfs_path/my_db.db/my_table')
How do I invoke SparkSQL above, so it can return something like:
'select col_A, col_B from my_table'
After creating a Dataframe from parquet file, you have to register it as a temp table to run sql queries on it.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.parquet("src/main/resources/peopleTwo.parquet")
df.printSchema
// after registering as a table you will be able to run sql queries
df.registerTempTable("people")
sqlContext.sql("select * from people").collect.foreach(println)
With plain SQL
JSON, ORC, Parquet, and CSV files can be queried without creating the table on Spark DataFrame.
//This Spark 2.x code you can do the same on sqlContext as well
val spark: SparkSession = SparkSession.builder.master("set_the_master").getOrCreate
spark.sql("select col_A, col_B from parquet.`hdfs://my_hdfs_path/my_db.db/my_table`")
.show()
Suppose that you have the parquet file ventas4 in HDFS:
hdfs://localhost:9000/sistgestion/sql/ventas4
In this case, the steps are:
Charge the SQL Context:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Read the parquet File:
val ventas=sqlContext.read.parquet("hdfs://localhost:9000/sistgestion/sql/ventas4")
Register a temporal table:
ventas.registerTempTable("ventas")
Execute the query (in this line you can use toJSON to pass a JSON format or you can use collect()):
sqlContext.sql("select * from ventas").toJSON.foreach(println(_))
sqlContext.sql("select * from ventas").collect().foreach(println(_))
Use the following code in intellij:
def groupPlaylistIds(): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder.appName("FollowCount")
.master("local[*]")
.getOrCreate()
val sc = spark.sqlContext
val d = sc.read.format("parquet").load("/Users/CCC/Downloads/pq/file1.parquet")
d.printSchema()
val d1 = d.select("col1").filter(x => x!='-')
val d2 = d1.filter(col("col1").startsWith("searchcriteria"));
d2.groupBy("col1").count().sort(col("count").desc).show(100, false)
}

How to create a RDD from RC file using data which is partitioned in the hive table

CREATE TABLE employee_details(
emp_first_name varchar(50),
emp_last_name varchar(50),
emp_dept varchar(50)
)
PARTITIONED BY (
emp_doj varchar(50),
emp_dept_id int )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat';
Location of the hive table stored is /data/warehouse/employee_details
I have a hive table employee loaded with data and is partitioned by emp_doj ,emp_dept_id and FileFormat is RC file format.
I would like to process the data in the table using the spark-sql without using the hive-context(simply using sqlContext).
Could you please help me in how to load partitioned data of the hive table into an RDD and convert to DataFrame
If you are using Spark 2.0, you can do it in this way.
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()