PySpark - read recursive Hive table - pyspark

I have a Hive table that has multiple sub-directories in HDFS, something like:
/hdfs_dir/my_table_dir/my_table_sub_dir1
/hdfs_dir/my_table_dir/my_table_sub_dir2
...
Normally I set the following parameters before I run a Hive script:
set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
select * from my_db.my_table;
I'm trying to do the same using PySpark,
conf = (SparkConf().setAppName("My App")
...
.set("hive.input.dir.recursive", "true")
.set("hive.mapred.supports.subdirectories", "true")
.set("hive.supports.subdirectories", "true")
.set("mapred.input.dir.recursive", "true"))
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)
my_table = sqlContext.sql("select * from my_db.my_table")
and end up with an error like:
java.io.IOException: Not a file: hdfs://hdfs_dir/my_table_dir/my_table_sub_dir1
What's the correct way to read a Hive table with sub-directories in Spark?

What I have found is that these values must be preceded with spark as in:
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")

Try setting them through ctx.sql() prior to execute the query:
sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
my_table = sqlContext.sql("select * from my_db.my_table")

Try setting them through SpakSession to execute the query:
sparkSession = (SparkSession
.builder
.appName('USS - Unified Scheme of Sells')
.config("hive.metastore.uris", "thrift://probighhwm001:9083", conf=SparkConf())
.config("hive.input.dir.recursive", "true")
.config("hive.mapred.supports.subdirectories", "true")
.config("hive.supports.subdirectories", "true")
.config("mapred.input.dir.recursive", "true")
.enableHiveSupport()
.getOrCreate()
)

Related

Trying to create Data frame from a file with delimiter '|'

I want to load a text file which has delimiter "|" into Dataframe in spark.
one way is to create the RDD and use toDF to create Dataframe. However I was wondering if I can create DF directly.
As of now I am using the below command
val productsDF = sqlContext.read.text("/user/danishdshadab786/paper2/products/")
For Spark 2.x
val df = spark.read.format("csv")
.option("delimiter", "|")
.load("/user/danishdshadab786/paper2/products/")
For Spark<2.0
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.load("/user/danishdshadab786/paper2/products/")
You can add more options like option("header", "true") for reading headers in the same statement.
You can specify the delimiter in the 'read' options:
spark.read
.option("delimiter", "|")
.csv("/user/danishdshadab786/paper2/products/")

Create dataframe with header using header and data file

I have two files data.csv and headers.csv. I want to create dataframe in Spark/Scala with headers.
var data = spark.sqlContext.read.format(
"com.databricks.spark.csv").option("header", "true"
).option("inferSchema", "true").load(data_path)
Can you help me customizing above lines to do this?
you can read the headers.csv by using the above method and use the schema of headers dataframe to read the data.csv as below
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
I hope the answer is helpful

Spark JDBC returning dataframe only with column names

I am trying to connect to a HiveTable using spark JDBC, with the following code:
val df = spark.read.format("jdbc").
option("driver", "org.apache.hive.jdbc.HiveDriver").
option("user","hive").
option("password", "").
option("url", jdbcUrl).
option("dbTable", tableName).load()
df.show()
but the return I get is only an empty dataframe with modified columns name, like this:
--------------|---------------|
tableName.uuid|tableName.name |
--------------|---------------|
I've tried to read the dataframe in a lot of ways, but it always results the same.
I'm using JDBC Hive Driver, and this HiveTable is located in an EMR cluster. The code also runs in the same cluster.
Any help will be really appreciated.
Thank you all.
Please set fetchsize in option it should work.
Dataset<Row> referenceData
= sparkSession.read()
.option("fetchsize", "100")
.format("jdbc")
.option("url", jdbc.getJdbcURL())
.option("user", "")
.option("password", "")
.option("dbtable", hiveTableName).load();

SparkSQL - Read parquet file directly

I am migrating from Impala to SparkSQL, using the following code to read a table:
my_data = sqlContext.read.parquet('hdfs://my_hdfs_path/my_db.db/my_table')
How do I invoke SparkSQL above, so it can return something like:
'select col_A, col_B from my_table'
After creating a Dataframe from parquet file, you have to register it as a temp table to run sql queries on it.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.parquet("src/main/resources/peopleTwo.parquet")
df.printSchema
// after registering as a table you will be able to run sql queries
df.registerTempTable("people")
sqlContext.sql("select * from people").collect.foreach(println)
With plain SQL
JSON, ORC, Parquet, and CSV files can be queried without creating the table on Spark DataFrame.
//This Spark 2.x code you can do the same on sqlContext as well
val spark: SparkSession = SparkSession.builder.master("set_the_master").getOrCreate
spark.sql("select col_A, col_B from parquet.`hdfs://my_hdfs_path/my_db.db/my_table`")
.show()
Suppose that you have the parquet file ventas4 in HDFS:
hdfs://localhost:9000/sistgestion/sql/ventas4
In this case, the steps are:
Charge the SQL Context:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Read the parquet File:
val ventas=sqlContext.read.parquet("hdfs://localhost:9000/sistgestion/sql/ventas4")
Register a temporal table:
ventas.registerTempTable("ventas")
Execute the query (in this line you can use toJSON to pass a JSON format or you can use collect()):
sqlContext.sql("select * from ventas").toJSON.foreach(println(_))
sqlContext.sql("select * from ventas").collect().foreach(println(_))
Use the following code in intellij:
def groupPlaylistIds(): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder.appName("FollowCount")
.master("local[*]")
.getOrCreate()
val sc = spark.sqlContext
val d = sc.read.format("parquet").load("/Users/CCC/Downloads/pq/file1.parquet")
d.printSchema()
val d1 = d.select("col1").filter(x => x!='-')
val d2 = d1.filter(col("col1").startsWith("searchcriteria"));
d2.groupBy("col1").count().sort(col("count").desc).show(100, false)
}

Save values in spark

i am trying to read and write data from the my local folder, but my data's not identical.
val data =sc.textFile("/user/cts367689/datagen.txt")
val a=data.map(line=>(line.split(",")(0).toInt+line.split(",")(4).toInt,line.split(",")(3),line.split(",")(2)))
a.saveAsTextFile("/user/cts367689/sparkoutput")
Output:
(526,female,avil)
(635,male,avil)
(983,male,paracetamol)
(342,female,paracetamol)
(158,female,avil)
how can i save output as below,need to remove brackets.
Expected Result:
526,female,avil
635,male,avil
983,male,paracetamol
342,female,paracetamol
158,female,avil
val a = data.map (
line =>
(line.split(",")(0).toInt + line.split(",")(4).toInt) + "," +
line.split(",")(3) + "," +
line.split(",")(2)
)
Try doing this instead of returning it in (). That makes a tuple.
spark has capability of handling unstructured files. you are using one those functions.
for CSV(comma separated values) file there are some good libraries to do the same.
you can have look at this link
for your question , answer is as shown below.
import org.apache.spark.sql.SQLContext
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "false")
.load("/user/cts367689/datagen.txt");
df.select("id", "gender", "name").write()
.format("com.databricks.spark.csv")
.option("header", "true")
.save("/user/cts367689/sparkoutput");
use:
val a = data.map(line => line.split(",")(0).toInt+line.split(",")(4).toInt+","+line.split(",")(3)+","+line.split(",")(2))