issue insert data in hive create small part files - scala

i am processing more than 1000000 records of json file i am reading file line by line and extract requried key values
(json are mix structure is not fix. so i am parsing and generate requried json element) and generate json string simillar to json_string variable and push to hive table data are store properly but at hadoop apps/hive/warehouse/jsondb.myjson_table folder contain small part files. every insert query the new (.1 to .20 kb)part file will be created. beacuse of that if i run simple query on hive as it will take more than 30 min. showing sample code of my logic this iterate multipal times for new records to inesrt in hive.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("SparkSessionZipsExample").enableHiveSupport().getOrCreate()
var json_string = """{"name":"yogesh_wagh","education":"phd" }"""
val df = spark.read.json(Seq(json_string).toDS)
//df.write.format("orc").saveAsTable("bds_data1.newversion");
df.write.mode("append").format("orc").insertInto("bds_data1.newversion");
i have also try to add hive property to merge the files but it wont work,
i have also try to create table from existing table for combine small part file to one 256 mb files..
please share sample code to insert multipal records and append record in part file.

I think each of those individual inserts creating a new part file.
You could create dataset/dataframe of these json strings and then save it to hive table.
you could merge the existing small file using hive ddl ALTER TABLE table_name CONCATENATE;

Related

Difference Between df.wirte and CREATE TABLE USING

I have always been under the impression that the following code create a Delta table,
data.write.format("delta").save("/path/to/delta-table")
This creates the files, sure, however, I noticed today that when I look at the Data section of Databricks, under the hive_metastore, this table does not show up.
In order for this table to show up there, I have to do something like,
CREATE TABLE some_table USING DELTA LOCATION "/path/to/delta-table"
What exactly is going on here? Was I wrong in my understanding that the .write operation creates a table? What is the difference between these commands?
DataFrameWriter has following methods:
def save(path: String): Unit
Saves the content of the DataFrame at the specified path.
def saveAsTable(tableName: String): Unit
Saves the content of the DataFrame as the specified table.
What you did by .save("/path/to/delta-table") was saving the data in delta format in the filesystem. In order for the table to be visible in data catalog (aka. metastore) you need to run CREATE TABLE providing the location.
You can write data using .saveAsTable("delta-table") - that would write the data under a path managed by the metastore and register the table in one step.

write only when all tables are valid with databricks and delta table

I'm looping through some CSV files in a folder. I want to write these CSV files as delta table only if they are all valid. Each CSV files in a folder as different name and schemas. I want to reject the entire folder and all the files it contains until data are fixed. I'm running a lot of test but ultimately I have to actually write the files as delta table with the following loop (simplified for this question):
for f in files:
# read csv
df = spark.read.csv(f, header=True, schema=schema)
# writing to already existing delta table
df.write.format("delta").save('path/' + f)
Is there a callback mechanism so the write method is executed only if all the dataframe doesn't returns any errors? Delta table schema enforcement is pretty rigid which is great, but errors can pop at any time despite all the test I'm running before passing these files in this loop.
union is not an option because I want to handle this by date and each files has different schemas and names.
You can use df.union() or df.unionByName() to read all of your files into a single dataframe. Then that one is either written fully or fails.
# Create empty dataframe with schema to fill up
emptyRDD = spark.sparkContext.emptyRDD()
df = spark.createDataFrame(emptyRDD,schema)
for f in files:
# read csv
dfNext = spark.read.csv(f, header=True, schema=schema)
df = df.unionByName(dfNext)
df.write.format("delta").save(path)

Spark-Optimization Techniques

Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)
using spark submit as:
spark-submit \
--class class-name\
--jar file
or can I add any extra Parameter in spark submit for improving the optimization.
scala code(sample):
All Imports
object sample_1 {
def main(args: Array[String]) {
//sparksession with enabled hivesuppport
var a1=sparksession.sql("load data inpath 'filepath' overwrite into table table_name")
var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from source_table")
}
}
First of all, you don't need to store the data in the temp table to write into hive table later. You can straightaway read the file and write the output using the DataFrameWriter API. This will reduce one step from your code.
You can write as follows:
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
val df = spark.read.csv(filePath) //Add header or delimiter options if needed
inputDF.write.mode("append").format(outputFormat).saveAsTable(outputDB + "." + outputTableName)
Here, the outputFormat will be orc, the outputDB will be your hive database and outputTableName will be your Hive table name.
I think using the above technique, your write time will reduce significantly. Also, please mention the resources your job is using and I may be able to optimize it further.
Another optimization you can use is to partition your dataframe while writing. This will make the write operation faster. However, you need to decide the columns on which to partition carefully so that you don't end up creating a lot of partitions.

Hive SaveAsTable creates a new Parquet table file for every run

I have the following Scala code that I use to write data from a json file to a table in Hive.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setAppName("App").setMaster("local")
import org.apache.spark.sql.hive._
val hiveContext = new HiveContext(sc)
val stg_comments = hiveContext.read.schema(buildSchema()).json(<path to json file)
comment.write.mode("append").saveAsTable(<table name>)
My json data has newline and carriage return characters in it's field values and hence, I cannot simply insert records in Hive (because Hive tables by default do not store newline and carriage returns in the data values) and hence, I need to use SaveAsTable option. The issue here is that every time a json file is read and new records are appended to the existing table, a new parquet file is created in the table directory in Hive warehouse directory. This leads to really small small parquet files in the directory. I would like the data to be appended to the existing parquet file. Do we know how to do that? Thanks!
This is an expected behavior. There is no append-to-existing file option here. Each job has its own set of tasks, each task has its own output file. repartitioning before rewrite can reduce number of files written, but not prevent creating new files.
If number of files becomes a problem, you have to run a separate job to read existing small files and merge into larger chunks.

Incrementally adding to a Hive table w/Scala + Spark 1.3

Our cluster has Spark 1.3 and Hive
There is a large Hive table that I need to add randomly selected rows to.
There is a smaller table that I read and check a condition, if that condition is true, then I grab the variables I need to then query for the random rows to fill. What I did was do a query on that condition, table.where(value<number), then make it an array by using take(num rows). Then since all of these rows contain the information I need on which random rows are needed from the large hive table, I iterate through the array.
When I do the query I use ORDER BY RAND() in the query (using sqlContext). I created a var Hive table ( to be mutable) adding a column from the larger table. In the loop, I do a unionAll newHiveTable = newHiveTable.unionAll(random_rows)
I have tried many different ways to do this, but am not sure what is the best way to avoid CPU and temp disk use. I know that Dataframes aren't intended for incremental adds.
One thing I have though now to try is to create a cvs file, write the random rows to that file incrementally in the loop, then when the loop is finished, load the cvs file as a table, and do one unionAll to get my final table.
Any feedback would be great. Thanks
I would recommend that you create an external table with hive, defining the location, and then let spark write the output as csv to that directory:
in Hive:
create external table test(key string, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
LOCATION '/SOME/HDFS/LOCATION'
And then from spark with the aide of https://github.com/databricks/spark-csv , write the dataframe to csv files and appending to the existing ones:
df.write.format("com.databricks.spark.csv").save("/SOME/HDFS/LOCATION/", SaveMode.Append)