How to write the data back to Big Query using Databricks? - pyspark

I would like to upload my data frame to a Big query table using data bricks. I used the below code and got the following errors.
bucket = "databricks-ci"
table = "custom-bidder.ciupdate.myTable"
df.write.format("bigquery").mode("overwrite").option("temporaryGcsBucket", bucket).option("table", table).save()
Error Message
I created a new bucket called "databricks-ci" and also created a dataset called "ciupdate" and just gave my table name here "myTable". My project is "custom-bidder"
I am not sure why it's not loading? Can anyone advise?

Related

Delta Lake Data Load Datatype mismatch

I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following error:
Failed to merge fields 'COLUMN1' and 'COLUMN1'. Failed to merge incompatible data types LongType and DecimalType(32,0)
Command i use to write data to delta table:
DF.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("s3 path)
The only option i can think of right now is to enable OverWriteSchema to True.
But this will rewrite my target schema completely. I am just concerned about any sudden change in source schema that will replace existing target schema without any notification or alert.
Also i can't explicitly convert these columns because the databricks notebook i am using is a parametrized one used to to load data from source to Target(We are reading data from a CSV file that contain all the details about Target table, Source table, partition key etc)
Is there any better way to tackle this issue?
Any help is much appreciated!

Difference Between df.wirte and CREATE TABLE USING

I have always been under the impression that the following code create a Delta table,
data.write.format("delta").save("/path/to/delta-table")
This creates the files, sure, however, I noticed today that when I look at the Data section of Databricks, under the hive_metastore, this table does not show up.
In order for this table to show up there, I have to do something like,
CREATE TABLE some_table USING DELTA LOCATION "/path/to/delta-table"
What exactly is going on here? Was I wrong in my understanding that the .write operation creates a table? What is the difference between these commands?
DataFrameWriter has following methods:
def save(path: String): Unit
Saves the content of the DataFrame at the specified path.
def saveAsTable(tableName: String): Unit
Saves the content of the DataFrame as the specified table.
What you did by .save("/path/to/delta-table") was saving the data in delta format in the filesystem. In order for the table to be visible in data catalog (aka. metastore) you need to run CREATE TABLE providing the location.
You can write data using .saveAsTable("delta-table") - that would write the data under a path managed by the metastore and register the table in one step.

How to configure AWS glue crawler to read csv file having comma in dataset?

I have data as follow in csv file in S3 bucket:
"Name"|"Address"|"Age"
----------------------
"John"|"LA,USA"|"27"
I have created the crawler which has created the table and when I am trying to query data on Athena. Getting following data:
How to configure the AWS glue Crawler to create catalog table to read above data?
You must have figured it out already, but thought this answer would benefit anyone visits this question.
This can be resolved either using Crawler classifier or making modifications to table properties after table is created.
Using classifier:
Create classifier with "Quote symbol"
Add Classifer in Crawler you create.
Or you can modify table SerDe properties by editing table (after crawler creates table):

issue insert data in hive create small part files

i am processing more than 1000000 records of json file i am reading file line by line and extract requried key values
(json are mix structure is not fix. so i am parsing and generate requried json element) and generate json string simillar to json_string variable and push to hive table data are store properly but at hadoop apps/hive/warehouse/jsondb.myjson_table folder contain small part files. every insert query the new (.1 to .20 kb)part file will be created. beacuse of that if i run simple query on hive as it will take more than 30 min. showing sample code of my logic this iterate multipal times for new records to inesrt in hive.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("SparkSessionZipsExample").enableHiveSupport().getOrCreate()
var json_string = """{"name":"yogesh_wagh","education":"phd" }"""
val df = spark.read.json(Seq(json_string).toDS)
//df.write.format("orc").saveAsTable("bds_data1.newversion");
df.write.mode("append").format("orc").insertInto("bds_data1.newversion");
i have also try to add hive property to merge the files but it wont work,
i have also try to create table from existing table for combine small part file to one 256 mb files..
please share sample code to insert multipal records and append record in part file.
I think each of those individual inserts creating a new part file.
You could create dataset/dataframe of these json strings and then save it to hive table.
you could merge the existing small file using hive ddl ALTER TABLE table_name CONCATENATE;

Spring store data in jdbcTemlate(h2 db) permanently

I am starting to learn Spring and faced with some issues regarding spring-jdbc.
First, I tried run the example from this: https://spring.io/guides/gs/relational-data-access/ and it worked. Then, I commented lines with droping and creating new tables(http://pastebin.com/zcJHsL1P), in order to not override data, but just get it from db and show it. However, spring showed me error:
Table "CUSTOMERS" not found; SQL statement: ...
So, my question is: What should I do to store my database permanently? I don't want to recreate all time new database, I want create it once and update it.
P.S. I used H2 database. Maybe problem exists in tis db?
That piece of code looks like you are "prototyping" something; so it's easier to automatically create a new database (schema, tables, data) on the fly, execute and/or test whatever you want to...and finish the execution.
If you want to persist your data and only modify/update it, either use H2 with the "file layout" or use MySQL, PostreSQL, etcetera.
By the way, the reason you are getting Table "CUSTOMERS" not found; SQL statement: ... is because you are using H2 as an in-memory database and every time you start your application you need to re-create the tables and populate them with data.