How can I fix an error of spark application? - scala

I'm loading data from hdfs to clickhouse with spark job and get the error DB::Exception: Too many parts (306). Merges are processed much slower than inserts. (TOO_MANY_PARTS) (version 22.3.44 (official build))
Data is in parquet, volume 34 GB.
Path to parquet "hdfs://host:8020/user/stat/year=2022/month=1/day=1/101.parq".
My settings like so:
val df = spark.read.parquet("hdfs://host:8020/user/stat/year=2022/")
df.write
.format("jdbc")
.mode("append")
.option("driver","cc.blynk.clickhouse.ClickHouseDriver")
.option("url", "jdbc:clickhouse://host:8123/default")
.option("user", "login")
.option("password", "pass")
.option("dbtable", "table")
.save()
I'm new at Scala & Spark
so thanks for any advice

Use async_insert
https://clickhouse.com/docs/en/operations/settings/settings/#async-insert
.option("async_insert", "1")
or Engine=Buffer tables, for older ClickHouse version
https://clickhouse.com/docs/en/engines/table-engines/special/buffer/

Related

How to make spark commit in each batch when using batchsize and writing into RDBMS (ORACLE, MsSQL, etc,...)?

I am new to spark and am attempting to insert 50 million records into RDBMS. The RDBMS can be ORACLE or MsSQL or anything. Below is the sample code
df.write
.format('jdbc')
.mode('append')
.option("truncate", false)
.option("driver", ****)
.option("url", ****)
.option("user", ****)
.option("password", ****)
.option("dbtable", "TABLE_NAME")
.option("batchsize", 100000)
.save();
My assumption is that when using the "batchsize" option, it will perform insert in batches of 100k records.
Problem is that it is not committing batch by batch. After a point the tablespace for the uncommitted records becomes full and I get the below error
ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6)
java.sql.BatchUpdateException: ORA-30036: unable to extend segment by
8 in undo tablespace '***'
My requirement is to perform insert & commit for every batch based on the "batchsize" value.
"isolationLevel" option did the trick for me.
Reference link - here
isolationLevel - The transaction isolation level, which applies to
current connection. It can be one of NONE, READ_COMMITTED,
READ_UNCOMMITTED, REPEATABLE_READ, or SERIALIZABLE, corresponding to
standard transaction isolation levels defined by JDBC's Connection
object, with default of READ_UNCOMMITTED. This option applies only to
writing. Please refer the documentation in java.sql.Connection.
Below is the sample code
df.write
.format('jdbc')
.mode('append')
.option("truncate", false)
.option("driver", ****)
.option("url", ****)
.option("user", ****)
.option("password", ****)
.option("dbtable", "TABLE_NAME")
.option("isolationLevel", "NONE")
.option("batchsize", 100000)
.save();

How to connect to oracle database in Spark using Scala

I have Spark 3.1.2 and Scala 2.12.8. I want to connect to oracle Database and read a table then show it, using this code:
import org.apache.spark.sql.SparkSession
object readTable extends App{
val spark = SparkSession
.builder
.master("local[*]")
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#x.x.x.x:1521:orcldb")
.option("dbtable", "orcl.brnc_grp")
.option("user", "orcl")
.option("password", "xxxxxxx")
.load()
jdbcDF.show()
}
When I run the code, I receive this error:
Exception in thread "main" java.sql.SQLException: No suitable driver
Would you please guide me how to solve this problem?
Any help would be really appreciated.
Download oracle JDBC jar from here and place this JAR in $SPARK_HOME/jars folder.

Unable to write Spark's DataFrame to hive using presto

I'm writing some code to save a DataFrame to a hive database using presto
df.write.format("jdbc")
.option("url", "jdbc:presto://myurl/hive?user=user/default")
.option("driver","com.facebook.presto.jdbc.PrestoDriver")
.option("dbtable", "myhivetable")
.mode("overwrite")
.save()
this actually must work , but this actually raises an exception
java.lang.IllegalArgumentException: Can't get JDBC type for array<string> 

Error when importing into a spark dataframe a mssql table named as an integer

I have a table in MSSQL named as dbo.1table and I need to convert it into a dataframe and later on save it as an avro file but I can't even load it as dataframe.
I tested my code with tables named with characters a-z and it works, I tried to convert the table name "toString()" and nothing has worked so far.
I expect to have a dataframe and then save it as an avro file.
Instead I have the folloiwng error:
val DFDimAccountOperator = spark.read.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", connection)
.option("dbtable", "dbo.1table")
.option("user", userId)
.option("password", pwd).load()
DFDimAccountOperator.write.format("avro").save("conversionTypes/testinAVro13")
Exception in thread "main" com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near '.1'. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1621)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:522)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2935)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:444)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at com.aon.ukbi.TomsExample$.runJdbcDatasetExample(TomsExample.scala:27)
at com.aon.ukbi.TomsExample$.main(TomsExample.scala:16)
connection
For making a connection between MSSQL and Spark you need to add sqljdbc jar into $SPARK_HOME/jars location and restart your spark-shell and paste these line into Spark Shell.
scala> val DFDimAccountOperator = spark.read.format("jdbc").option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") .option("url", "jdbc:sqlserver://xxxxxxxx.xxx:1433;database=xx;user=xxxxxx;password=xxxxxx") .option("dbtable", "xxxxxx").load()
Restart and Re-run Code(replace XXXX with appropiate value)
after this, you can write your data frame whichever format you want.
DFDimAccountOperator.write.format("avro").save("conversionTypes/testinAVro13")
Hope this helps you let me know if you have further queries related to this if its solve your purpose accept the answer HAppy HAdooop

Write dataframe to Teradata in Spark

I have values in dataframe , and I have created a table structure in Teradata. My requirement is to load dataframe to Teradata. But I am getting error:
I have tried following code :
df.write.format("jdbc")
.option("driver","com.teradata.jdbc.TeraDriver")
.option("url","organization.td.intranet")
.option("dbtable",s"select * from td_s_zm_brainsdb.emp")
.option("user","userid")
.option("password","password")
.mode("append")
.save()
I got an error :
java.lang.NullPointerException at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:93)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:518)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 48 elided
I changed url option to make it similar to jdbc url, and ran following command:
df.write.format("jdbc")
.option("driver","com.teradata.jdbc.TeraDriver")
.option("url","jdbc:teradata//organization.td.intranet,CHARSET=UTF8,TMODE=ANSI,user=G01159039")
.option("dbtable",s"select * from td_s_zm_brainsdb.emp")
.option("user","userid")
.option("password","password")
.mode("append")
.save()
Still i am getting error:
java.lang.NullPointerException at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:93)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:518)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 48 elided
I have included following jars:
with --jars option
tdgssconfig-16.10.00.03.jar
terajdbc4-16.10.00.03.jar
teradata-connector-1.2.1.jar
Version of Teradata 15
Spark version 2
Change the jdbc_url and dbtable to the following
.option("url","jdbc:teradata//organization.td.intranet/Database=td_s_zm_brainsdb)
.option("dbtable","emp")
Also note in teradata, there are no row locks, so the above will create a table lock. i.e. it will not be efficient - parallel writes from sparkJDBC are not possible.
Native tools of teradata - fastloader /bteq combinations will work.
Another option - that requires a complicated set up is Teradata Query Grid - this is super fast - Uses Presto behind the scene.
I found actual issue.
JDBC Url should be in following form :-
val jdbcUrl = s"jdbc:teradata://${jdbcHostname}/database=${jdbcDatabase},user=${jdbcUsername},password=${jdbcPassword}"
It was causing exception , because I didnt supply username and password.
Below is code useful while reading data from Teradata table,
df = (spark.read.format("jdbc").option("driver", "com.teradata.jdbc.TeraDriver")
.option("url", "jdbc:teradata//organization.td.intranet/Database=td_s_zm_brainsdb")
.option("dbtable", "(select * from td_s_zm_brainsdb.emp) AS t")
.option("user", "userid")
.option("password", "password")
.load())
This will create data frame in Spark.
For writing data back to database below is statement,
Saving data to a JDBC source
jdbcDF.write \
.format("jdbc") \
.option("url", "jdbc:teradata//organization.td.intranet/Database=td_s_zm_brainsdb") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.save()