Error while using truncateTable method of JDBCUtils in PostgreTable using Spark - postgresql

I am trying to truncate table postgre_table from Spark using JDBCUtils, but it is throwing below error
< console>:71: error: value truncateTable is not a member of object org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
val trucate_table = JdbcUtils.truncateTable()
I am using the below code:
import org.apache.spark.sql.execution.datasources.jdbc._
import java.sql.DriverManager
import java.sql.Connection
val connection : Connection = DriverManager.getConnection(postgres_host + postgres_database,postgres_username,postgres_password)
val table_existing = JdbcUtils.tableExists(connection, postgres_host + postgres_database, postgre_table)
JdbcUtils.truncateTable(connection, postgres_host + postgres_database, postgre_table)
I am able to drop the table but not truncate it. I can see truncateTable method in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
Please suggest a solution and how to use it in databricks.

Looks like your compile time and runtime spark libraries are of different versions. Please make sure the runtime version match compile time version of spark. Seems the method is available from the version 2.1 alone.
available from this release :
https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

Related

Could not instantiate EventHubSourceProvider for Azure Databricks

Using the steps documented in structured streaming pyspark, I'm unable to create a dataframe in pyspark from the Azure Event Hub I have set up in order to read the stream data.
Error message is:
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.eventhubs.EventHubsSourceProvider could not be instantiated
I have installed the Maven libraries (com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12 is unavailable) but none appear to work:
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.6
As well as ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString) but the error message returned is:
java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V
The connection string is correct as it is also used in a console application that writes to the Azure Event Hub and that works.
Can someone point me in the right direction, please. Code in use is as follows:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Event Hub Namespace Name
NAMESPACE_NAME = "*myEventHub*"
KEY_NAME = "*MyPolicyName*"
KEY_VALUE = "*MySharedAccessKey*"
# The connection string to your Event Hubs Namespace
connectionString = "Endpoint=sb://{0}.servicebus.windows.net/;SharedAccessKeyName={1};SharedAccessKey={2};EntityPath=ingestion".format(NAMESPACE_NAME, KEY_NAME, KEY_VALUE)
ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
# For 2.3.15 version and above, the configuration dictionary requires that connection string be encrypted.
# ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
To resolve the issue, I did the following:
Uninstall azure event hub library versions
Install com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15 library version from Maven Central
Restart cluster
Validate by re-running code provided in the question
I received this same error when installing libraries with the version number com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.* on a Spark cluster running Spark 3.0 with Scala 2.12
For anyone else finding this via google - check if you have the correct Scala library version. In my case, my cluster is Spark v3 with Scala 2.12
Changing the "2.11" in the library version from the tutorial I was using to "2.12", so it matches my cluster runtime version, fixed the issue.
I had to take this a step further. in the format method I had to add in this:
.format("org.apache.spark.sql.eventhubs.EventHubsSourceProvider") directly.
check the cluster scala version and the library version
Unisntall the older libraries and install :
com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
in the shared workspace(right click and install library) and also in the cluster

Can Apache Beam detect the schema (column names) of a Parquet file like Spark and Pandas?

I am new to Apache Beam and I came from Spark world where the API is so rich.
How can I get the schema of a Parquet file using Apache Beam? without that I load data in memory as sometimes it risks to be huge and I am interested only in knowing the columns, and optionally the columns type.
The language is Python.
The storage system is Google Cloud Storage, and the Apache Beam job must be run in Dataflow.
FYI, I have tried the following as suggested in the sof:
from pyarrow.parquet import ParquetFile
ParquetFile(source).metadata
First, it didn't work when I give it a gs://.. path, giving me this error : error: No such file or directory
Then I have tried for a local file in my machine, and I have slightly changed the code to :
from pyarrow.parquet import ParquetFile
ParquetFile(source).metadata.schema
And so I could have the columns :
<pyarrow._parquet.ParquetSchema object at 0x10927cfd0>
name: BYTE_ARRAY
age: INT64
hobbies: BYTE_ARRAY String
But this solution as it seems to me it requires me to get this file to local (of Dataflow server??) and it doesn't use Apache Beam.
Any (better) solution?
Thank you!
I'm happy I could come up with a hand made solution after reading the code source of apache_beam.io.parquetio :
import pyarrow.parquet as pq
from apache_beam.io.parquetio import _ParquetSource
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '<json_key_path>'
ps = _ParquetSource("", None, None, None) # file_pattern, min_bundle_size, validate, columns
with ps.open_file("<GCS_path_of_parquet_file>") as f:
pf = pq.ParquetFile(f)
print(pf.metadata.schema)

Spark ElasticSearch Configuration - Reading Elastic Search from Spark

I am trying to read data from ElasticSearch via Spark Scala. I see lot of post addressing this question, i have tried all the options they have mentioned in various posts but seems nothing is working for me
JAR Used - elasticsearch-hadoop-5.6.8.jar (Used elasticsearch-spark-5.6.8.jar too without any success)
Elastic Search Version - 5.6.8
Spark - 2.3.0
Scala - 2.11
Code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
val spark = SparkSession.builder.appName("elasticSpark").master("local[*]").getOrCreate()
val reader = spark.read.format("org.elasticsearch.spark.sql").option("es.index.auto.create", "true").option("spark.serializer", "org.apache.spark.serializer.KryoSerializer").option("es.port", "9200").option("es.nodes", "xxxxxxxxx").option("es.nodes.wan.only", "true").option("es.net.http.auth.user","xxxxxx").option("es.net.http.auth.pass", "xxxxxxxx")
val read = reader.load("index/type")
Error:
ERROR rest.NetworkClient: Node [xxxxxxxxx:9200] failed (The server xxxxxxxxxxxxx failed to respond); no other nodes left - aborting...
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:294)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMappingAndGeoFields(SchemaUtils.scala:98)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMapping(SchemaUtils.scala:91)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema$lzycompute(DefaultSource.scala:129)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema(DefaultSource.scala:129)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:133)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:133)
at scala.Option.getOrElse(Option.scala:121)
at org.elasticsearch.spark.sql.ElasticsearchRelation.schema(DefaultSource.scala:133)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:432)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 53 elided
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[xxxxxxxxxxx:9200]]
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:461)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:429)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:155)
at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:655)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:287)
... 65 more
Apart from this I have also tried below properties without any success:
option("es.net.ssl.cert.allow.self.signed", "true")
option("es.net.ssl.truststore.location", "<path for elasticsearch cert file>")
option("es.net.ssl.truststore.pass", "xxxxxx")
Please note elasticsearch node is within Unix edge node and is http://xxxxxx:9200 (mentioning it just in case if that makes any difference with the code)
What I am missing here? Any other properties? Please Help
Use below Jar which support spark 2+ version instead of Elastic-Hadoop or Elastic-Spark jar.
https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-20_2.11/5.6.8

Scala Notebook Type Mismatch error when creating Event Hub for streaming tweets

I want to send messages from a Twitter application to an Azure event hub. However, I am getting an error that says:
notebook:20: error: type mismatch;
found : java.util.concurrent.ExecutorService
required: java.util.concurrent.ScheduledExecutorService
val eventHubClient = EventHubClient.create(connStr.toString(), pool)
I do not know how to create the EventHubClient.create now. Please help.
I am referring to code from the link
https://learn.microsoft.com/en-us/azure/azure-databricks/databricks-stream-from-eventhubs.
Also, I have tried the solution from link:
Stream data into Azure Databricks using Event Hubs and it doesn't work for me.
The version of the cluster is 5.2 (includes Apache Spark 2.4.0, Scala 2.11) which should include the Java SE 8 libraries that have the new ScheduledExecutorService member. Also, the libraries attached are com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.9 and org.twitter4j:twitter4j-core:4.0.7, so again all the prerequisites are met.
The code is:
import java._
import java.util._
import scala.collection.JavaConverters._
import com.microsoft.azure.eventhubs._
import java.util.concurrent._
import java.util.concurrent.ExecutorService
import java.util.concurrent.ScheduledExecutorService
val pool = Executors.newFixedThreadPool(1)
val eventHubClient = EventHubClient.create(connStr.toString(), pool)

Can't import sqlContext.implicits._ without an error through Jupyter

When I try to use the import sqlContext.implicits._ on my Jupyter notebook, I get the following error:
Name: Compile Error
Message: <console>:25: error: stable identifier required, but $iwC.this.$VAL10.sqlContext.implicits found.
import sqlContext.implicits._
^
I've tried this locally and it works, but this does not properly function when using it on my Jupyter Notebook server (which is hosted on ec2). I have tried importing different libraries involving that, but unfortunately can not get it to function.
You need to instantiate a sqlContext like so:
val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC.implicits._
You should have seen this error:
stable identifier required
you have to use val keyword instead of var keyword. Since val is equals to const or final keywords.