Spark sql using spark thrift server - scala

I want to access tables registered in spark using JDBC kind of service, using the thrift service provided by spark.
I didn't got any documentation for this on google, can anyone please tell me how to use thrift server to access spark tables.
and what will be the lifetime of these table in memory, will these table resides in memory till thrift server is running.

The Thrift server documentation is located in the Spark SQL reference page under Distributed SQL Engine (http://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine).
If you cache a query the cached result will stay in memory until the Thrift server is stopped.

Related

What are the proper properties for Kafka connector when using Oracle DB

I am learning about Kafka Connect and would like to use Oracle as my database.
I am having trouble with the properties.
Is there any setting/property that I am missing in order to fix this error?
According the docs:
The JDBC source and sink connectors use the Java Database Connectivity (JDBC) API that enables applications to connect to and use a wide range of database systems. In order for this to work, the connectors must have a JDBC Driver for the particular database systems you will use.
So in your case you have to install Oracle Database driver.

Is there a way to connect to multiple databases in multiple hosts using Kafka Connect?

I have a need to get data from Informix database using Kafka Connect. The scenario is this - I have 50 Informix Databases residing in 50 hosts. What I have understood by reading from Kafka connect is that we need to install the Kafka connect in each hosts to get the data from the database residing in that host. My question is this - Is there a way in which I can create the connectors centrally for these 50 hosts instead of installing into each of them and pull data from the databases?
Kafka Connect JDBC does not have to run on the database, just as other JDBC clients don't, so you can a have a Kafka Connect cluster be larger or smaller than your database pool.
Informix seems to have a thing called "CDC Replication Engine for Kafka", however, which might be something worth looking into, as CDC overall causes less load on the database
You don’t need any additional software installation on the system where Informix server is running.I am not fully clear about the question or the type of operation you are plan to do. If you are planning to setup a real time replication type of scenario, then you may have to invoke CDC API. Then one-time setup of CDC API at server is needed, then this APIs can be invoked using any Informix database driver API. If you are plan to read existing data from table(s) and pump into Kafka topic, then no need of any additional setup at server side. You could connect to all 50 database server(s) from a single program (remotely) and then pump those records to the Kafka topic(s). Base on the program language you are using you may choose Informix database driver.

Postgres streaming using JDBC Kafka Connect

I am trying to stream changes in my Postgres database using the Kafka Connect JDBC Connector. I am running into issues upon startup as the database is quite big and the query dies every time as rows change in between.
What is the best practice for starting off the JDBC Connector on really huge tables?
Assuming you can't pause the workload on the database that you're streaming the contents in from to allow the initialisation to complete, I would look at Debezium.
In fact, depending on your use case, I would look at Debezium regardless :) It lets you do true CDC against Postgres (and MySQL and MongoDB), and is a Kafka Connect plugin just like the JDBC Connector is so you retain all the benefits of that.

Akka Persistence: migrating from jdbc (postgres) to cassandra

I have a running project with using of akka-persistence-jdbc plugin and postgresql as a backend.
Now I want to migrate to akka-persistence-cassandra.
But how can I convert the existing events (more than 4GB size in postgres) to cassandra?
Should I write a manual migration program? Reading from postgres and writing to right format in cassandra?
This is a classic migration problem. There are multiple solutions for this.
Spark SQL and Spark Cassandra Connector: Spark JDBC (called as Spark Dataframe, Spark SQL) API allows you to read from any JDBC source. You can read it in chunks by segmenting it otherwise you will go out of memory. Segmentation also makes the migration parallel. Then write the data into Cassandra by Cassandra Spark Connector. This is by far the simplest and efficient way I used in my tasks.
Java Agents: Java Agent can be written based on plain JDBC or other libraries and then write to Cassandra with Datastax driver. Spark program runs on multi machine - multi threaded way and recovers if something goes wrong automatically. But if you write an agent like this manually, then your agent only runs on single machine and multi threading also need to be coded.
Kafka Connectors: Kafka is a messaging broker. It can be used indirectly to migrate. Kafka has connector which can read and write to different databases. You can use JDBC connector to read from PostGres and Cassandra connector to write to Cassandra. It's not that easy to setup but it has the advantage of "no coding involved".
ETL Systems: Some ETL Systems have support for Cassandra but I haven't personally tried anything.
I saw some advantages in using Spark Cassandra and Spark SQL for migration, some of them are:
Code was concise. It was hardly 40 lines
Multi machine (Again multi threaded on each machine)
Job progress and statistics in Spark Master UI
Fault tolerance- if a spark node is down or thread/worker failed there then job is automatically started on other node - good for very long running jobs
If you don't know Spark then writing agent is okay for 4GB data.

Hbase for File I/O. and way to connect HDFS on remote client

Please be aware that I’m not fluent in English before you read.
I'm new at NoSQL,and now trying to use HBase for File storage. - I'll store Files in HBase as binary.
I don't need any statistics. Only what I need is File storage.
IS IT RECOMMENDED!?!?
I am worrying about I/O speed.
Actually, because I couldn't find any way to connect HDFS with out hadoop, I wanna try HBase for file storage. I can’t set up Hadoop on client computer. I was trying to find some libraries - like JDBC for RDBMS - which help the client connect HDFS to get files. but I couldn’t find anything and just have chosen HBase instead of connection library.
Can I get any help from someone?
It really depends on your file sizes. In Hbase it is generally not recommended to store files or LOBs, the default max keyvalue size is 10mb. I have raised that limit and run tests with >100mb values but you do risk OOME your regionservers as it has to hold the entire value in memory - config your JVMs memory with care.
When this type of question is asked on the hbase-users listserve the usual response is to recommend using HDFS if you files can be large.
You should be able to use Thrift to connect to HDFS to bypass installing the Hadoop client on your client computer.