How to read HBase table as a pyspark dataframe? - pyspark

Is it possible to read in Hbase tables directly as Pyspark Dataframes without using Hive or Phoenix or the spark-Hbase connector provided by Hortonworks?
I'm comparatively new to Hbase and couldn't find a direct Python example to convert Hbase tables into Pyspark dataframes. Most of the examples I saw were either in Scala or Java.

You may connect to HBase via Phoenix. The sample code can be:
df=sqlContext.read.format('jdbc').options(driver="org.apache.phoenix.jdbc.PhoenixDriver",url='jdbc:phoenix:url:port:/hbase-unsecure',dbtable='table_name').load()
You may need to get spark phoenix connector jars: phoenix-spark-4.7.0-HBase-1.1.jar and phoenix-4.7.0-HBase-1.1-client.jar. Thanks

Related

How to write a Spark dataframe into Kinesis Stream?

I am creating a Dataframe from a kafka topic using spark streaming.
I want to write the Dataframe into a Kinesis Producer.
I understand that there is no official API for this as of now. But there are multiple APIs available over the internet , but sadly, none of them worked for me.
Spark version : 2.2
Scala : 2.11
I tried using https://github.com/awslabs/kinesis-kafka-connector and build the jar. But getting errors due to conflicting package names between this jar and spark API. Please help.
########### Here is the code for others:
spark-shell --jars spark-sql-kinesis_2.11-2.2.0.jar,spark-sql-kafka-0-10_2.11-2.1.0.jar,spark-streaming-kafka-0-10-assembly_2.10-2.1.0.jar --files kafka_client_jaas_spark.conf --properties-file gobblin_migration.conf --conf spark.port.maxRetries=100 --driver-java-options "-Djava.security.auth.login.config=kafka_client_jaas_spark.conf" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas_spark.conf" --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas_spark.conf"
import java.io.File
import org.apache.commons.lang3.exception.ExceptionUtils
import org.apache.spark.SparkException
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import scala.sys.process._
import org.apache.log4j.{ Logger, Level, LogManager, PropertyConfigurator }
import org.apache.spark.sql.streaming.Trigger
val streamingInputDF =spark.readStream.format("kafka").option("kafka.bootstrap.servers","bootstrap server").option("subscribe", "<kafkatopic>").option("startingOffsets", "latest").option("failOnDataLoss", "false").option("kafka.security.protocol", "SASL_PLAINTEXT").load()
val xdf=streamingInputDF.select(col("partition").cast("String").alias("partitionKey"),col("value").alias("data"))
xdf.writeStream.format("kinesis").option("checkpointLocation", "<hdfspath>").outputMode("Append").option("streamName", "kinesisstreamname").option("endpointUrl","kinesisendpoint").option("awsAccessKeyId", "accesskey").option("awsSecretKey","secretkey").start().awaitTermination()
For the jar spark-sql-kinesis_2.11-2.2.0.jar, go to quoble , download the package for your spark version, build the jar.
If you are behind a corporate network, set the proxy before launching spark.
export http_proxy=http://server-ip:port/
export https_proxy=https://server-ip:port/
Kafka Connect is a service to which you can POST your connector specifications (kinesis in this case), which then takes care of running the connector. It supports quite a few transformations as well while processing the records. Kafka Connect plugins are not intended to be used with Spark applications.
If your use case requires you to do some business logic while processing the records, then you could go with either Spark Streaming or Structured Streaming approach.
If you want to take Spark based approach, below are the 2 options I can think of.
Use Structured Streaming. You could use a Strucuted streaming connector for Kinesis. You can find one here. There may be others too. This is the only stable and open source connector I am aware of. You can find an example for using Kinesis as a sink here.
Use Kinesis Producer Library or aws-java-sdk-kinesis library to publish records from your Spark Streaming application. Using KPL is a preferred approach here. You could do mapPartitions and create a Kinesis client per partition and publish the records using these libraries. There are plenty of examples in AWS docs for these 2 libraries.

Is it possible to load a database directly from HDFS into spark as a DataFrame?

I have my MongoDB and Spark running on Zeppelin, both sharing the same HDFS. The MongoDB produces a .wt database stored in the same HDFS.
I want to load the database collection produced by that MongoDB from the HDFS into a Spark DataFrame.
Is it possible to load the database directly from HDFS into spark as a DataFrame? Or do I need to use a MongoDB Spark connector?
I would not recommend to read or modify the internal WiredTiger Storage Engine's *.wt files. Firstly, these internal files are subject to change without notifications (not a public facing API), also any unintended modifications to these files may cause the database to be in an invalid/corrupt state.
You can utilise MongoDB Spark Connector to load data from MongoDB to Spark. This connector is designed, developed and optimised for the purpose of read/write data between MongoDB and Apache Spark. For example, by accessing data via the database the client may utilise the database indexes to optimise read operations.
See also:
GitHub demo: Docker for MongoDB, Apache Spark and Apache Zeppelin
GitHub demo: Docker for MongoDB and Apache Spark

Store Spark DStream to mongodb

I'm working on Apache Spark Streaming (NO SPARK SQL) and I would like to store the results of my script on a mongodb.
How can I do that? I found some tutorial but they work only with Apache Spark SQL.
Thanks

Akka Persistence: migrating from jdbc (postgres) to cassandra

I have a running project with using of akka-persistence-jdbc plugin and postgresql as a backend.
Now I want to migrate to akka-persistence-cassandra.
But how can I convert the existing events (more than 4GB size in postgres) to cassandra?
Should I write a manual migration program? Reading from postgres and writing to right format in cassandra?
This is a classic migration problem. There are multiple solutions for this.
Spark SQL and Spark Cassandra Connector: Spark JDBC (called as Spark Dataframe, Spark SQL) API allows you to read from any JDBC source. You can read it in chunks by segmenting it otherwise you will go out of memory. Segmentation also makes the migration parallel. Then write the data into Cassandra by Cassandra Spark Connector. This is by far the simplest and efficient way I used in my tasks.
Java Agents: Java Agent can be written based on plain JDBC or other libraries and then write to Cassandra with Datastax driver. Spark program runs on multi machine - multi threaded way and recovers if something goes wrong automatically. But if you write an agent like this manually, then your agent only runs on single machine and multi threading also need to be coded.
Kafka Connectors: Kafka is a messaging broker. It can be used indirectly to migrate. Kafka has connector which can read and write to different databases. You can use JDBC connector to read from PostGres and Cassandra connector to write to Cassandra. It's not that easy to setup but it has the advantage of "no coding involved".
ETL Systems: Some ETL Systems have support for Cassandra but I haven't personally tried anything.
I saw some advantages in using Spark Cassandra and Spark SQL for migration, some of them are:
Code was concise. It was hardly 40 lines
Multi machine (Again multi threaded on each machine)
Job progress and statistics in Spark Master UI
Fault tolerance- if a spark node is down or thread/worker failed there then job is automatically started on other node - good for very long running jobs
If you don't know Spark then writing agent is okay for 4GB data.

connect to memsql from pyspark shell

Is it possible to connect to memsql from pyspark?
I heard that memsql recently built the streamliner infrastructure on top of pyspark to allow for custom python transformation
But does this mean I can run pyspark or submit a python spark job that connects to memsql?
Yes to both questions.
Streamliner is the best approach if your aim is to get data into MemSQL or perform transformation during ingest. How to use Python with Streamliner: http://docs.memsql.com/latest/spark/memsql-spark-interface-python/
You can also query MemSQL from a Spark application. Details on that here: http://docs.memsql.com/latest/spark/spark-sql-pushdown/
You can also run a Spark shell. See http://docs.memsql.com/latest/ops/cli/SPARK-SHELL/ & http://docs.memsql.com/latest/spark/admin/#launching-the-spark-shell