how to save output from cassandra table using spark - scala

I want to save the output/rows read from cassandra table to a file in either csv or json format. Using, Spark 1.6.3:
scala>val results.sqlContext.sql("select * from myks.mytable")
scala>val.write.option("header","true").save("/tmp/xx.csv") -- writes to cfs:// filesystem
I am not able to find an option to write to the OS as csv or json format file.
Appreciate any help!

Use the spark Cassandra connector to read data from a Cassandra table into spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md

Related

How to convert hadoop avro, parquet, as well as text file to csv without spark

I have hdfs versions of avro, parquet, and text file. Unfortunately, I can't use spark to convert them to csv. I saw from an earlier so question that this doesn't seem to be possible. How to convert HDFS file to csv or tsv.
Is this possible, and if so, how do I do this?
This will help you to read Avro files (just avoid schema evolution/modifications...).
Example.
As to Parquet, you can use parquet-mr, take a look at ParquetReader.
Example: ignore the Spark usage, they just use it in order to create a Parquet file to be used later on with ParquetReader.
Hope it helps

How to use spark streaming to get data from HBASE table using scala

I am trying to identify a solution to read data from HBASE table using spark streaming and write the data to another HBASE table.
I found numerous samples in internet which asks to create a DSTREAM to get the data from HDFS files and all.But I was unable to find any examples to get data from HBASE tables
For e.g, if I have a HBASE table 'SAMPLE' with columns as 'name' and 'activeStatus'. How can I retrieve the data from the table SAMPLE based on activeStatus column using spark streaming (New data?
Any examples to retrieve the data from HBASE table using spark streaming is welcome.
Regards,
Adarsh K S
You can connect to hbase from spark multiple ways
Hortonwork Spark hbase connector:
https://github.com/hortonworks-spark/shc
Unicredit hbase rdd : https://github.com/unicredit/hbase-rdd
Hortonworks SHC read hbase directly to dataframe using user defined
catalog whereas hbase-rdd read it as rdd and can be converted to DF
using toDF method. hbase-rdd has bulk write option (direct write HFiles) preferred for massive data write.
What you need is a library that enables spark to interact with hbase. Horton Works' shc is such an extension:
https://github.com/hortonworks-spark/shc

Read pdf file in apache spark dataframes

We can read avro file using the below code,
val df = spark.read.format("com.databricks.spark.avro").load(path)
is it possible to read pdf files using Spark dataframes?
You cannot read a pdf and store in a df as it will cannot interrupt the columns of the dataframe(basically it doens't have a standard schema), so if you want to get some data from a pdf first convert that to csv or parquet and then you can read from that file and then create a dataframe as it has a defined schema
visit this gitbook to understand more on what are the available read formats which you can use to get the data as a Dataframe
DataFrameReader — Loading Data From External Data Sources

Spark dataframe CSV vs Parquet

I am beginner in Spark and trying to understand the mechanics of spark dataframes.
I am comparing performance of sql queries on spark sql dataframe when loading data from csv verses parquet. My understanding is once the data is loaded to a spark dataframe, it shouldn't matter where the data was sourced from (csv or parquet). However I see significant performance difference between the two. I am loading the data using the following commands and there writing queries against it.
dataframe_csv = sqlcontext.read.format("csv").load()
dataframe_parquet = sqlcontext.read.parquet()
Please explain the reason for the difference.
The reason because you see differente performance between csv & parquet is because parquet has a columnar storage and csv has plain text format. Columnar storage is better for achieve lower storage size but plain text is faster at read from a dataframe.

Sending data from my spark code to redshift

I have a Spark code programmed in Scala. My code reads an xml and extracts all the info in it. The goal is to store the info from the XML into Redshift tables.
Is it possible to send the data directly from my Scala Spark code to Redshift without using S3?
Cheers!
If you're using Spark SQL you can read your XML data into DataFrame using spark-xml and then writing it into Redshift tables using spark-redshift .
You can also take a look on this question .
You can do row level insert using pre-prepared SQL statements into your Python/ Java code, but it will be extremely inefficient if you are going to insert more than few records.