Spark Create a dataframe from an InputStream? - scala

I want to avoid writing the entire stream to a file and then load it to dataframe. what's the right way?

You can check Spark Streaming and sqlnetworkWordCount which explains that your problem can be solved by creating singleton instance of SparkSession by using SparkContext of SparkStreaming.
You should have better ideas by going through above links where dataframes are created from streaming rdd.

Related

Read hdfs data to Spark DF without mentioning file type

Is there any approach to read hdfs data to spark df without explicitly mentioning file type.
spark.read.format("auto_detect").option("header", "true").load(inputPath)
We can achieve above requirement by using scala.sys.process_ or python subprocess(cmd). and splitting the extension of any part file. But without using any subprocess or sys.process, can we achieve this ..?

How to use spark streaming to get data from HBASE table using scala

I am trying to identify a solution to read data from HBASE table using spark streaming and write the data to another HBASE table.
I found numerous samples in internet which asks to create a DSTREAM to get the data from HDFS files and all.But I was unable to find any examples to get data from HBASE tables
For e.g, if I have a HBASE table 'SAMPLE' with columns as 'name' and 'activeStatus'. How can I retrieve the data from the table SAMPLE based on activeStatus column using spark streaming (New data?
Any examples to retrieve the data from HBASE table using spark streaming is welcome.
Regards,
Adarsh K S
You can connect to hbase from spark multiple ways
Hortonwork Spark hbase connector:
https://github.com/hortonworks-spark/shc
Unicredit hbase rdd : https://github.com/unicredit/hbase-rdd
Hortonworks SHC read hbase directly to dataframe using user defined
catalog whereas hbase-rdd read it as rdd and can be converted to DF
using toDF method. hbase-rdd has bulk write option (direct write HFiles) preferred for massive data write.
What you need is a library that enables spark to interact with hbase. Horton Works' shc is such an extension:
https://github.com/hortonworks-spark/shc

Anyway to log information from a UDF in Databricks-spark?

I am creating some scala UDFs to process data in a Dataframe, and am wondering if it is possible to log information within a UDF? Any examples on how to do this?

Setting Spark Properties on Dataframes

I am pretty naive to development in Spark and Scala.
I am able to set properties at runtime on spark session using the config method like below -
val spark = SparkSession.builder()
.master("local")
.config("spark.files.overwrite",true)
The above code will allow me to set properties on spark session level, but I want to set properties on a DataFrame level. Regarding this I have a few questions:
Is there any way using which I can achieve this?
If yes, will it affect the parallelism achieved by Spark?
You can use different format (and using overwrite or not) when you write:
CSV with compression:
df.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite")
.option("header","true")
.option("codec","org.apache.hadoop.io.compress.GzipCodec").save(tempLocationFileName)
CSV without compression:
df.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite")
.option("header","true")
.save(tempLocationFileName)

Is it possible to convert apache ignite rdd to spark rdd in scala

I am new to apache ignite as well as for spark...
Can any one help with example to convert ignite rdd to spark rdd in scala.
Updated----
Use case:
I will receive a dataframes of hbase tables.. I will execute some logic to build report out of it, save it to the ignite rdd... and same ignite rdd will be updated for each table... once all the tables are executed final ignite rdd will be converted to spark or java rdd and last rule will be executed on that rdd... to run that rule I need that rdd to be converted into dataframe. and that dataframe would be saved as a final report in hive...
What do you mean by converting? IgniteRDD is a Spark RDD, technically it' a subtype of RDD trait.
Spark internally has many type of RDDs: MappedRDD, HadoopRDD, LogicalRDD. IgniteRDD is only one of possible type of RDD and after some transformations it also will be wrapped by other RDD type, i.e. MappedRDD.
You can also write your own RDD :)
Example from documentation:
val cache = igniteContext.fromCache("partitioned")
val result = cache.filter(_._2.contains("Ignite")).collect()
After filtering cache RDD, type will be different - IgniteRDD will be wrapped to FilteredRDD. However it's still implementation of RDD trait.
Update after comment:
At first, have you imported implicits? import spark.implicits._
In SparkSession you've got various createDataFrame methods that will convert your RDD into DataFrame / Dataset
If it still not help you, please provide us error that you're getting while creating DataFrame and code example