Implement Scala read code having Pyspark file as Source - scala

I need to built a module in scala in which source data is coming from two modules present in Pyspark. Can you help me to read data from pyspark to scala module.

Related

Read hdfs data to Spark DF without mentioning file type

Is there any approach to read hdfs data to spark df without explicitly mentioning file type.
spark.read.format("auto_detect").option("header", "true").load(inputPath)
We can achieve above requirement by using scala.sys.process_ or python subprocess(cmd). and splitting the extension of any part file. But without using any subprocess or sys.process, can we achieve this ..?

How to convert scala spark.sql.dataFrame to Pandas data frame

I wanted to Convert scala dataframe into pandas data frame
val collection = spark.read.sqlDB(config)
collection.show()
#Should be like df=collection
You are asking for a way of using a Python library from Scala. This is a bit weird to me. Are you sure you have to do that? Maybe you know that, but Scala DataFrames have a good API that will probably give you the functionality you need from pandas.
If you still need to use pandas, I would suggest you to write the data that you need to a file (a csv, for example). Then, using a Python application you can load that file into a pandas dataframe and work from there.
Trying to create a pandas object from Scala is probably overcomplicating things (and I am not sure it is currently possible).
I think If you want to use pandas based API in SPARK code, then you can install Koalas-Python library. So, Whatever the function you want to use from pandas API directly you can embed them in SPARK code.
To install kolas
pip install koalas

Load XML files from a folder with Pyspark

I want to load XML files from a specific folder with Pyspark. But I don't want to use com.databricks.spark.xml package. From every example, I get using com.databricks.spark.xml package.
Is there any way to read XML files without this package?
Can you use 'xml.etree.ElementTree as ET'? If yes, then write a function in python using this function, and create a udf. Read the XML files into PySpark as RDDs and parse them with the udf.

Importing spark.sql.catalyst library in spark

I am trying to find a way to interpret the table names from spark sql.
The answer given here is in Scala How to get table names from SQL query?
I want to change this into pyspark.
For that I want to import the library
org.apache.spark.sql.catalyst.analysis.UnresolvedRelation (or its equivalent) into pyspark.
Can this be done?

Difference between File write using spark and scala and the advantages?

DF().write
.format("com.databricks.spark.csv")
.save("filepath/selectedDataset.csv")
vs
scala.tools.nsc.io.File("/Users/saravana-6868/Desktop/hello.txt").writeAll("String"))
In the above code, I used to write a file using both dataframes and scala. What is the difference in the above code?
The first piece of code is specific to SPARK API of writing the dataframe to a file in csv format. You can write to hdfs or local file system using this. even you can repartition and parallellize your write. The second piece of code is SCALA API which can only write in the local file system. You cannot parallelize it. The first code levearage the whole cluster to do its work but not the second piece of code.