Difference between sparksession text and textfile methods? [duplicate] - scala

This question already has an answer here:
Difference between sc.textFile and spark.read.text in Spark
(1 answer)
Closed 3 years ago.
I am working with Spark scala shell and trying to create dataframe and datasets from a text file.
For getting datasets from a text file, there are two options, text and textFile methods as follows:
scala> spark.read.
csv format jdbc json load option options orc parquet schema table text textFile
Here is how i am gettting datasets and dataframe from both these methods:
scala> val df = spark.read.text("/Users/karanverma/Documents/logs1.txt")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> val df = spark.read.textFile("/Users/karanverma/Documents/logs1.txt")
df: org.apache.spark.sql.Dataset[String] = [value: string]
So my question is what is the difference between the two methods for text file?
When to use which methods?

As I've noticed that they are almost having the same functionality,
It just that spark.read.text transform data to Dataset which is a distributed collection of data, while spark.read.textFile transform data to Dataset[Type] which is consist of Dataset organized into named columns.
Hope it helps.

Related

Sort every column of a dataframe in spark scala

I am working in Spark & Scala and have a dataframe with several hundred columns. I would like to sort the dataframe by every column. Is there anyway to do this in Scala/Spark?
I have tried:
val sortedDf = actualDF.sort(actualDF.columns)
but .sort does not support Array[String] input.
This question has been asked before: Sort all columns of a dataframe but there is no Scala answer
Thank you to #blackbishop for the answer to this:
val dfSortedByAllItsColumns = actualDF.sort(actualDF.columns.map(col): _*)

Difference between type DataSet[Row] and sql.DataFrame in Spark Scala [duplicate]

This question already has answers here:
Spark: How can DataFrame be Dataset[Row] if DataFrame's have a schema
(2 answers)
Closed 2 years ago.
I am confused around 2 datatypes DataSet[Row] and sql.DataFrame. From various documents etc its mentioned that DataFrame is nothing but DataSet[Row]. Then what is sql.DataFrame.
Below is the code where i see different type returned
Can you please explain difference between these
Below code returns of type DataSet[Row] (as per return type of method in intellij)
serverDf.select(from_json(col("value"), schema) as "event")
.select("*")
.filter(col("event.type").isin(eventTypes_*))
Below code snippet returns of type sql.DataFrame
serverDf.select(from_json(col("value"), schema) as "event")
.select("*")
Thanks in advance
The are the same thing, as it is stated in the documentation:
Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.
It's just a type alias:
type DataFrame = Dataset[Row]
They might have different result types in intellij because of methods' different signatures.

How to Create data frame schema from a header file

I have 2 data files:
1 file is header file and other is data file.
Header file is having 2 columns (Id,Tags):header.txt
Id,Tags
Now I am trying to create a dataFrame Schema Out of the header file:(I have to use this approach as in real time ,there are 1000 of columns in header.txt and data.txt. So,manually creating case class with 1000 column is not possible.
val dataFile=sparkSession.read.format("text").load("data.txt")
val headerFile=sparkSession.sparkContext.textFile("header.txt")
val fields=
headerFile.flatMap(x=>x.split(",")).map(fieldName=>StructField(fieldName,StringType,true))
val schema=StructType(fields)
But above line is failing with Cannot resolve overloaded method StructType.
Can some one please help
StructType needs an array of StructField and you are using fields variable which is an RDD[String] so collect the rdd to create StructType.
val fields= headerFile.flatMap(x=>x.split(","))
.map(fieldName=>StructField(fieldName,StringType,true))
val schema=StructType(fields.collect)

How to parse a csv string into a Spark dataframe using scala?

I would like to convert a RDD containing records of strings, like below, to a Spark dataframe.
"Mike,2222-003330,NY,34"
"Kate,3333-544444,LA,32"
"Abby,4444-234324,MA,56"
....
The schema line is not inside the same RDD, but in a another variable:
val header = "name,account,state,age"
So now my question is, how do I use the above two, to create a dataframe in Spark? I am using Spark version 2.2.
I did search and saw a post:
Can I read a CSV represented as a string into Apache Spark using spark-csv
.
However it's not exactly what I need and I can't figure out a way to modify this piece of code to work in my case.
Your help is greatly appreciated.
The easier way would probably be to start from the CSV file and read it directly as a dataframe (by specifying the schema). You can see an example here: Provide schema while reading csv file as a dataframe.
When the data already exists in an RDD you can use toDF() to convert to a dataframe. This function also accepts column names as input. To use this functionality, first import the spark implicits using the SparkSession object:
val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._
Since the RDD contains strings it needs to first be converted to tuples representing the columns in the dataframe. In this case, this will be a RDD[(String, String, String, Int)] since there are four columns (the last age column is changed to int to illustrate how it can be done).
Assuming the input data are in rdd:
val header = "name,account,state,age"
val df = rdd.map(row => row.split(","))
.map{ case Array(name, account, state, age) => (name, account, state, age.toInt)}
.toDF(header.split(","):_*)
Resulting dataframe:
+----+-----------+-----+---+
|name| account|state|age|
+----+-----------+-----+---+
|Mike|2222-003330| NY| 34|
|Kate|3333-544444| LA| 32|
|Abby|4444-234324| MA| 56|
+----+-----------+-----+---+

saving spark rdd in ORC format [duplicate]

This question already has an answer here:
Converting CSV to ORC with Spark
(1 answer)
Closed 6 years ago.
I am trying to save my RDD in orc format.
val data: RDD[MyObject] = createMyData()
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
data.toDF.write.format("orc").save(outputPath)
It compiles fine but it doesn't work.
I get following exception:
ERROR ApplicationMaster: User class threw exception: java.lang.AssertionError: assertion failed: The ORC data source can only be used with HiveContext.
java.lang.AssertionError: assertion failed: The ORC data source can only be used with HiveContext.
I would like to avoid using hive to do this, because my data is in hdfs and it is not related to any hive table. Is there any workaround?
It works fine for Parquet format.
Thanks in advance.
Persisting ORC formats in persistent storage area (like HDFS) is only available with the HiveContext.
As an alternate (workaround) you can register it as temporary table. Something like this: -
DataFrame.write.mode("overwrite").orc("myDF.orc")
val orcDF = sqlCtx.read.orc("myDF.orc")
orcDF.registerTempTable("<Table Name>")
As for now, saving as orc can only be done with HiveContext.
so the approach will be like this :
import sqlContext.implicits._
val data: RDD[MyObject] = createMyData()
val sqlContext = new New Org.Apache.Spark.Sql.Hive.HiveContext(Sc)
data.toDF.write.format("orc").save(outputPath)