Creating a Spark DataFrame from a single string - scala

I'm trying to take a hardcoded String and turn it into a 1-row Spark DataFrame (with a single column of type StringType) such that:
String fizz = "buzz"
Would result with a DataFrame whose .show() method looks like:
+-----+
| fizz|
+-----+
| buzz|
+-----+
My best attempt thus far has been:
val rawData = List("fizz")
val df = sqlContext.sparkContext.parallelize(Seq(rawData)).toDF()
df.show()
But I get the following compiler error:
java.lang.ClassCastException: org.apache.spark.sql.types.ArrayType cannot be cast to org.apache.spark.sql.types.StructType
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:413)
at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:155)
Any ideas as to where I'm going awry? Also, how do I set "buzz" as the row value for the fizz column?
Update:
Trying:
sqlContext.sparkContext.parallelize(rawData).toDF()
I get a DF that looks like:
+----+
| _1|
+----+
|buzz|
+----+

Try:
sqlContext.sparkContext.parallelize(rawData).toDF()
In 2.0 you can:
import spark.implicits._
rawData.toDF
Optionally provide a sequence of names for toDF:
sqlContext.sparkContext.parallelize(rawData).toDF("fizz")

In Java, the following works:
List<String> textList = Collections.singletonList("yourString");
SQLContext sqlContext = new SQLContext(sparkContext);
Dataset<Row> data = sqlContext
.createDataset(textList, Encoders.STRING())
.withColumnRenamed("value", "text");

Related

Spark: Split is not a member of org.apache.spark.sql.Row

Below is my code from Spark 1.6. I am trying to convert it to Spark 2.3 but I am getting error for using split.
Spark 1.6 code:
val file = spark.textFile(args(0))
val mapping = file.map(_.split('/t')).map(a => a(1))
mapping.saveAsTextFile(args(1))
Spark 2.3 code:
val file = spark.read.text(args(0))
val mapping = file.map(_.split('/t')).map(a => a(1)) //Getting Error Here
mapping.write.text(args(1))
Error Message:
value split is not a member of org.apache.spark.sql.Row
Unlike spark.textFile which returns a RDD,
spark.read.text returns a DataFrame which is essentially a RDD[Row]. You could perform map with a partial function as shown in the following example:
// /path/to/textfile:
// a b c
// d e f
import org.apache.spark.sql.Row
val df = spark.read.text("/path/to/textfile")
df.map{ case Row(s: String) => s.split("\\t") }.map(_(1)).show
// +-----+
// |value|
// +-----+
// | b|
// | e|
// +-----+

How to rename column headers in a scala dataframe

How can I do string.replace("fromstr", "tostr") on a scala dataframe.
As far as I can see withColumnRenamed performs replace on all columns and not just the headers.
withColumnRenamed renames column names only, data remains the same. If you need to change rows context, you can use one of the following:
import sparkSession.implicits._
import org.apache.spark.sql.functions._
val inputDf = Seq("to_be", "misc").toDF("c1")
val resultd1Df = inputDf
.withColumn("c2", regexp_replace($"c1", "^to_be$", "not_to_be"))
.select($"c2".as("c1"))
resultd1Df.show()
val resultd2Df = inputDf
.withColumn("c2", when($"c1" === "to_be", "not_to_be").otherwise($"c1"))
.select($"c2".as("c1"))
resultd2Df.show()
def replace(mapping: Map[String, String]) = udf(
(from: String) => mapping.get(from).orElse(Some(from))
)
val resultd3Df = inputDf
.withColumn("c2", replace(Map("to_be" -> "not_to_be"))($"c1"))
.select($"c2".as("c1"))
resultd3Df.show()
Input dataframe:
+-----+
| c1|
+-----+
|to_be|
| misc|
+-----+
Result dataframe:
+---------+
| c1|
+---------+
|not_to_be|
| misc|
+---------+
You can find the list of available Spark functions there

How to extract week day as a number from a Spark dataframe with the Scala API

I have a date column which is string in dataframe in the 2017-01-01 12:15:43 timestamp format.
Now I want to get weekday number(1 to 7) from that column using dataframe and not spark sql.
Like below
df.select(weekday(col("colname")))
I found one in python and sql but not in scala. can any body help me on this
in sqlcontext
sqlContext.sql("select date_format(to_date('2017-01-01'),'W') as week")
This works the same way in Scala:
scala> spark.version
res1: String = 2.3.0
scala> spark.sql("select date_format(to_date('2017-01-01'),'W') as week").show
// +----+
// |week|
// +----+
// | 1|
// +----+
or
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val df = Seq("2017-01-01").toDF("date")
df: org.apache.spark.sql.DataFrame = [date: string]
scala> df.select(date_format(to_date(col("date")), "W")).show
+-------------------------------+
|date_format(to_date(`date`), W)|
+-------------------------------+
| 1|
+-------------------------------+

Conversion of RDD to DataFrame using .toDF() When CSV data read using SparkContext (Not sqlContext)

I am a pure new guy in SparkSQL. Please help me anyone.
My specific question is that if we can convert the RDD hospitalDataText to a DataFrame(using .toDF()) where hospitalDataText has read the csv file using Spark Context(Not using sqlContext.read.csv("path")).
SO WHY WE CANNOT WRITE header.toDF() ? If I am trying to convert the variable header RDD to DataFrame it is throwing an error that: value toDF is not a member of String. Why? My main purpose is that I want to view the data of the variable header RDD using .show() function and therefore why I am unable to convert the RDD to a DataFrame? Please check the code given below! It is looks like DOUBLE-STANDARD :'(
scala> val hospitalDataText = sc.textFile("/Users/TheBhaskarDas/Desktop/services.csv")
hospitalDataText: org.apache.spark.rdd.RDD[String] = /Users/TheBhaskarDas/Desktop/services.csv MapPartitionsRDD[39] at textFile at <console>:33
scala> val header = hospitalDataText.first() //Remove the header
header: String = uhid,locationid,doctorid,billdate,servicename,servicequantity,starttime,endtime,servicetype,servicecategory,deptname
scala> header.toDF()
<console>:38: error: value toDF is not a member of String
header.toDF()
^
scala> val hospitalData = hospitalDataText.filter(a => a != header)
hospitalData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[40] at filter at <console>:37
scala> val m = hospitalData.toDF()
m: org.apache.spark.sql.DataFrame = [value: string]
scala> println(m)
[value: string]
scala> m.show()
+--------------------+
| value|
+--------------------+
|32d84f8b9c5193838...|
|32d84f8b9c5193838...|
|213d66cb9aae532ff...|
|222f8f1766ed4e7c6...|
|222f8f1766ed4e7c6...|
|993f608405800f97d...|
|993f608405800f97d...|
|fa14c3845a8f1f6b0...|
|6e2899a575a534a1d...|
|6e2899a575a534a1d...|
|1f1603e3c0a0db5e6...|
|508a4fbea4752771f...|
|5f33395ae7422c3cf...|
|5f33395ae7422c3cf...|
|4ef07783ce800fc5d...|
|70c13902c9c9ccd02...|
|70c13902c9c9ccd02...|
|a950feff6911ab5e4...|
|b1a0d427adfdc4f7e...|
|b1a0d427adfdc4f7e...|
+--------------------+
only showing top 20 rows
scala> m.show(1)
+--------------------+
| value|
+--------------------+
|32d84f8b9c5193838...|
+--------------------+
only showing top 1 row
scala> m.show(1,true)
+--------------------+
| value|
+--------------------+
|32d84f8b9c5193838...|
+--------------------+
only showing top 1 row
scala> m.show(1,2)
+-----+
|value|
+-----+
| 32|
+-----+
only showing top 1 row
You keep saying header is an RDD while the output you posted clearly shows that header is a String. first() does not return an RDD. You can't use show() on a String, but you can use println.

Dataframe creation in Scala

wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
This is a way of creating dataframe from a list of tuples in python. How can I do this in scala ? I'm new to Scala and I'm facing problem in figuring it out.
Any help will be appreciated!
One simple way,
val df = sc.parallelize(List( (1,"a"), (2,"b") )).toDF("key","value")
and so df.show
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
+---+-----+
Refer to the worked example in Programmatically Specifying the Schema for constructing a DataFrame with createDataFrame.
To create a dataframe , you need to create SQLContext .
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame , after importing it you can use .toDF method
import sqlContext.implicits._
now you can create dataframes
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
learn more about creating dataframes here