StructType can not accept object? - pyspark

How do I resolve this issue?
rdd.collect() //['3e866d48b59e8ac8aece79597df9fb4c'...]
rdd.toDF() //Can not infer schema for type: <type 'str'>
myschema=StructType([StructField("col1", StringType(),True)])
rdd.toDF(myschema).show()
// StructType can not accept object "3e866d48b59e8ac8aece79597df9fb4c" in type

It seems you have:
rdd = sc.parallelize(['3e866d48b59e8ac8aece79597df9fb4c'])
Which is a one dimensional data structure, a data frame is 2d; map each number to a tuple solves the problem:
rdd.map(lambda x: (x,)).toDF().show()
+--------------------+
| _1|
+--------------------+
|3e866d48b59e8ac8a...|
+--------------------+

Related

DataFrame to RDD[(String, String)] conversion

I want to convert an org.apache.spark.sql.DataFrame to org.apache.spark.rdd.RDD[(String, String)] in Databricks. Can anyone help?
Background (and a better solution is also welcome): I have a Kafka stream which (after some steps) becomes a 2 column data frame. I would like to put this into a Redis cache, first column as a key and second column as a value.
More specifically the type of the input is this: lastContacts: org.apache.spark.sql.DataFrame = [serialNumber: string, lastModified: bigint]. I try to put into Redis as follows:
sc.toRedisKV(lastContacts)(redisConfig)
The error message looks like this:
notebook:20: error: type mismatch;
found : org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
required: org.apache.spark.rdd.RDD[(String, String)]
sc.toRedisKV(lastContacts)(redisConfig)
I already played around with some ideas (like function .rdd) but none helped.
You can use df.map(row => ...) to convert the dataframe to a RDD if you want to map a row to a different RDD element.
For example:
val df = Seq(("table1",432),
("table2",567),
("table3",987),
("table1",789)).
toDF("tablename", "Code").toDF()
df.show()
+---------+----+
|tablename|Code|
+---------+----+
| table1| 432|
| table2| 567|
| table3| 987|
| table1| 789|
+---------+----+
val rddDf = df.map(r => (r(0), r(1))).rdd // Type:RDD[(Any,Any)]
OR
val rdd = df.map(r => (r(0).toString, r(1).toString)).rdd //Type: RDD[(String,String)]
Please refer https://community.hortonworks.com/questions/106500/error-in-spark-streaming-kafka-integration-structu.html regarding AnalysisException: Queries with streaming sources must be executed with writeStream.start()
You need to wait for the termination of the query using query.awaitTermination()
To prevent the process from exiting while the query is active.

Scala And Spark , rdd to dataframe creation from of dictionary

Can you please let me know on how to create data frame from the following code ?
val x =List(Map("col1"->"foo","col2"->"bar"))
val RDD =sc.parallelize(x)
The input is as shown above ie's RDD[Map[String, String]]
Want to covert into dataframe with col1 and col2 as column names and foo and bar as one single row.
You can create a case class, convert Maps in the rdd to case class and then toDF should work:
case class r(col1: Option[String], col2: Option[String])
RDD.map(m => r(m.get("col1"), m.get("col2"))).toDF.show
+----+----+
|col1|col2|
+----+----+
| foo| bar|
+----+----+

Conversion of RDD to DataFrame using .toDF() When CSV data read using SparkContext (Not sqlContext)

I am a pure new guy in SparkSQL. Please help me anyone.
My specific question is that if we can convert the RDD hospitalDataText to a DataFrame(using .toDF()) where hospitalDataText has read the csv file using Spark Context(Not using sqlContext.read.csv("path")).
SO WHY WE CANNOT WRITE header.toDF() ? If I am trying to convert the variable header RDD to DataFrame it is throwing an error that: value toDF is not a member of String. Why? My main purpose is that I want to view the data of the variable header RDD using .show() function and therefore why I am unable to convert the RDD to a DataFrame? Please check the code given below! It is looks like DOUBLE-STANDARD :'(
scala> val hospitalDataText = sc.textFile("/Users/TheBhaskarDas/Desktop/services.csv")
hospitalDataText: org.apache.spark.rdd.RDD[String] = /Users/TheBhaskarDas/Desktop/services.csv MapPartitionsRDD[39] at textFile at <console>:33
scala> val header = hospitalDataText.first() //Remove the header
header: String = uhid,locationid,doctorid,billdate,servicename,servicequantity,starttime,endtime,servicetype,servicecategory,deptname
scala> header.toDF()
<console>:38: error: value toDF is not a member of String
header.toDF()
^
scala> val hospitalData = hospitalDataText.filter(a => a != header)
hospitalData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[40] at filter at <console>:37
scala> val m = hospitalData.toDF()
m: org.apache.spark.sql.DataFrame = [value: string]
scala> println(m)
[value: string]
scala> m.show()
+--------------------+
| value|
+--------------------+
|32d84f8b9c5193838...|
|32d84f8b9c5193838...|
|213d66cb9aae532ff...|
|222f8f1766ed4e7c6...|
|222f8f1766ed4e7c6...|
|993f608405800f97d...|
|993f608405800f97d...|
|fa14c3845a8f1f6b0...|
|6e2899a575a534a1d...|
|6e2899a575a534a1d...|
|1f1603e3c0a0db5e6...|
|508a4fbea4752771f...|
|5f33395ae7422c3cf...|
|5f33395ae7422c3cf...|
|4ef07783ce800fc5d...|
|70c13902c9c9ccd02...|
|70c13902c9c9ccd02...|
|a950feff6911ab5e4...|
|b1a0d427adfdc4f7e...|
|b1a0d427adfdc4f7e...|
+--------------------+
only showing top 20 rows
scala> m.show(1)
+--------------------+
| value|
+--------------------+
|32d84f8b9c5193838...|
+--------------------+
only showing top 1 row
scala> m.show(1,true)
+--------------------+
| value|
+--------------------+
|32d84f8b9c5193838...|
+--------------------+
only showing top 1 row
scala> m.show(1,2)
+-----+
|value|
+-----+
| 32|
+-----+
only showing top 1 row
You keep saying header is an RDD while the output you posted clearly shows that header is a String. first() does not return an RDD. You can't use show() on a String, but you can use println.

Creating a Spark DataFrame from a single string

I'm trying to take a hardcoded String and turn it into a 1-row Spark DataFrame (with a single column of type StringType) such that:
String fizz = "buzz"
Would result with a DataFrame whose .show() method looks like:
+-----+
| fizz|
+-----+
| buzz|
+-----+
My best attempt thus far has been:
val rawData = List("fizz")
val df = sqlContext.sparkContext.parallelize(Seq(rawData)).toDF()
df.show()
But I get the following compiler error:
java.lang.ClassCastException: org.apache.spark.sql.types.ArrayType cannot be cast to org.apache.spark.sql.types.StructType
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:413)
at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:155)
Any ideas as to where I'm going awry? Also, how do I set "buzz" as the row value for the fizz column?
Update:
Trying:
sqlContext.sparkContext.parallelize(rawData).toDF()
I get a DF that looks like:
+----+
| _1|
+----+
|buzz|
+----+
Try:
sqlContext.sparkContext.parallelize(rawData).toDF()
In 2.0 you can:
import spark.implicits._
rawData.toDF
Optionally provide a sequence of names for toDF:
sqlContext.sparkContext.parallelize(rawData).toDF("fizz")
In Java, the following works:
List<String> textList = Collections.singletonList("yourString");
SQLContext sqlContext = new SQLContext(sparkContext);
Dataset<Row> data = sqlContext
.createDataset(textList, Encoders.STRING())
.withColumnRenamed("value", "text");

sqlContext.createDataframe from Row with Schema. pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>

After spending way to much time figuring out why I get the following error
pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>
while trying to create a dataframe based on Rows and a Schema, I noticed the following:
With a Row inside my rdd called rrdRows looking as follows:
Row(a="1", b="2", c=3)
and my dfSchema defined as:
dfSchema = StructType([
StructField("c", IntegerType(), True),
StructField("a", StringType(), True),
StructField("b", StringType(), True)
])
creating a dataframe as follows:
df = sqlContext.createDataFrame(rddRows, dfSchema)
brings the above mentioned Error, because Spark only considers the order of StructFields in the schema and does not match the name of the StructFields with the name of the Row fields.
In other words, in the above example, I noticed that spark tries to create a dataframe that would look as follow (if there would not be a typeError. e.x if everything would be of type String)
+---+---+---+
| c | b | a |
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
is this really expected, or some sort of bug?
EDIT: the rddRows are create along those lines:
def createRows(dic):
res = Row(a=dic["a"],b=dic["b"],c=int(dic["c"])
return res
rddRows = rddDict.map(createRows)
where rddDict is a parsed JSON file.
The constructor of the Row sorts the keys if you provide keyword arguments. Take a look at the source code here. When I found out about that, I ended up sorting my schema accordingly before applying it to the dataframe:
sorted_fields = sorted(dfSchema.fields, key=lambda x: x.name)
sorted_schema = StructType(fields=sorted_fields)
df = sqlContext.createDataFrame(rddRows, sorted_schema)