Assign SQL schema to Spark DataFrame - pyspark

I'm converting my team's legacy Redshift SQL code to Spark SQL code. All the Spark examples I've seen define the schema in a non-SQL way using StructType and StructField and I'd prefer to define the schema in SQL, since most of my users know SQL but not Spark.
This is the ugly workaround I'm doing now. Is there a more elegant way that doesn't require defining an empty table just so that I can pull the SQL schema?
create_table_sql = '''
CREATE TABLE public.example (
id LONG,
example VARCHAR(80)
)'''
spark.sql(create_table_sql)
schema = spark.sql("DESCRIBE public.example").collect()
s3_data = spark.read.\
option("delimiter", "|")\
.csv(
path="s3a://"+s3_bucket_path,
schema=schema
)\
.saveAsTable('public.example')

Yes there is a way to create schema from string although I am not sure if it really looks like SQL! So you can use:
from pyspark.sql.types import _parse_datatype_string
_parse_datatype_string("id: long, example: string")
This will create the next schema:
StructType(List(StructField(id,LongType,true),StructField(example,StringType,true)))
Or you may have a complex schema as well:
schema = _parse_datatype_string("customers array<struct<id: long, name: string, address: string>>")
StructType(
List(StructField(
customers,ArrayType(
StructType(
List(
StructField(id,LongType,true),
StructField(name,StringType,true),
StructField(address,StringType,true)
)
),true),true)
)
)
You can check for more examples here

adding up to what has already been said, making a schema (e.g. StructType-based or JSON) is more straightforward in scala spark than in pySpark:
> import org.apache.spark.sql.types.StructType
> val s = StructType.fromDDL("customers array<struct<id: long, name: string, address: string>>")
> s
res3: org.apache.spark.sql.types.StructType = StructType(StructField(customers,ArrayType(StructType(StructField(id,LongType,true),StructField(name,StringType,true),StructField(address,StringType,true)),true),true))
> s.prettyJson
res9: String =
{
"type" : "struct",
"fields" : [ {
"name" : "customers",
"type" : {
"type" : "array",
"elementType" : {
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "long",
"nullable" : true,
"metadata" : { }
}, {
"name" : "name",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "address",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
},
"containsNull" : true
},
"nullable" : true,
"metadata" : { }
} ]
}

Related

Not a data file error while reading Avro file

I have a file with the data in Avro format. I would like to read this data into GenericRecord type data structure or any other type data structure so I would be able to send it from Kafka to Spark.
I tried to use DataFileReader, however the result was this error:
Exception in thread "main" java.io.IOException: Not a data file.
at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
Here is the code with produced it:
val schema = Source.fromFile(schemaPath).mkString
val parser = new Schema.Parser
val avroSchema = parser.parse(schema)
val avroDataFile = new File(dataPath)
val avroReader = new GenericDatumReader[GenericRecord](avroSchema)
val dataFileReader = new DataFileReader[GenericRecord](avroDataFile, avroReader)
//THIS LINE PRODUCED ERROR
How can I fix this error?
This is how my Avro data schema looks like:
{
"type" : "record",
"namespace" : "input_data",
"name" : "testUser",
"fields" : [
{"name" : "name", "type" : "string", "default": "NONE"},
{"name" : "age", "type" : "int", "default": -1},
{"name" : "phone", "type" : "string", "default" : "NONE"},
{"name" : "city", "type" : "string", "default" : "NONE"},
{"name" : "country", "type" : "string", "default" : "NONE"}
]
}
And this is the data I tried to read (it was generated by this tool):
{
"name" : "O= ~usP3\u0001\bY\u0011k\u0001",
"age" : 585392215,
"phone" : "\u0012\u001F#\u001FH]e\u0015UW\u0000\fo",
"city" : "aWi\u001B'\u000Bh\u00163\u001A_I\u0001\u0001L",
"country" : "]H\u001Dl(n!Sr}oVCH"
}
{
"name" : "\u0011Y~\fV\u001Dv%4\u0006;\u0012",
"age" : -2045540864,
"phone" : "UyOdgny-hA",
"city" : "\u0015f?\u0000\u0015oN{\u0019\u0010\u001D%",
"country" : "eY>c\u0010j\u0002[\u001CdDQ"
}
...
Well, that data is not Avro, it is JSON.
If it were binary Avro data, you would not be able to read the file without first using avro-tools.jar tojson action.
If you look at the usage doc, JSON is the default
-j, --json: Encode outputted data in JSON format (default)
To actually get Avro, use arg -s schema.avsc -b -o out.avro
There are also other ways to generate test data in Kafka

how to transform a nested mongodb table into spark dataframe

i have a nested mongodb talbe and its document structure like this:
{
"_id" : "35228334dbd1090f6117c5a0011b56b0",
"brasidas" : [
{
"key" : "buy",
"value" : 859193
}
],
"crawl_time" : NumberLong(1526296211997),
"date" : "2018-05-11",
"id" : "44874f4c8c677087bcd5f829b2843e66",
"initNumber" : 0,
"repurchase" : 0,
"source_url" : "http://query.sse.com.cn/commonQuery.do?jsonCallBack=jQuery11120015170331124618408_1526262411932&isPagination=true&sqlId=COMMON_SSE_SCSJ_CJGK_ZQZYSHG_JYSLMX_L&beginDate&endDate&securityCode&pageHelp.pageNo=1&pageHelp.beginPage=1&pageHelp.cacheSize=1&pageHelp.endPage=1&pageHelp.pageSize=25",
"stockCode" : "600020",
"stockName" : "ZYGS",
"type" : "SSE"
}
i want to transform it into spark dataframe,and extract the title "key"and "value " of "brasidas" as single column respectively.just like follows:
initNumber repurchase key value stockName type date
50000 50000 buy 286698 shgf SSE 2015/3/30
but there is a problem with the form of title "brasidas",it have three forms:
[{ "key" : "buy", "value" : 286698 }]
[{ "value" : 15311500, "key" : "buy_free" }, { "value" : 0, "key" : "buy_limited" }]
[{ "key" : ""buy_free" " }, { "key" : "buy_limited" }]
so when i use scala to define a StructType, it's not suitable for every document,i can only take "brasidas" as a single column and failed to divide it by the "key" .this is what i get:
initNumber repurchase brasidas stockName type date
50000 50000 [{ "key" : "buy", "value" : 286698 }] shgf SSE 2015/3/30
This is the code for getting mongodb document:
val readpledge =ReadConfig(Map("uri"-> (mongouri_beehive+".pledge")))
val pledge = getMongoDB.readCollection(sc, readpledge,"initNumber","repurchase","brasidas","stockName","type","date")
.selectExpr("cast(initNumber as int) initNumber", "cast(repurchase as int) repurchase","brasidas","stockName","type","date")
If you try to df.printSchema() you'll probably be able to observe that brasidas got ArrayType. Most likely (array of map).
So, I'd suggest to implement some sort of UDF function that get Array as parameter and transform it in a way you need.
def arrayProcess(arr: Seq[AnyRef]): Seq[AnyRef] = ???

SCALA: Reading JSON file with the path provided

I have scenario where I will be getting different JSON result from multiple API's, I need to read specific value from the response.
For instance my JSON response is as below, now I need a format from user to provider by which I can read the value of Lat, Don't want hard-coded approach for this, user can provided a node to read in some other json file or txt file:
{
"name" : "Watership Down",
"location" : {
"lat" : 51.235685,
"long" : -1.309197
},
"residents" : [ {
"name" : "Fiver",
"age" : 4,
"role" : null
}, {
"name" : "Bigwig",
"age" : 6,
"role" : "Owsla"
} ]
}
You can get the key of json using scala JSON parser as below. Im defining a function to get the lat, which you can make generic as per your need, so that you just need to change the function.
import scala.util.parsing.json.JSON
val json =
"""
|{
| "name" : "Watership Down",
| "location" : {
| "lat" : 51.235685,
| "long" : -1.309197
| },
| "residents" : [ {
| "name" : "Fiver",
| "age" : 4,
| "role" : null
| }, {
| "name" : "Bigwig",
| "age" : 6,
| "role" : "Owsla"
| } ]
|}
""".stripMargin
val jsonObject = JSON.parseFull(json).get.asInstanceOf[Map[String, Any]]
val latLambda : (Map[String, Any] => Option[Double] ) = _.get("location")
.map(_.asInstanceOf[Map[String, Any]]("lat").toString.toDouble)
assert(latLambda(jsonObject) == Some(51.235685))
The expanded version of function,
val latitudeLambda = new Function[Map[String, Any], Double]{
override def apply(input: Map[String, Any]): Double = {
input("location").asInstanceOf[Map[String, Any]]("lat").toString.toDouble
}
}
Make the function generic so that once you know what key you want from the JSON, just change the function and apply the JSON.
Hope it helps. But there are nicer APIs out there like Play JSON lib. You simply can use,
import play.api.libs.json._
val jsonVal = Json.parse(json)
val lat = (jsonVal \ "location" \ "lat").get

How to convert column to vector type?

I have an RDD in Spark where the objects are based on a case class:
ExampleCaseClass(user: User, stuff: Stuff)
I want to use Spark's ML pipeline, so I convert this to a Spark data frame. As part of the pipeline, I want to transform one of the columns into a column whose entries are vectors. Since I want the length of that vector to vary with the model, it should be built into the pipeline as part of the feature transformation.
So I attempted to define a Transformer as follows:
class MyTransformer extends Transformer {
val uid = ""
val num: IntParam = new IntParam(this, "", "")
def setNum(value: Int): this.type = set(num, value)
setDefault(num -> 50)
def transform(df: DataFrame): DataFrame = {
...
}
def transformSchema(schema: StructType): StructType = {
val inputFields = schema.fields
StructType(inputFields :+ StructField("colName", ???, true))
}
def copy (extra: ParamMap): Transformer = defaultCopy(extra)
}
How do I specify the DataType of the resulting field (i.e. fill in the ???)? It will be a Vector of some simple class (Boolean, Int, Double, etc). It seems VectorUDT might have worked, but that's private to Spark. Since any RDD can be converted to a DataFrame, any case class can be converted to a custom DataType. However I can't figure out how to manually do this conversion, otherwise I could apply it to some simple case class wrapping the vector.
Furthermore, if I specify a vector type for the column, will VectorAssembler correctly process the vector into separate features when I go to fit the model?
Still new to Spark and especially to the ML Pipeline, so appreciate any advice.
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
def transformSchema(schema: StructType): StructType = {
val inputFields = schema.fields
StructType(inputFields :+ StructField("colName", VectorType, true))
}
In spark 2.1 VectorType makes VectorUDT publicly available:
package org.apache.spark.ml.linalg
import org.apache.spark.annotation.{DeveloperApi, Since}
import org.apache.spark.sql.types.DataType
/**
* :: DeveloperApi ::
* SQL data types for vectors and matrices.
*/
#Since("2.0.0")
#DeveloperApi
object SQLDataTypes {
/** Data type for [[Vector]]. */
val VectorType: DataType = new VectorUDT
/** Data type for [[Matrix]]. */
val MatrixType: DataType = new MatrixUDT
}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
case class MyVector(vector: Vector)
val vectorDF = Seq(
MyVector(Vectors.dense(1.0,3.4,4.4)),
MyVector(Vectors.dense(5.5,6.7))
).toDF
vectorDF.printSchema
root
|-- vector: vector (nullable = true)
println(vectorDF.schema.fields(0).dataType.prettyJson)
{
"type" : "udt",
"class" : "org.apache.spark.mllib.linalg.VectorUDT",
"pyClass" : "pyspark.mllib.linalg.VectorUDT",
"sqlType" : {
"type" : "struct",
"fields" : [ {
"name" : "type",
"type" : "byte",
"nullable" : false,
"metadata" : { }
}, {
"name" : "size",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "indices",
"type" : {
"type" : "array",
"elementType" : "integer",
"containsNull" : false
},
"nullable" : true,
"metadata" : { }
}, {
"name" : "values",
"type" : {
"type" : "array",
"elementType" : "double",
"containsNull" : false
},
"nullable" : true,
"metadata" : { }
} ]
}
}

Casbah cas from BasicDBObject to my type

I have a collection in the database that looks like below:
Question
{
"_id" : ObjectId("52b3248a43fa7cd2bc4a2d6f"),
"id" : 1001,
"text" : "Which is a valid java access modifier?",
"questype" : "RADIO_BUTTON",
"issourcecode" : true,
"sourcecodename" : "sampleques",
"examId" : 1000,
"answers" : [
{
"id" : 1,
"text" : "private",
"isCorrectAnswer" : true
},
{
"id" : 2,
"text" : "personal",
"isCorrectAnswer" : false
},
{
"id" : 3,
"text" : "protect",
"isCorrectAnswer" : false
},
{
"id" : 4,
"text" : "publicize",
"isCorrectAnswer" : false
}
]
}
I have a case class that represents both the Question and Answer. The Question case class has a List of Answer objects. I tried converting the result of the find operation to convert the DBObject to my Answer type:
def toList[T](dbObj: DBObject, key: String): List[T] =
(List[T]() ++ dbObject(key).asInstanceOf[BasicDBList]) map { _.asInstanceOf[T]}
The result of the above operation when I call it like
toList[Answer](dbObj, "answers") map {y => Answer(y.id,y.text, y.isCorrectAnswer)}
fails with the following exception:
com.mongodb.BasicDBObject cannot be cast to domain.content.Answer
Why should it fail? Is there a way to convert the DBObject to Answer type that I want?
You have to retrieve values from BasicDBObject, cast them and then populate the Answer class:
Answer class:
case class Answer(id:Int,text:String,isCorrectAnswer:Boolean)
toList, I changed it to return List[Answer]
def toList(dbObj: DBObject, key: String): List[Answer] = dbObj.get(key).asInstanceOf[BasicDBList].map { o=>
Answer(
o.asInstanceOf[BasicDBObject].as[Int]("id"),
o.asInstanceOf[BasicDBObject].as[String]("text"),
o.asInstanceOf[BasicDBObject].as[Boolean]("isCorrectAnswer")
)
}.toList