Consuming RESTful API and converting to Dataframe in Apache Spark - scala

I am trying to convert output of url directly from RESTful api to Dataframe conversion in following way:
package trials
import org.apache.spark.sql.SparkSession
import org.json4s.jackson.JsonMethods.parse
import scala.io.Source.fromURL
object DEF {
implicit val formats = org.json4s.DefaultFormats
case class Result(success: Boolean,
message: String,
result: Array[Markets])
case class Markets(
MarketCurrency:String,
BaseCurrency:String,
MarketCurrencyLong:String,
BaseCurrencyLong:String,
MinTradeSize:Double,
MarketName:String,
IsActive:Boolean,
Created:String,
Notice:String,
IsSponsored:String,
LogoUrl:String
)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName(s"${this.getClass.getSimpleName}")
.config("spark.sql.shuffle.partitions", "4")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val parsedData = parse(fromURL("https://bittrex.com/api/v1.1/public/getmarkets").mkString).extract[Array[Result]]
val mySourceDataset = spark.createDataset(parsedData)
mySourceDataset.printSchema
mySourceDataset.show()
}
}
The error is as follows and it repeats for every record:
Caused by: org.json4s.package$MappingException: Expected collection but got JObject(List((success,JBool(true)), (message,JString()), (result,JArray(List(JObject(List((MarketCurrency,JString(LTC)), (BaseCurrency,JString(BTC)), (MarketCurrencyLong,JString(Litecoin)), (BaseCurrencyLong,JString(Bitcoin)), (MinTradeSize,JDouble(0.01435906)), (MarketName,JString(BTC-LTC)), (IsActive,JBool(true)), (Created,JString(2014-02-13T00:00:00)), (Notice,JNull), (IsSponsored,JNull), (LogoUrl,JString(https://bittrexblobstorage.blob.core.windows.net/public/6defbc41-582d-47a6-bb2e-d0fa88663524.png))))))))) and mapping Result[][Result, Result]
at org.json4s.reflect.package$.fail(package.scala:96)

The structure of the JSON returned from this URL is:
{
"success": boolean,
"message": string,
"result": [ ... ]
}
So Result class should be aligned with this structure:
case class Result(success: Boolean,
message: String,
result: List[Markets])
Update
And I also refined slightly the Markets class:
case class Markets(
MarketCurrency: String,
BaseCurrency: String,
MarketCurrencyLong: String,
BaseCurrencyLong: String,
MinTradeSize: Double,
MarketName: String,
IsActive: Boolean,
Created: String,
Notice: Option[String],
IsSponsored: Option[Boolean],
LogoUrl: String
)
End-of-update
But the main issue is in the extraction of the main data part from the parsed JSON:
val parsedData = parse(fromURL("{url}").mkString).extract[Array[Result]]
The root of the returned structure is not an array, but corresponds to Result. So it should be:
val parsedData = parse(fromURL("{url}").mkString).extract[Result]
Then, I suppose that you need not to load the wrapper in the DataFrame, but rather the Markets that are inside. That is why it should be loaded like this:
val mySourceDataset = spark.createDataset(parsedData.result)
And it finally produces the DataFrame:
+--------------+------------+------------------+----------------+------------+----------+--------+-------------------+------+-----------+--------------------+
|MarketCurrency|BaseCurrency|MarketCurrencyLong|BaseCurrencyLong|MinTradeSize|MarketName|IsActive| Created|Notice|IsSponsored| LogoUrl|
+--------------+------------+------------------+----------------+------------+----------+--------+-------------------+------+-----------+--------------------+
| LTC| BTC| Litecoin| Bitcoin| 0.01435906| BTC-LTC| true|2014-02-13T00:00:00| null| null|https://bittrexbl...|
| DOGE| BTC| Dogecoin| Bitcoin|396.82539683| BTC-DOGE| true|2014-02-13T00:00:00| null| null|https://bittrexbl...|

Related

Map different value to the case class property during serialization and deserialization using Jackson

I am trying to deserialize this JSON using Jackson library -
{
"name": "abc",
"ageInInt": 30
}
To the case class Person
case class Person(name: String, #JsonProperty(value = "ageInInt")#JsonAlias(Array("ageInInt")) age: Int)
but I am getting -
No usable value for age
Did not find value which can be converted into int
org.json4s.package$MappingException: No usable value for age
Did not find value which can be converted into int
Basically, I want to deserialize the json with the different key fields ageInInt to age.
here is the complete code -
val json =
"""{
|"name": "Tausif",
|"ageInInt": 30
|}""".stripMargin
implicit val format = DefaultFormats
println(Serialization.read[Person](json))
You need to register DefaultScalaModule to your JsonMapper.
import com.fasterxml.jackson.databind.json.JsonMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.core.`type`.TypeReference
import com.fasterxml.jackson.annotation.JsonProperty
val mapper = JsonMapper.builder()
.addModule(DefaultScalaModule)
.build()
case class Person(name: String, #JsonProperty(value = "ageInInt") age: Int)
val json =
"""{
|"name": "Tausif",
|"ageInInt": 30
|}""".stripMargin
val person: Person = mapper.readValue(json, new TypeReference[Person]{})
println(person) // Prints Person(Tausif,30)

Convert prepareStament object to Json Scala

I'am trying to convert prepareStament(object uses for sending SQL statement to the database ) to Json with scala.
So far, I've discovered that the best way to convert an object to Json in scala is to do it with the net.liftweb library.
But when I tried it, I got an empty json.
this is the code
import java.sql.DriverManager
import net.liftweb.json._
import net.liftweb.json.Serialization.write
object Main {
def main (args: Array[String]): Unit = {
implicit val formats = DefaultFormats
val jdbcSqlConnStr = "sqlserverurl**"
val conn = DriverManager.getConnection(jdbcSqlConnStr)
val statement = conn.prepareStatement("exec select_all")
val piedPierJSON2= write(statement)
println(piedPierJSON2)
}
}
this is the result
{}
I used an object I created , and the conversion worked.
case class Person(name: String, address: Address)
case class Address(city: String, state: String)
val p = Person("Alvin Alexander", Address("Talkeetna", "AK"))
val piedPierJSON3 = write(p)
println(piedPierJSON3)
This is the result
{"name":"Alvin Alexander","address":{"city":"Talkeetna","state":"AK"}}
I understood where the problem was, PrepareStament is an interface, and none of its subtypes are serializable...
I'm going to try to wrap it up and put it in a different class.

How to create NULLable Flink table columns from Scala case classes that contain Option types

I would like to create a DataSet (or DataStream) from a collection of case classes that contain Option values.
In the created table columns resulting from Option values should either contain NULL or the actual primitive value.
This is what I tried:
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.TableEnvironment
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row
object OptionExample {
case class Event(time: Timestamp, id: String, value: Option[Int])
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val tEnv = TableEnvironment.getTableEnvironment(env)
val data = env.fromCollection(Seq(
Event(Timestamp.valueOf("2018-01-01 00:01:00"), "a", Some(3)),
Event(Timestamp.valueOf("2018-01-01 00:03:00"), "a", None),
Event(Timestamp.valueOf("2018-01-01 00:03:00"), "b", Some(7)),
Event(Timestamp.valueOf("2018-01-01 00:02:00"), "a", Some(5))
))
val table = tEnv.fromDataSet(data)
table.printSchema()
// root
// |-- time: Timestamp
// |-- id: String
// |-- value: Option[Integer]
val result = table
.groupBy('id)
.select('id, 'value.avg as 'averageValue)
// Print results
val ds: DataSet[Row] = result.toDataSet
ds.print()
}
}
But this causes an Exception in the aggregation part...
org.apache.flink.table.api.ValidationException: Expression avg('value) failed on input check: avg requires numeric types, get Option[Integer] here
...so with this approach Option does not get converted into a numeric type with NULLs as described above.
How can I achieve this with Flink?
(I'm coming from Apache Spark, there Datasets created from case classes with Options have this behaviour. I would like to achieve something similar with Flink)

How to use the functions.explode to flatten element in dataFrame

I've made this piece of code :
case class RawPanda(id: Long, zip: String, pt: String, happy: Boolean, attributes: Array[Double])
case class PandaPlace(name: String, pandas: Array[RawPanda])
object TestSparkDataFrame extends App{
System.setProperty("hadoop.home.dir", "E:\\Programmation\\Libraries\\hadoop")
val conf = new SparkConf().setAppName("TestSparkDataFrame").set("spark.driver.memory","4g").setMaster("local[*]")
val session = SparkSession.builder().config(conf).getOrCreate()
import session.implicits._
def createAndPrintSchemaRawPanda(session:SparkSession):DataFrame = {
val newPanda = RawPanda(1,"M1B 5K7", "giant", true, Array(0.1, 0.1))
val pandaPlace = PandaPlace("torronto", Array(newPanda))
val df =session.createDataFrame(Seq(pandaPlace))
df
}
val df2 = createAndPrintSchemaRawPanda(session)
df2.show
+--------+--------------------+
| name| pandas|
+--------+--------------------+
|torronto|[[1,M1B 5K7,giant...|
+--------+--------------------+
val pandaInfo = df2.explode(df2("pandas")) {
case Row(pandas: Seq[Row]) =>
pandas.map{
case (Row(
id: Long,
zip: String,
pt: String,
happy: Boolean,
attrs: Seq[Double])) => RawPanda(id, zip, pt , happy, attrs.toArray)
}
}
pandaInfo2.show
+--------+--------------------+---+-------+-----+-----+----------+
| name| pandas| id| zip| pt|happy|attributes|
+--------+--------------------+---+-------+-----+-----+----------+
|torronto|[[1,M1B 5K7,giant...| 1|M1B 5K7|giant| true|[0.1, 0.1]|
+--------+--------------------+---+-------+-----+-----+----------+
The problem that the explode function as I used it is deprecated, so I would like to recaculate the pandaInfo2 dataframe but using the adviced method in the warning.
use flatMap() or select() with functions.explode() instead
But then when I do :
val pandaInfo = df2.select(functions.explode(df("pandas"))
I obtain the same result as I had in df2.
I don't know how to proceed to use flatMap or functions.explode.
How could I use flatMap or functions.explode to obtain the result that I want ?(the one in pandaInfo)
I've seen this post and this other one but none of them helped me.
Calling select with explode function returns a DataFrame where the Array pandas is "broken up" into individual records; Then, if you want to "flatten" the structure of the resulting single "RawPanda" per record, you can select the individual columns using a dot-separated "route":
val pandaInfo2 = df2.select($"name", explode($"pandas") as "pandas")
.select($"name", $"pandas",
$"pandas.id" as "id",
$"pandas.zip" as "zip",
$"pandas.pt" as "pt",
$"pandas.happy" as "happy",
$"pandas.attributes" as "attributes"
)
A less verbose version of the exact same operation would be:
import org.apache.spark.sql.Encoders // going to use this to "encode" case class into schema
val pandaColumns = Encoders.product[RawPanda].schema.fields.map(_.name)
val pandaInfo3 = df2.select($"name", explode($"pandas") as "pandas")
.select(Seq($"name", $"pandas") ++ pandaColumns.map(f => $"pandas.$f" as f): _*)

Flink Scala join between two Streams doesn't seem to work

I want to join two streams (json) coming from a Kafka producer.
The code works if I filter the data. But it seems not working when I join them. I want to print to the console the joined stream but nothing appears.
This is my code
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.json4s._
import org.json4s.native.JsonMethods
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
object App {
def main(args : Array[String]) {
case class Data(location: String, timestamp: Long, measurement: Int, unit: String, accuracy: Double)
case class Sensor(sensor_name: String, start_date: String, end_date: String, data_schema: Array[String], data: Data, stt: Stt)
case class Datas(location: String, timestamp: Long, measurement: Int, unit: String, accuracy: Double)
case class Sensor2(sensor_name: String, start_date: String, end_date: String, data_schema: Array[String], data: Datas, stt: Stt)
val properties = new Properties();
properties.setProperty("bootstrap.servers", "0.0.0.0:9092");
properties.setProperty("group.id", "test");
val env = StreamExecutionEnvironment.getExecutionEnvironment
val consumer1 = new FlinkKafkaConsumer010[String]("topics1", new SimpleStringSchema(), properties)
val stream1 = env
.addSource(consumer1)
val consumer2 = new FlinkKafkaConsumer010[String]("topics2", new SimpleStringSchema(), properties)
val stream2 = env
.addSource(consumer2)
val s1 = stream1.map { x => {
implicit val formats = DefaultFormats
JsonMethods.parse(x).extract[Sensor]
}
}
val s2 = stream2.map { x => {
implicit val formats = DefaultFormats
JsonMethods.parse(x).extract[Sensor2]
}
}
val s1t = s1.assignAscendingTimestamps { x => x.data.timestamp }
val s2t = s2.assignAscendingTimestamps { x => x.data.timestamp }
val j1pre = s1t.join(s2t)
.where(_.data.unit)
.equalTo(_.data.unit)
.window(TumblingEventTimeWindows.of(Time.seconds(2L)))
.apply((g, s) => (s.sensor_name, g.sensor_name, s.data.measurement))
env.execute()
}
}
I think the problem is on the assignment of the timestamp. I think that the assignAscendingTimestamp on the two sources is not the right function.
The json produced by the kafka producer has a field data.timestamp that should be assigned as the timestamp. But I don't know how to manage that.
I also thought that i should have to give a time window batch (as in spark) to the incoming tuples. But I'm not sure this is the right solution.
I think your code needs just some minor adjustments. First of all as you want to work in EventTime you should set appropriate TimeCharacteristic
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Also your code that you pasted is missing a sink for the stream. If you want to print to console you should:
j1pre.print
The rest of your code seems fine.