Consuming RESTful API and converting to Dataframe in Apache Spark

Consuming RESTful API and converting to Dataframe in Apache Spark - scala

I am trying to convert output of url directly from RESTful api to Dataframe conversion in following way:
package trials
import org.apache.spark.sql.SparkSession
import org.json4s.jackson.JsonMethods.parse
import scala.io.Source.fromURL
object DEF {
implicit val formats = org.json4s.DefaultFormats
case class Result(success: Boolean,
message: String,
result: Array[Markets])
case class Markets(
MarketCurrency:String,
BaseCurrency:String,
MarketCurrencyLong:String,
BaseCurrencyLong:String,
MinTradeSize:Double,
MarketName:String,
IsActive:Boolean,
Created:String,
Notice:String,
IsSponsored:String,
LogoUrl:String
)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName(s"${this.getClass.getSimpleName}")
.config("spark.sql.shuffle.partitions", "4")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val parsedData = parse(fromURL("https://bittrex.com/api/v1.1/public/getmarkets").mkString).extract[Array[Result]]
val mySourceDataset = spark.createDataset(parsedData)
mySourceDataset.printSchema
mySourceDataset.show()
}
}
The error is as follows and it repeats for every record:
Caused by: org.json4s.package$MappingException: Expected collection but got JObject(List((success,JBool(true)), (message,JString()), (result,JArray(List(JObject(List((MarketCurrency,JString(LTC)), (BaseCurrency,JString(BTC)), (MarketCurrencyLong,JString(Litecoin)), (BaseCurrencyLong,JString(Bitcoin)), (MinTradeSize,JDouble(0.01435906)), (MarketName,JString(BTC-LTC)), (IsActive,JBool(true)), (Created,JString(2014-02-13T00:00:00)), (Notice,JNull), (IsSponsored,JNull), (LogoUrl,JString(https://bittrexblobstorage.blob.core.windows.net/public/6defbc41-582d-47a6-bb2e-d0fa88663524.png))))))))) and mapping Result[][Result, Result]
at org.json4s.reflect.package$.fail(package.scala:96)

The structure of the JSON returned from this URL is:
{
"success": boolean,
"message": string,
"result": [ ... ]
}
So Result class should be aligned with this structure:
case class Result(success: Boolean,
message: String,
result: List[Markets])
Update
And I also refined slightly the Markets class:
case class Markets(
MarketCurrency: String,
BaseCurrency: String,
MarketCurrencyLong: String,
BaseCurrencyLong: String,
MinTradeSize: Double,
MarketName: String,
IsActive: Boolean,
Created: String,
Notice: Option[String],
IsSponsored: Option[Boolean],
LogoUrl: String
)
End-of-update
But the main issue is in the extraction of the main data part from the parsed JSON:
val parsedData = parse(fromURL("{url}").mkString).extract[Array[Result]]
The root of the returned structure is not an array, but corresponds to Result. So it should be:
val parsedData = parse(fromURL("{url}").mkString).extract[Result]
Then, I suppose that you need not to load the wrapper in the DataFrame, but rather the Markets that are inside. That is why it should be loaded like this:
val mySourceDataset = spark.createDataset(parsedData.result)
And it finally produces the DataFrame:
+--------------+------------+------------------+----------------+------------+----------+--------+-------------------+------+-----------+--------------------+
|MarketCurrency|BaseCurrency|MarketCurrencyLong|BaseCurrencyLong|MinTradeSize|MarketName|IsActive| Created|Notice|IsSponsored| LogoUrl|
+--------------+------------+------------------+----------------+------------+----------+--------+-------------------+------+-----------+--------------------+
| LTC| BTC| Litecoin| Bitcoin| 0.01435906| BTC-LTC| true|2014-02-13T00:00:00| null| null|https://bittrexbl...|
| DOGE| BTC| Dogecoin| Bitcoin|396.82539683| BTC-DOGE| true|2014-02-13T00:00:00| null| null|https://bittrexbl...|

Related

Map different value to the case class property during serialization and deserialization using Jackson

I am trying to deserialize this JSON using Jackson library -
{
"name": "abc",
"ageInInt": 30
}
To the case class Person
case class Person(name: String, #JsonProperty(value = "ageInInt")#JsonAlias(Array("ageInInt")) age: Int)
but I am getting -
No usable value for age
Did not find value which can be converted into int
org.json4s.package$MappingException: No usable value for age
Did not find value which can be converted into int
Basically, I want to deserialize the json with the different key fields ageInInt to age.
here is the complete code -
val json =
"""{
|"name": "Tausif",
|"ageInInt": 30
|}""".stripMargin
implicit val format = DefaultFormats
println(Serialization.read[Person](json))

You need to register DefaultScalaModule to your JsonMapper.
import com.fasterxml.jackson.databind.json.JsonMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.core.`type`.TypeReference
import com.fasterxml.jackson.annotation.JsonProperty
val mapper = JsonMapper.builder()
.addModule(DefaultScalaModule)
.build()
case class Person(name: String, #JsonProperty(value = "ageInInt") age: Int)
val json =
"""{
|"name": "Tausif",
|"ageInInt": 30
|}""".stripMargin
val person: Person = mapper.readValue(json, new TypeReference[Person]{})
println(person) // Prints Person(Tausif,30)

Convert prepareStament object to Json Scala

I'am trying to convert prepareStament(object uses for sending SQL statement to the database ) to Json with scala.
So far, I've discovered that the best way to convert an object to Json in scala is to do it with the net.liftweb library.
But when I tried it, I got an empty json.
this is the code
import java.sql.DriverManager
import net.liftweb.json._
import net.liftweb.json.Serialization.write
object Main {
def main (args: Array[String]): Unit = {
implicit val formats = DefaultFormats
val jdbcSqlConnStr = "sqlserverurl**"
val conn = DriverManager.getConnection(jdbcSqlConnStr)
val statement = conn.prepareStatement("exec select_all")
val piedPierJSON2= write(statement)
println(piedPierJSON2)
}
}
this is the result
{}
I used an object I created , and the conversion worked.
case class Person(name: String, address: Address)
case class Address(city: String, state: String)
val p = Person("Alvin Alexander", Address("Talkeetna", "AK"))
val piedPierJSON3 = write(p)
println(piedPierJSON3)
This is the result
{"name":"Alvin Alexander","address":{"city":"Talkeetna","state":"AK"}}

I understood where the problem was, PrepareStament is an interface, and none of its subtypes are serializable...
I'm going to try to wrap it up and put it in a different class.

How to create NULLable Flink table columns from Scala case classes that contain Option types

I would like to create a DataSet (or DataStream) from a collection of case classes that contain Option values.
In the created table columns resulting from Option values should either contain NULL or the actual primitive value.
This is what I tried:
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.TableEnvironment
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row
object OptionExample {
case class Event(time: Timestamp, id: String, value: Option[Int])
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val tEnv = TableEnvironment.getTableEnvironment(env)
val data = env.fromCollection(Seq(
Event(Timestamp.valueOf("2018-01-01 00:01:00"), "a", Some(3)),
Event(Timestamp.valueOf("2018-01-01 00:03:00"), "a", None),
Event(Timestamp.valueOf("2018-01-01 00:03:00"), "b", Some(7)),
Event(Timestamp.valueOf("2018-01-01 00:02:00"), "a", Some(5))
))
val table = tEnv.fromDataSet(data)
table.printSchema()
// root
// |-- time: Timestamp
// |-- id: String
// |-- value: Option[Integer]
val result = table
.groupBy('id)
.select('id, 'value.avg as 'averageValue)
// Print results
val ds: DataSet[Row] = result.toDataSet
ds.print()
}
}
But this causes an Exception in the aggregation part...
org.apache.flink.table.api.ValidationException: Expression avg('value) failed on input check: avg requires numeric types, get Option[Integer] here
...so with this approach Option does not get converted into a numeric type with NULLs as described above.
How can I achieve this with Flink?
(I'm coming from Apache Spark, there Datasets created from case classes with Options have this behaviour. I would like to achieve something similar with Flink)

How to use the functions.explode to flatten element in dataFrame

I've made this piece of code :
case class RawPanda(id: Long, zip: String, pt: String, happy: Boolean, attributes: Array[Double])
case class PandaPlace(name: String, pandas: Array[RawPanda])
object TestSparkDataFrame extends App{
System.setProperty("hadoop.home.dir", "E:\\Programmation\\Libraries\\hadoop")
val conf = new SparkConf().setAppName("TestSparkDataFrame").set("spark.driver.memory","4g").setMaster("local[*]")
val session = SparkSession.builder().config(conf).getOrCreate()
import session.implicits._
def createAndPrintSchemaRawPanda(session:SparkSession):DataFrame = {
val newPanda = RawPanda(1,"M1B 5K7", "giant", true, Array(0.1, 0.1))
val pandaPlace = PandaPlace("torronto", Array(newPanda))
val df =session.createDataFrame(Seq(pandaPlace))
df
}
val df2 = createAndPrintSchemaRawPanda(session)
df2.show
+--------+--------------------+
| name| pandas|
+--------+--------------------+
|torronto|[[1,M1B 5K7,giant...|
+--------+--------------------+
val pandaInfo = df2.explode(df2("pandas")) {
case Row(pandas: Seq[Row]) =>
pandas.map{
case (Row(
id: Long,
zip: String,
pt: String,
happy: Boolean,
attrs: Seq[Double])) => RawPanda(id, zip, pt , happy, attrs.toArray)
}
}
pandaInfo2.show
+--------+--------------------+---+-------+-----+-----+----------+
| name| pandas| id| zip| pt|happy|attributes|
+--------+--------------------+---+-------+-----+-----+----------+
|torronto|[[1,M1B 5K7,giant...| 1|M1B 5K7|giant| true|[0.1, 0.1]|
+--------+--------------------+---+-------+-----+-----+----------+
The problem that the explode function as I used it is deprecated, so I would like to recaculate the pandaInfo2 dataframe but using the adviced method in the warning.
use flatMap() or select() with functions.explode() instead
But then when I do :
val pandaInfo = df2.select(functions.explode(df("pandas"))
I obtain the same result as I had in df2.
I don't know how to proceed to use flatMap or functions.explode.
How could I use flatMap or functions.explode to obtain the result that I want ?(the one in pandaInfo)
I've seen this post and this other one but none of them helped me.

Calling select with explode function returns a DataFrame where the Array pandas is "broken up" into individual records; Then, if you want to "flatten" the structure of the resulting single "RawPanda" per record, you can select the individual columns using a dot-separated "route":
val pandaInfo2 = df2.select($"name", explode($"pandas") as "pandas")
.select($"name", $"pandas",
$"pandas.id" as "id",
$"pandas.zip" as "zip",
$"pandas.pt" as "pt",
$"pandas.happy" as "happy",
$"pandas.attributes" as "attributes"
)
A less verbose version of the exact same operation would be:
import org.apache.spark.sql.Encoders // going to use this to "encode" case class into schema
val pandaColumns = Encoders.product[RawPanda].schema.fields.map(_.name)
val pandaInfo3 = df2.select($"name", explode($"pandas") as "pandas")
.select(Seq($"name", $"pandas") ++ pandaColumns.map(f => $"pandas.$f" as f): _*)

Flink Scala join between two Streams doesn't seem to work

I want to join two streams (json) coming from a Kafka producer.
The code works if I filter the data. But it seems not working when I join them. I want to print to the console the joined stream but nothing appears.
This is my code
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.json4s._
import org.json4s.native.JsonMethods
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
object App {
def main(args : Array[String]) {
case class Data(location: String, timestamp: Long, measurement: Int, unit: String, accuracy: Double)
case class Sensor(sensor_name: String, start_date: String, end_date: String, data_schema: Array[String], data: Data, stt: Stt)
case class Datas(location: String, timestamp: Long, measurement: Int, unit: String, accuracy: Double)
case class Sensor2(sensor_name: String, start_date: String, end_date: String, data_schema: Array[String], data: Datas, stt: Stt)
val properties = new Properties();
properties.setProperty("bootstrap.servers", "0.0.0.0:9092");
properties.setProperty("group.id", "test");
val env = StreamExecutionEnvironment.getExecutionEnvironment
val consumer1 = new FlinkKafkaConsumer010[String]("topics1", new SimpleStringSchema(), properties)
val stream1 = env
.addSource(consumer1)
val consumer2 = new FlinkKafkaConsumer010[String]("topics2", new SimpleStringSchema(), properties)
val stream2 = env
.addSource(consumer2)
val s1 = stream1.map { x => {
implicit val formats = DefaultFormats
JsonMethods.parse(x).extract[Sensor]
}
}
val s2 = stream2.map { x => {
implicit val formats = DefaultFormats
JsonMethods.parse(x).extract[Sensor2]
}
}
val s1t = s1.assignAscendingTimestamps { x => x.data.timestamp }
val s2t = s2.assignAscendingTimestamps { x => x.data.timestamp }
val j1pre = s1t.join(s2t)
.where(_.data.unit)
.equalTo(_.data.unit)
.window(TumblingEventTimeWindows.of(Time.seconds(2L)))
.apply((g, s) => (s.sensor_name, g.sensor_name, s.data.measurement))
env.execute()
}
}
I think the problem is on the assignment of the timestamp. I think that the assignAscendingTimestamp on the two sources is not the right function.
The json produced by the kafka producer has a field data.timestamp that should be assigned as the timestamp. But I don't know how to manage that.
I also thought that i should have to give a time window batch (as in spark) to the incoming tuples. But I'm not sure this is the right solution.

I think your code needs just some minor adjustments. First of all as you want to work in EventTime you should set appropriate TimeCharacteristic
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Also your code that you pasted is missing a sink for the stream. If you want to print to console you should:
j1pre.print
The rest of your code seems fine.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Consuming RESTful API and converting to Dataframe in Apache Spark - scala

Related

Map different value to the case class property during serialization and deserialization using Jackson

Convert prepareStament object to Json Scala

How to create NULLable Flink table columns from Scala case classes that contain Option types

How to use the functions.explode to flatten element in dataFrame

Flink Scala join between two Streams doesn't seem to work

Categories

Resources