Spark structured streaming: Schema Inference in Scala - scala

I'm trying to infer the dynamic json schema from kafka topic.Found this piece of code in blog, which infer the schema using PYSPARK.
def read_kafka_topic(topic):
df_json = (spark.read
.format("kafka")
.option("kafka.bootstrap.servers", kafka_broker)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.option("failOnDataLoss", "false")
.load()
.withColumn("value", expr("string(value)"))
.filter(col("value").isNotNull())
.select("key", expr("struct(offset, value) r"))
.groupBy("key").agg(expr("max(r) r"))
.select("r.value"))
df_read = spark.read.json(
df_json.rdd.map(lambda x: x.value), multiLine=True)**
Tried with SCALA:
**val df_read = spark.read.json(df_json.rdd.map(x=>x))**
But Im getting below error.
cannot be applied to
(org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) val df_read =
spark.read.json(df_json.rdd.map(x=>x))
Any fix? Kindly help.

RDD is not supported in Structured Streaming.
Structured Streaming does not allow schema inference.
Schema needs to be defined.
e.g. for a file source
val dataSchema = "Recorded_At timestamp, Device string, Index long, Model string, User string, _corrupt_record String, gt string, x double, y double, z double"
val dataPath = "dbfs:/mnt/training/definitive-guide/data/activity-data-stream.json"
val initialDF = spark
.readStream // Returns DataStreamReader
.option("maxFilesPerTrigger", 1) // Force processing of only 1 file per trigger
.schema(dataSchema) // Required for all streaming DataFrames
.json(dataPath) // The stream's source directory and file type
e.g. Kafka situation as Databricks teach you
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)
val kafkaServer = "server1.databricks.training:9092" // US (Oregon)
// kafkaServer = "server2.databricks.training:9092" // Singapore
val editsDF = spark.readStream // Get the DataStreamReader
.format("kafka") // Specify the source format as "kafka"
.option("kafka.bootstrap.servers", kafkaServer) // Configure the Kafka server name and port
.option("subscribe", "en") // Subscribe to the "en" Kafka topic
.option("startingOffsets", "earliest") // Rewind stream to beginning when we restart notebook
.option("maxOffsetsPerTrigger", 1000) // Throttle Kafka's processing of the streams
.load() // Load the DataFrame
.select($"value".cast("STRING")) // Cast the "value" column to STRING
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, BooleanType, TimestampType}
lazy val schema = StructType(List(
StructField("channel", StringType, true),
StructField("comment", StringType, true),
StructField("delta", IntegerType, true),
StructField("flag", StringType, true),
StructField("geocoding", StructType(List( // (OBJECT): Added by the server, field contains IP address geocoding information for anonymous edit.
StructField("city", StringType, true),
StructField("country", StringType, true),
StructField("countryCode2", StringType, true),
StructField("countryCode3", StringType, true),
StructField("stateProvince", StringType, true),
StructField("latitude", DoubleType, true),
StructField("longitude", DoubleType, true)
)), true),
StructField("isAnonymous", BooleanType, true),
StructField("isNewPage", BooleanType, true),
StructField("isRobot", BooleanType, true),
StructField("isUnpatrolled", BooleanType, true),
StructField("namespace", StringType, true), // (STRING): Page's namespace. See https://en.wikipedia.org/wiki/Wikipedia:Namespace
StructField("page", StringType, true), // (STRING): Printable name of the page that was edited
StructField("pageURL", StringType, true), // (STRING): URL of the page that was edited
StructField("timestamp", TimestampType, true), // (STRING): Time the edit occurred, in ISO-8601 format
StructField("url", StringType, true),
StructField("user", StringType, true), // (STRING): User who made the edit or the IP address associated with the anonymous editor
StructField("userURL", StringType, true),
StructField("wikipediaURL", StringType, true),
StructField("wikipedia", StringType, true) // (STRING): Short name of the Wikipedia that was edited (e.g., "en" for the English)
))
import org.apache.spark.sql.functions.from_json
val jsonEdits = editsDF.select(
from_json($"value", schema).as("json"))
...
...

Related

Cannot resolve overloaded method 'createDataFrame'

The following code:
val data1 = Seq(("Android", 1, "2021-07-24 12:01:19.000", "play"), ("Android", 1, "2021-07-24 12:02:19.000", "stop"),
("Apple", 1, "2021-07-24 12:03:19.000", "play"), ("Apple", 1, "2021-07-24 12:04:19.000", "stop"))
val schema1 = StructType(Array(StructField("device_id", StringType, true),
StructField("video_id", IntegerType, true),
StructField("event_timestamp", StringType, true),
StructField("event_type", StringType, true)
))
val spark = SparkSession.builder()
.enableHiveSupport()
.appName("PlayStop")
.getOrCreate()
var transaction=spark.createDataFrame(data1, schema1)
produces the error:
Cannot resolve overloaded method 'createDataFrame'
Why?
And how to fix it?
If your schema consists of default StructField settings, the easiest way to create a DataFrame would be to simply apply toDF():
val transaction = data1.toDF("device_id", "video_id", "event_timestamp", "event_type")
To specify custom schema definition, note that createDataFrame() takes a RDD[Row] and schema as its parameters. In your case, you could transform data1 into a RDD[Row] like below:
val transaction = spark.createDataFrame(sc.parallelize(data1.map(Row(_))), schema1)
An alternative is to use toDF, followed by rdd which represents a DataFrame (i.e. Dataset[Row]) as RDD[Row]:
val transaction = spark.createDataFrame(data1.toDF.rdd, schema1)

how to save data frame in Mongodb in spark using custom value for _id column

val transctionSchema = StructType(Array(
StructField("School_id", StringType, true),
StructField("School_Year", StringType, true),
StructField("Run_Type", StringType, true),
StructField("Bus_No", StringType, true),
StructField("Route_Number", StringType, true),
StructField("Reason", StringType, true),
StructField("Occurred_On", DateType, true),
StructField("Number_Of_Students_On_The_Bus", IntegerType, true)))
val dfTags = sparkSession.read.option("header", true).schema(transctionSchema).
option("dateFormat", "yyyyMMddhhmm")
.csv("src/main/resources/9_bus-breakdown-and-delays_case_study.csv").
toDF("School_id", "School_Year", "Run_Type", "Bus_No", "Route_Number", "Reason", "Occurred_On", "Number_Of_Students_On_The_Bus")
import sparkSession.implicits._
val writeConfig = WriteConfig(Map("collection" -> "bus_Details", "writeConcern.w" -> "majority"), Some(WriteConfig(sparkSession)))
dfTags.show(5)
I have a data frame with the column: School_id/School_year/Run_Type/Bus_No/Route_No/Reason/Occured_on. I want to save this data in mongo DB collection bus_Details such that _id in mongo collection holds the data from School_id column of the Data Frame.
I saw some post where it was suggested to define a collection as :But it is not working
properties: {
School_id: {
bsonType: "string",
id:"true"
description: "must be a string and is required"
}
Please help..
You can create a duplicate column names as _id in your dataframe like:
val dfToSave = dfTags.withColumn("_id", $"School_id")
And then save this to mongodb

Spark: import data frame to mongodb (scala)

Given the following data frame in spark:
Name,LicenseID_1,TypeCode_1,State_1,LicenseID_2,TypeCode_2,State_2,LicenseID_3,TypeCode_3,State_3
"John","123ABC",1,"WA","456DEF",2,"FL","789GHI",3,"CA"
"Jane","ABC123",5,"AZ","DEF456",7,"CO","GHI789",8,"GA"
How could I use scala in spark to write this into mongodb as collection of document as follows:
{ "Name" : "John",
"Licenses" :
{
[
{"LicenseID":"123ABC","TypeCode":"1","State":"WA" },
{"LicenseID":"456DEF","TypeCode":"2","State":"FL" },
{"LicenseID":"789GHI","TypeCode":"3","State":"CA" }
]
}
},
{ "Name" : "Jane",
"Licenses" :
{
[
{"LicenseID":"ABC123","TypeCode":"5","State":"AZ" },
{"LicenseID":"DEF456","TypeCode":"7","State":"CO" },
{"LicenseID":"GHI789","TypeCode":"8","State":"GA" }
]
}
}
I tried to do this but got block at the following code:
val customSchema = StructType(Array( StructField("Name", StringType, true), StructField("LicenseID_1", StringType, true), StructField("TypeCode_1", StringType, true), StructField("State_1", StringType, true), StructField("LicenseID_2", StringType, true), StructField("TypeCode_2", StringType, true), StructField("State_2", StringType, true), StructField("LicenseID_3", StringType, true), StructField("TypeCode_3", StringType, true), StructField("State_3", StringType, true)))
val license = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(customSchema).load("D:\\test\\test.csv")
case class License(LicenseID:String, TypeCode:String, State:String)
case class Data(Name:String, Licenses: Array[License])
val transformedData = license.map(data => Data(data(0),Array(License(data(1),data(2),data(3)),License(data(4),data(5),data(6)),License(data(7),data(8),data(9)))))
<console>:46: error: type mismatch;
found : Any
required: String
val transformedData = license.map(data => Data(data(0),Array(License(data(1),data(2),data(3)),License(data(4),data(5),data(6)),License(data(7),data(8),data(9)))))
...
Not sure what is your question , adding example how to save data spark and Mango
https://docs.mongodb.com/spark-connector/current/
https://docs.mongodb.com/spark-connector/current/scala-api/
import org.apache.spark.sql.SparkSession
import com.mongodb.spark.sql._
val sc: SparkContext // An existing SparkContext.
val sparkSession = SparkSession.builder().getOrCreate()
//mongo spark helper
val df = MongoSpark.load(sparkSession) // Uses the SparkConf
Read
sparkSession.loadFromMongoDB() // Uses the SparkConf for configuration
sparkSession.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://example.com/database.collection"))) // Uses the ReadConfig
sparkSession.read.mongo()
sparkSession.read.format("com.mongodb.spark.sql").load()
// Set custom options:
sparkSession.read.mongo(customReadConfig)
sparkSession.read.format("com.mongodb.spark.sql").options.
(customReadConfig.asOptions).load()
The connector provides the ability to persist data into MongoDB.
MongoSpark.save(centenarians.write.option("collection", "hundredClub"))
MongoSpark.load[Character](sparkSession, ReadConfig(Map("collection" ->
"data"), Some(ReadConfig(sparkSession)))).show()
Alternative to save data
dataFrameWriter.write.mongo()
dataFrameWriter.write.format("com.mongodb.spark.sql").save()
Adding .toString fixed the issue and I was able to saved to mongodb the format I wanted.
val transformedData = license.map(data => Data(data(0).toString,Array(License(data(1).toString,data(2).toString,data(3).toString),License(data(4).toString,data(5).toString,data(6).toString),License(data(7).toString,data(8).toString,data(9).toString))))

Save a 2d list into a dataframe scala spark

I have a 2d list of the following format with the name tuppleSlides:
List(List(10,4,2,4,5,2,6,2,5,7), List(10,4,2,4,5,2,6,2,5,7), List(10,4,2,4,5,2,6,2,5,7), List(10,4,2,4,5,2,6,2,5,7))
I have created the following schema:
val schema = StructType(
Array(
StructField("1", IntegerType, true),
StructField("2", IntegerType, true),
StructField("3", IntegerType, true),
StructField("4", IntegerType, true),
StructField("5", IntegerType, true),
StructField("6", IntegerType, true),
StructField("7", IntegerType, true),
StructField("8", IntegerType, true),
StructField("9", IntegerType, true),
StructField("10", IntegerType, true) )
)
and I am creating a dataframe like so:
val tuppleSlidesDF = sparkSession.createDataFrame(tuppleSlides, schema)
but it won't even compile. How am I suppose to do it properly?
Thank you.
You need to convert the 2d list to a RDD[Row] object before creating a data frame:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(tupleSlides).map(Row.fromSeq(_))
sqlContext.createDataFrame(rdd, schema)
# res7: org.apache.spark.sql.DataFrame = [1: int, 2: int, 3: int, 4: int, 5: int, 6: int, 7: int, 8: int, 9: int, 10: int]
Also note in spark 2.x, sqlContext is replaced with spark:
spark.createDataFrame(rdd, schema)
# res1: org.apache.spark.sql.DataFrame = [1: int, 2: int ... 8 more fields]

Array Index Out Of Bounds when running Logistic Regression from Spark MLlib

I'm trying to run a Logistic Regression model over the KDD dataset using Scala and the Spark MLlib library. I have gone through multiple webs, tutorials and forums, but I still can't figure out why my code is not working. It must be something simple, but I just don't get it and I'm felling blocked at this moment. Here is what (I think) I'm doing:
Create a Spark Context.
Create a SQL Context.
Load paths for training and test data files.
Define the schema for the data to work with. That is, the columns we are going to use (names and types) with the KDD dataset.
Read the file with training data.
Read the file with the test data.
Filter input data to ensure only numeric values for every column (I just drop the three StringType columns).
8.Since Logistic Regression model needs a column called "features" with all features packed within a single vector, I create such column via the "VectorAssembler" function.
I just keep the columns named "label" and "features", which are essential for the Logistic Regression model.
I use the "StringIndexer" function in order to transform the values from the "label" column into Doubles, otherwise Logistic Regression complies saying it can't work with StringType.
I set the hyperparameters for the Logistic Regression model, indicating the Label and Features columns.
I attempt to train the model (via the "fit" method).
Bellow you can find the code:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
object LogisticRegressionV2 {
val settings = new Settings() // Here I define the proper values for the training and test files paths
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("LogisticRegressionV2").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val trainingPath = settings.rootFolder + settings.dataFolder + settings.trainingDataFileName
val testPath = settings.rootFolder + settings.dataFolder + settings.testFileName
val kddSchema = StructType(Array(
StructField("duration", IntegerType, true),
StructField("protocol_type", StringType, true),
StructField("service", StringType, true),
StructField("flag", StringType, true),
StructField("src_bytes", IntegerType, true),
StructField("dst_bytes", IntegerType, true),
StructField("land", IntegerType, true),
StructField("wrong_fragment", IntegerType, true),
StructField("urgent", IntegerType, true),
StructField("hot", IntegerType, true),
StructField("num_failed_logins", IntegerType, true),
StructField("logged_in", IntegerType, true),
StructField("num_compromised", IntegerType, true),
StructField("root_shell", IntegerType, true),
StructField("su_attempted", IntegerType, true),
StructField("num_root", IntegerType, true),
StructField("num_file_creations", IntegerType, true),
StructField("num_shells", IntegerType, true),
StructField("num_access_files", IntegerType, true),
StructField("num_outbound_cmds", IntegerType, true),
StructField("is_host_login", IntegerType, true),
StructField("is_guest_login", IntegerType, true),
StructField("count", IntegerType, true),
StructField("srv_count", IntegerType, true),
StructField("serror_rate", DoubleType, true),
StructField("srv_serror_rate", DoubleType, true),
StructField("rerror_rate", DoubleType, true),
StructField("srv_rerror_rate", DoubleType, true),
StructField("same_srv_rate", DoubleType, true),
StructField("diff_srv_rate", DoubleType, true),
StructField("srv_diff_host_rate", DoubleType, true),
StructField("dst_host_count", IntegerType, true),
StructField("dst_host_srv_count", IntegerType, true),
StructField("dst_host_same_srv_rate", DoubleType, true),
StructField("dst_host_diff_srv_rate", DoubleType, true),
StructField("dst_host_same_src_port_rate", DoubleType, true),
StructField("dst_host_srv_diff_host_rate", DoubleType, true),
StructField("dst_host_serror_rate", DoubleType, true),
StructField("dst_host_srv_serror_rate", DoubleType, true),
StructField("dst_host_rerror_rate", DoubleType, true),
StructField("dst_host_srv_rerror_rate", DoubleType, true),
StructField("label", StringType, true)
))
val rawTraining = sqlContext.read
.format("csv")
.option("header", "true")
.schema(kddSchema)
.load(trainingPath)
val rawTest = sqlContext.read
.format("csv")
.option("header", "true")
.schema(kddSchema)
.load(testPath)
val trainingNumeric = rawTraining.drop("service").drop("protocol_type").drop("flag")
val trainingAssembler = new VectorAssembler()
//.setInputCols(trainingNumeric.columns.filter(_ != "label"))
.setInputCols(Array("duration", "src_bytes", "dst_bytes", "land", "wrong_fragment", "urgent", "hot",
"num_failed_logins", "logged_in", "num_compromised", "root_shell", "su_attempted", "num_root",
"num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds", "is_host_login",
"is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate",
"same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count", "dst_host_srv_count",
"dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate",
"dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate", "dst_host_srv_rerror_rate"))
.setOutputCol("features")
val trainingAssembled = trainingAssembler.transform(trainingNumeric).select("label", "features")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(trainingAssembled)
val trainingData = labelIndexer.transform(trainingAssembled).select("indexedLabel", "features")
trainingData.show(false)
val lr = new LogisticRegression()
.setMaxIter(2)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
val predictions = lr.fit(trainingData)
sc.stop()
}
}
As you can see, it is a simple code, but I get a "java.lang.ArrayIndexOutOfBoundsException: 1" when the execution reaches the line:
val predictions = lr.fit(trainingData)
And I just don't know why. If you had any clue about this issue, it would be very appreciated. Many thanks in advance.