Spark: import data frame to mongodb (scala) - mongodb

Given the following data frame in spark:
Name,LicenseID_1,TypeCode_1,State_1,LicenseID_2,TypeCode_2,State_2,LicenseID_3,TypeCode_3,State_3
"John","123ABC",1,"WA","456DEF",2,"FL","789GHI",3,"CA"
"Jane","ABC123",5,"AZ","DEF456",7,"CO","GHI789",8,"GA"
How could I use scala in spark to write this into mongodb as collection of document as follows:
{ "Name" : "John",
"Licenses" :
{
[
{"LicenseID":"123ABC","TypeCode":"1","State":"WA" },
{"LicenseID":"456DEF","TypeCode":"2","State":"FL" },
{"LicenseID":"789GHI","TypeCode":"3","State":"CA" }
]
}
},
{ "Name" : "Jane",
"Licenses" :
{
[
{"LicenseID":"ABC123","TypeCode":"5","State":"AZ" },
{"LicenseID":"DEF456","TypeCode":"7","State":"CO" },
{"LicenseID":"GHI789","TypeCode":"8","State":"GA" }
]
}
}
I tried to do this but got block at the following code:
val customSchema = StructType(Array( StructField("Name", StringType, true), StructField("LicenseID_1", StringType, true), StructField("TypeCode_1", StringType, true), StructField("State_1", StringType, true), StructField("LicenseID_2", StringType, true), StructField("TypeCode_2", StringType, true), StructField("State_2", StringType, true), StructField("LicenseID_3", StringType, true), StructField("TypeCode_3", StringType, true), StructField("State_3", StringType, true)))
val license = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(customSchema).load("D:\\test\\test.csv")
case class License(LicenseID:String, TypeCode:String, State:String)
case class Data(Name:String, Licenses: Array[License])
val transformedData = license.map(data => Data(data(0),Array(License(data(1),data(2),data(3)),License(data(4),data(5),data(6)),License(data(7),data(8),data(9)))))
<console>:46: error: type mismatch;
found : Any
required: String
val transformedData = license.map(data => Data(data(0),Array(License(data(1),data(2),data(3)),License(data(4),data(5),data(6)),License(data(7),data(8),data(9)))))
...

Not sure what is your question , adding example how to save data spark and Mango
https://docs.mongodb.com/spark-connector/current/
https://docs.mongodb.com/spark-connector/current/scala-api/
import org.apache.spark.sql.SparkSession
import com.mongodb.spark.sql._
val sc: SparkContext // An existing SparkContext.
val sparkSession = SparkSession.builder().getOrCreate()
//mongo spark helper
val df = MongoSpark.load(sparkSession) // Uses the SparkConf
Read
sparkSession.loadFromMongoDB() // Uses the SparkConf for configuration
sparkSession.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://example.com/database.collection"))) // Uses the ReadConfig
sparkSession.read.mongo()
sparkSession.read.format("com.mongodb.spark.sql").load()
// Set custom options:
sparkSession.read.mongo(customReadConfig)
sparkSession.read.format("com.mongodb.spark.sql").options.
(customReadConfig.asOptions).load()
The connector provides the ability to persist data into MongoDB.
MongoSpark.save(centenarians.write.option("collection", "hundredClub"))
MongoSpark.load[Character](sparkSession, ReadConfig(Map("collection" ->
"data"), Some(ReadConfig(sparkSession)))).show()
Alternative to save data
dataFrameWriter.write.mongo()
dataFrameWriter.write.format("com.mongodb.spark.sql").save()

Adding .toString fixed the issue and I was able to saved to mongodb the format I wanted.
val transformedData = license.map(data => Data(data(0).toString,Array(License(data(1).toString,data(2).toString,data(3).toString),License(data(4).toString,data(5).toString,data(6).toString),License(data(7).toString,data(8).toString,data(9).toString))))

Related

Scala spark-shell: schema function structType type mismatch

Learning scala to work with spark and having difficulty with return types:
Code:
def createSchema (name: String) : StructType = {
if (name == "test01") {
StructType(
List(
StructField("id", StringType, true),
StructField("score", DoubleType, true)
))}
}
Produces:
error: type mismatch;
found : Unit
required: org.apache.spark.sql.types.StructType
Without the argument and if condition the function works as expected.
Understand the if condition is the last evaluated expression and is setting the return type to Unit.
Have tried val definition (and other variations) without success.
Code:
def createSchema (name: String) : StructType = {
val struct: StructType = if (name == "test01") {
StructType(
List(
StructField("id", StringType, true),
StructField("score", DoubleType, true)
))}
struct
}
Produces:
error: type mismatch;
found : Unit
required: org.apache.spark.sql.types.StructType
var struct: StructType = if (name == "test01") {
^
Appreciate any help understanding the type mismatch errors and solutions.
Solution for testing function using if (as learning exercise).
Code:
def createSchema (name: String) : StructType = {
val struct = if (name == "test01") {
StructType(
List(
StructField("id", StringType, true),
StructField("score", DoubleType, true)
))
}
else {
StructType(
List(
StructField("col1", StringType, true),
StructField("col2", StringType, true)
))
}
struct
}
Thanks for your help and explanation.
Because you have not defined an else condition, it returns the smallest common super type for all branches, which is Unit. Keep in mind that the if condition could also not be true which means the body of your method is empty (= Unit).
You could verify this by typing in
val struct = if (name == "test01") {
StructType(
List(
StructField("id", StringType, true),
StructField("score", DoubleType, true)
)
)
}
into your Scala REPL to see that is retuns:
struct: Any = ()
Declare an else condition or, as you have done it, remove the if condition.

how to save data frame in Mongodb in spark using custom value for _id column

val transctionSchema = StructType(Array(
StructField("School_id", StringType, true),
StructField("School_Year", StringType, true),
StructField("Run_Type", StringType, true),
StructField("Bus_No", StringType, true),
StructField("Route_Number", StringType, true),
StructField("Reason", StringType, true),
StructField("Occurred_On", DateType, true),
StructField("Number_Of_Students_On_The_Bus", IntegerType, true)))
val dfTags = sparkSession.read.option("header", true).schema(transctionSchema).
option("dateFormat", "yyyyMMddhhmm")
.csv("src/main/resources/9_bus-breakdown-and-delays_case_study.csv").
toDF("School_id", "School_Year", "Run_Type", "Bus_No", "Route_Number", "Reason", "Occurred_On", "Number_Of_Students_On_The_Bus")
import sparkSession.implicits._
val writeConfig = WriteConfig(Map("collection" -> "bus_Details", "writeConcern.w" -> "majority"), Some(WriteConfig(sparkSession)))
dfTags.show(5)
I have a data frame with the column: School_id/School_year/Run_Type/Bus_No/Route_No/Reason/Occured_on. I want to save this data in mongo DB collection bus_Details such that _id in mongo collection holds the data from School_id column of the Data Frame.
I saw some post where it was suggested to define a collection as :But it is not working
properties: {
School_id: {
bsonType: "string",
id:"true"
description: "must be a string and is required"
}
Please help..
You can create a duplicate column names as _id in your dataframe like:
val dfToSave = dfTags.withColumn("_id", $"School_id")
And then save this to mongodb

Define StructType as input datatype of a Function Spark-Scala 2.11 [duplicate]

This question already has an answer here:
Defining a UDF that accepts an Array of objects in a Spark DataFrame?
(1 answer)
Closed 3 years ago.
I'm trying to write a Spark UDF in scala, I need to define a Function's input datatype
I have a schema variable with the StructType, mentioned the same below.
import org.apache.spark.sql.types._
val relationsSchema = StructType(
Seq(
StructField("relation", ArrayType(
StructType(Seq(
StructField("attribute", StringType, true),
StructField("email", StringType, true),
StructField("fname", StringType, true),
StructField("lname", StringType, true)
)
), true
), true)
)
)
I'm trying to write a Function like below
val relationsFunc: Array[Map[String,String]] => Array[String] = _.map(do something)
val relationUDF = udf(relationsFunc)
input.withColumn("relation",relationUDF(col("relation")))
above code throws below exception
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(relation)' due to data type mismatch: argument 1 requires array<map<string,string>> type, however, '`relation`' is of array<struct<attribute:string,email:string,fname:string,lname:string>> type.;;
'Project [relation#89, UDF(relation#89) AS proc#273]
if I give the input type as
val relationsFunc: StructType => Array[String] =
I'm not able to implement the logic, as _.map gives me metadata, filed names, etc.
Please advice how to define relationsSchema as input datatype in the below function.
val relationsFunc: ? => Array[String] = _.map(somelogic)
Your structure under relation is a Row, so your function should have the following signature :
val relationsFunc: Array[Row] => Array[String]
then you can access your data either by position or by name, ie :
{r:Row => r.getAs[String]("email")}
Check the mapping table in the documentation to determine the data type representations between Spark SQL and Scala: https://spark.apache.org/docs/2.4.4/sql-reference.html#data-types
Your relation field is a Spark SQL complex type of type StructType, which is represented by Scala type org.apache.spark.sql.Row so this is the input type you should be using.
I used your code to create this complete working example that extracts email values:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val relationsSchema = StructType(
Seq(
StructField("relation", ArrayType(
StructType(
Seq(
StructField("attribute", StringType, true),
StructField("email", StringType, true),
StructField("fname", StringType, true),
StructField("lname", StringType, true)
)
), true
), true)
)
)
val data = Seq(
Row("{'relation':[{'attribute':'1','email':'johnny#example.com','fname': 'Johnny','lname': 'Appleseed'}]}")
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
relationsSchema
)
val relationsFunc = (relation: Array[Row]) => relation.map(_.getAs[String]("email"))
val relationUdf = udf(relationsFunc)
df.withColumn("relation", relationUdf(col("relation")))

How to generate datasets dynamically based on schema?

I have multiple schema like below with different column names and data types.
I want to generate test/simulated data using DataFrame with Scala for each schema and save it to parquet file.
Below is the example schema (from a sample json) to generate data dynamically with dummy values in it.
val schema1 = StructType(
List(
StructField("a", DoubleType, true),
StructField("aa", StringType, true)
StructField("p", LongType, true),
StructField("pp", StringType, true)
)
)
I need rdd/dataframe like this with 1000 rows each based on number of columns in the above schema.
val data = Seq(
Row(1d, "happy", 1L, "Iam"),
Row(2d, "sad", 2L, "Iam"),
Row(3d, "glad", 3L, "Iam")
)
Basically.. like this 200 datasets are there for which I need to generate data dynamically, writing separate programs for each scheme is merely impossible for me.
Pls. help me with your ideas or impl. as I am new to spark.
Is it possible to generate dynamic data based on schema of different types?
Using #JacekLaskowski's advice, you could generate dynamic data using generators with ScalaCheck (Gen) based on field/types you are expecting.
It could look like this:
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SaveMode}
import org.scalacheck._
import scala.collection.JavaConverters._
val dynamicValues: Map[(String, DataType), Gen[Any]] = Map(
("a", DoubleType) -> Gen.choose(0.0, 100.0),
("aa", StringType) -> Gen.oneOf("happy", "sad", "glad"),
("p", LongType) -> Gen.choose(0L, 10L),
("pp", StringType) -> Gen.oneOf("Iam", "You're")
)
val schemas = Map(
"schema1" -> StructType(
List(
StructField("a", DoubleType, true),
StructField("aa", StringType, true),
StructField("p", LongType, true),
StructField("pp", StringType, true)
)),
"schema2" -> StructType(
List(
StructField("a", DoubleType, true),
StructField("pp", StringType, true),
StructField("p", LongType, true)
)
)
)
val numRecords = 1000
schemas.foreach {
case (name, schema) =>
// create a data frame
spark.createDataFrame(
// of #numRecords records
(0 until numRecords).map { _ =>
// each of them a row
Row.fromSeq(schema.fields.map(field => {
// with fields based on the schema's fieldname & type else null
dynamicValues.get((field.name, field.dataType)).flatMap(_.sample).orNull
}))
}.asJava, schema)
// store to parquet
.write.mode(SaveMode.Overwrite).parquet(name)
}
ScalaCheck is a framework to generate data, you generate a raw data based on the schema using you custom generators.
Visit ScalaCheck Documentation.
You could do something like this
import org.apache.spark.SparkConf
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.json4s
import org.json4s.JsonAST._
import org.json4s.jackson.JsonMethods._
import scala.util.Random
object Test extends App {
val structType: StructType = StructType(
List(
StructField("a", DoubleType, true),
StructField("aa", StringType, true),
StructField("p", LongType, true),
StructField("pp", StringType, true)
)
)
val spark = SparkSession
.builder()
.master("local[*]")
.config(new SparkConf())
.getOrCreate()
import spark.implicits._
val df = createRandomDF(structType, 1000)
def createRandomDF(structType: StructType, size: Int, rnd: Random = new Random()): DataFrame ={
spark.read.schema(structType).json((0 to size).map { _ => compact(randomJson(rnd, structType))}.toDS())
}
def randomJson(rnd: Random, dataType: DataType): JValue = {
dataType match {
case v: DoubleType =>
json4s.JDouble(rnd.nextDouble())
case v: StringType =>
JString(rnd.nextString(10))
case v: IntegerType =>
JInt(rnd.nextInt())
case v: LongType =>
JInt(rnd.nextLong())
case v: FloatType =>
JDouble(rnd.nextFloat())
case v: BooleanType =>
JBool(rnd.nextBoolean())
case v: ArrayType =>
val size = rnd.nextInt(10)
JArray(
(0 to size).map(_ => randomJson(rnd, v.elementType)).toList
)
case v: StructType =>
JObject(
v.fields.flatMap {
f =>
if (f.nullable && rnd.nextBoolean())
None
else
Some(JField(f.name, randomJson(rnd, f.dataType)))
}.toList
)
}
}
}

Array Index Out Of Bounds when running Logistic Regression from Spark MLlib

I'm trying to run a Logistic Regression model over the KDD dataset using Scala and the Spark MLlib library. I have gone through multiple webs, tutorials and forums, but I still can't figure out why my code is not working. It must be something simple, but I just don't get it and I'm felling blocked at this moment. Here is what (I think) I'm doing:
Create a Spark Context.
Create a SQL Context.
Load paths for training and test data files.
Define the schema for the data to work with. That is, the columns we are going to use (names and types) with the KDD dataset.
Read the file with training data.
Read the file with the test data.
Filter input data to ensure only numeric values for every column (I just drop the three StringType columns).
8.Since Logistic Regression model needs a column called "features" with all features packed within a single vector, I create such column via the "VectorAssembler" function.
I just keep the columns named "label" and "features", which are essential for the Logistic Regression model.
I use the "StringIndexer" function in order to transform the values from the "label" column into Doubles, otherwise Logistic Regression complies saying it can't work with StringType.
I set the hyperparameters for the Logistic Regression model, indicating the Label and Features columns.
I attempt to train the model (via the "fit" method).
Bellow you can find the code:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
object LogisticRegressionV2 {
val settings = new Settings() // Here I define the proper values for the training and test files paths
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("LogisticRegressionV2").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val trainingPath = settings.rootFolder + settings.dataFolder + settings.trainingDataFileName
val testPath = settings.rootFolder + settings.dataFolder + settings.testFileName
val kddSchema = StructType(Array(
StructField("duration", IntegerType, true),
StructField("protocol_type", StringType, true),
StructField("service", StringType, true),
StructField("flag", StringType, true),
StructField("src_bytes", IntegerType, true),
StructField("dst_bytes", IntegerType, true),
StructField("land", IntegerType, true),
StructField("wrong_fragment", IntegerType, true),
StructField("urgent", IntegerType, true),
StructField("hot", IntegerType, true),
StructField("num_failed_logins", IntegerType, true),
StructField("logged_in", IntegerType, true),
StructField("num_compromised", IntegerType, true),
StructField("root_shell", IntegerType, true),
StructField("su_attempted", IntegerType, true),
StructField("num_root", IntegerType, true),
StructField("num_file_creations", IntegerType, true),
StructField("num_shells", IntegerType, true),
StructField("num_access_files", IntegerType, true),
StructField("num_outbound_cmds", IntegerType, true),
StructField("is_host_login", IntegerType, true),
StructField("is_guest_login", IntegerType, true),
StructField("count", IntegerType, true),
StructField("srv_count", IntegerType, true),
StructField("serror_rate", DoubleType, true),
StructField("srv_serror_rate", DoubleType, true),
StructField("rerror_rate", DoubleType, true),
StructField("srv_rerror_rate", DoubleType, true),
StructField("same_srv_rate", DoubleType, true),
StructField("diff_srv_rate", DoubleType, true),
StructField("srv_diff_host_rate", DoubleType, true),
StructField("dst_host_count", IntegerType, true),
StructField("dst_host_srv_count", IntegerType, true),
StructField("dst_host_same_srv_rate", DoubleType, true),
StructField("dst_host_diff_srv_rate", DoubleType, true),
StructField("dst_host_same_src_port_rate", DoubleType, true),
StructField("dst_host_srv_diff_host_rate", DoubleType, true),
StructField("dst_host_serror_rate", DoubleType, true),
StructField("dst_host_srv_serror_rate", DoubleType, true),
StructField("dst_host_rerror_rate", DoubleType, true),
StructField("dst_host_srv_rerror_rate", DoubleType, true),
StructField("label", StringType, true)
))
val rawTraining = sqlContext.read
.format("csv")
.option("header", "true")
.schema(kddSchema)
.load(trainingPath)
val rawTest = sqlContext.read
.format("csv")
.option("header", "true")
.schema(kddSchema)
.load(testPath)
val trainingNumeric = rawTraining.drop("service").drop("protocol_type").drop("flag")
val trainingAssembler = new VectorAssembler()
//.setInputCols(trainingNumeric.columns.filter(_ != "label"))
.setInputCols(Array("duration", "src_bytes", "dst_bytes", "land", "wrong_fragment", "urgent", "hot",
"num_failed_logins", "logged_in", "num_compromised", "root_shell", "su_attempted", "num_root",
"num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds", "is_host_login",
"is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate",
"same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count", "dst_host_srv_count",
"dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate",
"dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate", "dst_host_srv_rerror_rate"))
.setOutputCol("features")
val trainingAssembled = trainingAssembler.transform(trainingNumeric).select("label", "features")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(trainingAssembled)
val trainingData = labelIndexer.transform(trainingAssembled).select("indexedLabel", "features")
trainingData.show(false)
val lr = new LogisticRegression()
.setMaxIter(2)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
val predictions = lr.fit(trainingData)
sc.stop()
}
}
As you can see, it is a simple code, but I get a "java.lang.ArrayIndexOutOfBoundsException: 1" when the execution reaches the line:
val predictions = lr.fit(trainingData)
And I just don't know why. If you had any clue about this issue, it would be very appreciated. Many thanks in advance.