I have a 2d list of the following format with the name tuppleSlides:
List(List(10,4,2,4,5,2,6,2,5,7), List(10,4,2,4,5,2,6,2,5,7), List(10,4,2,4,5,2,6,2,5,7), List(10,4,2,4,5,2,6,2,5,7))
I have created the following schema:
val schema = StructType(
Array(
StructField("1", IntegerType, true),
StructField("2", IntegerType, true),
StructField("3", IntegerType, true),
StructField("4", IntegerType, true),
StructField("5", IntegerType, true),
StructField("6", IntegerType, true),
StructField("7", IntegerType, true),
StructField("8", IntegerType, true),
StructField("9", IntegerType, true),
StructField("10", IntegerType, true) )
)
and I am creating a dataframe like so:
val tuppleSlidesDF = sparkSession.createDataFrame(tuppleSlides, schema)
but it won't even compile. How am I suppose to do it properly?
Thank you.
You need to convert the 2d list to a RDD[Row] object before creating a data frame:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(tupleSlides).map(Row.fromSeq(_))
sqlContext.createDataFrame(rdd, schema)
# res7: org.apache.spark.sql.DataFrame = [1: int, 2: int, 3: int, 4: int, 5: int, 6: int, 7: int, 8: int, 9: int, 10: int]
Also note in spark 2.x, sqlContext is replaced with spark:
spark.createDataFrame(rdd, schema)
# res1: org.apache.spark.sql.DataFrame = [1: int, 2: int ... 8 more fields]
Related
I have been trying to run some experiments on datasets on zepplin 0.9 running locally. However I am running into NPEs when performing operations on Datasets. The same operations seem to work on Dataframe. Here is an example of what is failing
import spark.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
case class Person(firstname: String, middlename: String, lastname: String, id: String, gender: String, salary: Int)
val simpleData = Seq(Row("James","","Smith","36636","M",3000),
Row("Michael","Rose","","40288","M",4000),
Row("Robert","","Williams","42114","M",4000),
Row("Maria","Anne","Jones","39192","F",4000),
Row("Jen","Mary","Brown","","F",-1)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true),
StructField("gender", StringType, true),
StructField("salary", IntegerType, true)
))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(simpleData),simpleSchema).as[Person]
df.filter( x => x.firstname == "James").show()
This is the error that i get
java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
at scala.Option.getOrElse(Option.scala:121)
You have to define Person case class in a different cell from use of as[Person]
cell 1
import spark.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
case class Person(firstname: String, middlename: String, lastname: String, id: String, gender: String, salary: Int)
cell2
val simpleData = Seq(Row("James","","Smith","36636","M",3000),
Row("Michael","Rose","","40288","M",4000),
Row("Robert","","Williams","42114","M",4000),
Row("Maria","Anne","Jones","39192","F",4000),
Row("Jen","Mary","Brown","","F",-1)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true),
StructField("gender", StringType, true),
StructField("salary", IntegerType, true)
))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(simpleData),simpleSchema).as[Person]
df.filter( x => x.firstname == "James").show()
I'm trying to infer the dynamic json schema from kafka topic.Found this piece of code in blog, which infer the schema using PYSPARK.
def read_kafka_topic(topic):
df_json = (spark.read
.format("kafka")
.option("kafka.bootstrap.servers", kafka_broker)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.option("failOnDataLoss", "false")
.load()
.withColumn("value", expr("string(value)"))
.filter(col("value").isNotNull())
.select("key", expr("struct(offset, value) r"))
.groupBy("key").agg(expr("max(r) r"))
.select("r.value"))
df_read = spark.read.json(
df_json.rdd.map(lambda x: x.value), multiLine=True)**
Tried with SCALA:
**val df_read = spark.read.json(df_json.rdd.map(x=>x))**
But Im getting below error.
cannot be applied to
(org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) val df_read =
spark.read.json(df_json.rdd.map(x=>x))
Any fix? Kindly help.
RDD is not supported in Structured Streaming.
Structured Streaming does not allow schema inference.
Schema needs to be defined.
e.g. for a file source
val dataSchema = "Recorded_At timestamp, Device string, Index long, Model string, User string, _corrupt_record String, gt string, x double, y double, z double"
val dataPath = "dbfs:/mnt/training/definitive-guide/data/activity-data-stream.json"
val initialDF = spark
.readStream // Returns DataStreamReader
.option("maxFilesPerTrigger", 1) // Force processing of only 1 file per trigger
.schema(dataSchema) // Required for all streaming DataFrames
.json(dataPath) // The stream's source directory and file type
e.g. Kafka situation as Databricks teach you
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)
val kafkaServer = "server1.databricks.training:9092" // US (Oregon)
// kafkaServer = "server2.databricks.training:9092" // Singapore
val editsDF = spark.readStream // Get the DataStreamReader
.format("kafka") // Specify the source format as "kafka"
.option("kafka.bootstrap.servers", kafkaServer) // Configure the Kafka server name and port
.option("subscribe", "en") // Subscribe to the "en" Kafka topic
.option("startingOffsets", "earliest") // Rewind stream to beginning when we restart notebook
.option("maxOffsetsPerTrigger", 1000) // Throttle Kafka's processing of the streams
.load() // Load the DataFrame
.select($"value".cast("STRING")) // Cast the "value" column to STRING
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, BooleanType, TimestampType}
lazy val schema = StructType(List(
StructField("channel", StringType, true),
StructField("comment", StringType, true),
StructField("delta", IntegerType, true),
StructField("flag", StringType, true),
StructField("geocoding", StructType(List( // (OBJECT): Added by the server, field contains IP address geocoding information for anonymous edit.
StructField("city", StringType, true),
StructField("country", StringType, true),
StructField("countryCode2", StringType, true),
StructField("countryCode3", StringType, true),
StructField("stateProvince", StringType, true),
StructField("latitude", DoubleType, true),
StructField("longitude", DoubleType, true)
)), true),
StructField("isAnonymous", BooleanType, true),
StructField("isNewPage", BooleanType, true),
StructField("isRobot", BooleanType, true),
StructField("isUnpatrolled", BooleanType, true),
StructField("namespace", StringType, true), // (STRING): Page's namespace. See https://en.wikipedia.org/wiki/Wikipedia:Namespace
StructField("page", StringType, true), // (STRING): Printable name of the page that was edited
StructField("pageURL", StringType, true), // (STRING): URL of the page that was edited
StructField("timestamp", TimestampType, true), // (STRING): Time the edit occurred, in ISO-8601 format
StructField("url", StringType, true),
StructField("user", StringType, true), // (STRING): User who made the edit or the IP address associated with the anonymous editor
StructField("userURL", StringType, true),
StructField("wikipediaURL", StringType, true),
StructField("wikipedia", StringType, true) // (STRING): Short name of the Wikipedia that was edited (e.g., "en" for the English)
))
import org.apache.spark.sql.functions.from_json
val jsonEdits = editsDF.select(
from_json($"value", schema).as("json"))
...
...
I am new in spark/scala.
I have a created below RDD by loading data from multiple paths. Now i want to create dataframe from same for further operations.
below should be the schema of dataframe
schema[UserId, EntityId, WebSessionId, ProductId]
rdd.foreach(println)
545456,5615615,DIKFH6545614561456,PR5454564656445454
875643,5485254,JHDSFJD543514KJKJ4
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR54545DSKJD541054
264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515
732543,8765984,UJHSG4240323545144
564574,6276832,KJDXSGFJFS2545DSAS
Will anyone please help me....!!!
I have tried same by defining schema class and mapping same against rdd but getting error
"ArrayIndexOutOfBoundsException :3"
If you treat your columns as String you can create with the following:
import org.apache.spark.sql.Row
val rdd : RDD[Row] = ???
val df = spark.createDataFrame(rdd, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
Note that you must "map" your RDD to a RDD[Row] for the compiler to allow to use the "createDataFrame" method. For the missing fields you can declare the columns as nullable in the DataFrame Schema.
In your example you are using the RDD method spark.sparkContext.textFile(). This method returns a RDD[String] that means that each element of your RDD is a line. But, you need a RDD[Row]. So you need to split your string by commas like:
val list =
List("545456,5615615,DIKFH6545614561456,PR5454564656445454",
"875643,5485254,JHDSFJD543514KJKJ4",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR54545DSKJD541054",
"264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515",
"732543,8765984,UJHSG4240323545144","564574,6276832,KJDXSGFJFS2545DSAS")
val FilterReadClicks = spark.sparkContext.parallelize(list)
val rows: RDD[Row] = FilterReadClicks.map(line => line.split(",")).map { arr =>
val array = Row.fromSeq(arr.foldLeft(List[Any]())((a, b) => b :: a))
if(array.length == 4)
array
else Row.fromSeq(array.toSeq.:+(""))
}
rows.foreach(el => println(el.toSeq))
val df = spark.createDataFrame(rows, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
df.show()
+------------------+------------------+------------+---------+
| userId| EntityId|WebSessionId|ProductId|
+------------------+------------------+------------+---------+
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|JHDSFJD543514KJKJ4| 5485254| 875643| |
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR54545DSKJD541054|DIKFH6545614561456| 5615615| 545456|
|PR5142545564542515|MNXZCBMNABC5645SAD| 3254564| 264264|
|UJHSG4240323545144| 8765984| 732543| |
|KJDXSGFJFS2545DSAS| 6276832| 564574| |
+------------------+------------------+------------+---------+
With rows rdd you will be able to create the dataframe.
I have multiple schema like below with different column names and data types.
I want to generate test/simulated data using DataFrame with Scala for each schema and save it to parquet file.
Below is the example schema (from a sample json) to generate data dynamically with dummy values in it.
val schema1 = StructType(
List(
StructField("a", DoubleType, true),
StructField("aa", StringType, true)
StructField("p", LongType, true),
StructField("pp", StringType, true)
)
)
I need rdd/dataframe like this with 1000 rows each based on number of columns in the above schema.
val data = Seq(
Row(1d, "happy", 1L, "Iam"),
Row(2d, "sad", 2L, "Iam"),
Row(3d, "glad", 3L, "Iam")
)
Basically.. like this 200 datasets are there for which I need to generate data dynamically, writing separate programs for each scheme is merely impossible for me.
Pls. help me with your ideas or impl. as I am new to spark.
Is it possible to generate dynamic data based on schema of different types?
Using #JacekLaskowski's advice, you could generate dynamic data using generators with ScalaCheck (Gen) based on field/types you are expecting.
It could look like this:
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SaveMode}
import org.scalacheck._
import scala.collection.JavaConverters._
val dynamicValues: Map[(String, DataType), Gen[Any]] = Map(
("a", DoubleType) -> Gen.choose(0.0, 100.0),
("aa", StringType) -> Gen.oneOf("happy", "sad", "glad"),
("p", LongType) -> Gen.choose(0L, 10L),
("pp", StringType) -> Gen.oneOf("Iam", "You're")
)
val schemas = Map(
"schema1" -> StructType(
List(
StructField("a", DoubleType, true),
StructField("aa", StringType, true),
StructField("p", LongType, true),
StructField("pp", StringType, true)
)),
"schema2" -> StructType(
List(
StructField("a", DoubleType, true),
StructField("pp", StringType, true),
StructField("p", LongType, true)
)
)
)
val numRecords = 1000
schemas.foreach {
case (name, schema) =>
// create a data frame
spark.createDataFrame(
// of #numRecords records
(0 until numRecords).map { _ =>
// each of them a row
Row.fromSeq(schema.fields.map(field => {
// with fields based on the schema's fieldname & type else null
dynamicValues.get((field.name, field.dataType)).flatMap(_.sample).orNull
}))
}.asJava, schema)
// store to parquet
.write.mode(SaveMode.Overwrite).parquet(name)
}
ScalaCheck is a framework to generate data, you generate a raw data based on the schema using you custom generators.
Visit ScalaCheck Documentation.
You could do something like this
import org.apache.spark.SparkConf
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.json4s
import org.json4s.JsonAST._
import org.json4s.jackson.JsonMethods._
import scala.util.Random
object Test extends App {
val structType: StructType = StructType(
List(
StructField("a", DoubleType, true),
StructField("aa", StringType, true),
StructField("p", LongType, true),
StructField("pp", StringType, true)
)
)
val spark = SparkSession
.builder()
.master("local[*]")
.config(new SparkConf())
.getOrCreate()
import spark.implicits._
val df = createRandomDF(structType, 1000)
def createRandomDF(structType: StructType, size: Int, rnd: Random = new Random()): DataFrame ={
spark.read.schema(structType).json((0 to size).map { _ => compact(randomJson(rnd, structType))}.toDS())
}
def randomJson(rnd: Random, dataType: DataType): JValue = {
dataType match {
case v: DoubleType =>
json4s.JDouble(rnd.nextDouble())
case v: StringType =>
JString(rnd.nextString(10))
case v: IntegerType =>
JInt(rnd.nextInt())
case v: LongType =>
JInt(rnd.nextLong())
case v: FloatType =>
JDouble(rnd.nextFloat())
case v: BooleanType =>
JBool(rnd.nextBoolean())
case v: ArrayType =>
val size = rnd.nextInt(10)
JArray(
(0 to size).map(_ => randomJson(rnd, v.elementType)).toList
)
case v: StructType =>
JObject(
v.fields.flatMap {
f =>
if (f.nullable && rnd.nextBoolean())
None
else
Some(JField(f.name, randomJson(rnd, f.dataType)))
}.toList
)
}
}
}
I'm trying to run a Logistic Regression model over the KDD dataset using Scala and the Spark MLlib library. I have gone through multiple webs, tutorials and forums, but I still can't figure out why my code is not working. It must be something simple, but I just don't get it and I'm felling blocked at this moment. Here is what (I think) I'm doing:
Create a Spark Context.
Create a SQL Context.
Load paths for training and test data files.
Define the schema for the data to work with. That is, the columns we are going to use (names and types) with the KDD dataset.
Read the file with training data.
Read the file with the test data.
Filter input data to ensure only numeric values for every column (I just drop the three StringType columns).
8.Since Logistic Regression model needs a column called "features" with all features packed within a single vector, I create such column via the "VectorAssembler" function.
I just keep the columns named "label" and "features", which are essential for the Logistic Regression model.
I use the "StringIndexer" function in order to transform the values from the "label" column into Doubles, otherwise Logistic Regression complies saying it can't work with StringType.
I set the hyperparameters for the Logistic Regression model, indicating the Label and Features columns.
I attempt to train the model (via the "fit" method).
Bellow you can find the code:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
object LogisticRegressionV2 {
val settings = new Settings() // Here I define the proper values for the training and test files paths
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("LogisticRegressionV2").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val trainingPath = settings.rootFolder + settings.dataFolder + settings.trainingDataFileName
val testPath = settings.rootFolder + settings.dataFolder + settings.testFileName
val kddSchema = StructType(Array(
StructField("duration", IntegerType, true),
StructField("protocol_type", StringType, true),
StructField("service", StringType, true),
StructField("flag", StringType, true),
StructField("src_bytes", IntegerType, true),
StructField("dst_bytes", IntegerType, true),
StructField("land", IntegerType, true),
StructField("wrong_fragment", IntegerType, true),
StructField("urgent", IntegerType, true),
StructField("hot", IntegerType, true),
StructField("num_failed_logins", IntegerType, true),
StructField("logged_in", IntegerType, true),
StructField("num_compromised", IntegerType, true),
StructField("root_shell", IntegerType, true),
StructField("su_attempted", IntegerType, true),
StructField("num_root", IntegerType, true),
StructField("num_file_creations", IntegerType, true),
StructField("num_shells", IntegerType, true),
StructField("num_access_files", IntegerType, true),
StructField("num_outbound_cmds", IntegerType, true),
StructField("is_host_login", IntegerType, true),
StructField("is_guest_login", IntegerType, true),
StructField("count", IntegerType, true),
StructField("srv_count", IntegerType, true),
StructField("serror_rate", DoubleType, true),
StructField("srv_serror_rate", DoubleType, true),
StructField("rerror_rate", DoubleType, true),
StructField("srv_rerror_rate", DoubleType, true),
StructField("same_srv_rate", DoubleType, true),
StructField("diff_srv_rate", DoubleType, true),
StructField("srv_diff_host_rate", DoubleType, true),
StructField("dst_host_count", IntegerType, true),
StructField("dst_host_srv_count", IntegerType, true),
StructField("dst_host_same_srv_rate", DoubleType, true),
StructField("dst_host_diff_srv_rate", DoubleType, true),
StructField("dst_host_same_src_port_rate", DoubleType, true),
StructField("dst_host_srv_diff_host_rate", DoubleType, true),
StructField("dst_host_serror_rate", DoubleType, true),
StructField("dst_host_srv_serror_rate", DoubleType, true),
StructField("dst_host_rerror_rate", DoubleType, true),
StructField("dst_host_srv_rerror_rate", DoubleType, true),
StructField("label", StringType, true)
))
val rawTraining = sqlContext.read
.format("csv")
.option("header", "true")
.schema(kddSchema)
.load(trainingPath)
val rawTest = sqlContext.read
.format("csv")
.option("header", "true")
.schema(kddSchema)
.load(testPath)
val trainingNumeric = rawTraining.drop("service").drop("protocol_type").drop("flag")
val trainingAssembler = new VectorAssembler()
//.setInputCols(trainingNumeric.columns.filter(_ != "label"))
.setInputCols(Array("duration", "src_bytes", "dst_bytes", "land", "wrong_fragment", "urgent", "hot",
"num_failed_logins", "logged_in", "num_compromised", "root_shell", "su_attempted", "num_root",
"num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds", "is_host_login",
"is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate",
"same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count", "dst_host_srv_count",
"dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate",
"dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate", "dst_host_srv_rerror_rate"))
.setOutputCol("features")
val trainingAssembled = trainingAssembler.transform(trainingNumeric).select("label", "features")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(trainingAssembled)
val trainingData = labelIndexer.transform(trainingAssembled).select("indexedLabel", "features")
trainingData.show(false)
val lr = new LogisticRegression()
.setMaxIter(2)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
val predictions = lr.fit(trainingData)
sc.stop()
}
}
As you can see, it is a simple code, but I get a "java.lang.ArrayIndexOutOfBoundsException: 1" when the execution reaches the line:
val predictions = lr.fit(trainingData)
And I just don't know why. If you had any clue about this issue, it would be very appreciated. Many thanks in advance.