Dynamically setting schema for spark.createDataFrame - pyspark

So I am trying to dynamically set the type of data in the schema.
I have seen the code schema = StructType([StructField(header[i], StringType(), True) for i in range(len(header))]) on stackoverflow
But how can I add change this into a conditional statement?
If header is in list1 then IntergerType, if in list2 then DoubleType, else StringType for example?

A colleague answered this for me
schema = StructType([
StructField(header[i], DateType(), True)
if header[i] in dateFields
else StructField(header[i], StringType(), True)
for i in range(len(header))])

Related

How to define a schema for json to be used in from_json to parse out values

I am trying to come up with a schema definition to parse out information from dataframe string column I am using from_json for that . I need help in defining schema which I am somehow not getting it right.
Here is the Json I have
[
{
"sectionid":"838096e332d4419191877a3fd40ed1f4",
"sequence":0,
"questions":[
{
"xid":"urn:com.mheducation.openlearning:lms.assessment.author:qastg.global:assessment_item:2a0f52fb93954f4590ac88d90888be7b",
"questionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
"quizquestionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
"qtype":"3",
"sequence":0,
"subsectionsequence":-1,
"type":"80",
"question":"<p>This is a simple, 1 question assessment for automation testing</p>",
"totalpoints":"5.0",
"scoring":"1",
"scoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
"inputoption":"0",
"casesensitive":"0",
"suggestedscoring":"1",
"suggestedscoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
"answers":[
"1"
],
"options":[
]
}
]
}
]
I want to parse this information out which will result in columns
sectionid , sequence, xid, question.sequence, question.question(question text), answers
Here is what I have I have defined a schema for testing like this
import org.apache.spark.sql.types.{StringType, ArrayType, StructType,
StructField}
val schema = new StructType()
.add("sectionid", StringType, true)
.add("sequence", StringType, true)
.add("questions", StringType, true)
.add("answers", StringType, true)
finalDF = finalDF
.withColumn( "parsed", from_json(col("enriched_payload.transformed"),schema) )
But I am getting NULL in result columns the reason I think is my schema is not right.
I am struggling to come up with right definition . How do I come up with correct json schema definition ?
I am using spark 3.0
Try below code.
import org.apache.spark.sql.types._
val schema = ArrayType(
new StructType()
.add("sectionid",StringType,true)
.add("sequence",LongType,true)
.add("questions", ArrayType(
new StructType()
.add("answers",ArrayType(StringType,true),true)
.add("casesensitive",StringType,true)
.add("inputoption",StringType,true)
.add("options",ArrayType(StringType,true),true)
.add("qtype",StringType,true)
.add("question",StringType,true)
.add("questionid",StringType,true)
.add("quizquestionid",StringType,true)
.add("scoring",StringType,true)
.add("scoringrules",StringType,true)
.add("sequence",LongType,true)
.add("subsectionsequence",LongType,true)
.add("suggestedscoring",StringType,true)
.add("suggestedscoringrules",StringType,true)
.add("totalpoints",StringType,true)
.add("type",StringType,true)
.add("xid",StringType,true)
)
)
)

Parquet file is not partitioned Spark

I am trying to save a parquet Spark dataframe with partitioning to the temporary directory for unit tests, however, for some reason partitions are not created. The data itself is saved into the directory and can be used for tests.
Here is the method I have created for that:
def saveParquet(df: DataFrame, partitions: String*): String = {
val path = createTempDir()
df.repartition(1).parquet(path)(partitions: _*)
path
}
val feedPath: String = saveParquet(feedDF.select(feed.schema), "processing_time")
This method works for various dataframe with various schemas but for some reason does not generate partitions for this one. I have logged out the resulting path and it looks like this:
/var/folders/xg/fur_diuhg83b2ba15ih2rt822000dhst/T/testutils-samples8512758291/jf81n7bsj-95hs-573n-b73h-7531ug04515
But it should look like this:
/var/folders/xg/fur_diuhg83b2ba15ih2rt822000dhst/T/testutils-samples8512758291/jf81n7bsj-95hs-573n-b73h-7531ug04515/processing_time=1591714800000/part-some-random-numbersnappy.parquet
I have checked that the data and all the columns are read just fine before partitioning, as soon as partition call is created this problem occurs. Also, I ran a regex on directories which failed with match error on test samples - s".*processing_time=([0-9]+)/.*parquet".r
So what could be the reason of this problem? How else can I partition the dataframe?
Dataframe schema looks like this:
val schema: StructType = StructType(
Seq(
StructField("field1", StringType),
StructField("field2", LongType),
StructField("field3", StringType),
StructField("field4Id", IntegerType, nullable = true),
StructField("field4", FloatType, nullable = true),
StructField("field5Id", IntegerType, nullable = true),
StructField("field5", FloatType, nullable = true),
StructField("field6Id", IntegerType, nullable = true),
StructField("field6", FloatType, nullable = true),
//partition keys
StructField("processing_time", LongType)
)
)

how to save data frame in Mongodb in spark using custom value for _id column

val transctionSchema = StructType(Array(
StructField("School_id", StringType, true),
StructField("School_Year", StringType, true),
StructField("Run_Type", StringType, true),
StructField("Bus_No", StringType, true),
StructField("Route_Number", StringType, true),
StructField("Reason", StringType, true),
StructField("Occurred_On", DateType, true),
StructField("Number_Of_Students_On_The_Bus", IntegerType, true)))
val dfTags = sparkSession.read.option("header", true).schema(transctionSchema).
option("dateFormat", "yyyyMMddhhmm")
.csv("src/main/resources/9_bus-breakdown-and-delays_case_study.csv").
toDF("School_id", "School_Year", "Run_Type", "Bus_No", "Route_Number", "Reason", "Occurred_On", "Number_Of_Students_On_The_Bus")
import sparkSession.implicits._
val writeConfig = WriteConfig(Map("collection" -> "bus_Details", "writeConcern.w" -> "majority"), Some(WriteConfig(sparkSession)))
dfTags.show(5)
I have a data frame with the column: School_id/School_year/Run_Type/Bus_No/Route_No/Reason/Occured_on. I want to save this data in mongo DB collection bus_Details such that _id in mongo collection holds the data from School_id column of the Data Frame.
I saw some post where it was suggested to define a collection as :But it is not working
properties: {
School_id: {
bsonType: "string",
id:"true"
description: "must be a string and is required"
}
Please help..
You can create a duplicate column names as _id in your dataframe like:
val dfToSave = dfTags.withColumn("_id", $"School_id")
And then save this to mongodb

Spark error when using except on a dataframe with MapType

I am seeing the error Cannot have map type columns in DataFrame which calls set operations when using Spark MapType.
Below is the sample code I wrote to reproduce it. I understand this is happening because the MapType objects are not hashable but I have an use case where I need to do the following.
val schema1 = StructType(Seq(
StructField("a", MapType(StringType, StringType, true)),
StructField("b", StringType, true)
))
val df = spark.read.schema(schema1).json("path")
val filteredDF = df.filter($"b" === "apple")
val otherDF = df.except(filteredDF)
Any suggestions for workarounds?

Array Index Out Of Bounds when running Logistic Regression from Spark MLlib

I'm trying to run a Logistic Regression model over the KDD dataset using Scala and the Spark MLlib library. I have gone through multiple webs, tutorials and forums, but I still can't figure out why my code is not working. It must be something simple, but I just don't get it and I'm felling blocked at this moment. Here is what (I think) I'm doing:
Create a Spark Context.
Create a SQL Context.
Load paths for training and test data files.
Define the schema for the data to work with. That is, the columns we are going to use (names and types) with the KDD dataset.
Read the file with training data.
Read the file with the test data.
Filter input data to ensure only numeric values for every column (I just drop the three StringType columns).
8.Since Logistic Regression model needs a column called "features" with all features packed within a single vector, I create such column via the "VectorAssembler" function.
I just keep the columns named "label" and "features", which are essential for the Logistic Regression model.
I use the "StringIndexer" function in order to transform the values from the "label" column into Doubles, otherwise Logistic Regression complies saying it can't work with StringType.
I set the hyperparameters for the Logistic Regression model, indicating the Label and Features columns.
I attempt to train the model (via the "fit" method).
Bellow you can find the code:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
object LogisticRegressionV2 {
val settings = new Settings() // Here I define the proper values for the training and test files paths
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("LogisticRegressionV2").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val trainingPath = settings.rootFolder + settings.dataFolder + settings.trainingDataFileName
val testPath = settings.rootFolder + settings.dataFolder + settings.testFileName
val kddSchema = StructType(Array(
StructField("duration", IntegerType, true),
StructField("protocol_type", StringType, true),
StructField("service", StringType, true),
StructField("flag", StringType, true),
StructField("src_bytes", IntegerType, true),
StructField("dst_bytes", IntegerType, true),
StructField("land", IntegerType, true),
StructField("wrong_fragment", IntegerType, true),
StructField("urgent", IntegerType, true),
StructField("hot", IntegerType, true),
StructField("num_failed_logins", IntegerType, true),
StructField("logged_in", IntegerType, true),
StructField("num_compromised", IntegerType, true),
StructField("root_shell", IntegerType, true),
StructField("su_attempted", IntegerType, true),
StructField("num_root", IntegerType, true),
StructField("num_file_creations", IntegerType, true),
StructField("num_shells", IntegerType, true),
StructField("num_access_files", IntegerType, true),
StructField("num_outbound_cmds", IntegerType, true),
StructField("is_host_login", IntegerType, true),
StructField("is_guest_login", IntegerType, true),
StructField("count", IntegerType, true),
StructField("srv_count", IntegerType, true),
StructField("serror_rate", DoubleType, true),
StructField("srv_serror_rate", DoubleType, true),
StructField("rerror_rate", DoubleType, true),
StructField("srv_rerror_rate", DoubleType, true),
StructField("same_srv_rate", DoubleType, true),
StructField("diff_srv_rate", DoubleType, true),
StructField("srv_diff_host_rate", DoubleType, true),
StructField("dst_host_count", IntegerType, true),
StructField("dst_host_srv_count", IntegerType, true),
StructField("dst_host_same_srv_rate", DoubleType, true),
StructField("dst_host_diff_srv_rate", DoubleType, true),
StructField("dst_host_same_src_port_rate", DoubleType, true),
StructField("dst_host_srv_diff_host_rate", DoubleType, true),
StructField("dst_host_serror_rate", DoubleType, true),
StructField("dst_host_srv_serror_rate", DoubleType, true),
StructField("dst_host_rerror_rate", DoubleType, true),
StructField("dst_host_srv_rerror_rate", DoubleType, true),
StructField("label", StringType, true)
))
val rawTraining = sqlContext.read
.format("csv")
.option("header", "true")
.schema(kddSchema)
.load(trainingPath)
val rawTest = sqlContext.read
.format("csv")
.option("header", "true")
.schema(kddSchema)
.load(testPath)
val trainingNumeric = rawTraining.drop("service").drop("protocol_type").drop("flag")
val trainingAssembler = new VectorAssembler()
//.setInputCols(trainingNumeric.columns.filter(_ != "label"))
.setInputCols(Array("duration", "src_bytes", "dst_bytes", "land", "wrong_fragment", "urgent", "hot",
"num_failed_logins", "logged_in", "num_compromised", "root_shell", "su_attempted", "num_root",
"num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds", "is_host_login",
"is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate",
"same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count", "dst_host_srv_count",
"dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate",
"dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate", "dst_host_srv_rerror_rate"))
.setOutputCol("features")
val trainingAssembled = trainingAssembler.transform(trainingNumeric).select("label", "features")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(trainingAssembled)
val trainingData = labelIndexer.transform(trainingAssembled).select("indexedLabel", "features")
trainingData.show(false)
val lr = new LogisticRegression()
.setMaxIter(2)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
val predictions = lr.fit(trainingData)
sc.stop()
}
}
As you can see, it is a simple code, but I get a "java.lang.ArrayIndexOutOfBoundsException: 1" when the execution reaches the line:
val predictions = lr.fit(trainingData)
And I just don't know why. If you had any clue about this issue, it would be very appreciated. Many thanks in advance.