How to save dataframe to pickle file using Pyspark - pyspark

I have to save a dataframe to Pickle file, but it returns an error
df.saveAsPickleFile(path)
AttributeError: 'Dataframe' object has no attribute 'saveAsPickleFile'

saveAsPickleFile is a method of RDD and not of a data frame.
see this documentation:
http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pickle
So you can just call:
df.rdd.saveAsPickleFile(filename)
To load it from file, run:
pickleRdd = sc.pickleFile(filename).collect()
df2 = spark.createDataFrame(pickleRdd)

Related

Spark save and read Array[Byte] type

I have serialize the object to Array[Byte] type, and save it to the parquet file as StructField("byteArrayObject",ArrayType(ByteType), nullable = true). When I try to read it, using row.getAs[Array[Byte]]("byteArrayObject") there is an error:
scala.collection.mutable.WrappedArray$ofRef cannot be cast to [B
Any one know what the problem is?
Spark deserializes array values as WrappedArray. Try the following:
import scala.collection.mutable.WrappedArray
import java.{lang => jl}
row
.getAs[WrappedArray[jl.Byte]]("byteArrayObject")
.map(_.byteValue)
.array

Not able to map records in csv into objects of a class in Scala / Spark

I have a jupyter notebook running a spylon-kernel (Scala / Spark).
Currently, I try to load records from a csv into a RDD and then map each record into objects of the "Weather" class as follows:
val lines = scala.io.Source.fromFile("/path/to/nycweather.csv").mkString
println(lines)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Next, you need to import a library for creating a SchemaRDD. Type this:
import sqlContext.implicits._
//Create a case class in Scala that defines the schema of the table. Type in:
case class Weather(date: String, temp: Int, precipitation: Double)
//Create the RDD of the Weather object:
val weather = sc.textFile("/path/to/nycweather.csv").map(_.split(",")). map(w => Weather(w(0), w(1).trim.toInt, w(2).trim.toDouble)).toDF()
//It all works fine until the last line above.
//But when I run this line of code:
weather.first()
It all bursts out with the following error message
the message has a couple more lines but I omitted to be more visible.
Could someone indicate why am I getting this error and suggest code changes to solve it?
You are using older RDD syntax for reading a CSV. There is an easier way to read a CSV as
val weather1 = spark.read.csv("path to nycweather.csv").toDF("date","temp","precipitation")
weather1.show()
Input file contains following data
1/1/2010,30,35.0
2/4/2015,35,27.9
Result
+--------+----+-------------+
| date|temp|precipitation|
+--------+----+-------------+
|1/1/2010| 30| 35.0|
|2/4/2015| 35| 27.9|
+--------+----+-------------+

Spark failing to deserialize a record when creating Dataset

I'm reading a large number of CSVs from S3 (everything under a key prefix) and creating a strongly-typed Dataset.
val events: DataFrame = cdcFs.getStream()
events
.withColumn("event", lit("I"))
.withColumn("source", lit(sourceName))
.as[TradeRecord]
where TradeRecord is a case class that can normally be deserialized into by SparkSession implicits. However, for a certain batch, a record is failing to deserialize. Here's the error (stack trace omitted)
Caused by: java.lang.NullPointerException: Null value appeared in non-nullable field:
- field (class: "scala.Long", name: "deal")
- root class: "com.company.trades.TradeRecord"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
deal being a field of TradeRecord that should never be null in source data (S3 objects), so it's not an Option.
Unfortunately the error message doesn't give me any clue as to what the CSV data looks like, or even which CSV file it's coming from. The batch consists of hundreds of files, so I need a way to narrow this down to at most a few files to investigate the issue.
As suggested by user10465355 you can load the data:
val events: DataFrame = ???
Filter
val mismatched = events.where($"deal".isNull)
Add file name
import org.apache.spark.sql.functions.input_file_name
val tagged = mismatched.withColumn("_file_name", input_file_name)
Optionally add the chunk and chunk and offset:
import org.apache.spark.sql.functions.{spark_partition_id, monotonically_increasing_id, shiftLeft, shiftRight
df
.withColumn("chunk", spark_partition_id())
.withColumn(
"offset",
monotonically_increasing_id - shiftLeft(shiftRight(monotonically_increasing_id, 33), 33))
Here's the solution I came up with (I'm using Spark Structured Streaming):
val stream = spark.readStream
.format("csv")
.schema(schema) // a StructType defined elsewhere
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "corruptRecord")
.load(path)
// If debugging, check for any corrupted CSVs
if (log.isDebugEnabled) { // org.apache.spark.internal.Logging trait
import spark.implicits._
stream
.filter($"corruptRecord".isNotNull)
.withColumn("input_file", input_file_name)
.select($"input_file", $"corruptRecord")
.writeStream
.format("console")
.option("truncate", false)
.start()
}
val events = stream
.withColumn("event", lit("I"))
.withColumn("source", lit(sourceName))
.as[TradeRecord]
Basically if Spark log level is set to Debug or lower, the DataFrame is checked for corrupted records and any such records are printed out together with their file names. Eventually the program tries to cast this DataFrame to a strongly-typed Dataset[TradeRecord] and fails.

Parse JSON data with Apache Spark and Scala

I have this type of file with data where each line is a JSON object except first few words(see attached image). I want to parse this type of file using Spark and Scala. I have tried it using sqlContext.read.json(“path to json file”) but it gives me error(corrupt data) because whole data is not a JSON object. How do I parse this JSON file to SQL dataframe?
Try this:
val rawRdd = sc.textFile("path-to-the-file")
val jsonRdd = rawRdd.map(_.substring(32)) //32 - number of first characters to ignore
val df = spark.read.json(jsonRdd)

spark scala dataframes - create an object with attributes in a json file

I have a json file of the format
{"x1": 2, "y1": 6, "x2":3, "y2":7}
I have a scala class
class test(int:x, int:y)
Using spark I am trying to read this file and create two test objects for each line from the json file. for example
{"x1": 2, "y1": 6, "x2":3, "y2":7} should create
test1 = new test(2,6) and
test2 = new test(3,7)
Then for each line of the json file, i want to call a function that takes two test objects as parameters. Example callFunction(test1,test2)
How do i do this with spark. I see method that will convert rows in json file to list of objects but no way to create multiple objects using attributes in a single row of json file
val conf = new SparkConf()
.setAppName("Example")
.setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val coordinates = sqlContext.read.json("c:/testfile.json")
//NOT SURE HOW TO DO THE FOLLOWING
//test1 = new Test(attr1 of json file, attr2 of json file)
//test2 = new Test(attr3 of json file, attr4 of json file)
//callFunction(test1,test2)
//collect the result of callFunction