How to log malformed rows from Scala Spark DataFrameReader csv - scala

The documentation for the Scala_Spark_DataFrameReader_csv suggests that spark can log the malformed rows detected while reading a .csv file.
- How can one log the malformed rows?
- Can one obtain a val or var containing the malformed rows?
The option from the linked documentation is:
maxMalformedLogPerPartition (default 10): sets the maximum number of malformed rows Spark will log for each partition. Malformed records beyond this number will be ignored

Based on this databricks example you need to explicitly add the "_corrupt_record" column to a schema definition when you read in the file. Something like this worked for me in pyspark 2.4.4:
from pyspark.sql.types import *
my_schema = StructType([
StructField("field1", StringType(), True),
...
StructField("_corrupt_record", StringType(), True)
])
my_data = spark.read.format("csv")\
.option("path", "/path/to/file.csv")\
.schema(my_schema)
.load()
my_data.count() # force reading the csv
corrupt_lines = my_data.filter("_corrupt_record is not NULL")
corrupt_lines.take(5)

If you are using the spark 2.3 check the _corrupt_error special column ... according to several spark discussions "it should work " , so after the read filter those which non-empty cols - there should be your errors ... you could check also the input_file_name() sql func
if you are not using lower than version 2.3 you should implement a custom read , record solution, because according to my tests the _corrupt_error does not work for csv data source ...

I've expanded on klucar's answer here by loading the csv, making a schema from the non-corrupted records, adding the corrupted record column, using the new schema to load the csv and then looking for corrupted records.
from pyspark.sql.types import StructField, StringType
from pyspark.sql.functions import col
file_path = "/path/to/file"
mode = "PERMISSIVE"
schema = spark.read.options(mode=mode).csv(file_path).schema
schema = schema.add(StructField("_corrupt_record", StringType(), True))
df = spark.read.options(mode=mode).schema(schema).csv(file_path)
df.cache()
df.count()
df.filter(col("_corrupt_record").isNotNull()).show()

Related

pyspark json datframe created with all null values

I have created a dataframe from a json file. However dataframe is created with all the schema but with values as null. Its a valid json file.
df = spark.read.json(path)
when I displayed the data , using df.display() all i can view is null in the dataframe. Can anyone tell me what could be the issue?
Reading the json file without enabling multiline might be the cause for this.
Please go through the sample demonstration.
My sample json.
[{"id":1,"first_name":"Amara","last_name":"Taplin"},
{"id":2,"first_name":"Gothart","last_name":"McGrill"},
{"id":3,"first_name":"Georgia","last_name":"De Miranda"},
{"id":4,"first_name":"Dukie","last_name":"Arnaud"},
{"id":5,"first_name":"Mellicent","last_name":"Scathard"}]
I got null values when multiline not used.
When multiline enabled I got proper result.
df= spark.read.option('multiline', True).json('/FileStore/tables/Sample1_json.json')
df.display()
If you want to give schema externally also, you can do like this.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,LongType
schema = StructType([StructField('first_name', StringType(), True),
StructField('id', IntegerType(), True),
StructField('last_name', StringType(), True)])
df= spark.read.schema(schema).option('multiline', True).json('/FileStore/tables/Sample1_json.json')
df.show()

How to append an index column to a spark data frame using spark streaming in scala?

I am using something like this:
df.withColumn("idx", monotonically_increasing_id())
But I get an exception as it is NOT SUPPORTED:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Expression(s): monotonically_increasing_id() is not supported with streaming DataFrames/Datasets;;
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:143)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:250)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:316)
Any ideas how to add an index or row number column to spark streaming dataframe in scala?
Full stacktrace: https://justpaste.it/5bdqr
There are a few operations that cannot exists anywhere in a streaming plan of Spark Streaming, unfortunately including monotonically_increasing_id(). Double check for this fact transformed1 is failing with the error as in your question, here is a reference on this check in Spark source code:
import org.apache.spark.sql.functions._
val df = Seq(("one", 1), ("two", 2)).toDF("foo", "bar")
val schema = df.schema
df.write.parquet("/tmp/out")
val input = spark.readStream.format("parquet").schema(schema).load("/tmp/out")
val transformed1 = input.withColumn("id", monotonically_increasing_id())
transformed1.writeStream.format("parquet").option("format", "append") .option("path", "/tmp/out2") .option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()
import org.apache.spark.sql.expressions.Window
val windowSpecRowNum = Window.partitionBy("foo").orderBy("foo")
val transformed2 = input.withColumn("row_num", row_number.over(windowSpecRowNum))
transformed2.writeStream.format("parquet").option("format", "append").option("path", "/tmp/out2").option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()
Also I tried to add indexing with Window over a column in DF - transformed2 in the snapshot above - it also failed, but with a different error):
"Non-time-based windows are not supported on streaming
DataFrames/Datasets"
All unsupported operator checks for Spark Streaming you can find here - seems like the traditional ways of adding an index column in Spark Batch don't work in Spark Streaming.

Inferschema detecting column as string instead of double from parquet in pyspark

Problem -
I am reading a parquet file in pyspark using azure databricks. There are columns which lot of nulls and have decimal values, these columns are read as string instead of double.
Is there any way of inferring the proper data type in pyspark?
Code -
To read parquet file -
df_raw_data = sqlContext.read.parquet(data_filename[5:])
The output of this is a dataframe with more than 100 columns of which most of the columns are of the type double but the printSchema() shows it as string.
P.S -
I have a parquet file which can have dynamic columns hence defining struct for the dataframe does not work for me. I used to convert the spark dataframe to pandas and use convert_objects but that does not work as the parquet file is huge.
You can define the schema using StructType and then provide this schema in the schema option while loading the data.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
fileSchema = StructType([StructField('atm_id', StringType(),True),
StructField('atm_street_number', IntegerType(),True),
StructField('atm_zipcode', IntegerType(),True),
StructField('atm_lat', DoubleType(),True),
])
df_raw_data = spark.read \
.option("header",True) \
.option("format", "parquet") \
.schema(fileSchema) \
.load(data_filename[5:])

Handling schema mismatches in Spark

I am reading a csv file using Spark in Scala.
The schema is predefined and i am using it for reading.
This is the esample code:
// create the schema
val schema= StructType(Array(
StructField("col1", IntegerType,false),
StructField("col2", StringType,false),
StructField("col3", StringType,true)))
// Initialize Spark session
val spark: SparkSession = SparkSession.builder
.appName("Parquet Converter")
.getOrCreate
// Create a data frame from a csv file
val dataFrame: DataFrame =
spark.read.format("csv").schema(schema).option("header", false).load(inputCsvPath)
From what i read when reading cav with Spark using a schema there are 3 options:
Set mode to DROPMALFORMED --> this will drop the lines that don't match the schema
Set mode to PERMISSIVE --> this will set the whole line to null values
Set mode to FAILFAST --> this will throw an exception when a mismatch is discovered
What is the best way to combine the options? The behaviour I want is to get the mismatches in the schema, print them as errors and ignoring the lines in my data frame.
Basically, I want a combination of FAILFAST and DROPMALFORMED.
Thanks in advance
This is what I eventually did:
I added to the schema the "_corrupt_record" column, for example:
val schema= StructType(Array(
StructField("col1", IntegerType,true),
StructField("col2", StringType,false),
StructField("col3", StringType,true),
StructField("_corrupt_record", StringType, true)))
Then I read the CSV using PERMISSIVE mode (it is Spark default):
val dataFrame: DataFrame = spark.read.format("csv")
.schema(schema)
.option("header", false)
.option("mode", "PERMISSIVE")
.load(inputCsvPath)
Now my data frame holds an additional column that holds the rows with schema mismatches.
I filtered the rows that have mismatched data and printed it:
val badRows = dataFrame.filter("_corrupt_record is not null")
badRows.cache()
badRows.show()
Just use DROPMALFORMED and follow the log. If malformed records are present there are dumped to the log, up to the limit set by maxMalformedLogPerPartition option.
spark.read.format("csv")
.schema(schema)
.option("header", false)
.option("mode", "DROPMALFORMED")
.option("maxMalformedLogPerPartition", 128)
.load(inputCsvPath)

Spark Scala - java.util.NoSuchElementException & Data Cleaning

I have had a similar problem before, but I am looking for a generalizable answer. I am using spark-corenlp to get Sentiment scores on e-mails. Sometimes, sentiment() crashes on some input (maybe it's too long, maybe it had an unexpected character). It does not tell me it crashes on some instances, and just returns the Column sentiment('email). Thus, when I try to show() beyond a certain point or save() my data frame, I get a java.util.NoSuchElementException because sentiment() must have returned nothing at that row.
My initial code is loading the data, and applying sentiment() as shown in spark-corenlp API.
val customSchema = StructType(Array(
StructField("contactId", StringType, true),
StructField("email", StringType, true))
)
// Load dataframe
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter","\t") // Delimiter is tab
.option("parserLib", "UNIVOCITY") // Parser, which deals better with the email formatting
.schema(customSchema) // Schema of the table
.load("emails") // Input file
val sent = df.select('contactId, sentiment('email).as('sentiment)) // Add sentiment analysis output to dataframe
I tried to filter for null and NaN values:
val sentFiltered = sent.filter('sentiment.isNotNull)
.filter(!'sentiment.isNaN)
.filter(col("sentiment").between(0,4))
I even tried to do it via SQL query:
sent.registerTempTable("sent")
val test = sqlContext.sql("SELECT * FROM sent WHERE sentiment IS NOT NULL")
I don't know what input is making the spark-corenlp crash. How can I find out? Else, how can I filter these non existing values from col("sentiment")? Or else, should I try catching the Exception and ignore the row? Is this even possible?