PySpark XML processing - Ignoring bad records - pyspark

I am processing a large XML file using the Spark XML Library (HyukjinKwon:spark-xml:0.1.1-s_2.11). The XML processing fails with an analysis exception for a couple of records. I would like to keep processing the file ignoring these records.
I have the below code for processing the xml and I tried the option of 'DROPMALFORMED' but didn't help.
df = (spark.read.format("xml")
.option("rootTag","Articles")
.option("rowTag", "Article")
.option("inferSchema", "true")
.option("mode", "DROPMALFORMED")
.load("/mnt/RawAdl2/problemfile.xml"))
AnalysisException: "cannot resolve '['Affiliation']' due to data type mismatch: argument 2 requires integral type, however, ''Affiliation'' is of string type.;
I would like to drop the malformed records and continue with the processing of the file. Is there any other option I could try? Appreciate the inputs!
EDIT: Looking at the source code link the Malformed option is supported by the library. As I am not well versed with Scala, I am not really sure whether I am using the correct syntax for this option. Please advise.
After going through the source code, I tried this below code but no luck
.option("mode", "DROP_MALFORMED_MODE")

Try setting the badRecords path:
.option("badRecordsPath", "/tmp/badRecordsPath")
https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html

Related

Scala Spark - Cannot resolve a column name

This should be pretty straightforward, but I'm having an issue with the following code:
val test = spark.read
.option("header", "true")
.option("delimiter", ",")
.csv("sample.csv")
test.select("Type").show()
test.select("Provider Id").show()
test is a dataframe like so:
Type
Provider Id
A
asd
A
bsd
A
csd
B
rrr
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve '`Provider Id`' given input columns: [Type, Provider Id];;
'Project ['Provider Id]
It selected and shows the Type column just fine but couldn't get it to work for the Provider Id. I wondered if it were because the column name had a space, so I tried using backticks, removing and replacing the space, but nothing seemed to work. Also, it ran fine when I'm using Spark libraries 3.x but doesn't work when I'm using Spark 2.1.x (meanwhile I need to use 2.1.x)
Additional: I tried changing the CSV column order from Type - Provider Id to Provider Id then Type. The error was the opposite, Provider Id shows but for Type it's throwing an exception now.
Any suggestions?
test.printSchema()
You can use the result from printSchema() to see how exactly spark read your column in, then use that in your code.

Read oneline file into dataframe

I have the task of reading a one line json file into spark. I´ve thought about either modifying the input file so that it fits spark.read.json(path) or read the whole file and modify it inmemory to make it fit the previous line as shown bellow:
import spark.implicit._
val file = sc.textFile(path).collect()(0)
val data = file.split("},").map(json => s"$json}")
val ds = data.toSeq.toDF()
Is there a way of directly reading the json or read the one line file into multiple rows?
Edit:
Sorry I didn´t crealy explain the json format, all the json in the same line:
{"key":"value"},{"key":"value2"},{"key":"value2"}
If imported with spark.read.json(path) it would only take the first value.
Welcome to SO HugoDife! I believe single line load is what spark.read.json() does and you are perhaps looking for this answer. If not maybe you want to adjust your question with a data example.

Why does Spark fail with "value write is not a member of org.apache.spark.sql.DataFrameReader [error]"?

I have two almost identical write into db . scala statement ,however one trowing me an error the other not and i don't understand how to fix it ? any ideas ?
this statement is working
df_pm_visits_by_site_trn.write.format("jdbc").option("url", db_url_2).option("dbtable", "pm_visits_by_site_trn").option("user", db_user).option("password", db_pwd).option("truncate","true").mode("overwrite").save()
this one doesn't work and throwing me compiling error
df_trsnss .write.format("jdbc").option("url", db_url_2).option("dbtable", "df_trsnss") .option("user", db_user).option("password", db_pwd).option("truncate","true").mode("overwrite").save()
_dev.scala:464: value write is not a member of org.apache.spark.sql.DataFrameReader [error]
df_trsnss.write.format("jdbc").option("url",
db_url_2).option("dbtable", "trsnss").option("user",
db_user).option("password",
db_pwd).option("truncate","true").mode("overwrite").save()
if i delete my second write statement or just simply comment it out whole code is compiling with no errors.
Based on the error message, df_trsnss is a DataFrameReader, not a DataFrame. You likely forgot to call load.
val df_trsnss = spark.read.format("csv")
instead of
val df_trsnss = spark.read.format("csv").load("...")

Use sparkSession,When I read the parquet file, I get type not yet supported error message

Use sparkSession,When I read the parquet file, I get the following error
my code :
val df = spark.read.parquet("/Users/shaokai.li/Downloads/test.parquet")
error:
org.apache.spark.sql.AnalysisException: Parquet type not yet supported: INT64 (TIMESTAMP_MILLIS);
I searched online for a long time and could not find the answer。
I hope someone can answer for me,thanks!

spark scala issue uploading csv

i am trying to upload a csv file into a tempTable such that I can query on it and I am having two issues.
First: I tried uploading the csv to a DataFrame, and this csv has some empty fields.... and I didn't find a way to do it. I found someone posting in another post to use :
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv")
but it gives me an error saying "Failed to load class for data source: com.databricks.spark.csv"
Then I uploaded the file and read it as a text file, without the headings as:
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
import sqlContext.implicits._;
case class cars(id: Int, name: String, licence: String);
val carsDF = sc.textFile("../myTests/cars.csv").map(_.split(",")).map(p => cars( p(0).trim.toInt, p(1).trim, p(2).trim) ).toDF();
carsDF.registerTempTable("cars");
val dgp = sqlContext.sql("SELECT * FROM cars");
dgp.show()
gives an error because one of the licence field is empty... I tried to control this issue when I build the data frame but did not work.
I can obviously go into the csv file but and fix by adding a null to it but U do not want to do this because of there are a lot of fields it could be problematic. I want to fix it programmatically either when i create the dataframe or the class...
any other thoughts please let me know as well
To be able to use spark-csv you have to make sure it is available. In an interactive mode the simplest solution is to use packages argument when you start shell:
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.1.0
Regarding manual parsing working with csv files, especially malformed like cars.csv, requires much more work than simply splitting on commas. Some things to consider:
how to detect csv dialect, including method of string quoting
how to handle quotes and new line characters inside strings
how handle malformed lines
In case of example file you have to at least:
filter empty lines
read header
map lines to fields providing default value if field is missing
Here you go. Remember to check the delimiter for your CSV.
// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// read csv
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/your/csv/dir/simplecsv.csv")
// create a table from dataframe
df.createOrReplaceTempView("tableName")
// run your sql query
val sqlResults = spark.sql("SELECT * FROM tableName")
// display sql results
display(sqlResults)