spark scala tab file read and replace empty - scala

I have a set of tab files which I have to read and save in the database(Cassandra). I can load all the tables which has the data in all the columns. But some table has empty value in some of the columns and those are not getting inserted.
I tried the below,
sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "/t").option("nullValue"," ").load(path)
and also
sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "/t").option("nullValue"," ").option(""," ").load(path)
both the options didnt load the data. Any inputs?

I think I figured it,
var df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "\t").option("treatEmptyValuesAsNulls", "true").option("nullValue","").load(path)
this turns every empty to null and then,
var df1 = df.na.fill(" ",df.columns)
I had to create another df to get the fill reflected. I still need to work on how to dynamically fill based on the dtypes.

Related

Saving pyspark dataframe with complicated schema in plain text for testing

How do I make clean test data for pyspark? I have figured something out that seems pretty good, but parts seem a little awkward, so I'm posting.
Let's say I have a dataframe df with a complicated schema and a small number of rows. I want test data checked into my repo. I don't want a binary file. At this point, I'm not sure the best way to proceed -but I'm thinking i have a file like
test_fn.py
and it has this in it
schema_str='struct<eventTimestamp:timestamp,list_data:array<struct<valueA:string,valueB:string,valueC:boolean>>>'
to get the schema in txt format, using the df.schema.simpleString() function. Then to get the rows - I do
lns = [row.json_txt for row in df.select((F.to_json(F.struct('*'))).alias('json_txt')).collect()]
now I put those lines in my test_fn.py file, or I could have a .json file in the repo.
Now to run the test, I have to make a dataframe with the correct schema and data from this text. It seems the only way spark will parse the simple string is if I create a dataframe with it, that is I can't pass that simple string to the from_json function? So this is a little awkward which is why I thought I'd post -
schema2 = spark.createDataFrame(data=[], schema=schema_str).schema
lns = # say I read the lns back from above
df_txt = spark.createDataFrame(data=lns, schema=T.StringType())
I see df_txt just has one column called 'value'
df_json = df_txt.select(F.from_json('value', schema=schema2).alias('xx'))
sel = ['xx.%s' % nm for nm in df_json.select('xx').schema.fields[0].dataType.fieldNames()]
df2 = df_json.select(*sel)
Now df2 should be the same as df1 - which I see is the case from the deepdiff module.

Spark Spark Empty Json Files reading from Directory

I'm reading from a path say /json//myfiles_.json
I'm then flattening the json using explode. This causes an error since I have some empty files. How do I tell it to ignore empty files of somehow filter them out?
I can detect individual files checking if the head is empty but I need to do this on the collection of files iterated in the dataframe with the use of the wildcard path.
So the answer seems to be that I need to provide a schema explicitly because it can't infer one from empty file - as you would expect!
e.g.
val schemadf = sqlContext.read.json(schemapath) //infer schema from file with data or do manually
val schema = schemadf.schema
val raw = sqlContext.read.schema(schema).json(monthfile)
val prep = raw.withColumn("MyArray", explode($"MyArray"))
.select($"ID", $"name", $"CreatedAt")
display(prep)

Write Header only CSV record from Spark Scala DataFrame

My requirement is to write only Header CSV record using Spark Scala DataFrame. Can any one help me on this.
val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/"
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")
The above one is working and able to create header in the CSV with tab delimiter. Since I am using spark session I am creating sparkContext in the second line. outDF is my dataframe created before these statements.
Two things are outstanding, can you one of you help me.
1. The above working code is not overriding the files, so every time I need to delete the files manually. I could not find override option, can you help me.
2. Since I am doing a select statement and schema, will it be consider as action and start another lineage for this statement. If it is true then this would degrade the performance.
If you need to output only header you can use this code:
df.schema.fieldNames.reduce(_ + "," + _)
It will create line of CSV with names of columns
I tested and the solution below did not affect any performance.
val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/"
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")
I got a solution to handle this situation. Define the columns in the configuration file and write those columns in an file. Here is the snipet.
val Header = prop.getProperty("OUT_HEADER_COLUMNS").replaceAll("\"","").replaceAll(",","\t")
scala.tools.nsc.io.File(s"$HeadOPath").writeAll(s"$Header")

Bypass last line of each file in Spark (Scala)

This question is related to this.
I am processing an S3 folder containing csv.gz files in Spark. Each csv.gz file has a header that contains column names. This has been solved by the above SO link and the solution looks like this:
val rdd = sc.textFile("s3://.../my-s3-path").mapPartitions(_.drop(1))
The problem now is that it looks like some of the files have newline ('\n') at the end (we assume we are not sure which file). So when converting the RDD to DataFrame, I'm getting some error. The question now is:
How do I get rid of the last line of each file if it is '\n'?
Why not a simple filter:
val rdd = sc.textFile("s3...").filter(line => !line.equalsIgnoreCase("\n")).mapPartition...
Or filter any empty line:
val rdd = sc.textFile("s3...").filter(line => !line.trim().isEmpty)...

spark scala issue uploading csv

i am trying to upload a csv file into a tempTable such that I can query on it and I am having two issues.
First: I tried uploading the csv to a DataFrame, and this csv has some empty fields.... and I didn't find a way to do it. I found someone posting in another post to use :
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv")
but it gives me an error saying "Failed to load class for data source: com.databricks.spark.csv"
Then I uploaded the file and read it as a text file, without the headings as:
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
import sqlContext.implicits._;
case class cars(id: Int, name: String, licence: String);
val carsDF = sc.textFile("../myTests/cars.csv").map(_.split(",")).map(p => cars( p(0).trim.toInt, p(1).trim, p(2).trim) ).toDF();
carsDF.registerTempTable("cars");
val dgp = sqlContext.sql("SELECT * FROM cars");
dgp.show()
gives an error because one of the licence field is empty... I tried to control this issue when I build the data frame but did not work.
I can obviously go into the csv file but and fix by adding a null to it but U do not want to do this because of there are a lot of fields it could be problematic. I want to fix it programmatically either when i create the dataframe or the class...
any other thoughts please let me know as well
To be able to use spark-csv you have to make sure it is available. In an interactive mode the simplest solution is to use packages argument when you start shell:
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.1.0
Regarding manual parsing working with csv files, especially malformed like cars.csv, requires much more work than simply splitting on commas. Some things to consider:
how to detect csv dialect, including method of string quoting
how to handle quotes and new line characters inside strings
how handle malformed lines
In case of example file you have to at least:
filter empty lines
read header
map lines to fields providing default value if field is missing
Here you go. Remember to check the delimiter for your CSV.
// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// read csv
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/your/csv/dir/simplecsv.csv")
// create a table from dataframe
df.createOrReplaceTempView("tableName")
// run your sql query
val sqlResults = spark.sql("SELECT * FROM tableName")
// display sql results
display(sqlResults)