Skip first and last line from a pipe delimited file with 26 columns and make it to dataframe using scala - scala

HD|20211210
DT|D-|12/22/2017|12/22/2017 09:41:45.828000|11/01/2017|01/29/2018 14:46:10.666000|1.2|1.2|ABC|ABC|123|123|4554|023|11/01/2017|ACDF|First|0012345||f|ABCD|ABCDEFGH|ABCDEFGH||||
DT|D-|12/25/2017|12/25/2017 09:24:20.202000|12/13/2017|01/29/2018 07:52:23.607000|6.4|6.4|ABC|ABC|123|123|4540|002|12/13/2017|ACDF|First|0012345||f|ABC|ABCDEF|ABCDEFGH||||
TR|0000000002
File name is Datafile.Dat. Scala version 2.11
I need to create header Dataframe with the first line but excluding "HD|", Need to create trailer dataframe with the last line but excluding "TR|", and finally need to create actual dataframe by skipping both the first and last line and excluding "DT|" from each line.
Please help me on this.

I see you have a defined schema for your dataframe (except first and last row).
What you can do is to read that file and seperator will be '|'
and you can enable "DROPMALFORMED" mode.
schema = 'define your schema here'
df = spark.read.option("mode","DROPMALFORMED").option("delimiter","|").option("header","true").schema(schema).csv("Datafile.Dat")
Another way is to use zipWithIndex.

Related

Adding field delimiter ";" in last column on header file

I'm new in datastage and trying to create a sequential file with ";" as delimeter.
I would like to add my delimeter just after the last column in the headers
please see below exemple for more understanding
Actully i have this in my sequential file :
SERVICE_ID;OFFER_ID;MINIMUM;MAXIMUM
19441;162887;;;
19442;162889;;;
Expected result with delimiter after last column in header :
SERVICE_ID;OFFER_ID;MINIMUM;MAXIMUM;
19441;162887;;;
19442;162889;;;
How can i do that please ?
Use the Final Delimiter property in the Sequential File stage format properties.

how to return empty field in Spark

I am trying to check incomplete record and identify the bad record in Spark.
eg. sample test.txt file, it is in record format, columns separated by \t
L1C1 L1C2 L1C3 L1C4
L2C1 L2C2 L2C3
L3C1 L3C2 L3C3 L3C4
scala> sc.textFile("test.txt").filter(_.split("\t").length < 4).collect.foreach(println)
L2C1 L2C2 L2C3
The second line is printing as having less number of columns.
How should i parse without ignoring the empty column after in second line
It is the split string in scala removes trailing empty substrings.
The behavior is similar to Java, to let all the substrings checked we can call as
"L2C1 L2C2 L2C3 ".split("\t",-1)

Converting csv to parquet in spark gives error if csv column headers contain spaces

I have csv file which I am converting to parquet files using databricks library in scala. I am using below code:
val spark = SparkSession.builder().master("local[*]").config("spark.sql.warehouse.dir", "local").getOrCreate()
var csvdf = spark.read.format("org.apache.spark.csv").option("header", true).csv(csvfile)
csvdf.write.parquet(csvfile + "parquet")
Now the above code works fine if I don't have space in my column headers. But if any csv file have spaces in the column headers, it doesn't work and errors out stating invalid column headers. My csv files are delimited by ,.
Also, I cannot change the spaces of column names of the csv. The column names has to be as they are even if they contain spaces as those are given by end user.
Any idea on how to fix this?
per #CodeHunter's request
sadly, the parquet file format does not allow for spaces in column names;
the error that it'll spit out when you try is: contains invalid character(s) among " ,;{}()\n\t=".
ORC also does not allow for spaces in column names :(
Most sql-engines don't support column names with spaces, so you'll probably be best off converting your columns to your preference of foo_bar or fooBar or something along those lines
I would rename the offending columns in the dataframe, to change space to underscore, before saving. Could be with select "foo bar" as "foo_bar" or .withColumnRenamed("foo bar", "foo_bar")

How to skip multiple header lines in CSV files with pyspark

I have a text file that has several lines of header starting with #. How I can skip these lines using pyspark?
Is there any lines.startswith in pyspark?
As given in the documentation, there exists a parameter comment which can be set to # to skip lines starting with this character.
Example,
df = sql.read.csv(path, comment="#", inferSchema=True, header=True)

Remove file trailer record using scalding or scala

I am trying to use Pipe (cascading.pipe.Pipe) for reading a file.
Every record in file follows a schema except trailer record hence; whenever the pipe reading code executes, it throws exception as trailer record doesn't match with schema.
The Pipe line looks like :
fieldlst:List(col1, col2, col3)
val filteredInput = Csv(inputFilePath, separator = "|", fields = fieldlst, skipHeader = true)
.read
Can anybody tell me a solution for this. Removing trailer record by read-write file seems to be a simple solution but for that, I have to read-write entire file and file can be very huge.
Rather than using Csv pipe, you can use TextLine and then split your records using '|'.