I have a text file that has several lines of header starting with #. How I can skip these lines using pyspark?
Is there any lines.startswith in pyspark?
As given in the documentation, there exists a parameter comment which can be set to # to skip lines starting with this character.
Example,
df = sql.read.csv(path, comment="#", inferSchema=True, header=True)
Related
HD|20211210
DT|D-|12/22/2017|12/22/2017 09:41:45.828000|11/01/2017|01/29/2018 14:46:10.666000|1.2|1.2|ABC|ABC|123|123|4554|023|11/01/2017|ACDF|First|0012345||f|ABCD|ABCDEFGH|ABCDEFGH||||
DT|D-|12/25/2017|12/25/2017 09:24:20.202000|12/13/2017|01/29/2018 07:52:23.607000|6.4|6.4|ABC|ABC|123|123|4540|002|12/13/2017|ACDF|First|0012345||f|ABC|ABCDEF|ABCDEFGH||||
TR|0000000002
File name is Datafile.Dat. Scala version 2.11
I need to create header Dataframe with the first line but excluding "HD|", Need to create trailer dataframe with the last line but excluding "TR|", and finally need to create actual dataframe by skipping both the first and last line and excluding "DT|" from each line.
Please help me on this.
I see you have a defined schema for your dataframe (except first and last row).
What you can do is to read that file and seperator will be '|'
and you can enable "DROPMALFORMED" mode.
schema = 'define your schema here'
df = spark.read.option("mode","DROPMALFORMED").option("delimiter","|").option("header","true").schema(schema).csv("Datafile.Dat")
Another way is to use zipWithIndex.
I am trying to read hundreds of .dat file by skipping header lines (I do not know how many of them I need to skip beforehand). Header lines very from 1 to 20 and have at beginning either or "$" oder "!".
A sample data (left column - node, right column - microstructure) has always two columns and looks like the following:
!===
!Comment
$Material
1 1.452E-001
2 1.446E-001
3 1.459E-001
I tried the following codeline, assuming I know beforehand that there 3 lines in header:
fid = fopen('Graphite_Node_Test.dat') ;
data = textscan(fid,'%f %f','HeaderLines',3) ;
fclose(fid);
This solution works if the number of header lines is known. How can I change the code so that it can read the .dat file without knowing the number of header lines beginning with either "$" or "!" sign?
I added a dummy column at the beginning of my data export to a CSV file to get rid of control characters and some specific string values as mentioned below by using a pipe '|' delimiter. This data is coming from Teradata fast export using utf-8
'''
y^CDUMMYCOLUMN|
<86>^ADUMMYCOLUMN|
<87>^ADUMMYCOLUMN|
<94>^ADUMMYCOLUMN|
{^ADUMMYCOLUMN|
_^ADUMMYCOLUMN|
y^CDUMMYCOLUMN|
[^ADUMMYCOLUMN|
k^ADUMMYCOLUMN|
m^ADUMMYCOLUMN|
<82>^ADUMMYCOLUMN|
c^ADUMMYCOLUMN|
<8e>^ADUMMYCOLUMN|
<85>^ADUMMYCOLUMN|
'''
This is completely random and not every row has these special characters. I'm sure I'm missing something here. I'm using sed to get rid of dummycolumn and control characters.
'''$ sed -e 's/.*DUMMYCOLUMN|//;/^$/d' data.csv > data_output.csv'''
After running this statement, I'm still remaining these below random values.
'''
<86>
<87>
<85>
<94>
<8a>
<85>
<8e>
'''
I could have written a sed statement to remove first three letters from each row but this series is not appearing in every row. At the same time, row count is 400 Million.
Current output.
y^CDUMMYCOLUMN|COLUMN1|COLUMN2|COLUMN3
<86>^ADUMMYCOLUMN|6218915846|36596|12
<87>^ADUMMYCOLUMN|9822354765|35325|33
t^ADUMMYCOLUMN|6788793999|111|12
g^ADUMMYCOLUMN|6090724004|7017|12
_^ADUMMYCOLUMN|IC-21357688806502|111|12
<8e>^ADUMMYCOLUMN|9682027117|35335|33
v^ADUMMYCOLUMN|6406807681|121|12
h^ADUMMYCOLUMN|6346768510|121|12
V^ADUMMYCOLUMN|6130452510|7017|12
Desired Output
COLUMN1|COLUMN2|COLUMN3
6218915846|36596|12
9822354765|35325|33
6788793999|111|12
6090724004|7017|12
IC-21357688806502|111|12
9682027117|35335|33
6406807681|121|12
6346768510|121|12
6130452510|7017|12
Please help.
Thank you.
I have csv file which I am converting to parquet files using databricks library in scala. I am using below code:
val spark = SparkSession.builder().master("local[*]").config("spark.sql.warehouse.dir", "local").getOrCreate()
var csvdf = spark.read.format("org.apache.spark.csv").option("header", true).csv(csvfile)
csvdf.write.parquet(csvfile + "parquet")
Now the above code works fine if I don't have space in my column headers. But if any csv file have spaces in the column headers, it doesn't work and errors out stating invalid column headers. My csv files are delimited by ,.
Also, I cannot change the spaces of column names of the csv. The column names has to be as they are even if they contain spaces as those are given by end user.
Any idea on how to fix this?
per #CodeHunter's request
sadly, the parquet file format does not allow for spaces in column names;
the error that it'll spit out when you try is: contains invalid character(s) among " ,;{}()\n\t=".
ORC also does not allow for spaces in column names :(
Most sql-engines don't support column names with spaces, so you'll probably be best off converting your columns to your preference of foo_bar or fooBar or something along those lines
I would rename the offending columns in the dataframe, to change space to underscore, before saving. Could be with select "foo bar" as "foo_bar" or .withColumnRenamed("foo bar", "foo_bar")
I'm having a tough time using StreamingContext to read a CSV and send each row to another method that does other processing. I tried splitting by newline but it splits after three columns (there are about 10 columns per
row):
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/tmpFolder/")
lines.map{row => {
val columnValues = row.split("\n")
(columnValues(0), "\n")
}}.print()
If I open the CSV in Excel, there are about 10 values per column. If I open the same file using Sublime or some text editor, there appears to be a newline after those first 3 values. Not sure if it's an encoding thing or just the way Sublime displays it. In any case I'm trying to get the entire row in Spark - not sure if there's a way to do that.
ssc.textFileStream internally creates a file stream and start splitting on the new line character. But your data is containing the text qualifiers
1996, Jeep, "Grand Cherokee, MUST SELL!
air", moon roof, loaded, 4799.00
Here some text is in double quotes and the row is multi lined row. If you try to split the data by , it will be:
[1996, Jeep, "Grand Cherokee,MUST SELL!]
It will miss the other data points because you are splitting by comma. To avoid, that you can use sqlContext
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema","true")
.option("multiLine","true")
.option("quoteMode","ALL")
.load(path)
Or you can pre-process your CSV using Univocity Parser to handle multi-line and double quotes and other special characters, and put these files in to the directory and start your ssc.textFileStream after that.