Writing pyspark dataframe to text file without trailing spaces in hdfs - pyspark

I have pyspark dataframe. which having following data:-
id name address
101010 xyz abc jkdjfkd kdfjd
101001 ygbgnt bnbnb dfkdjfd kdjfkd kdfjkd
I want to write given data to textfile with fixed length with '|' delimiter. Delimiter should come after fixed position of each column.first pipe will come after 7 position , second one in 23 position and third one 52 position.
File format and length should be
101010|xyz abc |jkdjfkd kdfjd |
101001|ygbgnt bnbnb |dfkdjfd kdjfkd kdfjkd |
I am using rpad to achieve the given file format. But In this solution i am getting trailing spaces after the name and address column. I do not want trailing spaces.

Related

How to remove a dynamic string from a CSV file using sed?

I added a dummy column at the beginning of my data export to a CSV file to get rid of control characters and some specific string values as mentioned below by using a pipe '|' delimiter. This data is coming from Teradata fast export using utf-8
'''
y^CDUMMYCOLUMN|
<86>^ADUMMYCOLUMN|
<87>^ADUMMYCOLUMN|
<94>^ADUMMYCOLUMN|
{^ADUMMYCOLUMN|
_^ADUMMYCOLUMN|
y^CDUMMYCOLUMN|
[^ADUMMYCOLUMN|
k^ADUMMYCOLUMN|
m^ADUMMYCOLUMN|
<82>^ADUMMYCOLUMN|
c^ADUMMYCOLUMN|
<8e>^ADUMMYCOLUMN|
<85>^ADUMMYCOLUMN|
'''
This is completely random and not every row has these special characters. I'm sure I'm missing something here. I'm using sed to get rid of dummycolumn and control characters.
'''$ sed -e 's/.*DUMMYCOLUMN|//;/^$/d' data.csv > data_output.csv'''
After running this statement, I'm still remaining these below random values.
'''
<86>
<87>
<85>
<94>
<8a>
<85>
<8e>
'''
I could have written a sed statement to remove first three letters from each row but this series is not appearing in every row. At the same time, row count is 400 Million.
Current output.
y^CDUMMYCOLUMN|COLUMN1|COLUMN2|COLUMN3
<86>^ADUMMYCOLUMN|6218915846|36596|12
<87>^ADUMMYCOLUMN|9822354765|35325|33
t^ADUMMYCOLUMN|6788793999|111|12
g^ADUMMYCOLUMN|6090724004|7017|12
_^ADUMMYCOLUMN|IC-21357688806502|111|12
<8e>^ADUMMYCOLUMN|9682027117|35335|33
v^ADUMMYCOLUMN|6406807681|121|12
h^ADUMMYCOLUMN|6346768510|121|12
V^ADUMMYCOLUMN|6130452510|7017|12
Desired Output
COLUMN1|COLUMN2|COLUMN3
6218915846|36596|12
9822354765|35325|33
6788793999|111|12
6090724004|7017|12
IC-21357688806502|111|12
9682027117|35335|33
6406807681|121|12
6346768510|121|12
6130452510|7017|12
Please help.
Thank you.

how to return empty field in Spark

I am trying to check incomplete record and identify the bad record in Spark.
eg. sample test.txt file, it is in record format, columns separated by \t
L1C1 L1C2 L1C3 L1C4
L2C1 L2C2 L2C3
L3C1 L3C2 L3C3 L3C4
scala> sc.textFile("test.txt").filter(_.split("\t").length < 4).collect.foreach(println)
L2C1 L2C2 L2C3
The second line is printing as having less number of columns.
How should i parse without ignoring the empty column after in second line
It is the split string in scala removes trailing empty substrings.
The behavior is similar to Java, to let all the substrings checked we can call as
"L2C1 L2C2 L2C3 ".split("\t",-1)

multi-character separator in `set datafile separator "|||"` doesn't work

I have an input file example.data with a triple-pipe as separator, dates in the first column, and also some more or less unpredictable text in the last column:
2019-02-01|||123|||345|||567|||Some unpredictable textual data with pipes|,
2019-02-02|||234|||345|||456|||weird symbols # and commas, and so on.
2019-02-03|||345|||234|||123|||text text text
When I try to run the following gnuplot5 script
set terminal png size 400,300
set output 'myplot.png'
set datafile separator "|||"
set xdata time
set timefmt "%Y-%m-%d"
set format x "%y-%m-%d"
plot "example.data" using 1:2 with linespoints
I get the following error:
line 8: warning: Skipping data file with no valid points
plot "example.data" using 1:2 with linespoints
^
"time.gnuplot", line 8: x range is invalid
Even stranger, if I change the last line to
plot "example.data" using 1:4 with linespoints
then it works. It also works for 1:7 and 1:10, but not for other numbers. Why?
When using the
set datafile separator "chars"
syntax, the string is not treated as one long separator. Instead, every character listed between the quotes becomes a separator on its own. From [Janert, 2016]:
If you provide an explicit string, then each character in the string will be
treated as a separator character.
Therefore,
set datafile separator "|||"
is actually equivalent to
set datafile separator "|"
and a line
2019-02-05|||123|||456|||789
is treated as if it had ten columns, of which only the columns 1,4,7,10 are non-empty.
Workaround
Find some other character that is unlikely to appear in the dataset (in the following, I'll assume \t as an example). If you can't dump the dataset with a different separator, use sed to replace ||| by \t:
sed 's/|||/\t/g' example.data > modified.data # in the command line
then proceed with
set datafile separator "\t"
and modified.data as input.
You basically gave the answer yourself.
If you can influence the separator in your data, use a separator which typically does not occur in your data or text. I always thought \t was made for that.
If you cannot influence the separator in your data, use an external tool (awk, Python, Perl, ...) to modify your data. In these languages it is probably a "one-liner". gnuplot has no direct replace function.
If you don't want to install external tools and want ensure platform independence, there is still a way to do it with gnuplot. Not just a "one-liner", but there is almost nothing you can't also do with gnuplot ;-).
Edit: simplified version with the input from #Ethan (https://stackoverflow.com/a/54541790/7295599).
Assuming you have your data in a dataset named $Data. The following code will replace ||| with \t and puts the result into $DataOutput.
### Replace string in dataset
reset session
$Data <<EOD
# data with special string separators
2019-02-01|||123|||345|||567|||Some unpredictable textual data with pipes|,
2019-02-02|||234|||345|||456|||weird symbols # and commas, and so on.
2019-02-03|||345|||234|||123|||text text text
EOD
# replace string function
# prefix RS_ to avoid variable name conflicts
replaceStr(s,s1,s2) = (RS_s='', RS_n=1, (sum[RS_i=1:strlen(s)] \
((s[RS_n:RS_n+strlen(s1)-1] eq s1 ? (RS_s=RS_s.s2, RS_n=RS_n+strlen(s1)) : \
(RS_s=RS_s.s[RS_n:RS_n], RS_n=RS_n+1)), 0)), RS_s)
set print $DataOutput
do for [RS_j=1:|$Data|] {
print replaceStr($Data[RS_j],"|||","\t")
}
set print
print $DataOutput
### end of code
Output:
# data with special string separators
2019-02-01 123 345 567 Some unpredictable textual data with pipes|,
2019-02-02 234 345 456 weird symbols # and commas, and so on.
2019-02-03 345 234 123 text text text

Converting csv to parquet in spark gives error if csv column headers contain spaces

I have csv file which I am converting to parquet files using databricks library in scala. I am using below code:
val spark = SparkSession.builder().master("local[*]").config("spark.sql.warehouse.dir", "local").getOrCreate()
var csvdf = spark.read.format("org.apache.spark.csv").option("header", true).csv(csvfile)
csvdf.write.parquet(csvfile + "parquet")
Now the above code works fine if I don't have space in my column headers. But if any csv file have spaces in the column headers, it doesn't work and errors out stating invalid column headers. My csv files are delimited by ,.
Also, I cannot change the spaces of column names of the csv. The column names has to be as they are even if they contain spaces as those are given by end user.
Any idea on how to fix this?
per #CodeHunter's request
sadly, the parquet file format does not allow for spaces in column names;
the error that it'll spit out when you try is: contains invalid character(s) among " ,;{}()\n\t=".
ORC also does not allow for spaces in column names :(
Most sql-engines don't support column names with spaces, so you'll probably be best off converting your columns to your preference of foo_bar or fooBar or something along those lines
I would rename the offending columns in the dataframe, to change space to underscore, before saving. Could be with select "foo bar" as "foo_bar" or .withColumnRenamed("foo bar", "foo_bar")

Modelica combiTimeTable

I have a few questions regarding combitimeTables: I tired to import a txt file (3 columns: first time + 2 measured data)into a combitimeTable. - Does the txt file have to have the following header #1; double K(x,y) - Is it right, that the table name in combitimeTable have to have the same name than the variable after double (in my case K)? - I get errors if i try to connect 2 outputs of the table (column 1 and column2). Do I have to specify how many columns that I want to import?
And: Why do i have to use in the path "/" instead of "\" ?
Modelica Code:
Modelica.Blocks.Sources.CombiTimeTable combiTimeTable(
tableOnFile=true,
tableName="K",
fileName="D:/test.txt")
Thank you very much!
The standard text file format for CombiTables is:
#1
double K(4,3)
0 1 10
1 3 20
2 5 30
3 7 40
In this case note the "tableName" parameter I would set as a modifier to the CombiTable (or CombiTimeTable) is "K". And yes, the numbers in parenthesis indicate the dimensions of the data to the tool, so in this case 4 rows and 3 columns.
Regarding the path separator "/" or "\", the backslash character "\" which is the path separator in Windows where as the forward slash "/" is the path separator on Unix like systems (e.g. Linux). The issue is that in most libraries the backslash is used as an escape character. So for example "\n" indicates new line and "\t" indicates tab so if my file name string was "D:\nextfolder\table.txt", this would actually look something like:
D:
extfolder able.txt
Depending on your Modelica simulation tool however it might correct this. So if you used a file selection dialog box to choose your file, the tool should automatically switch the file separator character to the forward slash "/" and your text would look like:
combiTimeTable(
tableOnFile=true,
tableName="K",
fileName="D:/nextfolder/table.txt",
columns=2:3)
If you are getting errors in your connect statement, I would guess you might have forgotten the "columns" parameter. The default value for this parameter comes from the "table" parameter (which is empty by default because there are zero rows by two columns), not from the data in the file. So when you are reading data from a file you need to explicitly set this