Input Data:
key,date,value
10,20180701,a10
11,20180702,a11
12,20180702,a12
13,20180702,a13
14,20180702,a14
15,20180702,a15
16,20180702,a16
17,20180702,a17
18,20180702,a18
19,20180702,a19
1 ,20180701,a1
2 ,20180701,a2
3 ,20180701,a3
4 ,20180701,a4
5 ,20180701,a5
6 ,20180701,a6
7 ,20180701,a7
8 ,20180701,a8
9 ,20180701,a9
Code
val rawData=sc.textFile(.....).
val datadf:DataFrame=rawData.toDF
After reading the data into DF with columns key,data,value
datadf.coalesce(1).orderBy(desc("key")).drop(col("key")).write.mode("overwrite").partitionBy("date").text("hdfs://path/")
I am trying to order the column by column key and drop the same column before saving to hdfs (into a single file for each day).
I am not able to preserve the order in the outputfiles.
if i am not using coalesce the order is preserved but multiple files are getting generated.
Output:
/20180701/part-xxxxxxx.txt
a1
a9
a6
a4
a5
a3
a7
a8
a2
a10
/20180702/part-xxxxxxx.txt
a18
a12
a13
a19
a15
a16
a17
a11
a14
Expected OP:
/20180701/part-xxxxxxx.txt
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
/20180702/part-xxxxxxx.txt
a11
a12
a13
a14
a15
a16
a17
a18
a19
The following code should get you started (This is using Spark 2.1) :-
import org.apache.spark.sql.types.StructType
val schema = new StructType().add($"key".int).add($"date".string).add($"value".string)
val df = spark.read.schema(schema).option("header","true").csv("source.txt")
df.coalesce(1).orderBy("key").drop("key").write.mode("overwrite").partitionBy("date").csv("hdfs://path/")
Related
I have a problem with an eloquent statement.
I have two tables
Table 1 (ORDER_DETAILS)
FORM_ID
ORDER_ID
DETAIL_ID
VALUE
A1
X1
B1
Test
A1
X1
B3
10;20
A2
X2
B10
Test 2
A2
X2
B20
1;2
A2
X2
B30
A;B;C
A3
X3
B200
X;Y;Z
A3
X3
B300
Test 3
A4
X4
B1000
L;M;O
Table 2 (FORM_DETAIL)
FORM_ID
DETAIL_ID
MOD
A1
B1
Text
A1
B2
Input
A1
B3
Select
A2
B10
Input
A2
B20
Select
A2
B30
Select
A3
B100
Text
A3
B200
Select
A3
B300
Input
A4
B1000
Select
A4
B2000
Text
A4
B3000
Text
A4
B4000
Input
A4
B5000
Input
Now I would like to bring them together for example when I call the ORDER_ID X1.
FORM_ID
ORDER_ID
DETAIL_ID
VALUE
A1
X1
B1
Test
A1
X1
B2
null
A1
X1
B3
10;20
or the ORDER_ID X4
FORM_ID
ORDER_ID
DETAIL_ID
VALUE
A4
X4
B1000
L;M;O
A4
X4
B2000
null
A4
X4
B3000
null
A4
X4
B4000
null
A4
X4
B5000
null
All values from the table 2 (FORM_DETAIL) should always be displayed and
check whether a VALUE is intended in the table ORDER_ID X4. It is important
that FORM_ID and ORDER_ID are always identical in the
are always identical.
Can you maybe help me?
Sorry, I am still Laravel beginner :o)
I have a huge pyspark dataframe with segments and their subsegments, like this:
SegmentId SubSegmentStart SubSegmentEnd
1 a1 a2
1 a2 a3
2 b1 b2
3 c1 c2
3 c3 c4
3 c2 c3
I need to group records by SegmentId and add new column index to build chain of subsegments using start and end points. I need to do it for each Segment.
So I need to get the following dataframe:
SegmentId SubSegmentStart SubSegmentEnd Index
1 a1 a2 0
1 a2 a3 1
2 b1 b2 0
3 c1 c2 0
3 c3 c4 2
3 c2 c3 1
How can I do it by PySpark?
I have two pyspark dataframes A and B. I want to inner join two pyspark dataframes and select all columns from first dataframe and few columns from second dataframe.
A_df
id column1 column2 column3 column4
1 A1 A2 A3 A4
2 A1 A2 A3 A4
3 A1 A2 A3 A4
4 A1 A2 A3 A4
B_df
id column1 column2 column3 column4 column5 column6
1 B1 B2 B3 B4 B5 B6
2 B1 B2 B3 B4 B5 B6
3 B1 B2 B3 B4 B5 B6
4 B1 B2 B3 B4 B5 B6
joined_df
id column1 column2 column3 column4 column5 column6
1 A1 A2 A3 A4 B5 B6
2 A1 A2 A3 A4 B5 B6
3 A1 A2 A3 A4 B5 B6
4 A1 A2 A3 A4 B5 B6
I am trying below code -
joined_df = (A_df.alias('A_df').join(B_df.alias('B_df'),
on = A_df['id'] == B_df['id'],
how = 'inner')
.select('A_df.*',B_df.column5,B_df.column6))
But it gives a weird result where it is interchanging the values in columns. How can I achieve it? Thanks in advance
What is the problem? Everything works as expected.
df1 = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv")
df2 = spark.read.option("header","true").option("inferSchema","true").csv("test2.csv")
df1.alias('a').join(df2.alias('b'), ['id'], 'inner') \
.select('a.*', 'b.column5', 'b.column6').show()
+---+-------+-------+-------+-------+-------+-------+
| id|column1|column2|column3|column4|column5|column6|
+---+-------+-------+-------+-------+-------+-------+
| 1| A1| A2| A3| A4| B5| B6|
| 2| A1| A2| A3| A4| B5| B6|
| 3| A1| A2| A3| A4| B5| B6|
| 4| A1| A2| A3| A4| B5| B6|
+---+-------+-------+-------+-------+-------+-------+
I have a TSV file that I want to parse. There are empty fields in all columns resulting in displacement of the order of columns, so that not all the values I get using a certain column number actually come from that column.
Some fields contain long strings with empty space inside them. Also, some columns contain potential delimiters like ; | :
Input file
columnA columnB columnC columnD
A1 B1 C1 D1
B2 C2 D2
A3 D3
A4 B4 D4
Desired output
columnA columnB columnC columnD
A1 B1 C1 D1
B2 C2 D2
A3 D3
A4 B4 D4
$file myfile
`ASCII English text, with very long lines`
$awk '-F\t' '{print NF}' myfile | sort | uniq -c | tail -n
`247871 136`
I have found this code posted in reply to a similar question (https://unix.stackexchange.com/questions/29023/how-to-display-tsv-csv-in-console-when-empty-cells-are-missed-by-column-t), but I cannot make this work for my file:
sed ':x s/\(^\|\t\)\t/\1 \t/; t x' < file.tsv | column -t -s $'\t'
(The problem persists after importing into Excel.)
FieldEmpty=' '
FieldSize=${#FieldEmpty}
sed "
s/A/&/
t B
s/^ */ ${FieldEmpty}/
t B
: B
s/B/&/
t C
s/^ .\{${FieldSize}\}/&${FieldEmpty}/
t C
: C
s/C/&/
t D
s/^ \(.\{${FieldSize}\}\)\{2\}/&${FieldEmpty}/
t D
: D
s/D/&/
t
s/^ \(.\{${FieldSize}\}\)\{3\}/&${FieldEmpty}/
" YourFile
If more column are used, an iterative way is to be used instead (same concept of test/"insert")
On my AIX/KSH (so should be the same as with --posix -e on GNU sed)
$ cat YourFile
columnA columnB columnC columnD
A1 B1 C1 D1
B2 C2 D2
A3 D3
A4 B4 D4
$ FieldEmpty=' ';FieldSize=${#FieldEmpty};echo $FieldSize
11
$sed "..." YourFile
columnA columnB columnC columnD
A1 B1 C1 D1
B2 C2 D2
A3 D3
A4 B4 D4
If your file is tab separated, you should use tab as a field separator in awk. Like
$ column -t -s $'\t' file
columnA columnB columnC columnD
A1 1 B1 2 C1 3 D1 4
B2 2 C2 4 D2 4
A3 1 D3 4
A4 1 B4 2 D4 4
$xxd file
0000000: 636f 6c75 6d6e 4109 636f 6c75 6d6e 4209 columnA.columnB.
0000010: 636f 6c75 6d6e 4309 636f 6c75 6d6e 440a columnC.columnD.
0000020: 4131 2031 0942 3120 3209 4331 2033 0944 A1 1.B1 2.C1 3.D
0000030: 3120 340a 0942 3220 3209 4332 2034 0944 1 4..B2 2.C2 4.D
0000040: 3220 340a 4133 2031 0909 0944 3320 340a 2 4.A3 1...D3 4.
0000050: 4134 2031 0942 3420 3209 0944 3420 340a A4 1.B4 2..D4 4.
$ awk -F'\t' '{
for (i=1; i<=NF; i++) {
printf "%-8s ", $i
}
print ""
}'
columnA columnB columnC columnD
A1 1 B1 2 C1 3 D1 4
B2 2 C2 4 D2 4
A3 1 D3 4
A4 1 B4 2 D4 4
I have a report with the 4 columns,
ColumnA|ColumnB|ColumnC|ColumnD
Row1 A1 B1 C1 D1
Row2 A1 B1 C1 D2
Row3 A1 B1 C1 D1
Row4 A1 B1 C1 D2
Row5 A1 B1 C1 D1
I did like grouping based on the 4 columns, but i got output with space for every row.
But here in this report i would like to get the ouput as,
ColumnA|ColumnB|ColumnC|ColumnD
Row1 A1 B1 C1 D1
Row2 A1 B1 C1 D2
<-------------an empty space ----------->
Row3 A1 B1 C1 D1
Row4 A1 B1 C1 D2
<-------------an empty space ----------->
Row5 A1 B1 C1 D1
How can i achieve the above output?
A standard group by would sort the record like this:
ColumnA|ColumnB|ColumnC|ColumnD
Row1 A1 B1 C1 D1
Row3 A1 B1 C1 D1
Row5 A1 B1 C1 D1
Row2 A1 B1 C1 D2
Row4 A1 B1 C1 D2
Since you don't have a standard grouping, another approach may work. You basically want a blank line after the D2 value. This will only work if you always have D2 values at the end of a group.
Create a new blank detail section under the main section
Detail one
A1 B1 C1 D1
Detail two
<blank>
Then put a conditional suppress expression on detail two
ColumnD <> "D2"
Then whenever D2 is present the blank detail section will be displayed.
You can use a Formula instead of a field Value for grouping.
select Column4 <br>
case D1 : "Group1"<br>
case D2 : "Group2"<br>
case D3 : "Group3"<br>
case D4 : "Group3"<br>
case D5 : "Group3"<br>
case D6 : "Group4"<br>
default "Group5"<br>
Is that your problem ?
The blank lines can be generated as Group Footer.