generate a csv file formatted in spark scala - scala

I'm trying to generate a csv file in spark scala for a dataframe of 540000 records. The dataframe only has 5 fields but one of them is empty, when I want to generate the csv file the empty field (string type) is filled with quotes (" ") and what I want is that it is not filled with quotes but that it be empty. An example:
My DF:
+------------+-------------+-----------------+--------------+------------+
| c.date | c.cdoffice | c.cdemployee | c.cdage | c.CM |
+------------+-------------+-----------------+--------------+------------+
| 20220630 | 0000 | 6678 | | PR |
| 20220630 | 0000 | 3971 | | PR |
| 20220630 | 0000 | 4861 | | ME |
| 20220630 | 0000 | 2191 | | IT |
| 20220630 | 0000 | 2673 | | PI |
+------------+-------------+-----------------+--------------+------------+
and this is what I apply:
salida.coalesce(1).write
.format("csv")
.option("header",true)
.option("delimiter", ";")
//.option("quote", "\u0000") -- I tried to put this and it doesn't work
//.option("quoteAll", false) -- this doesn't work either
// .option("quote", "") -- more of the same
.mode("overwrite")
.save(path)
generate:
date;cdoffice;cdemployee;cdage;CM
20220630;0000;6687;"";PR
20220630;0000;3971;"";PR
20220630;0000;4861;"";ME
20220630;0000;2191;"";IT
20220630;0000;2673;"";PI
expected
date;cdoffice;cdemployee;cdage;CM
20220630;0000;6687;;PR
20220630;0000;3971;;PR
20220630;0000;4861;;ME
20220630;0000;2191;;IT
20220630;0000;2673;;PI
Any recommendations, suggestions?
THANKSSS

Related

How to explode a string column based on specific delimiter in Spark Scala

I want to explode string column based on a specific delimiter (| in my case )
I have a dataset like this:
+-----+--------------+
|Col_1|Col_2 |
+-----+--------------+
| 1 | aa|bb |
| 2 | cc |
| 3 | dd|ee |
| 4 | ff |
+-----+-----+---------
I want an output like this:
+-----+--------------+
|Col_1|Col_2 |
+-----+--------------+
| 1 | aa |
| 1 | bb |
| 2 | cc |
| 3 | dd |
| 3 | ee |
| 4 | ff |
+-----+-----+---------
Use explode and split functions, and use \\ escape |.
val df1 = df.select(col("Col_1"), explode(split(col("Col_2"),"\\|")).as("Col_2"))

PySpark Return Exact Match from list of strings

I have a dataset as follows:
| id | text |
--------------
| 01 | hello world |
| 02 | this place is hell |
I also have a list of keywords I'm search for:
Keywords = ['hell', 'horrible', 'sucks']
When using the following solution using .rlike() or .contains(), sentences with either partial and exact matches to the list of words are returned to be true. I would like only exact matches to be returned.
Current code:
KEYWORDS = 'hell|horrible|sucks'
df = (
df
.select(
F.col('id'),
F.col('text'),
F.when(F.col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
Current output:
| id | text | keyword_found |
-------------------------------
| 01 | hello world | 1 |
| 02 | this place is hell | 1 |
Expected output:
| id | text | keyword_found |
--------------------------------
| 01 | hello world | 0 |
| 02 | this place is hell | 1 |
Try below code, I have just change the Keyword only :
from pyspark.sql.functions import col,when
data = [["01","hello world"],["02","this place is hell"]]
schema =["id","text"]
df2 = spark.createDataFrame(data, schema)
df2.show()
+---+------------------+
| id| text|
+---+------------------+
| 01| hello world|
| 02|this place is hell|
+---+------------------+
KEYWORDS = '(hell|horrible|sucks)$'
df = (
df2
.select(
col('id'),
col('text'),
when(col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
df.show()
+---+------------------+-------------+
| id| text|keyword_found|
+---+------------------+-------------+
| 01| hello world| 0|
| 02|this place is hell| 1|
+---+------------------+-------------+
Let me know if you need more help on this.
This should work
Keywords = 'hell|horrible|sucks'
df = (df.select(F.col('id'),F.col('text'),F.when(F.col('text').rlike('('+Keywords+')(\s|$)').otherwise(0).alias('keyword_found')))
id
text
keyword_found
01
hello world
0
02
this place is hell
1

How to remove rows from pyspark dataframe using pattern matching?

I have a pyspark dataframe read from a CSV file that has a value column which contains hexadecimal values.
| date | part | feature | value" |
|----------|-------|---------|--------------|
| 20190503 | par1 | feat2 | 0x0 |
| 20190503 | par1 | feat3 | 0x01 |
| 20190501 | par2 | feat4 | 0x0f32 |
| 20190501 | par5 | feat9 | 0x00 |
| 20190506 | par8 | feat2 | 0x00f45 |
| 20190507 | par1 | feat6 | 0x0e62300000 |
| 20190501 | par11 | feat3 | 0x000000000 |
| 20190501 | par21 | feat5 | 0x03efff |
| 20190501 | par3 | feat9 | 0x000 |
| 20190501 | par6 | feat5 | 0x000000 |
| 20190506 | par5 | feat8 | 0x034edc45 |
| 20190506 | par8 | feat1 | 0x00000 |
| 20190508 | par3 | feat6 | 0x00000000 |
| 20190503 | par4 | feat3 | 0x0c0deffe21 |
| 20190503 | par6 | feat4 | 0x0000000000 |
| 20190501 | par3 | feat6 | 0x0123fe |
| 20190501 | par7 | feat4 | 0x00000d0 |
The requirement is to remove rows that contain values similar to 0x0, 0x00, 0x000, etc. which evaluate to decimal 0(zero) in the value column. The number of 0's after '0x' varies across the dataframe. Removing through pattern matching is the way I tried, but I wasn't successful.
myFile = sc.textFile("file.txt")
header = myFile.first()
fields = [StructField(field_name, StringType(), True) for field_name in header.split(',')]
myFile_header = myFile.filter(lambda l: "date" in l)
myFile_NoHeader = myFile.subtract(myFile_header)
myFile_df = myFile_NoHeader.map(lambda line: line.split(",")).toDF(schema)
## this is the pattern match I tried
result = myFile_df.withColumn('Test', regexp_extract(col('value'), '(0x)(0\1*\1*)',2 ))
result.show()
The other approach I used was using udf:
def convert_value(x):
return int(x,16)
Using this udf in pyspark give me
ValueError: invalid literal for int() with base 16: value
I don't really understand your regular expression, but when you want to match all strings containing 0x0 (+any number of zeros), then you can use ^0x0+$. Filtering with regular expression can be achieved with rlike and the tilde negates the match.
l = [('20190503', 'par1', 'feat2', '0x0'),
('20190503', 'par1', 'feat3', '0x01'),
('20190501', 'par2', 'feat4', '0x0f32'),
('20190501', 'par5', 'feat9', '0x00'),
('20190506', 'par8', 'feat2', '0x00f45'),
('20190507', 'par1', 'feat6', '0x0e62300000'),
('20190501', 'par11', 'feat3', '0x000000000'),
('20190501', 'par21', 'feat5', '0x03efff'),
('20190501', 'par3', 'feat9', '0x000'),
('20190501', 'par6', 'feat5', '0x000000'),
('20190506', 'par5', 'feat8', '0x034edc45'),
('20190506', 'par8', 'feat1', '0x00000'),
('20190508', 'par3', 'feat6', '0x00000000'),
('20190503', 'par4', 'feat3', '0x0c0deffe21'),
('20190503', 'par6', 'feat4', '0x0000000000'),
('20190501', 'par3', 'feat6', '0x0123fe'),
('20190501', 'par7', 'feat4', '0x00000d0')]
columns = ['date', 'part', 'feature', 'value']
df=spark.createDataFrame(l, columns)
expr = "^0x0+$"
df.filter(~ df["value"].rlike(expr)).show()
Output:
+--------+-----+-------+------------+
| date| part|feature| value|
+--------+-----+-------+------------+
|20190503| par1| feat3| 0x01|
|20190501| par2| feat4| 0x0f32|
|20190506| par8| feat2| 0x00f45|
|20190507| par1| feat6|0x0e62300000|
|20190501|par21| feat5| 0x03efff|
|20190506| par5| feat8| 0x034edc45|
|20190503| par4| feat3|0x0c0deffe21|
|20190501| par3| feat6| 0x0123fe|
|20190501| par7| feat4| 0x00000d0|
+--------+-----+-------+------------+

Get the row corresponding to the latest timestamp in a Spark Dataset using Scala

I am relatively new to Spark and Scala. I have a dataframe which has the following format:
| Col1 | Col2 | Col3 | Col_4 | Col_5 | Col_TS | Col_7 |
| 1234 | AAAA | 1111 | afsdf | ewqre | 1970-01-01 00:00:00.0 | false |
| 1234 | AAAA | 1111 | ewqrw | dafda | 2017-01-17 07:09:32.748 | true |
| 1234 | AAAA | 1111 | dafsd | afwew | 2015-01-17 07:09:32.748 | false |
| 5678 | BBBB | 2222 | afsdf | qwerq | 1970-01-01 00:00:00.0 | true |
| 5678 | BBBB | 2222 | bafva | qweqe | 2016-12-08 07:58:43.04 | false |
| 9101 | CCCC | 3333 | caxad | fsdaa | 1970-01-01 00:00:00.0 | false |
What I need to do is to get the row that corresponds to the latest timestamp.
In the example above, the keys are Col1, Col2 and Col3. Col_TS represents the timestamp and Col_7 is a boolean that determines the validity of the record.
What I want to do is to find a way to group these records based on the keys and retain the one that has the latest timestamp.
So the output of the operation in the dataframe above should be:
| Col1 | Col2 | Col3 | Col_4 | Col_5 | Col_TS | Col_7 |
| 1234 | AAAA | 1111 | ewqrw | dafda | 2017-01-17 07:09:32.748 | true |
| 5678 | BBBB | 2222 | bafva | qweqe | 2016-12-08 07:58:43.04 | false |
| 9101 | CCCC | 3333 | caxad | fsdaa | 1970-01-01 00:00:00.0 | false |
I came up with a partial solution but this way I can only return the dataframe of the Column keys on which the records are grouped and not the other columns.
df = df.groupBy("Col1","Col2","Col3").agg(max("Col_TS"))
| Col1 | Col2 | Col3 | max(Col_TS) |
| 1234 | AAAA | 1111 | 2017-01-17 07:09:32.748 |
| 5678 | BBBB | 2222 | 2016-12-08 07:58:43.04 |
| 9101 | CCCC | 3333 | 1970-01-01 00:00:00.0 |
Can someone help me in coming up with a Scala code for performing this operation?
You can use window function as following
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("Col1","Col2","Col3").orderBy(col("Col_TS").desc)
df.withColumn("maxTS", first("Col_TS").over(windowSpec))
.select("*").where(col("maxTS") === col("Col_TS"))
.drop("maxTS")
.show(false)
You should get output as following
+----+----+----+-----+-----+----------------------+-----+
|Col1|Col2|Col3|Col_4|Col_5|Col_TS |Col_7|
+----+----+----+-----+-----+----------------------+-----+
|5678|BBBB|2222|bafva|qweqe|2016-12-0807:58:43.04 |false|
|1234|AAAA|1111|ewqrw|dafda|2017-01-1707:09:32.748|true |
|9101|CCCC|3333|caxad|fsdaa|1970-01-0100:00:00.0 |false|
+----+----+----+-----+-----+----------------------+-----+
One option is firstly order the data frame by Col_TS, then group by Col1, Col2 and Col3 and take the last item from each other column:
val val_columns = Seq("Col_4", "Col_5", "Col_TS", "Col_7").map(x => last(col(x)).alias(x))
(df.orderBy("Col_TS")
.groupBy("Col1", "Col2", "Col3")
.agg(val_columns.head, val_columns.tail: _*).show)
+----+----+----+-----+-----+--------------------+-----+
|Col1|Col2|Col3|Col_4|Col_5| Col_TS|Col_7|
+----+----+----+-----+-----+--------------------+-----+
|1234|AAAA|1111|ewqrw|dafda|2017-01-17 07:09:...| true|
|9101|CCCC|3333|caxad|fsdaa|1970-01-01 00:00:...|false|
|5678|BBBB|2222|bafva|qweqe|2016-12-08 07:58:...|false|
+----+----+----+-----+-----+--------------------+-----+

Missing spark partition column in partition table

I am creating a partitioned parquet file in HDFS with a datasource.
The datasource looks like:
scala> sqlContext.sql("select * from parquetFile").show()
+--------+-----------------+
|area_tag| vin|
+--------+-----------------+
| 0|LSKG5GC19BA210794|
| 0|LSKG5GC15BA210372|
| 0|LSKG5GC18BA210107|
| 0|LSKG4GC16BA211971|
| 0|LSKG4GC19BA210233|
| 0|LSKG5GC17BA210017|
| 0|LSKG4GC19BA211785|
| 0|LSKG4GC15BA210004|
| 0|LSKG4GC12BA211739|
| 0|LSKG4GC18BA210238|
| 0|LSKG4GC13BA210261|
| 0|LSKG5GC16BA210106|
| 0|LSKG4GC1XBA210287|
| 0|LSKG4GC10BA210265|
| 0|LSKG5GC10CA210118|
| 0|LSKG5GC16BA212289|
| 0|LSKG5GC1XBA211016|
| 0|LSKG5GC15CA210194|
| 0|LSKG5GC12CA210119|
| 0|LSKG4GC19BA211379|
+--------+-----------------+
I create partition with the following commands (I did it in spark shell):
scala>val df1 = sqlContext.sql("select * from parquetFile where area_tag=0 ")
scala>df1.write.parquet("/tmp/test_table3/area_tag=0")
scala>val p1 = sqlContext.read.parquet("/tmp/test_table3")
When I print the data by loading from the partitioned table, it shows:
scala> p1.show()
+--------+-----------------+
|area_tag| vin|
+--------+-----------------+
| |LSKG5GC19BA210794|
| |LSKG5GC15BA210372|
| |LSKG5GC18BA210107|
| |LSKG4GC16BA211971|
| |LSKG4GC19BA210233|
| |LSKG5GC17BA210017|
| |LSKG4GC19BA211785|
| |LSKG4GC15BA210004|
| |LSKG4GC12BA211739|
| |LSKG4GC18BA210238|
| |LSKG4GC13BA210261|
| |LSKG5GC16BA210106|
| |LSKG4GC1XBA210287|
| |LSKG4GC10BA210265|
| |LSKG5GC10CA210118|
| |LSKG5GC16BA212289|
| |LSKG5GC1XBA211016|
| |LSKG5GC15CA210194|
| |LSKG5GC12CA210119|
| |LSKG4GC19BA211379|
+--------+-----------------+
only showing top 20 rows
The partition column was missing. What happened with the column, is it a bug?