Spark Scala : Fetch distinct text values within each cell

Spark Scala : Fetch distinct text values within each cell - scala

I am new to scala and I have been having trouble fetching distinct text values from each cell of a row. My dataframe looks somewhat like this below. My intent is to eliminate duplicate skills for each candidate id.
candidate_id
skills
join_date
location
1789s3
java; c++ ; java
2012-09-22
Mumbai
agduch23
ppt ; ppt ; miner
2018-02-02
Banglore
sgdtev
office 365;
2019-03-10
Noida
My final resultant dataframe should look somewhat like this -
candidate_id
skills
join_date
location
1789s3
java; c++
2012-09-22
Mumbai
agduch23
ppt; miner
2018-02-02
Banglore
sgdtev
office 365;
2019-03-10
Noida
I use the following command in SQL to do this.
string_agg(ARRAY_TO_STRING(ARRAY((select distinct skill from unnest(split(skills_agg, '; ')) as skill)), '; ')) as skills_distinct
Is there a way I can do this in scala without using sql.
Thanks in advance

If you're using Spark 3.0 and greater, you can remove duplicate by splitting your string to an array with split, then use function array_distinct to remove duplicates and finally rebuild string with concat_ws, as follow:
import org.apache.spark.sql.functions.{array_distinct, col, concat_ws, split}
dataframe.withColumn("skills", concat_ws("; ", array_distinct(split(col("skills"), "; ")))
You can find all the functions you can use in Scala API functions' Scaladoc

Related

regexp_extract in scala data frame is giving the error

I am trying to convert the below Hive SQL statement into Spark dataframe and getting the error.
trim(regexp_extract(message_comment_txt, '(^.*paid\\s?\\$?)(.*?)(\\s?toward.*)', 2))
Sample data: message_comment_txt = "DAY READER, paid 12.76 toward the cost"
I need to get the output as 12.76
Please help me to provide equivalent spark dataframe statement.

Try with paid\\s+(.*?)\\s+toward regex.
df.withColumn("extract",regexp_extract(col("message_comment_txt"),"paid\\s+(.*?)\\s+toward",1)).show(false)
//for case insensitive
df.withColumn("extract",regexp_extract(col("message_comment_txt"),"(?i)paid\\s+(.*?)\\s+(?i)toward",1)).show(false)
//+--------------------------------------+-------+
//|message_comment_txt |extract|
//+--------------------------------------+-------+
//|DAY READER, paid 12.76 toward the cost|12.76 |
//+--------------------------------------+-------+

Adding Column In sparkdataframe

Hi I am trying to add one column in my spark dataframe and calculating value based on existing dataframe column. I am writing below code.
val df1=spark.sql("select id,dt1,salary frm dbdt1.tabledt1")
val df2=df1.withColumn("new_date",WHEN (month(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))
IN (01,02,03)) THEN
CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))-1,'-'),
substr(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy'))),3,4))
.otherwise(CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy'))),'-')
,SUBSTR(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy')))+1,3,4))))
But it always showing issue error: unclosed character literal. Can someone plase guide me how should i add this new column or modify the existing code.

Incorrect syntax in many places. First I suggest you look at a few spark sql examples online and also the org.apache.spark.sql.functions API documentation because your use of WHEN, CONCAT, IN are all incorrect.
Scala strings are enclosed by double quotes, you appear to be using SQL string syntax.
'dd-MM-yyyy' should be "dd-MM-yyyy"
To reference a column dt1 on DataFrame df1 you can use one of the following:
df1("dt1")
col("dt1") // if you import org.apache.spark.sql.functions.col
$"dt1" // if you import spark.implicits._ locally
For example:
from_unixtime(unix_timestamp(col("dt1")), 'dd-MM- yyyy')

flatten a spark data frame's column values and put it into a variable

Spark version 1.60, Scala version 2.10.5.
I have a spark-sql dataframe df like this,
+-------------------------------------------------+
|addess | attributes |
+-------------------------------------------------+
|1314 44 Avenue | Tours, Mechanics, Shopping |
|115 25th Ave | Restaurant, Mechanics, Brewery|
+-------------------------------------------------+
From this dataframe, I would like values as below,
Tours, Mechanics, Shopping, Brewery
If I do this,
df.select(df("attributes")).collect().foreach(println)
I get,
[Tours, Mechanics, Shopping]
[Restaurant, Mechanics, Brewery]
I thought I could use flatMapinstead found this, so, tried to put this into a variable using,
val allValues = df.withColumn(df("attributes"), explode("attributes"))
but I am getting an error:
error: type mismatch;
found:org.apache.spark.sql.column
required:string
I was thinking if I can get an output using explode I can use distinct to get the unique values after flattening them.
How can I get the desired output?

I strongly recommend you to use spark 2.x version. In Cloudera, when you issue "spark-shell", it launches 1.6.x version.. however, if you issue "spark2-shell", you get the 2.x shell. Check with your admin
But if you need with Spark 1.6 and rdd solution, try this.
import spark.implicits._
import scala.collection.mutable._
val df = Seq(("1314 44 Avenue",Array("Tours", "Mechanics", "Shopping")),
("115 25th Ave",Array("Restaurant", "Mechanics", "Brewery"))).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[mutable.WrappedArray[String]]("attributes") ).distinct().collect.foreach(println)
Results:
Brewery
Shopping
Mechanics
Restaurant
Tours
If the "attribute" column is not an array, but comma separated string, then use the below one which gives you same results
val df = Seq(("1314 44 Avenue","Tours,Mechanics,Shopping"),
("115 25th Ave","Restaurant,Mechanics,Brewery")).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[String]("attributes").split(",") ).distinct().collect.foreach(println)

The problem is that withColumn expects a String in its first argument (which is the name of the added column), but you're passing it a Column here df.withColumn(df("attributes").
You only need to pass "attributes" as a String.
Additionally, you need to pass a Column to the explode function, but you're passing a String - to make it a column you can use df("columName") or the Scala shorthand $ syntax, $"columnName".
Hope this example can help you.
import org.apache.spark.sql.functions._
val allValues = df.select(explode($"attributes").as("attributes")).distinct
Note that this will only preserve the attributes Column, since you want the distinct elements on that one.

How to use Postgresql ts_delete function

I am trying to use Postgresql Full Text Search. I read that the stop words (words ignored for indexing) are implemented via dictionary. But I would like to give the user a limited control over the stop words (insert new ones), so I grouped then in a table.
From the example below:
select strip(to_tsvector('simple', texto)) from longtxts where id = 23;
I can get the vector:
{'alta' 'aluno' 'cada' 'do' 'em' 'leia' 'livro' 'pedir' 'que' 'trecho' 'um' 'voz'}
And now I would like to remove the elements from the stopwords table:
select array(select palavra_proibida from stopwords);
That returns the array:
{a,as,ao,aos,com,default,e,eu,o,os,da,das,de,do,dos,em,lhe,na,nao,nas,no,nos,ou,por,para,pra,que,sem,se,um,uma}
Then, following documentation:
ts_delete(vector tsvector, lexemes text[]) tsvector remove any occurrence of lexemes in lexemes from vector ts_delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat'])
I tried a lot. For example:
select ts_delete((select strip(to_tsvector('simple', texto)) from longtxts where id = 23), array[(select palavra_proibida from stopwords)]);
But I always receive the error:
ERROR: function ts_delete(tsvector, character varying[]) does not exist
LINE 1: select ts_delete((select strip(to_tsvector('simple', texto))...
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
Could anyone help me? Thanks in advance!

ts_delete was introduced in PostgreSQL 9.6. Based on the error message, you're using an earlier version. You may try select version(); to be sure.
When you land on the PostgreSQL online documentation with a web search, it may correspond to any version. The version is in the URL and there's a "This page in another version" set of links at the top of each page to help switching to the equivalent doc for a different version.

Conditionally map through rows in CSV file in Scala / Spark to produce another CSV file

I am quite new to Scala / Spark and I have been thrown into the deep end. I have been trying hard since several weeks to find a solution for a seemingly simple problem on Scala 2.11.8 but have been unable to find a good solution for it.
I have a large database in csv format close to 150 GB, with plenty of null values, which needs to be reduced and cleaned based on the values of individual columns.
The schema of the original CSV file is as follows:
Column 1: Double
Columnn 2: Integer
Column 3: Double
Column 4: Double
Columnn 5: Integer
Column 6: Double
Columnn 7: Integer
So, I want to conditionally map through all the rows of the CSV file and export the results to another CSV file with the following conditions for each row:
If the value for column 4 is not null, then the values for columns 4, 5, 6 and 7 of that row should be stored as an array called lastValuesOf4to7. (In the dataset if the element in column 4 is not null, then columns 1, 2 and 3 are null and can be ignored)
If the value of column 3 is not null, then the values of columns 1, 2 and 3 and the four elements from the lastValuesOf4to7 array, as described above, should be exported as a new row into another CSV file called condensed.csv. (In the dataset if the element in column 3 is not null, then columns 4, 5, 6 & 7 are null and can be ignored)
So in the end I should get a csv file called condensed.csv, which has 7 columns.
I have tried using the following code in Scala but have not been able to progress further:
import scala.io.Source
object structuringData {
def main(args: Array[String]) {
val data = Source.fromFile("/path/to/file.csv")
var lastValuesOf4to7 = Array("0","0","0","0")
val lines = data.getLines // Get the lines of the file
val splitLine = lines.map(s => s.split(',')).toArray // This gives an out of memory error since the original file is huge.
data.close
}
}
As you can see from the code above, I have tried to move it into an array but have been unable to progress further since I am unable to process each line individually.
I am quite certain that there must be straightforward solution to processing csv files on Scala / Spark.

Use the Spark-csv package and then use the Sql query to query the data and make the filters according to your use case and then export it at the end.
If you are using spark 2.0.0 then spark-csv will be present in spark-sql or else if you are using a old version add the dependency accordingly.
You can find a link to the spark-csv here.
You can also look at the example here: http://blog.madhukaraphatak.com/analysing-csv-data-in-spark/

Thank you for the response. I managed to create a solution myself using Bash Script. I had to start with a blank condensed.csv file first. My code shows how easy it was to achieve this:
#!/bin/bash
OLDIFS=$IFS
IFS=","
last1=0
last2=0
last3=0
last4=0
while read f1 f2 f3 f4 f5 f6 f7
do
if [[ $f4 != "" ]];
then
last1=$f4
last2=$f5
last3=$f6
last4=$f7
elif [[ $f3 != "" ]];
then
echo "$f1,$f2,$f3,$last1,$last2,$last3,$last4" >> path/to/condensed.csv
fi
done < $1
IFS=$OLDIFS
If the script is saved with the name extractcsv.sh then it should be run using the following format:
$ ./extractcsv.sh path/to/original/file.csv
This only goes to confirm my observation that ETL is easier on Bash than in Scala. Thank you for your help, though.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark Scala : Fetch distinct text values within each cell - scala

Related

regexp_extract in scala data frame is giving the error

Adding Column In sparkdataframe

flatten a spark data frame's column values and put it into a variable

How to use Postgresql ts_delete function

Conditionally map through rows in CSV file in Scala / Spark to produce another CSV file

Categories

Resources