How to regroup several files into one? - scala

I'm a beginner with Spark, and I have to regroup all data stored on several files into one.
Note : I already used Talend, and my goal is to do same thing but with Spark (scala).
Example :
File 1:
id | attr1.1 | attr1.2 | attr1.3
1 | aaa | aab | aac
2 | aad | aae | aaf
File 2:
id | attr2.1 | attr2.2 | attr2.3
1 | lll | llm | lln
2 | llo | llp | llq
File 3:
id | attr3.1 | attr3.2 | attr3.3
1 | sss | sst | ssu
2 | ssv | ssw | ssx
Ouput wished:
id |attr1.1|attr1.2|attr1.3|attr2.1|attr2.2|attr2.3|attr3.1|attr3.2|attr3.3
1 | aaa | aab | aac | lll | llm | lln | sss | sst | ssu
2 | aad | aae | aaf | llo | llp | llq | ssv | ssw | ssx
I have 9 files about orders, customers, items, ... And several hundreds of thousands of lines, that's why I have to use Spark. Fortunately, data can be tied with ids.
File format is .csv.
Final objective : Final objective is to do some visualizations from file generated by Spark.
Question : So, can you give me some clues to do this task please? I saw several ways with RDD or DataFrame but I am completely lost...
Thanks

you didn't specify anything about the original file formats so assuming you've got them in dataframes f1,f2... you can create a unified dataframe by joining them val unified=f1.join(f2,f1("id")===f2("id")).join(f3, f1("id")===f3("id"))....

Related

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show

Fast split Spark dataframe by keys in some column and save as different dataframes

I have Spark 2.3 very big dataframe like this:
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AB | 2 | 1 |
| AA | 2 | 3 |
| AC | 1 | 2 |
| AA | 3 | 2 |
| AC | 5 | 3 |
-------------------------
I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AA | 2 | 3 |
| AA | 3 | 2 |
-------------------------
and
-------------------------
| col_key | col1 | col2 |
-------------------------
| AC | 1 | 2 |
| AC | 5 | 3 |
-------------------------
and so far.
Every result dataframe I need to save as different csv file.
Count of keys is not big (20-30) but total count of data is (~200 millions records).
I have the solution where in the loop is selected every part of data and then saved to file:
val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList
keysList.foreach(k => {
val dfi = df.where($"col_key" === lit(k))
SaveDataByKey(dfi, path_to_save)
})
It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time.
I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :)
May be, someone have ideas about it?
Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).
You need to partition by column and save as csv file. Each partition save as one file.
yourDF
.write
.partitionBy("col_key")
.csv("/path/to/save")
Why don't you try this ?

Talend: how to split column data into rows

I have one table:
| id | head1| head2 | head3|
| 1 | fv1 | fw1,fw2,fw3| fv3 |
| 2 | sv2 | sw1,sw2,sw3| sv4 |
And would like to have the following:
| id | head2 |
| 1 | fw1 |
| 1 | fw2 |
| 1 | fw3 |
| 2 | sw1 |
| 2 | sw2 |
| 2 | sw3 |
So I would like to split a comma-delimited content of some columns and then copy it over into the different table as rows for search purposes.
Which Talend component should I use to achieve this? Is that possible?
tNormalize should help you with this problem.
Just select "," as field separator, and head2 as the column to normalize.

Talend Get most recent row for each instance of a given key

We have two data sources that We need to consolidate into one. We need to keep the most recent rows. The sources have the same columns. The key is Machine column.
Input :
tFileInputDelimited A
+---------+------+-------------------+
| Machine | Desc | LastScanTimestamp |
+---------+------+-------------------+
| M01 | AA | 25 |
| M02 | AB | 23 |
| M03 | AC | 28 |
+---------+------+-------------------+
tFileInputDelimited B
+---------+------+-------------------+
| Machine | Desc | LastScanTimestamp |
+---------+------+-------------------+
| M02 | BB | 25 |
| M03 | BC | 27 |
| M04 | BD | 26 |
+---------+------+-------------------+
Output wanted :
+---------+------+-------------------+
| Machine | Desc | LastScanTimestamp |
+---------+------+-------------------+
| M01 | AA | 25 |
| M02 | BB | 25 |
| M03 | AC | 28 |
| M04 | BD | 26 |
+---------+------+-------------------+
Our question is similar to this SQL question:
SQL query to get most recent row for each instance of a given key
We can't use a SQL query to do it. We found a way to do it with tUnite, tAggregateRow and tMap. This is not really elegant and maintainable.
Thanks a lot.
I found an anwser on Talend Forum : https://www.talendforge.org/forum/viewtopic.php?id=49655
tInputFileDelimited--->tUnite--->tSortRow--->tUniqRow--->tLogRow
/
tInputFileDelimited-/
tSortRow : Sort by LastScanTimestamp desc
tUniqRow : Key = Machine

How to sort the data on multiple columns in apache spark scala?

I have data set like this which I am taking from csv file and converting it into RDD using scala.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234 | 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
How to sort the data on all columns ?
Thanks
Suppose your input RDD/DataFrame is called df.
To sort recent in descending order, Freq and Monitor both in ascending you can do:
import org.apache.spark.sql.functions._
val sorted = df.sort(desc("recent"), asc("Freq"), asc("Monitor"))
You can use df.orderBy(...) as well, it's an alias of sort().
csv.sortBy(r => (r.recent, r.freq)) or equivalent should do it