combine two different csv files and make them into one - scala

I want to join csv1 with csv2 to result into final_csv, the schema only has String type columns (file contents are as follows):
csv1
emp_name designation salary_col
smith manager 40000
john analyst 35000
adam sr.engineer 50000
eve QA 36000
mills sr.manager 44000
csv2
emp_name designation advance_salary_col
smith manager 2000
john analyst 3030
adam sr.engineer 5044
eve QA 3600
mills sr.manager 4500
final_csv
emp_name designation salary_col advance_salary_col
smith manager 40000 2000
john analyst 35000 3030
adam sr.engineer 50000 5044
eve QA 36000 3600
mills sr.manager 44000 4500
I tried using few methods Union, Intersect, UnionByName but am getting null values for all of the columns in scala in my final_df
val emp_dataDf1 = spark.read.format("csv")
.option("header", "true")
.load("data/emp_data1.csv")
val emp_dataDf2 = spark.read.format("csv")
.option("header", "true")
.load("/data/emp_data2.csv")
val final_df= emp_dataDf1.union(emp_dataDf2)

This is a join. See docs about SQL joins and joins in Spark.
val final_df = emp_dataDf1.join(emp_dataDf2, Seq("empname", "designation"))

NOTE: you need to mention the separator for csv files and if you are using spaces (or any separator for that matter) make sure you use the exact string literal.
here is how you can do it:
val csv1 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv1_file")
val csv2 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv2_file")
val joinExpression = Seq("emp_name", "designation")
csv1.join(csv2, joinExpression, "inner").show(false)
/* output *
*
+--------+-----------+----------+------------------+
|emp_name|designation|salary_col|advance_salary_col|
+--------+-----------+----------+------------------+
|smith |manager |40000 |2000 |
|john |analyst |35000 |3030 |
|adam |sr.engineer|50000 |5044 |
|eve |QA |36000 |3600 |
|mills |sr.manager |44000 |4500 |
+--------+-----------+----------+------------------+
*/

Related

Split file with Space where column data is also having space

Hi I have data file which is having space as delimiter and also the data each column also contain spaces..How can i split it using spark program using scala:
Sample data Filed:
student.txt
3 columns:
Name
Address
Id
Name Address Id
Abhi Rishii Bangalore,Karnataka 1234
Rinki siyty Hydrabad,Andra 2345
Output Data frame should be:
|Name |City |State |Id--|
+-------------+------+-----------+-----+
| Abhi Rishii|Bangalore|Karnataka|1234|
| Rinki siyty|Hydrabad |Andra |2345|
+----+-----+-----------+---------+-----+
Your file is a tab delimited file.
You can use Spark's csv reader to read this file directly into a dataframe.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
var studentDf = spark.read.format("csv") // Use "csv" for both TSV and CSV
.option("header", "true")
.option("delimiter", "\t") // Set delimiter to tab .
.load("student.txt")
.withColumn("_tmp", split($"Address", "\\,"))
.withColumn($"_tmp".getItem(0).as("City"))
.withColumn($"_tmp".getItem(1).as("State"))
.drop("_tmp")
.drop("Address")
studentDf .show()
|Name |City |State |Id--|
+-------------+------+-----------+-----+
| Abhi Rishii|Bangalore|Karnataka|1234|
| Rinki siyty|Hydrabad |Andra |2345|
+----+-----+-----------+---------+-----+

Format csv file with column creation in Spark scala

I have a csv file, as below
It has 6 rows with top row as header, while header read as "Students Marks"
dataframe is treating them as one columns, now i want to separate both columns with data. "student" and "marks" are separated by space.
df.show()
_______________
##Student Marks##
---------------
A 10;20;10;20
A 20;20;30;10
B 10;10;10;10
B 20;20;20;10
B 30;30;30;20
Now i want to transform this csv table into two columns, with student and Marks, Also for every student the marks with add up, something like below
Student | Marks
A | 30;40;40;30
B | 60;60;60;40
I have tried with below but it is throwing an error
df.withColumn("_tmp", split($"Students Marks","\\ ")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2")).drop("_tmp")
You can read the csv file with the delimiteryou want and calculate result as below
val df = spark.read
.option("header", true)
.option("delimiter", " ")
.csv("path to csv")
After You get the dataframe df
val resultDF = df.withColumn("split", split($"Marks", ";"))
.withColumn("a", $"split"(0))
.withColumn("b", $"split"(1))
.withColumn("c", $"split"(2))
.withColumn("d", $"split"(3))
.groupBy("Student")
.agg(concat_ws(";", array(
Seq(sum($"a"), sum($"b"), sum($"c"), sum($"d")): _*)
).as("Marks"))
resultDF.show(false)
Output:
+-------+-------------------+
|Student|Marks |
+-------+-------------------+
|B |60.0;60.0;60.0;40.0|
|A |30.0;40.0;40.0;30.0|
+-------+-------------------+
Three Ideas. The first one is to read the file, split it by space and then create the dataFrame:
val df = sqlContext.read
.format("csv")
.option("header", "true")
.option("delimiter", " ")
.load("your_file.csv")
The second one is to read the file to dataframe and split it:
df.withColumn("Student", split($"Students Marks"," ").getItem(0))
.withColumn("Marks", split($"Students Marks"," ").getItem(1))
.drop("Students Marks")
The last one is your solution. It should work, but when you use the select, you don't use $"_tmp", therefore, it should work without the .drop("_tmp")
df.withColumn("_tmp", split($"Students Marks"," "))
.select($"_tmp".getItem(0).as("Student"),$"_tmp".getItem(1).as("Marks"))

data transformations in scala/park

brand,month,price
abc,jan, - \n
abc,feb, 29 \n
abc,mar, - \n
abc,apr, 45.23 \n
bb-c,jan, 34 \n
bb-c,feb,-35 \n
bb-c,mar, - \n
sum(price) groupby(brand)
challenges
1)csv file available in xl sheet
2)trim the extra spaces in price
3)replace non-numeric(" - ") with zero
4)sum the price group by brand
--read the csv file to df1
--changed the price data type string to double
--created registered temp table on df1
--but still facing issue with trim and
--replace the zero for non numeric
can someone please help me on this issue.
Theoretical explaining :
A simple use of sqlContext to read the csv file, regexp_replace inbuilt function to replace the strings to doubles (casting) and groupBy and sum aggregation should get you your desired output,
Programmatically explaining :
//1)csv file available in xl sheet
val df = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", true)
.load("path to the csv file")
df.show(false)
//+-----+-----+------+
//|brand|month|price |
//+-----+-----+------+
//|abc |jan | - |
//|abc |feb | 29 |
//|abc |mar | - |
//|abc |apr | 45.23|
//|bb-c |jan | 34 |
//|bb-c |feb |-35 |
//|bb-c |mar | - |
//+-----+-----+------+
import org.apache.spark.sql.functions._
//2)trim the extra spaces in price
//3)replace non-numeric(" - ") with zero
df.withColumn("price", regexp_replace(col("price"), "[\\s+a-zA-Z- :]", "").cast("double"))
//4)sum the price group by brand
.groupBy("brand")
.agg(sum("price").as("price_sum"))
.show(false)
//+-----+-----------------+
//|brand|price_sum |
//+-----+-----------------+
//|abc |74.22999999999999|
//|bb-c |69.0 |
//+-----+-----------------+
I hope the answer is helpful

Using spark to merge data in sorted order to csv files

I have a data set like this:
name time val
---- ----- ---
fred 04:00 111
greg 03:00 123
fred 01:00 411
fred 05:00 921
fred 11:00 157
greg 12:00 333
And csv files in some folder, one for each unique name from the data set:
fred.csv
greg.csv
The contents of fred.csv, for example, looks like this:
00:00 222
10:00 133
My goal is to efficiently merge the dataset to the CSV's in sorted time order so that fred.csv, for example, ends up like this:
00:00 222
01:00 411
04:00 111
05:00 921
10:00 133
In reality, there are thousands of unique names, not just two. I use union and sort functions to add rows in order, but I have not been successful with partitionBy, for each, or coalesce in getting the rows to their proper CSV files.
Import and declare necessary variables
val spark = SparkSession.builder
.master("local")
.appName("Partition Sort Demo")
.getOrCreate;
import spark.implicits._
Create dataframe from source file
val df = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.csv("csv/file/location")
//df.show()
+----+-----+---+
|name| time|val|
+----+-----+---+
|fred|04:00|111|
|greg|03:00|123|
|fred|01:00|411|
|fred|05:00|921|
|fred|11:00|157|
|greg|12:00|333|
+----+-----+---+
Now repartition dataframe by name and sort each partition then save them
//repartition
val repartitionedDf = df.repartition($"name")
for {
//fetch the distinct names in dataframe use as filename
distinctName <- df.dropDuplicates("name").collect.map(_ (0))
} yield {
import org.apache.spark.sql.functions.lit
repartitionedDf.select("time", "val")
.filter($"name" === lit(distinctName)) //filter df by name
.coalesce(1)
.sortWithinPartitions($"time") //sort
.write.mode("overwrite").csv("location/" + distinctName + ".csv") //save
}
Note:
The content of CSV file is available in highlighted files.

In spark and scala, how to convert or map a dataframe to specific columns info?

Scala.
Spark.
intellij IDEA.
I have a dataframe (multiple rows, multiple columns) from CSV file.
And I want it maps to another specific column info.
I think scala class (not case class, because columns count > 22) or map().....
But I don't know how to convert them.
Example
a dataframe from CSV file.
----------------------
| No | price| name |
----------------------
| 1 | 100 | "A" |
----------------------
| 2 | 200 | "B" |
----------------------
another specific columns info.
=> {product_id, product_name, seller}
First, product_id is mapping to 'No'.
Second, product_name is mapping to 'name'.
Third, seller is null or ""(empty string).
So, finally, I want a dataframe that have another columns info.
-----------------------------------------
| product_id | product_name | seller |
-----------------------------------------
| 1 | "A" | |
-----------------------------------------
| 2 | "B" | |
-----------------------------------------
If you already have a dataframe (eg. old_df) :
val new_df=old_df.withColumnRenamed("No","product_id").
withColumnRenamed("name","product_name").
drop("price").
withColumn("seller", ... )
Let's say your CSV file is "products.csv",
First you have to load it in spark, you can do that using
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
Once the data is loaded you will have all the column names in the dataframe df. As you mentioned your column name will be "No","Price","Name".
To change the name of the column you just have to use withColumnRenamed api of dataframe.
val renamedDf = df.withColumnRenamed("No","product_id").
withColumnRenames("name","product_name")
Your renamedDf will have the name of the column as you have assigned.