processing of csv file with multiple delimiter in spark scala - scala

I have below data:
"Ankit BE, "BBD"",123,abcd
I read this csv file using in spark dataframe using scala
val df = spark.read.csv("filepath")
o/p
+------------------------+---+----+
|_c0 |_c1|_c2 |
+------------------------+---+----+
|"Ankit BE, "BBD"" |123|abcd|
+------------------------+---+----+
I want o/p like below:
+------------------------+---+----+
|_c0 |_c1|_c2 |
+------------------------+---+----+
|Ankit BE |BBD|123 |
+------------------------+---+----+

Related

How to add new rows to DataFrame?

Is there a way in spark to append a string in data frame and upload that data frame value in s3 in some txt file
i have build a DF by reading a text file from s3
val DF = spark.read.textFile("s3_path/file_name.txt")
DF.show(200,false)
+----------------------------------+
|value |
+----------------------------------+
|country:india |
|address:xyz |
After this need to append and update some string in that file and upload it back to s3 in same location
expected output
+----------------------------------+
|value |
+----------------------------------+
|country:abcd |
|address:xyz |
|pin:1234 |
This is union operation:
Returns a new Dataset containing union of rows in this Dataset and another Dataset.

Split file with Space where column data is also having space

Hi I have data file which is having space as delimiter and also the data each column also contain spaces..How can i split it using spark program using scala:
Sample data Filed:
student.txt
3 columns:
Name
Address
Id
Name Address Id
Abhi Rishii Bangalore,Karnataka 1234
Rinki siyty Hydrabad,Andra 2345
Output Data frame should be:
|Name |City |State |Id--|
+-------------+------+-----------+-----+
| Abhi Rishii|Bangalore|Karnataka|1234|
| Rinki siyty|Hydrabad |Andra |2345|
+----+-----+-----------+---------+-----+
Your file is a tab delimited file.
You can use Spark's csv reader to read this file directly into a dataframe.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
var studentDf = spark.read.format("csv") // Use "csv" for both TSV and CSV
.option("header", "true")
.option("delimiter", "\t") // Set delimiter to tab .
.load("student.txt")
.withColumn("_tmp", split($"Address", "\\,"))
.withColumn($"_tmp".getItem(0).as("City"))
.withColumn($"_tmp".getItem(1).as("State"))
.drop("_tmp")
.drop("Address")
studentDf .show()
|Name |City |State |Id--|
+-------------+------+-----------+-----+
| Abhi Rishii|Bangalore|Karnataka|1234|
| Rinki siyty|Hydrabad |Andra |2345|
+----+-----+-----------+---------+-----+

Create a single schema dataframe when reading multiple csv files under a directory

I have thousands of CSV files that have similar but non-identical headers under a single directory. The structure is as follow:
path/to/files/unique_parent_directory/*.csv
One csv file can be:
|Column_A|Column_B|Column_C|Column_D|
|V1 |V2 |V3 |V4 |
The second CSV file can be:
|Coulmn_A|Coulmn_B|Coulmn_E|Coulmn_F|
|V5 |V6 |V7 |V8 |
The result I want to create is a single Spark Dataframe that merges the files correctly without overlapping columns, the output for the previous example should be like this:
|Column_A|Column_B|Column_C|Column_D|Coulmn_E|Coulmn_F|
|V1 |V2 |V3 |V4 |Null |Null |
|V5 |V6 |Null |Null |V7 |V8 |
The code I am using to create the dataframes is:
val df = sparkSession.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mergeSchema", "true")
.load(path/to/files/unique_parent_directory/*.csv)
.persist(StorageLevel.MEMORY_AND_DISK_SER)
But I get the following result:
|Column_A|Column_B|Column_C|Column_D|
|V1 |V2 |V3 |V4 |
|V5 |V6 |V7 |V8 |
Is there a way to obtain the desired dataframe without running a header unification process?

How to split column into multiple columns in Spark 2?

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful

In spark and scala, how to convert or map a dataframe to specific columns info?

Scala.
Spark.
intellij IDEA.
I have a dataframe (multiple rows, multiple columns) from CSV file.
And I want it maps to another specific column info.
I think scala class (not case class, because columns count > 22) or map().....
But I don't know how to convert them.
Example
a dataframe from CSV file.
----------------------
| No | price| name |
----------------------
| 1 | 100 | "A" |
----------------------
| 2 | 200 | "B" |
----------------------
another specific columns info.
=> {product_id, product_name, seller}
First, product_id is mapping to 'No'.
Second, product_name is mapping to 'name'.
Third, seller is null or ""(empty string).
So, finally, I want a dataframe that have another columns info.
-----------------------------------------
| product_id | product_name | seller |
-----------------------------------------
| 1 | "A" | |
-----------------------------------------
| 2 | "B" | |
-----------------------------------------
If you already have a dataframe (eg. old_df) :
val new_df=old_df.withColumnRenamed("No","product_id").
withColumnRenamed("name","product_name").
drop("price").
withColumn("seller", ... )
Let's say your CSV file is "products.csv",
First you have to load it in spark, you can do that using
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
Once the data is loaded you will have all the column names in the dataframe df. As you mentioned your column name will be "No","Price","Name".
To change the name of the column you just have to use withColumnRenamed api of dataframe.
val renamedDf = df.withColumnRenamed("No","product_id").
withColumnRenames("name","product_name")
Your renamedDf will have the name of the column as you have assigned.