pyspark read tab delimiter not behaving as expected

pyspark read tab delimiter not behaving as expected - pyspark

I am using Spark 2.4 and I am trying to read a tab delimited file, however, while it does read the file it does not parse the delimiter correctly.
Test file, e.g.,
$ cat file.tsv
col1 col2
1 abc
2 def
The file is tab delimited correctly:
$ cat -A file.tsv
col1^Icol2$
1^Iabc$
2^Idef$
I have tried both "delimiter="\t" and sep="\t" but neither are giving expected results.
df = spark.read.format("csv") \
.option("header", "true") \
.option("delimiter", "\t") \
.option("inferSchema","true") \
.load("file.tsv")
df = spark.read.load("file.tsv", \
format="csv",
sep="\t",
inferSchema="true",
header="true")
The result of the read is a single column string.
df.show(10,False)
+---------+
|col1 col2|
+---------+
|1 abc |
|2 def |
+---------+
Am I doing something wrong or do I have to preprocess the file to convert tab to pipe before reading?

Related

Split file with Space where column data is also having space

Hi I have data file which is having space as delimiter and also the data each column also contain spaces..How can i split it using spark program using scala:
Sample data Filed:
student.txt
3 columns:
Name
Address
Id
Name Address Id
Abhi Rishii Bangalore,Karnataka 1234
Rinki siyty Hydrabad,Andra 2345
Output Data frame should be:
|Name |City |State |Id--|
+-------------+------+-----------+-----+
| Abhi Rishii|Bangalore|Karnataka|1234|
| Rinki siyty|Hydrabad |Andra |2345|
+----+-----+-----------+---------+-----+

Your file is a tab delimited file.
You can use Spark's csv reader to read this file directly into a dataframe.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
var studentDf = spark.read.format("csv") // Use "csv" for both TSV and CSV
.option("header", "true")
.option("delimiter", "\t") // Set delimiter to tab .
.load("student.txt")
.withColumn("_tmp", split($"Address", "\\,"))
.withColumn($"_tmp".getItem(0).as("City"))
.withColumn($"_tmp".getItem(1).as("State"))
.drop("_tmp")
.drop("Address")
studentDf .show()
|Name |City |State |Id--|
+-------------+------+-----------+-----+
| Abhi Rishii|Bangalore|Karnataka|1234|
| Rinki siyty|Hydrabad |Andra |2345|
+----+-----+-----------+---------+-----+

Create a single schema dataframe when reading multiple csv files under a directory

I have thousands of CSV files that have similar but non-identical headers under a single directory. The structure is as follow:
path/to/files/unique_parent_directory/*.csv
One csv file can be:
|Column_A|Column_B|Column_C|Column_D|
|V1 |V2 |V3 |V4 |
The second CSV file can be:
|Coulmn_A|Coulmn_B|Coulmn_E|Coulmn_F|
|V5 |V6 |V7 |V8 |
The result I want to create is a single Spark Dataframe that merges the files correctly without overlapping columns, the output for the previous example should be like this:
|Column_A|Column_B|Column_C|Column_D|Coulmn_E|Coulmn_F|
|V1 |V2 |V3 |V4 |Null |Null |
|V5 |V6 |Null |Null |V7 |V8 |
The code I am using to create the dataframes is:
val df = sparkSession.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mergeSchema", "true")
.load(path/to/files/unique_parent_directory/*.csv)
.persist(StorageLevel.MEMORY_AND_DISK_SER)
But I get the following result:
|Column_A|Column_B|Column_C|Column_D|
|V1 |V2 |V3 |V4 |
|V5 |V6 |V7 |V8 |
Is there a way to obtain the desired dataframe without running a header unification process?

Format csv file with column creation in Spark scala

I have a csv file, as below
It has 6 rows with top row as header, while header read as "Students Marks"
dataframe is treating them as one columns, now i want to separate both columns with data. "student" and "marks" are separated by space.
df.show()
_______________
##Student Marks##
---------------
A 10;20;10;20
A 20;20;30;10
B 10;10;10;10
B 20;20;20;10
B 30;30;30;20
Now i want to transform this csv table into two columns, with student and Marks, Also for every student the marks with add up, something like below
Student | Marks
A | 30;40;40;30
B | 60;60;60;40
I have tried with below but it is throwing an error
df.withColumn("_tmp", split($"Students Marks","\\ ")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2")).drop("_tmp")

You can read the csv file with the delimiteryou want and calculate result as below
val df = spark.read
.option("header", true)
.option("delimiter", " ")
.csv("path to csv")
After You get the dataframe df
val resultDF = df.withColumn("split", split($"Marks", ";"))
.withColumn("a", $"split"(0))
.withColumn("b", $"split"(1))
.withColumn("c", $"split"(2))
.withColumn("d", $"split"(3))
.groupBy("Student")
.agg(concat_ws(";", array(
Seq(sum($"a"), sum($"b"), sum($"c"), sum($"d")): _*)
).as("Marks"))
resultDF.show(false)
Output:
+-------+-------------------+
|Student|Marks |
+-------+-------------------+
|B |60.0;60.0;60.0;40.0|
|A |30.0;40.0;40.0;30.0|
+-------+-------------------+

Three Ideas. The first one is to read the file, split it by space and then create the dataFrame:
val df = sqlContext.read
.format("csv")
.option("header", "true")
.option("delimiter", " ")
.load("your_file.csv")
The second one is to read the file to dataframe and split it:
df.withColumn("Student", split($"Students Marks"," ").getItem(0))
.withColumn("Marks", split($"Students Marks"," ").getItem(1))
.drop("Students Marks")
The last one is your solution. It should work, but when you use the select, you don't use $"_tmp", therefore, it should work without the .drop("_tmp")
df.withColumn("_tmp", split($"Students Marks"," "))
.select($"_tmp".getItem(0).as("Student"),$"_tmp".getItem(1).as("Marks"))

How to load and process multiple csv files from a DBFS directory with Spark

I want to run the following code on each file that I read from DBFS (Databricks FileSystem). I tested it on all files that are in a folder, but I want to make similar calculations for each file in the folder, one by one:
// a-e are calculated fields
val df2=Seq(("total",a,b,c,d,e)).toDF("file","total","count1","count2","count3","count4")
//schema is now an empty dataframe
val final1 = schema.union(df2)
Is that possible? I guess reading it from dbfs should be done differently as well, from what I do now:
val df1 = spark
.read
.format("csv")
.option("header", "true")
.option("delimiter",",")
.option("inferSchema", "true")
.load("dbfs:/Reports/*.csv")
.select("lot of ids")
Thank you a lot in advance for the ideas :)

As discussed you have 3 options here.
In my example I am using the next 3 datasets:
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1 |100 |200 |
|2 |300 |400 |
+----+----+----+
+----+----+----+
|col1|col2|col3|
+----+----+----+
|3 |60 |80 |
|4 |12 |100 |
|5 |20 |10 |
+----+----+----+
+----+----+----+
|col1|col2|col3|
+----+----+----+
|7 |20 |40 |
|8 |30 |40 |
+----+----+----+
You create first you schema (is faster to define the schema explicitly instead of inferring it):
import org.apache.spark.sql.types._
val df_schema =
StructType(
List(
StructField("col1", IntegerType, true),
StructField("col2", IntegerType, true),
StructField("col3", IntegerType, true)))
Option 1:
Load all CSVs at once with:
val df1 = spark
.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.schema(df_schema)
.csv("file:///C:/data/*.csv")
Then apply your logic to the whole dataset grouping by the file name.
Precondition: You must find a way to append the file name to each file
Option 2:
Load csv files from directory. Then iterate over the files and create a dataframe for each csv. Inside the loop apply your logic to each csv. Finally in the end of the loop append (union) the results into a 2nd dataframe which will store your accumulated results.
Attention: Please be aware that a large number of files might cause a very big DAG and subsequently a huge execution plan, in order to avoid this you can persist the current results or call collect. In the example below I assumed that persist or collect will get executed for every bufferSize iterations. You can adjust or even remove this logic according to the number of csv files.
This is a sample code for the 2nd option:
import java.io.File
import org.apache.spark.sql.Row
import spark.implicits._
val dir = "C:\\data_csv\\"
val csvFiles = new File(dir).listFiles.filter(_.getName.endsWith(".csv"))
val bufferSize = 10
var indx = 0
//create an empty df which will hold the accumulated results
var bigDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], df_schema)
csvFiles.foreach{ path =>
var tmp_df = spark
.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.schema(df_schema)
.csv(path.getPath)
//execute your custom logic/calculations with tmp_df
if((indx + 1) % bufferSize == 0){
// If buffer size reached then
// 1. call unionDf.persist() or unionDf.collect()
// 2. in the case you use collect() load results into unionDf again
}
bigDf = bigDf.union(tmp_df)
indx = indx + 1
}
bigDf.show(false)
This should output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1 |100 |200 |
|2 |300 |400 |
|3 |60 |80 |
|4 |12 |100 |
|5 |20 |10 |
|7 |20 |40 |
|8 |30 |40 |
+----+----+----+
Option 3:
The last option is to use the build-in spark.sparkContext.wholeTextFiles.
This is the code to load all csv files into a RDD:
val data = spark.sparkContext.wholeTextFiles("file:///C:/data_csv/*.csv")
val df = spark.createDataFrame(data)
df.show(false)
And the output:
+--------------------------+--------------------------+
|_1 |_2 |
+--------------------------+--------------------------+
|file:/C:/data_csv/csv1.csv|1,100,200 |
| |2,300,400 |
|file:/C:/data_csv/csv2.csv|3,60,80 |
| |4,12,100 |
| |5,20,10 |
|file:/C:/data_csv/csv3.csv|7,20,40 |
| |8,30,40 |
+--------------------------+--------------------------+
spark.sparkContext.wholeTextFiles will return a key/value RDD in which key is the file path and value is the file data.
This requires extra code to extract the content of the _2 which is the content of each csv. In my opinion this would consist an overhead regarding the performance and the maintainability of the program therefore I would have avoided it.
Let me know if you need further clarifications

I am adding to the answer that #Alexandros Biratsis provided.
One can use the First Approach as below by concatenating the file name as a separate column within the same dataframe that has all the data from multiple files.
val df1 = spark
.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.schema(df_schema)
.csv("file:///C:/data/*.csv")
.withColumn("FileName",input_file_name())
Here input_file_name() is a function that adds the name of the file to each row within the DataFrame. This is an inbuilt function in spark.
To use this function you need to import the below namespace.
import org.apache.spark.sql.functions._
One can find the documentation for the function at https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html
I would advise not to use the second approach suggested by #Alexandros Biratsis of doing union and persisting the temporary dataframes as it would work for small number of files but as the number of files gets increased it becomes too slow and sometimes it gets timed out and driver gets shutdown unexpectedly.
I would like to thank Alexandros for answer as that gave me a way to move forward with the problem.

Escape Comma inside a csv file using spark-shell

I have a dataset containing below two rows
s.no,name,Country
101,xyz,India,IN
102,abc,UnitedStates,US
I am trying to escape the commas of each column but not for last column I want them the same and get the output using spark-shell. I tried using the below code but it has given me the different output.
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", ",").option("escape", "\"").load("/user/username/data.csv").show()
The output it has given me is
+----+-----+------------+
|s.no| name| Country|
+----+-----+------------+
| 101| xyz| India|
| 102| abc|UnitedStates|
+----+-----+------------+
But I am expecting output to be like below What I am missing here can anyone help me?
s.no name Country
101 xyz India,IN
102 abc UnitedStates,US

I suggest to read the all the fields with providing schema and ignoring the header present in data as below
case class Data (sno: String, name: String, country: String, country1: String)
val schema = Encoders.product[Data].schema
import spark.implicits._
val df = spark.read
.option("header", true)
.schema(schema)
.csv("data.csv")
.withColumn("Country" , concat ($"country", lit(", "), $"country1"))
.drop("country1")
df.show(false)
Output:
+---+----+----------------+
|sno|name|Country |
+---+----+----------------+
|101|xyz |India, IN |
|102|abc |UnitedStates, US|
+---+----+----------------+
Hope this helps!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark read tab delimiter not behaving as expected - pyspark

Related

Split file with Space where column data is also having space

Create a single schema dataframe when reading multiple csv files under a directory

Format csv file with column creation in Spark scala

How to load and process multiple csv files from a DBFS directory with Spark

Escape Comma inside a csv file using spark-shell

Categories

Resources