How can we load a non delimited text file using spark scala and save it as a CSV file where column lengths are given dynamically? - scala

For ex - if we have three columns Name , address and age as schema and a line in file has 92 chars where first 50 are name, next 40 are address and last 2 chars is age, and if these column lengths might vary and will be given dynamically, how to read a file and make it delimited and save it as text file?
Could not get the idea at all

Your question has two parts.
Read file
I assume you have a file named input.txt like this:
1 John 100
2 Jack 200
3 Jonah300
10JJ 400
I also assumed that these are the columns:
// Column Name, Length
val columns = Vector(("id", 2), ("name", 5), ("value", 3))
You should find starting position for each column:
case class ColumnInfo(name: String, length: Int, position: Int)
val columnInfos = columns.tail.foldLeft(Vector(ColumnInfo(columns.head._1, columns.head._2, 1))) { (acc, current) =>
acc :+ ColumnInfo(current._1, current._2, acc.last.position + acc.last.length)
}
This will be the result:
Vector(ColumnInfo(id,2,1), ColumnInfo(name,5,3), ColumnInfo(value,3,8))
Now, you can read and parse this file using this code:
val sparkCols = columnInfos map { columnInfo =>
trim(substring(col("value"), columnInfo.position, columnInfo.length)) as columnInfo.name
}
val df = spark.read
.text("input.txt")
.select(sparkCols: _*)
df.show()
This will be the result:
+---+-----+-----+
| id| name|value|
+---+-----+-----+
| 1| John| 100|
| 2| Jack| 200|
| 3|Jonah| 300|
| 10| JJ| 400|
+---+-----+-----+
Save file
You can save file using this code:
df.repartition(1).write.option("header", true).csv("output.csv")
This will be the result:
id,name,value
1,John,100
2,Jack,200
3,Jonah,300
10,JJ,400

Related

How to transform this dataset to the following dataset

Input
+------+------+------+------+
|emp_name|emp_area| dept|zip|
+------+------+------+------+
|ram|USA|"Sales"|805912|
|sham|USA|"Sales"|805912|
|ram|Canada|"Marketing"|805912|
|ram|USA|"Sales"|805912|
|sham|USA|"Marketing"|805912|
+------+------+------+------
Desired output
feature |Top1 name |Top 1 value1|Top2 name|top 2 value|
emp_name ram |3|sham |2
emp_area Usa |4|canada |1
dept sales|3|Marketing|3
zip 805912|5|NA|NA
I started with dynamically generating the count for each one of them but unable to store them in a dataset
val features=ds.columns.toList
for (e <- features) {
val ds1=ds.groupBy(e).count().sort(desc("count")).limit(5).withColumnRenamed("count", e+"_count")
}
Now how to collect all the values into one dataframe and transform to the output?
Here's a slightly verbose approach. You can map each column to a dataframe with one row, which corresponds to the row in the desired output. Add NA columns if necessary. Convert the column names to the desired ones, and finally do a unionAll to combine the dataframes (one row each).
import org.apache.spark.sql.expressions.Window
val top = 2
val result = ds.columns.map(
c => ds.groupBy(c).count()
.withColumn("rn", row_number().over(Window.orderBy(desc("count"))))
.filter(s"rn <= $top")
.groupBy().pivot("rn")
.agg(first(col(c)), first(col("count")))
.select(lit(c), col("*"))
).map(df =>
if (df.columns.size != 1 + top*2)
df.select(List(col("*")) ::: (1 to (top*2+1 - df.columns.size)).toList.map(x => lit("NA")): _*)
else df
).map(df =>
df.toDF(List("feature") ::: (1 to top).toList.flatMap(x => Seq(s"top$x name", s"top$x value")): _*)
).reduce(_ unionAll _)
result.show
+--------+---------+----------+---------+----------+
| feature|top1 name|top1 value|top2 name|top2 value|
+--------+---------+----------+---------+----------+
|emp_name| ram| 3| sham| 2|
|emp_area| USA| 4| Canada| 1|
| dept| Sales| 3|Marketing| 2|
| zip| 805912| 5| NA| NA|
+--------+---------+----------+---------+----------+

How to read csv file in dataframe with different delimiter in header as ''," and rest of the rows are separated with "|"

Have csv file header was comma separated and rest of the rows are seperated with another delimiter "|" .How to handle this different delimiters scenario ? Please advise .
import org.apache.spark.sql.{DataFrame, SparkSession}
var df1: DataFrame = null
df1=spark.read.option("header", "true").option("delimiter", ",").option("inferSchema", "false")
.option("ignoreLeadingWhiteSpace", "true") .option("ignoreTrailingWhiteSpace", "true")
.csv("/testing.csv")
df1.show(10)
this commands displays the headers are delimited seperately .But all the data was displayed in first column ,remaining columns are displayed with null values
Read the csv first and split the columns, create new dataframe.
df.show
+---------+----+-----+
| Id|Date|Value|
+---------+----+-----+
|1|2020|30|null| null|
|1|2020|40|null| null|
|2|2020|50|null| null|
|2|2020|40|null| null|
+---------+----+-----+
val cols = df.columns
val index = 0 to cols.size - 1
val expr = index.map(i => col("array")(i))
df.withColumn("array", split($"Id", "\\|"))
.select(expr: _*).toDF(cols: _*).show
+---+----+-----+
| Id|Date|Value|
+---+----+-----+
| 1|2020| 30|
| 1|2020| 40|
| 2|2020| 50|
| 2|2020| 40|
+---+----+-----+

dynamically pass arguments to function in scala

i have record as string with 1000 fields with delimiter as comma in dataframe like
"a,b,c,d,e.......upto 1000" -1st record
"p,q,r,s,t ......upto 1000" - 2nd record
I am using below suggested solution from stackoverflow
Split 1 column into 3 columns in spark scala
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),$"_tmp".getItem(2).as("col3")).drop("_tmp")
however in my case i am having 1000 columns which i have in JSON schema which i can retrive like
column_seq:Seq[Array]=Schema_func.map(_.name)
for(i <-o to column_seq.length-1){println(i+" " + column_seq(i))}
which returns like
0 col1
1 col2
2 col3
3 col4
Now I need to pass all this indexes and column names to below function of DataFrame
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),$"_tmp".getItem(2).as("col3")).drop("_tmp")
in
$"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),
as i cant create the long statement with all 1000 columns , is there any effective way to pass all this arguments from above mentioned json schema to select function , so that i can split the columns , add the header and then covert the DF to parquet.
You can build a series of org.apache.spark.sql.Column, where each one is the result of selecting the right item and has the right name, and then select these columns:
val columns: Seq[Column] = Schema_func.map(_.name)
.zipWithIndex // attach index to names
.map { case (name, index) => $"_tmp".getItem(index) as name }
val result = df
.withColumn("_tmp", split($"columnToSplit", "\\."))
.select(columns: _*)
For example, for this input:
case class A(name: String)
val Schema_func = Seq(A("c1"), A("c2"), A("c3"), A("c4"), A("c5"))
val df = Seq("a.b.c.d.e").toDF("columnToSplit")
The result would be:
// +---+---+---+---+---+
// | c1| c2| c3| c4| c5|
// +---+---+---+---+---+
// | a| b| c| d| e|
// +---+---+---+---+---+

Writing RDD Data to CSV with Dynamic Columns in Spark - Scala

I am reading multiple files from an HDFS directory, and for each file the generated data is printed using:
frequencies.foreach(x => println(x._1 + ": "+x._2))
And the printed data is (for File1.txt):
'text': 45
'data': 100
'push': 150
The key can be different for other files like (File2.txt):
'data': 45
'lea': 100
'jmp': 150
The key is not necessarily the same in all the files. I want all the file data to be written to a .csv file in the following format:
Filename text data push lea jmp
File1.txt 45 100 150 0 0
File2.txt 0 45 0 100 150 ....
Can someone please help me find a solution to this problem?
If your files doesn't big enough, you can done without spark.
Here is my example code, csv format is old style, doesn't like your expected output, but you can tweak it easily.
import scala.io.Source
import org.apache.hadoop.fs._
val sparkSession = ... // I created it to retrieve hadoop configuration, you can create your own Configuration.
val inputPath = ...
val outputPath = ...
val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
// read all files content to Array of Map[String,String]
val filesContent = fs.listStatus(new Path(inputPath)).filter(_.isFile).map(_.getPath).filter(_.getName.endsWith(".txt"))
.map(s => (s.getName, Source.fromInputStream(fs.open(s)).getLines()
.map(_.split(":").map(_.trim))
.filter(_.length == 2)
.map(p => (p.head, p.last)).toMap))
// create default Map with all possible keys
val listKeys = filesContent.flatMap(_._2.keys).distinct.map(s => (s, "0")).toMap
val csvContent = filesContent.map(s => (s._1, listKeys ++ s._2))
.map(s => (s._1, s._2.values.mkString(",")))
.map(s => s"${s._1},${s._2}")
.mkString("\n")
val csvHeader = ("Filename" +: listKeys.keys.toList).mkString(",")
val csv = csvHeader + "\n" + csvContent
new PrintWriter(fs.create(new Path(outputPath))){
write(csv)
close()
}
I'd suggest creating one dataframe for all the files inside your directory and then using a pivot to re-shape the data accordingly :
val df1 = sc.parallelize(Array(
("text",45 ),
("data",100 ),
("push",150 ))).toDF("key", "value").withColumn("Filename", lit("File1") )
val df2 = sc.parallelize(Array(
("data",45 ),
("lea",100 ),
("jump",150 ))).toDF("key", "value").withColumn("Filename", lit("File2") )
val df = df1.unionAll(df2)
df.show
+----+-----+--------+
| key|value|Filename|
+----+-----+--------+
|text| 45| File1|
|data| 100| File1|
|push| 150| File1|
|data| 45| File2|
| lea| 100| File2|
|jump| 150| File2|
+----+-----+--------+
val finalDf = df.groupBy($"Filename").pivot("key").agg(first($"value") ).na.fill(0)
finalDf.show
+--------+----+----+---+----+----+
|Filename|data|jump|lea|push|text|
+--------+----+----+---+----+----+
| File1| 100| 0| 0| 150| 45|
| File2| 45| 150|100| 0| 0|
+--------+----+----+---+----+----+
You can write it as a CSV using DataFrameWriter
df.write.csv(..)
The hard part with this would be creating a different dataframe for each file with an extra column for the Filename from which the dataframe is created

Scala-Spark(version1.5.2) Dataframes split error

I have an input file foo.txt with the following content:
c1|c2|c3|c4|c5|c6|c7|c8|
00| |1.0|1.0|9|27.0|0||
01|2|3.0|4.0|1|10.0|1|1|
I want to transform it to a Dataframe to perform some Sql queries:
var text = sc.textFile("foo.txt")
var header = text.first()
var rdd = text.filter(row => row != header)
case class Data(c1: String, c2: String, c3: String, c4: String, c5: String, c6: String, c7: String, c8: String)
Until this point everything is ok, the problem comes in the next sentence:
var df = rdd.map(_.split("\\|")).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()
If I try to print df with df.show, I get an error message:
scala> df.show()
java.lang.ArrayIndexOutOfBoundsException: 7
I know that the error might be due to the split sentence. I also tried to split foo.txt using the following syntax:
var df = rdd.map(_.split("""|""")).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()
And then I get something like this:
scala> df.show()
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 |
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
| 0| 0| || | || 1| .| 0|
| 0| 1| || 2| || 3| .| 0|
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
Therefore, my question is how can I correctly pass this file to a Dataframe.
EDIT: The error is in the first row due to || field without an intermediate space. This type of field definition depending on the examples works fine or crashes.
This is because one of your lines is shorter than the others:
scala> var df = rdd.map(_.split("\\|")).map(_.length).collect()
df: Array[Int] = Array(7, 8)
You can fill in the rows manually (but you need to handle each case manually):
val df = rdd.map(_.split("\\|")).map{row =>
row match {
case Array(a,b,c,d,e,f,g,h) => Data(a,b,c,d,e,f,g,h)
case Array(a,b,c,d,e,f,g) => Data(a,b,c,d,e,f,g," ")
}
}
scala> df.show()
+---+---+---+---+---+----+---+---+
| c1| c2| c3| c4| c5| c6| c7| c8|
+---+---+---+---+---+----+---+---+
| 00| |1.0|1.0| 9|27.0| 0| |
| 01| 2|3.0|4.0| 1|10.0| 1| 1|
+---+---+---+---+---+----+---+---+
EDIT:
A more generic solution would be something like this:
val df = rdd.map(_.split("\\|", -1)).map(_.slice(0,8)).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()
If you assume that you always have the right number of delimiters, it is safe to use this syntax an truncate the last value.
My suggestion would be to use databrick's csv parser.
Link : https://github.com/databricks/spark-csv
To load your example :
I loaded a sample file similar to yours:
c1|c2|c3|c4|c5|c6|c7|c8|
00| |1.0|1.0|9|27.0|0||
01|2|3.0|4.0|1|10.0|1|1|
To create the dataframe use the below code:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("delimiter", "|") // default is ","
.load("foo.txt")
.show
I got the below output
+---+---+---+---+---+----+---+----+---+
| c1| c2| c3| c4| c5| c6| c7| c8| |
+---+---+---+---+---+----+---+----+---+
| 0| |1.0|1.0| 9|27.0| 0|null| |
| 1| 2|3.0|4.0| 1|10.0| 1| 1| |
+---+---+---+---+---+----+---+----+---+
This way you do not have to bother about parsing the file yourself. You get a dataframe directly