Spark each file to a dataset row - scala

I have many files in a directory, each file containing text spanning multiple lines.
Currently, I use the following code to read all those files to a spark dataset (>2.0)
val ddf = spark.read.text("file:///input/*")
However, this creates a dataset where each row is a line, not a file. I'd like to have each file (as string) per row in the dataset.
How can I achieve this without iterating over each file and reading it in separately as RDD?

Use wholeTextFiles() on SparkContext
val rdd: RDD[(String, String)] = spark.sparkContext
.wholeTextFiles("file/path/to/read/as/rdd")
SparkContext.wholeTextFiles lets you read a directory containing
multiple small text files, and returns each of them as (filename,
content) pairs. This is in contrast with textFile, which would return
one record per line in each file.

An alternative to #mrsrinivas's answer would be to group by input_file_name. Given the structure:
evan#vbox>~/junk/so> find .
.
./d2
./d2/t.txt
./d1
./d1/t.txt
evan#vbox>~/junk/so> cat */*.txt
d1_1
d1_2
d2_1
d2_2
We can collect lists based on the input files like so:
scala> val ddf = spark.read.textFile("file:///home/evan/junk/so/*").
| select($"value", input_file_name as "fName")
ddf: org.apache.spark.sql.DataFrame = [value: string, fName: string]
scala> ddf.show(false)
+-----+----------------------------------+
|value|fName |
+-----+----------------------------------+
|d2_1 |file:///home/evan/junk/so/d2/t.txt|
|d2_2 |file:///home/evan/junk/so/d2/t.txt|
|d1_1 |file:///home/evan/junk/so/d1/t.txt|
|d1_2 |file:///home/evan/junk/so/d1/t.txt|
+-----+----------------------------------+
scala> ddf.groupBy("fName").agg(collect_list($"value") as "value").
| drop("fName").show
+------------+
| value|
+------------+
|[d1_1, d1_2]|
|[d2_1, d2_2]|
+------------+

Related

How to parse List of Values from a column of a file into Spark sql Dataframe

I am still new to spark scala, i have a requirement to extract the first partition from every table in hive. I have extracted the table list in a separate text file and created as a sequence, i have no idea how to parse the each sequential value into "show partitions test_hive_database."
scala> import scala.io.Source
import scala.io.Source
scala> val filename = "text_tables.txt"
filename: String = text_tables.txt
Sample file containing the table list:
TABLE_NAME_A101
TABLE_NAME_A102
TABLE_NAME_A103
TABLE_NAME_B101
TABLE_NAME_C101
scala> val linestable =
scala.io.Source.fromFile("text_tables.txt").getLines.toSeq
linestable: Seq[String] = Stream(TABLE_NAME_A101, ?)
Below is the sample first partition from a table and i have concatenated the table along with the partition value.
scala> sql("show partitions test_hive_database.TABLE_NAME_A101").withColumn("new_column",concat(lit("TABLE_NAME_A101,"),'partition)).select("new_column").show(1,false)
+------------------------------------+
|new_column |
+------------------------------------+
|TABLE_NAME_A101,dta_ld_dt=2018-01-23|
+------------------------------------+
only showing top 1 row
Tried with for comprehension
scala> for(e <- linestable) yield (sql("show partitions test_hive_database.$e").withColumn("new_column",concat(lit("$e , "),'partition)).select("new_column").show(1,false))
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '$' expecting {'SELECT', 'FROM', 'ADD'
Expected Result
+------------------------------------+
|new_column |
+------------------------------------+
|TABLE_NAME_A101,dta_ld_dt=2018-01-23|
|TABLE_NAME_A102,dta_ld_dt=2018-02-28|
|TABLE_NAME_A103,dta_ld_dt=2018-03-31|
|TABLE_NAME_B101,dta_ld_dt=2018-04-30|
|TABLE_NAME_C101,dta_ld_dt=2019-01-30|
+------------------------------------+
Actual Result:
I am getting error and i am not sure this approach is correct.
How to parse single column values from a file into a spark sql (table name) and get the result appended into a csv file?

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.

Stringbuilder to RDD

I have a string builder(sb) with data as below in Scala IDE
CellId,Date,Time,MeasType,MeasResult
251498240,2016-12-02,20:45:00,RRC.ConnEstabAtt.emergency,0
251498240,2016-12-02,20:45:00,RRC.ConnEstabAtt.highPriorityAccess,0
251498240,2016-12-02,20:45:00,RRC.ConnEstabAtt.mt-Access,4
Now I want to convert this string into RDD by using scala. Please help me.
I am using this code. But no luck. Thanks in advance
val headerFile = sc.parallelize(sb)
headerFile.collect()
StringBuilder is used to build strings from mutable sequence of characters. So what ever you add to the builder would be appended to become as one string.
You would need to separate the strings added to be used as list of strings in sparkcontext
Assuming that the string are added with trailing line feed, you can split the string builder with line feed and use it to be transformed as rdd
val headerFile = sc.parallelize(sb.toString.split("\n"))
headerFile.collect()
To visualize the data, you would have to print them or save them to file
Now if you want to convert to dataframe before saving then you can perform as below
val data = sb.toString.split("\n")
import org.apache.spark.sql.types._
val schema = StructType(data.head.split(",").map(StructField(_, StringType, true)))
val rdd = sc.parallelize(sb.toString.split("\n").tail.map(line => Row.fromSeq(line.split(","))))
spark.createDataFrame(rdd, schema).show(false)
which should give you
+---------+----------+--------+-----------------------------------+----------+
|CellId |Date |Time |MeasType |MeasResult|
+---------+----------+--------+-----------------------------------+----------+
|251498240|2016-12-02|20:45:00|RRC.ConnEstabAtt.emergency |0 |
|251498240|2016-12-02|20:45:00|RRC.ConnEstabAtt.highPriorityAccess|0 |
|251498240|2016-12-02|20:45:00|RRC.ConnEstabAtt.mt-Access |4 |
+---------+----------+--------+-----------------------------------+----------+

creating dataframe by loading csv file using scala in spark

but csv file is added with extra double quotes which results all cloumns into single column
there are four columns,header and 2 rows
"""SlNo"",""Name"",""Age"",""contact"""
"1,""Priya"",78,""Phone"""
"2,""Jhon"",20,""mail"""
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").option("inferSchema","true").load ("bank.csv")
df: org.apache.spark.sql.DataFrame = ["SlNo","Name","Age","contact": string]
What you can do is read it using sparkContext and replace all " with empty and use zipWithIndex() to separate header and text data so that custom schema and row rdd data can be created. Finally just use the row rdd and schema in sqlContext's createDataFrame api
//reading text file, replacing and splitting and finally zipping with index
val rdd = sc.textFile("bank.csv").map(_.replaceAll("\"", "").split(",")).zipWithIndex()
//separating header to form schema
val header = rdd.filter(_._2 == 0).flatMap(_._1).collect()
val schema = StructType(header.map(StructField(_, StringType, true)))
//separating data to form row rdd
val rddData = rdd.filter(_._2 > 0).map(x => Row.fromSeq(x._1))
//creating the dataframe
sqlContext.createDataFrame(rddData, schema).show(false)
You should be getting
+----+-----+---+-------+
|SlNo|Name |Age|contact|
+----+-----+---+-------+
|1 |Priya|78 |Phone |
|2 |Jhon |20 |mail |
+----+-----+---+-------+
I hope the answer is helpful

Issue in spliting data in Txt file while converting from RDD to DataFrame in Spark using Scala

I reading data from a text file as RDD and converting into DataFrame but I am not getting the desired output.
Code -
val myFile = sc.textFile("car.txt")
val df = myFile.map(_.split(" ")).map(line => Text(line(0))).toDF()
df.show()
where Text is the case class
case class-
case class Text(field: String)
Data in the car.txt file -
hyundai honda
honda maruti
maruti honda
Output when executing -
+-------+
| field|
+-------+
|hyundai|
| honda|
| maruti|
+-------+
Why am I not getting all the data from the text file in the DataFrame?
It's because you are splitting the data on spaces and then only outputting the first element of that (first word) -> line(0)
If you just want the lines, then you can cut out the .map(_.split(" ")) and then just use the line (no (0))