How to parse delimited fields with some (sub)fields empty? - scala

I use Spark 2.1.1 and Scala 2.11.8 in spark-shell.
My input dataset is something like :
2017-06-18 00:00:00 , 1497769200 , z287570731_serv80i:7:175 , 5:Re
2017-06-18 00:00:00 , 1497769200 , p286274731_serv80i:6:100 , 138
2017-06-18 00:00:00 , 1497769200 , t219420679_serv37i:2:50 , 5
2017-06-18 00:00:00 , 1497769200 , v290380588_serv81i:12:800 , 144:Jo
2017-06-18 00:00:00 , 1497769200 , z292902510_serv83i:4:45 , 5:Re
2017-06-18 00:00:00 , 1497769200 , v205454093_serv75i:5:70 , 50:AK
It is saved as a CSV file which is read using sc.textFile("input path")
After a few transformations, this is the output of the RDD I have:
(String, String) = ("Re ",7)
I get this by executing
val tid = read_file.map { line =>
val arr = line.split(",")
(arr(3).split(":")(1), arr(2).split(":")(1))
}
My input RDD is:
( z287570731_serv80i:7:175 , 5:Re )
( p286274731_serv80i:6:100 , 138 )
( t219420679_serv37i:2:50 , 5 )
( v290380588_serv81i:12:800 , 144:Jo )
( z292902510_serv83i:4:45 , 5:Re )
As it can be observed, in the first entry column 2, I have
5:Re
of which I'm getting the output
("Re ",7)
However when I reach the second row, according to the format, column 2 is 138 which should be
138:null
but gives ArrayIndexOutOfBoundsException on executing
tid.collect()
How can I correct this so that null is displayed with 138 and 5 for the second and third rows respectively? I tried to do it this way:
tid.filter(x => x._1 != null )

The problem is that you expect at least two parts in the position while you may have only one.
The following is the line that causes the issue.
{var arr = line.split(","); (arr(3).split(":")(1),arr(2).split(":")(1))});
After you do line.split(",") you then arr(3).split(":")(1) and also arr(2).split(":")(1).
There's certainly too much assumption about the format and got beaten by missing values.
but gives ArrayIndexOutOfBoundsException on executing
That's because you access 3 and 2 elements but have only 2 (!)
scala> sc.textFile("input.csv").
map { line => line.split(",").toSeq }.
foreach(println)
WrappedArray(( z287570731_serv80i:7:175i , 5:Re ))
WrappedArray(( p286274731_serv80i:6:100 , 138 ))
The problem has almost nothing to do with Spark. It's a regular Scala problem where the data is not where you expect it.
scala> val arr = "hello,world".split(",")
arr: Array[String] = Array(hello, world)
Note that what's above is just a pure Scala.
Solution 1 - Spark Core's RDDs
Given the following dataset...
2017-06-18 00:00:00 , 1497769200 , z287570731_serv80i:7:175 , 5:Re
2017-06-18 00:00:00 , 1497769200 , p286274731_serv80i:6:100 , 138
2017-06-18 00:00:00 , 1497769200 , t219420679_serv37i:2:50 , 5
2017-06-18 00:00:00 , 1497769200 , v290380588_serv81i:12:800 , 144:Jo
2017-06-18 00:00:00 , 1497769200 , z292902510_serv83i:4:45 , 5:Re
2017-06-18 00:00:00 , 1497769200 , v205454093_serv75i:5:70 , 50:AK
...I'd do the following:
val solution = sc.textFile("input.csv").
map { line => line.split(",") }.
map { case Array(_, _, third, fourth) => (third, fourth) }.
map { case (third, fourth) =>
val Array(_, a # _*) = fourth.split(":")
val Array(_, right, _) = third.split(":")
(a.headOption.orNull, right)
}
scala> solution.foreach(println)
(Re,7)
(null,6)
(Re,4)
(null,2)
(AK,5)
(Jo,12)
Solution 2 - Spark SQL's DataFrames
I strongly recommend using Spark SQL for such data transformations. As you said, you are new to Spark, so why not start from the right place which is exactly Spark SQL.
val solution = spark.
read.
csv("input.csv").
select($"_c2" as "third", $"_c3" as "fourth").
withColumn("a", split($"fourth", ":")).
withColumn("left", $"a"(1)).
withColumn("right", split($"third", ":")(1)).
select("left", "right")
scala> solution.show(false)
+----+-----+
|left|right|
+----+-----+
|Re |7 |
|null|6 |
|null|2 |
|Jo |12 |
|Re |4 |
|AK |5 |
+----+-----+

If your data is as below in a file
( z287570731_serv80i:7:175 , 5:Re )
( p286274731_serv80i:6:100 , 138 )
( t219420679_serv37i:2:50 , 5 )
( v290380588_serv81i:12:800 , 144:Jo )
( z292902510_serv83i:4:45 , 5:Re )
Then you can use
val tid = sc.textFile("path to the input file")
.map(line => line.split(","))
.map(array => {
if (array(1).contains(":")) (array(1).split(":")(1).replace(")", "").trim, array(0).split(":")(1))
else (null, array(0).split(":")(1))
})
tid.foreach(println)
which should give you output as
(Re,7)
(null,6)
(null,2)
(Jo,12)
(Re,4)
But if you have data as
2017-06-18 00:00:00 , 1497769200 , z287570731_serv80i:7:175 , 5:Re
2017-06-18 00:00:00 , 1497769200 , p286274731_serv80i:6:100 , 138
2017-06-18 00:00:00 , 1497769200 , t219420679_serv37i:2:50 , 5
2017-06-18 00:00:00 , 1497769200 , v290380588_serv81i:12:800 , 144:Jo
2017-06-18 00:00:00 , 1497769200 , z292902510_serv83i:4:45 , 5:Re
2017-06-18 00:00:00 , 1497769200 , v205454093_serv75i:5:70 , 50:AK
2017-06-18 00:00:00 , 1497769200 , z287096299_serv80i:19:15000 , 39:Re
Then you need to do
val tid = sc.textFile("path to the input file")
.map(line => line.split(","))
.map(array => {
if (array(3).contains(":")) (array(3).split(":")(1).replace(")", "").trim, array(2).split(":")(1))
else (null, array(2).split(":")(1))
})
tid.foreach(println)
And you should have output as
(Re,7)
(null,6)
(null,2)
(Jo,12)
(Re,4)
(AK,5)
(Re,19)

ArrayIndexOutOfBounds is occurring because the element will not be there if no : is present in the second element of the tuple.
You can check if : is present in the second element of each tuple. And then use map to give you an intermediate RDD on which you can run your current query.
val rdd = sc.parallelize(Array(
( "z287570731_serv80i:7:175" , "5:Re" ),
( "p286274731_serv80i:6:100" , "138" ),
( "t219420679_serv37i:2:50" , "5" ),
( "v290380588_serv81i:12:800" , "144:Jo" ),
( "z292902510_serv83i:4:45" , "5:Re" )))
rdd.map { x =>
val idx = x._2.lastIndexOf(":")
if(idx == -1) (x._1, x._2+":null")
else (x._1, x._2)
}
There are obviously better (lesser lines of code) ways to do what you want to accomplish but as a beginner, it's good to layout each step in a single command so t's easily readable and understandable, specially with scala where you can stop global warming with a single line of code.

Related

Scala spark + encoder issues

Working on a problem where I need to add a new column that holds the length of all characters under all columns.
My sample data set :
ItemNumber,StoreNumber,SaleAmount,Quantity, Date
2231 , 1 , 400 , 2 , 19/01/2020
2145 , 3 , 500 , 10 , 14/01/2020
The expected output would be
19 20
The ideal output am expecting to build is with new column Length added to the data frame
ItemNumber,StoreNumber,SaleAmount,Quantity, Date , Length
2231 , 1 , 400 , 2 , 19/01/2020, 19
2145 , 3 , 500 , 10 , 14/01/2020, 20
My code
val spark = SparkSession.builder()
.appName("SimpleNewIntColumn").master("local").enableHiveSupport().getOrCreate()
val df = spark.read.option("header","true").csv("./data/sales.csv")
var schema = new StructType
df.schema.toList.map{
each => schema = schema.add(each)
}
val encoder = RowEncoder(schema)
val charLength = (row :Row) => {
var len :Int = 0
row.toSeq.map(x => {
x match {
case a : Int => len = len + a.toString.length
case a : String => len = len + a.length
}
})
len
}
df.map(row => charLength(row))(encoder) // ERROR - Required Encoder[Int] Found EncoderExpression[Row]
df.withColumn("Length", ?)
I have two issues
1) How to solve the error "ERROR - Required Encoder[Int] Found EncodeExpression[Row]"?
2) How do I add the output of charLength function as new column value? - df.withColumn("Length", ?)
Thank you.
Gurupraveen
If you are just trying to add a column, with total length of that Row
You can simply concat all the columns cast to String and use length function
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
val concatCol = concat(df.columns.map(col(_).cast(StringType)):_*)
df.withColumn("Length", length(concatCol))
Output:
+----------+-----------+----------+--------+----------+------+
|ItemNumber|StoreNumber|SaleAmount|Quantity| Date|length|
+----------+-----------+----------+--------+----------+------+
| 2231| 1| 400| 2|19/01/2020| 19|
| 2145| 3| 500| 10|14/01/2020| 20|
+----------+-----------+----------+--------+----------+------+

Fill in missing weeks within a given date interval in Spark (Scala)

Consider the following DataFrame:
val df = Seq("20140101", "20170619")
.toDF("date")
.withColumn("date", to_date($"date", "yyyyMMdd"))
.withColumn("week", date_format($"date", "Y-ww"))
The code yields:
date: date
week: string
date week
2014-01-01 2014-01
2017-06-19 2017-25
What I would like to do is thicken the dataframe so I'm left with one row for each week in the interval between 2014-01 and 2017-25. The date column isn't important so it can be discarded.
This needs to be done over a myriad of customer / product id combinations, so I'm looking for an efficient solution, preferably using nothing beyond java.sql.date and the built-in date functionalities in Spark.
Check this out. I have used the default "Sunday" as the week start number.
scala> import java.time._
import java.time._
scala> import java.time.format._
import java.time.format._
scala> val a = java.sql.Date.valueOf("2014-01-01")
a: java.sql.Date = 2014-01-01
scala> val b = java.sql.Date.valueOf("2017-12-31")
b: java.sql.Date = 2017-12-31
scala> val a1 = a.toLocalDate.toEpochDay.toInt
a1: Int = 16071
scala> val b1 = b.toLocalDate.toEpochDay.toInt
b1: Int = 17531
scala> val c1 = (a1 until b1).map(LocalDate.ofEpochDay(_)).map(x => (x,x.format(DateTimeFormatter.ofPattern("Y-ww")),x.format(DateTimeFormatter.ofPattern("E")) ) ).filter( x=> x._3 =="Sun" ).map(x => (java.sql.Date.valueOf(x._1),x._2) ).toMap
c1: scala.collection.immutable.Map[java.sql.Date,String] = Map(2014-06-01 -> 2014-23, 2014-11-02 -> 2014-45, 2017-11-05 -> 2017-45, 2016-10-23 -> 2016-44, 2014-11-16 -> 2014-47, 2014-12-28 -> 2015-01, 2017-04-30 -> 2017-18, 2015-01-04 -> 2015-02, 2015-10-11 -> 2015-42, 2014-09-07 -> 2014-37, 2017-09-17 -> 2017-38, 2014-04-13 -> 2014-16, 2014-10-19 -> 2014-43, 2014-01-05 -> 2014-02, 2016-07-17 -> 2016-30, 2015-07-26 -> 2015-31, 2016-09-18 -> 2016-39, 2015-11-22 -> 2015-48, 2015-10-04 -> 2015-41, 2015-11-15 -> 2015-47, 2015-01-11 -> 2015-03, 2016-12-11 -> 2016-51, 2017-02-05 -> 2017-06, 2016-03-27 -> 2016-14, 2015-11-01 -> 2015-45, 2017-07-16 -> 2017-29, 2015-05-24 -> 2015-22, 2017-06-18 -> 2017-25, 2016-03-13 -> 2016-12, 2014-11-09 -> 2014-46, 2014-09-21 -> 2014-39, 2014-01-26 -> 2014-05...
scala> val df = Seq( (c1) ).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map<date,string>]
scala> val df2 = df.select(explode('a).as(Seq("dt","wk")) )
df2: org.apache.spark.sql.DataFrame = [dt: date, wk: string]
scala> df2.orderBy('dt).show(false)
+----------+-------+
|dt |wk |
+----------+-------+
|2014-01-05|2014-02|
|2014-01-12|2014-03|
|2014-01-19|2014-04|
|2014-01-26|2014-05|
|2014-02-02|2014-06|
|2014-02-09|2014-07|
|2014-02-16|2014-08|
|2014-02-23|2014-09|
|2014-03-02|2014-10|
|2014-03-09|2014-11|
|2014-03-16|2014-12|
|2014-03-23|2014-13|
|2014-03-30|2014-14|
|2014-04-06|2014-15|
|2014-04-13|2014-16|
|2014-04-20|2014-17|
|2014-04-27|2014-18|
|2014-05-04|2014-19|
|2014-05-11|2014-20|
|2014-05-18|2014-21|
+----------+-------+
only showing top 20 rows
scala>

Dataframe Aggregation

I have a dataframe DF with the following structure :
ID, DateTime, Latitude, Longitude, otherArgs
I want to group my data by ID and time window, and keep information about the location (For example the mean of the grouped latitude and the mean of the grouped longitude)
I successfully got a new dataframe with data grouped by id and time using :
DF.groupBy($"ID",window($"DateTime","2 minutes")).agg(max($"ID"))
But I lose my location data doing that.
What I am looking for is something that would look like this for example:
DF.groupBy($"ID",window($"DateTime","2 minutes"),mean("latitude"),mean("longitude")).agg(max($"ID"))
Returning only one row for each ID and time window.
EDIT :
Sample input :
DF : ID, DateTime, Latitude, Longitude, otherArgs
0 , 2018-01-07T04:04:00 , 25.000, 55.000, OtherThings
0 , 2018-01-07T04:05:00 , 26.000, 56.000, OtherThings
1 , 2018-01-07T04:04:00 , 26.000, 50.000, OtherThings
1 , 2018-01-07T04:05:00 , 27.000, 51.000, OtherThings
Sample output :
DF : ID, window(DateTime), Latitude, Longitude
0 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 25.5, 55.5
1 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 26.5, 50.5
Here is what you can do, you need to use mean with the aggregation.
val df = Seq(
(0, "2018-01-07T04:04:00", 25.000, 55.000, "OtherThings"),
(0, "2018-01-07T04:05:00", 26.000, 56.000, "OtherThings"),
(1, "2018-01-07T04:04:00", 26.000, 50.000, "OtherThings"),
(1, "2018-01-07T04:05:00", 27.000, 51.000, "OtherThings")
).toDF("ID", "DateTime", "Latitude", "Longitude", "otherArgs")
//convert Sting to DateType for DateTime
.withColumn("DateTime", $"DateTime".cast(DateType))
df.groupBy($"id", window($"DateTime", "2 minutes"))
.agg(
mean("Latitude").as("lat"),
mean("Longitude").as("long")
)
.show(false)
Output:
+---+---------------------------------------------+----+----+
|id |window |lat |long|
+---+---------------------------------------------+----+----+
|1 |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|26.5|50.5|
|0 |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|25.5|55.5|
+---+---------------------------------------------+----+----+
You should use the .agg() method for the aggregating
Perhaps this is what you mean?
DF
.groupBy(
'ID,
window('DateTime, "2 minutes")
)
.agg(
mean("latitude").as("latitudeMean"),
mean("longitude").as("longitudeMean")
)

How to collect and process column-wise data in Spark

I have a dataframe contains 7 days, 24 hours data, so it has 144 columns.
id d1h1 d1h2 d1h3 ..... d7h24
aaa 21 24 8 ..... 14
bbb 16 12 2 ..... 4
ccc 21 2 7 ..... 6
what I want to do, is to find the max 3 values for each day:
id d1 d2 d3 .... d7
aaa [22,2,2] [17,2,2] [21,8,3] [32,11,2]
bbb [32,22,12] [47,22,2] [31,14,3] [32,11,2]
ccc [12,7,4] [28,14,7] [11,2,1] [19,14,7]
import org.apache.spark.sql.functions._
var df = ...
val first3 = udf((list : Seq[Double]) => list.slice(0,3))
for (i <- 1 until 7) {
val columns = (1 until 24).map(x=> "d"+i+"h"+x)
df = df
.withColumn("d"+i, first3(sort_array(array(columns.head, columns.tail :_*), false)))
.drop(columns :_*)
}
This should give you what you want. In fact for each day I aggregate the 24 hours into an array column, that I sort in desc order and from which I select the first 3 elements.
Define pattern:
val p = "^(d[1-7])h[0-9]{1,2}$".r
Group columns:
import org.apache.spark.sql.functions._
val cols = df.columns.tail
.groupBy { case p(d) => d }
.map { case (c, cs) => {
val sorted = sort_array(array(cs map col: _*), false)
array(sorted(0), sorted(1), sorted(2)).as(c)
}}
And select:
df.select($"id" +: cols.toSeq: _*)

Replace new line (\n) character in csv file - spark scala

Just to illustrate the problem I have taken a testset csv file. But in real case scenario, the problem has to handle more than a TeraByte data.
I have a CSV file, where the columns are enclosed by quotes("col1"). But when the data import was done. One column contains new line character(\n). This is leading me to lot of problems, when I want to save them as Hive tables.
My idea was to replace the \n character with "|" pipe in spark.
I achieved so far :
1. val test = sqlContext.load(
"com.databricks.spark.csv",
Map("path" -> "test_set.csv", "header" -> "true", "inferSchema" -> "true", "delimiter" -> "," , "quote" -> "\"", "escape" -> "\\" ,"parserLib" -> "univocity" ))#read a csv file
2. val dataframe = test.toDF() #convert to dataframe
3. dataframe.foreach(println) #print
4. dataframe.map(row => {
val row4 = row.getAs[String](4)
val make = row4.replaceAll("[\r\n]", "|")
(make)
}).collect().foreach(println) #replace not working for me
Sample set :
(17 , D73 ,525, 1 ,testing\n , 90 ,20.07.2011 ,null ,F10 , R)
(17 , D73 ,526, 1 ,null , 89 ,20.07.2011 ,null ,F10 , R)
(17 , D73 ,529, 1 ,once \n again, 10 ,20.07.2011 ,null ,F10 , R)
(17 , D73 ,531, 1 ,test3\n , 10 ,20.07.2011 ,null ,F10 , R)
Expected result set :
(17 , D73 ,525, 1 ,testing| , 90 ,20.07.2011 ,null ,F10 , R)
(17 , D73 ,526, 1 ,null , 89 ,20.07.2011 ,null ,F10 , R)
(17 , D73 ,529, 1 ,once | again, 10 ,20.07.2011 ,null ,F10 , R)
(17 , D73 ,531, 1 ,test3| , 10 ,20.07.2011 ,null ,F10 , R)
what worked for me:
val rep = "\n123\n Main Street\n".replaceAll("[\\r\\n]", "|") rep: String = |123| Main Street|
but why I am not able to do on Tuple basis?
val dataRDD = lines_wo_header.map(line => line.split(";")).map(row => (row(0).toLong, row(1).toString,
row(2).toLong, row(3).toLong,
row(4).toString, row(5).toLong,
row(6).toString, row(7).toString, row(8).toString,row(9).toString))
dataRDD.map(row => {
val wert = row._5.replaceAll("[\\r\\n]", "|")
(row._1,row._2,row._3,row._4,wert,row._6, row._7,row._8,row._9,row._10)
}).collect().foreach(println)
Spark --version 1.3.1
If you can use Spark SQL 1.5 or higher, you may consider using the functions available for columns. Assuming you don't know (or don't have) the names for the columns, you can do as in the following snippet:
val df = test.toDF()
import org.apache.spark.sql.functions._
val newDF = df.withColumn(df.columns(4), regexp_replace(col(df.columns(4)), "[\\r\\n]", "|"))
If you know the name of the column, you can replace df.columns(4) by its name in both occurences.
I hope that helps.
Cheers.
My idea was to replace the \n character with "|" pipe in spark.
I tried replaceAll method but it is not working. Here is an alternative to achieve the same:
val test = sq.load(
"com.databricks.spark.csv",
Map("path" -> "file:///home/veda/sample.csv", "header" -> "false", "inferSchema" -> "true", "delimiter" -> "," , "quote" -> "\"", "escape" -> "\\" ,"parserLib" -> "univocity" ))
val dataframe = test.toDF()
val mapped = dataframe.map({
row => {
val str = row.get(0).toString()
var fnal=new StringBuilder(str)
//replace newLine
var newLineIndex=fnal.indexOf("\\n")
while(newLineIndex != -1){
fnal.replace(newLineIndex,newLineIndex+2,"|")
newLineIndex = fnal.indexOf("\\n")
}
//replace carriage returns
var cgIndex=fnal.indexOf("\\r")
while(cgIndex != -1){
fnal.replace(cgIndex,cgIndex+2,"|")
cgIndex = fnal.indexOf("\\r")
}
(fnal.toString()) //tuple modified
}
})
mapped.collect().foreach(println)
Note: You might want to move the duplicate code to separate function.
The multi line support for CSV is added in spark version 2.2 JIRA and spark 2.2 is not yet released.
I had faced same issue and resolved it with the help us hadoop input format and reader.
Copy InputFormat and reader classes from git and implement like this:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
//implementation
JavaPairRDD<LongWritable, Text> rdd =
context.
newAPIHadoopFile(path, FileCleaningInputFormat.class, null, null, new Configuration());
JavaRDD<String> inputWithMultiline= rdd.map(s -> s._2().toString())
Another solution- use CSVInputFormat from Apache crunch to read CSV file then parse each CSV line using opencsv:
sparkContext.newAPIHadoopFile(path, CSVInputFormat.class, null, null, new Configuration()).map(s -> s._2().toString());
Apache crunch maven dependency:
<dependency>
<groupId>org.apache.crunch</groupId>
<artifactId>crunch-core</artifactId>
<version>0.15.0</version>
</dependency>