Convert cvs string into list of strings - scala

val lines : String = ("a1 , test1 , test2 , a2 , test3 , test4")
I'd like to convert this to a list of Strings where each string in the list contains 3 elements so above list is converted to 2 element list of strings containing "a1 , test1 , test2" and "a2 , test3 , test4"
One option I have considered is to iterate over each cvs element in the string and if on an element which is the current third element then add then add the previous elements to a new string. Is there a more functional approach?

grouped partitions them into fixed groups with a value n.
scala> lines.split(",").grouped(3).toList
res0: List[Array[String]] = List(Array("a1 ", " test1 ", " test2 "), Array(" a2 ", " test3 ", " test4"))

The answer by #Brian suffices; for an output formatted as
"a1 , test1 , test2" and "a2 , test3 , test4"
consider for instance
scala> val groups = lines.split(",").grouped(3).map { _.mkString(",").trim }.toList
groups: List[String] = List(a1 , test1 , test2, a2 , test3 , test4)
Then
scala> groups(0)
res1: String = a1 , test1 , test2
and
scala> groups(1)
res2: String = a2 , test3 , test4

Related

getting the values of a column with keys - spark scala

I have a map[String,String] like this
val map1 = Map( "S" -> 1 , "T" -> 2, "U" -> 3)
and a Dataframe with a column called mappedcol ( type array[string] ). Here are the first and second rows of the column : [S,U] , [U,U] and I would like to map every row of this column to get the value of the key so I would have [1,3] instead of [S,U] and [3,3] instead of [U,U]. How can I do this effectively?
Thanks
The map can be tranformed into an SQL expression based on transform
and when:
var ex = "transform(value, v -> case ";
for ((k,v) <- map1) ex += s"when v = '${k}' then ${v} "
ex += "else 99 end)"
ex now contains the string
transform(value, v -> case when v = 'S' then 1 when v = 'T' then 2 when v = 'U' then 3 else 99 end)
This expression can now be used to calculate a new column:
import org.apache.spark.sql.functions._
df.withColumn("result", expr(ex)).show();
Output:
+---+------+------+
| id| value|result|
+---+------+------+
| 1|[S, U]|[1, 3]|
| 2|[U, U]|[3, 3]|
+---+------+------+

How to get distinct count value in scala

I want to find the distinct values from this query in scala
select
key,
count(distinct suppKey)
from
file
group by
key ;
I write this code in scala, but didn't working.
val count= file.map(line=> (line.split('|')(0),line.split('|')(1)).distinct().count())
I make split, because key is in the first row in file, and suppkey in the second.
File:
1|52|3956|337.0
1|77|4069|357.8
1|7|14|35.2
2|3|8895|378.4
2|3|4969|915.2
2|3|8539|438.3
2|78|3025|306.3
Expected output:
1|3
2|2
Instead of a file, for simpler testing, I use a String:
scala> val s="""1|52|3956|337.0
| 1|77|4069|357.8
| 1|7|14|35.2
| 2|3|8895|378.4
| 2|3|4969|915.2
| 2|3|8539|438.3
| 2|78|3025|306.3"""
scala> s.split("\n").map (line => {val sp = line.split ('|'); (sp(0), sp(1))}).distinct.groupBy (_._1).map (e => (e._1, e._2.size))
res198: scala.collection.immutable.Map[String,Int] = Map(2 -> 2, 1 -> 3)
Imho, we need a groupBy to specify what to group over, and to count groupwise.
Done in spark REPL. test.txt is the file with the text you've provided
val d = sc.textFile("test.txt")
d.map(x => (x.split("\\|")(0), x.split("\\|")(1))).distinct.countByKey
scala.collection.Map[String,Long] = Map(2 -> 2, 1 -> 3)

Spark dataframe explode pair of lists

My dataframe has 2 columns which look like this:
col_id| col_name
-----------
id1 | name1
id2 | name2
------------
id3 | name3
id4 | name4
....
so for each row, there are 2 matching arrays of the same length in columns id and name. What I want is to get each pair of id/name as a separate row like:
col_id| col_name
-----------
id1 | name1
-----------
id2 | name2
....
explode seems like the function to use but I can't seem to get it to work. What I tried is:
rdd.explode(col("col_id"), col("col_name")) ({
case row: Row =>
val ids: java.util.List[String] = row.getList(0)
val names: java.util.List[String] = row.getList(1)
var res: Array[(String, String)] = new Array[(String, String)](ids.size)
for (i <- 0 until ids.size) {
res :+ (ids.get(i), names.get(i))
}
res
})
This however returns only nulls so it might just be my poor knowledge of Scala. Can anyone point out the issue?
Looks like the last 10mins out of the past 1-2hours did the trick lol. This works just fine:
df.explode(col("id"), col("name")) ({
case row: Row =>
val ids: List[String] = row.getList(0).asScala.toList
val names: List[String] = row.getList(1).asScala.toList
ids zip names
})

Read CSV File to OrderedMap

I'm reading the CSV File and adding data to Map in Scala.
val br = new BufferedReader(new InputStreamReader(new FileInputStream(new File(fileName)), "UTF-8"))
val inputFormat = CSVFormat.newFormat(delimiter.charAt(0)).withHeader().withQuote('"')
import scala.collection.JavaConverters._
import org.apache.commons.csv.{CSVFormat, CSVParser}
val csvRecords = new CSVParser(br, inputFormat).getRecords.asScala
val buffer = for (csvRecord <- csvRecords; if csvRecords != null && csvRecords.nonEmpty)
yield csvRecord.toMap.asScala
buffer.toList
But as the Map is not ordered I'm not able to read the columns in order. Is there any way to read the csvRecords in order?
The CSV file contains comma separated values along with the header. It should generate the output in List[mutable.LinkedHashMap[String, String]] format something like [["fname", "A", "lname", "B"], ["fname", "C", "lname", "D"]].
The above code is working but it is not preserving the order. For Ex: if CSV file contains columns in order fname, lname, the output map is having lname first and fname last.
If I understand you question correctly, here's one way to create a list of LinkedHashMaps with elements in order:
// Assuming your CSV File has the following content:
fname,lname,grade
John,Doe,A
Ann,Cole,B
David,Jones,C
Mike,Duke,D
Jenn,Rivers,E
import collection.mutable.LinkedHashMap
// Get indexed header from CSV
val indexedHeader = io.Source.fromFile("/path/to/csvfile").
getLines.take(1).next.
split(",").
zipWithIndex
indexedHeader: Array[(String, Int)] = Array((fname,0), (lname,1), (grade,2))
// Aggregate LinkedHashMap using foldLeft
val ListOfLHM = for ( csvRecord <- csvRecords ) yield
indexedHeader.foldLeft(LinkedHashMap[String, String]())(
(acc, x) => acc += (x._1 -> csvRecord.get(x._2))
)
ListOfLHM: scala.collection.mutable.Buffer[scala.collection.mutable.LinkedHashMap[String,String]] = ArrayBuffer(
Map(fname -> John, lname -> Doe, grade -> A),
Map(fname -> Ann, lname -> Cole, grade -> B),
Map(fname -> David, lname -> Jones, grade -> C),
Map(fname -> Mike, lname -> Duke, grade -> D),
Map(fname -> Jenn, lname -> Rivers, grade -> E)
)

Using apache Spark to union to Lists of Tuples

I'm attempting to union to RDD's :
val u1 = sc.parallelize(List ( ("a" , (1,2)) , ("b" , (1,2))))
val u2 = sc.parallelize(List ( ("a" , ("3")) , ("b" , (2))))
I receive error :
scala> u1 union u2
<console>:17: error: type mismatch;
found : org.apache.spark.rdd.RDD[(String, Any)]
required: org.apache.spark.rdd.RDD[(String, (Int, Int))]
Note: (String, Any) >: (String, (Int, Int)), but class RDD is invariant in type
T.
You may wish to define T as -T instead. (SLS 4.5)
u1 union u2
^
The String type in each of above Tuples is a key.
Is it possible to union these two types ?
Once u1 and u2 are unioned I intent to use groupBy to group each item according to its key.
The issue you are facing is actually explained by the compiler: You are trying to join values of type (Int,Int) with values of type Any. The Any comes as the superclass of String and Int in this statement: sc.parallelize(List ( ("a" , ("3")) , ("b" , (2)))). This might be an error or might be intended.
In any case, I would try to make the values converge to a common type before the union.
Given that Tuple1, Tuple2 are different types, I'd consider some other container that is easier to transform.
Assuming that the "3" above is actually a 3 (Int):
val au1 = sc.parallelize(List ( ("a" , Array(1,2)) , ("b" , Array(1,2))))
val au2 = sc.parallelize(List ( ("a" , Array(3)) , ("b" , Array(2))))
au1 union au2
org.apache.spark.rdd.RDD[(String, Array[Int])] = UnionRDD[10] at union at <console>:17
res: Array[(String, Array[Int])] = Array((a,Array(1, 2)), (b,Array(1, 2)), (a,Array(3)), (b,Array(2)))
Once u1 and u2 are unioned I intent to use groupBy to group each item
according to its key.
If you intend to group both rdds by key, you may consider using join instead of union. That gets the job done at once
au1 join au2
res: Array[(String, (Array[Int], Array[Int]))] = Array((a,(Array(1, 2),Array(3))), (b,(Array(1, 2),Array(2))))
If the "3" above is actually a "3" (String): I'd consider to map the values first to a common type. Either all strings or all ints. It will make the data easier to manipulate than having Any as type. Your life will be easier.
If you want to use an (key,value) RDD with any value (I see you are trying and RDD with and (Int,Int), and Int and a String), you can define the type of your RDD on creation:
val u1:org.apache.spark.rdd.RDD[(String, Any)] = sc.parallelize(List ( ("a" , (1,2)) , ("b" , (1,2))))
val u2org.apache.spark.rdd.RDD[(String, Any)] = sc.parallelize(List ( ("a" , ("3")) , ("b" , (2))))
Then the union will work because it's the union between the same types.
Hope it helps