Get words and values Between Parentheses in Scala-Spark - scala

here is my data :
doc1: (Does,1) (just,-1) (what,0) (was,1) (needed,1) (to,0) (charge,1) (the,0) (Macbook,1)
doc2: (Pro,1) (G4,-1) (13inch,0) (laptop,1)
doc3: (Only,1) (beef,0) (was,1) (it,0) (no,-1) (longer,0) (lights,-1) (up,0) (the,-1)
etc...
and i want to extract words and values and then store them in two separated matrices , matrix_1 is (docID words) and matrix_2 is (docID values) ;

input.txt
=========
doc1: (Does,1) (just,-1) (what,0) (was,1) (needed,1) (to,0) (charge,1) (the,0) (Macbook,1)
doc2: (Pro,1) (G4,-1) (13inch,0) (laptop,1)
doc3: (Only,1) (beef,0) (was,1) (it,0) (no,-1) (longer,0) (lights,-1) (up,0) (the,-1)
val inputText = sc.textFile("input.txt")
var digested = input.map(line => line.split(":"))
.map(row => row(0) -> row(1).trim.split(" "))
.map(row => row._1 -> row._2.map(_.stripPrefix("(").stripSuffix(")").trim.split(",")))
var matrix_1 = digested.map(row => row._1 -> row._2.map( a => a(0)))
var matrix_2 = digested.map(row => row._1 -> row._2.map( a => a(1)))
gives:
List(
(doc1 -> Does,just,what,was,needed,to,charge,the,Macbook),
(doc2 -> Pro,G4,13inch,laptop),
(doc3 -> Only,beef,was,it,no,longer,lights,up,the)
)
List(
(doc1 -> 1,-1,0,1,1,0,1,0,1),
(doc2 -> 1,-1,0,1),
(doc3 -> 1,0,1,0,-1,0,-1,0,-1)
)

Related

scala getting read,write and reject records with simplified regex

I'm working on a log file to parse the read/written/rejected records using scala and convert them into a Map. The values are present in different lines - "read" followed by "written" in next line and then "rejected"..
The snippet of the code I'm using is
val log_text =
"""
|server.net|Wed Apr 8 05:44:24 2018|acct_reformat.000||finish|
| 120 records ( 7200 bytes) read
| 100 records ( 6000 bytes) written
| 20 records ( 1200 bytes) rejected|
|server.net|Wed Apr 8 05:44:24 2018|acct_reformat_rfm_logs
""".stripMargin
val read_pat = """(\d+) (records) (.*)""".r
val write_pat = """(?s)records .*? (\d+) (records)(.*)""".r
val reject_pat = """(?s).* (\d+) (records)""".r
val read_recs = read_pat.findAllIn(log_text).matchData.map( m=> m.subgroups(0) ).take(1).mkString
val write_recs = write_pat.findAllIn(log_text).matchData.map( m=> m.subgroups(0) ).take(1).mkString
val reject_recs = reject_pat.findAllIn(log_text).matchData.map( m=> m.subgroups(0) ).take(1).mkString
val log_summ = List("Read",read_recs,"Write",write_recs,"Reject",reject_recs).sliding(2,2).map( p => p match { case List(x,y) => (x,y)}).toMap
which results in
log_summ: scala.collection.immutable.Map[String,String] = Map(Read -> 120, Write -> 100, Reject -> 20)
Somehow I feel, I'm doing it in a roundabout/redundant way.. Is there a better way to accomplish this?.
Given the similarity of the read/write/reject text, you could simplify the multiple Regex matching patterns into a generic one and use zip to generate your Map, as shown below:
val pattern = """(\d+) records .*""".r
val keys = List("Read", "Write", "Reject")
val values = pattern.findAllIn(log_text).matchData.map(_.subgroups(0)).toList
// values: List[String] = List(120, 100, 20)
val log_summ = (keys zip values).toMap
// log_summ: scala.collection.immutable.Map[String,String] =
// Map(Read -> 120, Write -> 100, Reject -> 20)
Looks fine to me. Just three things to improve:
1) IntelliJ is your friend. It gives you two intentions immediately:
m.subgroups(0) -> m.subgroups.head
map(p => p match { case List(x, y) => (x, y) }) -> map { case List(x, y) => (x, y) }
2) DRY. Don't repeat read/write/reject related code three times. Just keep it somewhere once. E.g.:
case class Processor(name: String, patternString: String) {
lazy val pattern: Regex = patternString.r
}
val processors = Seq(
Processor("Read", """(\d+) (records) (.*)"""),
Processor("Write", """(?s)records .*? (\d+) (records)(.*)"""),
Processor("Reject", """(?s).* (\d+) (records)"""),
)
def read_recs(processor: Processor) = processor.pattern.findAllIn(log_text).matchData.map(m => m.subgroups.head).take(1).mkString
3) List[Tuple2] can be converted to a Map with a simple toMap
val log_summ = processors.map(processor => processor.name -> read_recs(processor)).toMap
It can be done in a single pass if you're willing to use the log's wording for the Map keys.
val Pattern = raw"(\d+) records .*\) ([^|]+)".r.unanchored
log_text.split("\n").flatMap{
case Pattern(num, typ) => Some(typ -> num)
case _ => None
}.toMap
//res0: immutable.Map[String,String] = Map(read -> 120, written -> 100, rejected -> 20)

toMap on List of List generates error

Given that this example:
val myList = List("age=21", "name=xyz", "profession=Tester", "city=cuba", "age=43", "name=abc", "profession=Programmer", "city=wellington")
val myMap = myList.map(text => text.split("=")).map(a => (a(0) -> a(1))).toMap
works fine, returning:
myList: List[String] = List(age=21, name=xyz, profession=Tester, city=cuba, age=43, name=abc, profession=Programmer, city=wellington)
myMap: scala.collection.immutable.Map[String,String] = Map(age -> 43, name -> abc, profession -> Programmer, city -> wellington)
I am wondering why the following which is just N sets of values:
val myList = List("age=21", "name=xyz", "profession=Tester", "city=cuba", "age=43", "name=abc", "profession=Programmer", "city=Sydney")
val myMap = myList.grouped(4).toList.map(text => text.split("=")).map(a => (a(0) -> a(1))).toMap
generates the error, and how to solve:
notebook:9: error: value split is not a member of List[String]
val myMap = myList.grouped(4).toList.map(text => text.split("=")).map(a => (a(0) -> a(1))).toMap
I must be missing something elementary here.
myList.grouped(4).toList returns a nested list – List[List[String]].
To transform the grouped sublists into Maps:
val myMap = myList.grouped(4).toList.
map(_.map(_.split("=")).map(a => (a(0) -> a(1))).toMap)
// myMap: List[scala.collection.immutable.Map[String,String]] = List(
// Map(age -> 21, name -> xyz, profession -> Tester, city -> cuba),
// Map(age -> 43, name -> abc, profession -> Programmer, city -> Sydney)
// )

Scala For Comprehension with Filter

I am using Scala's for comprehension to produce a modified facetFilter. If a value in facetFilter doesn't exist in allFacets, it should be filtered out. Currently, the newFacetFilter doesn't filter at all.
val allFacets = Map(
"band_material" -> Map("Rubber" -> 11),
"dial_color" -> Map("Ivory" -> 68, "Salmon"-> 3))
val facetFilter =
Map("band_material" -> List("Yellow Gold Plated", "Rubber"),
"dial_color" -> List("Ivory"))
val newFacetFilter =
for {
(k,v) <- allFacets
(facetName, facetArr) <- facetFilter
aFacet <- facetArr
if k != facetName || !v.contains(aFacet)
} yield (facetName -> facetArr)
Current Output of newFacetFilter:
Map("band_material" -> List("Yellow Gold Plated", "Rubber"), "dial_color" -> List("Ivory"))
Expected Output of newFacetFilter:
Map("band_material" -> List("Rubber"), "dial_color" -> List("Ivory"))
See this fiddle
Try this:
val newFacetFilter =
for ((k,vs) <- facetFilter)
yield (k, vs filter allFacets(k).contains)
Output:
Map(band_material -> List(Rubber), dial_color -> List(Ivory))
OK, if we are done with edits, I think this is what you want...
val allFacets = Map(
"band_material" -> Map(
"Rubber" -> 11
),
"dial_color" -> Map(
"Ivory" -> 68,
"Salmon"-> 3
)
)
val facetFilter = Map(
"band_material" -> List("Yellow Gold Plated", "Rubber"),
"dial_color" -> List("Ivory"),
"case_material" -> List(),
"movement" -> List(),
"price_range" -> List(),
"gender" -> List()
)
val newFacetFilter = for {
(facetName, facetArr) <- facetFilter
(k,v) <- allFacets
if k == facetName
} yield (facetName, facetArr intersect v.keys.toList)
We simply iterate both maps and when we have the same keys, we intersect the two lists.
Edit: There is a more efficient way, using the Map's get function instead of just iterating everything and ignoring non-matches.
val newFacetFilter = facetFilter.flatMap {
case (n, fs) =>
allFacets.get(n).map(n -> _.keys.toList.intersect(fs))
}
So we take each facetFilter entry ((n, fs)), check allFacets for n, then intersect the optional result with our list fs. If n did not exist, we propagate None and it is flattened out by flatMap.

Extracting Values from a Map[String, Any] where Any is a Map itself

I have a Map which contains another Map in its value field. Here is an example of some records ;
(8702168053422489,Map(sequence -> 5, id -> 8702168053422489, type -> List(AppExperience, Session), time -> 527780267713))
(8702170626376335,Map(trackingInfo -> Map(trackId -> 14183197, location -> Browse, listId -> 3393626f-98e3-4973-8d38-6b2fb17454b5_27331247X28X6839X1506087469573, videoId -> 80161702, rank -> 0, row -> 1, imageKey -> boxshot|AD_e01f4a50-7e2b-11e7-a327-12789459b73e|en, requestId -> 662d92c2-6a1c-41a6-8ac4-bf2ae9f1ce68-417037), id -> 8702170626376335, sequence -> 59, time -> 527780275219, type -> List(NavigationLevel, Session), view -> details))
(8702168347359313,Map(muting -> false, id -> 8702168347359313, level -> 1, type -> List(Volume)))
(8702168321522401,Map(utcOffset -> 3600, type -> List(TimeZone), id -> 8702168321522401))
(8702171157207449,Map(trackingInfo -> Map(trackId -> 14183197, location -> Browse, listId -> 3393626f-98e3-4973-8d38-6b2fb17454b5_27331247X28X6839X1506087469573, videoId -> 80161356, rank -> 0, row -> 1, imageKey -> boxshot|AD_e01f4a50-7e2b-11e7-a327-12789459b73e|en, requestId -> 662d92c2-6a1c-41a6-8ac4-bf2ae9f1ce68-417037), id -> 8702171157207449, sequence -> 72, startOffset -> 0, time -> 527780278061, type -> List(StartPlay, Action, Session)))
The actual records I've interested in are the ones that contain trackingInfo, records 2 and 5.
What I would like to do is extract those and then extract some of the keys from there such as trackId. Something like this;
val trackingInfo = json("trackingInfo").asInstanceOf[Map[String, Any]]
val videoId = trackingInfo("videoId").asInstanceOf[Int]
val id = json("id").asInstanceOf[Long]
val sequence = json("sequence").asInstanceOf[Int]
val time = json("time").asInstanceOf[Long]
val eventType = json.get("type").getOrElse(List("")).asInstanceOf[List[String]]
To the extract the inner map, I've tired;
myMap.map {case (k,v: collection.Map[_,_]) => v.toMap case _ => }
Which brings back the inner map but as a scala.collection.immutable.Iterable[Any] which leaves me in a puzzle on extracting values from it.
Any help is appreciated
Let's say you have a real map (I cut it a little bit)
val data: Map[ BigInt, Any ] = Map(
BigInt( 8702168053422489L ) -> Map("sequence" -> "5", "id" -> BigInt( 8702168053422489L ) ),
BigInt( 8702170626376335L ) -> Map("trackingInfo" -> Map("trackId" -> BigInt( 14183197 ), "location" -> "Browse" ), "id" -> BigInt( 8702170626376335L ) ),
BigInt( 8702168347359313L ) -> Map("id" -> BigInt( 8702168347359313L ) ),
BigInt( 8702168321522401L ) -> Map("id" -> BigInt( 8702168321522401L ) ),
BigInt( 8702171157207449L ) -> Map("trackingInfo" -> Map("trackId" -> BigInt( 14183197 ), "location" -> "Browse" ), "id" -> BigInt( 8702171157207449L ) )
)
And you want to get records which have a trackingInfo key
val onlyWithTracking = data.filter( ( row ) => {
val recordToFilter = row._2 match {
case trackRecord: Map[ String, Any ] => trackRecord
case _ => Map( "trackId" -> Map() )
}
recordToFilter.contains( "trackingInfo" )
} )
And then process those records in some way
onlyWithTracking.foreach( ( row ) => {
val record = row._2 match {
case trackRecord: Map[ String, Any ] => trackRecord
case _ => Map( "trackingInfo" -> Map() )
}
val trackingInfo = record( "trackingInfo" ) match {
case trackRow: Map[ String, Any ] => trackRow
case _ => Map( "trackId" -> "error" )
}
val trackId = trackingInfo( "trackId" )
println( trackId )
} )
With this pattern matching I'm trying to ensure that using keys like trackingInfo or trackId is somewhat safe. You should implement more strict approach.

how to convert map into key value pair

Given below:
test: Array[scala.collection.immutable.Map[String,Any]] = Array(
Map(_c3 -> "foobar", _c5 -> "impt", _c0 -> Key1, _c4 -> 20.0, _c1 -> "next", _c2 -> 1.0),
Map(_c3 -> "high", _c5 -> "low", _c0 -> Key2, _c4 -> 19.0, _c1 -> "great", _c2 -> 0.0),
Map(_c3 -> "book", _c5 -> "game", _c0 -> Key3, _c4 -> 42.0, _c1 -> "name", _c2 -> 0.5)
)
How can I transform it to Key Value pairs based on _c0 that only include Strings?
like below
Key1 foobar
Key1 impt
Key1 next
Key2 high
Key2 low
Key2 great
Key3 book
Key3 game
Key3 name
Please check this out
test.map(
_.filter(!_._2.toString.matches("[+-]?\\d+.?\\d+"))
).flatMap(
data =>
{
val key = data.getOrElse("_c0", "key_not_found")
data
.filter(_._1 != "_c0")
.map(
key +" "+_._2.toString()
)
}
)
Try this method
import org.apache.spark.sql.functions._
# first extract all values which are string
val rdd = sc.parallelize(test).map(x => (x.getOrElse("_c0","no key").toString -> (x - "_c0").values.filter(_.isInstanceOf[String]).asInstanceOf[List[String]]))
val df = spark.createDataFrame(rdd).toDF("key", "vals")
# use explode function to add new rows
df.withColumn("vals", explode(col("vals"))).show()
How about:
test
.map(row => row.getOrElse(_c0, "") -> (row - _c0).values.filter(_.isInstanceOf[String]))
.flatMap { case (key, innerList) => innerList.map(key -> _) }