here is my data :
doc1: (Does,1) (just,-1) (what,0) (was,1) (needed,1) (to,0) (charge,1) (the,0) (Macbook,1)
doc2: (Pro,1) (G4,-1) (13inch,0) (laptop,1)
doc3: (Only,1) (beef,0) (was,1) (it,0) (no,-1) (longer,0) (lights,-1) (up,0) (the,-1)
etc...
and i want to extract words and values and then store them in two separated matrices , matrix_1 is (docID words) and matrix_2 is (docID values) ;
input.txt
=========
doc1: (Does,1) (just,-1) (what,0) (was,1) (needed,1) (to,0) (charge,1) (the,0) (Macbook,1)
doc2: (Pro,1) (G4,-1) (13inch,0) (laptop,1)
doc3: (Only,1) (beef,0) (was,1) (it,0) (no,-1) (longer,0) (lights,-1) (up,0) (the,-1)
val inputText = sc.textFile("input.txt")
var digested = input.map(line => line.split(":"))
.map(row => row(0) -> row(1).trim.split(" "))
.map(row => row._1 -> row._2.map(_.stripPrefix("(").stripSuffix(")").trim.split(",")))
var matrix_1 = digested.map(row => row._1 -> row._2.map( a => a(0)))
var matrix_2 = digested.map(row => row._1 -> row._2.map( a => a(1)))
gives:
List(
(doc1 -> Does,just,what,was,needed,to,charge,the,Macbook),
(doc2 -> Pro,G4,13inch,laptop),
(doc3 -> Only,beef,was,it,no,longer,lights,up,the)
)
List(
(doc1 -> 1,-1,0,1,1,0,1,0,1),
(doc2 -> 1,-1,0,1),
(doc3 -> 1,0,1,0,-1,0,-1,0,-1)
)
Related
I'm working on a log file to parse the read/written/rejected records using scala and convert them into a Map. The values are present in different lines - "read" followed by "written" in next line and then "rejected"..
The snippet of the code I'm using is
val log_text =
"""
|server.net|Wed Apr 8 05:44:24 2018|acct_reformat.000||finish|
| 120 records ( 7200 bytes) read
| 100 records ( 6000 bytes) written
| 20 records ( 1200 bytes) rejected|
|server.net|Wed Apr 8 05:44:24 2018|acct_reformat_rfm_logs
""".stripMargin
val read_pat = """(\d+) (records) (.*)""".r
val write_pat = """(?s)records .*? (\d+) (records)(.*)""".r
val reject_pat = """(?s).* (\d+) (records)""".r
val read_recs = read_pat.findAllIn(log_text).matchData.map( m=> m.subgroups(0) ).take(1).mkString
val write_recs = write_pat.findAllIn(log_text).matchData.map( m=> m.subgroups(0) ).take(1).mkString
val reject_recs = reject_pat.findAllIn(log_text).matchData.map( m=> m.subgroups(0) ).take(1).mkString
val log_summ = List("Read",read_recs,"Write",write_recs,"Reject",reject_recs).sliding(2,2).map( p => p match { case List(x,y) => (x,y)}).toMap
which results in
log_summ: scala.collection.immutable.Map[String,String] = Map(Read -> 120, Write -> 100, Reject -> 20)
Somehow I feel, I'm doing it in a roundabout/redundant way.. Is there a better way to accomplish this?.
Given the similarity of the read/write/reject text, you could simplify the multiple Regex matching patterns into a generic one and use zip to generate your Map, as shown below:
val pattern = """(\d+) records .*""".r
val keys = List("Read", "Write", "Reject")
val values = pattern.findAllIn(log_text).matchData.map(_.subgroups(0)).toList
// values: List[String] = List(120, 100, 20)
val log_summ = (keys zip values).toMap
// log_summ: scala.collection.immutable.Map[String,String] =
// Map(Read -> 120, Write -> 100, Reject -> 20)
Looks fine to me. Just three things to improve:
1) IntelliJ is your friend. It gives you two intentions immediately:
m.subgroups(0) -> m.subgroups.head
map(p => p match { case List(x, y) => (x, y) }) -> map { case List(x, y) => (x, y) }
2) DRY. Don't repeat read/write/reject related code three times. Just keep it somewhere once. E.g.:
case class Processor(name: String, patternString: String) {
lazy val pattern: Regex = patternString.r
}
val processors = Seq(
Processor("Read", """(\d+) (records) (.*)"""),
Processor("Write", """(?s)records .*? (\d+) (records)(.*)"""),
Processor("Reject", """(?s).* (\d+) (records)"""),
)
def read_recs(processor: Processor) = processor.pattern.findAllIn(log_text).matchData.map(m => m.subgroups.head).take(1).mkString
3) List[Tuple2] can be converted to a Map with a simple toMap
val log_summ = processors.map(processor => processor.name -> read_recs(processor)).toMap
It can be done in a single pass if you're willing to use the log's wording for the Map keys.
val Pattern = raw"(\d+) records .*\) ([^|]+)".r.unanchored
log_text.split("\n").flatMap{
case Pattern(num, typ) => Some(typ -> num)
case _ => None
}.toMap
//res0: immutable.Map[String,String] = Map(read -> 120, written -> 100, rejected -> 20)
Given that this example:
val myList = List("age=21", "name=xyz", "profession=Tester", "city=cuba", "age=43", "name=abc", "profession=Programmer", "city=wellington")
val myMap = myList.map(text => text.split("=")).map(a => (a(0) -> a(1))).toMap
works fine, returning:
myList: List[String] = List(age=21, name=xyz, profession=Tester, city=cuba, age=43, name=abc, profession=Programmer, city=wellington)
myMap: scala.collection.immutable.Map[String,String] = Map(age -> 43, name -> abc, profession -> Programmer, city -> wellington)
I am wondering why the following which is just N sets of values:
val myList = List("age=21", "name=xyz", "profession=Tester", "city=cuba", "age=43", "name=abc", "profession=Programmer", "city=Sydney")
val myMap = myList.grouped(4).toList.map(text => text.split("=")).map(a => (a(0) -> a(1))).toMap
generates the error, and how to solve:
notebook:9: error: value split is not a member of List[String]
val myMap = myList.grouped(4).toList.map(text => text.split("=")).map(a => (a(0) -> a(1))).toMap
I must be missing something elementary here.
myList.grouped(4).toList returns a nested list – List[List[String]].
To transform the grouped sublists into Maps:
val myMap = myList.grouped(4).toList.
map(_.map(_.split("=")).map(a => (a(0) -> a(1))).toMap)
// myMap: List[scala.collection.immutable.Map[String,String]] = List(
// Map(age -> 21, name -> xyz, profession -> Tester, city -> cuba),
// Map(age -> 43, name -> abc, profession -> Programmer, city -> Sydney)
// )
I am using Scala's for comprehension to produce a modified facetFilter. If a value in facetFilter doesn't exist in allFacets, it should be filtered out. Currently, the newFacetFilter doesn't filter at all.
val allFacets = Map(
"band_material" -> Map("Rubber" -> 11),
"dial_color" -> Map("Ivory" -> 68, "Salmon"-> 3))
val facetFilter =
Map("band_material" -> List("Yellow Gold Plated", "Rubber"),
"dial_color" -> List("Ivory"))
val newFacetFilter =
for {
(k,v) <- allFacets
(facetName, facetArr) <- facetFilter
aFacet <- facetArr
if k != facetName || !v.contains(aFacet)
} yield (facetName -> facetArr)
Current Output of newFacetFilter:
Map("band_material" -> List("Yellow Gold Plated", "Rubber"), "dial_color" -> List("Ivory"))
Expected Output of newFacetFilter:
Map("band_material" -> List("Rubber"), "dial_color" -> List("Ivory"))
See this fiddle
Try this:
val newFacetFilter =
for ((k,vs) <- facetFilter)
yield (k, vs filter allFacets(k).contains)
Output:
Map(band_material -> List(Rubber), dial_color -> List(Ivory))
OK, if we are done with edits, I think this is what you want...
val allFacets = Map(
"band_material" -> Map(
"Rubber" -> 11
),
"dial_color" -> Map(
"Ivory" -> 68,
"Salmon"-> 3
)
)
val facetFilter = Map(
"band_material" -> List("Yellow Gold Plated", "Rubber"),
"dial_color" -> List("Ivory"),
"case_material" -> List(),
"movement" -> List(),
"price_range" -> List(),
"gender" -> List()
)
val newFacetFilter = for {
(facetName, facetArr) <- facetFilter
(k,v) <- allFacets
if k == facetName
} yield (facetName, facetArr intersect v.keys.toList)
We simply iterate both maps and when we have the same keys, we intersect the two lists.
Edit: There is a more efficient way, using the Map's get function instead of just iterating everything and ignoring non-matches.
val newFacetFilter = facetFilter.flatMap {
case (n, fs) =>
allFacets.get(n).map(n -> _.keys.toList.intersect(fs))
}
So we take each facetFilter entry ((n, fs)), check allFacets for n, then intersect the optional result with our list fs. If n did not exist, we propagate None and it is flattened out by flatMap.
I have a Map which contains another Map in its value field. Here is an example of some records ;
(8702168053422489,Map(sequence -> 5, id -> 8702168053422489, type -> List(AppExperience, Session), time -> 527780267713))
(8702170626376335,Map(trackingInfo -> Map(trackId -> 14183197, location -> Browse, listId -> 3393626f-98e3-4973-8d38-6b2fb17454b5_27331247X28X6839X1506087469573, videoId -> 80161702, rank -> 0, row -> 1, imageKey -> boxshot|AD_e01f4a50-7e2b-11e7-a327-12789459b73e|en, requestId -> 662d92c2-6a1c-41a6-8ac4-bf2ae9f1ce68-417037), id -> 8702170626376335, sequence -> 59, time -> 527780275219, type -> List(NavigationLevel, Session), view -> details))
(8702168347359313,Map(muting -> false, id -> 8702168347359313, level -> 1, type -> List(Volume)))
(8702168321522401,Map(utcOffset -> 3600, type -> List(TimeZone), id -> 8702168321522401))
(8702171157207449,Map(trackingInfo -> Map(trackId -> 14183197, location -> Browse, listId -> 3393626f-98e3-4973-8d38-6b2fb17454b5_27331247X28X6839X1506087469573, videoId -> 80161356, rank -> 0, row -> 1, imageKey -> boxshot|AD_e01f4a50-7e2b-11e7-a327-12789459b73e|en, requestId -> 662d92c2-6a1c-41a6-8ac4-bf2ae9f1ce68-417037), id -> 8702171157207449, sequence -> 72, startOffset -> 0, time -> 527780278061, type -> List(StartPlay, Action, Session)))
The actual records I've interested in are the ones that contain trackingInfo, records 2 and 5.
What I would like to do is extract those and then extract some of the keys from there such as trackId. Something like this;
val trackingInfo = json("trackingInfo").asInstanceOf[Map[String, Any]]
val videoId = trackingInfo("videoId").asInstanceOf[Int]
val id = json("id").asInstanceOf[Long]
val sequence = json("sequence").asInstanceOf[Int]
val time = json("time").asInstanceOf[Long]
val eventType = json.get("type").getOrElse(List("")).asInstanceOf[List[String]]
To the extract the inner map, I've tired;
myMap.map {case (k,v: collection.Map[_,_]) => v.toMap case _ => }
Which brings back the inner map but as a scala.collection.immutable.Iterable[Any] which leaves me in a puzzle on extracting values from it.
Any help is appreciated
Let's say you have a real map (I cut it a little bit)
val data: Map[ BigInt, Any ] = Map(
BigInt( 8702168053422489L ) -> Map("sequence" -> "5", "id" -> BigInt( 8702168053422489L ) ),
BigInt( 8702170626376335L ) -> Map("trackingInfo" -> Map("trackId" -> BigInt( 14183197 ), "location" -> "Browse" ), "id" -> BigInt( 8702170626376335L ) ),
BigInt( 8702168347359313L ) -> Map("id" -> BigInt( 8702168347359313L ) ),
BigInt( 8702168321522401L ) -> Map("id" -> BigInt( 8702168321522401L ) ),
BigInt( 8702171157207449L ) -> Map("trackingInfo" -> Map("trackId" -> BigInt( 14183197 ), "location" -> "Browse" ), "id" -> BigInt( 8702171157207449L ) )
)
And you want to get records which have a trackingInfo key
val onlyWithTracking = data.filter( ( row ) => {
val recordToFilter = row._2 match {
case trackRecord: Map[ String, Any ] => trackRecord
case _ => Map( "trackId" -> Map() )
}
recordToFilter.contains( "trackingInfo" )
} )
And then process those records in some way
onlyWithTracking.foreach( ( row ) => {
val record = row._2 match {
case trackRecord: Map[ String, Any ] => trackRecord
case _ => Map( "trackingInfo" -> Map() )
}
val trackingInfo = record( "trackingInfo" ) match {
case trackRow: Map[ String, Any ] => trackRow
case _ => Map( "trackId" -> "error" )
}
val trackId = trackingInfo( "trackId" )
println( trackId )
} )
With this pattern matching I'm trying to ensure that using keys like trackingInfo or trackId is somewhat safe. You should implement more strict approach.
Given below:
test: Array[scala.collection.immutable.Map[String,Any]] = Array(
Map(_c3 -> "foobar", _c5 -> "impt", _c0 -> Key1, _c4 -> 20.0, _c1 -> "next", _c2 -> 1.0),
Map(_c3 -> "high", _c5 -> "low", _c0 -> Key2, _c4 -> 19.0, _c1 -> "great", _c2 -> 0.0),
Map(_c3 -> "book", _c5 -> "game", _c0 -> Key3, _c4 -> 42.0, _c1 -> "name", _c2 -> 0.5)
)
How can I transform it to Key Value pairs based on _c0 that only include Strings?
like below
Key1 foobar
Key1 impt
Key1 next
Key2 high
Key2 low
Key2 great
Key3 book
Key3 game
Key3 name
Please check this out
test.map(
_.filter(!_._2.toString.matches("[+-]?\\d+.?\\d+"))
).flatMap(
data =>
{
val key = data.getOrElse("_c0", "key_not_found")
data
.filter(_._1 != "_c0")
.map(
key +" "+_._2.toString()
)
}
)
Try this method
import org.apache.spark.sql.functions._
# first extract all values which are string
val rdd = sc.parallelize(test).map(x => (x.getOrElse("_c0","no key").toString -> (x - "_c0").values.filter(_.isInstanceOf[String]).asInstanceOf[List[String]]))
val df = spark.createDataFrame(rdd).toDF("key", "vals")
# use explode function to add new rows
df.withColumn("vals", explode(col("vals"))).show()
How about:
test
.map(row => row.getOrElse(_c0, "") -> (row - _c0).values.filter(_.isInstanceOf[String]))
.flatMap { case (key, innerList) => innerList.map(key -> _) }