I have rdd with json rows:
val jsons = sc.textFile("hdfs://" + directory + "articles_json/*/*").flatMap(_.split("\n")).
map(x => JSON.parseFull(x))
Each json has field "dc:title" and i want to create rdd with these titles and with indexes.
val titles_rdd = jsons.filter(x => x.isDefined).
map(x => x.get.asInstanceOf[Map[String, Any]].
get("dc:title").get.asInstanceOf[String]).zipWithIndex()
But, i don't understand, should i use .get in x => x.get.asInstanceOf in map, or just x => x.asInstanceOf? And the same question about .get after get("dc:title")?
Did you try with sqlContext? Parsing is much simpler with this.
https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets
It would be great, if you can give sample json of yours
EDITED:
i assume this is your question,
you have a list,
scala> val a = List(Some(1),Some(2),Some(3),None,Some(4))
a: List[Option[Int]] = List(Some(1), Some(2), Some(3), None, Some(4))
you want to know whether you should be using as below to retrieve values,
scala> val b = a.filter{_.isDefined}.map{x => x.get.asInstanceOf[Int]}
b: List[Int] = List(1, 2, 3, 4)
OR
like this,
scala> val b = a.filter{_.isDefined}.map{x => x.asInstanceOf[Int]}
If you run above code, you'll get below exception.
java.lang.ClassCastException: scala.Some cannot be cast to
java.lang.Integer at
scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:105) at
$anonfun$2.apply(:8) at $anonfun$2.apply(:8) at
scala.collection.immutable.List.map(List.scala:272) ... 33 elided
Reason is pretty simple, you want value that is residing inside the Some, but your question is about how to convert Some to your desired object.
in you above example , line 2,
map (x => x...)
x will be of type Some ,if you want its value , you have to call get function or else you won't get the value.
below link will be of some help.
http://www.scala-lang.org/api/current/index.html#scala.Some
Please let me know if your question still stands unclarified
If all the lines parse into JSON objects (ie. no arrays), you could use a for comprehension:
val titles_rdd = (for {
json <- jsons
jmap <- json
jtitle <- jmap.get("dc:title")
} yield jtitle) zipWithIndex
Related
I have Some() type Map[String, String], such as
Array[Option[Any]] = Array(Some(Map(String, String)
I want to return it as
Array(Map(String, String))
I've tried few different ways of extracting it-
Let's say if
val x = Array(Some(Map(String, String)
val x1 = for (i <- 0 until x.length) yield { x.apply(i) }
but this returns IndexedSeq(Some(Map)), which is not what I want.
I tried pattern matching,
x.foreach { i =>
i match {
case Some(value) => value
case _ => println("nothing") }}
another thing I tried that was somewhat successful was that
x.apply(0).get.asInstanceOf[Map[String, String]]
will do something what I want, but it only gets 0th index of the entire array and I'd want all the maps in the array.
How can I extract Map type out of Some?
If you want an Array[Any] from your Array[Option[Any]], you can use this for expression:
for {
opt <- x
value <- opt
} yield value
This will put the values of all the non-empty Options inside a new array.
It is equivalent to this:
x.flatMap(_.toArray[Any])
Here, all options will be converted to an array of either 0 or 1 element. All these arrays will then be flattened back to one single array containing all the values.
Generally, the pattern is either to use transformations on the Option[T], like map, flatMap, filter, etc.
The problem is, we'll need to add a type cast to retrieve the underlying Map[String, String] from Any. So we'll use flatten to remove any potentially None types and unwrap the Option, and asInstanceOf to retreive the type:
scala> val y = Array(Some(Map("1" -> "1")), Some(Map("2" -> "2")), None)
y: Array[Option[scala.collection.immutable.Map[String,String]]] = Array(Some(Map(1 -> 1)), Some(Map(2 -> 2)), None)
scala> y.flatten.map(_.asInstanceOf[Map[String, String]])
res7: Array[Map[String,String]] = Array(Map(1 -> 1), Map(2 -> 2))
Also when you talk just about single value you can try Some("test").head and for null simply Some(null).flatten
in my main program i receive inputs like -
key1=value1 key2=value2
Now what I want is to create a map out of it. I know the imperative way of doing this where I would get Array[String] that can be foreach and then split by "=" and then key and value can be used to form a Map.
is there a good functional and readable way to achieve this?
Also It will be great if I can avoid mutable Map and I want to avoid initial Dummy value initialization.
def initialize(strings: Array[String]): Unit = {
val m = collection.mutable.Map("dummy" -> "dummyval")
strings.foreach(
s => {
val keyVal:Array[String] = s.split("=")
m += keyVal(0) -> keyVal(1)
})
println(m)
}
you can just use toMap().
However, converting from array to tuple is not quite trivial:
How to convert an Array to a Tuple?
scala> val ar = Array("key1=value1","key2=value2")
ar: Array[String] = Array(key1=value1, key2=value2)
scala> ar.collect(_.split("=") match { case Array(x,y) => (x,y)}).toMap
res10: scala.collection.immutable.Map[String,String] = Map(key1 -> value1, key2 -> value2)
Maybe you have to call Function.unlift for intellij
val r = ar.collect(Function.unlift(_.split("=") match { case Array(x, y) => Some(x, y)})).toMap
similar to above but using only 'map'
ar.map(_.split("=")).map(a=>(a(0), a(1))).toMap
You can use Scopt to do the command line argument parsing in a neat way.
This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD
Currently I have a structure like this:
Array[(Int, Array[(String, Int)])], and I want to use reduceByKey on the Array[(String, Int)], which is inside the Array of tuple. I tried code like
//data is in Array[(Int, Array[(String, Int)])] structure
val result = data.map(l => (l._1, l._2.reduceByKey(_ + _)))
The error is telling that Array[(String,Int)]does not have method called reduceByKey, and I understand that this method can only be used on RDD. So my question is, is there any way to use "reduceByKey" feature, doesn't need to use exactly this method, in the nested structure?
Thanks guys.
You simply use Array's reduce method here as you are now working with an Array and not an RDD (assuming you really meant the outer wrapper to be an RDD)
val data = sc.parallelize(List((1,List(("foo", 1), ("foo", 1)))))
data.map(l=>(l._1, l._2.foldLeft(List[(String, Int)]())((accum, curr)=>{
val accumAsMap = accum.toMap
accumAsMap.get(curr._1) match {
case Some(value : Int) => (accumAsMap + (curr._1 -> (value + curr._2))).toList
case None => curr :: accum
}
}))).collect
Ultimately, it seems that you do not understand what an RDD is, so you might want to read some of the docs on them.
I'm having a real brain fart here. I'm working with the Play Framework. I have a method which takes a map and turns it into a HTML select element. I had a one-liner to take a list of objects and convert it into a map of two of the object's fields, id and name. However, I'm a Java programmer and my Scala is weak, and I've only gone and forgotten the syntax of how I did it.
I had something like
organizations.all.map {org => /* org.prop1, org.prop2 */ }
Can anyone complete the commented part?
I would suggest:
map { org => (org.id, org.name) } toMap
e.g.
scala> case class T(val a : Int, val b : String)
defined class T
scala> List(T(1, "A"), T(2, "B"))
res0: List[T] = List(T(1,A), T(2,B))
scala> res0.map(t => (t.a, t.b))
res1: List[(Int, String)] = List((1,A), (2,B))
scala> res0.map(t => (t.a, t.b)).toMap
res2: scala.collection.immutable.Map[Int,String] = Map(1 -> A, 2 -> B)
You could also take an intermediary List out of the equation and go straight to the Map like this:
case class Org(prop1:String, prop2:Int)
val list = List(Org("foo", 1), Org("bar", 2))
val map:Map[String,Int] = list.map(org => (org.prop1, org.prop2))(collection.breakOut)
Using collection.breakOut as the implicit CanBuildFrom allows you to basically skip a step in the process of getting a Map from a List.