Is there something like python's ast.literal_eval() in Scala? - scala

I'm trying to parse strings that look like:
info_string = "[(1, 10, 10, 2), (2, 20, 12, 3), (3, 42, 53, 1)]"
and I would like to get a list of arrays, i.e.
info_list = [(1, 10, 10, 2), (2, 20, 12, 3), (3, 42, 53, 1)]
In python I would just do
import ast
info_list = ast.literal_eval(info_string)
Is there similar functionality in Scala?

Related

Group_by_key in order in Pyspark

rrr = sc.parallelize([1, 2, 3])
fff = sc.parallelize([5, 6, 7, 8])
test = rrr.cartesian(fff)
Here's test:
[(1, 5),(1, 6),(1, 7),(1, 8),
(2, 5),(2, 6),(2, 7),(2, 8),
(3, 5),(3, 6),(3, 7),(3, 8)]
Is there a way to preserve the order after calling groupByKey:
test.groupByKey().mapValues(list).take(2)
Output is this where the list is in random order:
Out[255]: [(1, [8, 5, 6, 7]), (2, [5, 8, 6, 7]), (3, [6, 8, 7, 5])]
The desired output is:
[(1, [5,6,7,8]), (2, [5,6,7,8]), (3, [5,6,7,8])]
How to achieve this?
You can add one more mapValues to sort the lists:
result = test.groupByKey().mapValues(list).mapValues(sorted)

Combine two different RDD's with two different sets of data but the same key

RDD_1 contains rows like the following:
(u'id2875421', 2, datetime.datetime(2016, 3, 14, 17, 24, 55), datetime.datetime(2016, 3, 14, 17, 32, 30), 1, -73.9821548461914, 40.76793670654297, -73.96463012695312, 40.765602111816406, u'N', 455)
RDD_2 contains rows like the following:
(u'id2875421', 1.9505895451732258)
What I'm trying to do is get an rdd in the form of
(u'id2875421', 2, datetime.datetime(2016, 3, 14, 17, 24, 55), datetime.datetime(2016, 3, 14, 17, 32, 30), 1, 1.9505895451732258, u'N', 455)
So I'm trying to replace the location columns with a distance column.
rdd1.join(rdd2) gives me:
(u'id1585324', (1, 0.9773030754631484))
and rdd1.union(rdd2) gives me:
(u'id2875421', 2, datetime.datetime(2016, 3, 14, 17, 24, 55), datetime.datetime(2016, 3, 14, 17, 32, 30), 1, -73.9821548461914, 40.76793670654297, -73.96463012695312, 40.765602111816406, u'N', 455)
IIUC, just convert the first RDD into a paired RDD and then join:
rdd1.keyBy(lambda x: x[0]) \
.join(rdd2) \
.map(lambda x: x[1][0][:5] + (x[1][1],) + x[1][0][9:]) \
.collect()
#[(u'id2875421',
# 2,
# datetime.datetime(2016, 3, 14, 17, 24, 55),
# datetime.datetime(2016, 3, 14, 17, 32, 30),
# 1,
# 1.9505895451732258,
# u'N',
# 455)]
Here I use the keyBy() function to convert x[0] of rdd1 to key and the original element as value, then join rdd2 and use map() function to pick what you want in the final tuple.

How to convert a List[List[Long]] to a List[List[Int]]?

What's the best way to convert a List[List[Long]] to a List[List[Int]] in Scala?
For example, given the following list of type List[List[Long]]
val l: List[List[Long]] = List(List(11, 10, 11, 10, 11), List(8, 19, 24, 0, 2))
how can it be converted to List[List[Int]]?
You can also use cats lib for that and compose List functors
import cats.Functor
import cats.implicits._
import cats.data._
val l: List[List[Long]] = List(List(11, 10, 11, 10, 11), List(8, 19, 24, 0, 2))
Functor[List].compose[List].map(l)(_.toInt)
//or
Nested(l).map(_.toInt).value
and one more pure scala approach (not very safe, though)
val res:List[List[Int]] = l.asInstanceOf[List[List[Int]]]
Try l.map(_.map(_.toInt)) like so
val l: List[List[Long]] = List(List(11, 10, 11, 10, 11), List(8, 19, 24, 0, 2))
l.map(_.map(_.toInt))
which should give
res2: List[List[Int]] = List(List(11, 10, 11, 10, 11), List(8, 19, 24, 0, 2))
Only if you are completely sure that you won't overflow the Int.
val l1: List[List[Long]] = List(List(11, 10, 11, 10, 11), List(8, 19, 24, 0, 2))
val l2: List[List[Int]] = l1.map(list => list.map(long => long.toInt))
(Basically, every time you want to transform a List into another List, use map).
can be achieved with simple transformation on collection using map function.
map works by applying a function to each element in the list. in your case nested lists are there. so you need to apply map function 2 times like below example...
val x : List[List[Long]] = List(List(11, 10, 11, 10, 11), List(8, 19, 24, 0, 2))
println(x)
val y :List[List[Int]]= x.map(a => a.map(_.toInt))
println(y)
Output :
List(List(11, 10, 11, 10, 11), List(8, 19, 24, 0, 2))
List(List(11, 10, 11, 10, 11), List(8, 19, 24, 0, 2))

How to store each element to dictionary and count dictionary value with pyspark?

I want to count elements value of dictionary. I try with this code:
def f_items(data, steps=0):
items = defaultdict(int)
for element in data:
if element in data:
items[element] += 1
else:
items[element] = 1
return items.items()
data = [[1, 2, 3, 'E'], [1, 2, 3, 'E'], [5, 2, 7, 112, 'A'] ]
rdd = sc.parallelize(data)
items = rdd.flatMap(lambda data: [y for y in f_items(data)], True)
print (items.collect())
The output of this code is shown below:
[(1, 1), (2, 1), (3, 1), ('E', 1), (1, 1), (2, 1), (3, 1), ('E', 1), (5, 1), (2, 1), (7, 1), (112, 1), ('A', 1)]
But, it should show the result following:
[(1, 2), (2, 3), (3, 3), ('E', 2), (5, 1), (7, 1), (112, 1), ('A', 1)]
How to achieve this?
Your last step should be a reduceByKey function call on the items rdd.
final_items = items.reduceByKey(lambda x,y: x+y)
print (final_items.collect())
You can look into this link to see some examples of reduceByKey in scala, java and python.

Spark - Remove intersecting elements between two array type columns

I have dataframe like this
+---------+--------------------+----------------------------+
| Name| rem1| quota |
+---------+--------------------+----------------------------+
|Customer_3|[258, 259, 260, 2...|[1, 2, 3, 4, 5, 6, 7,..500]|
|Customer_4|[18, 19, 20, 27, ...|[1, 2, 3, 4, 5, 6, 7,..500]|
|Customer_5|[16, 17, 51, 52, ...|[1, 2, 3, 4, 5, 6, 7,..500]|
|Customer_6|[6, 7, 8, 9, 10, ...|[1, 2, 3, 4, 5, 6, 7,..500]|
|Customer_7|[0, 30, 31, 32, 3...|[1, 2, 3, 4, 5, 6, 7,..500]|
I would like to remove list value in rem1 from quota and create as one new column. I have tried.
val dfleft = dfpci_remove2.withColumn("left",$"quota".filter($"rem1"))
<console>:123: error: value filter is not a member of org.apache.spark.sql.ColumnName
Please advise.
You can use a filter in a column in such way, you can write an udf as below
val filterList = udf((a: Seq[Int], b: Seq[Int]) => a diff b)
df.withColumn("left", filterList($"rem1", $"quota") )
This should give you the expected result.
Hope this helps!