Search in list OCL - eclipse

We have a list of items and we want to compare element by element with another list of items and the result is a list with items that don´t have items in both lists or duplicates items.
For Example:
L1={S1, S2, S3, S4, S5, S6, S7, S8, S9, S10}, L2={S1, S4, S7, S9}, listresult={S2, S3, S5, S6, S8, S10}

Providing the description is not good enough, I´ll try to figure out a solution for you anyway:
let L1 : Sequence(String) = Sequence {'1', '2', '3', '4', '5', '6', '7', '8', '9', '10'},
L2 : Sequence(String) = Sequence {'1', '4', '7', '9' }
in L1->reject(x | L2->includes(x))
results:
'2'
'3'
'5'
'6'
'8'
'10'

Related

Pyspark union column values from string separate with comma into array

I have a pyspark request to union multiple dataframes on id. Each dataframe has a certain column with comma separate strings.i.e.
df1=[("1", "a,b,c,"),
("2", "i,j,k"),
("3", "x,y,z")]
df2=[("1", "b,d,e"),
("2", "l,m,n"),
("3", "x")]
Now I want to union this column's value of each entry together.i.e.
df3=[("1", "a,b,c,d,e"),
("2", "i,j,k,l,m,n"),
("3", "x,y,z")]
is there a function to do that?
What you are looking for is the array_union function.
data1 = [
('1', 'a,b,c'),
('2', 'i,j,k'),
('3', 'x,y,z')
]
data2 = [
('1', 'b,d,e'),
('2', 'l,m,n'),
('3', 'x')
]
df1 = spark.createDataFrame(data1, ['id', 'values1'])
df2 = spark.createDataFrame(data2, ['id', 'values2'])
df = df1.join(df2, 'id') \
.select('id',
F.array_join(F.array_union(F.split('values1', ','), F.split('values2', ',')), ',').alias('values'))
df.show(truncate=False)

How to turn a disorganized text file into an Array[String] in Scala?

I have a text file that looks like this:
10 10
54 129 155 559 10.00 10.00
99999 3 15 15 15 15 15 15
15 15
120 195 258 744 10.00 10.00
3 99999 15 15 15 15 15 15
15 15
amount of ints/doubles per line can vary.
I can't read line by line because the amount on them are not constant. I've been trying with split, mkString and such to no success.
val lines = Source.fromFile(s"/tmp/$filepath")
.getLines.mkString
.split("\n").mkString
.split(" ").map(_.trim)
When I try to read it like:
lines(0).toInt
It return: [NumberFormatException: For input string: ""]
Need that to look like this:
A = Array('10', '10', '54', '129', '155', '559', '10.00', '10.00', '99999', '3', '15', '15', '15', '15', '15', '15', '15', '15', '120', '195', '258', '744', '10.00', '10.00', '3', '99999', '15', '15', '15', '15', '15', '15', '15', '15')
Not sure what you wanted with all those mkStrings there... Anyway, this here works just fine:
io.Source.fromFile("input.txt").getLines.flatMap(_.trim.split(" +")).toArray

Spark , Scala: How to remove empty lines either from Rdd or from dataframe?

I am using spark on scala.And I have some empty rows in Rdd. I need to remove them from the Rdd.
And I tried it as :
val valfilteredRow = rddRow.filter(row => row!=null && row.length>0)
However it did not work.
The rows in Rdd looks like [ with : valfilteredRow.collect().foreach(println) ]:
[,AAGGOO]
[,AAAOOO]
[,GGGGGII]
[]
[,UGGG]
Suppose you have the following sequence :
val seq = Seq(
",AAGGOO",
",AAAOOO",
",GGGGGII",
"",
",UGGG"
)
With DF
val df = seq.toDF("Column_name")
df.show(false)
+--------------+
|Column_name |
+--------------+
|,AAGGOO |
|,AAAOOO |
|,GGGGGII |
| |
|,UGGG |
+--------------+
df.filter(row => !(row.mkString("").isEmpty && row.length>0)).show(false)
+--------------+
|Column_name |
+--------------+
|,AAGGOO |
|,AAAOOO |
|,GGGGGII |
|,UGGG |
+--------------+
With rdd
val rdd = sc.parallelize(seq)
val filteredRdd = rdd.filter(row => !row.isEmpty)
filteredRdd.foreach(println)
,AAGGOO
,AAAOOO
,GGGGGII
,UGGG
If your RDD is of type RDD[String] then you can do like
rdd.filter(_.length>0).collect
I don't know Scala but here is what I did in Pyspark:
Suppose you have an input file like:
Banana,23,Male,5,11,2017
Dragon,28,Male,1,11,2017
Dragon,28,Male,1,11,2017
2nd line is empty.
rdd = sc.textFile(PATH_TO_FILE).mapPartitions(lambda line: csv.reader(line,delimiter=','))
>>> rdd.take(10)
[['Banana', '23', 'Male', '5', '11', '2017'], [], ['Dragon', '28', 'Male', '1', '11', '2017'], ['Dragon', '28', 'Male', '1', '11', '2017']]
you can see that second element is empty, so we will filter it by calculating the length of element, which should be greater than one.
>>> rdd = sc.textFile(PATH_TO_FILE).mapPartitions(lambda line: csv.reader(line,delimiter=',')).filter(lambda line: len(line) > 1)
>>> rdd.take(10)
[['Banana', '23', 'Male', '5', '11', '2017'], ['Dragon', '28', 'Male', '1', '11', '2017'], ['Dragon', '28', 'Male', '1', '11', '2017']]

How to create possible set and sum dictionary value of same key on RDD pyspark?

I have a sample of data below and I wrote my code to convert a dictionary to sum dictionary value that has the same key.
import itertools
d = [frozenset({'112', 'a', 'e'}), frozenset({'112', 'a', 'e', 'd'})]
rdd = sc.parallelize(d)
def f_itemset(data):
d = {}
for i in range(1, len(data)+1):
for x in itertools.combinations(data, i+1):
if x not in d:
d[x] += 1
else:
d[x] = 1
return d
Ck = rdd.map(lambda s: sorted([l for l in s])).map(lambda x: [f_itemset(x))
print(Ck.collect())
The output is shown below.
[{('112', 'a'): 1, ('112', 'e'): 1, ('a', 'e'): 1, ('112', 'a', 'e'): 1}, {('112', 'a'): 1, ('112', 'd'): 1, ('112', 'e'): 1, ('a', 'd'): 1, ('a', 'e'): 1, ('d', 'e'): 1, ('112', 'a', 'd'): 1, ('112', 'a', 'e'): 1, ('112', 'd', 'e'): 1, ('a', 'd', 'e'): 1, ('112', 'a', 'd', 'e'): 1}]
But, I want the output is:
[{('112', 'a'): 2, ('112', 'e'): 2, ('a', 'e'): 2, ('112', 'a', 'e'): 2, ('112', 'd'): 1, ('a', 'd'): 1, ('d', 'e'): 1, ('112', 'a', 'd'): 1, ('112', 'd', 'e'): 1, ('a', 'd', 'e'): 1, ('112', 'a', 'd', 'e'): 1}]
Please, anyone, advise me.
I omitted some of your initial statements and included an additional reduceByKey method to achieve the counting. Unfortunately, it is by default only possible to process lists with reduceByKey. If you really want to stick to dictionaries you have to create an own method for the reduction. Otherwise this code can help you.
import itertools
d = [frozenset({'112', 'a', 'e'}), frozenset({'112', 'a', 'e', 'd'})]
rdd = sc.parallelize(d)
def f_itemset(data):
l = list()
for i in range(1, len(data)+1):
for x in itertools.combinations(data, i+1):
l.append(x)
return l
Ck = rdd.map(lambda s: sorted([l for l in s])).flatMap(lambda x: f_itemset(x)).map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
print(Ck.collect())
Result:
[(('112', 'e'), 2), (('a', 'd', 'e'), 1), (('112', 'd'), 1), (('112', 'a'), 2), (('a', 'e'), 2), (('112', 'a', 'd', 'e'), 1), (('a', 'd'), 1), (('d', 'e'), 1), (('112', 'a', 'e'), 2), (('112', 'a', 'd'), 1), (('112', 'd', 'e'), 1)]

Nested delimiter problem with gawk/sed

I've this text that I need to split:
[{names: {en: 'UK 100', es: 'UK 100'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:14:02', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666872, ev_type_id: 10744, type_name: '|UK 100|'}, {names: {en: 'US 30', es: 'US 30'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:13:45', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666879, ev_type_id: 10745, type_name: '|US 30|'}, {names: {en: 'Germany 30', es: 'Germany 30'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:13:52', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666884, ev_type_id: 10748, type_name: '|Germany 30|'}, {names: {en: 'France 40', es: 'France 40'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:13:38', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666882, ev_type_id: 10747, type_name: '|France 40|'}, {names: {en: 'US 500', es: 'US 500'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:14:30', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666890, ev_type_id: 10749, type_name: '|US 500|'}, {names: {en: 'Spain 35', es: 'Spain 35'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:13:51', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666886, ev_type_id: 10750, type_name: '|Spain 35|'}],
I've tried variants of these, but keep getting caught by the 'inner' delimiters that I DON'T want to split!!:
gawk -F "[" -v RS="," "NF{print $0}" text.txt
How can I split them (1) First on the main "{", ignoring the inner "{"'s (2) Then on the commas, ignoring commas in between curly braces. I then want to output only one or two fields like this:
suspend_at: '2011-05-12 15:14:02', ev_id: 2666872, ev_type_id: 10744, type_name: '|UK 100|'
Thanks in advance.
As already stated, if Perl is acceptable:
% perl -MText::ParseWords -nle'
/suspend|ev_(id|type)|type_name/ and print for parse_line("[{},]",0, $_);
' infile
suspend_at: 2011-05-12 15:14:02
ev_id: 2666872
ev_type_id: 10744
type_name: |UK 100|
suspend_at: 2011-05-12 15:13:45
ev_id: 2666879
ev_type_id: 10745
type_name: |US 30|
suspend_at: 2011-05-12 15:13:52
ev_id: 2666884
ev_type_id: 10748
type_name: |Germany 30|
suspend_at: 2011-05-12 15:13:38
ev_id: 2666882
ev_type_id: 10747
type_name: |France 40|
suspend_at: 2011-05-12 15:14:30
ev_id: 2666890
ev_type_id: 10749
type_name: |US 500|
suspend_at: 2011-05-12 15:13:51
ev_id: 2666886
ev_type_id: 10750
type_name: |Spain 35|