Nested delimiter problem with gawk/sed - sed

I've this text that I need to split:
[{names: {en: 'UK 100', es: 'UK 100'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:14:02', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666872, ev_type_id: 10744, type_name: '|UK 100|'}, {names: {en: 'US 30', es: 'US 30'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:13:45', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666879, ev_type_id: 10745, type_name: '|US 30|'}, {names: {en: 'Germany 30', es: 'Germany 30'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:13:52', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666884, ev_type_id: 10748, type_name: '|Germany 30|'}, {names: {en: 'France 40', es: 'France 40'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:13:38', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666882, ev_type_id: 10747, type_name: '|France 40|'}, {names: {en: 'US 500', es: 'US 500'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:14:30', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666890, ev_type_id: 10749, type_name: '|US 500|'}, {names: {en: 'Spain 35', es: 'Spain 35'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:13:51', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666886, ev_type_id: 10750, type_name: '|Spain 35|'}],
I've tried variants of these, but keep getting caught by the 'inner' delimiters that I DON'T want to split!!:
gawk -F "[" -v RS="," "NF{print $0}" text.txt
How can I split them (1) First on the main "{", ignoring the inner "{"'s (2) Then on the commas, ignoring commas in between curly braces. I then want to output only one or two fields like this:
suspend_at: '2011-05-12 15:14:02', ev_id: 2666872, ev_type_id: 10744, type_name: '|UK 100|'
Thanks in advance.

As already stated, if Perl is acceptable:
% perl -MText::ParseWords -nle'
/suspend|ev_(id|type)|type_name/ and print for parse_line("[{},]",0, $_);
' infile
suspend_at: 2011-05-12 15:14:02
ev_id: 2666872
ev_type_id: 10744
type_name: |UK 100|
suspend_at: 2011-05-12 15:13:45
ev_id: 2666879
ev_type_id: 10745
type_name: |US 30|
suspend_at: 2011-05-12 15:13:52
ev_id: 2666884
ev_type_id: 10748
type_name: |Germany 30|
suspend_at: 2011-05-12 15:13:38
ev_id: 2666882
ev_type_id: 10747
type_name: |France 40|
suspend_at: 2011-05-12 15:14:30
ev_id: 2666890
ev_type_id: 10749
type_name: |US 500|
suspend_at: 2011-05-12 15:13:51
ev_id: 2666886
ev_type_id: 10750
type_name: |Spain 35|

Related

How do I count duplicate rows with the Start Date is the same as above End Date using PostgreSQL?

How do I add count as 1,2,3,... to the number of duplicate rows with Start date equal to the above Row End Date (or End Date of a Row equal to below row Start Date) by using below sample query?.
DROP TABLE IF EXiSTS "Count";
CREATE TABLE "Count" ("Code" bigint,
"Start Date" timestamp without time zone,
"End Date" timestamp without time zone,
"Extended" character varying(255),
"Transaction ID" bigint,
"ID #" character varying(255));
INSERT INTO "Count" ("Code",
"Start Date",
"End Date",
"Extended",
"Transaction ID" , "ID #" )
VALUES (589373857, '06/23/2021 13:03:57', '06/23/2021 14:03:57', '0', 5843454160, '299SC4'),
(589373857, '06/23/2021 13:19:22', '06/23/2021 13:43:22', '0', 5843383960, '299SC4'),
(589373857, '06/23/2021 15:10:05', '06/23/2021 16:10:05', '0', 5843727220, '299SC4'),
(589373857, '06/23/2021 16:10:05', '06/23/2021 17:10:05', 'yes', 5843596000, '299SC4'),
(588633059, '06/23/2021 15:01:23', '06/23/2021 17:25:23', '0', 5843649640, '306SC4'),
(583079912, '06/23/2021 09:30:02', '06/23/2021 09:42:02', '0', 5843471800, '338SC4'),
(582553030, '06/23/2021 09:04:53', '06/23/2021 09:14:53', '0', 5843402680, '364SC4'),
(583163101, '06/23/2021 09:16:11', '06/23/2021 09:28:11', '0', 5843518420, '364SC4'),
(584063039, '06/23/2021 15:02:13', '06/23/2021 15:12:13', '0', 5843671240, '364SC4'),
(582833001, '06/23/2021 09:01:54', '06/23/2021 10:01:54', '0', 5843480620, '438SC5'),
(585803021, '06/23/2021 10:01:54', '06/23/2021 11:01:54', 'yes', 5843524900, '438SC5'),
(585803021, '06/23/2021 11:01:54', '06/23/2021 12:01:54', 'yes', 5843522560, '438SC5'),
(585803021, '06/23/2021 12:35:23', '06/23/2021 13:11:23', '0', 5843479540, '438SC5'),
(585803021, '06/23/2021 13:11:23', '06/23/2021 13:23:23', 'yes', 5843551360, '438SC5'),
(585803021, '06/23/2021 13:23:23', '06/23/2021 13:35:23', 'yes', 5843495380, '438SC5'),
(585803021, '06/23/2021 14:11:37', '06/23/2021 15:11:37', '0', 5843489260, '438SC5'),
(585803021, '06/23/2021 15:11:37', '06/23/2021 16:11:37', 'yes', 5843723260, '438SC5'),
(585803021, '06/23/2021 16:11:37', '06/23/2021 16:47:37', 'yes', 5843623540, '438SC5'),
(585803021, '06/23/2021 16:47:37', '06/23/2021 17:11:37', 'yes', 5843561260, '438SC5'),
(585803021, '06/23/2021 17:11:37', '06/23/2021 17:23:37', 'yes', 5843673940, '438SC5'),
(585803021, '06/23/2021 17:47:02', '06/23/2021 18:59:59', '0', 5843639380, '577SC5'),
(584453015, '06/23/2021 15:39:24', '06/23/2021 16:39:24', '0', 5843734420, '584SC5');
I would like to the result query to be like below sample table. Please help.. Thank you
**Code Start Date End Date Extended Transaction ID ID # Counter**
589373857 6/23/2021 13:03 6/23/2021 14:03 0 5843454160 299SC4 0
589373857 6/23/2021 13:19 6/23/2021 13:43 0 5843383960 299SC4 0
589373857 6/23/2021 15:10 6/23/2021 16:10 0 5843727220 299SC4 1
589373857 6/23/2021 16:10 6/23/2021 17:10 yes 5843596000 299SC4 2
588633059 6/23/2021 15:01 6/23/2021 17:25 0 5843649640 306SC4 0
583079912 6/23/2021 9:30 6/23/2021 9:42 0 5843471800 338SC4 0
582553030 6/23/2021 9:04 6/23/2021 9:14 0 5843402680 364SC4 0
583163101 6/23/2021 9:16 6/23/2021 9:28 0 5843518420 364SC4 0
584063039 6/23/2021 15:02 6/23/2021 15:12 0 5843671240 364SC4 0
582833001 6/23/2021 9:01 6/23/2021 10:01 0 5843480620 438SC5 1
585803021 6/23/2021 10:01 6/23/2021 11:01 yes 5843524900 438SC5 2
585803021 6/23/2021 11:01 6/23/2021 12:01 yes 5843522560 438SC5 3
585803021 6/23/2021 12:35 6/23/2021 13:11 0 5843479540 438SC5 1
585803021 6/23/2021 13:11 6/23/2021 13:23 yes 5843551360 438SC5 2
585803021 6/23/2021 13:23 6/23/2021 13:35 yes 5843495380 438SC5 3
585803021 6/23/2021 14:11 6/23/2021 15:11 0 5843489260 438SC5 1
585803021 6/23/2021 15:11 6/23/2021 16:11 yes 5843723260 438SC5 2
585803021 6/23/2021 16:11 6/23/2021 16:47 yes 5843623540 438SC5 3
585803021 6/23/2021 16:47 6/23/2021 17:11 yes 5843561260 438SC5 4
585803021 6/23/2021 17:11 6/23/2021 17:23 yes 5843673940 438SC5 5
585803021 6/23/2021 17:47 6/23/2021 18:59 0 5843639380 577SC5 0
584453015 6/23/2021 15:39 6/23/2021 16:39 0 5843734420 584SC5 0

How to turn a disorganized text file into an Array[String] in Scala?

I have a text file that looks like this:
10 10
54 129 155 559 10.00 10.00
99999 3 15 15 15 15 15 15
15 15
120 195 258 744 10.00 10.00
3 99999 15 15 15 15 15 15
15 15
amount of ints/doubles per line can vary.
I can't read line by line because the amount on them are not constant. I've been trying with split, mkString and such to no success.
val lines = Source.fromFile(s"/tmp/$filepath")
.getLines.mkString
.split("\n").mkString
.split(" ").map(_.trim)
When I try to read it like:
lines(0).toInt
It return: [NumberFormatException: For input string: ""]
Need that to look like this:
A = Array('10', '10', '54', '129', '155', '559', '10.00', '10.00', '99999', '3', '15', '15', '15', '15', '15', '15', '15', '15', '120', '195', '258', '744', '10.00', '10.00', '3', '99999', '15', '15', '15', '15', '15', '15', '15', '15')
Not sure what you wanted with all those mkStrings there... Anyway, this here works just fine:
io.Source.fromFile("input.txt").getLines.flatMap(_.trim.split(" +")).toArray

Spark , Scala: How to remove empty lines either from Rdd or from dataframe?

I am using spark on scala.And I have some empty rows in Rdd. I need to remove them from the Rdd.
And I tried it as :
val valfilteredRow = rddRow.filter(row => row!=null && row.length>0)
However it did not work.
The rows in Rdd looks like [ with : valfilteredRow.collect().foreach(println) ]:
[,AAGGOO]
[,AAAOOO]
[,GGGGGII]
[]
[,UGGG]
Suppose you have the following sequence :
val seq = Seq(
",AAGGOO",
",AAAOOO",
",GGGGGII",
"",
",UGGG"
)
With DF
val df = seq.toDF("Column_name")
df.show(false)
+--------------+
|Column_name |
+--------------+
|,AAGGOO |
|,AAAOOO |
|,GGGGGII |
| |
|,UGGG |
+--------------+
df.filter(row => !(row.mkString("").isEmpty && row.length>0)).show(false)
+--------------+
|Column_name |
+--------------+
|,AAGGOO |
|,AAAOOO |
|,GGGGGII |
|,UGGG |
+--------------+
With rdd
val rdd = sc.parallelize(seq)
val filteredRdd = rdd.filter(row => !row.isEmpty)
filteredRdd.foreach(println)
,AAGGOO
,AAAOOO
,GGGGGII
,UGGG
If your RDD is of type RDD[String] then you can do like
rdd.filter(_.length>0).collect
I don't know Scala but here is what I did in Pyspark:
Suppose you have an input file like:
Banana,23,Male,5,11,2017
Dragon,28,Male,1,11,2017
Dragon,28,Male,1,11,2017
2nd line is empty.
rdd = sc.textFile(PATH_TO_FILE).mapPartitions(lambda line: csv.reader(line,delimiter=','))
>>> rdd.take(10)
[['Banana', '23', 'Male', '5', '11', '2017'], [], ['Dragon', '28', 'Male', '1', '11', '2017'], ['Dragon', '28', 'Male', '1', '11', '2017']]
you can see that second element is empty, so we will filter it by calculating the length of element, which should be greater than one.
>>> rdd = sc.textFile(PATH_TO_FILE).mapPartitions(lambda line: csv.reader(line,delimiter=',')).filter(lambda line: len(line) > 1)
>>> rdd.take(10)
[['Banana', '23', 'Male', '5', '11', '2017'], ['Dragon', '28', 'Male', '1', '11', '2017'], ['Dragon', '28', 'Male', '1', '11', '2017']]

How to create possible set and sum dictionary value of same key on RDD pyspark?

I have a sample of data below and I wrote my code to convert a dictionary to sum dictionary value that has the same key.
import itertools
d = [frozenset({'112', 'a', 'e'}), frozenset({'112', 'a', 'e', 'd'})]
rdd = sc.parallelize(d)
def f_itemset(data):
d = {}
for i in range(1, len(data)+1):
for x in itertools.combinations(data, i+1):
if x not in d:
d[x] += 1
else:
d[x] = 1
return d
Ck = rdd.map(lambda s: sorted([l for l in s])).map(lambda x: [f_itemset(x))
print(Ck.collect())
The output is shown below.
[{('112', 'a'): 1, ('112', 'e'): 1, ('a', 'e'): 1, ('112', 'a', 'e'): 1}, {('112', 'a'): 1, ('112', 'd'): 1, ('112', 'e'): 1, ('a', 'd'): 1, ('a', 'e'): 1, ('d', 'e'): 1, ('112', 'a', 'd'): 1, ('112', 'a', 'e'): 1, ('112', 'd', 'e'): 1, ('a', 'd', 'e'): 1, ('112', 'a', 'd', 'e'): 1}]
But, I want the output is:
[{('112', 'a'): 2, ('112', 'e'): 2, ('a', 'e'): 2, ('112', 'a', 'e'): 2, ('112', 'd'): 1, ('a', 'd'): 1, ('d', 'e'): 1, ('112', 'a', 'd'): 1, ('112', 'd', 'e'): 1, ('a', 'd', 'e'): 1, ('112', 'a', 'd', 'e'): 1}]
Please, anyone, advise me.
I omitted some of your initial statements and included an additional reduceByKey method to achieve the counting. Unfortunately, it is by default only possible to process lists with reduceByKey. If you really want to stick to dictionaries you have to create an own method for the reduction. Otherwise this code can help you.
import itertools
d = [frozenset({'112', 'a', 'e'}), frozenset({'112', 'a', 'e', 'd'})]
rdd = sc.parallelize(d)
def f_itemset(data):
l = list()
for i in range(1, len(data)+1):
for x in itertools.combinations(data, i+1):
l.append(x)
return l
Ck = rdd.map(lambda s: sorted([l for l in s])).flatMap(lambda x: f_itemset(x)).map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
print(Ck.collect())
Result:
[(('112', 'e'), 2), (('a', 'd', 'e'), 1), (('112', 'd'), 1), (('112', 'a'), 2), (('a', 'e'), 2), (('112', 'a', 'd', 'e'), 1), (('a', 'd'), 1), (('d', 'e'), 1), (('112', 'a', 'e'), 2), (('112', 'a', 'd'), 1), (('112', 'd', 'e'), 1)]

Search in list OCL

We have a list of items and we want to compare element by element with another list of items and the result is a list with items that don´t have items in both lists or duplicates items.
For Example:
L1={S1, S2, S3, S4, S5, S6, S7, S8, S9, S10}, L2={S1, S4, S7, S9}, listresult={S2, S3, S5, S6, S8, S10}
Providing the description is not good enough, I´ll try to figure out a solution for you anyway:
let L1 : Sequence(String) = Sequence {'1', '2', '3', '4', '5', '6', '7', '8', '9', '10'},
L2 : Sequence(String) = Sequence {'1', '4', '7', '9' }
in L1->reject(x | L2->includes(x))
results:
'2'
'3'
'5'
'6'
'8'
'10'