Related
I have a pyspark request to union multiple dataframes on id. Each dataframe has a certain column with comma separate strings.i.e.
df1=[("1", "a,b,c,"),
("2", "i,j,k"),
("3", "x,y,z")]
df2=[("1", "b,d,e"),
("2", "l,m,n"),
("3", "x")]
Now I want to union this column's value of each entry together.i.e.
df3=[("1", "a,b,c,d,e"),
("2", "i,j,k,l,m,n"),
("3", "x,y,z")]
is there a function to do that?
What you are looking for is the array_union function.
data1 = [
('1', 'a,b,c'),
('2', 'i,j,k'),
('3', 'x,y,z')
]
data2 = [
('1', 'b,d,e'),
('2', 'l,m,n'),
('3', 'x')
]
df1 = spark.createDataFrame(data1, ['id', 'values1'])
df2 = spark.createDataFrame(data2, ['id', 'values2'])
df = df1.join(df2, 'id') \
.select('id',
F.array_join(F.array_union(F.split('values1', ','), F.split('values2', ',')), ',').alias('values'))
df.show(truncate=False)
I have a text file that looks like this:
10 10
54 129 155 559 10.00 10.00
99999 3 15 15 15 15 15 15
15 15
120 195 258 744 10.00 10.00
3 99999 15 15 15 15 15 15
15 15
amount of ints/doubles per line can vary.
I can't read line by line because the amount on them are not constant. I've been trying with split, mkString and such to no success.
val lines = Source.fromFile(s"/tmp/$filepath")
.getLines.mkString
.split("\n").mkString
.split(" ").map(_.trim)
When I try to read it like:
lines(0).toInt
It return: [NumberFormatException: For input string: ""]
Need that to look like this:
A = Array('10', '10', '54', '129', '155', '559', '10.00', '10.00', '99999', '3', '15', '15', '15', '15', '15', '15', '15', '15', '120', '195', '258', '744', '10.00', '10.00', '3', '99999', '15', '15', '15', '15', '15', '15', '15', '15')
Not sure what you wanted with all those mkStrings there... Anyway, this here works just fine:
io.Source.fromFile("input.txt").getLines.flatMap(_.trim.split(" +")).toArray
I am using spark on scala.And I have some empty rows in Rdd. I need to remove them from the Rdd.
And I tried it as :
val valfilteredRow = rddRow.filter(row => row!=null && row.length>0)
However it did not work.
The rows in Rdd looks like [ with : valfilteredRow.collect().foreach(println) ]:
[,AAGGOO]
[,AAAOOO]
[,GGGGGII]
[]
[,UGGG]
Suppose you have the following sequence :
val seq = Seq(
",AAGGOO",
",AAAOOO",
",GGGGGII",
"",
",UGGG"
)
With DF
val df = seq.toDF("Column_name")
df.show(false)
+--------------+
|Column_name |
+--------------+
|,AAGGOO |
|,AAAOOO |
|,GGGGGII |
| |
|,UGGG |
+--------------+
df.filter(row => !(row.mkString("").isEmpty && row.length>0)).show(false)
+--------------+
|Column_name |
+--------------+
|,AAGGOO |
|,AAAOOO |
|,GGGGGII |
|,UGGG |
+--------------+
With rdd
val rdd = sc.parallelize(seq)
val filteredRdd = rdd.filter(row => !row.isEmpty)
filteredRdd.foreach(println)
,AAGGOO
,AAAOOO
,GGGGGII
,UGGG
If your RDD is of type RDD[String] then you can do like
rdd.filter(_.length>0).collect
I don't know Scala but here is what I did in Pyspark:
Suppose you have an input file like:
Banana,23,Male,5,11,2017
Dragon,28,Male,1,11,2017
Dragon,28,Male,1,11,2017
2nd line is empty.
rdd = sc.textFile(PATH_TO_FILE).mapPartitions(lambda line: csv.reader(line,delimiter=','))
>>> rdd.take(10)
[['Banana', '23', 'Male', '5', '11', '2017'], [], ['Dragon', '28', 'Male', '1', '11', '2017'], ['Dragon', '28', 'Male', '1', '11', '2017']]
you can see that second element is empty, so we will filter it by calculating the length of element, which should be greater than one.
>>> rdd = sc.textFile(PATH_TO_FILE).mapPartitions(lambda line: csv.reader(line,delimiter=',')).filter(lambda line: len(line) > 1)
>>> rdd.take(10)
[['Banana', '23', 'Male', '5', '11', '2017'], ['Dragon', '28', 'Male', '1', '11', '2017'], ['Dragon', '28', 'Male', '1', '11', '2017']]
I have a sample of data below and I wrote my code to convert a dictionary to sum dictionary value that has the same key.
import itertools
d = [frozenset({'112', 'a', 'e'}), frozenset({'112', 'a', 'e', 'd'})]
rdd = sc.parallelize(d)
def f_itemset(data):
d = {}
for i in range(1, len(data)+1):
for x in itertools.combinations(data, i+1):
if x not in d:
d[x] += 1
else:
d[x] = 1
return d
Ck = rdd.map(lambda s: sorted([l for l in s])).map(lambda x: [f_itemset(x))
print(Ck.collect())
The output is shown below.
[{('112', 'a'): 1, ('112', 'e'): 1, ('a', 'e'): 1, ('112', 'a', 'e'): 1}, {('112', 'a'): 1, ('112', 'd'): 1, ('112', 'e'): 1, ('a', 'd'): 1, ('a', 'e'): 1, ('d', 'e'): 1, ('112', 'a', 'd'): 1, ('112', 'a', 'e'): 1, ('112', 'd', 'e'): 1, ('a', 'd', 'e'): 1, ('112', 'a', 'd', 'e'): 1}]
But, I want the output is:
[{('112', 'a'): 2, ('112', 'e'): 2, ('a', 'e'): 2, ('112', 'a', 'e'): 2, ('112', 'd'): 1, ('a', 'd'): 1, ('d', 'e'): 1, ('112', 'a', 'd'): 1, ('112', 'd', 'e'): 1, ('a', 'd', 'e'): 1, ('112', 'a', 'd', 'e'): 1}]
Please, anyone, advise me.
I omitted some of your initial statements and included an additional reduceByKey method to achieve the counting. Unfortunately, it is by default only possible to process lists with reduceByKey. If you really want to stick to dictionaries you have to create an own method for the reduction. Otherwise this code can help you.
import itertools
d = [frozenset({'112', 'a', 'e'}), frozenset({'112', 'a', 'e', 'd'})]
rdd = sc.parallelize(d)
def f_itemset(data):
l = list()
for i in range(1, len(data)+1):
for x in itertools.combinations(data, i+1):
l.append(x)
return l
Ck = rdd.map(lambda s: sorted([l for l in s])).flatMap(lambda x: f_itemset(x)).map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
print(Ck.collect())
Result:
[(('112', 'e'), 2), (('a', 'd', 'e'), 1), (('112', 'd'), 1), (('112', 'a'), 2), (('a', 'e'), 2), (('112', 'a', 'd', 'e'), 1), (('a', 'd'), 1), (('d', 'e'), 1), (('112', 'a', 'e'), 2), (('112', 'a', 'd'), 1), (('112', 'd', 'e'), 1)]
I've this text that I need to split:
[{names: {en: 'UK 100', es: 'UK 100'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:14:02', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666872, ev_type_id: 10744, type_name: '|UK 100|'}, {names: {en: 'US 30', es: 'US 30'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:13:45', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666879, ev_type_id: 10745, type_name: '|US 30|'}, {names: {en: 'Germany 30', es: 'Germany 30'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:13:52', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666884, ev_type_id: 10748, type_name: '|Germany 30|'}, {names: {en: 'France 40', es: 'France 40'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:13:38', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666882, ev_type_id: 10747, type_name: '|France 40|'}, {names: {en: 'US 500', es: 'US 500'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:14:30', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666890, ev_type_id: 10749, type_name: '|US 500|'}, {names: {en: 'Spain 35', es: 'Spain 35'}, status: 'A', displayed: 'Y', start_time: '2011-05-12 00:00:00', start_time_xls: {en: '12th of May 2011 00:00 am', es: '12 May 2011 00:00 am'}, suspend_at: '2011-05-12 15:13:51', is_off: 'Y', score_home: '', score_away: '', bids_status: '', period_id: '', curr_period_start_time: '', score_extra_info: '', settled: 'N', ev_id: 2666886, ev_type_id: 10750, type_name: '|Spain 35|'}],
I've tried variants of these, but keep getting caught by the 'inner' delimiters that I DON'T want to split!!:
gawk -F "[" -v RS="," "NF{print $0}" text.txt
How can I split them (1) First on the main "{", ignoring the inner "{"'s (2) Then on the commas, ignoring commas in between curly braces. I then want to output only one or two fields like this:
suspend_at: '2011-05-12 15:14:02', ev_id: 2666872, ev_type_id: 10744, type_name: '|UK 100|'
Thanks in advance.
As already stated, if Perl is acceptable:
% perl -MText::ParseWords -nle'
/suspend|ev_(id|type)|type_name/ and print for parse_line("[{},]",0, $_);
' infile
suspend_at: 2011-05-12 15:14:02
ev_id: 2666872
ev_type_id: 10744
type_name: |UK 100|
suspend_at: 2011-05-12 15:13:45
ev_id: 2666879
ev_type_id: 10745
type_name: |US 30|
suspend_at: 2011-05-12 15:13:52
ev_id: 2666884
ev_type_id: 10748
type_name: |Germany 30|
suspend_at: 2011-05-12 15:13:38
ev_id: 2666882
ev_type_id: 10747
type_name: |France 40|
suspend_at: 2011-05-12 15:14:30
ev_id: 2666890
ev_type_id: 10749
type_name: |US 500|
suspend_at: 2011-05-12 15:13:51
ev_id: 2666886
ev_type_id: 10750
type_name: |Spain 35|