change the keys in Pair RDD [pyspark] - pyspark

Can I change the key in a pair RDD?
I have created a normal RDD from a CSV FILE ( NAME , AGE , NATIONALITY) using sc.textFile command.
I want to create a pairRDD WITH NATIONALITY AS the key and values of ( name , age) .
t1 = rdd.map( lamda x : (X.split(",") [2] , x))
But t1.keys() doesn't show the keys nor does the t1.values()
I am using python and can u help me to create as in scala we have the option to do the same .

You have to use collect() along with RDD like, t1.keys().collect() to print them. Check the below, and I got it.
>>> rdd= sc.parallelize([['Mike',25,'XXX'],['Sam',45,'YYY'],['Jim',26,'ZZZ']])
>>> rdd.collect()
[['Mike', 25, 'XXX'], ['Sam', 45, 'YYY'], ['Jim', 26, 'ZZZ']]
#Making Nationality as Key, and others as values
>>> t1=rdd.map(lambda x:(x[2],(x[0],x[1])))
>>> t1.collect()
[('XXX', ('Mike', 25)), ('YYY', ('Sam', 45)), ('ZZZ', ('Jim', 26))]
>>> t1.keys().collect()
['XXX', 'YYY', 'ZZZ']
>>> t1.values().collect()
[('Mike', 25), ('Sam', 45), ('Jim', 26)]

Related

'list' object has no attribute 'map' in pyspark error

""" df = sc.textFile("/content/Shakespeare.txt")
llist = df.collect()
for line in llist:
t= simple_tokenize(line)
rdd2 = t.map(lambda word: (word,1)) # error on this line
rdd3 = rdd2.reduceByKey(lambda a,b: a+b)
"""
I am facing an error on rdd2. Can someone please help?
I think you would like a simple word count using rdd. You can do it by
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark import SparkFiles
# read file from url
url="https://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/shakespeare.txt"
spark.sparkContext.addFile(url)
df=spark.read.csv(SparkFiles.get("shakespeare.txt"), header=True)
df.show(4)
+------------------------------------------+
|THE SONNETS |
+------------------------------------------+
|by William Shakespeare |
|From fairest creatures we desire increase |
|That thereby beauty's rose might never die|
|But as the riper should by time decease |
+------------------------------------------+
only showing top 4 rows
# convert to rdd taking only the strings of the row
rdd=df.rdd.map(lambda x: x["THE SONNETS"])
rdd.take(4)
['by William Shakespeare',
'From fairest creatures we desire increase',
"That thereby beauty's rose might never die",
'But as the riper should by time decease']
# you can also parallelize a Python list of strings
data=["From fairest creatures we desire increase",
"That thereby beauty's rose might never die",
"But as the riper should by time decease",
"His tender heir might bear his memory",
]
rdd=spark.sparkContext.parallelize(data)
Now run the basic three steps
split by words
count the words
reduce by key
rdd1=rdd.flatMap(lambda x: x.split(" "))
rdd2=rdd1.map(lambda word: (word,1))
rdd3=rdd2.reduceByKey(lambda a,b: a+b)
rdd3.take(20)
[('by', 66),
('William', 1),
('Shakespeare', 1),
('From', 14),
('fairest', 5),
('creatures', 2),
('we', 11),
('desire', 6),
('increase', 3),
('That', 83),
('thereby', 1),
("beauty's", 16),
('rose', 5),
('might', 19),
('never', 10),
('die', 5),
('But', 89),
('as', 66),
('the', 311),
('riper', 2)]

how to make a new column by pairing elements of the other column?

I have a big data dataframe and I want to make pairs from elements of the other column.
col
['summer','book','hot']
['g','o','p']
output:
the pair of the above rows:
new_col
['summer','book'],['summer','hot'],['hot','book']
['g','o'],['g','p'],['p','o']
Note that tuple will work instead of list. like ('summer','book').
I know in pandas I can do this:
df['col'].apply(lambda x: list(itertools.combinations(x, 2)))
but not sure in pyspark.
You can use a UDF to do the same as you would do in python. Then cast the output to an array of array of strings.
import itertools
from pyspark.sql import functions as F
combinations_udf = F.udf(
lambda x: list(itertools.combinations(x, 2)), "array<array<string>>"
)
df = spark.createDataFrame([(['hot','summer', 'book'],),
(['g', 'o', 'p'], ),
], ['col1'])
df1 = df.withColumn("new_col", combinations_udf(F.col("col1")))
display(df1)

How can I parse a string based on character count?

I am trying to parse a string and append the results to a new fields in a dataframe? In SQL, it would work like this.
UPDATE myDF
SET theyear = SUBSTRING(filename, 52, 4),
SET themonth = SUBSTRING(filename, 57, 2),
SET theday = SUBSTRING(filename, 60, 2),
SET thefile = SUBSTRING(filename, 71, 99)
I want to use Scala to do the work because the dataframes that I'm working with are really huge and using this will be magnitudes faster than using SQL to do the same. So, based on my research, I think it would look something like this, but I don't know how to count the number of characters in a field.
Here is some sample data:
abc://path_to_all_files_in_data_lake/2018/10/27/Parent/CPPP1027.Mid.414.gz
I want to get the year, the month, the day, and the file name, so in this example, I want the dataframe to have this.
val modifiedDF = df
.withColumn("theyear", )
.withColumn("themonth", )
.withColumn("theday", )
.withColumn("thefile", )
modifiedDF.show(false)
So, I want to append four fields to a dataframe: theyear, themonth, theday, and thefile. Then, do the parsing based on the count of characters in a string. Thanks.
I would probably rather use RegEx for pattern matching than string length. In this simple example, I extract the main date pattern using regexp_extract then build the other columns from there using substring:
%scala
import org.apache.spark.sql.functions._
val df = Seq( ( "abc://path_to_all_files_in_data_lake/2018/10/27/Parent/CPPP1027.Mid.414.gz" ), ( "abc://path_to_all_files_in_data_lake/2019/02/28/Parent/CPPP77.Mid.303.gz" ) )
.toDF("somePath")
.withColumn("theDate", regexp_extract($"somePath", "[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]", 0) )
.withColumn("theYear", substring($"theDate", 1, 4 ) )
.withColumn("theMonth", substring($"theDate", 6, 2 ) )
.withColumn("theDay", substring($"theDate", 9, 2 ) )
.withColumn("theFile", regexp_extract($"somePath", "[^/]+\\.gz", 0) )
df.show
My results:
Does that work for you?
Using built in functions on data frame -
You can use length(Column) from org.apache.spark.sql.functions to find the size of the data in a column.
val modifiedDF = df
.withColumn("theyear", when(length($"columName"),??).otherwise(??))
Using Scala -
df.map{row =>
val c = row.getAs[String]("columnName")
//length of c = c.length()
//build all columns
// return (column1,column2,,,)
}.toDF("column1", "column2")
Here is the final working solution!
%scala
import org.apache.spark.sql.functions._
val dfMod = df
.withColumn("thedate", regexp_extract($"filepath", "[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]", 0) )
.withColumn("theyear", substring($"thedate", 1, 4 ) )
.withColumn("themonth", substring($"thedate", 6, 2 ) )
.withColumn("theday", substring($"thedate", 9, 2 ) )
.withColumn("thefile", regexp_extract($"filepath", "[^/]+\\.gz", 0) )
dfMod.show(false)
Thanks for the assist wBob!!!

adding new column to my spark dataframe , and calculat the sum()

AttributeError: 'DataFrame' object has no attribute '_get_object_id'
First of all: It is really important that you give us a reproducible example of your dataframe. Nobody likes to look at screenshots to identify an error.
Your code is not working because spark can't determine how the rows of your groupby and your initial dataframe can be merge. It isn't aware of that NUM_TIERS is somekind of a key. Therefore you have to tell spark which column(s) should be used to merge the groupby and the initial dataframe.
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [('OBAAAA7K2KBBO' , 34),
('OBAAAA878000K' , 138 ),
('OBAAAA878A2A0' , 164 ),
('OBAAAA7K2KBBO' , 496),
('OBAAAA878000K' , 91)]
columns = ['NUM_TIERS', 'MONTAN_TR']
df=spark.createDataFrame(l, columns)
You have to options to do that. You can use a join:
df = df.join(df.groupby('NUM_TIERS').sum('MONTAN_TR'), 'NUM_TIERS')
df.show()
OR a window function:
w = Window.partitionBy('NUM_TIERS')
df = df.withColumn('SUM', F.sum('MONTAN_TR').over(w))
Output is the same for both ways:
+-------------+---------+---+
| NUM_TIERS|MONTAN_TR|SUM|
+-------------+---------+---+
|OBAAAA7K2KBBO| 34|530|
|OBAAAA7K2KBBO| 496|530|
|OBAAAA878000K| 138|229|
|OBAAAA878000K| 91|229|
|OBAAAA878A2A0| 164|164|
+-------------+---------+---+

Pyspark - Using reducebykey on spark dataframe column that is lists

So I have a spark dataframe called ngram_df that looks something like this
--------------------------------
Name | nGrams |
--------|--------------------- |
Alice | [ALI, LIC, ICE] |
Alicia | [ALI, LIC, ICI, CIA] |
--------------------------------
And I want to produce an output in a dictionary form such as:
ALI: 2, LIC: 2, ICE: 1, ICI: 1, CIA: 1
I've been trying to turn the nGrams column into a RDD so that I can use the reduceByKey function
rdd = ngram_df.map(lambda row: row['nGrams'])
test = rdd.reduceByKey(add).collect()
However I get the error:
ValueError: too many values to unpack
Even using flatmap doesn't help as I get the error:
ValueError: need more than 1 value to unpack
this is possible with a combination of flatMap and reduceByKey method.
rdd = spark.sparkContext.parallelize([('Alice', ['ALI', 'LIC', 'ICE']), ('Alicia', ['ALI', 'LIC', 'ICI', 'CIA'])])
result = rdd.flatMap(lambda x: [(y, 1) for y in x[1]] ).reduceByKey(lambda x,y: x+y)
result.collect()
[('ICI', 1), ('CIA', 1), ('ALI', 2), ('ICE', 1), ('LIC', 2)]