'list' object has no attribute 'map' in pyspark error - pyspark

""" df = sc.textFile("/content/Shakespeare.txt")
llist = df.collect()
for line in llist:
t= simple_tokenize(line)
rdd2 = t.map(lambda word: (word,1)) # error on this line
rdd3 = rdd2.reduceByKey(lambda a,b: a+b)
"""
I am facing an error on rdd2. Can someone please help?

I think you would like a simple word count using rdd. You can do it by
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark import SparkFiles
# read file from url
url="https://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/shakespeare.txt"
spark.sparkContext.addFile(url)
df=spark.read.csv(SparkFiles.get("shakespeare.txt"), header=True)
df.show(4)
+------------------------------------------+
|THE SONNETS |
+------------------------------------------+
|by William Shakespeare |
|From fairest creatures we desire increase |
|That thereby beauty's rose might never die|
|But as the riper should by time decease |
+------------------------------------------+
only showing top 4 rows
# convert to rdd taking only the strings of the row
rdd=df.rdd.map(lambda x: x["THE SONNETS"])
rdd.take(4)
['by William Shakespeare',
'From fairest creatures we desire increase',
"That thereby beauty's rose might never die",
'But as the riper should by time decease']
# you can also parallelize a Python list of strings
data=["From fairest creatures we desire increase",
"That thereby beauty's rose might never die",
"But as the riper should by time decease",
"His tender heir might bear his memory",
]
rdd=spark.sparkContext.parallelize(data)
Now run the basic three steps
split by words
count the words
reduce by key
rdd1=rdd.flatMap(lambda x: x.split(" "))
rdd2=rdd1.map(lambda word: (word,1))
rdd3=rdd2.reduceByKey(lambda a,b: a+b)
rdd3.take(20)
[('by', 66),
('William', 1),
('Shakespeare', 1),
('From', 14),
('fairest', 5),
('creatures', 2),
('we', 11),
('desire', 6),
('increase', 3),
('That', 83),
('thereby', 1),
("beauty's", 16),
('rose', 5),
('might', 19),
('never', 10),
('die', 5),
('But', 89),
('as', 66),
('the', 311),
('riper', 2)]

Related

Get sum of a column in a DF1 based on date range from DF2 in Spark

I have two dataframes and I want to get sum of value in the dataframe 1 based on the date range from dataframe 2 (startDate and endDate) and sort the results from maximum to minimum in Spark
import org.apache.spark.sql.functions.to_date
val df = sc.parallelize(Seq(
("2019-01-01", 100), ("2019-01-02", 150),
("2019-01-03", 120), ("2019-01-04", 38),
("2019-01-05", 200), ("2019-01-06", 381),
("2019-01-07", 220), ("2019-01-08", 183),
("2019-01-09", 160), ("2019-01-10", 109),
("2019-01-11", 130), ("2019-01-12", 282),
("2019-01-13", 10), ("2019-01-14", 348),
("2019-01-15", 20), ("2019-01-16", 190)
)).toDF("date", "value").withColumn("date", to_date($"date"))
val df_dates = sc.parallelize(Seq(
("2019-01-01", "2019-01-04"),
("2019-01-05", "2019-01-08"),
("2019-01-09", "2019-01-12"),
("2019-01-13", "2019-01-16")
)).toDF("startDate", "endDate").withColumn("startDate", to_date($"startDate")).withColumn("endDate", to_date($"endDate"))
The resulting output will add a column to the df_date dataframe sum_value. I really do not know where to start. I searched web and couldm't find a solution.
You first have to join date values to the date ranges, then aggregate:
df_dates
.join(df, $"date".between($"startDate", $"endDate"), "left")
.groupBy($"startDate", $"endDate").agg(
sum($"value").as("sum_value")
)
.orderBy($"sum_value".desc)
.show()
+----------+----------+---------+
| startDate| endDate|sum_value|
+----------+----------+---------+
|2019-01-05|2019-01-08| 984|
|2019-01-09|2019-01-12| 681|
|2019-01-13|2019-01-16| 568|
|2019-01-01|2019-01-04| 408|
+----------+----------+---------+

Spark Dataframe: Calculate variance between groups

With R's dplyr I would calculate variance between groups like so:
df %>% group_by(group) %>% summarise(total = sum(value)) %>% summarise(variance_between_groups = var(total))
Trying to perform the same action with Sparks DataFrame API:
df.groupBy(group).agg(sum(value).alias("total")).agg(var_samp(total).alias("variance_between_groups"))
I receive an error in the second agg saying that it can't find total. I am clearly misunderstanding something so any help would be appreciated.
var_samp() takes a String-type column name, hence you need to provide a String as follows:
import org.apache.spark.sql.functions._
val df = Seq(
("a", 1.0),
("a", 2.5),
("a", 1.5),
("b", 2.0),
("b", 1.6)
).toDF("group", "value")
df.groupBy("group").
agg(sum("value").alias("total")).
agg(var_samp("total").alias("variance_between_groups")).
show
// +-----------------------+
// |variance_between_groups|
// +-----------------------+
// | 0.9799999999999999|
// +-----------------------+
It can also take a column (of Column type), e.g. var_samp($"total"). See Spark's API doc for more details.

efficiently using union in spark

I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage.
Is there more efficient way to do this?
Why don't you convert the two RDDs to dataframes and use union function.
Converting to dataframe is easy you just need to import sqlContext.implicits._ and apply .toDF() function with header names.
for example:
val sparkSession = SparkSession.builder().appName("testings").master("local").config("", "").getOrCreate()
val sqlContext = sparkSession.sqlContext
var firstTableColumns = Seq("col1", "col2")
var secondTableColumns = Seq("col3", "col4")
import sqlContext.implicits._
var firstDF = Seq((1, 2), (2, 3), (3, 4), (2, 3), (3, 4)).toDF(firstTableColumns:_*)
var secondDF = Seq((4, 5), (5, 6), (6, 7), (4, 5)) .toDF(secondTableColumns: _*)
firstDF = firstDF.union(secondDF)
It should be very easy for you to work with dataframes than with RDDs. Changing dataframe to RDD is quite easy too, just call .rdd function
val rddData = firstDF.rdd

pyspark substring and aggregation

I am new to Spark and I've got a csv file with such data:
date, accidents, injured
2015/20/03 18:00 15, 5
2015/20/03 18:30 25, 4
2015/20/03 21:10 14, 7
2015/20/02 21:00 15, 6
I would like to aggregate this data by a specific hour of when it has happened. My idea is to Substring date to 'year/month/day hh' with no minutes so I can make it a key. I wanted to give average of accidents and injured by each hour. Maybe there is a different, smarter way with pyspark?
Thanks guys!
Well, it depends on what you're going to do afterwards, I guess.
The simplest way would be to do as you suggest: substring the date string and then aggregate:
data = [('2015/20/03 18:00', 15, 5),
('2015/20/03 18:30', 25, 4),
('2015/20/03 21:10', 14, 7),
('2015/20/02 21:00', 15, 6)]
df = spark.createDataFrame(data, ['date', 'accidents', 'injured'])
df.withColumn('date_hr',
df['date'].substr(1, 13)
).groupby('date_hr')\
.agg({'accidents': 'avg', 'injured': 'avg'})\
.show()
If you, however, want to do some more computation later on, you can parse the data to a TimestampType() and then extract the date and hour from that.
import pyspark.sql.types as typ
from pyspark.sql.functions import col, udf
from datetime import datetime
parseString = udf(lambda x: datetime.strptime(x, '%Y/%d/%m %H:%M'), typ.TimestampType())
getDate = udf(lambda x: x.date(), typ.DateType())
getHour = udf(lambda x: int(x.hour), typ.IntegerType())
df.withColumn('date_parsed', parseString(col('date'))) \
.withColumn('date_only', getDate(col('date_parsed'))) \
.withColumn('hour', getHour(col('date_parsed'))) \
.groupby('date_only', 'hour') \
.agg({'accidents': 'avg', 'injured': 'avg'})\
.show()

How to convert a SQL query output (dataframe) into an array list of key value pairs in Spark Scala?

I created a dataframe in spark scala shell for SFPD incidents. I queried the data for Category count and the result is a datafame. I want to plot this data into a graph using Wisp. Here is my dataframe,
+--------------+--------+
| Category|catcount|
+--------------+--------+
| LARCENY/THEFT| 362266|
|OTHER OFFENSES| 257197|
| NON-CRIMINAL| 189857|
| ASSAULT| 157529|
| VEHICLE THEFT| 109733|
| DRUG/NARCOTIC| 108712|
| VANDALISM| 91782|
| WARRANTS| 85837|
| BURGLARY| 75398|
|SUSPICIOUS OCC| 64452|
+--------------+--------+
I want to convert this dataframe into an arraylist of key value pairs. So I want result like this with (String,Int) type,
(LARCENY/THEFT,362266)
(OTHER OFFENSES,257197)
(NON-CRIMINAL,189857)
(ASSAULT,157529)
(VEHICLE THEFT,109733)
(DRUG/NARCOTIC,108712)
(VANDALISM,91782)
(WARRANTS,85837)
(BURGLARY,75398)
(SUSPICIOUS OCC,64452)
I tried converting this dataframe (t) into an RDD as val rddt = t.rdd. And then used flatMapValues,
rddt.flatMapValues(x=>x).collect()
but still couldn't get the required result.
Or is there a way to directly give the dataframe output into Wisp?
In pyspark it'd be as below. Scala will be quite similar.
Creating test data
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,1), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
Mapping the test data, reformatting from a RDD of Rows to an RDD of tuples. Then, using collect to extract all the tuples as a list.
df.rdd.map(lambda x: (x[0], x[1])).collect()
[(0, 1), (0, 1), (0, 2), (1, 2), (1, 1), (1, 20), (3, 18), (3, 18), (3, 18)]
Here's the Scala Spark Row documentation that should help you convert this to Scala Spark code