Unable to write to parque file using fullSet.repartition(1).saveAsParquetFile("swift://notebooks.spark/tweetsFull.parquet") - scala

I am trying build application using apache spark using IBM bluemix. (Ref https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags/). I am using streaming API's to stream data and have successfully created the SQL table using spark SQL. Then I have read the data using SQL select *, but when I am unable to write the data to paraque file in the object storage space on the IBM Bluxmix platform.Following is the sample code
.
.
.
var df = sqlContext.createDataFrame( workingRDD, schemaTweets )
df.registerTempTable("tweets_table")
df.printSchema()
root
|-- author: string (nullable = true)
|-- date: string (nullable = true)
|-- lang: string (nullable = true)
|-- text: string (nullable = true)
val results = sqlContext.sql("select * from tweets_table limit 5")
results.show
+--------------------+--------------------+----+--------------------+
| author| date|lang| text|
+--------------------+--------------------+----+--------------------+
| abc ?|Sun Nov 29 03:30:...| en|RT #fdfds: W........|
| fdsfsdf ?|Sun Nov 29 03:30:...| en|#NewsIndofsdfM R...|
| .fsdfdsf |Sun Nov 29 03:30:...| en|RT #Lsfddsfds. ..|
| Wsfsfd |Sun Nov 29 03:30:...| en|My gfsdfsdfdshtps...|
| Ffsdfsdf |Sun Nov 29 03:30:...| en|RT #Ayfsdfsdf : W...|
+---------------------+--------------------+----+--------------------+
results.repartition(1).saveAsParquetFile("swift://notebooks.spark/tweets_1.parquet")
Here in the object storage seeing that the file tweets_1.parquet is created, But its showing as 0 bytes. Can any one let me know where I did a mistake ?

When I ran through this same example my Parquet file is saved in Object Storage but is broken up into several files in a subdirectory with the same name:
tweetsFull.parquet 12/02/2015 1:48 PM 0 KB
tweetsFull.parquet/part-r-00000-c3709e95-8f23-4ec5-bdf0-f0940b2cd94b.gz.parquet 12/02/2015 1:49 PM 16 KB
tweetsFull.parquet/_common_metadata 12/02/2015 1:49 PM 1 KB
tweetsFull.parquet/_metadata 12/02/2015 1:49 PM 3 KB
tweetsFull.parquet/_SUCCESS 12/02/2015 1:49 PM 0 KB
It works if I read from this file. Is that what you are seeing?

Sorry. I was mistaken by the folder name tweets_1.parquet, which is showing as 0 bytes.I thought tweets_1.parquet is the only file which must be created. But its a folder and under that I can able to see all the files which are valid.

Related

extracting day, week, hour,date,year in pyspark from a string column

I am trying to extract day, week, hour,date,year in pyspark however after using dayofweek it shows null as output.
DF is something like this :
Mailed Date
Wed, 09/29/10 03:52 PM
Tue, 09/21/10 11:51 PM
Tue, 09/21/10 11:51 PM
Tue, 09/21/10 11:51 PM
I am trying to have different column named day day of week month year and hour of day
however after using from pyspark.sql.functions import year, month, dayofweek it shows null as day output column
Code i have used:
df01 = emaildf.withColumn('Day', dayofweek('Mailed_Date')).show(5)
converted into timesetemp:
df01 = vdf.withColumn("Mailed_Date",col("Mailed_Date").cast("Timestamp"))
Output:
Since, the string datetime provided is not in the default format, you'd have to convert the datetime to a readable format using to_timestamp(). Also, you'll need to set the timeParserPolicy to LEGACY, if you're parsing in spark 3.0+, due to the presence of week in the string.
spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY') # if spark 3.0+
ts_sdf = spark.sparkContext.parallelize([('Wed, 09/29/10 03:52 PM',)]).toDF(['ts_str']). \
withColumn('ts', func.to_timestamp('ts_str', 'EEE, MM/dd/yy hh:mm a')). \
withColumn('year', func.year('ts')). \
withColumn('month', func.month('ts')). \
withColumn('dayofweek', func.dayofweek('ts')). \
withColumn('hour', func.hour('ts')). \
withColumn('minute', func.minute('ts'))
ts_sdf.show(truncate=False)
# +----------------------+-------------------+----+-----+---------+----+------+
# |ts_str |ts |year|month|dayofweek|hour|minute|
# +----------------------+-------------------+----+-----+---------+----+------+
# |Wed, 09/29/10 03:52 PM|2010-09-29 15:52:00|2010|9 |4 |15 |52 |
# +----------------------+-------------------+----+-----+---------+----+------+
ts_sdf.printSchema()
# root
# |-- ts_str: string (nullable = true)
# |-- ts: timestamp (nullable = true)
# |-- year: integer (nullable = true)
# |-- month: integer (nullable = true)
# |-- dayofweek: integer (nullable = true)
# |-- hour: integer (nullable = true)
# |-- minute: integer (nullable = true)

pyspark rolling window timeframe

I am trying to implement a rolling window with a 30 minutes timeframe that is grouped by the source_ip. The idea is to get the average for each of the source_ip. Not sure this is the right way to do it. The problem I have is the ip 192.168.1.3 which seems to average more than the 30 minute window since packets 25 is a couple of days later.
df = sqlContext.createDataFrame([('192.168.1.1', 17, "2017-03-10T15:27:18+00:00"),
('192.168.1.2', 1, "2017-03-15T12:27:18+00:00"),
('192.168.1.2', 2, "2017-03-15T12:28:18+00:00"),
('192.168.1.2', 3, "2017-03-15T12:29:18+00:00"),
('192.168.1.3', 4, "2017-03-15T12:28:18+00:00"),
('192.168.1.3', 5, "2017-03-15T12:29:18+00:00"),
('192.168.1.3', 25, "2017-03-18T11:27:18+00:00")],
["source_ip","packets", "timestampGMT"])
w = (Window()
.partitionBy("source_ip")
.orderBy(F.col("timestampGMT").cast('long'))
.rangeBetween(-1800, 0))
df = df.withColumn('rolling_average', F.avg("packets").over(w))
df.show(100,False)
This is the result I get. I would expect 4.5 for the first 2 entries and 25 for the third entry?
+-----------+-------+-------------------------+------------------+
|source_ip |packets|timestampGMT |rolling_average |
+-----------+-------+-------------------------+------------------+
|192.168.1.3|4 |2017-03-15T12:28:18+00:00|11.333333333333334|
|192.168.1.3|5 |2017-03-15T12:29:18+00:00|11.333333333333334|
|192.168.1.3|25 |2017-03-18T11:27:18+00:00|11.333333333333334|
|192.168.1.2|1 |2017-03-15T12:27:18+00:00|2.0 |
|192.168.1.2|2 |2017-03-15T12:28:18+00:00|2.0 |
|192.168.1.2|3 |2017-03-15T12:29:18+00:00|2.0 |
|192.168.1.1|17 |2017-03-10T15:27:18+00:00|17.0 |
+-----------+-------+-------------------------+------------------+
Change the string to the timestamp first, and orderBy it.
import pyspark.sql.functions as F
from pyspark.sql import Window
w = (Window()
.partitionBy("source_ip")
.orderBy(F.col("timestamp"))
.rangeBetween(-1800, 0))
df = df.withColumn("timestamp", F.unix_timestamp(F.to_timestamp("timestampGMT"))) \
.withColumn('rolling_average', F.avg("packets").over(w))
df.printSchema()
df.show(100,False)
root
|-- source_ip: string (nullable = true)
|-- packets: long (nullable = true)
|-- timestampGMT: string (nullable = true)
|-- timestamp: long (nullable = true)
|-- rolling_average: double (nullable = true)
+-----------+-------+-------------------------+----------+---------------+
|source_ip |packets|timestampGMT |timestamp |rolling_average|
+-----------+-------+-------------------------+----------+---------------+
|192.168.1.2|1 |2017-03-15T12:27:18+00:00|1489580838|1.0 |
|192.168.1.2|2 |2017-03-15T12:28:18+00:00|1489580898|1.5 |
|192.168.1.2|3 |2017-03-15T12:29:18+00:00|1489580958|2.0 |
|192.168.1.1|17 |2017-03-10T15:27:18+00:00|1489159638|17.0 |
|192.168.1.3|4 |2017-03-15T12:28:18+00:00|1489580898|4.0 |
|192.168.1.3|5 |2017-03-15T12:29:18+00:00|1489580958|4.5 |
|192.168.1.3|25 |2017-03-18T11:27:18+00:00|1489836438|25.0 |
+-----------+-------+-------------------------+----------+---------------+

Spark BinaryType to Scala/Java

I'm doing a spark app using scala with following data:
+----------+--------------------+
| id| data|
+----------+--------------------+
| id1 |[AC ED 00 05 73 7...|
| id2 |[CF 33 01 61 88 9...|
+----------+--------------------+
The schema shows:
root
|-- id: string (nullable = true)
|-- data: binary (nullable = true)
I tried to convert this dataframe into a map object, with id being key and data being value
I have tried:
df.as[(String, BinaryType)].collect.toMap
but I got following error:
java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.types.BinaryType
- field (class: "org.apache.spark.sql.types.BinaryType", name: "_2")
- root class: "scala.Tuple2"
BinaryType is a Spark DataType. It maps in Scala/Java to Array[Byte].
Try df.as[(String, Array[Byte])].collect.toMap.
Make sure you've imported your sessions implicits, e.g., import spark.implicits._ so you gain the ability to create Encoder[T] instances implicitly.

PySpark - Get the size of each list in group by

I have a massive pyspark dataframe. I need to group by Person and then collect their Budget items into a list, to perform a further calculation.
As an example,
a = [('Bob', 562,"Food", "12 May 2018"), ('Bob',880,"Food","01 June 2018"), ('Bob',380,'Household'," 16 June 2018"), ('Sue',85,'Household'," 16 July 2018"), ('Sue',963,'Household'," 16 Sept 2018")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])
Group By:
import pyspark.sql.functions as F
df_grouped = df.groupby('person').agg(F.collect_list("Budget").alias("data"))
Schema:
root
|-- person: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: string (containsNull = true)
However, I am getting a memory error when I try to apply a UDF on each person. How can I get the size (in megabytes or gigbabytes) of each list (data) for each person?
I have done the following, but I am getting nulls
import sys
size_list_udf = F.udf(lambda data: sys.getsizeof(data)/1000, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show()
Output:
+------+--------------------+----+
|person| data|size|
+------+--------------------+----+
| Sue|[Household, House...|null|
| Bob|[Food, Food, Hous...|null|
+------+--------------------+----+
You just have one minor issue with your code. sys.getsizeof() returns the size of an object in bytes as an integer. You're dividing this by the integer value 1000 to get kilobytes. In python 2, this returns an integer. However you defined your udf to return a DoubleType(). The simple fix is to divide by 1000.0.
import sys
size_list_udf = f.udf(lambda data: sys.getsizeof(data)/1000.0, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show(truncate=False)
#+------+-----------------------+-----+
#|person|data |size |
#+------+-----------------------+-----+
#|Sue |[Household, Household] |0.112|
#|Bob |[Food, Food, Household]|0.12 |
#+------+-----------------------+-----+
I have found that in cases where a udf is returning null, the culprit is very frequently a type mismatch.

Reading binary file in Spark Scala

I need to extract data from a binary file.
I used binaryRecords and get RDD[Array[Byte]].
From here I want to parse every record into
case class (Field1: Int, Filed2 : Short, Field3: Long)
How can I do this?
assuming you have no delimiter, an Int in Scala is 4 bytes, Short is 2 byte and long is 8 bytes. Assume that your Binary data was structured (for each line) as Int Short Long. You should be able to take the bytes and convert them to the classes you want.
import java.nio.ByteBuffer
val result = YourRDD.map(x=>(ByteBuffer.wrap(x.take(4)).getInt,
ByteBuffer.wrap(x.drop(4).take(2)).getShort,
ByteBuffer.wrap(x.drop(6)).getLong))
This uses a Java library to convert Bytes to Int/Short/Long, you can use other libraries if you want.
Since Spark 3.0, Spark has a “binaryFile” data source to read Binary file
I've found this at How to read Binary file into DataFrame with more explanation.
val df = spark.read.format("binaryFile").load("/tmp/binary/spark.png")
df.printSchema()
df.show()
This outputs schema and DataFrame as below
root
|-- path: string (nullable = true)
|-- modificationTime: timestamp (nullable = true)
|-- length: long (nullable = true)
|-- content: binary (nullable = true)
+--------------------+--------------------+------+--------------------+
| path| modificationTime|length| content|
+--------------------+--------------------+------+--------------------+
|file:/C:/tmp/bina...|2020-07-25 10:11:...| 74675|[89 50 4E 47 0D 0...|
+--------------------+--------------------+------+--------------------+
Thanks