pyspark parse user_agents into new columns - pyspark

I am having a problem when trying to parse user_agents. I am using this lib lib and have followed the instructions from medium.
def parse_ua(ua_string):
# parse library cannot parse None
if ua_string is None:
ua_string = ""
parsed_string = parse(ua_string)
output = [
parsed_string.device.brand,
parsed_string.device.family,
parsed_string.device.model,
parsed_string.os.family,
parsed_string.os.version_string,
parsed_string.browser.family,
parsed_string.browser.version_string,
(parsed_string.is_mobile or parsed_string.is_tablet),
parsed_string.is_bot
]
# If any of the column have None value it doesn't comply with schema
# and thus throw Null Pointer Exception
for i in range(len(output)):
if output[i] is None:
print(output[i])
output[i] = 'Unknown'
return output
ua_parser_udf = F.udf(lambda z: parse_ua(z), StructType([
StructField("device_brand", StringType(), False),
StructField("device_family", StringType(), False),
StructField("device_model", StringType(), False),
StructField("os_family", StringType(), False),
StructField("os_version", StringType(), False),
StructField("browser_family", StringType(), False),
StructField("browser_version", StringType(), False)
]))
df = df.withColumn('parsed', ua_parser_udf('meta_user_agent'))\
.select('*',
F.col('parsed.device_brand').alias('device_brand'),
F.col('parsed.device_family').alias('device_family'),
F.col('parsed.device_model').alias('device_model'),
F.col('parsed.os_family').alias('os_family'),
F.col('parsed.os_version').alias('os_version'),
F.col('parsed.browser_family').alias('browser_family'),
F.col('parsed.browser_version').alias('browser_version'),
F.col('parsed.is_mobile').alias('is_mobile'),
F.col('parsed.is_bot').alias('is_bot')
)
My problem is that, when I call the function I am getting this error
Could not serialize object: Py4JError: An error occurred while calling o517.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
I operate in AWS, using s3 as hdfs, AWS Glue for reading_from_catalog and AWS Catalog as a MetaStore. Any tips or help on how to fix this issue? Thanks a lot!

Related

Creating a pyspark dataframe from exploded (nested) json values

I'm trying to get nested json values in a pyspark dataframe. I have easily solved this using pandas, but now I'm trying to get it working with just pyspark functions.
print(response)
{'ResponseMetadata': {'RequestId': 'PGMCTZNAPV677CWE', 'HostId': '/8qweqweEfpdegFSNU/hfqweqweqweSHtM=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '/8yacqweqwe/hfjuSwKXDv3qweqweqweHtM=', 'x-amz-request-id': 'PqweqweqweE', 'date': 'Fri, 09 Sep 2022 09:25:04 GMT', 'x-amz-bucket-region': 'eu-central-1', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Contents': [{'Key': 'qweqweIntraday.csv', 'LastModified': datetime.datetime(2022, 7, 12, 8, 32, 10, tzinfo=tzutc()), 'ETag': '"qweqweqwe4"', 'Size': 1165, 'StorageClass': 'STANDARD'}], 'Name': 'test-bucket', 'Prefix': '', 'MaxKeys': 1000, 'EncodingType': 'url', 'KeyCount': 1}
With pandas I can parse this input into a dataframe with the following code:
object_df = pd.DataFrame()
for elem in response:
if 'Contents' in elem:
object_df = pd.json_normalize(response['Contents'])
print(object_df)
Key LastModified \
0 202207110000_qweIntraday.csv 2022-07-12 08:32:10+00:00
ETag Size StorageClass
0 "fqweqweqwee0cb4" 1165 STANDARD
(there are sometimes multiple "Contents", so I have to use recursion).
This was my attempt to replicate this with spark dataframe, and sc.parallelize:
object_df = spark.sparkContext.emptyRDD()
for elem in response:
if 'Contents' in elem:
rddjson = spark.read.json(sc.parallelize([response['Contents']]))
Also tried:
sqlc = SQLContext(sc)
rddjson = spark.read.json(sc.parallelize([response['Contents']]))
df = sqlc.read.json("multiline", "true").json(rddjson)
df.show()
+--------------------+
| _corrupt_record|
+--------------------+
|[{'Key': '2/3c6a6...|
+--------------------+
This is not working. I already saw some related posts, saying that I can use explode like in this example (stackoverflow answer) instead of json_normalize, but i'm having trouble replicating the example.
Any suggestion how I can solve this with pyspark or pyspark.sql (and not adding additional libraries) is very welcome.
It looks like the issue is with the data containing a python datetime object (in the LastModified field).
One way around this might be (assuming your ok with python standard libraries):
import json
sc = spark.sparkContext
for elem in response:
if 'Contents' in elem:
json_str = json.dumps(response['Contents'], default=str)
object_df = spark.read.json(sc.parallelize([json_str]))

spark structured streaming pyspark, applyInPandas being called immediately and not waiting for window to expire

spark structured streaming is not allowing Window function to perform lag,lead operations. So I am trying to use applyInpandas function.
I have a tumbling window of 5 minutes with watermark set to 1 minute in append mode.I need to wait till window expires and apply my UDF function on it.So I am using applyInpandas.
The problem is my custom UDF function is being called immediately after data arrives and not waiting till the window expires.
My code
schema = StructType([
StructField("date", TimestampType(), True),
StructField("id", StringType(), True),
StructField("x1", IntegerType(), True),
StructField("y1", IntegerType(), True),
])
newschema = StructType([
StructField("date", TimestampType(), True),
StructField("id", StringType(), True),
StructField("x1", DoubleType(), True),
StructField("y1", DoubleType(), True),
StructField("x2", DoubleType(), True),
StructField("y2", DoubleType(), True),
StructField("dis", DoubleType(), True)
])
spark = SparkSession.builder.appName("testing").getOrCreate()
spark.sparkContext.setLogLevel('WARN')
truck_data = spark.readStream.format("kafka")\
.option("kafka.bootstrap.servers", KAFKA_BROKER)\
.option("subscribe", KAFKA_INPUT_TOPIC)\
.option("startingOffsets", "latest") \
.load()
truck_data = truck_data.withColumn("value", col("value").cast(StringType()))\
.withColumn("jsonData", from_json(col("value"), schema)) \
.select("JsonData.*")\
def f(pdf):
print("inside applyinpandas")
pdf = pdf.sort_values(by='date')
pdf = pdf.assign(x2=pdf.x1.shift(1))
pdf = pdf.assign(y2=pdf.y1.shift(1))
pdf = pdf.assign(dis=np.sqrt((pdf.x1 - pdf.x2)**2 + (pdf.x1 - pdf.x2)**2))
pdf = pdf.fillna(0)
return pdf
truck_data = truck_data.withWatermark("date", "1 minute").\
.groupBy(window("date", "5 minute"), "id")\
.applyInPandas(f, schema=newschema)
query = truck_data\
# .writeStream\
# .format("console") \
# .option("numRows", 300)\
# .option("truncate", False)\
# .start().awaitTermination()
Sample input data
{"date":"2022-03-23 09:04:32.242637","id":"B","x1":3,"y1":3}
{"date":"2022-03-23 09:04:32.242737","id":"A","x1":2,"y1":2}
{"date":"2022-03-23 09:04:29.242737","id":"A","x1":1,"y1":1}
{"date":"2022-03-23 09:04:55.242737","id":"A","x1":6,"y1":6}
{"date":"2022-03-23 09:04:40.242737","id":"B","x1":7,"y1":7}
{"date":"2022-03-23 09:04:29.242737","id":"B","x1":1,"y1":1}
{"date":"2022-03-23 09:04:44.242737","id":"A","x1":5,"y1":5}
{"date":"2022-03-23 09:04:35.242737","id":"B","x1":5,"y1":5}
{"date":"2022-03-23 09:04:35.242737","id":"A","x1":3,"y1":3}
{"date":"2022-03-23 09:04:40.242737","id":"A","x1":4,"y1":4}
{"date":"2022-03-23 09:04:44.242737","id":"B","x1":9,"y1":9}
{"date":"2022-03-23 09:04:55.242737","id":"B","x1":11,"y1":11}
{"date":"2022-03-23 09:06:55.242737","id":"B","x1":11,"y1":11}
Output required
output image from pycharm

Extract Embedded AWS Glue Connection Credentials Using Scala

I have a glue job that reads directly from redshift, and to do that, one has to provide connection credentials. I have created an embedded glue connection and can extract the credentials with the following pyspark code. Is there a way to do this in Scala?
glue = boto3.client('glue', region_name='us-east-1')
response = glue.get_connection(
Name='name-of-embedded-connection',
HidePassword=False
)
table = spark.read.format(
'com.databricks.spark.redshift'
).option(
'url',
'jdbc:redshift://prod.us-east-1.redshift.amazonaws.com:5439/db'
).option(
'user',
response['Connection']['ConnectionProperties']['USERNAME']
).option(
'password',
response['Connection']['ConnectionProperties']['PASSWORD']
).option(
'dbtable',
'db.table'
).option(
'tempdir',
's3://config/glue/temp/redshift/'
).option(
'forward_spark_s3_credentials', 'true'
).load()
There is no scala equivalent from AWS to issue this API call.But you can use Java SDK code inside scala as mentioned in this answer.
This is the Java SDK call for getConnection and if you don't want to do this then you can follow below approach:
Create AWS Glue python shell job and retrieve the connection information.
Once you have the values then call the other scala Glue job with these as arguments inside your python shell job as shown below :
glue = boto3.client('glue', region_name='us-east-1')
response = glue.get_connection(
Name='name-of-embedded-connection',
HidePassword=False
)
response = client.start_job_run(
JobName = 'my_scala_Job',
Arguments = {
'--username': response['Connection']['ConnectionProperties']['USERNAME'],
'--password': response['Connection']['ConnectionProperties']['PASSWORD'] } )
Then access these parameters inside your scala job using getResolvedOptions as shown below:
import com.amazonaws.services.glue.util.GlueArgParser
val args = GlueArgParser.getResolvedOptions(
sysArgs, Array(
"username",
"password")
)
val user = args("username")
val pwd = args("password")

pySpark: java.lang.UnsupportedOperationException: Unimplemented type: StringType

While reading inconsistent schema written group of parquet files, we have issue on schema merging.
On switching to manually specifying schema i get following error. Any pointer will be helpful.
java.lang.UnsupportedOperationException: Unimplemented type: StringType
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readDoubleBatch(VectorizedColumnReader.java:389)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:195)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
source_location = "{}/{}/{}/dt={}/{}/*_{}_{}.parquet".format(source_initial,
bucket,
source_prefix,
date,
source_file_pattern,
date,
source_file_pattern)
schema = StructType([
StructField("Unnamed", StringType(), True),StructField("nanos", LongType(), True),StructField("book", LongType(), True),
StructField("X_o", LongType(), True),StructField("Y_o", LongType(), True),StructField("Z_o", LongType(), True),
StructField("Total", DoubleType(), True),StructField("P_v", DoubleType(), True),StructField("R_v", DoubleType(), True),
StructField("S_v", DoubleType(), True),StructField("message_type", StringType(), True),StructField("symbol", StringType(), True),
StructField("date", StringType(), True),StructField("__index_level_0__", StringType(), True)])
print("Querying data from source location {}".format(source_location))
df_raw = spark.read.format('parquet').load(source_location, schema = schema, inferSchema = False,mergeSchema="true")
df_raw = df_raw.filter(df_raw.nanos.between(open_nano,close_nano))
df_raw = df_raw.withColumn("timeInWindow_nano",(fun.ceil(df_raw.nanos/(window_nano))).cast("int"))
df_core = df_raw.groupBy("date","symbol","timeInWindow_nano").agg(fun.sum("Total").alias("Total"),
fun.sum("P_v").alias("P_v"),
fun.sum("R_v").alias("R_v"),
fun.sum("S_v").alias("S_v"))
df_core = df_core.withColumn("P_v",fun.when(df_core.Total < 0,0).otherwise(df_core.P_v))
df_core = df_core.withColumn("R_v",fun.when(df_core.Total < 0,0).otherwise(df_core.R_v))
df_core = df_core.withColumn("S_v",fun.when(df_core.Total < 0,0).otherwise(df_core.S_v))
df_core = df_core.withColumn("P_pct",df_core.P_v*df_core.Total)
df_core = df_core.withColumn("R_pct",df_core.R_v*df_core.Total)
df_core = df_core.withColumn("S_pct",df_core.S_v*df_core.Total)
You cannot read parquet files in one load if schemas are not compatible. My advice would be to separate this to two loads and then union dataframes when you have them compatible. See example code:
schema1_df = spark.read.parquet('path/to/files/with/string/field.parquet')
schema2_df = spark.read.parquet('path/to/files/with/double/field.parquet')
df = schema2_df.unionAll(schema1.df.withColumn('invalid_col', schema2_df.invalid_col.cast('double')))

How to create data frames from rdd of word's list

I have gone through all the answers of the stackoverflow and on internet but nothing works.so i have this rdd of list of words:
tweet_words=['tweet_text',
'RT',
'#ochocinco:',
'I',
'beat',
'them',
'all',
'for',
'10',
'straight',
'hours']
**What i have done till now:**
Df =sqlContext.createDataFrame(tweet_words,["tweet_text"])
and
tweet_words.toDF(['tweet_words'])
**ERROR**:
TypeError: Can not infer schema for type: <class 'str'>
Looking at the above code, you are trying to convert a list to a DataFrame. A good StackOverflow link on this is: https://stackoverflow.com/a/35009289/1100699.
Saying this, here's a working version of your code:
from pyspark.sql import Row
# Create RDD
tweet_wordsList = ['tweet_text', 'RT', '#ochocinco:', 'I', 'beat', 'them', 'all', 'for', '10', 'straight', 'hours']
tweet_wordsRDD = sc.parallelize(tweet_wordsList)
# Load each word and create row object
wordRDD = tweet_wordsRDD.map(lambda l: l.split(","))
tweetsRDD = wordRDD.map(lambda t: Row(tweets=t[0]))
# Infer schema (using reflection)
tweetsDF = tweetsRDD.toDF()
# show data
tweetsDF.show()
HTH!