Unable to run spark jobs on Spark 3.2.1 on Kubernetes - scala

I am trying to create a simple dataframe with the below configuration but the executors are spawning and terminating rapidly giving out the following error:
ERROR [dispatcher-CoarseGrainedScheduler] scheduler.TaskSchedulerImpl (Logging.scala:logError(73)) - Lost
executor X on X.X.X.X: The executor with id X exited with exit code
50(Uncaught exception).
"spark.executor.instances": 2,
"spark.executor.cores": 1,
"spark.executor.memory": "6g",
"spark.driver.memory": "4g",
"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",
"spark.kubernetes.container.image.pullPolicy": "Always",
"spark.kubernetes.node.selector": "spark-worker",
"spark.reducer.maxReqsInFlight": 1,
"spark.shuffle.io.retryWait": "10s",
"spark.shuffle.io.maxRetries": 5,
"spark.dynamicAllocation.enabled": "True",
"spark.dynamicAllocation.shuffleTracking.enabled": "True",
"spark.dynamicAllocation.cachedExecutorIdleTimeout": "3600s",
"spark.dynamicAllocation.executorIdleTimeout": "120s",
"spark.dynamicAllocation.schedulerBacklogTimeout": "5s",
#spark.dynamicAllocation.shuffleTracking.timeout
"spark.dynamicAllocation.executorAllocationRatio": "0.8",
"spark.dynamicAllocation.maxExecutors": 10,
"spark.dynamicAllocation.initialExecutors": 2,
"spark.dynamicAllocation.minExecutors": 1,
# For for UDF pyarrow issue
"spark.driver.extraJavaOptions": "-Dio.netty.tryReflectionSetAccessible=true",
"spark.executor.extraJavaOptions":"-Dio.netty.tryReflectionSetAccessible=true"
data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True},
{"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False},
{"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None},
{"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True}
]
spark.createDataFrame(data).show()
I am running Spark 3.2.1 and Hadoop 3.2.2 on kubernetes.
Surprisingly the same config works well on Spark 3.1.2 and Hadoop 2.8.5

Related

plotly mapbox - create clusters in mapview

I am building Dash App that uses plotly scattermapbox graph object. In the current map view each point is represented as a circle. As a user zooms-in and out, I'd like to cluster the points and create groupings. Here's my code for reference.
import dash
from dash import dcc
import pandas as pd
df = pd.DataFrame({
'x': [1, 2, 3],
'Lat': [37.774322, 37.777035, 37.773033],
'Long': [-122.489761, -122.485555, -122.491220]
})
layout = html.Div(
dcc.Graph(id="map"),
dcc.Input(id="inp")
)
#app.callback(
Output('map','figure'),
Input('inp','value')
)
def fin(val):
# do something
data = []
data.append({
"type": "scattermapbox",
"lat": df["Lat"],
"lon": df["Long"],
"name": "Location",
"showlegend": False,
"hoverinfo": "text",
"mode": "markers",
"clickmode": "event+select",
"customdata": df.loc[:,cd_cols].values,
"marker": {
"symbol": "circle",
"size": 8,
"opacity": 0.7,
"color": "black"
}
}
)
layout = {
"autosize": True,
"hovermode": "closest",
"mapbox": {
"accesstoken": MAPBOX_KEY,
"bearing": 0,
"center": {
"lat": xxx,
"lon": xxx
},
"pitch": 0,
"zoom": zoom,
"style": "satellite-streets",
},
}
return ({'data': data, 'layout': layout})
try using plotly.graph_objects.scattermapbox.Cluster. Hope this helps:
from dash import dcc, html, Dash, Output, Input
import pandas as pd
import plotly.graph_objects as go
app = Dash(__name__)
df = pd.DataFrame({
'x': [1, 2, 3],
'Lat': [37.774322, 37.777035, 37.773033],
'Long': [-122.489761, -122.485555, -122.491220]
})
#app.callback(
Output('map','figure'),
Input('inp','value')
)
def fin(val):
data = []
data.append({
"type": "scattermapbox",
"lat": df["Lat"],
"lon": df["Long"],
"name": "Location",
"showlegend": False,
"hoverinfo": "text",
"mode": "markers",
"clickmode": "event+select",
"customdata": df.loc[:,['Lat', 'Long']].values,
"marker": {
"symbol": "circle",
"size": 8,
"opacity": 0.7,
"color": "black"
},
"cluster": {'maxzoom': 14}
}
)
layout = {
"autosize": True,
"hovermode": "closest",
"mapbox": {
"bearing": 0,
"center": {
"lat": 37.774322,
"lon": -122.489761
},
"pitch": 0,
"zoom": 7,
"style": "open-street-map",
},
}
return ({'data': data, 'layout': layout})
app.layout = html.Div(
[dcc.Graph(id="map"),
dcc.Input(id="inp")]
)
if __name__ == '__main__':
app.run_server(debug=True)
Notice the added cluster parameters I added to data.
p.s - make sure you are using a new version of dash for this to work. I used the latest version - dash-2.7.1.

AWS Glue pySpark Filter & Manual Mapping of Several Columns

I'm using AWS Glue Studio with DynamicFrameCollections. I created a custom transformation where I am looking to filter by multiple columns and modify 2 column in the row based off a static mapping list. I'm struggling to figure out what the most efficient way to do this - pandas, udfs, or something completely different?
Consider the sample dataframe:
data = [{"Category": 'A', "Subcategory": 2, "Value": 121.44, "Properties": {}},
{"Category": 'B', "Subcategory": 2, "Value": 300.01, "Properties": None},
{"Category": 'C', "Subcategory": 3, "Value": 10.99, "Properties": { "Active":True } },
{"Category": 'E', "Subcategory": 4, "Value": 33.87, "Properties": { "Active":True, "ReadOnly": False }},
{"Category": 'E', "Subcategory": 1, "Value": 11.37, "Properties": { "Active":True }}
]
df = spark.createDataFrame(data)
I need to filter and transform by Category and Subcategory. Below is the sample mapping with the key as the category and subcategory merged while the first value in the array must be created as a new column ActivityName and the second values must be merged with the Properties column:
mapping= {"A2": ["EatingFood", { "Visible": True }],
"A3": ["DrinkingWater", { "Visible": False }],
"B2": ["Sleep", { "Visible": False }],
"C3": ["Exercise", { "Visible": False }],
"E4": ["Running", { "Visible": False }],
}
The output data I am expecting is:
resultingData = [{"Category": 'A', "Subcategory": 2, "ActivityName":"EatingFood", "Value": 121.44, "Properties": { "Visible": True }},
{"Category": 'B', "Subcategory": 2, "ActivityName":"Sleep", "Value": 300.01, "Properties": {"Visible": False}},
{"Category": 'C', "Subcategory": 3, "ActivityName":"Exercise", "Value": 10.99, "Properties": { "Active":True, "Visible": False } },
{"Category": 'E', "Subcategory": 4, "ActivityName":"Running", "Value": 33.87, "Properties": { "Active":True, "ReadOnly": False, "Visible": False }}
]
Note that the last data entry, E1 is missing because it was not in my mapping filter.
Is there any way to achieve this? I have a large list of items that I need to manually filter/map/transform like this. Thank you.
I got this working by transforming the dynamicframe into a dataframe and processing it using glue functions. Here's what I did:
def FilterAndMap (glueContext, dfc) -> DynamicFrameCollection:
from pyspark.sql.types import StringType, ArrayType
from awsglue.dynamicframe import DynamicFrame
import pyspark.sql.functions as f
import json
mapping= {"A2": ["EatingFood", json.dumps({ "Visible": True })],
"A3": ["DrinkingWater", json.dumps({ "Visible": False })],
"B2": ["Sleep", json.dumps({ "Visible": False })],
"C3": ["Exercise", json.dumps({ "Visible": False })],
"E4": ["Running", json.dumps({ "Visible": False })],
}
df = dfc.select(list(dfc.keys())[0]).toDF()
def func_filter_udf(concat_str):
return mapping[concat_str]
def func_map_udf(map_str):
if map_str[1]:
map_string = json.loads(map_str[0])
ret_val = json.dumps({**map_string, **json.loads(map_str[1])})
else:
ret_val = map_str[0]
return ret_val
filter_udf = f.udf(func_filter_udf, ArrayType(StringType()))
map_udf = f.udf(func_map_udf, StringType())
df = df.filter(f.concat("Category", "Subcategory").isin([*mapping]))
df = df.withColumn("concat_col", filter_udf(f.concat("Category", "Subcategory")))
df = (df.withColumn("ActivityName", df.concat_col[0]).
withColumn("Properties", map_udf(f.struct(df.concat_col[1], df.Properties))))
df = df.drop("concat_col")
dyf_processed = DynamicFrame.fromDF(df, glueContext, "filtered")
return(DynamicFrameCollection({"filtered": dyf_processed }, glueContext))

Exception in thread "main" org.apache.spark.SparkException: Detected yarn cluster mode, but isn't running on a cluster

I am trying to run a spark submit job from Managed Workflows for Apache Airflow (MWAA). I am able to spin up the cluster but not able to trigger the spark job.
Error:
Exception in thread "main" org.apache.spark.SparkException: Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.
Below is my code config
SPARK_STEPS = [ # Note the params values are supplied to the operator
{
"Name": "Run Spark Job",
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Jar": "s3://eyx-dataplatform-staging/reporting/Airflow/Jars/Reporting_EMR1.jar",
'MainClass': 'com.eyxdp.spark.Test',
"Args": ['spark-submit ',
'--packages',
"com.typesafe:config:1.3.3",
'--master',
"yarn-cluster",
'--deploy-mode',
"yarn",
'--class',
"com.eyxdp.spark.Test",
'--jars',
"s3://eyx-dataplatform-staging/reporting/Airflow/Jars/Reporting_EMR1.jar"
],
},
},
]JOB_FLOW_OVERRIDES = {
"Name": "Spark Job Runner",
"ReleaseLabel": "emr-5.36.0",
"Applications": [{"Name": "Hadoop"}, {"Name": "Spark"}], # We want our EMR cluster to have HDFS and Spark
"Configurations": [
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {"PYSPARK_PYTHON": "/usr/bin/python3"}, # by default EMR uses py2, change it to py3
}
],
}
],
"Instances": {
"InstanceGroups": [
{
"Name": "Master node",
"Market": "SPOT",
"InstanceRole": "MASTER",
"InstanceType": "m5.xlarge",
"InstanceCount": 1,
},
{
"Name": "Core - 2",
"Market": "SPOT", # Spot instances are a "use as available" instances
"InstanceRole": "CORE",
"InstanceType": "m5.xlarge",
"InstanceCount": 2,
},
],
"KeepJobFlowAliveWhenNoSteps": True,
"TerminationProtected": False, # this lets us programmatically terminate the cluster
},
"JobFlowRole": "EMR_EC2_DefaultRole",
"ServiceRole": "EMR_DefaultRole",
}

Calculate differences between consecutive Kafka messages in one Topic

I have some temperature sensors which are generated Kafka messages to a Kafka topic (my-sensors-topic). The messages generally look like below.
{"Offset": 7, "Id": 1, "Time": 1643718777898, "Value": 21}
{"Offset": 6, "Id": 1, "Time": 1643718768592, "Value": 20}
{"Offset": 5, "Id": 2, "Time": 1643718755443, "Value": 21}
{"Offset": 4, "Id": 3, "Time": 1643718746678, "Value": 21}
{"Offset": 3, "Id": 4, "Time": 1643718733408, "Value": 22}
{"Offset": 2, "Id": 2, "Time": 1643718709450, "Value": 20}
{"Offset": 1, "Id": 3, "Time": 1643718667375, "Value": 22}
{"Offset": 0, "Id": 1, "Time": 1643718386944, "Value": 19}
What I want to do is for a new generated message:
{"Offset": 8, "Id": 2, "Time": 1643719318393, "Value": 21}
Firstly compare the "Time" differences (in milliseconds) with the last existed message that has the same Id. In this case:
{"Offset": 5, "Id": 2, "Time": 1643718755443, "Value": 21}
Because its the last existed message and also with Id 2.
Secondly, I want to subtract the "Time" differences between these two messages.
If the difference is greater than 60000, then it's counted as an error occurred for this sensor and I need to create a message to record the error and write the message to another Kafka topic (my-sensors-error-topic).
The created message perhaps look like:
{"Id": 2, "Time_lead": 1643719318393, "Time_lag": 1643718755443, "Letancy": 1562950}
//Latency is calculated by (Time_lead-Time_lag)
So later I can select the Count from my-sensors-error-topic by (sensor) Id so I know how many errors occurred for this sensor.
From my own investigation, to reach my scenario, I need to use Kafka Processor API with State Store. Some examples mentioned I can implement Processor interface while others mentioned using Transform.
Which way is better to implement my scenarios and how?

Why doesn't this pymongo subdocument find work?

I'm looking at using mongodb and so far most things that I've tried work. But I don't know why this find doesn't work.
col = db.create_collection("test")
x = col.insert_many([
{"item": "journal", "qty": 25, "size": {"h": 14, "w": 21, "uom": "cm"}, "status": "A"},
{"item": "notebook", "qty": 50, "size": {"h": 8.5, "w": 11, "uom": "in"}, "status": "A"},
{"item": "paper", "qty": 100, "size": {"h": 8.5, "w": 11, "uom": "in"}, "status": "D"},
{"item": "planner", "qty": 75, "size": {"h": 22.85, "w": 30, "uom": "cm"}, "status": "D"},
{"item": "postcard", "qty": 45, "size": {"h": 10, "w": 15.25, "uom": "cm"}, "status": "A"}
])
cursor = col.find({"size": {"h": 14, "w": 21, "uom": "cm"}})
if cursor.retrieved == 0:
print("found nothing") # <<<<<<<<< prints this
As explained into docs into section Match an Embedded/Nested Document:
Equality matches on the whole embedded document require an exact match of the specified document, including the field order.
So, you have to set the object into find stage in the same order that exists into DB.
I really don't know if keys into objects follows an strict order (alphabetically or whatever) but using this query almost everything output the result. Not always so I think there is a "random" (or not possible to handle) concept to store data -at least into mongo playground-.
By the way, the correct way to ensure results is to use dot notation so this query will always works ok.
coll.find({
"size.h": 14,
"size.w": 21,
"size.uom": "cm"
})
I was thinking that cursor.retrieved was non zero if it found something. I guess not. I found that this works:
lst = list(cursor)
print(lst)
cursor.rewind()
print(list(cursor))
if len(lst) != 0:
for d in lst:
print(d)