So I recently started using Glue and PySpark for the first time. The task was to create a Glue job that does the following:
Load data from parquet files residing in an S3 bucket
Apply a filter to the data
Add a column, the value of which is derived from 2 other columns
Write the result to S3
Since the data is going from S3 to S3, I assumed that Glue DynamicFrames should be a decent fit for this, and I came up with the following code:
def AddColumn(r):
if r["option_type"] == 'S':
r["option_code_derived"]= 'S'+ r["option_code_4"]
elif r["option_type"] == 'P':
r["option_code_derived"]= 'F'+ r["option_code_4"][1:]
elif r["option_type"] == 'L':
r["option_code_derived"]= 'P'+ r["option_code_4"]
else:
r["option_code_derived"]= None
return r
glueContext = GlueContext(create_spark_context(role_arn=args['role_arn']))
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [source_path], "recurse" : True}, format = source_format, additional_options = {"useS3ListImplementation":True})
filtered_gdf = Filter.apply(frame = inputGDF, f = lambda x: x["my_filter_column"] in ['50','80'])
additional_column_gdf = Map.apply(frame = filtered_gdf, f = AddColumn)
gdf_mapped = ApplyMapping.apply(frame = additional_column_gdf, mappings = mappings, transformation_ctx = "gdf_mapped")
glueContext.purge_s3_path(full_target_path_purge, {"retentionPeriod": 0})
outputGDF = glueContext.write_dynamic_frame.from_options(frame = gdf_mapped, connection_type = "s3", connection_options = {"path": full_target_path}, format = target_format)
This works but takes a very long time (just short of 10 hours with 20 G1.X workers).
Now, the dataset is quite large (almost 2 billion records, over 400 GB), but this was still unexpected (to me at least).
Then I gave it another try, this time with PySpark DataFrames instead of DynamicFrames.
The code looks like the following:
glueContext = GlueContext(create_spark_context(role_arn=args['role_arn'], source_bucket=args['s3_source_bucket'], target_bucket=args['s3_target_bucket']))
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df = spark.read.parquet(full_source_path)
df_filtered = df.filter( (df.model_key_status == '50') | (df.model_key_status == '80') )
df_derived = df_filtered.withColumn('option_code_derived',
when(df_filtered.option_type == "S", concat(lit('S'), df_filtered.option_code_4))
.when(df_filtered.option_type == "P", concat(lit('F'), df_filtered.option_code_4[2:42]))
.when(df_filtered.option_type == "L", concat(lit('P'), df_filtered.option_code_4))
.otherwise(None))
glueContext.purge_s3_path(full_purge_path, {"retentionPeriod": 0})
df_reorderered = df_derived.select(target_columns)
df_reorderered.write.parquet(full_target_path, mode="overwrite")
This also works, but with otherwise identical settings (20 workers of type G1.X, same dataset), this takes less than 20 minutes.
My question is: Where does this massive difference in performance between DynamicFrames and DataFrames come from? Was I doing something fundamentally wrong in the first try?
Related
I am using Pyspark in AWS cloud to extract the image features:
ImageSchema.imageFields
img2vec = F.udf(lambda x: DenseVector(ImageSchema.toNDArray(x).flatten()),
VectorUDT())
df_vec = df_cat.withColumn('original_vectors', img2vec("image"))
df_vec.show()
After having standardized the data:
standardizer = MinMaxScaler(inputCol="original_vectors",
outputCol="scaledFeatures",
min=-1.0,
max=1.0)
#withStd=True, withMean=True)
model_std = standardizer.fit(df_vec)
df_std = model_std.transform(df_vec)
df_std.show()
... when I apply PCA for dimension reduction, I receive an error that I could not debug for a couple of weeks :(
Error_1
Error_2
Could you please help me to solve that?
I use Pyspark spark-3.0.3-bin-hadoop2.7
img2vec = F.udf(lambda x : Vectors.dense(x), VectorUDT())
df = df.withColumn("data_as_vector", img2vec("data_as_resized_array"))
standardizer = StandardScaler(withMean=True, withStd=True, inputCol="data_as_vector", outputCol="scaledFeatures")
for image it needs to resize image data with this code and you must use the resized image data;
def resize_img(img_data, resize=True):
mode = 'RGBA' if (img_data.nChannels == 4) else 'RGB'
img = Image.frombytes(mode=mode, data=img_data.data, size=[img_data.width, img_data.height])
img = img.convert('RGB') if (mode == 'RGBA') else img
img = img.resize([224, 224], resample=Image.Resampling.BICUBIC) if (resize) else img
arr = convert_bgr_array_to_rgb_array(np.asarray(img))
arr = arr.reshape([224*224*3]) if (resize) else arr.reshape([img_data.width*img_data.height*3])
return arr
def resize_image_udf(dataframe_batch_iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for dataframe_batch in dataframe_batch_iterator:
dataframe_batch["data_as_resized_array"] = dataframe_batch.apply(resize_img, args=(True,), axis=1)
dataframe_batch["data_as_array"] = dataframe_batch.apply(resize_img, args=(False,), axis=1)
yield dataframe_batch
resized_df = df_image.select("image.*").mapInPandas(resize_image_udf, schema)
then you can make standardscaler and PCA with;
model_std = standardizer.fit(df)
df = model_std.transform(df)
# algorithm
pca = PCA(k=n_components, inputCol='data_as_vector', outputCol='pcaFeatures')
model_pca = pca.fit(df)
# Transformation images
df = model_pca.transform(df)
i think, i am too late to answer your questions, sorry
Sorry, I need your help to make this manipulation in pyspark.
I have this dataframe
data = [['tom', True,False], ['nick', True,False], ['juli', False,True]]
df = pd.DataFrame(data, columns=['Name', 'Age','gender'])
and I want to have
data = [['tom', True,False,1], ['nick', True,False,1], ['juli', False,True,2]]
df = pd.DataFrame(data, columns=['Name', 'cond1','cond2',"stat"])
I mean if cond1 ==True, stat = 1 and if cond2 ==True, stat = 2
Thank in advance for your help
Hello StackOverflowers.
I have a pyspark dataframe that consists of a time_column and a column with values.
E.g.
+----------+--------------------+
| snapshot| values|
+----------+--------------------+
|2005-01-31| 0.19120256617637743|
|2005-01-31| 0.7972692479278891|
|2005-02-28|0.005236883665445502|
|2005-02-28| 0.5474099672222935|
|2005-02-28| 0.13077227571485905|
+----------+--------------------+
I would like to perform a KS test of each snapshot value with the previous one.
I tried to do it with a for loop.
import numpy as np
from scipy.stats import ks_2samp
import pyspark.sql.functions as F
def KS_for_one_snapshot(temp_df, snapshots_list, j, var = "values"):
sample1=temp_df.filter(F.col("snapshot")==snapshots_list[j])
sample2=temp_df.filter(F.col("snapshot")==snapshots_list[j-1]) # pick the last snapshot as the one to compare with
if (sample1.count() == 0 or sample2.count() == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
ks_value, p_value = ks_2samp( np.array(sample1.select(var).collect()).reshape(-1)
, np.array(sample2.select(var).collect()).reshape(-1)
, alternative="two-sided"
, mode="auto")
return ks_value
results = []
snapshots_list = df.select('snapshot').dropDuplicates().sort('snapshot').rdd.flatMap(lambda x: x).collect()
for j in range(len(snapshots_list) - 1 ):
results.append(KS_for_one_snapshot(df, snapshots_list, j+1))
results
But the data in reality is huge so it takes forever. I am using databricks and pyspark, so I wonder what would be a more efficient way to run it by avoiding the for loop and utilizing the available workers.
I tried to do it by using a udf but in vain.
Any ideas?
PS. you can generate the data with the following code.
from random import randint
df = (spark.createDataFrame( range(1,1000), T.IntegerType())
.withColumn('snapshot' ,F.array(F.lit("2005-01-31"), F.lit("2005-02-28"),F.lit("2005-03-30") ).getItem((F.rand()*3).cast("int")))
.withColumn('values', F.rand()).drop('value')
)
Update:
I tried the following by using an UDF.
var_used = 'values'
data_input_1 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias('value_list'))
data_input_2 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias("value_list_2"))
windowSpec = Window.orderBy("snapshot")
data_input_2 = data_input_2.withColumn('snapshot_2', F.lag("snapshot", 1).over(Window.orderBy('snapshot'))).filter('snapshot_2 is not NULL')
data_input_final = data_input_final = data_input_1.join(data_input_2, data_input_1.snapshot == data_input_2.snapshot_2)
def KS_one_snapshot_general(sample_in_list_1, sample_in_list_2):
if (len(sample_in_list_1) == 0 or len(sample_in_list_2) == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
print('something')
ks_value, p_value = ks_2samp( sample_in_list_1
, sample_in_list_2
, alternative="two-sided"
, mode="auto")
return ks_value
import pyspark.sql.types as T
KS_one_snapshot_general_udf = udf(KS_one_snapshot_general, T.FloatType())
data_input_final.select( KS_one_snapshot_general_udf('value_list', 'value_list_2')).display()
Which works fine if the dataset (per snapshot) is small. But If I increase the number of rows then I end up with an error.
PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
I'm trying to implement the Louvain algorihtm in pyspark using dataframes. The problem is that my implementation is reaaaally slow. This is how I do it:
I collect all vertices and communityIds into simple python lists
For each vertex - communityId pair I calculate the modularity gain using dataframes (just a fancy formula involving edge weights sums/differences)
Repeat untill no change
What am I doing wrong?
I suppose that if I could somehow parallelize the for each loop the performance would increase, but how can I do that?
LATER EDIT:
I could use vertices.foreach(changeCommunityId) instead of the for each loop, but then I'd have to compute the modularity gain (that fancy formula) without dataframes.
See the code sample below:
def louvain(self):
oldModularity = 0 # since intially each node represents a community
graph = self.graph
# retrieve graph vertices and edges dataframes
vertices = verticesDf = self.graph.vertices
aij = edgesDf = self.graph.edges
canOptimize = True
allCommunityIds = [row['communityId'] for row in verticesDf.select('communityId').distinct().collect()]
verticesIdsCommunityIds = [(row['id'], row['communityId']) for row in verticesDf.select('id', 'communityId').collect()]
allEdgesSum = self.graph.edges.groupBy().sum('weight').collect()
m = allEdgesSum[0]['sum(weight)']/2
def computeModularityGain(vertexId, newCommunityId):
# the sum of all weights of the edges within C
sourceNodesNewCommunity = vertices.join(aij, vertices.id == aij.src) \
.select('weight', 'src', 'communityId') \
.where(vertices.communityId == newCommunityId);
destinationNodesNewCommunity = vertices.join(aij, vertices.id == aij.dst) \
.select('weight', 'dst', 'communityId') \
.where(vertices.communityId == newCommunityId);
k_in = sourceNodesNewCommunity.join(destinationNodesNewCommunity, sourceNodesNewCommunity.communityId == destinationNodesNewCommunity.communityId) \
.count()
# the rest of the formula computation goes here, I just wanted to show you an example
# just return some value for the modularity
return 0.9
def changeCommunityId(vertexId, currentCommunityId):
maxModularityGain = 0
maxModularityGainCommunityId = None
for newCommunityId in allCommunityIds:
if (newCommunityId != currentCommunityId):
modularityGain = computeModularityGain(vertexId, newCommunityId)
if (modularityGain > maxModularityGain):
maxModularityGain = modularityGain
maxModularityGainCommunityId = newCommunityId
if (maxModularityGain > 0):
return maxModularityGainCommunityId
return currentCommunityId
while canOptimize:
while self.changeInModularity:
self.changeInModularity = False
for vertexCommunityIdPair in verticesIdsCommunityIds:
vertexId = vertexCommunityIdPair[0]
currentCommunityId = vertexCommunityIdPair[1]
newCommunityId = changeCommunityId(vertexId, currentCommunityId)
self.changeInModularity = False
canOptimize = False
I have a Spark script that establishes a connection to Hive and read Data from different databases and then writes the union into a CSV file. I tested it with two databases and it took 20 minutes. Now I am trying it with 11 databases and it has been running since yesterday evening (18 hours!). The script is supposed to get between 400000 and 800000 row per database.
My question is: is 18 hours normal for such jobs? If not, how can I optimize it? This is what my main does:
// This is a list of the ten first databases used:
var use_database_sigma = List( Parametre_vigiliste.sourceDbSigmaGca, Parametre_vigiliste.sourceDbSigmaGcm
,Parametre_vigiliste.sourceDbSigmaGge, Parametre_vigiliste.sourceDbSigmaGne
,Parametre_vigiliste.sourceDbSigmaGoc, Parametre_vigiliste.sourceDbSigmaGoi
,Parametre_vigiliste.sourceDbSigmaGra, Parametre_vigiliste.sourceDbSigmaGsu
,Parametre_vigiliste.sourceDbSigmaPvl, Parametre_vigiliste.sourceDbSigmaLbr)
val grc = Tables.getGRC(spark) // This creates the first dataframe
var sigma = Tables.getSIGMA(spark, use_database_sigma(0)) // This creates other dataframe which is the union of ten dataframes (one database each)
for(i <- 1 until use_database_sigma.length)
{
if (use_database_sigma(i) != "")
{
sigma = sigma.union(Tables.getSIGMA(spark, use_database_sigma(i)))
}
}
// writing into csv file
val grc_sigma=sigma.union(grc) // union of the 2 dataframes
grc_sigma.cache
LogDev.ecrireligne("total : " + grc_sigma.count())
grc_sigma.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", true).option("delimiter", "|").save(Parametre_vigiliste.cible)
val conf = new Configuration()
val fs = FileSystem.get(conf)
val file = fs.globStatus(new Path(Parametre_vigiliste.cible + "/part*"))(0).getPath().getName();
fs.rename(new Path(Parametre_vigiliste.cible + "/" + file), new Path(Parametre_vigiliste.cible + "/" + "FIC_PER_DATALAKE_.csv"));
grc_sigma.unpersist()
Not written in an IDE so it might be off somewhere, but you get the general idea.
val frames = Seq("table1", "table2).map{ table =>
spark.read.table(table).cache()
}
frames
.reduce(_.union(_)) //or unionByName() if the columns aren't in the same order
.repartition(1)
.write
.mode(SaveMode.Overwrite)
.format("csv")
.options(Map("header" -> "true", "delimiter" -> "|"))
.save("filePathName")