I have a Spark script that establishes a connection to Hive and read Data from different databases and then writes the union into a CSV file. I tested it with two databases and it took 20 minutes. Now I am trying it with 11 databases and it has been running since yesterday evening (18 hours!). The script is supposed to get between 400000 and 800000 row per database.
My question is: is 18 hours normal for such jobs? If not, how can I optimize it? This is what my main does:
// This is a list of the ten first databases used:
var use_database_sigma = List( Parametre_vigiliste.sourceDbSigmaGca, Parametre_vigiliste.sourceDbSigmaGcm
,Parametre_vigiliste.sourceDbSigmaGge, Parametre_vigiliste.sourceDbSigmaGne
,Parametre_vigiliste.sourceDbSigmaGoc, Parametre_vigiliste.sourceDbSigmaGoi
,Parametre_vigiliste.sourceDbSigmaGra, Parametre_vigiliste.sourceDbSigmaGsu
,Parametre_vigiliste.sourceDbSigmaPvl, Parametre_vigiliste.sourceDbSigmaLbr)
val grc = Tables.getGRC(spark) // This creates the first dataframe
var sigma = Tables.getSIGMA(spark, use_database_sigma(0)) // This creates other dataframe which is the union of ten dataframes (one database each)
for(i <- 1 until use_database_sigma.length)
{
if (use_database_sigma(i) != "")
{
sigma = sigma.union(Tables.getSIGMA(spark, use_database_sigma(i)))
}
}
// writing into csv file
val grc_sigma=sigma.union(grc) // union of the 2 dataframes
grc_sigma.cache
LogDev.ecrireligne("total : " + grc_sigma.count())
grc_sigma.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", true).option("delimiter", "|").save(Parametre_vigiliste.cible)
val conf = new Configuration()
val fs = FileSystem.get(conf)
val file = fs.globStatus(new Path(Parametre_vigiliste.cible + "/part*"))(0).getPath().getName();
fs.rename(new Path(Parametre_vigiliste.cible + "/" + file), new Path(Parametre_vigiliste.cible + "/" + "FIC_PER_DATALAKE_.csv"));
grc_sigma.unpersist()
Not written in an IDE so it might be off somewhere, but you get the general idea.
val frames = Seq("table1", "table2).map{ table =>
spark.read.table(table).cache()
}
frames
.reduce(_.union(_)) //or unionByName() if the columns aren't in the same order
.repartition(1)
.write
.mode(SaveMode.Overwrite)
.format("csv")
.options(Map("header" -> "true", "delimiter" -> "|"))
.save("filePathName")
Related
So I recently started using Glue and PySpark for the first time. The task was to create a Glue job that does the following:
Load data from parquet files residing in an S3 bucket
Apply a filter to the data
Add a column, the value of which is derived from 2 other columns
Write the result to S3
Since the data is going from S3 to S3, I assumed that Glue DynamicFrames should be a decent fit for this, and I came up with the following code:
def AddColumn(r):
if r["option_type"] == 'S':
r["option_code_derived"]= 'S'+ r["option_code_4"]
elif r["option_type"] == 'P':
r["option_code_derived"]= 'F'+ r["option_code_4"][1:]
elif r["option_type"] == 'L':
r["option_code_derived"]= 'P'+ r["option_code_4"]
else:
r["option_code_derived"]= None
return r
glueContext = GlueContext(create_spark_context(role_arn=args['role_arn']))
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [source_path], "recurse" : True}, format = source_format, additional_options = {"useS3ListImplementation":True})
filtered_gdf = Filter.apply(frame = inputGDF, f = lambda x: x["my_filter_column"] in ['50','80'])
additional_column_gdf = Map.apply(frame = filtered_gdf, f = AddColumn)
gdf_mapped = ApplyMapping.apply(frame = additional_column_gdf, mappings = mappings, transformation_ctx = "gdf_mapped")
glueContext.purge_s3_path(full_target_path_purge, {"retentionPeriod": 0})
outputGDF = glueContext.write_dynamic_frame.from_options(frame = gdf_mapped, connection_type = "s3", connection_options = {"path": full_target_path}, format = target_format)
This works but takes a very long time (just short of 10 hours with 20 G1.X workers).
Now, the dataset is quite large (almost 2 billion records, over 400 GB), but this was still unexpected (to me at least).
Then I gave it another try, this time with PySpark DataFrames instead of DynamicFrames.
The code looks like the following:
glueContext = GlueContext(create_spark_context(role_arn=args['role_arn'], source_bucket=args['s3_source_bucket'], target_bucket=args['s3_target_bucket']))
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df = spark.read.parquet(full_source_path)
df_filtered = df.filter( (df.model_key_status == '50') | (df.model_key_status == '80') )
df_derived = df_filtered.withColumn('option_code_derived',
when(df_filtered.option_type == "S", concat(lit('S'), df_filtered.option_code_4))
.when(df_filtered.option_type == "P", concat(lit('F'), df_filtered.option_code_4[2:42]))
.when(df_filtered.option_type == "L", concat(lit('P'), df_filtered.option_code_4))
.otherwise(None))
glueContext.purge_s3_path(full_purge_path, {"retentionPeriod": 0})
df_reorderered = df_derived.select(target_columns)
df_reorderered.write.parquet(full_target_path, mode="overwrite")
This also works, but with otherwise identical settings (20 workers of type G1.X, same dataset), this takes less than 20 minutes.
My question is: Where does this massive difference in performance between DynamicFrames and DataFrames come from? Was I doing something fundamentally wrong in the first try?
I am reading all one by one files which is stored in a directory structure as YY=18/MM=12/DD=10 and need to read only current date minus 60 days. Files will be created created for every day and possibility is also that some day files wont create. so, for that day folder will not create.
I am reading all files which is stored in a directory structure as YY/MM/DD.
I am writing below code but its not working.
var datecalculate = {
var days = 0
do{
val start = DateTime.now
var start1 = DateTime.now.minusDays(days)
days = days + 1
var start2 = start1.toString
datecalculatenow(start2) }
while (days <= 90)
}
def datecalculatenow(start2:String):String={
var YY:String = start2.toString.substring(0,4)
var MM:String = start2.toString.substring(5,7)
var DD:String = start2.toString.substring(8,10)
var datepath = "YYYY=" + YY +"/MM=" +MM +"/DD=" +DD
var datepath1 = datepath.toString
org.apache.spark.sql.SparkSession.read.option("delimiter","|").
option("header","true").option("inferSchema","true").
csv("/Table/Files" + datepath1 )
}
I expect to read every files from current date minus 60 days, which has directory structure as YY/MM/DD
With spark sql you can use the following in a select statement to subtract 90 days;
date_sub(CAST(current_timestamp() as DATE), 90)
As it's possible to generate a dataframe from a list of path , why are you not first generating list of path. Here is the simple and concise way to read data from multiple paths:
val paths = (0 until 90).map(days => {
val tmpDate = DateTime.now.minusDays(days).toString()
val year = tmpDate.substring(0,4)
val month = tmpDate.substring(5,7)
val opdate = tmpDate.toString.substring(8,10)
(s"basepath/YY=$year/MM=$month/DD=$opdate")
}).toList
val df = spark.read.
option("delimiter", "|").
option("header", "true").
option("inferSchema","true")
.csv(paths:_*)
While generating paths, you can filter out the paths that do not exist. I've used some of your codes with some modifications. I've not tested in my local setup but the idea is same. Hopefully it'll help you.
I'm trying to learn spark so don't judge harshly. I have the following problem. I can run spark basic examples like this one
import os
os.environ['PYSPARK_PYTHON'] = '/g/scb/patil/andrejev/python36/bin/python3'
import random
from pyspark import SparkConf, SparkContext
from pyspark.sql.types import *
from pyspark.sql import *
sc.stop()
conf = SparkConf().setAppName('').setMaster('spark://remotehost:7789').setSparkHome('/path/to/spark-2.3.0-bin-hadoop2.7/')
sc = SparkContext(conf=conf)
num_samples = 100
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
but when I am creating data frame I have error that _jsc is NULL
eDF = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})])
/usr/local/spark/python/pyspark/traceback_utils.py in __enter__(self)
70 def __enter__(self):
71 if SCCallSiteSync._spark_stack_depth == 0:
---> 72 self._context._jsc.setCallSite(self._call_site)
73 SCCa._spark_stack_depth += 1
Here are the environemnt variables that are set on local machine
SPARK_HOME': '/usr/local/spark/
PYSPARK_DRIVER_PYTHON: '/usr/bin/python3'
PYSPARK_DRIVER_PYTHON_OPTS: 'notebook'
PYSPARK_PYTHON: '/g/scb/patil/andrejev/python36/bin/python3'
PATH': '...:/usr/lib/jvm/java-8-oracle/jre/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/spark/bin'
and on remote machine
PYSPARK_PYTHON=/g/scb/patil/andrejev/python36/bin/python3
PYSPARK_DIRVER_PYTHON=/g/scb/patil/andrejev/python36/bin/python3
In the end I figured out that I had two sessions (one default and one that I created) running in the same time. I ended explicitly using my session to create DataFrame.
sess = SparkSession(sc)
freq_signal = sess.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})])
I´m using Spark 2.2.1 with Scala 2.11.12 version as a language to generate a recursive algorithm. First, I tried an implementation using RDD but the time when I used a lot of data was too much. I have made a new version using DataFrames but with very little data it takes too much, taking less data in each iteration than in the previous iteration.
I have tried to cache variables in different ways (types of persistence included), to use checkpoints in different moments, using the repartition method with different values and in different functions, and nothing works.
The code starts looking for the minimum distance between the points that make up the matrix (matrix is a DataFrame):
println("Finding minimum:")
val minDistRes = matrix.select(min("dist")).first().getFloat(0)
val clusterRes = matrix.where($"dist" === minDistRes)
println(s"New minimum:")
clusterRes.show(1)
Then, save the coordenates to the points for later calculations:
val point1 = clusterRes.first().getInt(0)
val point2 = clusterRes.first().getInt(1)
After, made several filters to use them in the new points generated in the next iteration (the creation of a broadcast variable is necessary to be able to access this data in a later map):
matrix = matrix.where("!(idW1 == " + point1 +" and idW2 ==" + point2 + " )").cache()
val dfPoints1 = matrix.where("idW1 == " + point1 + " or idW2 == " + point1).cache()
val dfPoints2 = matrix.where("idW1 == " + point2 + " or idW2 == " + point2).cache()
val dfPoints2Broadcast = spark.sparkContext.broadcast(dfPoints2)
val dfUnionPoints = dfPoints1.union(dfPoints2).cache()
val matrixSub = matrix.except(dfUnionPoints).cache()
Continued with the calculation of the new points and I return the matrix that will be used recursively by the algorithm:
val newPoints = dfPoints1.map{
r => val distAux = dfPoints2Broadcast.value.where("idW1 == " + r.getInt(0) +
" or idW1 == " + r.getInt(1) + " or idW2 == " + r.getInt(0) + " or idW2 == " +
r.getInt(1)).first().getFloat(2)
(newIndex.toInt, filterDF(r.getInt(0),r.getInt(1), point1, point2), math.min(r.getFloat(2), distAux))
}.asInstanceOf[Dataset[Row]]
matrix = matrixSub.union(newPoints)
Finalize each iteration caching the matrix variable and realized a checkpoint every so often:
matrix.cache()
if (a % 5 == 0)
matrix.checkpoint()
I am successfully loading files into Spark, from S3, through the following code. It's working, however I am noticing that there is a delay between 1 file and another, and they are loaded sequentially. I would like to improve this by loading in parallel.
// Load files that were loaded into firehose on this day
var s3Files = spark.sqlContext.read.schema(schema).json("s3n://" + job.awsaccessKey + ":" + job.awssecretKey + "#" + job.bucketName + "/" + job.awss3RawFileExpression + "/" + year + "/" + monthCheck + "/" + dayCheck + "/*/").rdd
// Apply the schema to the RDD, here we will have duplicates
val usersDataFrame = spark.createDataFrame(s3Files, schema)
usersDataFrame.createOrReplaceTempView("results")
// Clean and use partition by the keys to eliminate duplicates and get latest record
var results = spark.sql(buildCleaningQuery(job, "results"))
results.createOrReplaceTempView("filteredResults")
val records = spark.sql("select count(*) from filteredResults")
I have also tried loading through the textFile() method, however then I am having problems converting RDD[String] to RDD[Row] because afterwards I would need to move on to use Spark SQL. I am using it in the following manner;
var s3Files = sparkContext.textFile("s3n://" + job.awsaccessKey + ":" + job.awssecretKey + "#" + job.bucketName + "/" + job.awss3RawFileExpression + "/" + year + "/" + monthCheck + "/" + dayCheck + "/*/").toJavaRDD()
What is the ideal manner to load JSON files (Multiple files around 50MB each) into Spark? I would like to validate the properties against a schema, so I would later on be able to Spark SQL queries to clean data.
What's going on is that DataFrame is being converted into RDD and then into DataFrame again, which then loses the partitioning information.
var s3Files = spark
.sqlContext
.read.schema(schema)
.json(...)
.createOrReplaceTempView("results")
should be sufficient, and the partitioning information should still be present, allowing json files to be loaded concurrently.