Why could a modest spark broadcast variable make a job hang? - scala

here's a code snippet. (problem description below)
val broadcastVarbannedDNS = sc.broadcast(filterList)
val INPUT = hc.table(s"tableName").where(s"DS BETWEEN 2016120100 AND 2016120100").rdd.filter(x => !broadcastVarbannedDNS.value.map(str => x.getString(2).contains(str)).contains(true))
INPUT.count()
filterList is a csv with 200k+ lines. Comes out to about 9MB.
When I run with the whole filter list, the job hangs at the filter stage and no exception or hint of what the problem is shows. Nothing on the logs either. However, when I cut the filterList file to a few hundred lines, it runs like a hot knife thru butter. Immediately, one could come to the conclusion that the "bigger" file is the issue, although 9MB is minute compared to something that should be able to handle GBs of cache. Any help is appreciated.

I would imagine the issue is that if you have N records in the RDD and M lines in the filter then your processing is O(M*N) as you are checking each record against each line.
This means that if processing the N records is 1 second and you have 200K lines then you are running 200K seconds which would be the same as hanging.

Related

How to capture the last record in a file

I have a requirement to split a sequential file into 3 parts, Header, Data, Trailer. I have the header and Data worked out.
Is there a way, in a Transformer, to determine if you have the last record in a sequential file? I tried using LastRow() but that gives me the last row for each node. I need to leave parallelize on.
Thanks in advance for any help.
You have no a priori knowledge about which node the trailer row will come through on. There is therefore no solution in a Transformer stage if you want to retain parallel execution.
One way to do it is to have a reject link on the Sequential File stage. This will capture any row that does not match the defined metadata. Set up the stage with the metadata for your Data rows, then the Header and Trailer will be captured onto the reject link. It should be pretty obvious from their data which is which, and you can process them further and perhaps even rejoin them to your Data rows.
You could also capture the last row separately (e.g. via head -1 filename) and compare that against every row processed to determine if it's the last. Computationally heavy for very little gain.

Massive time increase when segmenting column and writing parquet file

I work with clinical data, so I apologize that I can't display any output, as it is HIPAA regulated, but I'll do my best to fill in any gaps.
I am a recent graduate in data science, and I never really spent much time working with any spark system, but I am now in my new role. We are working on collecting output from a function that I will call udf_function, which takes a clinical note (report) from a physician and returns output that the function defines from the python function call_function. Here is the code that I use to complete this task
def call_function(report):
//python code that generates a list of a,b,c, which I
join together to return a string of the combined list items
a= ",".join(a)
b= ",".join(b)
c= ",".join(c)
return [a,b,c]
udf_function= udf(lambda y: call_function(y), ArrayType(StringType()))
mid_frame = df.select('report',
udf_function('report').alias('udf_output')
)
This returns an array of length 3, which contains strings about the information returned from the function. On a selection of 25,000 records, I was able to complete the run on a 30 node cluster on GCP (20 workers, 10 preemptive) in just a little over 3 hours the other day.
I changed my code a bit to parse out the three objects from the array, as the three objects contains different types of information that we want to further analyze, which I'll call a,b,c (again, sorry if this is vague; I'm trying to keep the actual data as surface level as possible). The previous 3 hour run didn't write out any files, as I was was testing how long the system would take.
output = mid_frame.select('report',
mid_frame['udf_output'].getItem(0).alias('a'),
mid_frame['udf_output'].getItem(1).alias('b'),
mid_frame['udf_output'].getItem(2).alias('c')
)
output_frame.show()
output_frame.write.parquet(data_bucket)
This task of parsing the output and writing the files took an additional 48 hours. I think I could stomach this time lost if I was dealing with HUGE files, but the output is 4 parquet files which come out to 24.06 MB total. Looking at the job logs, the writing process itself took just about 20 hours.
Obviously I have introduced some extreme inefficiency, but I'm since I'm new to this system and style of work, I'm not sure where I have gone awry.
Thank you to all that can offer some advice or guidance on this!
EDIT
Here is an example of what report might be and what the return would be from the function
This is a sentence I wrote myself, and thus, is not pulled from any real record
report = 'The patient showed up to the hospital, presenting with a heart attack and diabetes'
\\ code
return ['heart attack, diabetes','myocardial infarction, diabetes mellatus', 'X88989,B898232']
where the first item is any actual string in the sentence that is tagged by the code, the second item is the professional medical equivalent, and the third item is simply a code which helps us find diagnosis hierarchy between other codes
If you only have 4 parquet file outputs, that says your partition is too small, try repartition before you write out. For example:
output_frame= output_frame.repartition(500)
output_frame.write.parquet(data_bucket)

Duplicate jobs are being generated in DAG for the same action in Spark

I have a spark-streaming job in which I receive data from a message queue and process a bunch of records. In the process, I have a take() method on a dataset. Although the take action is happening in an expected manner, In the DAG visualization, I see multiple job ids created and all of them have the same take action. This is happening only when the data is in the order of a hundreds of thousand records. I didn't observe redundant jobs while running with tens of records in my local machine. Can anyone help me understand the reasoning behind this behavior?
The job ids - (91 to 95) are basically running the same action. Following is the code snippet corresponding to the mentioned action above.
val corruptedMessageArray: Array[ String ] = corruptedMessageDs.take(1);
if ( !corruptedMessageArray.isEmpty ) {
val firstCorruptedMessage: String = corruptedMessageArray( 0 )
}
Your question seems to be whether duplicate jobs are created by Spark.
If you look at the screenshot you will see that the jobs have a different number of tasks, hence it is not a simple matter of duplication.
I am not sure exactly what is happening, but it seems that for large datasets take() needs several quick subsequent jobs. Perhaps because it devises work, or perhaps because it needs to try how much work needs to be done.

How to keep temporary output files in Spark

I'm writing a map only sparkSQL job which looks like
val lines = sc.textFile(inputPath)
val df = lines.map { line => ... }.toDF("col0", "col1")
df.write.parquet(output)
As the job takes quite a long time to compute, I would like to save and keep the results of the tasks which successfully terminated, even if the overall job fails or gets killed.
I noticed that, during the computation, in the output directory some temporary files are created.
I inspected them and noticed that, since my job has only a mapper, what is saved there is the output of the successful tasks.
The problem is that the job failed and I couldn't analyse what it could compute because the temp files were deleted.
Does anyone have some idea how to deal with this situation?
Cheers!
Change the output committer to DirectParquetOutputCommitter.
sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class", "org.apache.spark.sql.parquet.DirectParquetOutputCommitter"
Note that if you've turned on speculative execution, then you have to turn it off to use a direct output committer.

Spark read files recursively from all sub folders with same name

I have a process that is pushing bunch of data to the Blob store every hour and creating the following folder structure inside my storage container as below:
/year=16/Month=03/Day=17/Hour=16/mydata.csv
/year=16/Month=03/Day=17/Hour=17/mydata.csv
and so on
form inside my Spark context I want to access all the mydata.csv and process them. I figured out that I needed to set the sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true") so that we can use recursive search like below:
val csvFile2 = sc.textFile("wasb://mycontainer#mystorage.blob.core.windows.net/*/*/*/mydata.csv")
but when I execute the following command to see how many files I have received, it gives me some really large number like below
csvFile2.count
res41: Long = 106715282
ideally it should be returning me 24*16=384, also i verified on the container, it only has 384 mydata.csv files, but for some reasons i see it returns 106715282.
can someone please help me understand where I went wrong?
Regards
Kiran
SparkContext has two similar methods: textFile and wholeTextFiles.
textFile loads each line of each file as a record in the RDD. So count() will return the total number of lines across all of the files (which in most cases, such as yours, will be a large number).
wholeTextFiles loads each entire file as a record in the RDD. So count() will return the total number of files (384 in your case).