Can SparkContext.setCheckpointDir(hdfsPath) set same hdfsPath in different spark apps? - scala

As docs of:
https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/SparkContext.html#setCheckpointDir-java.lang.String-
SparkContext:
setCheckpointDir
public void setCheckpointDir(String directory)
Set the directory under which RDDs are going to be checkpointed.
Parameters:
directory - path to the directory where checkpoint files will be stored (must be HDFS path if running in cluster)
Questions :
1) If different spark apps SparkContext.setCheckpointDir(hdfsPath) set the same hdfsPath, Is there any conflict?
2) If no conflict, the hdfsPath for CheckpointDir will clean automaticly?

Questions :
1) If different spark apps SparkContext.setCheckpointDir(hdfsPath) set the same hdfsPath, Is there any conflict?
Answer : No conflict as per below example given. Multiple applcaition can use same check point directory. Under that unique hash kind of folder will be created to avoid conflicts.
2) If no conflict, the hdfsPath for CheckpointDir will clean automaticly?
Answer : Yes its happening. for the below example I used local for demonstration... but local or hdfs it doesnt matter. Behaviour will be the same.
Lets go by example (ran multiple times with same check point directory):
package examples
import java.io.File
import org.apache.log4j.Level
object CheckPointTest extends App {
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CheckPointTest").master("local").getOrCreate()
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import spark.implicits._
spark.sparkContext.setCheckpointDir("/tmp/checkpoints")
val csvData1: Dataset[String] = spark.sparkContext.parallelize(
"""
|id
| a
| b
| c
""".stripMargin.lines.toList).toDS()
val frame1 = spark.read.option("header", true).option("inferSchema",true).csv(csvData1).show
val checkpointDir = spark.sparkContext.getCheckpointDir.get
println(checkpointDir)
println("Number of Files in Check Point Directory " + getListOfFiles(checkpointDir).length)
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
}
Result :
+---+
| id|
+---+
| a|
| b|
| c|
+---+
file:/tmp/checkpoints/30e6f882-b49a-42cc-9e60-59adecf13166
Number of Files in Check Point Directory 0 // this indicates once application finished removed all the RDD/DS information.
If you have a look at checkpoint folder it will be like this...
user#f0189843ecbe [~/Downloads]$ ll /tmp/checkpoints/
total 0
drwxr-xr-x 2 user wheel 64 Mar 27 14:08 a2396c08-14b6-418a-b183-a90a4ca7dba3
drwxr-xr-x 2 user wheel 64 Mar 27 14:09 65c8ef5a-0e64-4e79-a050-7d1ee1d0e03d
drwxr-xr-x 2 user wheel 64 Mar 27 14:09 5667758c-180f-4c0b-8b3c-912afca59f55
drwxr-xr-x 2 user wheel 64 Mar 27 14:10 30e6f882-b49a-42cc-9e60-59adecf13166
drwxr-xr-x 6 user wheel 192 Mar 27 14:10 .
drwxrwxrwt 5 root wheel 160 Mar 27 14:10 ..
user#f0189843ecbe [~/Downloads]$ du -h /tmp/checkpoints/
0B /tmp/checkpoints//a2396c08-14b6-418a-b183-a90a4ca7dba3
0B /tmp/checkpoints//5667758c-180f-4c0b-8b3c-912afca59f55
0B /tmp/checkpoints//65c8ef5a-0e64-4e79-a050-7d1ee1d0e03d
0B /tmp/checkpoints//30e6f882-b49a-42cc-9e60-59adecf13166
0B /tmp/checkpoints/
Conclusion :
1) Even multiple applications are running parllel, there will be unique hash under check point directory in that all the RDD/DS
information will be stored.
2) Afer success full execution of each
Spark Application, the context cleaner will remove the contents in
it.. is what I observed from the above practical example.

Related

I am trying to call libpcap dynamic lib using dart ffi, it works fine without Flutter. However, it report "cannot open BPF device" with Flutter

I wrote a c function to call libpcap, and compile it to .dylib
#include "capture.h"
#include <stdio.h>
#include <pcap.h>
int run_capture(const char * dev) {
printf("start capture\n");
pcap_t* pcap;
const u_char* pkt;
struct pcap_pkthdr ph;
char ebuf[PCAP_ERRBUF_SIZE];
pcap = pcap_open_live(dev, 65535, 0, 0, ebuf);
if (!pcap) {
fprintf(stderr, "%s\n", ebuf);
return -1;
}
pcap_dumper_t *dumper = pcap_dump_open(pcap, "./cap.pcap");
if (pcap_loop(pcap, 2000, &pcap_dump, (u_char *)dumper) < 0) {
/*
* Print out appropriate text, followed by the error message
* generated by the packet capture library.
*/
printf("Error reading packets from interface %s", "en0");
return 8;
}
pcap_dump_flush(dumper);
pcap_dump_close(dumper);
pcap_close(pcap);
printf("stoped capture\n");
return 0;
}
dart ffi binding code:
import 'dart:ffi' as ffi;
import 'package:ffi/ffi.dart';
typedef run_capture_native_t = ffi.Int Function(ffi.Pointer<Utf8> dev);
typedef run_capture_t = int Function(ffi.Pointer<Utf8> dev);
final dylib = ffi.DynamicLibrary.open('libpacket_worker.dylib');
final runCapture = dylib.lookupFunction<run_capture_native_t, run_capture_t>('run_capture');
main.dart code
import 'dart:ffi' as ffi;
import 'dart:isolate';
import 'libpcap/libpcap_bindings.dart';
import 'packet_worker/packet_worker_bindings.dart';
import 'dart:io' show Platform, Directory;
import 'package:ffi/ffi.dart';
// Define the function type for inet_ntoa()
typedef inet_ntoa_native_t = ffi.Pointer<Utf8> Function(ffi.Pointer<in_addr>);
typedef inet_ntoa_t = ffi.Pointer<Utf8> Function(ffi.Pointer<in_addr>);
int main(List<String> args) {
final recvPort = ReceivePort();
final worker = Isolate.spawn((message) {
runCapture("en0".toNativeUtf8());
}, recvPort.sendPort);
print("Isolate spawnd.");
return 0;
}
it works fine and output a .pcap file
lake#LakedeMBP dart_pcap_test % dart compile exe main.dart -o sniff
Info: Compiling with sound null safety
Generated: /Users/lake/Projects/dart_pcap_test/sniff
lake#LakedeMBP dart_pcap_test % ./sniff
Isolate spawnd.
start capture
stoped capture
^C
lake#LakedeMBP dart_pcap_test % ls
cap.pcap libpcap packet_worker pubspec.yaml
libpacket_worker.dylib main.dart pubspec.lock sniff
lake#LakedeMBP dart_pcap_test % ls -al
total 11816
-rw-r--r-- 1 lake staff 886300 Jan 22 01:58 cap.pcap
-rwxr-xr-x 1 lake staff 34035 Jan 22 01:12 libpacket_worker.dylib
drwxr-xr-x 11 lake staff 352 Jan 20 23:03 libpcap
-rw-r--r-- 1 lake staff 624 Jan 21 22:39 main.dart
drwxr-xr-x 9 lake staff 288 Jan 21 22:09 packet_worker
-rw-r--r-- 1 lake staff 8261 Jan 9 15:35 pubspec.lock
-rw-r--r-- 1 lake staff 516 Jan 9 15:35 pubspec.yaml
-rwxr-xr-x 1 lake staff 5063216 Jan 22 01:57 sniff
But it have permission problem when i move the same code to flutter project:
lake#LakedeMBP MacOS % ./flutter_pcap_test
flutter: err: en0: (cannot open BPF device) /dev/bpf0: Operation not permitted
I don't know why.
I upload the 2 samples to Github, Any one can help me?
environment: Macbook M1
Dart Proj that works fine:
https://github.com/dev-lake/dart_pcap_test
Flutter Proj that have permittion problem:
https://github.com/dev-lake/flutter_pcap_test
I Have tried:
chmod 777 /dev/bpf*

pyspark error with reduceByKey call using simple wordcount from a file

Trying to run this pyspark wordcount program from this page: https://www.learntospark.com/2020/01/word-count-program-in-apache-spark.html
Here is my code:
import findspark
findspark.init()
# Create SparkSession and sparkcontext
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.master("spark://my-ubuntu.xxx.com:7077")\
.appName('wordcount')\
.getOrCreate()
sc=spark.sparkContext
# Read the input file and Calculating words count
text_file = sc.textFile("peterpan.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y)
# Printing each word with its respective count
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
# Stopping Spark-Session and Spark context
sc.stop()
spark.stop()
I am getting following error: tuple index out of range.
I am trying to learn pyspark. Any help is appreciated.
PicklingError Traceback (most recent call last)
Cell In [9], line 16
12 # Read the input file and Calculating words count
13 text_file = sc.textFile("peterpan1.txt")
14 counts = text_file.flatMap(lambda line: line.split(" ")) \
15 .map(lambda word: (word, 1)) \
---> 16 .reduceByKey(lambda x, y: x + y)
18 # Printing each word with its respective count
19 output = counts.collect()
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\rdd.py:2275, in RDD.reduceByKey(self, func, numPartitions, partitionFunc)
2252 def reduceByKey(
2253 self: "RDD[Tuple[K, V]]",
2254 func: Callable[[V, V], V],
2255 numPartitions: Optional[int] = None,
2256 partitionFunc: Callable[[K], int] = portable_hash,
2257 ) -> "RDD[Tuple[K, V]]":
2258 """
2259 Merge the values for each key using an associative and commutative reduce function.
2260
(...)
2273 [('a', 2), ('b', 1)]
2274 """
-> 2275 return self.combineByKey(lambda x: x, func, func, numPartitions, partitionFunc)
..
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\rdd.py:3345, in _prepare_for_python_RDD(sc, command)
3342 def _prepare_for_python_RDD(sc: "SparkContext", command: Any) -> Tuple[bytes, Any, Any, Any]:
3343 # the serialized command will be compressed by broadcast
3344 ser = CloudPickleSerializer()
-> 3345 pickled_command = ser.dumps(command)
3346 assert sc._jvm is not None
3347 if len(pickled_command) > sc._jvm.PythonUtils.getBroadcastThreshold(sc._jsc): # Default 1M
3348 # The broadcast will have same life cycle as created PythonRDD
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\serializers.py:468, in CloudPickleSerializer.dumps(self, obj)
466 msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
467 print_exec(sys.stderr)
--> 468 raise pickle.PicklingError(msg)
PicklingError: Could not serialize object: IndexError: tuple index out of range
=== peterpan.txt file is story with 6k+ lines. Bunch of lines from the file are listed below. =====
The Project Gutenberg EBook of Peter Pan, by James M. Barrie
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **
** Please follow the copyright guidelines in this file. **
Title: Peter Pan
Peter Pan and Wendy
Author: James M. Barrie
Posting Date: June 25, 2008 [EBook #16]
Release Date: July, 1991
Last Updated: October 14, 2016
Language: English
Character set encoding: UTF-8
*** START OF THIS PROJECT GUTENBERG EBOOK PETER PAN ***
PETER PAN
[PETER AND WENDY]
By J. M. Barrie [James Matthew Barrie]
A Millennium Fulcrum Edition (c)1991 by Duncan Research
Contents:
Chapter 1 PETER BREAKS THROUGH
Chapter 2 THE SHADOW
..
..
EBooks posted prior to November 2003, with eBook numbers BELOW #10000,
are filed in directories based on their release date. If you want to
download any of these eBooks directly, rather than using the regular
search system you may utilize the following addresses and just
download by the etext year.
http://www.ibiblio.org/gutenberg/etext06
(Or /etext 05, 04, 03, 02, 01, 00, 99,
98, 97, 96, 95, 94, 93, 92, 92, 91 or 90)
EBooks posted since November 2003, with etext numbers OVER #10000, are
filed in a different way. The year of a release date is no longer part
of the directory path. The path is based on the etext number (which is
identical to the filename). The path to the file is made up of single
digits corresponding to all but the last digit in the filename. For
example an eBook of filename 10234 would be found at:
http://www.gutenberg.org/1/0/2/3/10234
or filename 24689 would be found at:
http://www.gutenberg.org/2/4/6/8/24689
An alternative method of locating eBooks:
http://www.gutenberg.org/GUTINDEX.ALL
*** END: FULL LICENSE ***

AWS Glue job failing with OOM exception when changing column names

I have an ETL job where I load some data from S3 into a dynamic frame, relationalize it, and iterate through the dynamic frames returned. I want to query the result of this in Athena later so I want to change the names of the columns from having '.' to '_' and lower case them. When I do this transformation, I change the DynamicFrame into a spark dataframe and have been doing it this way. I've also seen a problem in another SO question where it turned out there is a reported problem with AWS Glue rename field transform so I've stayed away from that.
I've tried a couple things, including adding a load limit size to 50MB, repartitioning the dataframe, using both dataframe.schema.names and dataframe.columns, using reduce instead of loops, using sparksql to change it and nothing has worked. I'm fairly certain that its this transformation that failing because I've put some print statements in and the print that I have right after the completion of this transformation never shows up. I used a UDF at one point but that also failed. I've tried the actual transformation using df.toDF(new_column_names) and df.withColumnRenamed() but it never gets this far because I've not seen it get past retrieving the column names. Here's the code I've been using. I've been changing the actual name transformation as I said above, but the rest of it has stayed pretty much the same.
I've seen some people try and use the spark.executor.memory, spark.driver.memory, spark.executor.memoryOverhead and spark.driver.memoryOverhead. I've used those and set them to the most AWS Glue will let you but to no avail.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import explode, col, lower, trim, regexp_replace
import copy
import json
import boto3
import botocore
import time
# ========================================================
# UTILITY FUNCTIONS
# ========================================================
def lower_and_pythonize(s=None):
if s is not None:
return s.replace('.', '_').lower()
else:
return None
# pyspark implementation of renaming
# exprs = [
# regexp_replace(lower(trim(col(c))),'\.' , '_').alias(c) if t == "string" else col(c)
# for (c, t) in data_frame.dtypes
# ]
# ========================================================
# END UTILITY FUNCTIONS
# ========================================================
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
#my params
bucket_name = '<my-s3-bucket>' # name of the bucket. do not include 's3://' thats added later
output_key = '<my-output-path>' # key where all of the output is saved
input_keys = ['<root-directory-i'm using'] # highest level key that holds all of the desired data
s3_exclusions = "[\"*.orc\"]" # list of strings to exclude. Documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-s3
s3_exclusions = s3_exclusions.replace('\n', '')
dfc_root_table_name = 'root' # name of the root table generated in the relationalize process
input_paths = ['s3://' + bucket_name + '/' + x for x in input_keys] # turn input keys into s3 paths
output_connection_opts = {"path": "s3://" + bucket_name + "/" + output_key} # dict of options. Documentation link found above the write_dynamic_frame.from_options line
s3_client = boto3.client('s3', 'us-east-1') # s3 client used for writing to s3
s3_resource = boto3.resource('s3', 'us-east-1') # s3 resource used for checking if key exists
group_mb = 50 # NOTE: 75 has proven to be too much when running on all of the april data
group_size = str(group_mb * 1024 * 1024)
input_connection_opts = {'paths': input_paths,
'groupFiles': 'inPartition',
'groupSize': group_size,
'recurse': True,
'exclusions': s3_exclusions} # dict of options. Documentation link found above the create_dynamic_frame_from_options line
print(sc._conf.get('spark.executor.cores'))
num_paritions = int(sc._conf.get('spark.executor.cores')) * 4
print('Loading all json files into DynamicFrame...')
loading_time = time.time()
df = glueContext.create_dynamic_frame_from_options(connection_type='s3', connection_options=input_connection_opts, format='json')
print('Done. Time to complete: {}s'.format(time.time() - loading_time))
# using the list of known null fields (at least on small sample size) remove them
#df = df.drop_fields(drop_paths)
# drop any remaining null fields. The above covers known problems that this step doesn't fix
print('Dropping null fields...')
dropping_time = time.time()
df_without_null = DropNullFields.apply(frame=df, transformation_ctx='df_without_null')
print('Done. Time to complete: {}s'.format(time.time() - dropping_time))
df = None
print('Relationalizing dynamic frame...')
relationalizing_time = time.time()
dfc = Relationalize.apply(frame=df_without_null, name=dfc_root_table_name, info="RELATIONALIZE", transformation_ctx='dfc', stageThreshold=3)
print('Done. Time to complete: {}s'.format(time.time() - relationalizing_time))
keys = dfc.keys()
keys.sort(key=lambda s: len(s))
print('Writting all dynamic frames to s3...')
writting_time = time.time()
for key in keys:
good_key = lower_and_pythonize(s=key)
data_frame = dfc.select(key).toDF()
# lowercase all the names and remove '.'
print('Removing . and _ from names for {} frame...'.format(key))
df_fix_names_time = time.time()
print('Repartitioning data frame...')
data_frame.repartition(num_paritions)
print('Done.')
#
print('Changing names...')
for old_name in data_frame.schema.names:
data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
print('Done.')
#
df_now = DynamicFrame.fromDF(dataframe=data_frame, glue_ctx=glueContext, name='df_now')
print('Done. Time to complete: {}'.format(time.time() - df_fix_names_time))
# if a conflict of types appears, make it 2 columns
# https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
print('Fixing any type conficts for {} frame...'.format(key))
df_resolve_time = time.time()
resolved = ResolveChoice.apply(frame = df_now, choice = 'make_cols', transformation_ctx = 'resolved')
print('Done. Time to complete: {}'.format(time.time() - df_resolve_time))
# check if key exists in s3. if not make one
out_connect = copy.deepcopy(output_connection_opts)
out_connect['path'] = out_connect['path'] + '/' + str(good_key)
try:
s3_resource.Object(bucket_name, output_key + '/' + good_key + '/').load()
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == '404' or 'NoSuchKey' in e.response['Error']['Code']:
# object doesn't exist
s3_client.put_object(Bucket=bucket_name, Key=output_key+'/'+good_key + '/')
else:
print(e)
## https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html
print('Writing {} frame to S3...'.format(key))
df_writing_time = time.time()
datasink4 = glueContext.write_dynamic_frame.from_options(frame = df_now, connection_type = "s3", connection_options = out_connect, format = "orc", transformation_ctx = "datasink4")
out_connect = None
datasink4 = None
print('Done. Time to complete: {}'.format(time.time() - df_writing_time))
print('Done. Time to complete: {}s'.format(time.time() - writting_time))
job.commit()
Here is the error I'm getting
19/06/07 16:33:36 DEBUG Client:
client token: N/A
diagnostics: Application application_1559921043869_0001 failed 1 times due to AM Container for appattempt_1559921043869_0001_000001 exited with exitCode: -104
For more detailed output, check application tracking page:http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001Then, click on links to logs of each attempt.
Diagnostics: Container [pid=9630,containerID=container_1559921043869_0001_01_000001] is running beyond physical memory limits. Current usage: 5.6 GB of 5.5 GB physical memory used; 8.8 GB of 27.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1559921043869_0001_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 9630 9628 9630 9630 (bash) 0 0 115822592 675 /bin/bash -c LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -Djava.io.tmpdir=/mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/tmp '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' '-Djavax.net.ssl.trustStore=ExternalAndAWSTrustStore.jks' '-Djavax.net.ssl.trustStoreType=JKS' '-Djavax.net.ssl.trustStorePassword=amazon' '-DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem' '-DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem' '-DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file runscript.py --arg 'script_2019-06-07-15-29-50.py' --arg '--JOB_NAME' --arg 'tss-json-to-orc' --arg '--JOB_ID' --arg 'j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe' --arg '--JOB_RUN_ID' --arg 'jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233' --arg '--job-bookmark-option' --arg 'job-bookmark-disable' --arg '--TempDir' --arg 's3://aws-glue-temporary-059866946490-us-east-1/zmcgrath' --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stdout 2> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stderr
|- 9677 9648 9630 9630 (python) 12352 2628 1418354688 261364 python runscript.py script_2019-06-07-15-29-50.py --JOB_NAME tss-json-to-orc --JOB_ID j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --JOB_RUN_ID jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --job-bookmark-option job-bookmark-disable --TempDir s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath
|- 9648 9630 9630 9630 (java) 265906 3083 7916974080 1207439 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -Djava.io.tmpdir=/mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/tmp -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -Djavax.net.ssl.trustStore=ExternalAndAWSTrustStore.jks -Djavax.net.ssl.trustStoreType=JKS -Djavax.net.ssl.trustStorePassword=amazon -DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem -DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem -DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.deploy.PythonRunner --primary-py-file runscript.py --arg script_2019-06-07-15-29-50.py --arg --JOB_NAME --arg tss-json-to-orc --arg --JOB_ID --arg j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --arg --JOB_RUN_ID --arg jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --arg --job-bookmark-option --arg job-bookmark-disable --arg --TempDir --arg s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/__spark_conf__.properties
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1559921462650
final status: FAILED
tracking URL: http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001
user: root
Here are the log contents from the job
LogType:stdout
Log Upload Time:Fri Jun 07 16:33:36 +0000 2019
LogLength:487
Log Contents:
4
Loading all json files into DynamicFrame...
Done. Time to complete: 59.5056920052s
Dropping null fields...
null_fields [<some fields that were dropped>]
Done. Time to complete: 529.95293808s
Relationalizing dynamic frame...
Done. Time to complete: 2773.11689401s
Writting all dynamic frames to s3...
Removing . and _ from names for root frame...
Repartitioning data frame...
Done.
Changing names...
End of LogType:stdout
As I said earlier, the Done. print after changing the names never appears in the logs. I've seen plenty of people getting the same error I'm seeing and I've tried a fair bit of them with no success. Any help you can provide would b e much appreciated. Let me know if you need any more information. Thanks
Edit
Prabhakar's comment reminded me that I have tried the memory worker type in AWS Glue and it still failed. As stated above, I have tried raising the amount of memory in the memoryOverhead from 5 to 12, but to avail. Neither of these made the job complete successfully
Update
I put in the following code for column name change instead of the above code for easier debugging
print('Changing names...')
name_counter = 0
for old_name in data_frame.schema.names:
print('Name number {}. name being changed: {}'.format(name_counter, old_name))
data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
name_counter += 1
print('Done.')
And I got the following output
Removing . and _ from names for root frame...
Repartitioning data frame...
Done.
Changing names...
End of LogType:stdout
So it must be a problem with the data_frame.schema.names part. Could it be this line with my loop through all of the DynamicFrames? Am I looping through the DynamicFrames from the relationalize transformation correctly?
Update 2
Glue recently added more verbose logs and I found this
ERROR YarnClusterScheduler: Lost executor 396 on ip-172-32-78-221.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
This happens for more than just this executor too; it looks like almost all of them.
I can try to increase the executor memory overhead, but I would like to know why getting the column names results in an OOM error. I wouldn't think that something that trivial would take up that much memory?
Update
I attempted to run the job with both spark.driver.memoryOverhead=7g and spark.yarn.executor.memoryOverhead=7g and I again got an OOM error

Cocoapods pod repo push git

I am trying to push my pod to local repo. Before that, I have verified pod lib lint on my repo, and working fine locally
$ pod lib lint --swift-version=5.0 --allow-warnings
/System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/universal-darwin18/rbconfig.rb:215: warning: Insecure world writable dir /usr/local/sbin in PATH, mode 040777
-> SFLocationManager (1.0)
- WARN | source: Git SSH URLs will NOT work for people behind firewalls configured to only allow HTTP, therefore HTTPS is preferred.
- NOTE | xcodebuild: note: Using new build system
- NOTE | xcodebuild: note: Planning build
- NOTE | xcodebuild: note: Constructing build description
SFLocationManager passed validation.
After this, I have created tags and pushed to server
$ git tag
0.1.0
0.1.1
1.0
Then I have tried to test pod repo push command for local repo, which got failed
$ pod repo push git#git.url.com:ankit.thakur/locationmanager.git SFLocationManager.podspec --allow-warnings --swift-version=5.0 --local-only
/System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/universal-darwin18/rbconfig.rb:215: warning: Insecure world writable dir /usr/local/sbin in PATH, mode 040777
Validating spec
-> SFLocationManager (1.0)
- WARN | source: Git SSH URLs will NOT work for people behind firewalls configured to only allow HTTP, therefore HTTPS is preferred.
- ERROR | file patterns: The `source_files` pattern did not match any file.
- NOTE | xcodebuild: note: Using new build system
- NOTE | xcodebuild: note: Planning build
- NOTE | xcodebuild: note: Constructing build description
[!] The `SFLocationManager.podspec` specification does not validate.
Then I removed --local-only flag and ran again, but still failed.
$ pod repo push git#git.url.com:ankit.thakur/locationmanager.git SFLocationManager.podspec --allow-warnings --swift-version=5.0
/System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/universal-darwin18/rbconfig.rb:215: warning: Insecure world writable dir /usr/local/sbin in PATH, mode 040777
Validating spec
-> SFLocationManager (1.0)
- WARN | source: Git SSH URLs will NOT work for people behind firewalls configured to only allow HTTP, therefore HTTPS is preferred.
- ERROR | file patterns: The `source_files` pattern did not match any file.
- NOTE | xcodebuild: note: Using new build system
- NOTE | xcodebuild: note: Planning build
- NOTE | xcodebuild: note: Constructing build description
[!] The `SFLocationManager.podspec` specification does not validate.
Here is the pod version
$ pod --version
/System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/universal-darwin18/rbconfig.rb:215: warning: Insecure world writable dir /usr/local/sbin in PATH, mode 040777
1.6.0
Here is the podspec file:
#
# Be sure to run `pod lib lint SFLocationManager.podspec' to ensure this is a
# valid spec before submitting.
#
# Any lines starting with a # are optional, but their use is encouraged
# To learn more about a Podspec see https://guides.cocoapods.org/syntax/podspec.html
#
Pod::Spec.new do |spec|
spec.name = 'SFLocationManager'
spec.version = '1.0'
spec.summary = 'SFLocationManager is location based library for iOS and Mac'
# This description is used to generate tags and improve search results.
# * Think: What does it do? Why did you write it? What is the focus?
# * Try to keep it short, snappy and to the point.
# * Write the description between the DESC delimiters below.
# * Finally, don't worry about the indent, CocoaPods strips it!
spec.description = <<-DESC
Location library in beta test version to fetch location with scheduled interval.
DESC
spec.homepage = 'https://git.url.com/ankit.thakur/locationmanager'
# spec.screenshots = 'www.example.com/screenshots_1', 'www.example.com/screenshots_2'
spec.license = { :type => 'MIT', :file => 'LICENSE' }
spec.author = { 'ankitthakur' => 'ankit.thakur#url.com' }
spec.source = { :git => 'git#git.url.com:ankit.thakur/locationmanager.git', :tag => spec.version.to_s }
# spec.social_media_url = 'https://twitter.com/<TWITTER_USERNAME>'
spec.requires_arc = true
spec.ios.deployment_target = '10.0'
spec.osx.deployment_target = '10.10'
spec.source_files = 'SFLocationManager/Sources/Common/**/*.swift'
# spec.ios.source_files = 'SFLocationManager/Sources/iOS/**/*.swift'
# spec.osx.source_files = 'SFLocationManager/Sources/OSX/**/*.swift'
# spec.resource_bundles = {
# 'SFLocationManager' => ['SFLocationManager/Assets/*.png']
# }
spec.frameworks = 'CoreLocation'
# spec.public_header_files = 'Pod/Classes/**/*.h'
# spec.frameworks = 'UIKit', 'MapKit'
# spec.dependency 'AFNetworking', '~> 2.3'
end
The response of spec.source_files is
$ ls -al SFLocationManager/Sources/Common/**/*.swift
-rw-r--r--# 1 ankitthakur staff 2710 Apr 25 18:02 SFLocationManager/Sources/Common/GeocoderUtils/Geocoder.swift
-rw-r--r--# 1 ankitthakur staff 613 Apr 25 18:21 SFLocationManager/Sources/Common/LocationManager/LocationConfiguration.swift
-rw-r--r--# 1 ankitthakur staff 324 Apr 25 18:02 SFLocationManager/Sources/Common/LocationManager/LocationError.swift
-rw-r--r--# 1 ankitthakur staff 241 Apr 25 18:02 SFLocationManager/Sources/Common/LocationManager/LocationEventType.swift
-rw-r--r--# 1 ankitthakur staff 7144 Apr 25 18:36 SFLocationManager/Sources/Common/LocationManager/LocationManager.swift
-rw-r--r--# 1 ankitthakur staff 4649 Apr 25 18:02 SFLocationManager/Sources/Common/Model/Location.swift
-rw-r--r--# 1 ankitthakur staff 3939 Apr 25 18:27 SFLocationManager/Sources/Common/Trigger/LocationTriggerManager.swift
As per suggestions in provided solutions, my updated Podspec is
#
# Be sure to run `pod lib lint SFLocationManager.podspec' to ensure this is a
# valid spec before submitting.
#
# Any lines starting with a # are optional, but their use is encouraged
# To learn more about a Podspec see https://guides.cocoapods.org/syntax/podspec.html
#
Pod::Spec.new do |spec|
spec.name = 'SFLocationManager'
spec.version = '1.0'
spec.summary = 'SFLocationManager is location based library for iOS and Mac'
# This description is used to generate tags and improve search results.
# * Think: What does it do? Why did you write it? What is the focus?
# * Try to keep it short, snappy and to the point.
# * Write the description between the DESC delimiters below.
# * Finally, don't worry about the indent, CocoaPods strips it!
spec.description = <<-DESC
Location library in beta test version to fetch location with scheduled interval.
DESC
spec.homepage = 'https://git.promobitech.com/ankit.thakur/locationmanager'
# spec.screenshots = 'www.example.com/screenshots_1', 'www.example.com/screenshots_2'
spec.license = { :type => 'MIT', :file => 'LICENSE' }
spec.author = { 'ankitthakur' => 'ankit.thakur#promobitech.com' }
spec.source = { :git => 'git#git.promobitech.com:ankit.thakur/locationmanager.git', :tag => spec.version.to_s }
# spec.social_media_url = 'https://twitter.com/<TWITTER_USERNAME>'
spec.requires_arc = true
spec.ios.deployment_target = '10.0'
spec.osx.deployment_target = '10.10'
spec.source_files = 'SFLocationManager/Sources/Common/GeocoderUtils/*.{swift}',
'SFLocationManager/Sources/Common/LocationManager/*.{swift}',
'SFLocationManager/Sources/Common/Model/*.{swift}',
'SFLocationManager/Sources/Common/Trigger/*.{swift}'
# spec.ios.source_files = 'SFLocationManager/Sources/iOS/**/*.{swift}'
# spec.osx.source_files = 'SFLocationManager/Sources/OSX/**/*.{swift}'
# spec.resource_bundles = {
# 'SFLocationManager' => ['SFLocationManager/Assets/*.png']
# }
spec.frameworks = 'CoreLocation'
# spec.public_header_files = 'Pod/Classes/**/*.h'
# spec.frameworks = 'UIKit', 'MapKit'
# spec.dependency 'AFNetworking', '~> 2.3'
end
but it is still not working.
Here is the my podspec file:
Admin:locationmanager ankitthakur$ ls -al
total 40
drwxr-xr-x 10 ankitthakur staff 320 Apr 25 20:38 .
drwxr-xr-x 9 ankitthakur staff 288 Apr 25 20:38 ..
-rw-r--r-- 1 ankitthakur staff 6148 Apr 25 20:38 .DS_Store
drwxr-xr-x 14 ankitthakur staff 448 Apr 26 14:50 .git
drwxr-xr-x 10 ankitthakur staff 320 Apr 25 20:38 Example
-rw-r--r-- 1 ankitthakur staff 1086 Apr 25 20:38 LICENSE
-rw-r--r-- 1 ankitthakur staff 1029 Apr 25 20:38 README.md
drwxr-xr-x 4 ankitthakur staff 128 Apr 25 20:51 SFLocationManager
-rw-r--r-- 1 ankitthakur staff 2241 Apr 26 14:49 SFLocationManager.podspec
lrwxr-xr-x 1 ankitthakur staff 27 Apr 25 20:38 _Pods.xcodeproj -> Example/Pods/Pods.xcodeproj
The error says:
file patterns: The `source_files` pattern did not match any file.
This means that you have written a wrong pattern.
So you should correct your source_files like the following
s.source_files = "FOLDERNAME/*.{swift}"
(This will include all the Swift files under the folder "FOLDERNAME")
In case you have multiple folders, do like the following:
s.source_files = "FOLDERNAME1/*.{swift}" , "FOLDERNAME2/*.{swift}"

Spark: All RDD data not getting saved to Cassandra table

Hi, I am trying to load RDD data to a Cassandra Column family using Scala. Out of a total 50 rows , only 28 are getting stored into cassandra table.
Below is the Code snippet:
val states = sc.textFile("state.txt")
//list o fall the 50 states of the USA
var n =0 // corrected to var
val statesRDD = states.map{a =>
n=n+1
(n, a)
}
scala> statesRDD.count
res2: Long = 50
cqlsh:brs> CREATE TABLE BRS.state(state_id int PRIMARY KEY, state_name text);
statesRDD.saveToCassandra("brs","state", SomeColumns("state_id","state_name"))
// this statement saves only 28 rows out of 50, not sure why!!!!
cqlsh:brs> select * from state;
state_id | state_name
----------+-------------
23 | Minnesota
5 | California
28 | Nevada
10 | Georgia
16 | Kansas
13 | Illinois
11 | Hawaii
1 | Alabama
19 | Maine
8 | Oklahoma
2 | Alaska
4 | New York
18 | Virginia
15 | Iowa
22 | Wyoming
27 | Nebraska
20 | Maryland
7 | Ohio
6 | Colorado
9 | Florida
14 | Indiana
26 | Montana
21 | Wisconsin
17 | Vermont
24 | Mississippi
25 | Missouri
12 | Idaho
3 | Arizona
(28 rows)
Can anyone please help me in finding where the issue is?
Edit:
I understood why only 28 rows are getting stored in Cassandra, it's because I have made the first column a PRIMARY KEY and It looks like in my code, n is incremented maximum to 28 and then it starts again with 1 till 22 (total 50).
val states = sc.textFile("states.txt")
var n =0
var statesRDD = states.map{a =>
n+=1
(n, a)
}
I tried making n an accumulator variable as well(viz. val n = sc.accumulator(0,"Counter")), but I don't see any differnce in the output.
scala> statesRDD.foreach(println)
[Stage 2:> (0 + 0) / 2]
(1,New Hampshire)
(2,New Jersey)
(3,New Mexico)
(4,New York)
(5,North Carolina)
(6,North Dakota)
(7,Ohio)
(8,Oklahoma)
(9,Oregon)
(10,Pennsylvania)
(11,Rhode Island)
(12,South Carolina)
(13,South Dakota)
(14,Tennessee)
(15,Texas)
(16,Utah)
(17,Vermont)
(18,Virginia)
(19,Washington)
(20,West Virginia)
(21,Wisconsin)
(22,Wyoming)
(1,Alabama)
(2,Alaska)
(3,Arizona)
(4,Arkansas)
(5,California)
(6,Colorado)
(7,Connecticut)
(8,Delaware)
(9,Florida)
(10,Georgia)
(11,Hawaii)
(12,Idaho)
(13,Illinois)
(14,Indiana)
(15,Iowa)
(16,Kansas)
(17,Kentucky)
(18,Louisiana)
(19,Maine)
(20,Maryland)
(21,Massachusetts)
(22,Michigan)
(23,Minnesota)
(24,Mississippi)
(25,Missouri)
(26,Montana)
(27,Nebraska)
(28,Nevada)
I am curious to know what is causing n to not getting updated after value 28? Also, what are the ways in which I can create a counter which I can use for creating RDD?
There are some misconceptions about distributed systems embedded inside your question. The real heart of this is "How do I have a counter in a distributed system?"
The short answer is you don't. For example what you've done in your code example originally is something like this.
Task One {
var x = 0
record 1: x = 1
record 2: x = 2
}
Task Two {
var x = 0
record 20: x = 1
record 21: x = 2
}
Each machine is independently creating a new x variable set at 0 which gets incremented within it's own context, independently over the other nodes.
For most use cases the "counter" question can be replaced with "How can I get a Unique Identifier per Record in a distributed system?"
For this most users end up using a UUID which can be generated on independent machines with infinitesimal chances of conflicts.
If the question can be "How can I get a monotonic increasing unique indentifier?"
Then you can use zipWithUniqueIndex which will not count but will generate monotonically increasing ids.
If you just want them number to start with it's best to do it on the local system.
Edit; Why can't I use an accumulator?
Accumulators store their state (surprise) per task. You can see this with a little example:
val x = sc.accumulator(0, "x")
sc.parallelize(1 to 50).foreachPartition{ it => it.foreach(y => x+= 1); println(x)}
/*
6
7
6
6
6
6
6
7
*/
x.value
// res38: Int = 50
The accumulators combine their state after finishing their tasks, which means you can't use them as a global distributed counter.