I am writing a program that essentially pulls a line out of a logfile, parses it, and returns the parsed data back in a simplified form. My main problem currently is the method in which I should parse my datetime stirngs. Here's an example of a line from the log.
Example of Logfile:
2012-06-12 14:02:16,341 [main] INFO ---
2012-06-12 14:02:16,509 [main] INFO ---
2012-06-12 14:02:17,000 [main] INFO ---
2012-06-12 14:02:17,112 [main] INFO ---
2012-06-12 14:02:20,338 [main] INFO ---
2012-06-12 14:02:21,813 [main] INFO ---
My code to parse SO FAR ( very rough ):
class LogLine:
SEVERITIES = ['EMERG','ALERT','CRIT','ERR','WARNING','NOTICE','INFO','DEBUG']
severity = 1
def __init__(self, line):
try:
t, s, self.filename, n, self.message =
re.match(r"^(\d\d\d\d-\d\d-\d\d[ \t]\d\d:\d\d:\d\d,\d\d\d)", line)
self.line = int(n)
self.sev = self.SEVERITIES.index(s)
self.time = time.strptime(t)
def get_t(self):
return
def get_severity(self):
return self.SEVERITIES.index(self)
def get_message(self):
return
def get_filename(self):
return
def get_line(self):
return
So basically (if you weren't able to deduce from my terrible code) I am parsing the string using a regular expression to obtain the datetime. I also have been reading about strptime as a possible solution to this as well. ULTIMATELY, I need to parse the datetime into milliseconds and then add it to the milliseconds integer in the datetime (separated by comma)
I'm sure this question is extremely convoluted and I apologize in advance. Thank you for your help.
>>> datetime.datetime.strptime('2012-06-12 14:02:16,341' + '000', '%Y-%m-%d %H:%M:%S,%f')
datetime.datetime(2012, 6, 12, 14, 2, 16, 341000)
Here's an example of how to parse a line:
>>> # A line from the logfile.
>>> line = "2012-06-12 14:02:16,341 [main] INFO ---"
>>> # Parse the line.
>>> m = re.match(r"^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),(\d{3}) \[([^]]*)\] (\S+) (.*)", line)
>>> timestamp, line_number, filename, severity, message = m.groups()
>>> # Show the various captured values.
>>> timestamp
'2012-06-12 14:02:16'
>>> line_number
'341'
>>> filename
'main'
>>> severity
'INFO'
>>> message
'---'
Related
I would like to know the right way to read float fields using Quickfix (python). I was getting a string then casting to float.
For instance:
>>> m = fix.Message()
>>> m.setField(fix.BidPx(1.12))
>>> m.getField(fix.BidPx()).getString()
'1.12'
>>> float(m.getField(fix.BidPx()).getString())
1.12
The way above works fine for floats with less then 15 digits of precision. But I got the following error for float numbers with more the 15 digits of precision:
>>> m = fix.Message()
>>> m.setField(fix.BidPx(1.123456789123456))
>>> m.getField(fix.BidPx()).getString()
'\x00\xe1}\xf5\x82U\x00\x0078912346'
>>> float(m.getField(fix.BidPx()).getString())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float:
I am not sure if that sample works, may be you should explain how you import "fix".
Any way, this sample works with python 3.7 and quickfix 1.15.1
>>> import quickfix as fix
>>> m = fix.Message()
>>> m.setField(fix.BidPx(1.123456789123456789123456789))
>>> m.getField(fix.BidPx().getField())
'1.12345678912346'
>>>
if you need more precision in the float number, you can do
>>> m.setField(fix.StringField(fix.BidPx().getField(),"1.123456789123456789123456789"))
>>> m.getField(fix.BidPx().getField())
'1.123456789123456789123456789'
>>>
I hope I've helped
I have an ETL job where I load some data from S3 into a dynamic frame, relationalize it, and iterate through the dynamic frames returned. I want to query the result of this in Athena later so I want to change the names of the columns from having '.' to '_' and lower case them. When I do this transformation, I change the DynamicFrame into a spark dataframe and have been doing it this way. I've also seen a problem in another SO question where it turned out there is a reported problem with AWS Glue rename field transform so I've stayed away from that.
I've tried a couple things, including adding a load limit size to 50MB, repartitioning the dataframe, using both dataframe.schema.names and dataframe.columns, using reduce instead of loops, using sparksql to change it and nothing has worked. I'm fairly certain that its this transformation that failing because I've put some print statements in and the print that I have right after the completion of this transformation never shows up. I used a UDF at one point but that also failed. I've tried the actual transformation using df.toDF(new_column_names) and df.withColumnRenamed() but it never gets this far because I've not seen it get past retrieving the column names. Here's the code I've been using. I've been changing the actual name transformation as I said above, but the rest of it has stayed pretty much the same.
I've seen some people try and use the spark.executor.memory, spark.driver.memory, spark.executor.memoryOverhead and spark.driver.memoryOverhead. I've used those and set them to the most AWS Glue will let you but to no avail.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import explode, col, lower, trim, regexp_replace
import copy
import json
import boto3
import botocore
import time
# ========================================================
# UTILITY FUNCTIONS
# ========================================================
def lower_and_pythonize(s=None):
if s is not None:
return s.replace('.', '_').lower()
else:
return None
# pyspark implementation of renaming
# exprs = [
# regexp_replace(lower(trim(col(c))),'\.' , '_').alias(c) if t == "string" else col(c)
# for (c, t) in data_frame.dtypes
# ]
# ========================================================
# END UTILITY FUNCTIONS
# ========================================================
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
#my params
bucket_name = '<my-s3-bucket>' # name of the bucket. do not include 's3://' thats added later
output_key = '<my-output-path>' # key where all of the output is saved
input_keys = ['<root-directory-i'm using'] # highest level key that holds all of the desired data
s3_exclusions = "[\"*.orc\"]" # list of strings to exclude. Documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-s3
s3_exclusions = s3_exclusions.replace('\n', '')
dfc_root_table_name = 'root' # name of the root table generated in the relationalize process
input_paths = ['s3://' + bucket_name + '/' + x for x in input_keys] # turn input keys into s3 paths
output_connection_opts = {"path": "s3://" + bucket_name + "/" + output_key} # dict of options. Documentation link found above the write_dynamic_frame.from_options line
s3_client = boto3.client('s3', 'us-east-1') # s3 client used for writing to s3
s3_resource = boto3.resource('s3', 'us-east-1') # s3 resource used for checking if key exists
group_mb = 50 # NOTE: 75 has proven to be too much when running on all of the april data
group_size = str(group_mb * 1024 * 1024)
input_connection_opts = {'paths': input_paths,
'groupFiles': 'inPartition',
'groupSize': group_size,
'recurse': True,
'exclusions': s3_exclusions} # dict of options. Documentation link found above the create_dynamic_frame_from_options line
print(sc._conf.get('spark.executor.cores'))
num_paritions = int(sc._conf.get('spark.executor.cores')) * 4
print('Loading all json files into DynamicFrame...')
loading_time = time.time()
df = glueContext.create_dynamic_frame_from_options(connection_type='s3', connection_options=input_connection_opts, format='json')
print('Done. Time to complete: {}s'.format(time.time() - loading_time))
# using the list of known null fields (at least on small sample size) remove them
#df = df.drop_fields(drop_paths)
# drop any remaining null fields. The above covers known problems that this step doesn't fix
print('Dropping null fields...')
dropping_time = time.time()
df_without_null = DropNullFields.apply(frame=df, transformation_ctx='df_without_null')
print('Done. Time to complete: {}s'.format(time.time() - dropping_time))
df = None
print('Relationalizing dynamic frame...')
relationalizing_time = time.time()
dfc = Relationalize.apply(frame=df_without_null, name=dfc_root_table_name, info="RELATIONALIZE", transformation_ctx='dfc', stageThreshold=3)
print('Done. Time to complete: {}s'.format(time.time() - relationalizing_time))
keys = dfc.keys()
keys.sort(key=lambda s: len(s))
print('Writting all dynamic frames to s3...')
writting_time = time.time()
for key in keys:
good_key = lower_and_pythonize(s=key)
data_frame = dfc.select(key).toDF()
# lowercase all the names and remove '.'
print('Removing . and _ from names for {} frame...'.format(key))
df_fix_names_time = time.time()
print('Repartitioning data frame...')
data_frame.repartition(num_paritions)
print('Done.')
#
print('Changing names...')
for old_name in data_frame.schema.names:
data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
print('Done.')
#
df_now = DynamicFrame.fromDF(dataframe=data_frame, glue_ctx=glueContext, name='df_now')
print('Done. Time to complete: {}'.format(time.time() - df_fix_names_time))
# if a conflict of types appears, make it 2 columns
# https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
print('Fixing any type conficts for {} frame...'.format(key))
df_resolve_time = time.time()
resolved = ResolveChoice.apply(frame = df_now, choice = 'make_cols', transformation_ctx = 'resolved')
print('Done. Time to complete: {}'.format(time.time() - df_resolve_time))
# check if key exists in s3. if not make one
out_connect = copy.deepcopy(output_connection_opts)
out_connect['path'] = out_connect['path'] + '/' + str(good_key)
try:
s3_resource.Object(bucket_name, output_key + '/' + good_key + '/').load()
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == '404' or 'NoSuchKey' in e.response['Error']['Code']:
# object doesn't exist
s3_client.put_object(Bucket=bucket_name, Key=output_key+'/'+good_key + '/')
else:
print(e)
## https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html
print('Writing {} frame to S3...'.format(key))
df_writing_time = time.time()
datasink4 = glueContext.write_dynamic_frame.from_options(frame = df_now, connection_type = "s3", connection_options = out_connect, format = "orc", transformation_ctx = "datasink4")
out_connect = None
datasink4 = None
print('Done. Time to complete: {}'.format(time.time() - df_writing_time))
print('Done. Time to complete: {}s'.format(time.time() - writting_time))
job.commit()
Here is the error I'm getting
19/06/07 16:33:36 DEBUG Client:
client token: N/A
diagnostics: Application application_1559921043869_0001 failed 1 times due to AM Container for appattempt_1559921043869_0001_000001 exited with exitCode: -104
For more detailed output, check application tracking page:http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001Then, click on links to logs of each attempt.
Diagnostics: Container [pid=9630,containerID=container_1559921043869_0001_01_000001] is running beyond physical memory limits. Current usage: 5.6 GB of 5.5 GB physical memory used; 8.8 GB of 27.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1559921043869_0001_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 9630 9628 9630 9630 (bash) 0 0 115822592 675 /bin/bash -c LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -Djava.io.tmpdir=/mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/tmp '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' '-Djavax.net.ssl.trustStore=ExternalAndAWSTrustStore.jks' '-Djavax.net.ssl.trustStoreType=JKS' '-Djavax.net.ssl.trustStorePassword=amazon' '-DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem' '-DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem' '-DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file runscript.py --arg 'script_2019-06-07-15-29-50.py' --arg '--JOB_NAME' --arg 'tss-json-to-orc' --arg '--JOB_ID' --arg 'j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe' --arg '--JOB_RUN_ID' --arg 'jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233' --arg '--job-bookmark-option' --arg 'job-bookmark-disable' --arg '--TempDir' --arg 's3://aws-glue-temporary-059866946490-us-east-1/zmcgrath' --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stdout 2> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stderr
|- 9677 9648 9630 9630 (python) 12352 2628 1418354688 261364 python runscript.py script_2019-06-07-15-29-50.py --JOB_NAME tss-json-to-orc --JOB_ID j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --JOB_RUN_ID jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --job-bookmark-option job-bookmark-disable --TempDir s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath
|- 9648 9630 9630 9630 (java) 265906 3083 7916974080 1207439 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -Djava.io.tmpdir=/mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/tmp -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -Djavax.net.ssl.trustStore=ExternalAndAWSTrustStore.jks -Djavax.net.ssl.trustStoreType=JKS -Djavax.net.ssl.trustStorePassword=amazon -DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem -DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem -DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.deploy.PythonRunner --primary-py-file runscript.py --arg script_2019-06-07-15-29-50.py --arg --JOB_NAME --arg tss-json-to-orc --arg --JOB_ID --arg j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --arg --JOB_RUN_ID --arg jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --arg --job-bookmark-option --arg job-bookmark-disable --arg --TempDir --arg s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/__spark_conf__.properties
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1559921462650
final status: FAILED
tracking URL: http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001
user: root
Here are the log contents from the job
LogType:stdout
Log Upload Time:Fri Jun 07 16:33:36 +0000 2019
LogLength:487
Log Contents:
4
Loading all json files into DynamicFrame...
Done. Time to complete: 59.5056920052s
Dropping null fields...
null_fields [<some fields that were dropped>]
Done. Time to complete: 529.95293808s
Relationalizing dynamic frame...
Done. Time to complete: 2773.11689401s
Writting all dynamic frames to s3...
Removing . and _ from names for root frame...
Repartitioning data frame...
Done.
Changing names...
End of LogType:stdout
As I said earlier, the Done. print after changing the names never appears in the logs. I've seen plenty of people getting the same error I'm seeing and I've tried a fair bit of them with no success. Any help you can provide would b e much appreciated. Let me know if you need any more information. Thanks
Edit
Prabhakar's comment reminded me that I have tried the memory worker type in AWS Glue and it still failed. As stated above, I have tried raising the amount of memory in the memoryOverhead from 5 to 12, but to avail. Neither of these made the job complete successfully
Update
I put in the following code for column name change instead of the above code for easier debugging
print('Changing names...')
name_counter = 0
for old_name in data_frame.schema.names:
print('Name number {}. name being changed: {}'.format(name_counter, old_name))
data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
name_counter += 1
print('Done.')
And I got the following output
Removing . and _ from names for root frame...
Repartitioning data frame...
Done.
Changing names...
End of LogType:stdout
So it must be a problem with the data_frame.schema.names part. Could it be this line with my loop through all of the DynamicFrames? Am I looping through the DynamicFrames from the relationalize transformation correctly?
Update 2
Glue recently added more verbose logs and I found this
ERROR YarnClusterScheduler: Lost executor 396 on ip-172-32-78-221.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
This happens for more than just this executor too; it looks like almost all of them.
I can try to increase the executor memory overhead, but I would like to know why getting the column names results in an OOM error. I wouldn't think that something that trivial would take up that much memory?
Update
I attempted to run the job with both spark.driver.memoryOverhead=7g and spark.yarn.executor.memoryOverhead=7g and I again got an OOM error
Using Linear Regression I trying to predict power generated based on the values of temperature, vacuum, pressure and humidity which is inspired and Adapted from "http://datascience-enthusiast.com" and applying model to real time data from Kafka topic. Properly generated pickled .pkl.z file and converts to PMML using JPMML as suggested https://github.com/jpmml/jpmml-sklearn.
Kafka Producer, a Python program (kafka_producer.py) randomly generates data within some ranges as float and converts string and sending to Kafka topic as bytes.
Kafka Consumer, a python program(kafka_consumer.py) which act as Openscoring python client, reads the data from Kafka topic, converts byte string to string and finally to a dictionary which form the arguments like arguments = {"AT" : 9.2, "V" : 39.82, "AP" : 1013.19, "RH" : 91.25} for result = os.evaluate("CCPP", arguments) statement.
It works well and predict power, but after showing the result correctly for 4 to 10 records,
Openscoring server throws
SEVERE: INFO:
Received EvaluationRequest{id=null, arguments={AT=12.12, V=41.35, AP=1031.67, RH=66.32}}
Nov 20, 2017 6:39:16 AM org.openscoring.service.ModelResource evaluate
INFO: Returned EvaluationResponse{id=null, result={PE=472.152110955029}}
Nov 20, 2017 6:39:17 AM org.openscoring.service.ModelResource evaluate
INFO: Received EvaluationRequest{id=null, arguments={AT=34.06, V=51.53, AP=1016.22, RH=91.7}}
Nov 20, 2017 6:39:17 AM org.openscoring.service.ModelResource evaluate
INFO: Returned EvaluationResponse{id=null, result={PE=444.9147880324237}}
Nov 20, 2017 6:39:18 AM org.openscoring.service.ModelResource evaluate
INFO: Received EvaluationRequest{id=null, arguments={AT=20.41, V=50.33, AP=1018.19, RH=100.18}}
Nov 20, 2017 6:39:18 AM org.openscoring.service.ModelResource doEvaluate
**SEVERE: Failed to evaluate**
org.jpmml.evaluator.InvalidResultException (at or around line 130)
at org.jpmml.evaluator.FieldValueUtil.performInvalidValueTreatment(FieldValueUtil.java:178) at org.jpmml.evaluator.FieldValueUtil.prepareInputValue(FieldValueUtil.java:90)
at org.jpmml.evaluator.InputField.prepare(InputField.java:64)
Kafka Consumer stops and show: raise Exception(self.message)
Exception: Bad Request
kafka_producer.py
import random
import time
from kafka import KafkaProducer
from kafka.errors import KafkaError
producer = KafkaProducer(bootstrap_servers='localhost:9092')
topic = "power"
for i in range(1000):
AT = "19.651231"
V = "54.305804"
AP = "1013.259078"
RH = "73.308978"
def getAT():
return str(round(random.uniform(2.0, 38.0),2))
def getV():
return str(round(random.uniform(26.0, 81.5),2))
def getAP():
return str(round(random.uniform(993.0, 1033.0),2))
def getRH():
return str(round(random.uniform(26.0, 101.0),2))
# arguments = {"AT" :9.2, "V" : 39.82, "AP" : 1013.19, "RH" : 91.25}
message = "{"AT" : " + getAT() + "," + ""V" : " +getV() + "," + ""AP" : " +getAP() + "," + ""RH" : " + getRH() + "}"
producer.send(topic, key=str.encode('key_{}'.format(i)), value=(message.encode('utf-8')))
time.sleep(1)
producer.close()
kafka_consumer.py
import ast
from kafka import KafkaConsumer
import openscoring
import os
os = openscoring.Openscoring("http://localhost:8080/openscoring")
kwargs = {"auth" : ("admin", "adminadmin")}
os.deploy("CCPP", "/home/gopinathankm/jpmml-sklearn-master/ccpp.pmml", **kwargs)
consumer = KafkaConsumer('power', bootstrap_servers='localhost:9092')
for message in consumer:
arguments =message.value
argsdict = arguments.decode("utf-8")
dict = ast.literal_eval(argsdict)
print(dict)
result = os.evaluate("CCPP", dict)
print(result)
For some data generated it is not working, I really don't know how bad request is generated.
Any assistance will be highly appreciated.
Regards
Gopinathan K.M
Got assistance and solution from Villu Ruusmann as below so that others may benefit:
https://github.com/jpmml/jpmml-evaluator/issues/84
"Exception type InvalidResultException means that the evaluation of a model could not be completed successfully, because one or more input field values are outside of their declared range.
The value ranges of your PMML document and Python script do not match. Either delete PMML document value ranges (so that all input values are considered to be valid), or contract Python script value ranges." as suggested by Villu Ruusmann.
How to Handle geolocated data using k-means cluster algorithm here, Can somebody please share your input here, Thanks in advance.
Project_2_Dataset.txt file entries look like this
=================================================
33.68947543 -117.5433083
37.43210889 -121.4850296
39.43789083 -120.9389785
39.36351868 -119.4003347
33.19135811 -116.4482426
33.83435437 -117.3300009
Please review my Code here:
============================
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering.KMeans
val data = sc.textFile("Project_2_Dataset.txt")
val parsedData = data.map( line => Vectors.dense(line.split(',').map(_.toDouble)))
val kmmodel= KMeans.train(parsedData,3,5) --- 3 clusters,4 Iterations.
17/06/17 13:12:20 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
java.lang.NumberFormatException: For input string: "33.68947543 -117.5433083"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
Thanks
Amit K
I think it is because you try to split each line at char ',' instead of ' '.
# "33.19135811 -116.4482426".toDouble
java.lang.NumberFormatException: For input string: "33.19135811 -116.4482426"
...
# "33.19135811 -116.4482426".split(',').map(_.toDouble)
java.lang.NumberFormatException: For input string: "33.19135811 -116.4482426"
...
# "33.19135811 -116.4482426".split(' ').map(_.toDouble)
res3: Array[Double] = Array(33.19135811, -116.4482426)
In the previous case where were able to apply the split on a set of data("33.19135811 -116.4482426".split(' ').map(_.toDouble)) , But it seems that when we are applying the same split on multiple set of data, Am getting this error:
33.68947543 -117.5433083
37.43210889 -121.4850296
39.43789083 -120.9389785
39.36351868 -119.4003347
scala> val kmmodel= KMeans.train(parsedData,3,5)
17/06/29 19:14:36 ERROR Executor: Exception in task 1.0 in stage 6.0 (TID 8)
java.lang.NumberFormatException: empty String
I have a capture packet raw packet using python's sockets:
s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, socket.ntohs(0x0003))
while True:
message = s.recv(4096)
test = []
print(len(message))
print(repr(message))
I assumed that the packet returned would be in hex string format, however the printout of print(repr(message)) get me something like this:
b'\x00\x1b\xac\x00Gd\x00\x14\xd1+\x1f\x19\x05\n\x124VxC!UUUU\x00\x00\x00\x00\xcd\xcc\xcc=\xcd\xccL>\x9a\x99\x99>\xcd\xcc\xcc>\x00\x00\x00?\x9a\x......'
which has weird non hex characters like !UUUU or =. What encoding is this, and how do I decode the packet?
I know what the packet looks like ahead of time for now, since I'm the one generating the packets using winpcapy:
from ctypes import *
from winpcapy import *
import zlib
import binascii
import time
from ChanPackets import base, FrMessage, FrTodSync, FrChanConfig, FlChan, RlChan
while (1):
now = time.time()
errbuf = create_string_buffer(PCAP_ERRBUF_SIZE)
fp = pcap_t
deviceName = b'\\Device\\NPF_{8F5BD2E9-253F-4659-8256-B3BCD882AFBC}'
fp = pcap_open_live(deviceName, 65536, 1, 1000, errbuf)
if not bool(fp):
print ("\nUnable to open the adapter. %s is not supported by WinPcap\n" % deviceName)
sys.exit(2)
# FrMessage is a custom class that creates the packet
test = FrMessage('00:1b:ac:00:47:64', '00:14:d1:2b:1f:19', 0x12345678, 0x4321, 0x55555555, list(i/10 for i in range(320)))
# test.get_Raw_Packet() returns a c_bytes array needed for winpcap to send the packet
if (pcap_sendpacket(fp, test.get_Raw_Packet(), test.packet_size) != 0):
print ("\nError sending the packet: %s\n" % pcap_geterr(fp))
sys.exit(3)
elapsed = time.time() - now
if elapsed < 0.02 and elapsed > 0:
time.sleep(0.02 - elapsed)
pcap_close(fp)
Note: I would like to get an array of hex values representing each byte
What encoding is this, and how do I decode the packet?
What you see is the representation of bytes object in Python. As you might have guessed \xab represents byte 0xab (171).
which has weird non hex characters like !UUUU or =
Printable ASCII characters represent themselves i.e., instead of \x55 the representation contains just U.
What you have is a sequence of bytes. How to decode them depends on your application. For example, to decode a data packet that contains Ethernet frame, you could use scapy (Python 2):
>>> b = '\x00\x02\x157\xa2D\x00\xae\xf3R\xaa\xd1\x08\x00E\x00\x00C\x00\x01\x00\x00#\x06x<\xc0\xa8\x05\x15B#\xfa\x97\x00\x14\x00P\x00\x00\x00\x00\x00\x00\x00\x00P\x02 \x00\xbb9\x00\x00GET /index.html HTTP/1.0 \n\n'
>>> c = Ether(b)
>>> c.hide_defaults()
>>> c
<Ether dst=00:02:15:37:a2:44 src=00:ae:f3:52:aa:d1 type=0x800 |
<IP ihl=5L len=67 frag=0 proto=tcp chksum=0x783c src=192.168.5.21 dst=66.35.250.151 |
<TCP dataofs=5L chksum=0xbb39 options=[] |
<Raw load='GET /index.html HTTP/1.0 \n\n' |>>>>
I would like to get an array of hex values representing each byte
You could use binascii.hexlify():
>>> pkt = b'\x00\x1b\xac\x00Gd\x00'
>>> import binascii
>>> binascii.hexlify(pkt)
b'001bac00476400'
or If you want a list with string hex values:
>>> hexvalue = binascii.hexlify(pkt).decode()
>>> [hexvalue[i:i+2] for i in range(0, len(hexvalue), 2)]
['00', '1b', 'ac', '00', '47', '64', '00']
In python raw packet decode can be done using the scapy functions like IP(), TCP(), UDP() etc.
import sys
import socket
from scapy.all import *
s = socket.socket(socket.AF_INET, socket.SOCK_RAW, socket.IPPROTO_TCP)
while 1:
packet = s.recvfrom(2000);
packet = packet[0]
ip = IP(packet)
ip.show()