k-means Clustering geolocated data using Spark/Scala - scala

How to Handle geolocated data using k-means cluster algorithm here, Can somebody please share your input here, Thanks in advance.
Project_2_Dataset.txt file entries look like this
=================================================
33.68947543 -117.5433083
37.43210889 -121.4850296
39.43789083 -120.9389785
39.36351868 -119.4003347
33.19135811 -116.4482426
33.83435437 -117.3300009
Please review my Code here:
============================
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering.KMeans
val data = sc.textFile("Project_2_Dataset.txt")
val parsedData = data.map( line => Vectors.dense(line.split(',').map(_.toDouble)))
val kmmodel= KMeans.train(parsedData,3,5) --- 3 clusters,4 Iterations.
17/06/17 13:12:20 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
java.lang.NumberFormatException: For input string: "33.68947543 -117.5433083"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
Thanks
Amit K

I think it is because you try to split each line at char ',' instead of ' '.
# "33.19135811 -116.4482426".toDouble
java.lang.NumberFormatException: For input string: "33.19135811 -116.4482426"
...
# "33.19135811 -116.4482426".split(',').map(_.toDouble)
java.lang.NumberFormatException: For input string: "33.19135811 -116.4482426"
...
# "33.19135811 -116.4482426".split(' ').map(_.toDouble)
res3: Array[Double] = Array(33.19135811, -116.4482426)

In the previous case where were able to apply the split on a set of data("33.19135811 -116.4482426".split(' ').map(_.toDouble)) , But it seems that when we are applying the same split on multiple set of data, Am getting this error:
33.68947543 -117.5433083
37.43210889 -121.4850296
39.43789083 -120.9389785
39.36351868 -119.4003347
scala> val kmmodel= KMeans.train(parsedData,3,5)
17/06/29 19:14:36 ERROR Executor: Exception in task 1.0 in stage 6.0 (TID 8)
java.lang.NumberFormatException: empty String

Related

apache-spark count filtered words from textfilestream

Hi I am new to Scala and I am using intellij Idea. I am trying to filestream a text file running Scala cluster on Hadoop-Spark. My main goal is to count only words (without any special characters) in a key, value format.
I found apache-spark regex extract words from rdd article where they use findAllIn() function with regex but not sure if I am using it correctly in my code.
When I built my project and generate the jar file to run it in Spark I manually provide the text file and it seems that it runs and count words but it also seems that it enters in a loop as it processed the file infinitely. I thought it should process it only once.
Can someone tell me why this may be happening? or is it a better way to achieve my goal?
Part of my text is:
It wlll only take a day,' he said. The others disagreed.
It's too fragile," 289 they said disapprovingly 23 age, but he refused to listen. Not
quite so lazy, the second little pig went in search of planks of seasoned 12 hola, 1256 23.
My code is:
package streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object NetworkWordCount {
def main(args: Array[String]): Unit = {
if (args.length < 1) {
System.err.println("Usage: HdfsWordCount <directory>")
System.exit(1)
}
//StreamingExamples.setStreamingLogLevels()
val sparkConf = new SparkConf().setAppName("HdfsWordCount").setMaster("local")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(5))
// Create the FileInputDStream on the directory and use the
// stream to count words in new files created
val lines = ssc.textFileStream(args(0))
val sep_words = lines.flatMap("[a-zA-Z]+".r.findAllIn(_))
val wordCounts = sep_words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveAsTextFiles("hdfs:///user/test/task1.txt")
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
My expected output is something like below but as mentioned before it keeps repeating:
22/10/24 05:11:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/10/24 05:11:40 INFO Executor: Finished task 0.0 in stage 27.0 (TID 20). 1529 bytes result sent to driver
22/10/24 05:11:40 INFO TaskSetManager: Finished task 0.0 in stage 27.0 (TID 20) in 18 ms on localhost (executor driver) (1/1)
22/10/24 05:11:40 INFO TaskSchedulerImpl: Removed TaskSet 27.0, whose tasks have all completed, from pool
22/10/24 05:11:40 INFO DAGScheduler: ResultStage 27 (print at NetworkWordCount.scala:28) finished in 0.031 s
22/10/24 05:11:40 INFO DAGScheduler: Job 13 finished: print at NetworkWordCount.scala:28, took 0.036092 s
-------------------------------------------
Time: 1666588300000 ms
-------------------------------------------
(heaped,1)
(safe,1)
(became,1)
(For,1)
(ll,1)
(it,1)
(Let,1)
(Open,1)
(others,1)
(pack,1)
...
22/10/24 05:11:40 INFO JobScheduler: Finished job streaming job 1666588300000 ms.1 from job set of time 1666588300000 ms
22/10/24 05:11:40 INFO JobScheduler: Total delay: 0.289 s for time 1666588300000 ms (execution: 0.222 s)
22/10/24 05:11:40 INFO ShuffledRDD: Removing RDD 40 from persistence list
22/10/24 05:11:40 INFO BlockManager: Removing RDD 40
22/10/24 05:11:40 INFO MapPartitionsRDD: Removing RDD 39 from persistence list
22/10/24 05:11:40 INFO BlockManager: Removing RDD 39
Additionally, I notice that when I only use split() function without any regex neither findAllIn() function it process the file only once as expected, but of course it only splits the text by spaces.
Something like this:
val lines = ssc.textFileStream(args(0))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
I almost forgot if you can explain also the meaning of the usage of underscores in code will help me a lot. I am having some problems to understand that part too.
Thanks in advance.
If your file doesn't change while you're parsing it, you won't need to use SparkStreamingContext.
Just create SparkContext:
val sc = new SparkContext(new SparkConf().setAppName("HdfsWordCount").setMaster("local"))
and then process you file in the way you need.
You will need SparkStreamingContext if you have some datasources which changing and you need to process changed delta continiously
There're many different ways to use underscore in scala, but in your current case underscore is a just syntax sugar reduces syntax of lambda function. It actully means element of collection.
You can rewrite code with undescore like this:
//both lines doing the same
.flatMap("[a-zA-Z]+".r.findAllIn(_))
.flatMap((s: String) => "[a-zA-Z]+".r.findAllIn(s))
And
//both lines doing the same
.map(x => (x, 1)).reduceByKey(_ + _)
.map(x => (x, 1)).reduceByKey((l, r) => l + r)
For better understanding this case and other usages read some articles. Like this or this
Well, I think that I found what was my main error. I was telling spark to monitor the same directory where I was creating my output file. So, my thought is that it was self-triggering a new event everytime it saved the output processed data. Once I created an input and output directory the repeating output stopped and worked as expected. It only generates a new output until I provide a new file to the server manually.

FileNotFound Error in Spark

Iam running a spark simple program on a cluster:
val logFile = "/home/hduser/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println()
println()
println()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
println()
println()
println()
println()
println()
and i get the following error
15/10/27 19:44:01 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on
executor 192.168.0.19: java.io.FileNotFoundException (File
file:/home/hduser/README.md does not exist.) [duplicate 6]
15/10/27 19:44:01 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
15/10/27 19:44:01 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 7)
on executor 192.168.0.19: java.io.FileNotFoundException (File
file:/home/hduser/README.md does not exist.) [duplicate 7]
15/10/27 19:44:01 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
15/10/27 19:44:01 INFO TaskSchedulerImpl: Cancelling stage 0
15/10/27 19:44:01 INFO DAGScheduler: ResultStage 0 (count at
SimpleApp.scala:55) failed in 7.636 s
15/10/27 19:44:01 INFO DAGScheduler: Job 0 failed: count at
SimpleApp.scala:55, took 7.810387 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.0.19): java.io.FileNotFoundException: File file:/home/hduser/README.md does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:125)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The file is in the correct place. IF I REPLACE README.MD WITH REAMDME.TXT IT WILL WORK FINE. Can somenone help with this?
Thanks
If you are running a multinode cluster, make sure all nodes have the file in the same given path, with respect to their own filesystem. Or, you know, just use HDFS.
In multinode case "/home/hduser/README.md" path is distributed to worker nodes as well. README.md probably exists only on master node. Now when workers try to access this file, they won't look into master's fs instead each will try to find it on their own fs. If you have the same file on the same path in every node. The code is very likely to work. To do this, copy the file to every node's fs using the same path.
As you've already noticed, the solution above is cumbersome. Hadoop FS, HDFS, solves this issue and many more. You should look into it.
It is simply because a file with an extension .md contains plain text with formatting information. When you save this file with .txt extension, the formatting informations are removed or not considered. sc.textFile() works with plain texts.

Scala Spark SQLContext Program throwing array out of bound exception

I am new to Apache Spark. I am trying to create a schema and load data from hdfs. Below is my code:
// importing sqlcontext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
//defining the schema
case class Author1(Author_Key: Long, Author_ID: Long, Author: String, First_Name: String, Last_Name: String, Middle_Name: String, Full_Name: String, Institution_Full_Name: String, Country: String, DIAS_ID: Int, R_ID: String)
val D_Authors1 =
sc.textFile("hdfs:///user/D_Authors.txt")
.map(_.split("\\|"))
.map(auth => Author1(auth(0).trim.toLong, auth(1).trim.toLong, auth(2), auth(3), auth(4), auth(5), auth(6), auth(7), auth(8), auth(9).trim.toInt, auth(10)))
//register the table
D_Authors1.registerAsTable("D_Authors1")
val auth = sqlContext.sql("SELECT * FROM D_Authors1")
sqlContext.sql("SELECT * FROM D_Authors").collect().foreach(println)
When I am executing this code it throwing array out of bound exception. Below is the error:
14/08/18 06:57:14 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
14/08/18 06:57:14 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
14/08/18 06:57:14 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange
14/08/18 06:57:14 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions
14/08/18 06:57:14 INFO FileInputFormat: Total input paths to process : 1
14/08/18 06:57:14 INFO SparkContext: Starting job: collect at <console>:24
14/08/18 06:57:14 INFO DAGScheduler: Got job 5 (collect at <console>:24) with 2 output partitions (allowLocal=false)
14/08/18 06:57:14 INFO DAGScheduler: Final stage: Stage 5(collect at <console>:24)
14/08/18 06:57:14 INFO DAGScheduler: Parents of final stage: List()
14/08/18 06:57:14 INFO DAGScheduler: Missing parents: List()
14/08/18 06:57:14 INFO DAGScheduler: Submitting Stage 5 (SchemaRDD[26] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [Author_Key#22L,Author_ID#23L,Author#24,First_Name#25,Last_Name#26,Middle_Name#27,Full_Name#28,Institution_Full_Name#29,Country#30,DIAS_ID#31,R_ID#32], MapPartitionsRDD[23] at mapPartitions at basicOperators.scala:174), which has no missing parents
14/08/18 06:57:14 INFO DAGScheduler: Submitting 2 missing tasks from Stage 5 (SchemaRDD[26] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [Author_Key#22L,Author_ID#23L,Author#24,First_Name#25,Last_Name#26,Middle_Name#27,Full_Name#28,Institution_Full_Name#29,Country#30,DIAS_ID#31,R_ID#32], MapPartitionsRDD[23] at mapPartitions at basicOperators.scala:174)
14/08/18 06:57:14 INFO YarnClientClusterScheduler: Adding task set 5.0 with 2 tasks
14/08/18 06:57:14 INFO TaskSetManager: Starting task 5.0:0 as TID 38 on executor 1: orf-bat.int..com (NODE_LOCAL)
14/08/18 06:57:14 INFO TaskSetManager: Serialized task 5.0:0 as 4401 bytes in 1 ms
14/08/18 06:57:15 INFO TaskSetManager: Starting task 5.0:1 as TID 39 on executor 1: orf-bat.int..com (NODE_LOCAL)
14/08/18 06:57:15 INFO TaskSetManager: Serialized task 5.0:1 as 4401 bytes in 0 ms
14/08/18 06:57:15 WARN TaskSetManager: Lost TID 38 (task 5.0:0)
14/08/18 06:57:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 10
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:179)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:174)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/08/18 06:57:15 WARN TaskSetManager: Lost TID 39 (task 5.0:1)
14/08/18 06:57:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 9
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:179)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:174)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Your problem has nothing to do with Spark.
Format your code correctly (I have corrected)
Don't mix camel & underscore naming - use underscore for SQL fields, use camel for Scala vals,
When you get an exception read it it usually tells you what you are doing wrong, in your case it's probably that some of the records in hdfs:///user/D_Authors.txt are not how you expect them
When you get an exception debug it, try actually catching the exception and printing out what the records are that fail to parse
_.split("\\|") ignores empty leading and trailing strings, use _.split("\\|", -1)
In Scala you don't need magic numbers that manually access elements of an array, it's ugly and more prone to error, use a pattern match ...
here is a simple example which includes unusual record handling!:
case class Author(author: String, authorAge: Int)
myData.map(_.split("\t", -1) match {
case Array(author, authorAge) => Author(author, authorAge.toInt)
case unexpectedArrayForm =>
throw new RuntimeException("Record did not have correct number of fields: " +
unexpectedArrayForm.mkString("\t"))
})
Now if you coded it like this, your exception would tell you straight away exactly what is wrong with your data.
One final point/concern; why are you using Spark SQL? Your data is in text form, are you trying to transform it into, say, parquet? If not, why not just use the regular Scala API to perform your analysis, moreover it's type checked and compile checked, unlike SQL.

Shark getting started: all queries hanging

I am a noobie for sharkle - though I do have some experience with spark. Every attempt being made to retrieve data from shark is hanging.
As a preliminary step: let's ensure that spark were up and healthy:
spark>
val tf = sc.textFile("hdfs://10.213.39.125:8020/hadoop/example/20417.txt")
val c = tf.count
..
14/04/10 19:44:34 INFO SparkContext: Job finished: count at <console>:14, took 0.161135127 s
c: Long = 12761
I have checked carefully about the shark-env.sh points to the spark installation correctly..
Now let us go to shark and try (a) the same file read and (b) a shark table read
(a)
shark>
val tf = sc.textFile("hdfs://10.213.39.125:8020/hadoop/example/20417.txt")
tf: org.apache.spark.rdd.RDD[String] = MappedRDD[4] at textFile at <console>:17
scala> val c2 = tf.count
(wait minutes .. finally do control -c)
shark>
sc.makeRDD("select * from dual")
res1: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[2] at makeRDD at <console>:18
scala> res1.collect
(Once again: wait minutes .. finally do control -c)
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:62)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:313)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:725)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:744)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:758)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:772)
at org.apache.spark.rdd.RDD.collect(RDD.scala:560)
More Details
Here are pertinent sections of the shark-env.sh
export SPARK_MEM=2g
# (Required) Set the master program's memory
export SHARK_MASTER_MEM=1g
# (Required) Point to your Scala installation.
export SCALA_HOME="/usr/local/scala-2.9.3"
# (Required) Point to the patched Hive binary distribution
export HIVE_HOME="/home/guest/shark-0.8.0-bin-hadoop1/hive-0.9.0-shark-0.8.0-bin"
# For running Shark in distributed mode, set the following:
export HADOOP_HOME="/usr/local/hadoop"
export SPARK_HOME="/home/guest/spark-0.8.0"
export MASTER="spark://swlab-r03-16L:17087"
From shark-shell, let us ensure we are talking to the same spark server
scala> sc.sparkHome
res0: String = /home/guest/spark-0.8.0
scala> sc.isLocal
res1: Boolean = false
scala> sc.master
res2: String = spark://swlab-r03-16L:17087
It seems there were hive metastore configuration issues. The metastore parameters are under the shark-hive-/conf/hive-site.xml

Converting datetime string to integer (seconds ) then adding ms

I am writing a program that essentially pulls a line out of a logfile, parses it, and returns the parsed data back in a simplified form. My main problem currently is the method in which I should parse my datetime stirngs. Here's an example of a line from the log.
Example of Logfile:
2012-06-12 14:02:16,341 [main] INFO ---
2012-06-12 14:02:16,509 [main] INFO ---
2012-06-12 14:02:17,000 [main] INFO ---
2012-06-12 14:02:17,112 [main] INFO ---
2012-06-12 14:02:20,338 [main] INFO ---
2012-06-12 14:02:21,813 [main] INFO ---
My code to parse SO FAR ( very rough ):
class LogLine:
SEVERITIES = ['EMERG','ALERT','CRIT','ERR','WARNING','NOTICE','INFO','DEBUG']
severity = 1
def __init__(self, line):
try:
t, s, self.filename, n, self.message =
re.match(r"^(\d\d\d\d-\d\d-\d\d[ \t]\d\d:\d\d:\d\d,\d\d\d)", line)
self.line = int(n)
self.sev = self.SEVERITIES.index(s)
self.time = time.strptime(t)
def get_t(self):
return
def get_severity(self):
return self.SEVERITIES.index(self)
def get_message(self):
return
def get_filename(self):
return
def get_line(self):
return
So basically (if you weren't able to deduce from my terrible code) I am parsing the string using a regular expression to obtain the datetime. I also have been reading about strptime as a possible solution to this as well. ULTIMATELY, I need to parse the datetime into milliseconds and then add it to the milliseconds integer in the datetime (separated by comma)
I'm sure this question is extremely convoluted and I apologize in advance. Thank you for your help.
>>> datetime.datetime.strptime('2012-06-12 14:02:16,341' + '000', '%Y-%m-%d %H:%M:%S,%f')
datetime.datetime(2012, 6, 12, 14, 2, 16, 341000)
Here's an example of how to parse a line:
>>> # A line from the logfile.
>>> line = "2012-06-12 14:02:16,341 [main] INFO ---"
>>> # Parse the line.
>>> m = re.match(r"^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),(\d{3}) \[([^]]*)\] (\S+) (.*)", line)
>>> timestamp, line_number, filename, severity, message = m.groups()
>>> # Show the various captured values.
>>> timestamp
'2012-06-12 14:02:16'
>>> line_number
'341'
>>> filename
'main'
>>> severity
'INFO'
>>> message
'---'