I use pyspark to deal with data. The data is as below:
8611060350280948828b33be803 4363 2017-10-01
8611060350280948828b33be803 4363 2017-10-02
4e5556e536714363b195eb8f88becbf8 365 2017-10-01
4e5556e536714363b195eb8f88becbf8 365 2017-10-02
4e5556e536714363b195eb8f88becbf8 365 2017-10-03
4e5556e536714363b195eb8f88becbf8 365 2017-10-04
I created a class to store these data. The codes are as following:
class LogInfo:
def __init__(self, session_id, sku_id, request_tm):
self.session_id = session_id
self.sku_id = sku_id
self.request_tm = request_tm
The dealing codes are as following:
from classFile import LogInfo
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local[*]")
sc = SparkContext(conf=conf)
orgData = sc.textFile(<dataPath>)
readyData = orgData.map(lambda x: x.split('\t')).\
filter(lambda x: x[0].strip() != "" and x[1].strip() != "" and x[2].strip() != "").\
map(lambda x: LogInfo(x[0], x[1], x[2])).groupBy(lambda x: x.session_id).\
filter(lambda x: len(x[1]) > 3).filter(lambda x: len(x[1]) < 20).\
map(lambda x: x[1]).sortBy(lambda x:x.request_tm).map(lambda x: x.sku_id)
But the codes didn't work. The mistake information is as below:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 177, in main
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 172, in process
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-
hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 346, in func
return f(iterator)
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 1041, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 1041, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2053, in <lambda>
return self.map(lambda x: (f(x), x))
File
"D:<filePath>", line 15, in <lambda>
map(lambda x: x[1]).sortBy(lambda x:x.request_tm).map(lambda x: x.sku_id)
AttributeError: 'ResultIterable' object has no attribute 'request_tm'
at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>
(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
[Stage 1:> (0 + 5) /
10]17/12/01 17:54:15 WARN TaskSetManager: Lost task 3.0 in stage 1.0 (TID 13, localhost, executor driver): org.apache.spark.api.python.PythonException:
Traceback (most recent call last):
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 177, in main
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 172, in process
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 346, in func
return f(iterator)
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 1041, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 1041, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2053, in <lambda>
return self.map(lambda x: (f(x), x))
File
"D:<filePath>", line 15, in <lambda>
map(lambda x: x[1]).sortBy(lambda x:x.request_tm).map(lambda x: x.sku_id)
AttributeError: 'ResultIterable' object has no attribute 'request_tm'
........
I think the main mistake information is as above. I could't figure out where I made mistake. Could anybody help? Thank you very much!
I think you need to replace this:
map(lambda x: x[1])
with this:
flatMap(lambda x: list(x[1]))
Basically, after the groupBy, x[1] is a "ResultIterable" object so if you want to sort each element of it, you first need to flaten it.
Edit:
If you need a list of sku_id inside the rdd then:
.map(lambda x: [y.sku_id for y in sorted(list(x[1]), key=lambda x: x.request_tm)])
Related
import os
import signal
from subprocess import check_output
def get_pid(name):
return check_output(["pidof", name])
def main():
os.kill(get_pid(dsmcad), signal.SIGTERM) #or signal.SIGKILL
if __name__ == "__main__":
main()
Getting error:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "<stdin>", line 2, in main
NameError: global name 'dsmcad' is not defined
When my program is the following...
import queue
queue = queue.Queue()
queue = None
queue = queue.Queue()
...my output is the following:
AttributeError: 'NoneType' object has no attribute 'Queue'
But when my program is the following...
import queue
queue = queue.Queue()
queue = None
...no error messages are thrown.
Why is this the case? I need to reinitialize my queue.
When you imported the module queue, you actually created a variable queue referencing an object of type module.
Then, when you created a queue named queue, you redefined the variable queue to be an object of type queue.Queue.
No wonder why you could not call queue.Queue() after that!
QED.
See in details:
>>> import queue
>>> type(queue)
<class 'module'>
>>> # Here you redefine the variable queue: the module queue won't be accessible after that
>>> queue = queue.Queue()
>>> type(queue)
<class 'queue.Queue'>
>>> queue
<queue.Queue object at ***>
>>> # Here I try to call Queue() on an object of type Queue...
>>> queue = queue.Queue()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Queue' object has no attribute 'Queue'
>>> queue = None
>>> # And here I try to call Queue() on an object of type None...
>>> queue = queue.Queue()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'Queue'
I need to generate an SHA-256 checksum from a string that will be sent as a get param.
If found this link to generate the checksum.
Genrating the checksum like so:
val digest = MessageDigest.getInstance("SHA-256");
private def getCheckSum() = {
println(new String(digest.digest(("Some String").getBytes(StandardCharsets.UTF_8))))
}
prints checksum similar to this:
*║┼¼┬]9AòdJb:#↓o6↓T╞B5C♀¼O~╟╙àÿG
The API that we need to send this to says the checksum should look like this:
45e00158bc8454049b7208e76670466d49a5dfb2db4196
What am I doing wrong?
Please advise.
Thanks.
Equivalent, but a bit more efficient:
MessageDigest.getInstance("SHA-256")
.digest("some string".getBytes("UTF-8"))
.map("%02x".format(_)).mkString
java.security.MessageDigest#digest gives a byte array.
scala> import java.security.MessageDigest
scala> import java.math.BigInteger
scala> MessageDigest.getInstance("SHA-256").digest("some string".getBytes("UTF-8"))
res1: Array[Byte] = Array(97, -48, 52, 71, 49, 2, -41, -38, -61, 5, -112, 39, 112, 71, 31, -43, 15, 76, 91, 38, -10, -125, 26, 86, -35, -112, -75, 24, 75, 60, 48, -4)
To create the hex, use String.format,
scala> val hash = String.format("%032x", new BigInteger(1, MessageDigest.getInstance("SHA-256").digest("some string".getBytes("UTF-8"))))
hash: String = 61d034473102d7dac305902770471fd50f4c5b26f6831a56dd90b5184b3c30fc
You can verify hash with command line tool in linux, unix
$ echo -n "some string" | openssl dgst -sha256
61d034473102d7dac305902770471fd50f4c5b26f6831a56dd90b5184b3c30fc
NOTE:
In case java returns hash of length lesser than 64 chars you can left pad with 0. (eg. 39)
def hash64(data: String) = {
val hash = String.format(
"%032x",
new BigInteger(1, MessageDigest.getInstance("SHA-256").digest(data.getBytes("UTF-8")))
)
val hash64 = hash.reverse.padTo(64, "0").reverse.mkString
hash64
}
Can use DatatypeConverter.printHexBinary.
Something like:
DatatypeConverter.printHexBinary(
MessageDigest
.getInstance(algorithm)
.digest("some string").getBytes("UTF-8")))
Since jdk 17, we can use java.util.HexFormat
import java.security.MessageDigest
import java.util.HexFormat
val bytes = MessageDigest.getInstance("SHA-256")
.digest("any string".getBytes("UTF-8"))
val sha256 = HexFormat.of().formatHex(bytes)
// 1e57a452a094728c291bc42bf2bc7eb8d9fd8844d1369da2bf728588b46c4e75
val another = HexFormat.ofDelimiter(":").withUpperCase().formatHex(bytes)
// 1E:57:A4:52:A0:94:72:8C:29:1B:C4:2B:F2:BC:7E:B8:D9:FD:88:44:D1:36:9D:A2:BF:72:85:88:B4:6C:4E:75
Given a sample text file, how can one use Akka ByteStrings and either convert it to plain text or run a "find" on the ByteString itself?
val file = new File("sample.txt")
val fileSource = SynchronousFileSource(file, 4096)
val messageStream = fileSource.map(chunk => sendMessage(chunk.toString()))
messageStream.to(Sink.foreach(println(_))).run
The "toString()" functionality above literally spits out a string containing the text "ByteString", followed by bytes represented as integers. For example:
chunk.toString() ==> "ByteString(111, 112, 119, 111)"
You can use containsSlice to find sub ByteString.
scala> import akka.util.ByteString;
import akka.util.ByteString
scala> val target = ByteString("hello world");
target: akka.util.ByteString = ByteString(104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100)
scala> val sub = ByteString("world")
sub: akka.util.ByteString = ByteString(119, 111, 114, 108, 100)
scala> target.containsSlice(sub)
res0: Boolean = true
If you want to convert akka.util.ByteString to String, you can use decodeString
scala> ByteString("hello").decodeString("UTF-8")
res3: String = hello
See the doc for more detail: http://doc.akka.io/api/akka/2.3.13/index.html#akka.util.ByteString
I'm getting this error when importing scipy.stats:
import scipy.stats
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/site-packages/scipy/stats/__init__.py", line 322, in <module>
from stats import *
File "/usr/lib64/python2.6/site-packages/scipy/stats/stats.py", line 194, in <module>
import scipy.linalg as linalg
File "/usr/lib64/python2.6/site-packages/scipy/linalg/__init__.py", line 116, in <module>
from basic import *
File "/usr/lib64/python2.6/site-packages/scipy/linalg/basic.py", line 12, in <module>
from lapack import get_lapack_funcs
File "/usr/lib64/python2.6/site-packages/scipy/linalg/lapack.py", line 15, in <module>
from scipy.linalg import clapack
ImportError: /usr/lib64/python2.6/site-packages/scipy/linalg/clapack.so: undefined symbol: clapack_sgesv
Looks like clapack.so links to the full, ATLAS version of libatlas:
ldd /usr/lib64/python2.6/site-packages/scipy/linalg/clapack.so
linux-vdso.so.1 => (0x00007fff232e6000)
liblapack.so.3 => /usr/lib64/liblapack.so.3 (0x00007f23b8ad7000)
libptf77blas.so.3 => /usr/lib64/atlas/libptf77blas.so.3 (0x00007f23b88b7000)
libptcblas.so.3 => /usr/lib64/atlas/libptcblas.so.3 (0x00007f23b8697000)
libatlas.so.3 => /usr/lib64/atlas/libatlas.so.3 (0x00007f23b8120000)
libpython2.6.so.1.0 => /usr/lib64/libpython2.6.so.1.0 (0x00007f23b7d65000)
libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00007f23b7a73000)
libm.so.6 => /lib64/libm.so.6 (0x00007f23b77da000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f23b75c3000)
libc.so.6 => /lib64/libc.so.6 (0x00007f23b7232000)
libblas.so.3 => /usr/lib64/libblas.so.3 (0x00007f23b6fdb000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f23b6dbd000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f23b6bb9000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f23b69b6000)
/lib64/ld-linux-x86-64.so.2 (0x00000032a2200000)
Any ideas?