AttributeError: 'HashingTF' object has no attribute '_java_obj' - pyspark

when I use the pyspark.ml.Pipline to create pipline, it occurs the following problem:
File "/opt/module/spark-2.4.3-bin-hadoop2.7/Pipeline.py", line 18, in
hashingTF = HashingTF(ipnutCol=tokenizer.getOutputCol(),outputCol="features")
File "/opt/module/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/init.py", line 110
, in wrapper
TypeError: init() got an unexpected keyword argument 'ipnutCol'
Exception ignored in:
Traceback (most recent call last):
File "/opt/module/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 4
0, in del
AttributeError: 'HashingTF' object has no attribute '_java_obj'
I guess the API have changed, but I am not certain.
# 构建一个机器学习流水线
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml import Pipeline
# 创建一个SparkSession对象
spark = SparkSession.builder.master("local").appName("WorldCount").getOrCreate()
# 1. prepare training documents from a list of (id, text, label) tuples
training = spark.createDataFrame([
(0, 'a b c d e spark', 1.0),
(1, 'b d', 0.0),
(2, 'spark f g h', 1.0),
(3, 'hadoop mapreduce', 0.0)
],['id','text','label'])
# 2. 定义pipline 中各个流水线阶段PipelineStage.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(ipnutCol=tokenizer.getOutputCol(),outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
# 3. 按照具体的处理逻辑有序地组织PipelineStages,并创建一个Pipeline.
pipeline = Pipeline(stages=[tokenizer,hashingTF,lr])
# 4. 训练模型
model = pipeline.fit(training)
# 5. 构建测试数据
test = spark.createDataFrame([
(4, 'spark i j k'),
(5, 'i m n'),
(6, 'spark hadoop spark'),
(7, 'apache hadoop')
],['id', 'text'])
# 6. 调用之前训练好的PipelineModel的transform()方法,
# 让测试数据按照顺序通过拟合的流水线,生成预测结果
prediction = model.transform(test)
selected = prediction.select('id','text','probability','prediction')
for row in selected.collect():
rid, text, prob, prediction = row
print('({},{}) -> prob = {}, prediction={}'.format(rid, text, str(prob),prediction))
(4, spark i j k) --> prob=[0.155543713844,0.844456286156], prediction=1.000000 (5, l m n) --> prob=[0.830707735211,0.169292264789], prediction=0.000000 (6, spark hadoop spark) --> prob=[0.0696218406195,0.93037815938], prediction=1.000000 (7, apache hadoop) --> prob=[0.981518350351,0.018481649649], prediction=0.000000

Your spelling of input is wrong here (ipnutCol):
TypeError: init() got an unexpected keyword argument 'ipnutCol'

Related

How can I loop through combinations of values with `pytest.raises`, and find out which combination failed?

I use pytest.raises for tuples of values in an array:
import pytest
def test_division():
pairs = [(1, 0), (0, 1), (0, 0)]
for a, b in pairs:
with pytest.raises(ZeroDivisionError):
a / b
It fails as expected when one of the tuples does not cause the error.
But it does not tell me which tuple was the problem:
================================================ FAILURES ================================================
_____________________________________________ test_division ______________________________________________
def test_division():
pairs = [(1, 0), (0, 1), (0, 0)]
for a, b in pairs:
with pytest.raises(ZeroDivisionError):
> a / b
E Failed: DID NOT RAISE <class 'ZeroDivisionError'>
proj/divide_test.py:8: Failed
======================================== short test summary info =========================================
FAILED proj/divide_test.py::test_division - Failed: DID NOT RAISE <class 'ZeroDivisionError'>
====================================== 1 failed, 3 passed in 0.04s =======================================
In this case I would like to be told that the pair (0, 1) was the problem.
Is there a way to make that information accessible?
Instead of iterating over the values inside the test, use parametrization:
import pytest
#pytest.mark.parametrize("pair", ((1, 0), (0, 1), (0, 0)), ids=str)
def test_division(pair):
with pytest.raises(ZeroDivisionError):
pair[0] / pair[1]
This handles each value as a separate test, and you get:
collecting ... collected 3 items
div_by_zero.py::test_division[(1, 0)] PASSED [ 33%]
div_by_zero.py::test_division[(0, 1)] FAILED [ 66%]
div_by_zero.py:2 (test_division[(0, 1)])
pair = (0, 1)
#pytest.mark.parametrize("pair", ((1, 0), (0, 1), (0, 0)), ids=str)
def test_division(pair):
with pytest.raises(ZeroDivisionError):
> pair[0] / pair[1]
E Failed: DID NOT RAISE <class 'ZeroDivisionError'>
div_by_zero.py:6: Failed
div_by_zero.py::test_division[(0, 0)] PASSED [100%]
================================== FAILURES ===================================
...
div_by_zero.py:6: Failed
========================= 1 failed, 2 passed in 0.08s =========================
EDIT: added ids=str to make the output better readable as suggested by #hoefling

Reformatting Dataframe Containing Array to RowMatrix

I have this dataframe in the following format:
+----+-----+
| features |
+----+-----+
|[1,4,7,10]|
|[2,5,8,11]|
|[3,6,9,12]|
+----+----+
Script to create sample dataframe:
rows2 = sc.parallelize([ IndexedRow(0, [1, 4, 7, 10 ]),
IndexedRow(1, [2, 5, 8, 1]),
IndexedRow(1, [3, 6, 9, 12]),
])
rows_df = rows2.toDF()
row_vec= rows_df.drop("index")
row_vec.show()
The feature column contains 4 features, and there are 3 row ids. I want to convert this data to a rowmatrix, where the columns and rows will be in the following mat format:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
Basically, I want to transpose the dataframe into the new format so that I can run the columnSimilarities() function. I have a much larger dataframe that contains 50 features, and 39000 rows.
Is this what you are trying to do? Hate using collect() but don't think it can be avoided here since you want to reshape/convert structured object to matrix ... right?
X = np.array(row_vec.select("_2").collect()).reshape(-1,3)
X = sc.parallelize(X)
for i in X.collect(): print(i)
[1 4 7]
[10 2 5]
[8 1 3]
[ 6 9 12]
I figured it out, I used the following:
from pyspark.mllib.linalg.distributed import RowMatrix
features_rdd = row_vec.select("features").rdd.map(lambda row: row[0])
features_mat = RowMatrix(features_rdd )
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coordmatrix_features = CoordinateMatrix(
features_mat .rows.zipWithIndex().flatMap(
lambda x: [MatrixEntry(x[1], j, v) for j, v in enumerate(x[0])]
)
)
transposed_rowmatrix_features = coordmatrix_features.transpose().toRowMatrix()

Merging time distributed tensor gives 'inbound node error'

In my network, I have some time distributed convolutions.Batch size = 1 image, which breaks down into 32 sub-images, for each sub-image, 3 feature of dimension 6x6x256. I need to merge all the 3 features corresponding to a particular image.
Tensor definition are like:
out1 = TimeDistributed(Convolution2D(256, (3, 3), strides=(2, 2), padding='same', activation='relu'))(out1)
out2 = TimeDistributed(Convolution2D(256, (3, 3), strides = (2,2), padding='same', activation='relu'))(out2)
out3 = TimeDistributed(Convolution2D(256, (1, 1), padding='same', activation='relu'))(out3)
out1: <tf.Tensor 'time_distributed_3/Reshape_1:0' shape=(1, 32, 6, 6, 256) dtype=float32>
out2: <tf.Tensor 'time_distributed_5/Reshape_1:0' shape=(1, 32, 6, 6, 256) dtype=float32>
out4: <tf.Tensor 'time_distributed_6/Reshape_1:0' shape=(1, 32, 6, 6, 256) dtype=float32>
I have tried different techniques to merge like TimeDistributed(merge(... )), etc but nothing works.
out = Lambda(lambda x:merge([x[0],x[1],x[2]],mode='concat'))([out1,out2,out3])
It gives correct dimension tensor (1,32,6,6,768), but then it further passes through some flatten and dense layers. When i build the model like
model = Model( .... , .... ), it gives error
File "/home/adityav/.virtualenvs/cv/local/lib/python2.7/site-packages/keras/engine/topology.py", line 1664, in build_map_of_graph
next_node = layer.inbound_nodes[node_index]
AttributeError: 'NoneType' object has no attribute 'inbound_nodes'
Any idea on how to do this time distributed concatenation, when the tensors are 5dimensional.
Thanks

implicit recomendation with ML spark and data frames

I am trying to use the new ML libraries with Spark and Dataframes for building a recommender with implicit ratings.
My code
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.ml.recommendation import ALS
sc = SparkContext()
sqlContext = SQLContext(sc)
# create the dataframe (user x item)
df = sqlContext.createDataFrame(
[(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)],
["user", "item"])
als = ALS() \
.setRank(10) \
.setImplicitPrefs(True)
model = als.fit(df)
print "Rank %i " % model.rank
model.userFactors.orderBy("id").collect()
test = sqlContext.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", "item"])
predictions = sorted(model.transform(test).collect(), key=lambda r: r[0])
for p in predictions: print p
However, I run in this error
pyspark.sql.utils.AnalysisException: cannot resolve 'rating' given input columns user, item;
So, Not sure how to define the data frame
It appears you are trying to use (user, product) tuples, but you need (user, product, rating) triplets. Even for implicit ratings, you do need the ratings. You can use a constant like 1.0 if they are all the same.
I am confused because the MLLIB API has a separate API call for implicit
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
val alpha = 0.01
val lambda = 0.01
val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha)

How to filter a RDD according to a function based another RDD in Spark?

I am a beginner of Apache Spark. I want to filter out all groups whose sum of weight is larger than a constant value in a RDD. The "weight" map is also a RDD. Here is a small-size demo, the groups to be filtered is stored in "groups", the constant value is 12:
val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
val wm = weights.toArray.toMap
def isheavy(inp: String): Boolean = {
val allw = inp.split(",").map(wm(_)).sum
allw > 12
}
val result = groups.filter(isheavy)
When the input data is very large, > 10GB for example, I always encounter a "java heap out of memory" error. I doubted if it's caused by "weights.toArray.toMap", because it convert an distributed RDD to an Java object in JVM. So I tried to filter with RDD directly:
val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
def isheavy(inp: String): Boolean = {
val items = inp.split(",")
val wm = items.map(x => weights.filter(_._1 == x).first._2)
wm.sum > 12
}
val result = groups.filter(isheavy)
When I ran result.collect after loading this script into spark shell, I got a "java.lang.NullPointerException" error. Someone told me when a RDD is manipulated in another RDD, there will be a nullpointer exception, and suggest me to put the weight into Redis.
So how can I get the "result" without convert "weight" to Map, or put it into Redis? If there is a solution to filter a RDD based on another map-like RDD without the help of external datastore service?
Thanks!
Suppose your group is unique. Otherwise, first make it unique by distinct, etc.
If group or weights is small, it should be easy. If both group and weights are huge, you can try this, which may be more scalable, but also looks complicated.
val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
//map groups to be (a, (a,b,c,d)), (b, (a,b,c,d), (c, (a,b,c,d)....
val g1=groups.flatMap(s=>s.split(",").map(x=>(x, Seq(s))))
//j will be (a, ((a,b,c,d),3)...
val j = g1.join(weights)
//k will be ((a,b,c,d), 3), ((a,b,c,d),2) ...
val k = j.map(x=>(x._2._1, x._2._2))
//l will be ((a,b,c,d), (3,2,5,1))...
val l = k.groupByKey()
//filter by sum the 2nd
val m = l.filter(x=>{var sum = 0; x._2.foreach(a=> {sum=sum+a});sum > 12})
//we only need the original list
val result=m.map(x=>x._1)
//don't do this in real product, otherwise, all results go to driver.instead using saveAsTextFile, etc
scala> result.foreach(println)
List(e,g)
List(b,c,e)
The "java out of memory" error is coming because spark uses its spark.default.parallelism property while determining number of splits, which by default is number of cores available.
// From CoarseGrainedSchedulerBackend.scala
override def defaultParallelism(): Int = {
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
}
When the input becomes large, and you have limited memory, you should increase number of splits.
You can do something as follows:
val input = List("a,b,c,d", "b,c,e", "a,c,d", "e,g")
val splitSize = 10000 // specify some number of elements that fit in memory.
val numSplits = (input.size / splitSize) + 1 // has to be > 0.
val groups = sc.parallelize(input, numSplits) // specify the # of splits.
val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)).toMap
def isHeavy(inp: String) = inp.split(",").map(weights(_)).sum > 12
val result = groups.filter(isHeavy)
You may also consider increasing executor memory size using spark.executor.memory.