How to debug multiple level spark sql? - scala

I am new to scala, I tried to explode the test_tracking and test_segment but still saw the error. Anyone knows how to debug this query? Thanks a lot!
An error was encountered:
org.apache.spark.sql.AnalysisException: cannot resolve '`test_segment`' given input columns: [test_tracking, customer_id]; line 1 pos 8;
'Project [customer_id#42, 'explode('test_segment) AS test_segment#94]
+- Project [customer_id#42, test_tracking#91]
val list = List(3,4);
var testdata = data.selectExpr("id","explode(test_tracking) as test_tracking")
.selectExpr("id","explode(test_segment) as test_segment")
.select("id", "test_tracking.gcor_id","test_tracking.propensity","test_segment.test_ops")
.withColumn("new_propensity", when($"propensity" > 2.0, 2.0).when($"propensity" < 1.0, 1.0).otherwise($"propensity"))
.filter($"test_ops".isin(list: _*))
.filter($"preference" >= 4))

The problem is in this line:
.select("id", "test_tracking.gcor_id","test_tracking.propensity","test_segment.test_ops")
You are selecting test_segment.test_ops and not test_segment
It should be clled like this:
.select(col("id"), col("test_tracking.gcor_id"),col("test_tracking.propensity"),col("test_segment.test_ops").as("test_segment"))


isNotNull as a condition in WHEN

I'm migrating the Scala code to Pyspark. The original code snippet looks like this: $"aa.*",
when($"bb".isNotNull, $"cc".multiply($"bb")).otherwise($"cc")
and my pyspark code is this: "aa.*",
when(col("bb").isNotNull, col("cc") * col("bb")).otherwise(col("cc"))
And I have this error:
6 .select(
----> 7 col("clr.*"), when(col("bb").isNotNull, col("cc") * col("bb")).otherwise(col("cc")))
/usr/hdp/current/spark2-client/python/pyspark/sql/ in when(condition, value)
708 sc = SparkContext._active_spark_context
709 if not isinstance(condition, Column):
--> 710 raise TypeError("condition should be a Column")
711 v = value._jc if isinstance(value, Column) else value
712 jc = sc._jvm.functions.when(condition._jc, v)
TypeError: condition should be a Column
Please help me with:
Explain why this logic works on Scala, but not on Python
Suggest improvement

Py4JJava wrong columns error when calling PCA of

I am trying to visualize word2vec words using pyspark's PCA function, but I'm getting an unhelpful error message. Saying column features are of the wrong type, but they aren't. (Full message below)
Scala 2.12.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191).
3.6.5 |Anaconda, Inc.
Ubuntu 16.04
My Code
maxWordsVis = 15
Feat = np.load('Gab_ai_posts_W2Vmatrix.npy')
words = np.load('Gab_ai_posts_WordList.npy')
# to rdd, avoid this with big matrices by reading them directly from hdfs
Feat = sc.parallelize(Feat)
Feat = vec: (Vectors.dense(vec),))
# to dataframe
dfFeat = sqlContext.createDataFrame(Feat,["features"])
Row(features=DenseVector([-0.1282, 0.0699, -0.0891, -0.0437, -0.0915, -0.0557, 0.1432, -0.1564, 0.0058, -0.0603, 0.1383, -0.0359, -0.0306, -0.0415, -0.0191, 0.058, 0.0119, -0.0302, 0.0362, -0.0466, 0.0403, -0.1035, 0.0456, 0.0892, 0.0548, -0.0735, 0.1094, -0.0299, -0.0549, -0.1235, 0.0062, 0.1381, -0.0082, 0.085, -0.0083, -0.0346, -0.0226, -0.0084, -0.0463, -0.0448, 0.0285, -0.0013, 0.0343, -0.0056, 0.0756, -0.0068, 0.0562, 0.0638, 0.023, -0.0224, -0.0228, 0.0281, -0.0698, -0.0044, 0.0395, -0.021, 0.0228, 0.0666, 0.0362, 0.0116, -0.0088, 0.0949, 0.0265, -0.0293, -0.007, -0.0746, 0.0891, 0.0145, 0.0532, -0.0084, -0.0853, 0.0037, -0.055, -0.0706, -0.0296, 0.0321, 0.0495, -0.0776, -0.1339, -0.065, 0.0856, 0.0328, 0.0821, 0.036, -0.0179, -0.0006, -0.036, 0.0438, -0.0077, -0.0012, 0.0322, 0.0354, 0.0513, 0.0436, 0.0002, -0.0578, 0.1062, 0.019, 0.0346, -0.1261]))
numComponents = 3
pca = PCA(k = numComponents, inputCol = "features", outputCol = "pcaFeatures")
Error Message
Py4JJavaError: An error occurred while calling : java.lang.IllegalArgumentException: requirement failed:
Column features must be of type
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually
at scala.Predef$.require(Predef.scala:224)

k-means Clustering geolocated data using Spark/Scala

How to Handle geolocated data using k-means cluster algorithm here, Can somebody please share your input here, Thanks in advance.
Project_2_Dataset.txt file entries look like this
33.68947543 -117.5433083
37.43210889 -121.4850296
39.43789083 -120.9389785
39.36351868 -119.4003347
33.19135811 -116.4482426
33.83435437 -117.3300009
Please review my Code here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering.KMeans
val data = sc.textFile("Project_2_Dataset.txt")
val parsedData = line => Vectors.dense(line.split(',').map(_.toDouble)))
val kmmodel= KMeans.train(parsedData,3,5) --- 3 clusters,4 Iterations.
17/06/17 13:12:20 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
java.lang.NumberFormatException: For input string: "33.68947543 -117.5433083"
at sun.misc.FloatingDecimal.readJavaFormatString(
at sun.misc.FloatingDecimal.parseDouble(
at java.lang.Double.parseDouble(
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
Amit K
I think it is because you try to split each line at char ',' instead of ' '.
# "33.19135811 -116.4482426".toDouble
java.lang.NumberFormatException: For input string: "33.19135811 -116.4482426"
# "33.19135811 -116.4482426".split(',').map(_.toDouble)
java.lang.NumberFormatException: For input string: "33.19135811 -116.4482426"
# "33.19135811 -116.4482426".split(' ').map(_.toDouble)
res3: Array[Double] = Array(33.19135811, -116.4482426)
In the previous case where were able to apply the split on a set of data("33.19135811 -116.4482426".split(' ').map(_.toDouble)) , But it seems that when we are applying the same split on multiple set of data, Am getting this error:
33.68947543 -117.5433083
37.43210889 -121.4850296
39.43789083 -120.9389785
39.36351868 -119.4003347
scala> val kmmodel= KMeans.train(parsedData,3,5)
17/06/29 19:14:36 ERROR Executor: Exception in task 1.0 in stage 6.0 (TID 8)
java.lang.NumberFormatException: empty String

apache spark - delay on driver

I am attaching image from spark UI, and i am asking what is causing the delay( represented by white space) based on the description of my code below
1) isEmpt: is a action triggered on a Dataset DS1. it takes fe milliseconds : 60ms.
2) The white space between "isEmpty" and " run at ThreadPool...".
3) "collect at graphUtil" : collection of Datasets created between 1) and 2)
The script is running on yarn cluster.
Between 1) and 2) i am declaring Datasets which uses sqlContext.implicits._, i am not collecting them this is supposed to be work on Driver.Those Datasets contains Join/filter/....
Having that i am not collecting them between 1) and 2) what could be causing this delay.
Code between 1) and 2)
import sqlContext.implicits._
val intermediateInputFlowsIdsDS= intermediateInputFlowsDS
val df_exch_flow_interm_out=df_exch_flow.filter(df_exch_flow("flow_type")==="PRODUCT_FLOW"
&&df_exch_flow("is_input")==="0" )
val allproducersExchDS= intermediateInputFlowsIdsDS.join(df_exch_flow_interm_out,
intermediateInputFlowsIdsDS("flowid")===df_exch_flow_interm_out("f_flow") )
df_proc.join(allproducersExchDS,df_proc("Id")=== allproducersExchDS("f_owner"))
.map(row => {
new FlowProducer( row.getInt(3),// flowid output of producer
row.getInt(0) ,// the process id of producer
row.getDouble(8),// value of the matrix A cell,
row.getString(17),//destination unit
row.getString(2)//process type

How to add meta_data to Pandas dataframe?

I use Pandas dataframe heavily. And need to attach some data to the dataframe, for example to record the birth time of the dataframe, the additional description of the dataframe etc.
I just can't find reserved fields of dataframe class to keep the data.
So I change the core\ file to add a line _reserved_slot = {} to solve my issue. I post the question here is just want to know is it OK to do so ? Or is there better way to attach meta-data to dataframe/column/row etc?
# DataFrame class
class DataFrame(NDFrame):
_auto_consolidate = True
_verbose_info = True
_het_axis = 1
_col_klass = Series
'index': 0,
'columns': 1
_reserved_slot = {} # Add by bigbug to keep extra data for dataframe
_AXIS_NAMES = dict((v, k) for k, v in _AXIS_NUMBERS.iteritems())
EDIT : (Add demo msg for witingkuo's way)
>>> df = pd.DataFrame(np.random.randn(10,5), columns=list('ABCDEFGHIJKLMN')[0:5])
>>> df
0 0.5890 -0.7683 -1.9752 0.7745 0.8019
1 1.1835 0.0873 0.3492 0.7749 1.1318
2 0.7476 0.4116 0.3427 -0.1355 1.8557
3 1.2738 0.7225 -0.8639 -0.7190 -0.2598
4 -0.3644 -0.4676 0.0837 0.1685 0.8199
5 0.4621 -0.2965 0.7061 -1.3920 0.6838
6 -0.4135 -0.4991 0.7277 -0.6099 1.8606
7 -1.0804 -0.3456 0.8979 0.3319 -1.1907
8 -0.3892 1.2319 -0.4735 0.8516 1.2431
9 -1.0527 0.9307 0.2740 -0.6909 0.4924
>>> df._test = 'hello'
>>> df2 = df.shift(1)
>>> print df2._test
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python\lib\site-packages\pandas\core\", line 2051, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute '_test'
This is not supported right now. See The reason is the propogation of these attributes is non-trivial. You can certainly assign data, but almost all pandas operations return a new object, where the assigned data will be lost.
Your _reserved_slot will become a class variable. That might not work if you want to assign different value to different DataFrame. Probably you can assign what you want to the instance directly.
In [6]: import pandas as pd
In [7]: df = pd.DataFrame()
In [8]: df._test = 'hello'
In [9]: df._test
Out[9]: 'hello'
I think a decent workaround is putting your datafame into a dictionary with your metadata as other keys. So if you have a dataframe with cashflows, like:
df = pd.DataFrame({'Amount': [-20, 15, 25, 30, 100]},index=pd.date_range(start='1/1/2018', periods=5))
You can create your dictionary with additional metadata and put the dataframe there
out = {'metadata': {'Name': 'Whatever', 'Account': 'Something else'}, 'df': df}
and then use it as out[df]