I am trying to add columns values in a narrative text but able to add only one value for every row
var hashColDf = rowmaxDF.select("min", "max", "Total")
val peopleArray = hashColDf.collect.map(r => Map(hashColDf.columns.zip(r.toSeq): _*))
val comstr = "shyam has max and min not Total"
var mapArrayStr = List[String]()
for(eachrow <- peopleArray){
mapArrayStr = mapArrayStr :+ eachrow.foldLeft(comstr)((a, b) => a.replaceAllLiterally(b._1, b._2.toString()))
}
for(eachCol <- mapArrayStr){
rowmaxDF = rowmaxDF.withColumn("compCols", lit(eachCol))
}
}
Source Dataframe :
|max|min|TOTAL|
|3 |1 |4 |
|5 |2 |7 |
|7 |3 |10 |
|8 |4 |12 |
|10 |5 |15 |
|10 |5 |15 |
Actual Result:
|max|min|TOTAL|compCols |
|3 |1 |4 |shyam has 10 and 5 not 15|
|5 |2 |7 |shyam has 10 and 5 not 15|
|7 |3 |10 |shyam has 10 and 5 not 15|
|8 |4 |12 |shyam has 10 and 5 not 15|
|10 |5 |15 |shyam has 10 and 5 not 15|
|10 |5 |15 |shyam has 10 and 5 not 15|
Expected Result :
|max|min|TOTAL|compCols |
|3 |1 |4 |shyam has 3 and 1 not 4 |
|5 |2 |7 |shyam has 5 and 2 not 7 |
|7 |3 |10 |shyam has 7 and 3 not 10 |
|8 |4 |12 |shyam has 8 and 4 not 12 |
|10 |5 |15 |shyam has 10 and 5 not 15|
|10 |5 |15 |shyam has 10 and 5 not 15|
Hi I am totally new to spark scala.I need an idea or any sample solution.I have a data like this
tagid,timestamp,listner,orgid,suborgid,rssi
[4,1496745915,718,4,3,0.30]
[2,1496745915,3878,4,3,0.20]
[4,1496745918,362,4,3,0.60]
[4,1496745913,362,4,3,0.60]
[2,1496745918,362,4,3,0.10]
[3,1496745912,718,4,3,0.05]
[2,1496745918,718,4,3,0.30]
[4,1496745911,1901,4,3,0.60]
[4,1496745912,718,4,3,0.60]
[2,1496745915,362,4,3,0.30]
[2,1496745912,3878,4,3,0.20]
[2,1496745915,1901,4,3,0.30]
[2,1496745910,1901,4,3,0.30]
I want to find for each tag and for each listner last 10 seconds timestamp data. Then For the 10 seconds data I need to find average for rssi values.Like this.
2,1496745918,718,4,3,0.60
2,1496745917,718,4,3,1.30
2,1496745916,718,4,1,2.20
2,1496745914,718,1,2,3.10
2,1496745911,718,1,2,6.10
4,1496745910,1901,1,2,0.30
4,1496745908,1901,1,2,1.30
..........................
..........................
Like this I need to find it. Any solution or suggestions is appreciated.
NOTE: I am doing with spark scala.
I tried through spark sql query .But not works properly.
val filteravg = avg.registerTempTable("avg")
val avgfinal = sqlContext.sql("SELECT tagid,timestamp,listner FROM (SELECT tagid,timestamp,listner,dense_rank() OVER (PARTITION BY _c6 ORDER BY _c5 ASC) as rank FROM avg) tmp WHERE rank <= 10")
avgfinal.collect.foreach(println)
I am trying through array also.Any help will be appreciated.
If you already have a dataframe as
+-----+----------+-------+-----+--------+----+
|tagid|timestamp |listner|orgid|suborgid|rssi|
+-----+----------+-------+-----+--------+----+
|4 |1496745915|718 |4 |3 |0.30|
|2 |1496745915|3878 |4 |3 |0.20|
|4 |1496745918|362 |4 |3 |0.60|
|4 |1496745913|362 |4 |3 |0.60|
|2 |1496745918|362 |4 |3 |0.10|
|3 |1496745912|718 |4 |3 |0.05|
|2 |1496745918|718 |4 |3 |0.30|
|4 |1496745911|1901 |4 |3 |0.60|
|4 |1496745912|718 |4 |3 |0.60|
|2 |1496745915|362 |4 |3 |0.30|
|2 |1496745912|3878 |4 |3 |0.20|
|2 |1496745915|1901 |4 |3 |0.30|
|2 |1496745910|1901 |4 |3 |0.30|
+-----+----------+-------+-----+--------+----+
Doing the following should work for you
df.withColumn("firstValue", first("timestamp") over Window.orderBy($"timestamp".desc).partitionBy("tagid"))
.filter($"firstValue".cast("long")-$"timestamp".cast("long") < 10)
.withColumn("average", avg("rssi") over Window.partitionBy("tagid"))
.drop("firstValue")
.show(false)
you should have output as
+-----+----------+-------+-----+--------+----+-------------------+
|tagid|timestamp |listner|orgid|suborgid|rssi|average |
+-----+----------+-------+-----+--------+----+-------------------+
|3 |1496745912|718 |4 |3 |0.05|0.05 |
|4 |1496745918|362 |4 |3 |0.60|0.54 |
|4 |1496745915|718 |4 |3 |0.30|0.54 |
|4 |1496745913|362 |4 |3 |0.60|0.54 |
|4 |1496745912|718 |4 |3 |0.60|0.54 |
|4 |1496745911|1901 |4 |3 |0.60|0.54 |
|2 |1496745918|362 |4 |3 |0.10|0.24285714285714288|
|2 |1496745918|718 |4 |3 |0.30|0.24285714285714288|
|2 |1496745915|3878 |4 |3 |0.20|0.24285714285714288|
|2 |1496745915|362 |4 |3 |0.30|0.24285714285714288|
|2 |1496745915|1901 |4 |3 |0.30|0.24285714285714288|
|2 |1496745912|3878 |4 |3 |0.20|0.24285714285714288|
|2 |1496745910|1901 |4 |3 |0.30|0.24285714285714288|
+-----+----------+-------+-----+--------+----+-------------------+
I obtain a resultant dataframe after performing some computations over it.Say the dataframe is result. When i write it to Amazon S3 there are specific cells which are shown blank. The top 5 of my result dataframe is:
_________________________________________________________
|var30 |var31 |var32 |var33 |var34 |var35 |var36|
--------------------------------------------------------
|-0.00586|0.13821 |0 | |1 | | |
|3.87635 |2.86702 |2.51963 |8 |11 |2 |14 |
|3.78279 |2.54833 |2.45881 | |2 | | |
|-0.10092|0 |0 |1 |1 |3 |1 |
|8.08797 |6.14486 |5.25718 | |5 | | |
---------------------------------------------------------
But when i run result.show() command i am able to see the values.
_________________________________________________________
|var30 |var31 |var32 |var33 |var34 |var35 |var36|
--------------------------------------------------------
|-0.00586|0.13821 |0 |2 |1 |1 |6 |
|3.87635 |2.86702 |2.51963 |8 |11 |2 |14 |
|3.78279 |2.54833 |2.45881 |2 |2 |2 |12 |
|-0.10092|0 |0 |1 |1 |3 |1 |
|8.08797 |6.14486 |5.25718 |20 |5 |5 |34 |
---------------------------------------------------------
Also, the blank are shown in same cells every time i run it.
Use this to save data to your s3
DataFrame.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("s3n://Yourpath")
For anyone who might have come across this issue, I can tell what worked for me.
I was joining 1 data frame ( let's say inputDF) with another df ( delta DF) based on some logic and storing in an output data frame (outDF). I was getting same error where by I could see a record in outDF.show() but while writing this dataFrame into a hive table OR persisting the outDF ( using outDF.persist(StorageLevel.MEMORY_AND_DISC)) I wasn't able to see that particular record.
SOLUTION:- I persisted the inputDF ( inputDF.persist(StorageLevel.MEMORY_AND_DISC)) before joining it with deltaDF. After that outDF.show() output was consistent with the hive table where outDF was written.
P.S:- I am not sure how this solved the issue. Would be awesome if someone could explain this, but the above worked for me.
I'm using OrientDB's UI/Query tool to analyze some graph data, and I've spent a couple of days unsuccessfully trying to unwind two arrays.
The unwind clause works just fine for one array but I can't seem to get the output I'm looking for when trying to unwind two arrays.
Here's a simplified example of my data:
#class | amt | storeID | customerID
transaction $4 1 1
transaction $2 1 1
transaction $6 1 4
transaction $3 1 4
transaction $2 2 1
transaction $7 2 1
transaction $8 2 2
transaction $3 2 2
transaction $4 2 3
transaction $9 2 3
transaction $10 3 4
transaction $3 3 4
transaction $4 3 5
transaction $10 3 5
Each customer is a document with the following information:
#class | customerID | State
customer 1 NY
customer 2 NJ
customer 3 PA
customer 4 NY
customer 5 NY
Each store is a document with the following information:
#class | storeID | State | Zip
store 1 NY 1
store 2 NJ 3
store 3 NY 2
Assuming I did not have storeID (nor wanted to create it), I want to recover a flattened table with the following distinct values: name of the store, city, account numbers, and the sum of spent.
The query would hopefully generate something like the table below (for a given depth value).
State | Zip | customerID
NY 1 4
NY 1 5
NY 2 1
NY 2 4
NJ 3 1
NJ 3 2
NJ 3 3
I've tried various expand/flatten/unwind operations but I can't seem to get my query to work.
Here's the query I have that recovers the State and Zip as two arrays and flattens the customerID:
SELECT out().State as State,
out().Zip as Zip,
customerID
FROM ( SELECT EXPAND(IN())
FROM (TRAVERSE * FROM
( SELECT FROM transaction)
)
) ;
Which yields,
State | Zip | customerID
[NY, NY, NJ, NJ] [1,1,2,2] 1
[NY, NY, NJ, NJ] [1,1,2,2] 1
[NY, NY, PA, PA] [1,1,3,3] 4
[NY, NY, PA, PA] [1,1,3,3] 4
... .... ....
Which is not what I'm looking for. Can someone provide a little help on how I can flatten/unwind these two arrays all together?
I tried your case with this structure (based on your example):
I used this queries to retrieve State, Zip and customerID (not as array):
Query 1:
SELECT State, Zip, in('transaction').customerID AS customerID FROM Store
ORDER BY Zip UNWIND customerID
----+------+-----+----+----------
# |#CLASS|State|Zip |customerID
----+------+-----+----+----------
0 |null |NY |1 |1
1 |null |NY |1 |1
2 |null |NY |1 |4
3 |null |NY |1 |4
4 |null |NY |2 |4
5 |null |NY |2 |4
6 |null |NY |2 |5
7 |null |NY |2 |5
8 |null |NJ |3 |1
9 |null |NJ |3 |1
10 |null |NJ |3 |2
11 |null |NJ |3 |2
12 |null |NJ |3 |3
13 |null |NJ |3 |3
----+------+-----+----+----------
Query 2:
SELECT inV('transaction').State AS State, inV('transaction').Zip AS Zip,
outV('transaction').customerID AS customerID FROM transaction ORDER BY Zip
----+------+-----+----+----------
# |#CLASS|State|Zip |customerID
----+------+-----+----+----------
0 |null |NY |1 |1
1 |null |NY |1 |1
2 |null |NY |1 |4
3 |null |NY |1 |4
4 |null |NY |2 |4
5 |null |NY |2 |4
6 |null |NY |2 |5
7 |null |NY |2 |5
8 |null |NJ |3 |1
9 |null |NJ |3 |1
10 |null |NJ |3 |2
11 |null |NJ |3 |2
12 |null |NJ |3 |3
13 |null |NJ |3 |3
----+------+-----+----+----------
EDITED
In the following example, with the query you'll be able to retrieve the average and the total spent for every storeID (based on each customerID):
SELECT customerID, storeID, avg(amt) AS averagePerStore, sum(amt) AS totalPerStore
FROM transaction GROUP BY customerID,storeID ORDER BY customerID
----+------+----------+-------+---------------+-------------
# |#CLASS|customerID|storeID|averagePerStore|totalPerStore
----+------+----------+-------+---------------+-------------
0 |null |1 |1 |3.0 |6.0
1 |null |1 |2 |4.5 |9.0
2 |null |2 |2 |5.5 |11.0
3 |null |3 |2 |6.5 |13.0
4 |null |4 |1 |4.5 |9.0
5 |null |4 |3 |6.5 |13.0
6 |null |5 |3 |7.0 |14.0
----+------+----------+-------+---------------+-------------
Hope it helps
I have a crystal report that when run will look like the below. The fields are place in the detail section:
Code|Jan|Feb|Mar|Apr|May|Jun|Jul|
405 |70 |30 |10 |45 |5 |76 |90 |
406 |10 |23 |30 |7 |1 |26 |10 |
488 |20 |30 |60 |7 |5 |44 |10 |
501 |40 |15 |90 |10 |8 |75 |40 |
502 |30 |30 |10 |7 |5 |12 |30 |
600 |60 |16 |50 |7 |9 |75 |20 |
I need to create a formula or a parameter to check if the Code=501 and then return the column Jun value of "75" from the footer section.
I wrote this formula:
whileprintingrecords;
NumberVar COSValue;
If {ds_RevSBU.Code}=501
Then COSValue :={ds_RevSBU.JUN)}
Else 0;
If I place this formula within the detail it work, it give me the value of 75. How can I get this value from the report footer section?
Please help.
Thank you.
I finally figure a way but I'm not sure if it is the correct way. I create the below formula and suppress it in the detail section:
Global NumberVar COSValue;
If {ds_RevSBU.Code}=501
Then COSValue :={ds_RevSBU.JUN)}
Else 0;
Then in the footer section, I created the below formula:
WhileReadingRecords;
Global NumberVar COSValue;
(COSValue * 4.5)/100