How to store bigint array in file in Pyspark? - pyspark

I've a UDF that returns bigint array. I want to store that in a file on Pyspark cluster.
Sample Data -
[
Row(Id='ABCD505936', array=[0, 2, 5, 6, 8, 10, 12, 13, 14, 15, 18]),
Row(Id='EFGHI155784', array=[1, 2, 4, 10, 16, 22, 27, 32, 36, 38, 39, 40])
]
I tried saving it like this -
df.write.save("/home/data", format="text", header="true", mode="overwrite")
But it throws an error saying -
py4j.protocol.Py4JJavaError: An error occurred while calling
o101.save. : org.apache.spark.sql.AnalysisException: Text data source
does not support array data type.;
Can anyone please help me?

try this :
from pyspark.sql import functions as F
df.withColumn(
"array",
F.col("array").cast("string")
).write.save(
"/home/data", format="text", header="true", mode="overwrite"
)

Related

how to parse datetime.datetime(,,,tzinfo) to spark dataframe

I have following json
{'client_partition_id': 2, 'client_id': 'd7d3e26f-d105-4816-825d-d5858b9cf0d1', 'ingestion_timestamp': datetime.datetime(2020, 4, 10, 4, 3, 19, 654977, tzinfo=<UTC>)}
its in ndjson file
trying to read in spark spark.read.json() but getting error parse in pyspark to parse datetime.
please help me to resolve it.

spark dataframe : finding employees who is having salary more than the average salary of the organization

I am trying to run a test spark/scala code to find employees who is having salary more than the avarage salary with a test data using below spark dataframe . But this is failing while executing :
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot evaluate expression: avg(input[4, double, false])
What might be the correct syntax to achieve this ?
val dataDF20 = spark.createDataFrame(Seq(
(11, "emp1", 2, 45, 1000.0),
(12, "emp2", 1, 34, 2000.0),
(13, "emp3", 1, 33, 3245.0),
(14, "emp4", 1, 54, 4356.0),
(15, "emp5", 2, 76, 56789.0)
)).toDF("empid", "name", "deptid", "age", "sal")
val condition1 : Column = col("sal") > avg(col("sal"))
val d0 = dataDF20.filter(condition1)
println("------ d0.show()----", d0.show())
You can get this done in two steps:
val avgVal = dataDF20.select(avg($"sal")).take(1)(0)(0)
dataDF20.filter($"sal" > avgVal).show()
+-----+----+------+---+-------+
|empid|name|deptid|age| sal|
+-----+----+------+---+-------+
| 15|emp5| 2| 76|56789.0|
+-----+----+------+---+-------+

Why is Key always 0 when creating map

My code is supposed to extract a Map from a dataframe. The map will be used later for some calculations (mapping Credit to best matching original Billing). However the first step is failing already - the TransactionId is always retrieved as 0.
Simplified version of the code:
case class SalesTransaction(
CustomerId : Int,
Score : Int,
Revenue : Double,
Type : String,
Credited : Double = 0.0,
LinkedTransactionId : Int = 0,
IsProcessed : Boolean = false
)
val df = Seq(
(1, 1, 123, "Sales", 100),
(1, 2, 122, "Credit", 100),
(1, 3, 99, "Sales", 70),
(1, 4, 101, "Sales", 77),
(1, 5, 102, "Credit", 75),
(1, 6, 98, "Sales", 71),
(2, 7, 200, "Sales", 55),
(2, 8, 220, "Sales", 55),
(2, 9, 200, "Credit", 50),
(2, 10, 205, "Sales", 50)
).toDF("CustomerId", "TransactionId", "TransactionAttributesScore", "TransactionType", "Revenue")
.withColumn("Revenue", $"Revenue".cast(DoubleType))
.repartition($"CustomerId")
//map generation:
val m2 : Map[Int, SalesTransaction] =
df.map(row => (
row.getAs("TransactionId")
, new SalesTransaction(row.getAs("CustomerId")
, row.getAs("TransactionAttributesScore")
, row.getAs("Revenue")
, row.getAs("TransactionType")
)
)
).collect.toMap
m2.foreach(m => println("key: " + m._1 +" Value: "+ m._2))
The output has only the very last record, because all values captured by row.getAs("TransactionId") is null (i.e. translates as 0 in the m2 Map) thus tuple created in each iteration is (null, [current row SalesTransaction]).
Could you please advice me what might be wrong with my code? I'm quite new to Scala and must be missing some syntactical nuance here.
You can also use row.getAs[Int]("TransactionId") as shown below :
val m2 : Map[Int, SalesTransaction] =
df.map(row => (
row.getAs[Int]("TransactionId"),
new SalesTransaction(row.getAs("CustomerId"),
row.getAs("TransactionAttributesScore"),
row.getAs("Revenue"),
row.getAs("TransactionType"))
)
).collect.toMap
It is always better to use the casted version of getAs to avoid errors like this.
The issue is related to data type obtained from row.getAs("TransactionId"). Despite underlying $"TransactionId" being integer. Converting the input explicitly resolved the issue:
//… code above unchanged
val m2 : Map[Int, SlTransaction] =
df.map(row => {
val mKey : Int = row.getAs("TransactionId") //forcing into Int variable
val mValue : SlTransaction = new SlTransaction(row.getAs("CustomerId")
, row.getAs("TransactionAttributesScore")
, row.getAs("Revenue")
, row.getAs("TransactionType")
)
(mKey, mValue)
}
).collect.toMap

Aggregate data in scala

I have data in a file like :
2005, 08, 20, 50
2005, 08, 21, 52
2005, 08, 22, 38
2005, 08, 23, 70
Data is : Year, Month, Date, temperature.
I want to read this data and output data year and month wise temperatures.
example : 2015-08: 38, 50, 52, 70.
temperature is sorted in ascending order.
What should be the spark scala code for the same? Answer in RDD transformations would appreciate a lot.
Until now I have done this so far :
val conf= new SparkConf().setAppName("demo").setMaster("local[*]")
val spark = new SparkContext(conf)
val input = spark.textFile("src/main/resources/someFile.txt")
val fields = input.flatMap(_.split(","))
What I am thinking is, to have year-month as a key and then list of temperatures as values. But I am not able to get this into the code.
val myData = sc.parallelize(Array((2005, 8, 20, 50), (2005, 8, 21, 52), (2005, 8, 22, 38), (2005, 8, 23, 70)))
myData.sortBy(_._4).collect
returns:
res1: Array[(Int, Int, Int, Int)] = Array((2005,8,22,38), (2005,8,20,50), (2005,8,21,52), (2005,8,23,70))
Leave you to do the concat function
From file
val filesRDD = sc.textFile("/FileStore/tables/Weather2.txt",1)
val linesRDD = filesRDD.map(line => (line.trim.split(","))).map(entries=>(entries(0).toInt,entries(1).toInt,entries(2).toInt,entries(3).toInt))
linesRDD.sortBy(_._4).collect
returns:
res13: Array[(Int, Int, Int, Int)] = Array((2005,7,22,7), (2005,7,15,10), (2005,8,22,38), (2005,8,20,50), (2005,7,19,50), (2005,8,21,52), (2005,7,21,52), (2005,8,23,70))
You can think of the concat yourself, and, what if sort values are common? Multiple sorts, but this I think answers your first less well-formed question.

Use psycopg2 to do loop in postgresql

I use postgresql 8.4 to route a river network, and I want to use psycopg2 to loop through all data points in my river network.
#set up python and postgresql connection
import psycopg2
query = """
select *
from driving_distance ($$
select
gid as id,
start_id::int4 as source,
end_id::int4 as target,
shape_leng::double precision as cost
from network
$$, %s, %s, %s, %s
)
;"""
conn = psycopg2.connect("dbname = 'routing_template' user = 'postgres' host = 'localhost' password = '****'")
cur = conn.cursor()
while True:
i = 1
if i <= 2:
cur.execute(query, (i, 1000000, False, False))
i = i + 1
else:
break
rs = cur.fetchall()
conn.close()
print rs
The code above costs a lot of time to run even though I have set the maximum iterator i equals to 2, and the output is an error message contains garbage,
I am thinking that if postgresql can accept only one result at one time, so I tried to put this line in my loop,
rs(i) = cur.fetchall()
and the error message said that this line has bugs,
I know that I can't write code like rs(i), but I don't know the replacement to validate my assumption.
So should I save one result to a file first then use the next iterator to run the loop, and again and again?
I am working with postgresql 8.4, python 2.7.6 under Windows 8.1 x64.
Update#1
I can do loop using Clodoaldo Neto's code(thanks), and the result is like this,
[(1, 2, 0.0), (2, 2, 4729.33082850235), (3, 19, 4874.27571718902), (4, 3, 7397.215962901), (5, 4,
6640.31749097187), (6, 7, 10285.3869655786), (7, 7, 14376.1087618696), (8, 5, 15053.164236979), (9, 10, 16243.5973710466), (10, 8, 19307.3024368889), (11, 9, 21654.8669532788), (12, 11, 23522.6224229233), (13, 18, 29706.6964721152), (14, 21, 24034.6792693279), (15, 18, 25408.306370489), (16, 20, 34204.1769580924), (17, 11, 26465.8348728118), (18, 20, 38596.7313209197), (19, 13, 35184.9925532175), (20, 16, 36530.059646027), (21, 15, 35789.4069722436), (22, 15, 38168.1750567026)]
[(1, 2, 4729.33082850235), (2, 2, 0.0), (3, 19, 144.944888686669), (4, 3, 2667.88513439865), (5, 4, 1910.98666246952), (6, 7, 5556.05613707624), (7, 7, 9646.77793336723), (8, 5, 10323.8334084767), (9, 10, 11514.2665425442), (10, 8, 14577.9716083866), (11, 9, 16925.5361247765), (12, 11, 18793.2915944209), (13, 18, 24977.3656436129), (14, 21, 19305.3484408255), (15, 18, 20678.9755419867), (16, 20, 29474.8461295901), (17, 11, 21736.5040443094), (18, 20, 33867.4004924174), (19, 13, 30455.6617247151), (20, 16, 31800.7288175247), (21, 15, 31060.0761437413), (22, 15, 33438.8442282003)]
but if I want to get this look of output,
(1, 2, 7397.215962901)
(2, 2, 2667.88513439865)
(3, 19, 2522.94024571198)
(4, 3, 0.0)
(5, 4, 4288.98201949483)
(6, 7, 7934.05149410155)
(7, 7, 12024.7732903925)
(8, 5, 12701.828765502)
(9, 10, 13892.2618995696)
(10, 8, 16955.9669654119)
(11, 9, 19303.5314818018)
(12, 11, 21171.2869514462)
(13, 18, 27355.3610006382)
(14, 21, 21683.3437978508)
(15, 18, 23056.970899012)
(16, 20, 31852.8414866154)
(17, 11, 24114.4994013347)
(18, 20, 36245.3958494427)
(19, 13, 32833.6570817404)
(20, 16, 34178.72417455)
(21, 15, 33438.0715007666)
(22, 15, 35816.8395852256)
What should I make a little change in the code?
rs = []
while True:
i = 1
if i <= 2:
cur.execute(query, (i, 1000000, False, False))
rs.extend(cur.fetchall())
i = i + 1
else:
break
conn.close()
print rs
If it is just a counter that breaks that loop then
rs = []
i = 1
while i <= 2:
cur.execute(query, (i, 1000000, False, False))
rs.extend(cur.fetchall())
i = i + 1
conn.close()
print rs