Pyspark rdd Transpose - pyspark

I have the following emp table in hive testing database
1 ram 2000.0 101 market
2 shyam 3000.0 102 IT
3 sam 4000.0 103 finance
4 remo 1000.0 103 finance
I want to transpose this table in pyspark with first two columns being same and last 3 columns being stacked.
I have done the following in pyspark shell
test = sqlContext.sql("select * from testing.emp")
data = test.flatMap (lambda row: [Row (id=row ['id'],name=row['name'],column_name=col,column_val=row [col]) for col in ('sal','dno','dname')])
emp = sqlContext.createDataFrame(data)
emp.registerTempTable('mytempTable')
sqlContext.sql('create table testing.test(id int,name string,column_name string,column_val int) row format delimited fields terminated by ","')
sqlContext.sql('INSERT INTO TABlE testing.test select * from mytempTable')
the expected output is
1 ram sal 2000
1 ram dno 101
1 ram dname market
2 shyam sal 3000
2 shyam dno 102
2 shyam dname IT
3 sam sal 4000
3 sam dno 103
3 sam dname finance
4 remo sal 1000
4 remo dno 103
4 remo dname finance
But the output I get is
NULL 2000.0 1 NULL
NULL NULL 1 NULL
NULL NULL 1 NULL
NULL 3000.0 2 NULL
NULL NULL 2 NULL
NULL NULL 2 NULL
NULL 4000.0 3 NULL
NULL NULL 3 NULL
NULL NULL 3 NULL
NULL 1000.0 4 NULL
NULL NULL 4 NULL
NULL NULL 4 NULL
Also please let me know how I can loop columns if I have many columns in the table

Sorry I just notic "hive table "
cfg = SparkConf().setAppName('MyApp')
spark = SparkSession.builder.config(conf=cfg).enableHiveSupport().getOrCreate()
df = spark.table("default.test").cache()
cols = df.columns[2:5]
df = df.rdd.map(lambda x: Row(id=x[0], name=x[1], val=dict(zip(cols, x[2:5]))))
df = spark.createDataFrame(df)
df.createOrReplaceTempView('mytempTable')
sql = """
select
id,
name,
explode(val) AS (k,v)
from mytempTable
"""
df = spark.sql(sql)
df.show()
And in HIVE :
> desc test;
OK
id string
somebody string
sal string
dno string
dname string
dt string
# Partition Information
# col_name data_type comment
dt string
P.S.
You can only use sql without Spark as :
select
a.id,
a.somebody,
b.k,
b.v
from (
select
id,
somebody,
map('sal',sal,
'dno',dno,
'dname',dname) as val
from default.test
) a
lateral VIEW explode(val) b as k,v
For your question about small parquet files:
cfg = SparkConf().setAppName('MyApp')
spark = SparkSession.builder.enableHiveSupport().config(conf=cfg).getOrCreate()
df = spark.sparkContext.parallelize(range(26))
df = df.map(lambda x: (x, chr(x + 97), '2017-01-12'))
df = spark.createDataFrame(df, schema=['idx', 'val', 'dt']).coalesce(1)
df.write.saveAsTable('default.testing', mode='overwrite', partitionBy='dt', format='parquet')
small parquet files amount = DataFrame partitions amount
You can use df.coalesce or df.repartition to decrease DataFrame partitions amount
But I am not sure whether there is a hidden trouble that reduce DataFrame partitions to only one (e.g.: OOM?)
And there is another way to combine small files with out spark,just use HIVE sql:
set mapred.reduce.tasks=1;
insert overwrite table default.yourtable partition (dt='2017-01-13')
select col from tmp.yourtable where dt='2017-01-13';

Related

Overwriting group of values with in same column another set of group based on other column group

Input:
Name GroupId Processed NewGroupId NgId
Mike 1 N 9 NULL
Mikes 1 N 9 NULL
Miken 5 Y 9 5
Mikel 5 Y 9 5
Output:
Name GroupId Processed NewGroupId NgId
Mike 1 N 9 5
Mikes 1 N 9 5
Miken 5 Y 9 5
Mikel 5 Y 9 5
below query worked in sql server, due to correlated subquery same is not working in spark sql.
Is there any alternate either with spark sql or pyspark dataframe.
SELECT Name,groupid,IsProcessed,ngid,
CASE WHEN ngid IS NULL THEN
COALESCE((SELECT top 1 ngid FROM temp D
WHERE D.NewGroupId = T.NewGroupId AND
D.ngid IS NOT NULL ), null)
ELSE ngid
END AS ngid
FROM temp T
worked with below in sparksql.
spark.sql("select LKUP,groupid,IsProcessed,NewGroupId ,coalesce((select Max(D.ngid) from test2 D where D.NewGroupId = T.NewGroupId AND D.ngidis not null),null) as ngid from test2 T")

Postgresql multiply and sum row using windows function?

i need to somehow use the LAG along with the SUM after each returning line
table
id
valor
data
1
1,0182
2022-01-01
2
1,0183
2022-02-01
3
1,0174
2022-03-01
Expected result
id
valor
data
1
1,0182
2022-01-01
2
1,0368
2022-02-01
3
1,0548
2022-03-01
in the column "valor" I need to take the previous value, multiply it with the current value, and add this value
linha 1 1,0182
linha 2 (1,0182 x 1,0183)
linha 3 (1,0182 x 1,0183) x 1,0548
linha 4 ((1,0182 x 1,0183) x 1,0548) x ##,####
...
nd yes onwards
SELECT i.id,
valor,
COALESCE(LAG (valor) OVER ( PARTITION BY indice_correcao_id ORDER BY DATA ), 1) as valor_anteior,
SUM ( valor ) OVER ( PARTITION BY indice_correcao_id ORDER BY DATA ) AS cum_amt
FROM
indice_correcao_itens AS i
WHERE
i.indice_correcao_id = 1
AND i."data" BETWEEN '2022-01-01'
AND '2022-03-28'
ORDER BY i."data";
You can define your own aggregate function that returns the product of the input:
-- From https://stackoverflow.com/a/13156170/2650437
-- See this answer if your column is not a FLOAT but e.g. a NUMERIC
-- as you will need to adapt the aggregate a bit
CREATE AGGREGATE PRODUCT(DOUBLE PRECISION) (
SFUNC = float8mul,
STYPE = FLOAT8
);
You can use custom aggregates in window functions, so
CREATE TEMP TABLE t (
"id" INTEGER,
"valor" FLOAT,
"data" TIMESTAMP
);
INSERT INTO t ("id", "valor", "data")
VALUES ('1', 1.0182, '2022-01-01')
, ('2', 1.0183, '2022-02-01')
, ('3', 1.0174, '2022-03-01');
SELECT id, SUM(valor) OVER (ORDER BY data, id), PRODUCT(valor) OVER (ORDER BY data, id)
FROM t;
returns
+--+------------------+--------------+
|id|sum |product |
+--+------------------+--------------+
|1 |1.0182 |1.0182 |
|2 |2.0365 |1.03683306 |
|3 |3.0539000000000005|1.054873955244|
+--+------------------+--------------+

Overriding values in dataframe while joining 2 dataframes

In the below example I would like to override the values in Spark Dataframe A with corresponding value in Dataframe B (if it exists). Is there a way to do it using Spark (Scala)?
Dataframe A
ID Name Age
1 Paul 30
2 Sean 35
3 Rob 25
Dataframe B
ID Name Age
1 Paul 40
Result
ID Name Age
1 Paul 40
2 Sean 35
3 Rob 25
The combined use of a left join and coalesce should do the trick, something like:
dfA
.join(dfB, "ID", "left")
.select(
dfA.col("ID"),
dfA.col("Name"),
coalesce(dfB.col("Age"), dfA.col("Age")).as("Age")
)
Explanation: For a specific ID some_id, there is 2 cases:
if dfB does not contain some_id: then the left join will produce null for dfB.col("Age") and the
coalesce will return the first non-null value from expressions we
passed to it, i.e. the value of dfA.col("Age")
if dfB contains some_id then the value from dfB.col("Age") will be used.

How to aggregate some dates and data which belong to into one row in pyspark?

I want to aggregate some dates (for example one month for each customer) and its data to one row in pyspark.
Example simply as the bellow table
Customer_Id
Date
Data
id1
2021-01-01
2
id1
2021-01-02
3
id1
2021-01-03
4
I want to change it into
Customer_Id
Date
col1
col2
col3
id1
[2021-01-01 - 2021-01-03]
2
3
4
#matin you can try below code to replicate the output
from pyspark.sql.functions import *
schema = ["Customer_Id","Date","Data"]
data =[["id1", "2021-01-01", 2],["id1","2021-01-02", 3],["id1","2021-01-03", 4]]
df = spark.createDataFrame(data,schema)
df2 = df.groupBy(["Customer_Id"]).agg(collect_list("Date").alias("list_date"),collect_list("data").alias("list_data")
)
df3= df2.withColumn("col1",df2.list_data[0]).withColumn("col2",df2.list_data[1]).withColumn("col3",df2.list_data[2]).drop("list_data")
df3.show(truncate=False)
df3.printSchema()
let me know if you need further modification.

How to count the number of values in a column in a dataframe based on the values in the other dataframe

I have two dataframes. the first one is a raw dataframe so its item_value column has all the item values. and the other dataframe has columns named min,avg,max which has min,avg,max values specified for the items in the first dataframe. and I want to count the number of item values in the first dataframe based on the specified agg values in the second dataframe.
the first dataframe looks like this
item_name
item_value
A
1.4
A
2.1
B
3.0
A
2.8
B
4.5
B
1.1
the second dataframe looks like this
item_name
min
avg
max
A
1.1
2
2.7
B
2.1
3
4.0
I want to count the number of item values that are greater than the defined min,avg,max values in the other dataframe
So the result I want is
item_name
min
avg
max
A
3
2
1
B
2
1
1
Any help would be much appreciated
*please forgive my grammar
If you don't mind SQL implementation, you can try:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
sql = """
select df2.item_name,
sum(case when df1.item_value > df2.min then 1 else 0 end) as min,
sum(case when df1.item_value > df2.avg then 1 else 0 end) as avg,
sum(case when df1.item_value > df2.max then 1 else 0 end) as max
from df2 join df1 on df2.item_name=df1.item_name
group by df2.item_name
"""
df = spark.sql(sql)
df.show()