Sum columns of a dataset

Sum columns of a dataset - pyspark

I'm working in PySpark and I have a dataset like this :
I want to create a new df like this with the corresponding sums :
So I tried this code :
df = df.withColumnRenamed("month_actual_january", "monthjanuary")
fin=df.groupBy(["column1","column2"]).sum()
The problem is that I get the following error :
Attribute sum(column3) contains an invalid character among ,;{}()\n\t=. Please use an alias to rename it
Do you know how to fix this error ? Thanks !

Lets try use least squares to pass a wild card alias as follows
df.groupBy(["column1","column2"]).agg(*[sum(x).alias(f"sum_{x}") for x in df.drop("column1","column2").columns]).show()

Related

How to remove the first 2 rows using zipwithindex using spark scala

I have two headers in the file. have to remove them. i tried with zipwithindex. it will assign the index from zero onwards. But its showing error while performing filter condition on it.
val data=spark.sparkContext.textFile(filename)
val s=data.zipWithIndex().filter(row=>row[0]>1) --> throwing error here
Any help here please.
Sample data:
===============
sno,empno,name --> need to remove
c1,c2,c3 ==> need to remove
1,123,ramakrishna
2,234,deepthi
Error: identifier expected but integer literal found
value row of type (String,Long) does not take type parameters.
not found: type <error>

If you have to remove rows, you can use row_num and remove rows easily by filter row_num>2.

How can I resolve the "need struct type but got struct"

As you can see on my picture, I have a column named probability and I want to create a new column from the probability column. I want to extract values from the probability column which is an array. But while trying to do so, I receive an error:
"Can't extract value from probability#52427: need struct type but got struct<type:tinyint,size:int,indices:array<int>,values:array<double>>"
Here is my extraction code:
preds_test = preds.withColumn("newCol", col("probability").getItem(3))
Can someone please tell me what I did wrong?

I figured it out. I used a lambda function. This is my code:
preds_subset = preds.select('CustomerID','prediction', probs_churn('probability')).orderBy(asc("probability"))```

How to keep keep original column after applying data validation in same column

I have a task to validate decimal and date field.I am able to validate decimal and date filed on same column but not able to keep old column values.
Input:
id,amt1
1,123
2,321
3,345
4,543
5,789
Current Output:
id,amt1
1,12.3
2,32.1
3,34.5
4,54.3
5,78.9
Expected Output:
id,amt1,original_amt1_values
1,12.3,123
2,32.1,321
3,34.5,345
4,54.3,543
5,78.9,789
Below is the code, I am able to validate decimal filed but not able to keep original values. Kindly help me on this. I want to keep its original column in dataframe itself.
SourceFileDF = SourceFileDF.withColumn("amt1", DecimalConversion(col(amt1)))
DecimalConversion is my UDF and SourceFileDF is my dataframe.

You can use a temporary column name for "amt1" and the use column rename
SourceFileDF.withColumn("amt1_converted", DecimalConversion(col(amt1)))
SourceFileDF.withColumnRenamed("amt1", "original_amt1_values")
SourceFileDF.withColumnRenamed("amt1_converted", "amt1")

You can use select and provide the alias in a single line :
sourceFileDF.select(
DecimalConversion($"amt1").as("amt1") ,
$"amt1".as("original_amt1_values")
)

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.

Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.

You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]

Merge multiple geometry lines into a single geometry line

Now I have 3 lines (LINESTRING) with 3 different geometry values.
Line#1 : "11AABB"
Line#2 : "22CCDD"
Line#3 : "33EEFF"
How to merge these lines into a single line via PostGIS function ?
I use to know that It use ST_LineMerge or ST_Union But I'm not sure how to use it with geometry values.

could you be more specific?
with the examples you provided you could try this:
SELECT ST_Union('11AABB'::geometry , ST_Union('22CCDD'::geometry,'33EEFF'::geometry ) );
your geometry values are invalid, but this is the correct sintax of the query you want. You should get this error with the example you provided: "ERROR: parse error - invalid geometry"

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Sum columns of a dataset - pyspark

Lets try use least squares to pass a wild card alias as follows df.groupBy(["column1","column2"]).agg(*[sum(x).alias(f"sum_{x}") for x in df.drop("column1","column2").columns]).show()

Related

How to remove the first 2 rows using zipwithindex using spark scala

How can I resolve the "need struct type but got struct"

How to keep keep original column after applying data validation in same column

dataFrame keying using pandas groupby method

Merge multiple geometry lines into a single geometry line

Categories

Resources