Using PySpark integer column as argument

Using PySpark integer column as argument - pyspark

I am trying to parse a PySpark column which has an "=" sign inside. The two functions I've created for this purpose work individually:
DF=DF.withColumn("findEqual",instr(columnName,"="))
and also when I create a column of Substring
DF=DF.withColumn("parsedString",substring(columnName,2,18))
However, when I combine the two functions:
DF=DF.withColumn("parsedString",2,instr(columnName,"="))
I receive an error:
TypeError: int() argument must be a string or a number, not 'Column'
The issue seems to be that "findEqual" isn't seen by PySpark as an integer, rather an "integer object".
Thanks for your help!

You are using functions defined on strings not on pyspark columns, you can convert them using udf:
from pyspark.sql.functions import udf
from pyspark.sql.types import *
def instr (x, s):
return s in x
instr_udf = lambda s: udf(lambda x: instr(x, s), BooleanType())
DF=DF.withColumn("findEqual",instr_udf("=")("columnName"))
and
substring_udf = udf(substring, StringType())
I would not recommand using UDFs when the functions already exist in pyspark:
DF=DF.withColumn("findEqual",DF.columnName.like('%=%'))
DF=DF.withColumn("parsedString",DF.columnName[2:18])

Related

Azure Databricks analyze if the columns names are lower case, using islower() function

This is my logic on pyspark:
df2 = spark.sql(f" SELECT tbl_name, column_name, data_type, current_count FROM {database_name}.{tablename}")
query_df = spark.sql(f"SELECT tbl_name, COUNT(column_name) as `num_cols` FROM {database_name}.{tablename} GROUP BY tbl_name")
df_join = df2.join(query_df,['tbl_name'])
Then I want to add to the Dataframe another column called 'column_case_lower' with the analyzes if the columns_names are lower case using islower() function.
I'm using this logic to do the analyzes:
df_join.withColumn("column_case_lower",
when((col("column_name").islower()) == 'true'.otherwise('false'))
-- The error is: unexpected EOF while parsing
expecting something like this:

islower() cant be applied on column type. Use the below code that uses UDF instead.
def checkCase(col_value):
return col_value.islower()
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
checkUDF = udf(lambda z: checkCase(z),StringType())
from pyspark.sql.functions import col,when
df.withColumn("new_col", when(checkUDF(col('column_name')) == True,"True")
.otherwise("False")).show()

how to find length of string of array of json object in pyspark scala?

I have one column in DataFrame with format =
'[{jsonobject},{jsonobject}]'. here length will be 2 .
I have to find length of this array and store it in another column.

I've only worked with pySpark, but the Scala solution would be similar. Assuming the column name is input:
from pyspark.sql import functions as f, types as t
json_schema = t.ArrayType(t.MapType(t.StringType(), t.StringType()))
df.select(f.size(f.from_json(df.input, json_schema)).alias("num_objects"))

Filter Scala dataframe by column of arrays

My scala dataframe has a column that has the datatype array(element: String). I want to display those rows of the dataframe that has the word "hello" in that column.
I have this:
display(df.filter($"my_column".contains("hello")))
I get an error because of data mismatch. It says that argument 1 requires string type, however, 'my:column' is of array<string> type.

You can use array_contains function
import org.apache.spark.sql.functions._
df.filter(array_contains(df.col("my_column"), "hello")).show

how to rename column name and cast type when do aggregation in pyspark dataframe

I have a pyspark dataframe, and I wish to get the mean and std for all columns, and rename the columns name and type, what is the easiest way to implement this, currently below is my code:
test_mean=test.groupby('id').agg({'col1': 'mean',
'col2': 'mean',
'col3':'mean'
})
test_std=test.groupby('id').agg({'col1': 'std',
'col2': 'std',
'col3':'std'
})
##rename one columns by one columns
## type cast decimal to float
May I know how to improve it?
Thanks.

You can try with Col experssioons:
from pyspark.sql import functions as F
expr1 = F.std(F.col('col1').cast('integer').alias('col1'))
expr2 = F.std(F.col('col2').cast('integer').alias('col2'))
test \
.groupBy(id) \
.agg(
expr1,
expr2
)

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?

If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Using PySpark integer column as argument - pyspark

Related

Azure Databricks analyze if the columns names are lower case, using islower() function

how to find length of string of array of json object in pyspark scala?

Filter Scala dataframe by column of arrays

how to rename column name and cast type when do aggregation in pyspark dataframe

Converting Column of Dataframe to Seq[Columns] Scala

Categories

Resources