How to get right data with other dataframe - python-polars

print(
(
df1.lazy()
.with_context(df2.lazy())
.select(
pl.col("df1_date")
.apply(lambda s: pl.col("df2_date").filter(pl.col("df2_date") >= s).first())
.alias("release_date")
)
).collect()
)
Instead of getting actual data, I get a df of query plans. Is there any other way to solve my problem, Thx!!
In pandas, I can get what I want by using:
df1["release_date"] = df1.index.map(
lambda x: df2[df2.index < x].index[-1]
)
Edit:
Pls try code below and you will see polars only return query plans for this. While pandas gives the right data I want.
import polars as pl
df1 = pl.DataFrame(
{
"df1_date": [20221011, 20221012, 20221013, 20221014, 20221016],
"df1_col1": ["foo", "bar", "foo", "bar", "foo"],
}
)
df2 = pl.DataFrame(
{
"df2_date": [20221012, 20221015, 20221018],
"df2_col1": ["1", "2", "3"],
}
)
print(
(
df1.lazy()
.with_context(df2.lazy())
.select(
pl.col("df1_date")
.apply(lambda s: pl.col("df2_date").filter(pl.col("df2_date") <= s).last())
.alias("release_date")
)
).collect()
)
df1 = df1.to_pandas().set_index("df1_date")
df2 = df2.to_pandas().set_index("df2_date")
df1["release_date"] = df1.index.map(
lambda x: df2[df2.index <= x].index[-1] if len(df2[df2.index <= x]) > 0 else 0
)
print(df1)

It looks like you're trying to do an asof join. In other words a join where you take the last value that matched rather than exact matches.
You can do
df1 = (df1.lazy().join_asof(df2.lazy(), left_on='df1_date', right_on='df2_date')) \
.select(['df1_date', 'df1_col1',
pl.col('df2_date').fill_null(0).alias('release_date')]).collect()
The first difference is that in polars you don't assign new columns, you assign the whole df so it's always just the name of the df on the left side of the equals. The join_asof replaces your index/map/lambda thing. Then the last thing is to just replace the null value with 0 with fill_null and then rename the column. There was a bug in an old version of polars preventing the collect from working at the end. That is fixed in at least 0.15.1 (maybe an earlier version too but I'm just checking in with that version)

Related

How can I parse a string based on character count?

I am trying to parse a string and append the results to a new fields in a dataframe? In SQL, it would work like this.
UPDATE myDF
SET theyear = SUBSTRING(filename, 52, 4),
SET themonth = SUBSTRING(filename, 57, 2),
SET theday = SUBSTRING(filename, 60, 2),
SET thefile = SUBSTRING(filename, 71, 99)
I want to use Scala to do the work because the dataframes that I'm working with are really huge and using this will be magnitudes faster than using SQL to do the same. So, based on my research, I think it would look something like this, but I don't know how to count the number of characters in a field.
Here is some sample data:
abc://path_to_all_files_in_data_lake/2018/10/27/Parent/CPPP1027.Mid.414.gz
I want to get the year, the month, the day, and the file name, so in this example, I want the dataframe to have this.
val modifiedDF = df
.withColumn("theyear", )
.withColumn("themonth", )
.withColumn("theday", )
.withColumn("thefile", )
modifiedDF.show(false)
So, I want to append four fields to a dataframe: theyear, themonth, theday, and thefile. Then, do the parsing based on the count of characters in a string. Thanks.
I would probably rather use RegEx for pattern matching than string length. In this simple example, I extract the main date pattern using regexp_extract then build the other columns from there using substring:
%scala
import org.apache.spark.sql.functions._
val df = Seq( ( "abc://path_to_all_files_in_data_lake/2018/10/27/Parent/CPPP1027.Mid.414.gz" ), ( "abc://path_to_all_files_in_data_lake/2019/02/28/Parent/CPPP77.Mid.303.gz" ) )
.toDF("somePath")
.withColumn("theDate", regexp_extract($"somePath", "[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]", 0) )
.withColumn("theYear", substring($"theDate", 1, 4 ) )
.withColumn("theMonth", substring($"theDate", 6, 2 ) )
.withColumn("theDay", substring($"theDate", 9, 2 ) )
.withColumn("theFile", regexp_extract($"somePath", "[^/]+\\.gz", 0) )
df.show
My results:
Does that work for you?
Using built in functions on data frame -
You can use length(Column) from org.apache.spark.sql.functions to find the size of the data in a column.
val modifiedDF = df
.withColumn("theyear", when(length($"columName"),??).otherwise(??))
Using Scala -
df.map{row =>
val c = row.getAs[String]("columnName")
//length of c = c.length()
//build all columns
// return (column1,column2,,,)
}.toDF("column1", "column2")
Here is the final working solution!
%scala
import org.apache.spark.sql.functions._
val dfMod = df
.withColumn("thedate", regexp_extract($"filepath", "[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]", 0) )
.withColumn("theyear", substring($"thedate", 1, 4 ) )
.withColumn("themonth", substring($"thedate", 6, 2 ) )
.withColumn("theday", substring($"thedate", 9, 2 ) )
.withColumn("thefile", regexp_extract($"filepath", "[^/]+\\.gz", 0) )
dfMod.show(false)
Thanks for the assist wBob!!!

Pyspark isin with column in argument doesn't exclude rows

I need to exclude rows which doesn't have True value in column Status.
In my opinion this filter( isin( )== False) structure should solve my problem but it doesn't.
df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C", "True")], ( "name", "status"))
df.registerTempTable("df")
df_t = df[df.status == "True"]
from pyspark.sql import functions as sf
df_f = df.filter(df.status.isin(df_t.name)== False)
I expect row:
B | False
any help is greatly appreciated!
First, I think in your last statement, you meant to use df.name instead of df.status.
df_f = df.filter(df.status.isin(df_t.name)== False)
Second, even if you use df.name, it still won't work.
Because it's mixing the columns (Column type) from two DataFrames, i.e. df_t and df in your final statement. I don't think this works in pyspark.
However, you can achieve the same effect using other methods.
If I understand correctly, you want to select 'A' and 'C' first through 'status' column, then select the rows excluding ['A', 'C']. The thing here is to extend the selection to the second row of 'A', which can be achieved by Window. See below:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C", "True")], ( "name", "status"))
df.registerTempTable("df")
# create an auxiliary column satisfying the condition
df = df.withColumn("flag", F.when(df['status']=="True", 1).otherwise(0))
df.show()
# extend the selection to other rows with the same 'name'
df = df.withColumn('flag', F.max(df['flag']).over(Window.partitionBy('name')))
df.show()
#filter is now easy
df_f = df.filter(df.flag==0)
df_f.show()

Check every column in a spark dataframe has a certain value

Can we check to see if every column in a spark dataframe contains a certain string(example "Y") using Spark-SQL or scala?
I have tried the following but don't think it is working properly.
df.select(df.col("*")).filter("'*' =='Y'")
Thanks,
Sai
You can do something like this to keep the rows where all columns contain 'Y':
//Get all columns
val columns: Array[String] = df.columns
//For each column, keep the rows with 'Y'
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'Y'"))
//Union all the dataframes together into one final dataframe
val output: DataFrame = seqDfs.reduceRight(_ union _)
You can use data frame method columns to get all column's names
val columnNames: Array[String] = df.columns
and then add all filters in a loop
var filteredDf = df.select(join5.col("*"))
for(name <- columnNames) {
filteredDf = filteredDf.filter(s"$name =='Y'")
}
or you can create a SQL query using same approach
If you want to filter every row, in which any of the columns is equal to 1 (or anything else), you can dynamically create a query like this:
cols = [col(c) == lit(1) for c in patients.columns]
query = cols[0]
for c in cols[1:]:
query |= c
df.filter(query).show()
It's a bit verbose, but it is very clear what is happening. A more elegant version would be:
res = df.filter(reduce(lambda x, y: x | y, (col(c) == lit(1) for c in cols)))
res.show()

Pyspark groupby then sort within group

I have a table which contains id, offset, text. Suppose input:
id offset text
1 1 hello
1 7 world
2 1 foo
I want output like:
id text
1 hello world
2 foo
I'm using:
df.groupby(id).agg(concat_ws("",collect_list(text))
But I don't know how to ensure the order in the text. I did sort before groupby the data, but I've heard that groupby might shuffle the data. Is there a way to do sort within group after groupby data?
this will create a required df:
df1 = sqlContext.createDataFrame([("1", "1","hello"), ("1", "7","world"), ("2", "1","foo")], ("id", "offset" ,"text" ))
display(df1)
then you can use the following code, could be optimized further:
#udf
def sort_by_offset(col):
result =""
text_list = col.split("-")
for i in range(len(text_list)):
text_list[i] = text_list[i].split(" ")
text_list[i][0]=int(text_list[i][0])
text_list = sorted(text_list, key=lambda x: x[0], reverse=False)
for i in range(len(text_list)):
result = result+ " " +text_list[i][1]
return result.lstrip()
df2 = df1.withColumn("offset_text",concat(col("offset"),lit(" "),col("text")))
df3 = df2.groupby(col("id")).agg(concat_ws("-",collect_list(col("offset_text"))).alias("offset_text"))
df4 = df3.withColumn("text",sort_by_offset(col("offset_text")))
display(df4)
Final Output:
Add sort_array:
from pyspark.sql.functions import sort_array
df.groupby(id).agg(concat_ws("", sort_array(collect_list(text))))

Scala code to label rows of data frame based on another data frame

I just started learning scala to do data analytics and I encountered a problem when I try to label my data rows based on another data frame.
Suppose I have a df1 with columns "date","id","value",and"label" which is set to be "F" for all rows in df1 in the beginning. Then I have this df2 which is a smaller set of data with columns "date","id","value".Then I want to change the row label in df1 from "F" to "T" if that row appears in df2, i.e.some row in df2 has the same combination of ("date","id","value")as that row in df1.
I tried with df.filter and df.join but seems that both cannot solve my problem.
I Think this is what you are looking for.
val spark =SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
//create Dataframe 1
val df1 = spark.sparkContext.parallelize(Seq(
("2016-01-01", 1, "abcd", "F"),
("2016-01-01", 2, "efg", "F"),
("2016-01-01", 3, "hij", "F"),
("2016-01-01", 4, "klm", "F")
)).toDF("date","id","value", "label")
//Create Dataframe 2
val df2 = spark.sparkContext.parallelize(Seq(
("2016-01-01", 1, "abcd"),
("2016-01-01", 3, "hij")
)).toDF("date1","id1","value1")
val condition = $"date" === $"date1" && $"id" === $"id1" && $"value" === $"value1"
//Join two dataframe with above condition
val result = df1.join(df2, condition, "left")
// check wather both fields contain same value and drop columns
val finalResult = result.withColumn("label", condition)
.drop("date1","id1","value1")
//Update column label from true false to T or F
finalResult.withColumn("label", when(col("label") === true, "T").otherwise("F")).show
The basic idea is to join the two and then calculate the result. Something like this:
df2Mod = df2.withColumn("tmp", lit(true))
joined = df1.join(df2Mod , df1("date") <=> df2Mod ("date") && df1("id") <=> df2Mod("id") && df1("value") <=> df2Mod("value"), "left_outer")
joined.withColumn("label", when(joined("tmp").isNull, "F").otherwise("T")
The idea is that we add the "tmp" column and then do a left_outer join. "tmp" would be null for everything not in df2 and therefore we can use that to calculate the label.