When I try
.collect(streaming=True)
Get pyo3_runtime.PanicException: placeholder should be replaced
My below code work if collect streaming False but panic if streaming True
Import polars as pl
Import pandas as pd
right_ldf = pl.from_pandas(pd.read_excel("file.xls")).lazy()
left_ldf = pl.scan_csv("file.csv").filter(pl.col("col2") = 2).with_columns(col3=pl.when(pl.col("col1") = 1).then("result").join(right_ldf, on="col1", how="inner").collect(streaming=True)
The actual code there was many node of "with columns" or "filter" and "join" is the last node then collect. That code get panic
But if I move the join node before "with columns" or "filter", there was no panic ex. Right after scan_csv node
What I expected is there was no panic wherever I placed the join node as long as no col calculation related to right_ldf, is this bug or there is a trick to resolve this issue
Related
Here is a trivial benchmark based on a real-life workload.
import gc
import time
import numpy as np
import polars as pl
df = ( # I have a dataframe like this from reading a csv.
pl.Series(
name="x",
values=np.random.choice(
["ASPARAGUS", "BROCCOLI", ""], size=30_000_000
),
)
.to_frame()
.with_column(
pl.when(pl.col("x") == "").then(None).otherwise(pl.col("x"))
)
)
start = time.time()
df.lazy().with_column(
pl.col("x").cast(pl.Categorical).fill_null("MISSING")
).collect()
end = time.time()
print(f"Cast then fill_null took {end-start:.2f} seconds.")
Cast then fill_null took 0.93 seconds.
gc.collect()
start = time.time()
df.lazy().with_column(
pl.col("x").fill_null("MISSING").cast(pl.Categorical)
).collect()
end = time.time()
print(f"Fill_null then cast took {end-start:.2f} seconds.")
Fill_null then cast took 1.36 seconds.
(1) Am I correct to think that casting to categorical then filling null will always be faster?
(2) Am I correct to think that the result will always be identical regardless of the order?
(3) If the answers are "yes" and "yes", is it possible that someday polars will do this rearrangement automatically? Or is it actually impossible try all these sorts of permutations in a general query optimizer?
Thanks.
1: yes
2: somewhat. The logical categorcal representatition will always be the same. The physical changes by the order of occurrence of the string values. Doing fill_null before the cast, means "MISSING" will be found earlier. But this should be seen as an implementation detail.
3: Yes, this is something we can automatically optimize. Just today we merged something similar: https://github.com/pola-rs/polars/pull/4883
First time using TabPy and have the connections successfully set up. Within the "Create Calculated Field" button in Tableau, I have tried
SCRIPT_REAL("
import numpy as np
import pandas as pd
")
which results in "SCRIPT_REAL is being called with (string), did you mean (string, ...)?
Additionally, how do I refer to the dataset such as I did in Python to execute the following?
data = pd.read_csv("C:/Users/.../dataset.csv")
data.head()
plt.figure(figsize=(7,7))
plt.pie(data['stroke'].value_counts(sort = True),
explode = (0.05, 0),
labels = data['stroke'].value_counts(sort = True).index,
colors = ["blue","green"],
autopct = '%1.1f%%')
plt.title('Pie Chart')
plt.show()
I am wondering if it's possible to obtain the result of percentile_rank using the QuantileDiscretizer transformer in pyspark.
The purpose is that I am trying to avoid computing the percent_rank over the entire column, as it generates the following error:
WARN WindowExec: No Partition Defined for Window operation!
Moving all data to a single partition, this can cause serious performance degradation.
The method I am following is to first use QuantileDiscretizer then normalize to [0,1]:
from pyspark.sql.window import Window
from pyspark.ml.feature import QuantileDiscretizer
from scipy.stats import gamma
X1 = gamma.rvs(0.2, size=1000)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("perc_rank", F.percent_rank().over(Window.orderBy("x")))
df = QuantileDiscretizer(numBuckets=df.count()+1,\
inputCol="x",\
outputCol="q_discretizer").fit(df).transform(df)
agg_values = df.agg(F.max(df["q_discretizer"]).alias("maxval"),\
F.min(df["q_discretizer"]).alias("minval")).collect()[0]
xmax, xmin = agg_values.__getitem__("maxval"), agg_values.__getitem__("minval")
normalize = F.udf(lambda x: (x-xmin)/(xmax-xmin))
df = df.withColumn("perc_discretizer", normalize("q_discretizer"))
df = df.withColumn("error", F.round(F.abs(F.col("perc_discretizer")- F.col("perc_rank")),6) )
print(df.select(F.max("error")).show())
df.show(5)
However, it seems that increasing the number of datapoints the error grows, so I am not sure this is the right way to do it.
Is it possible to use QuantileDiscretizer to obtain the percentile_rank?
Alternatively is there a way to compute percentile_rank over an entire column in an efficient way?
Well you can use the below to avoid the warning message:
X1 = gamma.rvs(0.2, size=10)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("dummyCol", F.lit("some_val"))
win = Window.partitionBy("dummyCol").orderBy("x")
df = df.withColumn("perc_rank", F.percent_rank().over(win)).drop("dummyCol")
but nonetheless, the data would still be moved to a single worker, I don't think so there is any better alternative to avoid the shuffle here since the complete column needs to be rank-ordered.
In case you have multiple windows over the same column, you can try to pre-partition the data and then apply the ranking functions.
I am trying to find similar users by vectorizing user features and sorting by distance between user vectors in PySpark. I'm running this in Databricks on Runtime 5.5 LTS ML cluster (Scala 2.11, Spark 2.4.3)
Following the code in the docs, I am using approxSimilarityJoin() method from the pyspark.ml.feature.BucketedRandomProjectionLSH model.
I have found similar users successfully using approxSimilarityJoin(), but every now and then I come across a user of interest that apparently has no users similar to them.
Usually when approxSimilarityJoin() doesn't return anything, I assume it's because the threshold parameter is set to low. That fixes the issue sometimes, but now I've tried using a threshold of 100000 and still getting nothing back.
I define the model as
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=1.0)
I'm not sure if I changing bucketLength or numHashTables would help in obtaining results.
The following example shows a pair of users where approxSimilarityJoin() returned something (dataA, dataB) and a pair of users (dataC, dataD) where it didn't.
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
dataA = [(0, Vectors.dense([0.7016968702094931,0.2636417660310031,4.155293362824633,4.191398632883099]),)]
dataB = [(1, Vectors.dense([0.3757117100334294,0.2636417660310031,4.1539923630906745,4.190086328785612]),)]
dfA = spark.createDataFrame(dataA, ["customer_id", "scaledFeatures"])
dfB = spark.createDataFrame(dataB, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfA)
# returns
# theshold of 100000 is clearly overkill
# A dataframe with dfA and dfB feature vectors and a EuclideanDistance of 0.32599039770730354
model.approxSimilarityJoin(dfA, dfB, 100000, distCol="EuclideanDistance").show()
dataC = [(0, Vectors.dense([1.1600056435954367,78.27652460873155,3.5535837780801396,0.0030949620591871887]),)]
dataD = [(1, Vectors.dense([0.4660731192450482,39.85571715054726,1.0679201943112886,0.012330725745062067]),)]
dfC = spark.createDataFrame(dataC, ["customer_id", "scaledFeatures"])
dfD = spark.createDataFrame(dataD, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfC)
# returns empty df
model.approxSimilarityJoin(dfC, dfD, 100000, distCol="EuclideanDistance").show()
I was able to obtain results to the second half of the example above by increasing the bucketLength parameter value to 15. The threshold could have been lowered because the Euclidean Distance was ~34.
Per the PySpark docs:
bucketLength = the length of each hash bucket, a larger bucket lowers the false negative rate
below is the code that is giving the TypeError...
import pandas as pd
import networkx as nx
datamuse = pd.read_csv(NetworkDatasheet.csv',index_col=0)
print(datamuse)
G = nx.DiGraph(datamuse.values)
nx.draw_random(G, with_labels=True)
dc= nx.degree_centrality(G)
bc=nx.betweenness_centrality(G,normalized = True)
ec=nx.eigenvector_centrality(G)
nx.set_node_attributes(G,'degree centrality',dc)
nx.set_node_attributes(G,'betweenness centrality',bc)
nx.set_node_attributes(G,'eigenvector centrality',ec)
G.nodes()[1]['degree centrality']
the values in the dictionary (e.g: dc) are float like 0.029411764705882353
The last line of your code should be replaced by:
G.nodes(data=True)[1][1]['degree centrality']
You need to have the associated properties of your nodes, hence data=True otherwise you only get the node ids.
Then when you do G.nodes(data=True)[1] you actually get a tuple (nodeId, data_dict) so to access the data values, you need to get the second element, hence [1].