Databricks Foundation - Number of Partitions on Exercise 3-d Reality Checks is failing "Found 36 Partitions" - pyspark

I am currently doing the Databricks Foundational Course Exercises, where a Reality Check is failing in Exercise 3d.
Code -
from pyspark.sql.functions import *
from pyspark.sql.types import *
df_batch_temp_view = spark.sql(f"select * from {batch_temp_view}")
df_batch_temp_view = (df_batch_temp_view
.withColumn("submitted_at", from_unixtime(col("submitted_at")).cast("timestamp"))
.withColumn("submitted_yyyy_mm", date_format(col("submitted_at"),"yyyy-MM"))
.withColumn("shipping_address_zip", col("shipping_address_zip").cast("Int"))
.drop("sales_rep_ssn", "sales_rep_first_name", "sales_rep_last_name", "sales_rep_address", "sales_rep_city",\
"sales_rep_state", "sales_rep_zip","product_id", "product_quantity", "product_sold_price")
.dropDuplicates(["order_id","customer_id","sales_rep_id"])
.select("submitted_at","submitted_yyyy_mm","order_id","customer_id","sales_rep_id",\
"shipping_address_attention","shipping_address_address","shipping_address_city","shipping_address_state",\
"shipping_address_zip","ingest_file_name","ingested_at")
.repartition(1)
)
df_batch_temp_view.write.format("delta").mode("overwrite").partitionBy("submitted_yyyy_mm").saveAsTable(f"{orders_table}")
My Testing
spark.sql(f"show partitions {orders_table}").count()
Result - 36
Problem:
When I am doing the reality check, its showing failure in "Found 36 Partitions"
Any suggestions, not sure what is wrong.

I am able to resolve the problem, by dropping the database and working directory. All I needed to do was to reset the whole thing and start from beginning. Mostly it was a cache issue.

Related

Cluster the similar items for new attribute cluster number, for this K-prototype clustering is used

So I have been doing a project on E-commerce customer segmentation where I have reached the clustering of data
# Checking the optimal values of 'K'
import matplotlib.pyplot as plt
from kmodes.kprototypes import KPrototypes
cost = []
for num_clusters in list(range(2,15)):
kproto = KPrototypes(n_clusters=num_clusters, init='Cao')
kproto.fit_predict(train, categorical=[0,1,4,5,6,7])
cost.append(kproto.cost_)
labels=kproto.labels_
plt.plot(cost)
Now my main problem is this code doesn't gets execute and takes min 2 hrs to execute
The main problem starts when the execution reach
kproto.fit_predict(train, categorical=[0,1,4,5,6,7])
cost.append(kproto.cost_)
labels=kproto.labels_
plt.plot(cost)
Then It stops executing.
Please do help me I am in need of hands of the experts.
ThankYou

Audio widget within Jupyter notebook is **not** playing. How can I get the widget to play the audio?

I writing my code within a Jupyter notebook in VS Code. I am hoping to play some of the audio within my data set. However, when I execute the cell, the console reports no errors, produces the widget, but the widget displays 0:00 / 0:00 (see below), indicating there is no sound to play.
Below, I have listed two ways to reproduce the error.
I have acquired data from the hub data store. Looking specifically at the spoken MNIST data set, I cannot get the data from the audio tensor to play
import hub
from IPython.display import display, Audio
from ipywidgets import interactive
# Obtain the data using the hub module
ds = hub.load("hub://activeloop/spoken_mnist")
# Create widget
sample = ds.audio[0].numpy()
display(Audio(data=sample, rate = 8000, autoplay=True))
The second example is a test (copied from another post) that I ran to see if it was something wrong with the data or something wrong with my console, environment, etc.
# Same imports as shown above
# Toy Function to play beats in notebook
def beat_freq(f1=220.0, f2=224.0):
max_time = 5
rate = 8000
times = np.linspace(0,max_time,rate*max_time)
signal = np.sin(2*np.pi*f1*times) + np.sin(2*np.pi*f2*times)
display(Audio(data=signal, rate=rate))
return signal
v = interactive(beat_freq, f1=(200.0,300.0), f2=(200.0,300.0))
display(v)
I believe that if it is something wrong with the data (this is a well-known data set so, I doubt it), then only the second one will play. If it is something to do with the IDE or something else, then neither will work, as is the case now.
Apologies for the late reply! In the future, please tag the questions with activeloop so it's easier to sort through (or hit us up directly in community slack -> slack.activeloop.ai).
Regarding the Free Spoken Digit Dataset, I managed to track the error with your usage of activeloop hub and audio display.
adding [:,0] to 9th line will help fixing display on Colab as Audio expects one-dimensional data
%matplotlib inline
import hub
from IPython.display import display, Audio
from ipywidgets import interactive
# Obtain the data using the hub module
ds = hub.load("hub://activeloop/spoken_mnist")
# Create widget
sample = ds.audio[0].numpy()[:,0]
display(Audio(data=sample, rate = 8000, autoplay=True))
(When we uploaded the dataset, we decided to upload the audio as (N,C) where C is the number of channels, which happens to be 1 for the particular dataset. The added dimension wasn't added automatically)
Regarding the VScode... the audio, unfortunately, would still not work (not because of us, but VScode), but you can still try visualizing Free Spoken Digit Dataset (you can play the music there, too). Hopefully this addresses your needs!
Let us know if you have further questions.
Mikayel from Activeloop

Spark::KMeans calls takeSample() twice?

I have many data and I have experimented with partitions of cardinality [20k, 200k+].
I call it like that:
from pyspark.mllib.clustering import KMeans, KMeansModel
C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None)
C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None)
and I see that initRandom() calls takeSample() once.
Then the takeSample() implementation doesn't seem to call itself or something like that, so I would expect KMeans() to call takeSample() once. So why the monitor shows two takeSample()s per KMeans()?
Note: I execute more KMeans() and they all invoke two takeSample()s, regardless of the data being .cache()'d or not.
Moreover, the number of partitions doesn't affect the number takeSample() is called, it's constant to 2.
I am using Spark 1.6.2 (and I cannot upgrade) and my application is in Python, if that matters!
I brought this to the mailing list of the Spark devs, so I am updating:
Details of 1st takeSample():
Details of 2nd takeSample():
where one can see that the same code is executed.
As suggested by Shivaram Venkataraman in Spark's mailing list:
I think takeSample itself runs multiple jobs if the amount of samples
collected in the first pass is not enough. The comment and code path
at GitHub
should explain when this happens. Also you can confirm this by
checking if the logWarning shows up in your logs.
// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
numIters += 1
}
However, as one can see, the 2nd comment said it shouldn't happen often, and it does happen always to me, so if anyone has another idea, please let me know.
It was also suggested that this was a problem of the UI and takeSample() was actually called only once, but that was just hot air.

Controlling requests per second and timeout threshold in Gatling

I am working on a Gatling simulation. For the life of me, I cannot get my code to reach 10000 requests per second. I have read the documentation and I keep messing with different methods and whatnot but my requests per second seems capped at 5000 requests per second. I have attached my current iteration of my code. The URL and path information is blurred out. Assume that I have no issue with the HTTP part of my simulation.
package computerdatabase
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
//import assertions._
class userSimulation extends Simulation {
object Query {
val feeder = csv("firstfileSHUF.txt").random
val query = repeat(2000) {
feed(feeder).
exec(http("user")
.get("/path/path/" + "${userID}" + "?fullData=true"))
}
}
val baseUrl = "http:URL:7777"
val httpConf = http
.baseURL(baseUrl) // Here is the root for all relative URLs
val scn = scenario("user") // A scenario is a chain of requests and pauses
.exec(Query.query)
setUp(scn.inject(rampUsers(1500) over (60 seconds)))
.throttle(reachRps(10000) in (2 minute),
holdFor(3 minutes))
.protocols(httpConf)
}
Additionally, I would like to set the maximum threshold for a timeout to be 100ms. I have tried to do this with assertions and also editing the configuration files but it never seems to show up during the tests or in my reports. How can I set a request to KO if the request took longer than 100ms? Thank you for your help with this matter!
I ended up figuring this out. My code above is correct and I know understand what Stephane, one of the main contributors to Gatling was explaining. The server at the time simply could not handle my RPS threshold. It was an upper bound that was unreachable. After making changes to the server, we could handle this sort of latency. Additionally, I found a way to timeout at 100ms in the configuration file. Specifically, requestTimeout = 100 will cause the timeout behavior I was looking for.

Any reason why these instance could be misclassified?

I started off with two files training & testing.
Then using libsvm I scaled both those files to training.scale and testing.scale
Then using grid.py (part of libsvm) I ran training.scale and and recieved some cross validation values:
C = 512
gamme = 0.03125
validation 5 = 66.8421
Then running svm-train using the variable found from grid.py and training.scale I got a new fine called training.scale.model
I then ran svm-predict and I new file called testing.predict and got a validation % of 60.8333%
Finally comparing testing and testing.predict found that there were 47/120 misclassifications
[https://drive.google.com/folderview?id=0BxzgP5V6RPQHekRjZXdFYW9GX0U&usp=sharing][1]
[1]: link to code
The real question is there any reason why these misclassification occur?
PS. I apologise for the bad format of this question, been up for too long
I am guessing you are new to machine learning. The results you've got are perfectly right.
Reason why these mis-classifications occur? The features you've used don't separate your classes well. A 66% cross-validation score should have given you the hint. Even by plain hit or miss method you'll get 50% accuracy, and the feature-set you used could only improve this by another 16%. Try exploring new features.
I'm assuming your data set is clean.