Cluster the similar items for new attribute cluster number, for this K-prototype clustering is used - cluster-analysis

So I have been doing a project on E-commerce customer segmentation where I have reached the clustering of data
# Checking the optimal values of 'K'
import matplotlib.pyplot as plt
from kmodes.kprototypes import KPrototypes
cost = []
for num_clusters in list(range(2,15)):
kproto = KPrototypes(n_clusters=num_clusters, init='Cao')
kproto.fit_predict(train, categorical=[0,1,4,5,6,7])
cost.append(kproto.cost_)
labels=kproto.labels_
plt.plot(cost)
Now my main problem is this code doesn't gets execute and takes min 2 hrs to execute
The main problem starts when the execution reach
kproto.fit_predict(train, categorical=[0,1,4,5,6,7])
cost.append(kproto.cost_)
labels=kproto.labels_
plt.plot(cost)
Then It stops executing.
Please do help me I am in need of hands of the experts.
ThankYou

Related

How to display the time and space your program takes in VSCode?

It has been over a year since I'm using VSCode. Almost every day I search the web for ways to display the time taken (speed) and space taken(during execution) by my program. This info is very important. But unfortunately, I haven't found(or missed) a way to display these metrics. VSCode is cool to use and lightweight etc. etc., but these metrics were visible by default in some other IDEs (like codeblocks). Some extension or some setting I missed in the many articles I went through. If someone could help me out here, I'll be super grateful.
Thank you in advance
if __name__ == "__main__":
'''
Example code shows to display
the Time and Space taking of the program(creating N_gram language model)
'''
import os, psutil, time
start = time.time()
m = create_ngram_model(3, '/content/train_corpus.txt') # Replace withyour custom function
process = psutil.Process(os.getpid())
print('Memory usage in Mega Bytes: ', process.memory_info().rss/(1024**2)) # in bytes
print(f'Time Taken: {time.time() - start}')
You can add this psutil.Process(os.getpid()) at any function(at main thread to see entire memory usage of the program) that you wish to see the status byprocess.memory_info().rss in the running time Memory usage: 0.5774040222167969. Also check this code for to know the program runtime.

Databricks Foundation - Number of Partitions on Exercise 3-d Reality Checks is failing "Found 36 Partitions"

I am currently doing the Databricks Foundational Course Exercises, where a Reality Check is failing in Exercise 3d.
Code -
from pyspark.sql.functions import *
from pyspark.sql.types import *
df_batch_temp_view = spark.sql(f"select * from {batch_temp_view}")
df_batch_temp_view = (df_batch_temp_view
.withColumn("submitted_at", from_unixtime(col("submitted_at")).cast("timestamp"))
.withColumn("submitted_yyyy_mm", date_format(col("submitted_at"),"yyyy-MM"))
.withColumn("shipping_address_zip", col("shipping_address_zip").cast("Int"))
.drop("sales_rep_ssn", "sales_rep_first_name", "sales_rep_last_name", "sales_rep_address", "sales_rep_city",\
"sales_rep_state", "sales_rep_zip","product_id", "product_quantity", "product_sold_price")
.dropDuplicates(["order_id","customer_id","sales_rep_id"])
.select("submitted_at","submitted_yyyy_mm","order_id","customer_id","sales_rep_id",\
"shipping_address_attention","shipping_address_address","shipping_address_city","shipping_address_state",\
"shipping_address_zip","ingest_file_name","ingested_at")
.repartition(1)
)
df_batch_temp_view.write.format("delta").mode("overwrite").partitionBy("submitted_yyyy_mm").saveAsTable(f"{orders_table}")
My Testing
spark.sql(f"show partitions {orders_table}").count()
Result - 36
Problem:
When I am doing the reality check, its showing failure in "Found 36 Partitions"
Any suggestions, not sure what is wrong.
I am able to resolve the problem, by dropping the database and working directory. All I needed to do was to reset the whole thing and start from beginning. Mostly it was a cache issue.

Audio widget within Jupyter notebook is **not** playing. How can I get the widget to play the audio?

I writing my code within a Jupyter notebook in VS Code. I am hoping to play some of the audio within my data set. However, when I execute the cell, the console reports no errors, produces the widget, but the widget displays 0:00 / 0:00 (see below), indicating there is no sound to play.
Below, I have listed two ways to reproduce the error.
I have acquired data from the hub data store. Looking specifically at the spoken MNIST data set, I cannot get the data from the audio tensor to play
import hub
from IPython.display import display, Audio
from ipywidgets import interactive
# Obtain the data using the hub module
ds = hub.load("hub://activeloop/spoken_mnist")
# Create widget
sample = ds.audio[0].numpy()
display(Audio(data=sample, rate = 8000, autoplay=True))
The second example is a test (copied from another post) that I ran to see if it was something wrong with the data or something wrong with my console, environment, etc.
# Same imports as shown above
# Toy Function to play beats in notebook
def beat_freq(f1=220.0, f2=224.0):
max_time = 5
rate = 8000
times = np.linspace(0,max_time,rate*max_time)
signal = np.sin(2*np.pi*f1*times) + np.sin(2*np.pi*f2*times)
display(Audio(data=signal, rate=rate))
return signal
v = interactive(beat_freq, f1=(200.0,300.0), f2=(200.0,300.0))
display(v)
I believe that if it is something wrong with the data (this is a well-known data set so, I doubt it), then only the second one will play. If it is something to do with the IDE or something else, then neither will work, as is the case now.
Apologies for the late reply! In the future, please tag the questions with activeloop so it's easier to sort through (or hit us up directly in community slack -> slack.activeloop.ai).
Regarding the Free Spoken Digit Dataset, I managed to track the error with your usage of activeloop hub and audio display.
adding [:,0] to 9th line will help fixing display on Colab as Audio expects one-dimensional data
%matplotlib inline
import hub
from IPython.display import display, Audio
from ipywidgets import interactive
# Obtain the data using the hub module
ds = hub.load("hub://activeloop/spoken_mnist")
# Create widget
sample = ds.audio[0].numpy()[:,0]
display(Audio(data=sample, rate = 8000, autoplay=True))
(When we uploaded the dataset, we decided to upload the audio as (N,C) where C is the number of channels, which happens to be 1 for the particular dataset. The added dimension wasn't added automatically)
Regarding the VScode... the audio, unfortunately, would still not work (not because of us, but VScode), but you can still try visualizing Free Spoken Digit Dataset (you can play the music there, too). Hopefully this addresses your needs!
Let us know if you have further questions.
Mikayel from Activeloop

Pulling ipywidget value manually

I have this code ipython code:
import ipywidgets as widgets
from IPython.display import display
import time
w = widgets.Dropdown(
options=['Addition', 'Multiplication', 'Subtraction'],
value='Addition',
description='Task:',
)
def on_change(change):
print("changed to %s" % change['new'])
w.observe(on_change)
display(w)
It works as expected. When the value of the widget changes, the on_change function gets triggered. However, I want to run a long computation and periodically check for updates to the widget. For example:
for i in range(100):
time.sleep(1)
# pull for changes to w here.
# if w.has_changed:
# print(w.value)
How can I achieve this?
For reference, I seem to be able to do the desired polling with
import IPython
ipython = IPython.get_ipython()
ipython.kernel.do_one_iteration()
(I'd still love to have some feedback on whether this works by accident or design.)
I think you need to use threads and hook into the ZMQ event loop. This gist illustrates an example:
https://gist.github.com/maartenbreddels/3378e8257bf0ee18cfcbdacce6e6a77e
Also see https://github.com/jupyter-widgets/ipywidgets/issues/642.
To elaborate on the OP's self answer, this does work. It forces the widgets to sync with the kernel at an arbitrary point in the loop. This can be done right before the accessing the widget.value.
So the full solution would be:
import IPython
ipython = IPython.get_ipython()
last_val = 0
for i in range(100):
time.sleep(1)
ipython.kernel.do_one_iteration()
new_val = w.value
if new_val != old_val:
print(new_val)
old_val = new_val
A slight improvement to the ipython.kernel.do_one_iteration call used
# Max iteration limit, in case I don't know what I'm doing here...
for _ in range(100):
ipython.kernel.do_one_iteration()
if ipython.kernel.msg_queue.empty():
break
In my case, I had a number of UI elements, that could be clicked multiple times between do_one_iteration calls, which will process them one at a time, and with a 1 second time delay, that could get annoying. This will process at most 100 at a time. Tested it by mashing a button multiple times, and now they all get processes as soon as the sleep(1) ends.

Spark::KMeans calls takeSample() twice?

I have many data and I have experimented with partitions of cardinality [20k, 200k+].
I call it like that:
from pyspark.mllib.clustering import KMeans, KMeansModel
C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None)
C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None)
and I see that initRandom() calls takeSample() once.
Then the takeSample() implementation doesn't seem to call itself or something like that, so I would expect KMeans() to call takeSample() once. So why the monitor shows two takeSample()s per KMeans()?
Note: I execute more KMeans() and they all invoke two takeSample()s, regardless of the data being .cache()'d or not.
Moreover, the number of partitions doesn't affect the number takeSample() is called, it's constant to 2.
I am using Spark 1.6.2 (and I cannot upgrade) and my application is in Python, if that matters!
I brought this to the mailing list of the Spark devs, so I am updating:
Details of 1st takeSample():
Details of 2nd takeSample():
where one can see that the same code is executed.
As suggested by Shivaram Venkataraman in Spark's mailing list:
I think takeSample itself runs multiple jobs if the amount of samples
collected in the first pass is not enough. The comment and code path
at GitHub
should explain when this happens. Also you can confirm this by
checking if the logWarning shows up in your logs.
// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
numIters += 1
}
However, as one can see, the 2nd comment said it shouldn't happen often, and it does happen always to me, so if anyone has another idea, please let me know.
It was also suggested that this was a problem of the UI and takeSample() was actually called only once, but that was just hot air.