Deep Neural Networks - Improve performance of text classification - neural-network

I have an NLP classification problem to classify short phrases/words into one of two categories. The features are in the form of words and short phrases where some of the the words and phrases repeat themselves in a particular observation. For example
| Observation | label |
|:--------------------------------------------------:|:------:|
| 'dog, jump, eat, drink water, jump' | animal |
| 'run, jump, travel, sleep, talk, grocery shopping' | human |
| 'swim, language, jump, eat, go to school' | human |
| 'bite, lick, growl, eat, run, lick, scratch' | animal |
I've tried using affine DNN and ConvNets with tokenized(CountVectorizer and TFIDVectorizer) input text, also with tokenized input + word embeddings with and without regularization. Validation accuracy never seems to exceed 76%. Any ways on how to improve performance of my model?

Related

Dataframe level computation in pySpark

I am using PySpark and want to use the benefit of multiple nodes to improve on performance time.
For example:
Suppose I have 3 columns and have 1 million records:
Emp ID | Salary | % Increase | New Salary
1 | 200 | 0.05 |
2 | 500 | 0.15 |
3 | 300 | 0.25 |
4 | 700 | 0.1 |
I want to compute the New Salary column and want to use the power of multiple nodes in pyspark to reduce overall processing time.
I don't want to do an iterative row wise computation of New Salary.
Does df.withColumn do the computation at a dataframe level? Would it be able to give better performance as more nodes are used?
Spark's dataframes are basically a distributed collection of data. Spark manages this distribution and the operations (such as .withColumn) on them.
Here is a quick google search on how to increase spark's performance.

Flow Balance in Chemical Engineering Process Flow Diagram

I am developing a chemical process simulation program that takes the user input of
Definitions of process units
A Process Flow Diagram (PFD) that depicts how the process units are connected and flow/mass stream directions;
The PFD may have recirculation loops. A simple example may look like this:
PFD:
Feed_Unit --> Chemical_Reactor --> Separator --> Product
^ |
| |
|<----(recirculation)---V (flow split)
|
L------> Waste_Material
The flow of Waste_Material is a function of the Chemical_Reactor and changes during simulation from one timestamp to the next.
I can balance the Feed, Waste_Material, and Product flows easily. What would be an efficient approach/algorithm to make sure the inner streams' flows are balanced too?
This seems like an 1000 level mass balance problem. It would be easier to just set up a bunch of systems of linear equations in matlab and use the rref() function. As long as it's not a transient problem.
Just write out all the balances and then plug em into matlab.

spark sql : How to achieve parallel processing of dataframe at group level but with in each group, we require sequential processing of rows

Apply grouping on the data frame. Let us say it resulted in 100 groups with 10 rows each.
I have a function that has to be applied on each group. It can happen in parallel fashion and in any order (i.e., it is upto the spark discretion to choose any group in any order for execution).
But with in group, I need the guarantee of sequential processing of the rows. Because after processing each row in a group, I use the output in the processing of any of the rows remaining in the group.
We took the below approach where everything was running on the driver and could not utilize the spark cluster's parallel processing across nodes (as expected, performance was real bad)
1) Break the main DF into multiple dataframes, placed in an array :
val securityIds = allocStage1DF.select("ALLOCONEGROUP").distinct.collect.flatMap(_.toSeq)
val bySecurityArray = securityIds.map(securityIds => allocStage1DF.where($"ALLOCONEGROUP" <=> securityIds))
2) Loop thru the dataframe and pass to a method to process, row-by-row from above dataframe:
df.coalesce(1).sort($"PRIORITY" asc).collect().foreach({
row => AllocOneOutput.allocOneOutput(row)}
)
What we are looking for is a combination of parallel and sequential processing.
Parallel processing at group level. because, these are all independent groups and can be parallelized.
With in each group, rows have to be processed one after the other in a sequence which is very important for our use case.
sample Data
Apply grouping on SECURITY_ID,CC,BU,MPU which gives us 2 groups from above (SECID_1,CC_A,BU_A,MPU_A and SECID_2,CC_A,BU_A,MPU_A).
with the help of a priority matrix ( nothing but a ref table for assinging rank to rows), we transpose each group into below :
Transposed Data
Each row in the above group has a priority and are sorted in that order. Now, I want to process each row one after the other by passing them to a function and get an output like below :
output
Detailed Explanation of usecase :
Base data frame has all trading positions data of a financial firm. some customers buy (long) a given financial product (uniquely identified by securityId) and some sell(short) them.
The idea of our application is to identify/pair the long positions and short positions in a given securityId.
Since this pairing happens with in a securityId, we said that the base data frame is divided into groups based on this securityId and each group can processed independently.
Why are we looking for sequential processing within a group ? It is because, when there are many long positions and many short positions in a given group (as the example data had) then the reference table (priority matrix) decides which long position has to be paired against which short position. basically, it gives the order of processing.
Second reason is that, when a given long quantity and short quantity
are not equal then the residual quantity is eligible for pairing.
i.e., if long quantity is left, then it can be paired with the next
short quantity available in the group as per the priority or vice
versa.
Because of the reasons, mentioned in 4 & 5, we are looking to process row after row with in a group.
Above points are described using the dataset below.
Base DataFrame
+-------------+----------+----------+------
ROW_NO|SECURITY_ID| ACCOUNT|POSITION|
+-------------+----------+----------+------
1| secId| Acc1| +100|
2| secId| Acc2| -150|
3| secId| Acc3| -25|
4| secId2| Acc3| -25|
5| secId2| Acc3| -25|
Base data frame is divided based on group by securityID. Let us use secId group as below
+-------------+----------+----------+------
ROW_NO|SECURITY_ID| ACCOUNT|POSITION|
+-------------+----------+----------+------
1| secId| Acc1| +100|
2| secId| Acc2| -150|
3| secId| Acc3| -25|
In the above case positive position of 100 can be paired with either -50 or -25. In order to break the tie, the following ref table called priority matrix helps by defining the order.
+-------------+----------+----------+------
+vePositionAccount|-vePositionAccount| RANK
+-------------+----------+----------+------
Acc1| Acc3| 1|
Acc1| Acc2| 2|
so, from above matrix we know that rowNo 1 and 3 will be paired first and then rowNo 1 and 2. This is the order (sequential processing) that we are talking about. Lets pair them now as below :
+-------------+----------+----------+------+-------------+----------+----------+------
+veROW_NO|+veSECURITY_ID| +veACCOUNT|+vePOSITION|+veROW_NO|+veSECURITY_ID| +veACCOUNT|+vePOSITION|
+-------------+----------+----------+------+-------------+----------+----------+------
1| secId| Acc1| +100| 3| secId| Acc3| -25|
1| secId| Acc1| +100| 2| secId| Acc2| -150|
What happens when row 1 is processed after row 2 ? (this is what we need)
1.After processing row 1 - Position in Acc1 will be (100 - 25) = 75 Position in Acc3 will be 0. The updated position in Acc1 which is 75 will be now used in processing second row.
2.After processing row 2 - Position in Acc1 will be 0 . Position in Acc2 will be (75-150) -75.
Result dataframe:
+-------------+----------+----------+------
ROW_NO|SECURITY_ID| ACCOUNT|POSITION|
+-------------+----------+----------+------
1| secId| Acc1| 0|
2| secId| Acc2| -75|
3| secId| Acc3| 0|
What happens when row 2 is processed after row 1 ? (we dont want this)
After processing row 2 - Position in Acc1 will be 0 Position in Acc2 will be (100-150) -50. The updated position in Acc1 which is 0 will be now used in processing frist row.
After processing row 1 - Position in Acc1 will be 0. Position in Acc3 will be unchanged at -25.
Result dataframe:
+-------------+----------+----------+------
ROW_NO|SECURITY_ID| ACCOUNT|POSITION|
+-------------+----------+----------+------
1| secId| Acc1| 0|
2| secId| Acc2| -50|
3| secId| Acc3| -25|
As you see above, the order of processing with in a group determines our output
I also wanted to ask - why does not spark support sequential processing with in a section of dataframe ? we are saying that we need parallel processing capability of the cluster. That is why we are dividing the data frame into groups and asking the cluster to apply the logic on these groups in parallel. All we are saying is if the group has lets say 100 rows, then let these 100 rows are processed 1 after other in an order. Is this not supported by spark ?
If it is not, then what other technology in big data can help acheive that ?
Alternate Implementation:
Partition the dataframe into as many partitions as number of groups (50000 in our case. Groups are more but rows with in any group are no more than few 100s).
Run 'ForeachPartition' action on the data frame where in the logic is executed across partitions independently.
write the output from processing of each partition to the cluster.
After the whole data frame is processed, a seperate job is going to read these individual files from the step 3 and write to a single file/dataframe.
I doubt if thousands of partitions is any good, but yet would like to know if the approach is sounding good.
The concept works well enough until this rule:
Second reason is that, when a given long quantity and short quantity are not equal then the residual quantity is eligible for
pairing. i.e., if long quantity is left, then it can be paired with
the next short quantity available in the group as per the priority or
vice versa.
This is because you want iteration, looping with dependencies logic, that is difficult to code with Spark which is more flow-oriented.
I also worked on a project where everything was stated - do it in Big
Data with Spark, scala or pyspark. Being an architect as well as coder,
I looked at the algorithm for something similar to your area, but not
quite, in which for commodities all the periods for a set of data
points needed to be classified as bull, bear, or not. Like your
algorithm, but still different, I did not know what amount of looping
to do up-front. In fact I needed to do something, then decide to
repeat that something to the left and to the right of a period I had
marked as either bull or bear or nothing, potentially. Termination
conditions were required. See picture below. Sort of like a 'flat' binary tree
traversal until all paths exhausted. Not that Spark-ish.
I actually - for academic purposes - solved my specific situation in
Spark, but it was an academic exercise. The point to the matter is
that this type of processing - my example and your example are a poor
fit for Spark. We did these calculation in ORACLE and simply sqooped
the results to Hadoop datastore.
My advice is therefore that you not try this in Spark, as it does not fit the use cases well enough. Trust me, it gets messy. To be honest, it was soon apparent to me that this type of processing was an issue. But when starting out, it is a common aspect to query.

A ratio measured within one dimension shown across another dimension

For the sake of the exercise, let's assume that I'm monitoring the percentage of domestic or foreign auto sales across the US.
Assume my dataset looks like:
StateOfSale | Origin | Sales
'CA' | 'Foreign' | 1200
'CA' | 'Domestic' | 800
'TX' | 'Foreign' | 800
'TX' | 'Domestic' | 800
How would I show the percentage of Foreign Sales, by State of Sale, but each State is a line/mark/bar in the visual?
So for CA, the Foreign Percentage is 60%. For TX, the Foreign Percentage is 50%.
This is what Tableau was born to do!, and there are a lot of great ways to visualize this type of question.
Use a Quick table calculation called "Percent of Total" and compute that percentage according to each State. In the picture below, "StateofOrigin" is in Columns, and "Sum(Sale)" is in Rows, I compute using Table (Down).
You can also graph the raw sales numbers in addition to displaying the text percentage to gain additional context about the number of sales between states.
Finally, if you've got a lot of states, it can be cool to plot it out on a map. You can do this by creating a calculated field for percentage and then filtering out the domestic sales.
Field Name: Percentage
SUM([Sale])/SUM({FIXED [StateofOrigin]: SUM([Sale])})

Decision between an user-based and item-based filter

I'm working on an algorithm to generate recommendations for a platform where you can review restaurants. So the database exists of 3 tables, 'Users', 'Restaurants' and 'Reviews'. Reviews have a rating of 1-5. Each restaurant can have multiple reviews, and users can have multiple reviews. A review can only have 1 user and 1 restaurant.
First I'll explain my current research/conclusions
I have implemented most of an user-based system, but I'm encountering several issues that come from this way of generating recommendations.
First of all, the data sparsity issue, because there is a possibility that users have a few reviews on very different restaurants, it's very difficult to determine a correlation between users.
Second I have encountered the cold-start problem. Before being able to calculate a correlation between users, you'll need around 4 reviews per review on exactly the same restaurants.
I have found a 'solution' for this by using a content-based filter. I personally don't like this way of filtering when there are better solutions (I.e. Item-based filtering)
And last, the scalability issue. Because (Hopefully) the app will become a huge success, it's going to be very heavy on the performance to calculate these correlations for each user.
Because of these issues I have decided to do some research on an item-based filter. I just would like some feedback to be sure my recommendations will be correct and to be sure I understand how this works fully.
First, you'll have to calculate a similarity correlation between restaurants, based on reviews left on said restaurants.
Example of a matrix (If I'm correct) where R1 = restaurant 1, etc.
| R1 | R2 | R3 | R4 |
R1 | 1 | 0.75 | 0.64 | 0.23 |
R2 | 0.75 | 1 | 0.45 | 0.98 |
R3 | 0.64 | 0.45 | 1 | 0.36 |
R4 | 0.23 | 0.98 | 0.36 | 1 |
To generate recommendations for a user, you'll have to create a list of restaurants the user has reviewed positively.
Then check the matrix to see which restaurants are similar to these restaurants.
After this, you will have to check which of these restaurants the user hasn't reviewed yet. These will be recommendations for the user.
Using the item-based filter will solve most issues you'll encounter when using the user-based filter.
Data sparsity. Because you don't rely on similar users anymore, you won't have the problem of 'Not enough reviews'.
Cold start problem. Because you can base your recommendations on 1 review of the user, you won't have the cold start issues (Apart from having null data about the user)
Scalability. You don't have to generate the similarity matrix a lot. You can do this once a day or even week. And for generating the recommendations, you can simply consult the matrix and retrieve restaurants from that.
Now my questions:
I would really like some opinions. I'm not saying any of my research are facts, I'm just wondering if I'm doing things correctly.
Am I doing everything correctly? I have done plenty of research, but since every scenario is different, I find it difficult to determine if I'm doing the right thing.
Is it true that an item-based filter is much better than an user-based one?
How exactly would you calculate a similarity between restaurants? I understand the angle between vectors, but how do you determine the points on the vectors? You can't just take reviews and put them against other reviews, since then all highest rated restaurants will be very similar. How do you set up these vectors?
In my scenario, what would be the best solution, and what similarity coefficient would be best?
Also, what would happen when my review matrix looks like this?
| R1 | R2 |
User 1 | ? | 5 |
User 2 | ? | 3 |
User 3 | 5 | ? |
User 4 | 3 | ? |
Is it possible to calculate a similarity between these two restaurants? And if not, how could I fix that?
Any feedback on this would be greatly appreciated!
You look like you have the right questions on your mind, I'll try to give you a few pointers and some suggestions:
Cold Start:
You won't be able to solve the cold start problem. Period. You can only mitigate it, an Item to Item approach suffers less of cold start but you still need the restaurants to have some reviews and the users to have at least one.
If you have acess to content information about users and restaurants then take advantage of it to make recommendations for when you don't have enough data.
This leads me to an interesting insight. You don't need to use the same algorithm to make all your recommendations. You can specify different approaches for users with different amounts of data.
For instance you can start by recommending the most popular restaurants in the local area to the user if you have the location, otherwise just use the most popular restaurants in your database.
Item 2 Item vs User 2 User vs Something Else:
I2I and U2U Collaborative filtering are recommendation systems algorithms that have achieved good results, however they suffer from all the problems you mentioned. I2I usually performs better but also suffers from cold start, scalability and the other problems.
There is another class of algorithms that have been outperforming I2I and U2U, and are also based on the same intuition of using the reviews to determine what items to recommend. Matrix Factorization Algorithms try to represent Users and items information with hidden factors and make recommendations based on those factors. You should definetly investigate it further.
Similarity Calculation:
The starting point would definetly be the Cosine Similarity, where each vector representing a restaurant is an array with the reviews of all users to that restaurant, and with 0s when the user hasn't reviewed the restaurant.
Detailed explaination with example here.
Scaling:
Even though I2I scales better it is still quadratic on the number of restaurants for both the memory and computational requirements.
So you should investigate other options such as using Local Sensitive Hashing to calculate the similarities. I won't go into much detail as I am not very confortable with that algorithm, but I think you can apply it to calculate the most similar pairs without having to store the entire matrix.
Sources for further investigation:
A good introduction: Chapter 9 of Mining Massive Datasets
The Recommender Systems bible: Recommender Systems Handbook