Using dplyr correctly to combine shared values of a row to a new column of a table - rdata

How do I combine data from two tables based on certain shared values from the row?
I already tried using the which function and it didn't work.

I think you will have the best luck using the dplyr fuction. Specifically you can use right_join(). You can wright it like this, right_join(df1,df2, by="specification")
This will combine that columns from df2 with the specifications matching the rows according to the shared specification from df1.
For future reference it would be a lot of help if you included a screenshot of code just so it is easier to know exactly what you are asking.
Anyway, let me know if this answers your question!

Related

How to filter rows with column constraint in Deequ ColumnProfileRunner?

I am new to Scala and Spark. I am exploring the Amazon Deequ library for data profiling.
How do I get count of rows having a particular value while using ColumnProfilerRunner()?
The AnalysisRunner has an option of "compliance" I am looking for a similar option to filter rows that comply with the given column constraint.
I have multiple columns hence I want to check dynamically instead of using column names.
Appreciate any help.
Thanks
Deequ's column profiler computes a fixed set of statistics. If you want to compute custom statistics of your data, you should use the VerificationSuite. Checkout the examples on deequ's github page.

Scala Nested Iteration within RDD

I have to iterate through all columns to find similarity of 1 column value. For example:
ID,FN,LN,Phone
-----------
1,James,Butt,872-232-1212
2,Josephine,Darakjy, 872-232-1213
3,Art,Venere,872-232-1214
4,Lenna,Paprocki,872-232-1215
5,Donette, Foller,872-232-1216
6,Jmes,Butt,666-232-1212
7,Donette, Foller,888-232-1216
8,Josphne,Darkjy, 555-232-1213
Inside the loop, I will take FN, which is 'James' and see if I have similar name in the complete data set using some kind string distances (e.g Levenshtein) and in this case I have match with ID#6: 'Jmes', I will create a bucket by adding a new GUID column this:
ID,FN,LN,Phone,GrupId
----------------------
1,James,Butt,872-232-1212,G1
2,Josephine,Darakjy, 872-232-1213,G2
3,Art,Venere,872-232-1214,G3
4,Lenna,Paprocki,872-232-1215,G4
5,Donette, Foller,872-232-1216,G5
6,Jmes,Butt,666-232-1212,G1
7,Donette, Foller,888-232-1216,G5
8,Josphne,Darkjy, 555-232-1213,G2
I have to do same operation on multiple columns, like LN, Phone as well. Imagine if I have 1 million records.
Any thoughts, suggestions or links are appreciated. Thank you!
I would definitely not try anything pairwise and would rather think towards coding a per-field Levenshtein-y index and accumulate results on the fly. I’d probably start from a suffix tree -ish one.
Will try to sketch a prototype as soon as I get to the laptop...
Update: after some reading I am leaning towards Affinity Clustering1 combined with pairwise (yes I know) Levenshtein cached on a Trie2. Code in progress...

MATLAB and DataFrame

Is anyone using the DataFrame package (https://github.com/rothnic/DataFrame)?
I use it because older MATLAB can also use it. However, I just have very basic question:
How to change value in the DataFrame?
In MATLAB's table function, it is straightforward to do it. For example, t{1,{'prior'}} = {'a = 2000'}, where t is a table and I assign a cell to it. I cannot figure out how to do it in DataFrame package.
The DataFrame author seems not maintaining it anymore(?). I hope someone could give more examples of its methods in DataFrame.
Thanks!

Scala: wrapper for Breeze DenseMatrix for column and row referencing

I am new to Scala. Looking at it as an alternative to MATLAB for some applications.
I would like to program in Scala a wrapping class in order to be able to assign column names ("QuantityQ" && "QuantityP" -> Range) and row names (dates -> Range) to Breeze DenseMatrices (http://www.scalanlp.org/) in order to reference columns and rows.
The usage should resemble Python Pandas or Scala Saddle (http://saddle.github.io).
Saddle is very interesting but its usage is limited to 2D matrices. A huge limitation.
My Ideas:
Columns:
I thought a Map would do the job for colums but that may not be the best implementation.
Rows:
For rows, I could maintain a separate Breeze vector with timestamps and provide methods that convert dates into timestamps, doing the numbercruncing through Breeze. This comes with a loss of generality as a user may want to give whatever string names to rows.
Concerning dates I use nscala-time (a scala wrapper for joda)?
What are the drawbacks of my implementation?
Would you design the data structure differently?
Thank you for your help.

TSQL question : how to iterate columns of result set

I have a select statement and a cursor to iterate the rows I get. the problem is that I have many columns (more than 500), and so "fetch .. into #variable" is impossible for me. how can I iterate the columns (one by one, I need to process the data)?
Thanks in advance,
n.b
Two choices.
1/ Use SSIS or ADO.Net to pour through your dataset row by row.
2/ Consider what you're actually needing to achieve and find a set-based approach.
My preference is for option 2. Let us know what you need done and we'll find a way.
Rob
You can build a SQL string using sys.columns or INFORMATION_SCHEMA queries. Here's a post I wrote on that.