Intersecting two columns with different lengths - match

I have a dataset1 containing 5000 user_ids from Twitter. I want to intersect the user_ids from this dataset with another dataset2 containing other user_ids from Twitter and at the same time create a new column in my dataset1, where each user_id in dataset1 either get the score '1' (if intersect) or '0' (if no intersect). I tried the following code below, but I just get an output in the new column 'intersect' with some (random) zeros and then a lot of NA's.
for(i in 1:ncol(data1)){
#intersect with other data
ids_intersect = intersect(data1$user_id, data2$user_id)
if(length(ids_intersect == 0)){
data1[i, "intersect"] <- 0 # no intersect
} else {
data1[i, "intersect"] <- 1 # intersect
}
}
I also tried another code, which I find more intuitive, but this one won't work since the two datasets have different rowlengths ("replacement has 3172 rows, data has 5181"). But in the same way as above the intention here would be that you get the score 1 'if intersect' or 0/NA 'if no intersect' in the new column 'intersect'. However i'm not sure how to implement it in the following code:
data$intersect <- intersect(data1$user_id, data2$user_id)
Any way of assigning either 1 or 0 to the user_ids in a new column depending on whether there is an intersect/match?

A comfortable option is using mutate() from the dplyr package together with the Base R %in% command as follows.
Data
data1 <- data.frame(user_id = c("Test1",
"Test2",
"Test4",
"Test5"))
data2 <- data.frame(user_id = c("Test1",
"Test3",
"Test4"))
Code
data1 %<>%
mutate(Existence = ifelse(user_id %in% data2$user_id,
1,
0))
Output
> data1
user_id Existence
1 Test1 1
2 Test2 0
3 Test4 1
4 Test5 0

Related

read up a table and analyze the elements matlab

I am trying to realize my idea in matlab.
I consider two column A and B.
A=data(:,1)
B=data(:,5)
the data look like:
A B
1 1
2 1
3 1
... ...
100 20
... ...
150 30
151 1
... ...
The values in column A are timepoints.
I start with the first element in column A. It schould be A(1,1) and look on the first element in the column B B(1,1). If B(1,1)==1its true,if not its false. Then I increase consider the second raw of the column A and second raw of the column B and so on until the last raw of A and B.
How can I construck this loop??
You can just consider B likes the following:
result = (B == 1);
The result would be the same size of B such as you want. Nowm you can get the value of A on result likes the following:
valid_times = A(result);

What wrong with this Scala loop in reading files

I am using Scala to read data from 2 CSV files and for each line from the first file, I want to scan all line from the second CSV file to do some calculating.
This is my code
object CSVProcess extends App {
val dataMatlab = io.Source.fromFile("./data/data_matlab1.csv")
val matchDataMatlab = io.Source.fromFile("./data/match_data_matlab1.csv")
for ((line, count) <- dataMatlab.getLines.zipWithIndex) {
for ((line1, count1) <- matchDataMatlab.getLines.zipWithIndex) {
println(s"count count1 ${count} ${count1}")
}
}
dataMatlab.close
matchDataMatlab.close
However, the output does not like what I expect, the loop stops when the first line of the first CSV file scans all lines of the second one.
For example, in the CSV 1, There are 3 lines
1,1
2,2
3,3
In the CSV 2, It has3 lines
1,1,1
2,2,2
3,3,3
But the output is
count count1 0 0
count count1 0 1
count count1 0 2
The output should be
count count1 0 0
count count1 0 1
count count1 0 2
count count1 1 0
count count1 1 1
count count1 1 2
count count1 2 0
count count1 2 1
count count1 2 2
.
Could someone detect the problem of my code
The problem is io.Source.fromFiles("path").getLines gives you a iterator and Iterators are like socket buffers meaning that once you read a data out of it, there would be no data left.
official scala documentation explains as
An iterator is not a collection, but rather a way to access the elements of a collection one by one. The two basic operations on an iterator it are next and hasNext. A call to it.next() will return the next element of the iterator and advance the state of the iterator. Calling next again on the same iterator will then yield the element one beyond the one returned previously...
The solution would be to convert the iterators to any of the traversables. Here I have converted to List for persistance.
val dataMatlab = io.Source.fromFile("./data/data_matlab1.csv").getLines().toList
val matchDataMatlab = io.Source.fromFile("./data/match_data_matlab1.csv").getLines().toList
for ((line, count) <- dataMatlab.zipWithIndex) {
for ((line1, count1) <- matchDataMatlab.zipWithIndex) {
println(s"count count1 ${count} ${count1}")
}
}
now you should get the expected output
I hope the explanation is clear enough and helpful

Difference between these two count methods in Spark

I have been doing a count of "games" using spark-sql. The first way is like so:
val gamesByVersion = dataframe.groupBy("game_version", "server").count().withColumnRenamed("count", "patch_games")
val games_count1 = gamesByVersion.where($"game_version" === 1 && $"server" === 1)
The second is like this:
val gamesDf = dataframe.
groupBy($"hero_id", $"position", $"game_version", $"server").count().
withColumnRenamed("count", "hero_games")
val games_count2 = gamesDf.where($"game_version" === 1 && $"server" === 1).agg(sum("hero_games"))
For all intents and purposes dataframe just has the columns hero_id, position, game_version and server.
However games_count1 ends up being about 10, and games_count2 ends up being 50. Obviously these two counting methods are not equivalent or something else is going on, but I am trying to figure out: what is the reason for the difference between these?
I guess because in first query you group by only 2 columns and in the second 4 columns. Therefore, you may have less distinct groups just on two columns.

Joining multiple times in kdb

I have two tables
table 1 (orders) columns: (date,symbol,qty)
table 2 (marketData) columns: (date,symbol,close price)
I want to add the close for T+0 to T+5 to table 1.
{[nday]
value "temp0::update date",string[nday],":mdDates[DateInd+",string[nday],"] from orders";
value "temp::temp0 lj 2! select date",string[nday],":date,sym,close",string[nday],":close from marketData";
table1::temp
} each (1+til 5)
I'm sure there is a better way to do this, but I get a 'loop error when I try to run this function. Any suggestions?
See here for common errors. Your loop error is because you're setting views with value, not globals. Inside a function value evaluates as if it's outside the function so you don't need the ::.
That said there's lots of room for improvement, here's a few pointers.
You don't need the value at all in your case. E.g. this line:
First line can be reduced to (I'm assuming mdDates is some kind of function you're just dropping in to work out the date from an integer, and DateInd some kind of global):
{[nday]
temp0:update date:mdDates[nday;DateInd] from orders;
....
} each (1+til 5)
In this bit it just looks like you're trying to append something to the column name:
select date",string[nday],":date
Remember that tables are flipped dictionaries... you can mess with their column names via the keys, as illustrated (very noddily) below:
q)t:flip `a`b!(1 2; 3 4)
q)t
a b
---
1 3
2 4
q)flip ((`$"a","1"),`b)!(t`a;t`b)
a1 b
----
1 3
2 4
You can also use functional select, which is much neater IMO:
q)?[t;();0b;((`$"a","1"),`b)!(`a`b)]
a1 b
----
1 3
2 4
Seems like you wanted to have p0 to p5 columns with prices corresponding to date+0 to date+5 dates.
Using adverb over to iterate over 0 to 5 days :
q)orders:([] date:(2018.01.01+til 5); sym:5?`A`G; qty:5?10)
q)data:([] date:20#(2018.01.01+til 10); sym:raze 10#'`A`G; price:20?10+10.)
q)delete d from {c:`$"p",string[y]; (update d:date+y from x) lj 2!(`d`sym,c )xcol 0!data}/[ orders;0 1 2 3 4]
date sym qty p0 p1 p2 p3 p4
---------------------------------------------------------------
2018.01.01 A 0 10.08094 6.027448 6.045174 18.11676 1.919615
2018.01.02 G 3 13.1917 8.515314 19.018 19.18736 6.64622
2018.01.03 A 2 6.045174 18.11676 1.919615 14.27323 2.255483
2018.01.04 A 7 18.11676 1.919615 14.27323 2.255483 2.352626
2018.01.05 G 0 19.18736 6.64622 11.16619 2.437314 4.698096

select rows by comparing columns using HDFStore

How can I select some rows by comparing two columns from hdf5 file using Pandas? The hdf5 file is too big to load into memory. For example, I want to select rows where column A and columns B is equal. The dataframe is save in file 'mydata.hdf5'. Thanks.
import pandas as pd
store = pd.HDFstore('mydata.hdf5')
df = store.select('mydf',where='A=B')
This doesn't work. I know that store.select('mydf',where='A==12') will work. But I want to compare column A and B. The example data looks like this:
A B C
1 1 3
1 2 4
. . .
2 2 5
1 3 3
You cannot directly do this, but the following will work
In [23]: df = DataFrame({'A' : [1,2,3], 'B' : [2,2,2]})
In [24]: store = pd.HDFStore('test.h5',mode='w')
In [26]: store.append('df',df,data_columns=True)
In [27]: store.select('df')
Out[27]:
A B
0 1 2
1 2 2
2 3 2
In [28]: store.select_column('df','A') == store.select_column('df','B')
Out[28]:
0 False
1 True
2 False
dtype: bool
This should be pretty efficient.