I have about a billion pieces of data that I would like to store in Cassandra. The data items are ordered by time, and one of the main queries I'll be doing is to find the items between two time ranges, in order. I'd really prefer to use the RandomParititioner, if at all possible. Is there a way to do this in Cassandra?
At first, since I'm coming from SQL, I assumed I should create each event as a row, but then it occurred to me that I was thinking about it the wrong way and I should really use columns. Columns in Cassandra seem to be ordered, but I'm confused as to just how ordered they are. If I use a time as the column name, is there a way for me to get all of the columns from one time to another in order?
Another thing I looked at was the 0.7 feature of secondary indices, but I've had trouble finding documentation for whether I can use these to view a range of things in order.
All I want is the Cassandra equivalent of this SQL: "Select * from Stuff where date > X and date < Y order by date asc". How can I do this?
The partitioner only affects the distribution of keys around the ring, not the order of columns within a key. Columns are always ordered according to the Column Comparator defined for the column family.
You can call get_slice with a SlicePredicate that specifies a SliceRange to get all the columns of a key within a range.
To model your data, you can create 1 row for each day (or suitable time shard) and have a column for each piece of data. Something like,
"yyyy-mm-dd" : { #key, one for each day
timeStampMillis1:dataid1 : "value1" # one column for each piece of data
timeStampMillis2:dataid2 : "value2"
timeStampMillis3:dataid3 : "value3"
}
The column names should be binary, using the binary comparator. The first 8 bytes are the timestamp, while the rest of the bytes are the id of the data.
Assuming X and Y are on the same day, to find all items between X and Y, do a do a get_slice on the day key, with a SlicePredicate with a SliceRange specifying a start of X and a finish of Y+1. Both start and finish are byte arrays of 8 bytes.
To find data over multiple days, read from multiple keys.
Related
I have a table which columns are location and credit, the location contains string rows which mainly is location_name and npl_of_location_name. the credit contains integer rows which mainly is credit_of_location_name and credit_npl_of_location_name. I need to make a column which calculates the ((odd rows of the credit - the even rows of the credit)*0.1). How do i do this?
When you specify "odd rows" and "even rows" are you referring to row numbers? Because, unless your query sorts the data, you have not control over row order; the database server returns rows however they are physically stored.
Once you are sure that your rows are properly sorted, then you can use a technique such as Mod(#INROWNUM,2) = 1 to determine "odd" and zero is even. This works best if the Transformer is executing in sequential mode; if it is executed in parallel mode then you need to use a partitioning algorithm that ensures that the odd and even rows for a particular location are in the same node.
I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!
I have a calculation and it outputs multiple values. Then I am creating a table on those values. For example, in below data my formula is
if data is 1 then calculation is `one`
if data is 2 then calculation is `two`
if data is 3 then calculation is `three`
as three doesn't really appear in the output, when I create a table, three is not displayed. Is there any way to display it?
I tried table layout >> show empty rows and columns and it didn't work
data calculation
1 one
2 two
Tableau discovers the possible values for a dimension field dynamically from the query results.
If ‘three’ does not appear in your data, then how do you expect Tableau to know to make a column header for that non existent, but potential, value? It can’t read your mind.
This situation does occur often though - perhaps you want row or column headers to remain stable, even when you change filters in a way that causes some to no longer appear in the query results.
There are a few ways you can force Tableau to pad ** or **complete a domain:
one solution is to pad your data to make sure each value for your dimension field appears in at least one data row.
You can often do this easily by using a union to append some extra rows to your original data. You can often add padding rows that don’t impact any results by leaving all your Measure columns null since nulls are ignored by aggregation functions
Another common solution that is a bit more effort is to make what is known as scaffolding data source that is not much more than a list of your dimension members. You can then use that data source as a primary data source with data blending, making your original data source secondary.
There are two situations where Tableau can detect the absence of data and leave space for it in the visualization automatically
for numeric types, you can create a bin field that will automatically pad for missing bins
similarly, date fields can show missing values because, like bins, Tableau can tell when a month doesn’t appear in the data and leave room for it in the view
I have a scenario where my dataframe has 3 columns a,b and c. I need to validate if the length of all the columns is equal to 100. Based on validation I am creating status column like a_status,b_status,c_status with values 5 (Success) and 10 (Failure). In Failure scenarios I need to update count and create new columns a_sample,b_sample,c_sample with some 5 failure sample values separated by ",". For creating samples column I tried like this
df= df.select(df.columns.toList.map(col(_)) :::
df.columns.toList.map( x => (lit(getSample(df.select(x, x + "_status").filter(x + "_status=10" ).select(x).take(5))).alias(x + "_sample")) ).toList: _* )
getSample method will just get array of rows and concatenate as a string. This works fine for limited columns and data size. However if the number of columns > 200 and data is > 1 million rows it creates huge performance impact. Is there any alternate approach for same.
While the details of your problem statement are unclear, you can break up the task into two parts:
Transform data into a format where you identify several different types of rows you need to sample.
Collect sample by row type.
The industry jargon for "row type" is stratum/strata and the way to do (2), without collecting data to the driver, which you don't want to do when the data is large, is via stratified sampling, which Spark implements via df.stat.sampleBy(). As a statistical function, it doesn't work with exact row numbers but fractions. If you absolutely must get a sample with an exact number of rows there are two strategies:
Oversample by fraction and then filter unneeded rows, e.g., using the row_number() window function followed by a filter 'row_num < n.
Build a custom user-defined aggregate function (UDAF), firstN(col, n). This will be much faster but a lot more work. See https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
An additional challenge for your use case is that you want this done per column. This is not a good fit with Spark's transformations such as grouping or sampleBy, which operate on rows. The simple approach is to make several passes through the data, one column at a time. If you absolutely must do this in a single pass through the data, you'll need to build a much more custom UDAF or Aggregator, e.g., the equivalent of takeFirstNFromAWhereBHasValueC(n, colA, colB, c).
I will try to explain the problem on an abstract level first:
I have X amount of data as input, which is always going to have a field DATE. Before, the dates that came as input (after some process) where put in a table as output. Now, I am asked to put both the input dates and any date between the minimun date received and one year from that moment. If there was originally no input for some day between this two dates, all fields must come with 0, or equivalent.
Example. I have two inputs. One with '18/03/2017' and other with '18/03/2018'. I now need to create output data for all the missing dates between '18/03/2017' and '18/04/2017'. So, output '19/03/2017' with every field to 0, and the same for the 20th and 21st and so on.
I know to do this programmatically, but on powercenter I do not. I've been told to do the following (which I have done, but I would like to know of a better method):
Get the minimun date, day0. Then, with an aggregator, create 365 fields, each has that "day0"+1, day0+2, and so on, to create an artificial year.
After that we do several transformations like sorting the dates, union between them, to get the data ready for a joiner. The idea of the joiner is to do an Full Outer Join between the original data, and the data that is going to have all fields to 0 and that we got from the previous aggregator.
Then a router picks with one of its groups the data that had actual dates (and fields without nulls) and other group where all fields are null, and then said fields are given a 0 to finally be written to a table.
I am wondering how can this be achieved by, for starters, removing the need to add 365 days to a date. If I were to do this same process for 10 years intead of one, the task gets ridicolous really quick.
I was wondering about an XOR type of operation, or some other function that would cut the number of steps that need to be done for what I (maybe wrongly) feel is a simple task. Currently I now need 5 steps just to know which dates are missing between two dates, a minimun and one year from that point.
I have tried to be as clear as posible but if I failed at any point please let me know!
Im not sure what the aggregator is supposed to do?
The same with the 'full outer' join? A normal join on a constant port is fine :) c
Can you calculate the needed number of 'dublicates' before the 'joiner'? In that case a lookup configured to return 'all rows' and a less-than-or-equal predicate can help make the mapping much more readable.
In any case You will need a helper table (or file) with a sequence of numbers between 1 and the number of potential dublicates (or more)
I use our time-dimension in the warehouse, which have one row per day from 1753-01-01 and 200000 next days, and a primary integer column with values from 1 and up ...
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
Ok... so you could override your source qualifier to achieve this in the selection query itself (am giving Oracle based example as its what I'm used to and I'm assuming your data in is from a table). I looked up the connect syntax here
SQL to generate a list of numbers from 1 to 100
SELECT (MIN(tablea.DATEFIELD) + levquery.n - 1) AS Port1 FROM tablea, (SELECT LEVEL n FROM DUAL CONNECT BY LEVEL <= 365) as levquery
(Check if the query works for you - haven't access to pc to test it at the minute)