apache cassandra - Inconsistency between number of records returned and count(*) result - import

I am importing some data into a table in Apache Cassandra using COPY command. I have 7 rows in my csv files. But after importing I just have 1 row instead of 7 rows. What would make this inconsistency?
attached is the image of my cqlsh screen

Possible issue:
same clustering key for the rows.
Solution
try adding another column as clustering key (domain specific) that gives the rows uniqueness.

Related

AWS RDS postgresql performance

We have around 90 million rows in a new Postgresql table in an RDS instance. It contains 2 numbers, start_num and end_num(Bigint, mostly finance related) and details related to those numbers. The PK is on the start_num and end_num and table is CLUSTERed on this. The query will always be range query. Input will be a number and the output will be range in which this number is falling along with details. For eg: There is a row which has start_num=112233443322 and end_num as 112233543322. The input comes in as 112233443645. So the row containing 112233443322, 112233543322 needs to be returned.
select start_num, end_num from ipinfo.ipv4 where input_value between start_num and end_num;
This is always going into seq scan and the PK is not getting used. I have tried creating separate indexes on start_num and end_num desc but not much change in time. We are looking for an output of less than 300 ms. Now, I am wondering if that is even possible in Postgresql for range queries on large data sets or this is due to the Postgresql being on AWS RDS.
Looking forward to some advice on steps to improve the performance.

Copying untitled columns from tsv file to postgresql?

By tsv I mean a file delimited by tabs. I have a pretty large (6GB) data file that I have to import into a PostgreSQL database, and out of 56 columns, the first 8 are meaningful, then out of the other 48 there are several columns (like 7 or so) with 1's sparsely distributed with the rest being 0's. Is there a way to specify which columns in the file you want to copy into the table? If not, then I am fine with importing the whole file and just extracting the desired columns to use as data for my project, but I am concerned about allocating excessively large memory to a table in which less than 1/4 of the data is meaningful. Will this pose an issue, or will I be fine accommodating the meaningful columns into my table? I have considered using that table as a temp table and then importing the meaningful columns to another table, but I have been instructed to try to avoid doing an intermediary cleaning step, so I should be fine directly using the large table if it won't cause any problems in PostgreSQL.
With PostgreSQL 9.3 or newer, COPY accepts a program as input . This option is precisely meant for that kind of pre-processing. For instance, to keep only tab-separated fields 1 to 4 and 7 from a TSV file, you could run:
COPY destination_table FROM PROGRAM 'cut -f1-4,7 /path/to/file' (format csv, delimiter '\t');
This also works with \copy in psql, in which case the program is executed client-side.

Incrementally adding to a Hive table w/Scala + Spark 1.3

Our cluster has Spark 1.3 and Hive
There is a large Hive table that I need to add randomly selected rows to.
There is a smaller table that I read and check a condition, if that condition is true, then I grab the variables I need to then query for the random rows to fill. What I did was do a query on that condition, table.where(value<number), then make it an array by using take(num rows). Then since all of these rows contain the information I need on which random rows are needed from the large hive table, I iterate through the array.
When I do the query I use ORDER BY RAND() in the query (using sqlContext). I created a var Hive table ( to be mutable) adding a column from the larger table. In the loop, I do a unionAll newHiveTable = newHiveTable.unionAll(random_rows)
I have tried many different ways to do this, but am not sure what is the best way to avoid CPU and temp disk use. I know that Dataframes aren't intended for incremental adds.
One thing I have though now to try is to create a cvs file, write the random rows to that file incrementally in the loop, then when the loop is finished, load the cvs file as a table, and do one unionAll to get my final table.
Any feedback would be great. Thanks
I would recommend that you create an external table with hive, defining the location, and then let spark write the output as csv to that directory:
in Hive:
create external table test(key string, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
LOCATION '/SOME/HDFS/LOCATION'
And then from spark with the aide of https://github.com/databricks/spark-csv , write the dataframe to csv files and appending to the existing ones:
df.write.format("com.databricks.spark.csv").save("/SOME/HDFS/LOCATION/", SaveMode.Append)

How do I retrieve multiple rows from HBase using suffix glob, from a REST client?

I have the following rows in a HBase table called test
ROW COLUMN+CELL row1 column=cf:a, timestamp=1429204170712, value=value1
row2 column=cf:b, timestamp=1429204196225, value=value2
row3 column=cf:c, timestamp=1429204213427, value=value3
I am trying to retrieve all the rows with rowkey matching prefix row using Suffix Globbing, as mentioned here
But why do I get Bad request when I try http://localhost:8080/test/row* where localhost:8080 is where the HBase REST server Stargate is listening, test is the table and row is a partial rowkey. I executed it in a browser and in a REST client Poster (Firefox plugin). Executing the URL http://localhost:8080/test/row*/cf gives the response value1 but I would like to retrieve the values in all the rows with rowkey matching prefix row.
I am running HBase 0.94.26, Stargate (came bundled with HBase), Hadoop 1.2.1, Ubuntu 12.04 virtual machine.
Is it possible to retrieve all the rows programmatically atleast?
As per the doc REST works fine for retrieving all the rows. However you need to just modify the URL accordingly.
As per my opinion try the below comination on of them should work, Please note that that I have not yet tested.
http://localhost:8080/test/row*
http://localhost:8080/test/row
Suffix Globbing
Multiple value queries of a row can optionally append a suffix glob on
the row key. This is a restricted form of scanner which will return
all values in all rows that have keys which contain the supplied key
on their left hand side, for example:
org.someorg.*
-> org.someorg.blog
-> org.someorg.home
-> org.someorg.www

Only row key in a Cassandra column family

I'm using Cassandra 1.1.8 and today I saw in my keyspace a column family with the following content
SELECT * FROM challenge;
KEY
----------------------------
49feb2000100000a556522ed68
49feb2000100000a556522ed74
49feb2000100000a556522ed7a
49feb2000100000a556522ed72
49feb2000100000a556522ed76
49feb2000100000a556522ed6a
49feb2000100000a556522ed70
49feb2000100000a556522ed78
49feb2000100000a556522ed6e
49feb2000100000a556522ed6c
So, only rowkeys.
Yesterday those rows were there and I ran some deletions (exactly on those rows). I'm using Hector
Mutator<byte []> mutator = HFactory.createMutator(keyspace, BYTES_ARRAY_SERIALIZER)
.addDeletion(challengeRowKey(...), CHALLENGE_COLUMN_FAMILY_NAME)
.execute();
This is a small development and test environment on a single machine / single node so I don't believe the hardware details are relevant.
Probably I'm doing something stupid or I didn't get the point about how things are working, but as far I understood the rows above are no valid... column name and column value coordinates are missing so there are no valid cells (rowkey / column name / column value)...is that right?
I read about ghost reads but I think this is a scenario in a distribuited environment...is that valid after one day and on a single Cassandra node??
From http://www.datastax.com/docs/1.0/dml/about_writes#about-deletes
"The row key for a deleted row may still appear in range query results. When you delete a row in Cassandra, it marks all columns for that row key with a tombstone. Until those tombstones are cleared by compaction, you have an empty row key (a row that contains no columns). These deleted keys can show up in results of get_range_slices() calls. If your client application performs range queries on rows, you may want to have if filter out row keys that return empty column lists."