Things Which Represent Other Things - rest

Let's say I want to hand out things which represent other things (pretend phone numbers) via REST. We can call the representations tickets.
The goal is to create a thing which represents another thing but does not reveal the content of the thing it represents. Let's pretend we're talking about phone numbers.
There is a choice here. One could create a ticket from random numbers and letters which are stored in the backend or one could calculate a ticket from the phone number.
PRO/CON of Ticket Creation:
PRO: cannot be cracked
PRO: you can expire a ticket - (why is this good?)
CON: you have one ticket per requested phone number stored in a backend server
Pros and Cons of Calculation
CON: A calculation can be cracked and reveal the original content.
CON: calculation mechanism cannot be changed - (true?)
PRO: No backend storage required.
PRO: A calculation is a REST lookup (GET) as opposed to a REST (POST)
Thoughts?

You can think of "representation ticket" as a Hash for a phone number. When you think of it this way, you'll lean towards Calculation, and you will also draw from a vast knowledge base of hashing.

Related

Where should data be transformed for the database?

Where should data be transformed for the database? I believe this is called data normalization/sanitization.
In my database I have user created shops- like an Etsy. Let's say a user inputs a price for an item in their shop as "1,000.00". But my database stores the prices as an integer/pennies- "100000". Where should "1,000.00" be converted to "100000"?
These are the two ways I thought of.
In the frontend: The input data is converted from "1,000.00" to "100000" in the frontend before the HTTP request. In this case, the backend would validate that the price it is an integer.
In the backend: "1,000.00" is sent to the backend as is, then the backend validates that it is a price format, then the backend converts the price to an integer "100000" before being stored in the database.
It seems either would work but is one way better than the other or is there a standard way? I would think the second way is best to reduce code duplication since there is likely to be multiple frontends - mobile, web, etc- and one backend. But the first way also seems cleaner- just send what the api needs.
I'm working with a MERN application if that makes any difference, but I would think this would be language agnostic.
I would go with a mix between the two (which is a best practice, AFAIK).
The frontend has to do some kind of validation anyway, because it you don't want to wait for the backend to get the validation response. I would add the actual conversion here as well.
The validation code will be adapted to each frontend I guess, because each one has a different framework / language that it uses, so I don't necessarily see the code duplication here (the logic yes, but not the actual code). As a last resort, you can create a common validation library.
The backend should validate again the converted value sent by the frontends, just as a double check, and then store it in the database. You never know if the backend will be integrated with other components in the future and input sanitization is always a best practice.
Short version
Both ways would work and I think there is no standard way. I prefer formatting in the frontend.
Long version
What I do in such cases is to look at my expected business requirements and make a list of pros and cons. Here are just a few thoughts about this.
In case, you decide for doing every transformation of the price (formatting, normalization, sanitization) in the frontend, your backend will stay smaller and you will have less endpoints or endpoints with less options. Depending on the frontend, you can choose the perfect fit for you end user. The amount of code which is delivered will stay smaller, because the application can be cached and makes all the formatting stuff.
If you implement everything in the backend, you have full control about which format is delivered to your users. Especially when dealing with a lot of different languages, it could be helpful to get the correct display value directly from the server.
Furthermore, it can be helpful to take a look at some different APIs of well-known providers and how these handle prices.
The Paypal API uses an amount object to transfer prices as decimals together with a currency code.
"amount": {
"value": "9.87",
"currency": "USD"
}
It's up to you how to handle it in the frontend. Here is a link with an example request from the docs:
https://developer.paypal.com/docs/api/payments.payouts-batch/v1#payouts_post
Stripe uses a slightly different model.
{
unit_amount: 1600,
currency: 'usd',
}
It has integer values in the base unit of the currency as the amount and a currency code to describe prices. Here are two examples to make it more clear:
https://stripe.com/docs/api/prices/create?lang=node
https://stripe.com/docs/checkout/integration-builder
In both cases, the normalization and sanitization has to be done before making requests. The response will also need formatting before showing it to the user. Of course, most of these requests are done by backend code. But if you look at the prebuilt checkout pages from Stripe or Paypal, these are also using normalized and sanitized values for their frontend integrations: https://developer.paypal.com/docs/business/checkout/configure-payments/single-page-app
Conclusion/My opinion
I would always prefer keeping the backend as simple as possible for security reasons. Less code (aka endpoints) means a smaller attack surface. More configuration possibilities means a lot more effort to make the application secure. Furthermore, you could write another (micro)service which overtakes some transformation tasks, if you have a business requirement to deliver everything perfectly formatted from the backend. Example use cases may be if you have a preference for backend code over frontend code (think about your/your team's skills), if you want to deploy a lot of different frontends and want to make sure that they all use a single source of truth for their display values or maybe if you need to fulfill regulatory requirements to know exactly what is delivered to your user.
Thank you for your question and I hope I have given you some guidance for your own decision. In the end, you will always wish you had chosen a different approach anyway... ;)
For sanitisation, it has to be on the back end for security reasons. Sanitisation is concerned with ensuring only certain field's values from your web form are even entertained. It's like a guest list at an exclusive club. It's concerned not just with the field (e.g. username) but also the value (e.g. bobbycat, or ); DROP TABLE users;). So it's about ensuring security of your database.
Normalisation, on the other hand, is concerned with transforming the data before storing them in the database. It's the example you brought up: 1,000 to 1000 because you are storing it as integers without decimals in the database.
Where does this code belong? I think there's no clear winner because it depends on your use case.
If it's a simple matter like making sure the value is an integer and not a string, you should offload that to the web form (I.e. the front end), since forms already have a "type" attribute to enforce these.
But imagine a more complicated scenario. Let's say you're building an app that allows users to construct a Facebook ads campaign (your app being the third party developer app, like Smartly.io). That means there will be something like 30 form fields that must be filled out before the user hits "create campaign". And the value in some form fields affect the validity of other parts of the form.
In such a situation, it might make sense to put at least some of the validation in the back end because there is a series of operations your back end needs to run (like create the Post, then create the Ad) in order to ensure validity. It wouldn't make sense for you to code those validations and normalisations in the front end.
So in short, it's a balance you'll need to strike. Offload the easy stuff to the front end, leveraging web APIs and form validations. Leave the more complex normalisation steps to the back end.
On a related note, there's a broader concept of ETL (extract, transform, load) that you'd use if you were trying to consume data from another service and then transforming it to fit the way you store info in your own database. In that case, it's usually a good idea to keep it as a repository on its own - using something like Apache Airflow to manage and visualise the cron jobs.

REST design principles: Referencing related objects vs Nesting objects

My team and I we are refactoring a REST-API and I have come to a question.
For terms of brevity, let us assume that we have an SQL database with 4 tables: Teachers, Students, Courses and Classrooms.
Right now all the relations between the items are represented in the REST-API through referencing the URL of the related item. For example for a course we could have the following
{ "id":"Course1", "teacher": "http://server.com/teacher1", ... }
In addition, if ask a list of courses thought a call GET call to /courses, I get a list of references as shown below:
{
... //pagination details
"items": [
{"href": "http://server1.com/course1"},
{"href": "http://server1.com/course2"}...
]
}
All this is nice and clean but if I want a list of all the courses titles with the teachers' names and I have 2000 courses and 500 teachers I have to do the following:
Approximately 2500 queries just to read the data.
Implement the join between the teachers and courses
Optimize with caching etc, so that I will do it as fast as possible.
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently.
Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
My question therefore is:
1. Is it wrong if we we nest the teacher information in the courses.
2. Should the listing of items e.g. GET /courses return a list of references or a list of items?
Edit: After some research I would say the model I have in mind corresponds mainly to the one shown in jsonapi.org. Is this a good approach?
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently. Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
Your colleagues have lost the plot.
Here's your heuristic - how would you support this use case on a web site?
You would probably do it by defining a new web page, that produces the report you need. You'd run the query, you the result set to generate a bunch of HTML, and ta-da! The client has the information that they need in a standardized representation.
A REST-API is the same thing, with more emphasis on machine readability. Create a new document, with a schema so that your clients can understand the semantics of the document you return to them, tell the clients how to find the target uri for the document, and voila.
Creating new resources to handle new use cases is the normal approach to REST.
Yes, I totally think you should design something similar to jsonapi.org. As a rule of thumb, I would say "prefer a solution that requires less network calls". It's especially true if amount of network calls will be less by order of magnitude.
Of course it doesn't eliminate the need to limit the request/response size if it becomes unreasonable.
Real life solutions must have a proper balance. Clean API is nice as long as it works.
So in your case I would so something like:
GET /courses?include=teachers
Or
GET /courses?includeTeacher=true
Or
GET /courses?includeTeacher=brief|full
In the last one the response can have only the teacher's id for brief and full teacher details for full.
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently. Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
Have you actually measured the overhead generated by each request? If not, how do you know that the overhead will be too intense? From an object-oriented programmers perspective it may sound bad to perform each call on their own, your design, however, lacks one important asset which helped the Web to grew to its current size: caching.
Caching can occur on multiple levels. You can do it on the API level or the client might do something or an intermediary server might do it. Fielding even mad it a constraint of REST! So, if you want to comply to the REST architecture philosophy you should also support caching of responses. Caching helps to reduce the number of requests having to be calculated or even processed by a single server. With the help of stateless communication you might even introduce a multitude of servers that all perform calculations for billions of requests that act as one cohesive system to the client. An intermediary cache may further help to reduce the number of requests that actually reach the server significantly.
A URI as a whole (including any path, matrix or query parameters) is actually a key for a cache. Upon receiving a GET request, i.e., an application checks whether its current cache already contains a stored response for that URI and returns the stored response on behalf of the server directly to the client if the stored data is "fresh enough". If the stored data already exceeded the freshness threshold it will throw away the stored data and route the request to the next hop in line (might be the actual server, might be a further intermediary).
Spotting resources that are ideal for caching might not be easy at times, though the majority of data doesn't change that quickly to completely neglect caching at all. Thus, it should be, at least, of general interest to introduce caching, especially the more traffic your API produces.
While certain media-types such as HAL JSON, jsonapi, ... allow you to embed content gathered from related resources into the response, embedding content has some potential drawbacks such as:
Utilization of the cache might be low due to mixing data that changes quickly with data that is more static
Server might calculate data the client wont need
One server calculates the whole response
If related resources are only linked to instead of directly embedded, a client for sure has to fire off a further request to obtain that data, though it actually is more likely to get (partly) served by a cache which, as mentioned a couple times now throughout the post, reduces the workload on the server. Besides that, a positive side effect could be that you gain more insights into what the clients are actually interested in (if an intermediary cache is run by you i.e.).
Is it wrong if we we nest the teacher information in the courses.
It is not wrong, but it might not be ideal as explained above
Should the listing of items e.g. GET /courses return a list of references or a list of items?
It depends. There is no right or wrong.
As REST is just a generalization of the interaction model used in the Web, basically the same concepts apply to REST as well. Depending on the size of the "item" it might be beneficial to return a short summary of the items content and add a link to the item. Similar things are done in the Web as well. For a list of students enrolled in a course this might be the name and its matriculation number and the link further details of that student could be asked for accompanied by a link-relation name that give the actual link some semantical context which a client can use to decide whether invoking such URI makes sense or not.
Such link-relation names are either standardized by IANA, common approaches such as Dublin Core or schema.org or custom extensions as defined in RFC 8288 (Web Linking). For the above mentioned list of students enrolled in a course you could i.e. make use of the about relation name to hint a client that further information on the current item can be found by following the link. If you want to enable pagination the usage of first, next, prev and last can and probably should be used as well and so forth.
This is actually what HATEOAS is all about. Linking data together and giving them meaningful relation names to span a kind of semantic net between resources. By simply embedding things into a response such semantic graphs might be harder to build and maintain.
In the end it basically boils down to implementation choice whether you want to embed or reference resources. I hope, I could shed some light on the usefulness of caching and the benefits it could yield, especially on large-scale systems, as well as on the benefit of providing link-relation names for URIs, that enhance the semantical context of relations used within your API.

consistent hashing on Multiple machines

I've read the article: http://n00tc0d3r.blogspot.com/ about the idea for consistent hashing, but I'm confused about the method on multiple machines.
The basic process is:
Insert
Hash an input long url into a single integer;
Locate a server on the ring and store the key--longUrl on the server;
Compute the shorten url using base conversion (from 10-base to 62-base) and return it to the user.(How does this step work? In a single machine, there is a auto-increased id to calculate for shorten url, but what is the value to calculate for shorten url on multiple machines? There is no auto-increased id.)
Retrieve
Convert the shorten url back to the key using base conversion (from 62-base to 10-base);
Locate the server containing that key and return the longUrl. (And how can we locate the server containing the key?)
I don't see any clear answer on that page for how the author intended it. I think this is basically an exercise for the reader. Here's some ideas:
Implement it as described, with hash-table style collision resolution. That is, when creating the URL, if it already matches something, deal with that in some way. Rehashing or arithmetic transformation (eg, add 1) are both possibilities. This means, naively, a theoretical worst case of having to hit a server n times trying to find an available key.
There's a lot of ways to take that basic idea and smarten it, eg, just search for another available key on the same server, eg, by rehashing iteratively until you find one that's on the server.
Allow servers to talk to each other, and coordinate on the autoincrement id.
This is probably not a great solution, but it might work well in some situations: give each server (or set of servers) separate namespace, eg, the first 16 bits selects a server. On creation, randomly choose one. Then you just need to figure out how you want that namespace to map. The namespaces only really matter for who is allowed to create what IDs, so if you want to add nodes or rebalance later, it is no big deal.
Let me know if you want more elaboration. I think there's a lot of ways that this one could go. It is annoying that the author didn't elaborate on this point; my experience with these sorts of algorithms is that collision resolution and similar problems tend to be at the very heart of a practical implementation of a distributed system.

What ist a RESTful-resource in the context of large data sets, i.E. weather data?

So I am working on a webservice to access our weather forecast data (10000 locations, 40 parameters each, hourly values for the next 14 days = about 130 million values).
So I read all about RESTful services and its ideology.
So I understand that an URL is adressing a ressource.
But what is a ressource in my case?
The common use case is that you want to get the data for a couple of parameters over a timespan at one or more location. So clearly giving every value its own URL is not pratical and would result in hundreds of requests. I have the feeling that my specific problem doesn't excactly fit into the RESTful pattern.
Update: To clarify: There are two usage patterns of the service. 1. Raw data; rows and rows of data for several locations and parameters.
Interpreted data; the raw data calculated into symbols (Suns & clouds, for example) and other parameters.
There is not one 'forecast'. Different clients have different needs for data.
The reason I think this doesn't fit into the REST-pattern is, that while I can actually have a 'forecast' ressource, I still have to submit a lot of request parameters. So a simple GET-request on a ressource doesn't work, I end up POSTing data all over the place.
So I am working on a webservice to access our weather forecast data (10000 locations, 40 parameters each, hourly values for the next 14 days = about 130 million values). ... But what is a ressource in my case?
That depends on the details of your problem domain. Simply having a large amount of data is not a good reason to avoid REST. There are smart ways and dumb ways to model and expose that data.
As you rightly see, your main goal at this point should be to understand what exactly a resource is. Knowing only enough about weather forecasting to follow the Weather Channel, I won't be much help here. It's for domain experts like yourself to make that call.
If you were to explain in a little more detail the major domain concepts you're working with, it might make it a little easier to give specific advice.
For example, one resource might be Forecast. When weatherpeople talk about Forecasts, what words keep coming up? When you think about breaking a forecast down into smaller elements, what words do you use to describe the pieces?
Do this process recursively, and you'll probably be able to make a list of important terms. Don't forget that these terms can describe things or actions. Think about what these terms really mean, what data you can use to model them, how they can be aggregated.
At this point you'll have the makings of something you can start building a RESTful system around - but not before.
Don't forget that a RESTful system is not a data dump wrapped in HTTP - it's a hypertext-driven system.
Also don't forget that media types are the point of contact between your server and its clients. A media type is only limited by your imagination and can model datasets of any size if you're clever about it. It can contain XML, JSON, YAML, binary elements such as a Bloom Filter, or whatever works for the problem.
Firstly, there is no once-and-for-all right answer.
Each valid url is something that makes sense to query, think of them as equivalents to providing query forms for people looking for your data - that might help you narrow down the scenarios.
It is a matter of personal taste and possibly the toolkit you use, as to what goes into the basic url path and what parameters are encoded. The debate is a bit like the XML debate over putting values in elements vs attributes. It is not always a rational or logically decided issue nor will everybody be kind in their comments on your decisions.
If you are using a backend like Rails, that implies certain conventions. Even if you're not using Rails, it makes sense to work in the same way unless you have a strong reason to change. That way, people writing clients to talk to Rails-based services will find yours easier to understand and it saves you on documentation time ;-)
Maybe you can use forecast as the ressource and go deeper to fine grained services with xlink.
Would it be possible to do something like this,Since you have so many parameters so i was thinking if somehow you can relate it to a mix of id / parameter combo to decrease the url size
/WeatherForeCastService//day/hour
www.weatherornot.com/today/days/x // (where x is number of days)
www.weatherornot.com/today/9am/hours/h // (where h is number of hours)

Best way to prevent duplicate use of credit cards

We have a system where we want to prevent the same credit card number being registered for two different accounts. As we don't store the credit card number internally - just the last four digits and expiration date - we cannot simply compare credit card numbers and expiration dates.
Our current idea is to store a hash (SHA-1) in our system of the credit card information when the card is registered, and to compare hashes to determine if a card has been used before.
Usually, a salt is used to avoid dictionary attacks. I assume we are vulnerable in this case, so we should probably store a salt along with the hash.
Do you guys see any flaws in this method? Is this a standard way of solving this problem?
Let's do a little math: Credit card numbers are 16 digits long. The first seven digits are 'major industry' and issuer numbers, and the last digit is the luhn checksum. That leaves 8 digits 'free', for a total of 100,000,000 account numbers, multiplied by the number of potential issuer numbers (which is not likely to be very high). There are implementations that can do millions of hashes per second on everyday hardware, so no matter what salting you do, this is not going to be a big deal to brute force.
By sheer coincidence, when looking for something giving hash algorithm benchmarks, I found this article about storing credit card hashes, which says:
Storing credit cards using a simple single pass of a hash algorithm, even when salted, is fool-hardy. It is just too easy to brute force the credit card numbers if the hashes are compromised.
...
When hashing credit card number, the hashing must be carefully designed to protect against brute forcing by using strongest available cryptographic hash functions, large salt values, and multiple iterations.
The full article is well worth a thorough read. Unfortunately, the upshot seems to be that any circumstance that makes it 'safe' to store hashed credit card numbers will also make it prohibitively expensive to search for duplicates.
People are over thinking the design of this, I think. Use a salted, highly secure (e.g. "computationally expensive") hash like sha-256, with a per-record unique salt.
You should do a low-cost, high accuracy check first, then do the high-cost definitive check only if that check hits.
Step 1:
Look for matches to the last 4 digits (and possibly also the exp. date, though there's some subtleties there that may need addressing).
Step 2:
If the simple check hits, use the salt, get the hash value, do the in depth check.
The last 4 digits of the cc# are the most unique (partly because it includes the LUHN check digit as well) so the percentage of in depth checks you will do that won't ultimately match (the false positive rate) will be very, very low (a fraction of a percent), which saves you a tremendous amount of overhead relative to the naive "do the hash check every time" design.
Do not store a simple SHA-1 of the credit card number, it would be way too easy to crack (especially since the last 4 digits are known). We had the same problem in my company: here is how we solved it.
First solution
For each credit card, we store the last 4 digits, the expiration date, a long random salt (50 bytes long), and the salted hash of the CC number. We use the bcrypt hash algorithm because it is very secure and can be tuned to be as CPU-intensive as you wish. We tuned it to be very expensive (about 1 second per hash on our server!). But I guess you could use SHA-256 instead and iterate as many times as needed.
When a new CC number is entered, we start by finding all the existing CC numbers that end with the same 4 digits and have the same expiration date. Then, for each matching CC, we check whether its stored salted hash matches the salted hash calculated from its salt and the new CC number. In other words, we check whether or not hash(stored_CC1_salt+CC2)==stored_CC1_hash.
Since we have roughly 100k credit cards in our database, we need to calculate about 10 hashes, so we get the result in about 10 seconds. In our case, this is fine, but you may want to tune bcrypt down a bit. Unfortunately, if you do, this solution will be less secure. On the other hand, if you tune bcrypt to be even more CPU-intensive, it will take more time to match CC numbers.
Even though I believe that this solution is way better than simply storing an unsalted hash of the CC number, it will not prevent a very motivated pirate (who manages to get a copy of the database) to break one credit card in an average time of 2 to 5 years. So if you have 100k credit cards in your database, and if the pirate has a lot of CPU, then he can can recover a few credit card numbers every day!
This leads me to the belief that you should not calculate the hash yourself: you have to delegate that to someone else. This is the second solution (we are in the process of migrating to this second solution).
Second solution
Simply have your payment provider generate an alias for your credit card.
for each credit card, you simply store whatever you want to store (for example the last 4 digits & the expiration date) plus a credit card number alias.
when a new credit card number is entered, you contact your payment provider and give it the CC number (or you redirect the client to the payment provider, and he enters the CC number directly on the payment provider's web site). In return, you get the credit card alias! That's it. Of course you should make sure that your payment provider offers this option, and that the generated alias is actually secure (for example, make sure they don't simply calculate a SHA-1 on the credit card number!). Now the pirate has to break your system plus your payment provider's system if he wants to recover the credit card numbers.
It's simple, it's fast, it's secure (well, at least if your payment provider is). The only problem I see is that it ties you to your payment provider.
Hope this helps.
PCI DSS states that you can store PANs (credit card numbers) using a strong one-way hash. They don't even require that it be salted. That said you should salt it with a unique per card value. The expiry date is a good start but perhaps a bit too short. You could add in other pieces of information from the card, such as the issuer. You should not use the CVV/security number as you are not allowed to store it. If you do use the expiry date then when the cardholder gets issued a new card with the same number it will count as a different card. This could be a good or bad thing depending on your requirements.
An approach to make your data more secure is to make each operation computationally expensive. For instance if you md5 twice it will take an attacker longer to crack the codes.
Its fairly trivial to generate valid credit card numbers and to attempt a charge through for each possible expiry date. However, it is computationally expensive. If you make it more expensive to crack your hashes then it wouldn't be worthwhile for anyone to bother; even if they had the salts, hashes and the method you used.
#Cory R. King
SHA 1 isn't broken, per se. What the article shows is that it's possible to generate 2 strings which have the same hash value in less than brute force time. You still aren't able to generate a string that equates to a SPECIFIC hash in a reasonable amount of time. There is a big difference between the two.
I believe I have found a fool-proof way to solve this problem. Someone please correct me if there is a flaw in my solution.
Create a secure server on EC2, Heroku, etc. This server will serve ONE purpose and ONLY one purpose: hashing your credit card.
Install a secure web server (Node.js, Rails, etc) on that server and set up the REST API call.
On that server, use a unique salt (1000 characters) and SHA512 it 1000 times.
That way, even if hackers get your hashes, they would need to break into your server to find your formula.
Comparing hashes is a good solution. Make sure that you don't just salt all the credit card numbers with the same constant salt, though. Use a different salt (like the expiration date) on each card. This should make you fairly impervious to dictionary attacks.
From this Coding Horror article:
Add a long, unique random salt to each password you store. The point of a salt (or nonce, if you prefer) is to make each password unique and long enough that brute force attacks are a waste of time. So, the user's password, instead of being stored as the hash of "myspace1", ends up being stored as the hash of 128 characters of random unicode string + "myspace1". You're now completely immune to rainbow table attack.
Almost a good idea.
Storing just the hash is a good idea, it has served in the password world for decades.
Adding a salt seems like a fair idea, and indeed makes a brute force attack that much harder for the attacker. But that salt is going to cost you a lot of extra effort when you actually check to ensure that a new CC is unique: You'll have to SHA-1 your new CC number N times, where N is the number of salts you have already used for all of the CCs you are comparing it to. If indeed you choose good random salts you'll have to do the hash for every other card in your system. So now it is you doing the brute force. So I would say this is not a scalable solution.
You see, in the password world a salt adds no cost because we just want to know if the clear text + salt hashes to what we have stored for this particular user. Your requirement is actually pretty different.
You'll have to weigh the trade off yourself. Adding salt doesn't make your database secure if it does get stolen, it just makes decoding it harder. How much harder? If it changes the attack from requiring 30 seconds to requiring one day you have achieved nothing -- it will still be decoded. If it changes it from one day to 30 years you have achived someting worth considering.
Yes, comparing hashes should work fine in this case.
A salted hash should work just fine. Having a salt-per-user system should be plenty of security.
SHA1 is broken. Course, there isn't much information out on what a good replacement is. SHA2?
If you combine the last 4 digits of the card number with the card holder's name (or just last name) and the expiration date you should have enough information to make the record unique. Hashing is nice for security, but wouldn't you need to store/recall the salt in order to replicate the hash for a duplicate check?
I think a good solution as hinted to above, would be to store a hash value of say Card Number, Expiration date, and name. That way you can still do quick comparisons...
Sha1 broken is not a problem here.
All broken means is that it's possible to calculate collisions (2 data sets that have the same sha1) more easily than you would expect.
This might be a problem for accepting arbitrary files based on their sha1 but it has no relevence for an internal hashing application.
If you are using a payment processor like Stripe / Braintree, let them do the "heavy lifting".
They both offer card fingerprinting that you can safely store in your db and compare later to see if a card already exists:
Stripe returns fingerprint string - see doc
Braintree returns unique_number_identifier string - see doc