Consider, if you will, two actors, Alice and Bob. Alice wants to store some data, so she pays Bob to store the data. Alice gives Bob the data, and deletes it on her own server, but she wants to know whether Bob still has the data. After all, it’s possible that Bob simply took her money and threw away her data.
One method would be for Alice to periodically ask Bob for all of her data, and she can hash the data to make sure that all of the data is present and untampered with (here I neglect the probability that Bob found a preimage for the data, so I assume a strong hash function). This method is clearly suboptimal in bandwidth, because it requires that all of the data be sent periodically.
Another obvious method would be for Alice to ask Bob for a hash of the data. However, Bob could simply compute the hash and throw away the data, replying with the remembered hash every time.
An extension of this technique is for Alice to specify what hash function Bob should use to hash the data. Each time she queries Bob, she invents a random hash function for Bob to use, and checks whether the hash is valid. In a formal space, this is specifying the seed for the hash function. In the real world this might be accomplished using a clever use of random number generators.
However, this poses a problem, because Alice needs to know the hash a priori in order to compare it to Bob’s reply. There are two means of doing this, one being Alice keeping the data, and the other being Alice precomputing a large number of hashes. In the latter case, Alice would precompute a large set of one-time-use hash functions, and each time she queries Bob she picks a different one. Assuming that the hash functions she chooses are not predictable, this is secure for as long as she has unused hash functions left. Eventually she will run out, and will not be able to tell whether Bob has the data.
This is not the worst flaw in the world, though. In a practical example, she could generate 1,000 different hashes, and ask Bob if he has the data once a week for 20 years.
But we can do better. Consider a homomorphic encryption scheme , with the property . According to Wikipedia, the Goldwasser-Micali scheme has this property.
Let Alice have a secret key to this scheme, so that Bob cannot decrypt data that Alice encrypts with the scheme. Now split the data into chunks, . Alice sends Bob , and she stores on her machine. She also chooses some hash function that maps bit strings to bit strings and stores such that does not use in its calculation.
Let us assume for now that each chunk has the same length , so that each chunk
When Alice wishes to verify whether Bob actually has all of the data, she randomly chooses some and some . She sends over to Bob.
Bob then computes and , and sends and to Alice.
Alice computes and decrypts the result, and then checks whether , and .
If Bob faithfully followed the protocol, and stored the data correctly, then these will be equal because , and the first check will succeed. The second check will succeed because will match Alice’s hash.
Assuming all is well, Alice is only required to store bits of data, and the bandwidth required is bits per verification.
In order for Bob to get away with storing less than bits on his machine (he could precompute for all possible values of , but that would not save him storage because he would still have to store bits), he would have to compute and such that their product was the encryption of the XOR of and all of the data blocks. If Bob precomputes , and simply sets , then he will fail Alice’s check for whether .
In order to pass the second check, Bob would have to find a valid preimage for , which if the hash function is sufficiently secure then this happens with negligible probability. Assuming that Bob is honest in computing , he must still compute correctly, and since is chosen randomly he cannot (except with negligible probability) precompute all values of .
So I claim that this is a secure proof-of-knowledge scheme that does not require the challenger to have complete knowledge. Also, this scheme is not limited in the linear-growth-in-time as the naive scheme above is, where in order to last for verifications one must store data. This scheme requires only data (any susceptibility is in collisions in with prior values for ).
This scheme does not require Bob to send Alice the data if she ever actually requests it in full. However, if Bob is determined to answer verification queries, this scheme can be used to reconstruct the data in full. That’s left as an exercise for the reader.