I know what you are thinking - "That text editor assignment I did in freshman year, if only I had put it up on the Web at the right time...could be playing golf with Larry and Zuck right now..."
Now maybe, a big maybe at that, given enough years of object oriented thinking and a few years supply of Red Bull, you can come up with some text editor prototype. But it is guaranteed that you will not be able to "put it on the Web" without causing yourself embarrassment. Putting an application on Web scale is a very difficult task, and Google Docs stands on the shoulders of giants in this aspect, and so does all other popular Web services like Amazon, Netflix, Facebook (yes, that blue and white website too has mind boggling technology underneath).
To get a feel for what's involved, consider the simple problem of storing data. The data is stored onto giant server farms and one can go crazy just figuring out how to get the hot air out of these monsters. At this scale, machines keep failing, rebooting all the time. Even entire data centers going down is not unheard of. But hosting an application on the net means uptime of 99.999999999%. How can you provide such stability when the underlying hardware itself is crashing all the time?
The basic idea behind storing data reliably in the presence of machine and network failures is to make copies of it, over different machines, over different data centers.
Parameters of the problem :
N : Number of copies of the data
R : Number of copies read to respond to a Read request
W : Number of copies written while fulfilling a Write request
Trade offs :
- To provide high fault tolerance and support high availability N needs to be large.
- To guarantee strong consistency, the property that all reads should reflect the latest written data, one needs to ensure R + W > N so that there is some guaranteed overlap between all reads and writes.
- To ensure fast read operations R needs to be low.
- To provide fast write operations W needs to be low.
While consistency has been the cornerstone of traditional databases, the new generation of Web scale databases often sacrifice consistency in favor of availability by employing R+W <= N. In a typical system, R == 1, and W < N. Once written, the value propagates to the complete N replicas in a lazy manner by using mechanisms such as the gossip algorithm. This scheme is said to provide eventual consistency, all the N copies get the last data written in the limit. That is just a nice way of saying inconsistent reads are possible in this scheme.
Due to inconsistent reads, writing certain applications on top of the eventual consistency model becomes extremely hard. For example, if you get hired by Facebook to implement their Credits feature, and all you have is a database supporting eventual consistency, then might as well consider yourself fired because yelling "Those marketing gals! Always selling vaporware without checking feasibility with engineering!" is not going to work.
The current database challenge is to provide the scalability demanded by Web scale applications while maintaining strong consistency.
An attempt in this direction is the Google Megastore that uses W == N, R == 1. Whenever a data is written, it is written to all the copies. While this sounds simple, in practice this is a very tough problem to solve. Consider what happens when you open your browser and start editing a file in Google Docs. The browser is a client that connects to the application server running Docs, which in turn connects to the database which is trusted with storing all the files. Let's say N == 3, the database maintains three copies of each file so that it can tolerate one failure. (In general, to tolerate F failures, 2F+1 copies are needed to ensure a majority is still alive after the failures).
Consider the case when a single user edits a file and presses the save button. The application layer sends a write request to all the three copies of the file. If it gets an acknowledgement that all the three copies are updated, it considers the write successful.
But now comes the killer feature of Google Docs, collaboration. Consider the case where multiple users working on a file press the save button at nearly the same time. For the discussion, lets assume three users and call their applications A0, A1, A2. Lets call the three copies of the file they are editing F0, F1, F2.
To write a particular data to a file copy, we need to make sure that we write the same data to all the other copies. If A0 ends up modifying F0 independent of A1 modifying F2, for instance, then we will end up in an inconsistent state. In other words, there needs to be a consensus, which is where the Paxos algorithm comes into picture.
Paxos solves the problem of achieving consensus in two parts, first by establishing a majority, i.e, making sure a single update is assigned to a majority of the files and then propagating the majority to a consensus.
Let's follow the sequence of messages between the apps and the files to decide which app's data should be written to the file copies.
Objective 1 : Establish majority
Consider what happens when each of the applications A0, A1, A2 try to grab a majority of the files to force their changes. Each application must grab at least two of the copies. Lets say A0 goes after F0 and F1, A1 tries F1, F2 and A2 tries F2, F0. This problem reduces to the dining philosophers problem, with the applications acting as the philosophers and the files substituting forks.
One way to solve the dining philosophers problem is to assign priority to the philosophers so that a philosopher with higher priority can snatch away the fork from a lower priority philosopher. To bring priority into picture, we number the requests from the apps. Lets say A0 numbers it requests starting at 10, A1 20 and A2 from 30 onwards.
We are now ready to examine the messaging.
- Message 1.a (From: App To: Files) : Each app sends an intent-to-bid message to a majority of the files with a unique number.
- Example, A0 sends {F0,F1} intent-to-bid with #10, A1 to {F1,F2} with #20 and A2 to {F2,F0} with #30.
- Message 1.b (From:File To: App) : A file replies to a requesting app with an okay-to-bid message if it has not already granted permission to bid to another app with higher number.
- Example, F1 gets A0 first, replies to it. Then gets A1, replies to it.
- F2 gets A1 first, replies to it, A2 next replies to it.
- F0 gets A2 first, replies to it, then A0 and rejects it.
- Both A1 and A2 end up getting permission to bid with a majority response. A0 realizes it has no chance.
- Message 1.c (From:App To:Files) : If an app receives permission to bid from a majority of files, it goes ahead with a message to acquire majority and indicates the data it wants to write.
- Example, A1 attempts to acquire {F1,F2} and A2 attempts to acquire {F2,F0} for their updates.
- Message 1.d (From:File To:App): The file accepts the bid for majority from the highest numbered app it has granted the permission to bid and replies to each app whether its bid was successful or not.
- Example, A2 succeeds and it's update is considered majority. A1 gets a failed message.
Barring catastrophic failures, the highest bidder ends up acquiring the majority, and the algorithm proceeds to it's second part.
Objective 2: Propagate Majority
- Message 2.a (From:App To:Files): The apps which lost out now request permission to propagate majority by asking the files about the app they selected. This message is again accompanied by a unique number, which is higher than the number the app previously used (to indicate increase in priority).
- Example, A0 sends {F0,F1} message with #11, A1 sends {F1,F2} message with #21.
- Message 2.b (From:File To:App): The files respond with the information regarding the app who's bid they accepted.
- F0 responds it has not accepted any bid.
- {F1, F2} respond they have accepted A2's update as the majority.
- Message 2.c (From:App To:File): The apps now join the consensus by choosing the highest accepted bid they get in response from all the files they contacted.
- Example, A0 after comparing response from F0 and F1 decide to accept A2 as majority and reply back to {F0,F1}. A1 after comparing response from {F1,F2} decide to accept A2 as majority and reply back to {F1,F2}.
As a result A2's update it accepted as the update by everyone. A2 goes ahead and writes it's data onto {F0,F1,F2}.
While this sequence of messaging is easy to understand, the Paxos algorithm is more complicated because of the following optimization - when a file receives Message 1.a from an application, and if it has already granted the bid to some other application, then instead of replying with Message 1.b, it directly jumps to Message 2.b. Next, the apps either go to Message 1.c or Message 2.c depending upon the message received. Hence, the Paxos algorithm is stated as :
Phase 1:
i. Apps (called Proposers) send Message 1.a (called Prepare) to the files (called Acceptors, actually Acceptor/Learner combo).
ii. Apps get response to Prepare which is either Message 1.b or Message 2.b, depending on whether the Acceptor has already accepted a proposal.
Phase 2:
i. Proposers who get a response to their Prepare message from a majority send out a second message called Accept, which is either Message 1.c or Message 2.c depending on what they received in response to their prepare.
ii. Acceptors accept a request unless it has already responded to a higher numbered prepare request.
This still is far from the actual implementation, the sketch of which can be found in Section 4.4 of the Megastore paper. Paxos is actually a collection of algorithms and they form the basic blocks on top of which real life applications are built.
One last thought, before you rush off to put that freshman year project on the Web, of course now with a Paxos managed replicated file server in the backend, consider the taken for granted feature of spell checks. If you have the algorithm for spell checks figured out, consider running that algorithm for millions of users for billions of words they type, all in real time...
Ya, better use Google Doc for now. If you still want to give Larry some competition the least you can do is outsource the data management part to some place like Amazon or Rackspace. Heck, even Larry may help you out.