Sunday, May 1, 2011

The Magic Behind Google Docs

I love Google Docs, and it's no way fueled by any anti-Microsoft feeling or anything. In fact for natural-linguistically-challenged people like me Word is a God sent gift, not to mention getting that negative margin on indented paragraphs doesn't seem possible in Google Docs quite yet. But the killer feature of Google Docs is, of course, collaboration. No emailing, saving, diffing, getting confused. The day you start using Google Docs, you instantly know this is how document editing should have been done all along. Fortunately, Google considers Docs a success unlike some of their other experiments and signs are it's going to stay.

I know what you are thinking - "That text editor assignment I did in freshman year, if only I had put it up on the Web at the right time...could be playing golf with Larry and Zuck right now..."

Now maybe, a big maybe at that, given enough years of object oriented thinking and a few years supply of Red Bull, you can come up with some text editor prototype. But it is guaranteed that you will not be able to "put it on the Web" without causing yourself embarrassment. Putting an application on Web scale is a very difficult task, and Google Docs stands on the shoulders of giants in this aspect, and so does all other popular Web services like Amazon, Netflix, Facebook (yes, that blue and white website too has mind boggling technology underneath).

To get a feel for what's involved, consider the simple problem of storing data. The data is stored onto giant server farms and one can go crazy just figuring out how to get the hot air out of these monsters. At this scale, machines keep failing, rebooting all the time. Even entire data centers going down is not unheard of. But hosting an application on the net means uptime of 99.999999999%. How can you provide such stability when the underlying hardware itself is crashing all the time?

The basic idea behind storing data reliably in the presence of machine and network failures is to make copies of it, over different machines, over different data centers.

Parameters of the problem :

N : Number of copies of the data
R : Number of copies read to respond to a Read request
W : Number of copies written while fulfilling a Write request

Trade offs :
  • To provide high fault tolerance and support high availability N needs to be large.
  • To guarantee strong consistency, the property that all reads should reflect the latest written data, one needs to  ensure  R + W > N so that there is some guaranteed overlap between all reads and writes.
  • To ensure fast read operations R needs to be low.
  • To provide fast write operations W needs to be low.
For systems that provide consistency based on R+W > N, a network partition can make some copies unreachable, and hence R reads or W writes may be impossible and the system becomes unavailable. The trade off is summarized in the CAP theorem which states that of the three requirements - Consistency, Availability and Partition tolerance, only two can be achieved at a time.

While consistency has been the cornerstone of traditional databases, the new generation of Web scale databases often sacrifice consistency in favor of availability by employing R+W <= N. In a typical system, R == 1, and W < N. Once written, the value propagates to the complete N replicas in a lazy manner by using mechanisms such as the gossip algorithm. This scheme is said to provide eventual consistency, all the N copies get the last data written in the limit. That is just a nice way of saying inconsistent reads are possible in this scheme.

Due to inconsistent reads, writing certain applications on top of the eventual consistency model becomes extremely hard. For example, if you get hired by Facebook to implement their Credits feature, and all you have is a database supporting eventual consistency, then might as well consider yourself fired because yelling "Those marketing gals! Always selling vaporware without checking feasibility with engineering!" is not going to work.

The current database challenge is to provide the scalability demanded by Web scale applications while maintaining strong consistency.

An attempt in this direction is the Google Megastore that uses W == N, R == 1. Whenever a data is written, it is written to all the copies. While this sounds simple, in practice this is a very tough problem to solve. Consider what happens when you open your browser and start editing a file in Google Docs. The browser is a client that connects to the application server running Docs, which in turn connects to the database which is trusted with storing all the files. Let's say N == 3, the database maintains three copies of each file so that it can tolerate one failure. (In general, to tolerate F failures, 2F+1 copies are needed to ensure a majority is still alive after the failures).

Consider the case when a single user edits a file and presses the save button. The application layer sends a write request to all the three copies of the file. If it gets an acknowledgement that all the three copies are updated, it considers the write successful.

But now comes the killer feature of Google Docs, collaboration. Consider the case where multiple users working on a file press the save button at nearly the same time. For the discussion, lets assume three users and call their applications A0, A1, A2. Lets call the three copies of the file they are editing F0, F1, F2.

To write a particular data to a file copy, we need to make sure that we write the same data to all the other copies. If A0 ends up modifying F0 independent of A1 modifying F2, for instance, then we will end up in an inconsistent state. In other words, there needs to be a consensus, which is where the Paxos algorithm comes into picture.

Paxos solves the problem of achieving consensus in two parts, first by establishing a majority, i.e,  making sure a single update is assigned to a majority of the files and then propagating the majority to a consensus.

Let's follow the sequence of messages between the apps and the files to decide which app's data should be written to the file copies.

Objective 1 : Establish majority

Consider what happens when each of the applications A0, A1, A2 try to grab a majority of the files to force their changes. Each application must grab at least two of the copies. Lets say A0 goes after F0 and F1, A1 tries F1, F2 and A2 tries F2, F0. This problem reduces to the dining philosophers problem, with the applications acting as the philosophers and the files substituting forks.

One way to solve the dining philosophers problem is to assign priority to the philosophers so that a philosopher with higher priority can snatch away the fork from a lower priority philosopher. To bring priority into picture, we number the requests from the apps. Lets say A0 numbers it requests starting at 10, A1 20 and A2 from 30 onwards.

We are now ready to examine the messaging.
  • Message 1.a (From: App To: Files) : Each app sends an intent-to-bid message to a majority of the files with a unique number.
    • Example, A0 sends {F0,F1} intent-to-bid with #10, A1 to {F1,F2} with #20 and A2 to {F2,F0} with #30.
  • Message 1.b (From:File  To: App) : A file replies to a requesting app with an okay-to-bid message if it has not already granted permission to bid to another app with higher number.
    • Example, F1 gets A0 first, replies to it. Then gets A1, replies to it.
    • F2 gets A1 first, replies to it, A2 next replies to it.
    • F0 gets A2 first, replies to it, then A0 and rejects it.
    • Both A1 and A2 end up getting permission to bid with a majority response. A0 realizes it has no chance.
  • Message 1.c (From:App To:Files) : If an app receives permission to bid from a majority of files, it goes ahead with a message to acquire majority and indicates the data it wants to write.
    • Example, A1 attempts to acquire {F1,F2} and A2 attempts to acquire {F2,F0} for their updates.
  • Message 1.d (From:File To:App): The file accepts the bid for majority from the highest numbered app it has granted the permission to bid and replies to each app whether its bid was successful or not.
    • Example, A2 succeeds and it's update is considered majority. A1 gets a failed message.

Barring catastrophic failures, the highest bidder ends up acquiring the majority, and the algorithm proceeds to it's second part.

Objective 2: Propagate Majority

  • Message 2.a (From:App To:Files): The apps which lost out now request permission to propagate majority by asking the files about the app they selected. This message is again accompanied by a unique number, which is higher than the number the app previously used (to indicate increase in priority).
    • Example, A0 sends {F0,F1} message with #11, A1 sends {F1,F2} message with #21.
  • Message 2.b (From:File To:App): The files respond with the information regarding the app who's bid they accepted.
    • F0 responds it has not accepted any bid.
    • {F1, F2} respond they have accepted A2's update as the majority.
  • Message 2.c (From:App To:File): The apps now join the consensus by choosing the highest accepted bid they get in response from all the files they contacted.
    • Example, A0 after comparing response from F0 and F1 decide to accept A2 as majority and reply back to {F0,F1}. A1 after comparing response from {F1,F2} decide to accept A2 as majority and reply back to {F1,F2}.

As a result A2's update it accepted as the update by everyone. A2 goes ahead and writes it's data onto {F0,F1,F2}.

While this sequence of messaging is easy to understand, the Paxos algorithm is more complicated because of the following optimization - when a file receives Message 1.a from an application, and if it has already granted the bid to some other application, then instead of replying with Message 1.b, it directly jumps to Message 2.b. Next, the apps either go to Message 1.c or Message 2.c depending upon the message received. Hence, the Paxos algorithm is stated as :

Phase 1:
i. Apps (called Proposers) send Message 1.a (called Prepare) to the files (called Acceptors, actually Acceptor/Learner combo).
ii. Apps get response to Prepare which is either Message 1.b or Message 2.b, depending on whether the Acceptor has already accepted a proposal.


Phase 2:
i. Proposers who get a response to their Prepare message from a majority send out a second message called Accept, which is either Message 1.c or Message 2.c depending on what they received in response to their prepare.
ii. Acceptors accept a request unless it has already responded to a higher numbered prepare request.

This still is far from the actual implementation, the sketch of which can be found in Section 4.4 of the Megastore paper. Paxos is actually a collection of algorithms and they form the basic blocks on top of which real life applications are built.

One last thought, before you rush off to put that freshman year project on the Web, of course now with a Paxos managed replicated file server in the backend, consider the taken for granted feature of spell checks. If you have the algorithm for spell checks figured out, consider running that algorithm for millions of users for billions of words they type, all in real time...

Ya, better use Google Doc for now. If you still want to give Larry some competition the least you can do is outsource the data management part to some place like Amazon or Rackspace. Heck, even Larry may help you out.

Saturday, April 30, 2011

Future of Electronic Design Automation (EDA)

EDA is a very connected community. If you work in EDA, chances are all the people you know work in EDA. Every social gathering is exclusively of EDA folks and the discussions that start in the evening on varied topics of movies and sports by night reduce to a familiar one - the decaying state of EDA.

This has been the case for years, everyone has been talking about how 'EDA is not what it used to be' forever. Nothing wrong with that, grass is always greener on the other side, and a periodic review of your profession can only be considered healthy. But these worries over the state of EDA is no longer some stray honk at a red light that you can ignore and keep driving, it's more like getting rear ended by a pickup truck.

The last EDA vendor to go public was Magma. That was way back in 2001! A lot of bytes have flown through the net since then, but the state of EDA at best can be considered stagnant from that point onwards. The only significant product since then is perhaps the rapid prototyping tools. Not that attempts were not made to try out new things. The routing guys tried out X architecture, in synthesis people tried to raise the abstraction with various High Level Synthesis tools, verification folks dabbled in formal technologies. But nothing gained any traction beyond selling the CAD teams at semiconductor companies a few evaluation copies (okay, may be a one or two licences to the design teams but no more).

A significant indicator of the health of the industry is the annual Design Automation Conference. An interesting paper by Alberto Sangiovanni-Vincentelli summarizes the history of DAC till 2003. The graph on the second page continued to drop precipitously and by 2009 the number of attendees reduced to around 3000. At the height of things around 1999, DAC used to be a real party. Busty blondes stood in front of booths explaining "physical synthesis", giveaways meant real gizmos of the day and I used to collect my years supply of T-shirts there. Now all you get to see is engineers with droopy eyes sitting behind deserted stalls with a bowl of candies.

Profs  in academia seem to have judged the situation much better than the engineers in the field. Veterans in the field, like Randy Bryant ( inventor of BDDs), have moved from EDA to "the type of computing systems used for Internet search". Most working in the field now hedge their bets by maintaining parallel interests. Venture funding for EDA startups have dropped from the 2004 "high" of around $100 million to practically zero by 2010. In comparison Internet startups continue to attract over a billion dollars per quarter in funding. Within ten years of the Internet Bubble 1.0, people find enough reasons to put millions in Twitter clones, which itself is a loss making entity with no real revenue model as of now. Still they don't want to put money in EDA anymore.

Stop! Stop! Stop! EDA serves electronic design, so as long as electronics is part of our life, EDA would flourish, right?

The answer is not very straightforward, it may even be a resounding No. The last few years have seen a slew of mega hit electronic products - smartphones, e-readers, Kinect. The revenues of semiconductor companies like Samsung, TI, Apple, ARM, Xilinx have soared. But the EDA revenues have remained frozen. The key to understanding the discrepancy is to track the number of ASIC design starts, which have steadily declined. For higher end designs, the ones beyond 65nm, the decline is even sharper. The demand for bleeding edge EDA tools is now minuscule, and the lower end design flows have stabilized. The hardware design process has evolved from building everything from scratch to platform based design, and cores like ARM have become ubiquitous. Unfortunately in this scheme of things, EDA tools are no longer the innovation bottlenecks and they are pushed to a maintenance mode role.

So what's implied here, is EDA going to die? Will all of us be on the streets, resume in hand? Of course not. Job security and job growth are two entirely different things. EDA will continue to be a vital piece of the IT ecosystem. But much like plumbing, while essential, it will be relegated to the status of a utility and the hectic days of the 90s will never return. Experts of the various tools can continue comfortably in their roles maintaining their products by making incremental changes.

But is the situation comfortable? A myth that adds to the current comfort is that EDA jobs are high paying. While that was true a decade ago, a quick glance at the salary surveys at sites like GlassDoor reveal that is certainly not the case any longer. The other myth is that EDA has above average caliber people who on an individual basis add more value to their company compared to the other software firms, and hence salaries will be high. This again doesn't add up. Microsoft generates a revenue of $70 billion with 89,000 employees, an average of $786K/employee. Google generates $30 billion with only 26,000 employees at an average of $1.15 million. In comparison Cadence's revenue of $935 million is from 4,600 employees, an average of $203K. The average for Magma comes around $197K and Synopsys $211K. The productivity of the average EDA engineer is about one-forth in comparison to the established IT names. The downward pressure on salaries can only increase from this point onwards. With few new talent coming in and trickling away of the existing talent, the notion of high caliber people will also be difficult to maintain.

Where does all this lead to? One book that comes to mind in this context is "Who moved my cheese?" Note that the number of people who like the book and hate it are about evenly split. So this is nothing earthshaking and outright pedantic for the most part. But as it happens with anything that sells over ten million copies, there's something useful in it. My takeaway was that job growth trumps over job security. The only way to stay alive is to stay hungry.