Cloudspace | Big Data for Fortune 100

Big Data for Fortune 100

Posted on September 19, 2012 by Tim Rosenblatt

GigaOm has a great piece recapping a presentation by Disney's Arun Jacob on their internal data platform.

Cost certainly played in a role, but really it was flexibility that made the decision.

This bolsters my argument that custom software on top of open source packages is the best way to build software. Buying a whole solution means you’re paying for the packages, and the code to glue it all together, if you manage to get it working the way you want. Custom software means you’ll get it all, and you get a free ride on some parts because a bigger company has already produced a component you need and has made it available to everyone for free.

“We treated ourself like a small consulting organization and we had something to sell.” When a division wanted it to use the platform for a particular function, Jacob would say yes and then get busy actually figuring out how to build it.

This is precisely how software groups should be run inside big companies.

The conversation should be “What do you want right now?”, followed by “ok let’s make it do that”. The conversation should never be “tell us everything you want it to do ever, and we’ll make it do that all at once”. Doing the latter is a guaranteed waste of time and money.

This is the essence of iterative development and Agile in a large organization. Only build software that solves the issue on the spot, then when a new thing comes up, modify the software to do that. Don’t build software functionality that you will probably need in October. Build software that you need now, now, and build October’s software in October.

Architecturally, it’s all about being able to recompose the path data takes through the platform and the components that are used for each particular purpose, or being able to easily replace pieces altogether if something better comes along.

For any type of platform, just build an API, then build everything else on top of the API. Everyone can use the API -- it’s a lingua franca. Then, if anything changes behind the scenes, it doesn’t matter. One set of changes to the API and everyone is automatically upgraded to the new system.

“You pay for [open-source projects] late at night, you pay for them by learning to run them, you pay for them by reading people’s source code who even if you could read it, it still doesn’t make any sense,” Jacob said. But those things can be overcome if you’re willing to put in the time.

This is a bit dramatic, and shouldn’t be misinterpreted as “see, this is why to not use open source software!”. The “late at night” comment shouldn’t be misinterpreted as “open source software crashes more often and needs more maintenance” (it doesn’t); “learning to run them” doesn’t mean that they are any harder to run than commercial software; and “reading people’s source code...[that] doesn’t make any sense” is cheaper than the support line where a ‘customer support engineer’ is going to be in the exact same situation.

In fact, the last point is completely destroyed with the knowledge that in an open source package, you can actually find the exact human who wrote a particular thing, and get their help on it. Business people, take note: all the human networking concepts you apply to getting advice are the same for finding out information about code in an open source package.

You also have to make systems built on open-source software consumable by everyone who needs to use them. That means it’s not enough to just build a scalable and stable system; the system also has to be easy enough for thousands of internal developers of all types and all skill levels to use. In a six-person startup, Jacob said, it’s easy enough for everyone to just learn Hadoop in a month and then start using it, but that’s not the case in a large enterprise.

This is actually the same comment about "recomposing the data path" from before. He’s using a lot of words to say "build an API around the infrastructure underneath". The only comment in a later paragraph that’s really useful is that they build client libraries so programmers don’t have to directly make API calls themselves. This is a very simple and logical extension of their idea to build.