What's going on (from twitter)
Archive: October 2011

Trip report - XLSD Conference

Venue: SLAC (Stanford Linear Accelerator Center)
Conference: XLDB 2011

Overview

I attended the XLDB 2011 conference earlier in the week. XLDB seems to be an emerging conference. There were 280-300 people in attendance and more had to be turned away. In fact, there were people who turned up anyway, even though they hadn’t secured a place, hoping that there would be cancellations.

My understanding is that XLDB emerged from the need to discuss technology, issues, and requirements for Big Data in the scientific community, especially when using supercomputing facilities. However, it is clear to everyone that Big Data is not specific to those who are doing supercomputing anymore. Even though we are moving into the exascale supercomputing era (the fastest supercomputer today can do >8 PFlops), having to deal with GBs, PBs, or even EBs of data is not an exclusive problem to those who have access to supercomputing facilities. Hence, it was not a big surprise to find out that the conference attracted the interest of the industry.

A number of leading industry leaders had a presence at the conference: Google, Microsoft, eBay, Netflix, LinkedIn, Facebook, Amazon (even though they decided at the last minute not to present), and IBM (to name few). There were talks by scientists as well.

Most of the talks were great and I enjoyed them a lot. I chose to highlight the Netflix, Metamarkets, and Novartis ones as the driving examples for my observations. The conference organizers have promised to publish the slides and the videos of the presentations.

The value of data

In my mind, the Big Data space is not a niche any more. It’s not a space that any company offering enabling technologies, solutions, and services to its customers can afford to ignore. Many customers already have real problems, they already take advantage of Big Data processing infrastructures, and their competiveness is based on their ability to extract value and insights from the data they collect.

Take Netflix for example. Their VP of Data Science and Engineering (highlighting “Data Science” in the title!!!) gave an excellent talk on how Netflix won the DVD-shipping game, how they became competitive. It was all because of the data they collected and then analyzed. They heavily instrumented their DVD-handling equipment. Every single aspect of a DVD’s route was recorded and sent to Netflix’s data warehouse. Decisions within msecs had to be made about how to best route each disc. Data was collected and then processed in order to optimize all aspects of their business. They had become so good at it that their only bottleneck became the post office. They had reached such a level of data-based business intelligence that they even went to the post office and started helping them optimize their operations.

The VP of Data Science and Engineering at Netflix was a happy person until Netflix decided to get into the streaming business. Their data collection and analysis requirements skyrocketed!

Here comes the cloud

Netflix needed to expand its ability to process data and make business decisions. They really wanted to move away from the business of managing infrastructure. They didn’t want to have to deal with operations, data centers, machines, and so on. They went through a migration period of progressively moving their entire data collection and processing infrastructure into Amazon’s cloud.

Granted, Netflix had to build their own pipeline based on open source technologies. They used the right tool for the job. They used a NoSQL solution for reliably gathering/recording their data at scale. They used an RDBMS where it made sense.

Netflix is a big company. They can build their own data processing infrastructure from the various pieces. However, what about all those smaller companies that want to collect and process data that is critical to their growth, competitiveness, survival? Wouldn’t they benefit from cloud solutions that are scalable, reliable, and NOT managed by them!

Take Metamarkets as an example. They are doing predictive analytics that help advertisers around the world. Apparently the advertising game is following that of the financial markets. Advertisers need to be able to make decisions within few seconds. They need to analyze large amounts of data (billions of microtransactions per day) very fast.

Their needs for a very fast engine for doing almost real time analytics was not addressed by any existing solution. Metamarkets was born in the cloud and continues to operate in the cloud. They didn’t have to transition to it like Netflix did. Nevertheless, they still had to build their own distributed, in-memory database (Druid) because none of the solutions they tried could meet their requirements. Given their domain of focus, that’s effort that could have been avoided. Rather than focusing on infrastructure, they could have diverted their investments in offering better services to their customers. As it turned out, they managed to build a very good infrastructure that serves them well today.

The data analytics ecosystem

Companies like Vertica provide solutions for companies like Metamarkets. The value proposition is obvious. If you want to build a service or a product that is based/depends upon the processing of data at scale, then you don’t have to build the infrastructure yourself.

This is not about deploying a database management system. This is not about just deploying Hadoop or a NoSQL store. This is about getting a complete solution for your big data analytics needs, tailored to your specific requirements (e.g. close-to-real-time processing, batch processing, scale, cloud, etc.).

Novartis happens to concentrate on providing solutions for the genomics/life sciences community. They utilize SciDB, an array-oriented parallel database. There are many companies like Novartis out there addressing different domains. We’ve all heard about them and already monitoring them. The point is that such companies are offering solutions for real customer needs today. They reuse open source technologies in order to build an ecosystem of tools and services for their customers.

In my mind, a great opportunity resides in democratizing the data analytics ecosystem by offering scalable solutions at scale; that is, solutions that meet the compute- and data-processing scalability requirements of customers while doing so for 100s of millions of customers at the same time. An ecosystem that addresses all aspects of the Big Data space… data collection, management, processing, visualization, analysis, data mining, machine-based reasoning, and many more!

 

Isn’t it a great time to be in the cloud + big data space? :-)

 

XLDB was a great conference.

I moved to the US in September of 2005. I set a goal for myself to stay organized with regards to my finances and collect as much data as possible for data analysis. Since then, every single entry in my credit card statements is explicitly reviewed and categorized. I used Microsoft Money and then Quicken (after Money got discontinued) to collect and manage all the transactions.

A question that recently came into my mind was this…

Has the almost exclusive use of my motorcycle* for commuting had any positive impact on the consumption of gasoline over the years?

(* for reference: my BMW R1200GS :-)

I thought that I should be able to figure this out by doing some data analysis. Ok… it’s not a “big data” analysis problem but the exercise does incorporate the necessary steps one needs to undertake in order to get insight from some data, whether big or small.

Total spending per month

First step is to figure out my total spending on gasoline per month, which should be easy. Indeed, Quicken allowed me to sort all the transactions from the last 6 years. I copied the ones that were under the “gasoline” category to Excel and voila…

image

It was easy to calculate the monthly spending using Excel’s grouping function. I did have to play a bit with the dates so that I could make grouping work.*

There is definitely a trend towards less spending. Of course, the above doesn’t take into consideration the fluctuating price of the gasonline.

Finding historical gasoline prices

So, I had to go find the gasosline prices over time for the state of Washington. It took me a while on Bing to find a free source. The Department of Energy maintains historical data. Since I always use premium gas (better for the environment), I didn’t have to worry about averaging between the different types. I downloaded the Seattle data.

Grouping the data

Now that I had the data, I had to bring it into the same shape as my monthly-spending data. Again, some massaging of the dates, grouping, and I have the monthly average price of premium gasonline in the Seattle area.

Comparing the data

Unfortunately, I haven’t been collecting the actual miles that I did on per month basis. This makes it difficult to get an accurate view of the monthly spending in relation to the actual miles travelled and when compared to the prices of gasoline. Some months I travelled many more miles than others (e.g. road trips, visitors).

I do know that I have 63,000mi and 18,000mi on my car and motorcycle respectively. If I distribute these miles throughout the months, I find that on average I have been driving/riding 1,094.5mi per month. WOW!!!

With that information at hand, I can now calculate my monthly gallons consumption (again, given the even distribution of my total miles throughout the months**).

image

So, even though my car has been getting older, my miles/gallon efficiency has been going up. This is definitely a result of the heavier use of the motorcycle over the last two years. Also notice how the average price of gasonline has been going up in Seattle.

 

Lessons

Here are few things I’ve learnt through this exercise…

  • Excel is a great tool for playing with small amounts of data, even though some necessary features (for this scenario) were difficult to discover.
  • The discovery of data that I didn’t have seemed to be the most difficult part.
  • Filtering and massaging the data took most of the time.
  • I felt that reporting and making sense of the data could be automated.
  • Visualizing the numbers makes all the difference in the world :-)

* Perhaps the most difficult part of the entire process was my inability to copy-paste only the results of the grouping. Finally I discovered the “Go to special…” feature in Excel that allows one to copy only the visible parts of a selection.

** I can probably figure out the miles per month if I first calculate the miles/gallon efficiency of my car and motorcycle.

Steve Jobs, 1955 - 2011
6 Oct 2011, Updated: 6 Oct 2011
, Categories: Technology

To me Steve represents the perfect example of a visionary, a creator, leader, an artist, a world changer. In my eyes, the world is a better, more beautiful place, because of Steve.

Steve Jobs

Source: apple.com, Oct 5, 2011

Steve Jobs