Makai's Blog: Archive

Yahoo Spin Off to Compete With Cloudera?

Yahoo! is mulling whether to spin out its engineering group focused on Hadoop as its own firm. The new company would compete directly with Cloudera.

More commercial competition would address concerns from GigaOm’s article on why Hadoop innovation has to pick up.

The Yahoo! spin off would also compete with IBM, EMC and business intelligence software providers such as Teradata who are rapidly moving into the unstructured and semi-structured data space.

Apr 27, 2011 10 notes

#Hadoop #Yahoo! #Cloudera

Could Big Data Reverse the Cloud Computing Trend?

The trend of moving IT infrastructure from private data centers to public or hybrid cloud infrastructure appears inevitable. The media has discussed cloud trends for years and now other terms such as big data are taking cloud computing’s place as “the new new thing” in business IT. But it’s possible big data will actually reverse some of the movement towards cloud computing.

To see why big data and the cloud are not entirely compatible, we have to look at technology limitations. Daily transfers of large files in the terabyte, petabyte, and beyond range remains slow and expensive. Bandwidth will continually increase, but so will the amount of data generated and saved, especially after sensor networks come into widespread usage.

So a “big data + cloud computing” model for the next 5-10 years will likely involve the following pieces:

Multiple Local Data Centers

collect data
store all collected data locally
perform basic analysis
compression

Centralized Cloud Operations

aggregate locally-processed data
acquire data from a wide range of sources, both proprietary and public
perform complex analysis on the entire data set

The complexity is in determining which operations should be performed locally to minimize data transfer versus sending everything to the centralized infrastructure. But as long as there’s business value that can be gained from analyzing very large data sets to improve operations, the complexity will be overcome.

Further Resource: Big Data Is On A Collision Course With the Cloud (GigaOm)

Apr 22, 2011

#big data #cloud computing

Hiring Limits Hadoop Adoption

High-end niche technology skill sets on the edge are expensive. That’s part of what contributes to Hadoop’s slow adoption rate, reports GigaOm.

Like all useful technologies, the Hadoop hiring problem will get better as the tools mature and move into mainstream skill sets. Pig helps, as does Amazon’s Elastic MapReduce by introducing (non-Lisp) programmers to the MapReduce paradigm.

A bigger question raised in GigaOm’s article on whether Groupon can compete with Facebook isn’t the money that is being thrown around to top-quality big data and Hadoop developers. It’s how much interesting data Groupon have to analyze. While every organization is collecting a dramatically increasing amount of data today, it’s questionable whether Groupon is as mature in its collection, analysis, and visualization efforts as Facebook. While some engineers would be excited to join a fledgling data analysis effort at Groupon, Facebook has both the data collection maturity and massive scale of data to make any data scientist excited.

The question of data maturity will come up for mainstream companies as they try to get a handle on their big data challenges. It will be difficult to lead top data scientists away from organizations diligent in collecting and analyzing their data to immature companies that “just don’t get it” when it comes to how data can inform a business strategy.

Link: Can Groupon Compete With Facebook in Hadoop Hiring? (GigaOm)

Apr 21, 2011

#big data #Hadoop #recruiting #hiring #data scientists

JQuery Mobile First Look

If you’re interested in building web applications for mobile browsers and want to avoid worrying about differences in Android, Apple, BlackBerry, and other platforms’ web browsers, check out this new book, JQuery Mobile First Look by Giulio Bai. I just finished a technical review on the book for Packt Publishing and it’s a great introduction to the platform.

After reading the book you’ll have enough information to create mobile web applications like KhanApp that Khan Academy just released. KhanApp was created with JQuery Mobile and it’s a great example of how the platform can be used.

The mobile web applications created with JQuery Mobile can even be wrapped with PhoneGap to create platform-specific apps!

Apr 20, 2011 1 note

#JQuery Mobile #mobile development #KhanApp #JQuery Mobile First Look

Using Hadoop to Identify Botnets

Atlanta-based ipTrust is using Cassandra to store trillions of log files and Hadoop to analyze those files to identify botnets. After identification, the botnets can be targeted for cleansing.

ipTrust’s goal is to build more intelligent firewalls and protection software by creating a reputation-based system for IP addresses. The relationship graph between IP addresses is analyzed similar to how users are connected on social networks. It’s a fascinating idea and appears similar to both social network analysis and Google’s PageRank algorithm for using relationship data for contextual information.

Link: Hadoop Kills Zombies, Too! (GigaOm)

Apr 20, 2011 1 note

#big data #Hadoop #Cassandra

Differentiating Between Little and Big Data

Enterprises need to differentiate between little and big data. “Big data” remains a moving target without an exact definition, but it has certain criteria:

Collected over an extended period of time, often years and even decades
A mix of unstructured, semi-structured, and structured data
Large enough in size that it is difficult for traditional business intelligence platforms to analyze it in real-time
Unknown data quality - often a mix of proprietary and public data sources

“Little data” is:

Collected over a short duration of days, weeks, or months
Mostly structured data
Easily analyzed by traditional business intelligence platforms
High data quality - usually proprietary data collected based on a well-defined business need

Traditional data warehousing and business intelligence approaches are not enough for leading-edge companies to stay in their positions. Companies who are not yet feeling the data deluge but will be as big data goes mainstream need to ensure their “little data” needs are covered by strong business intelligence practices. As the data deluge grows over the next several years, companies that have thought through the implications and can act will generate more business value than their competitors based on their data best practices.

Apr 17, 2011

#big data #little data

Museums, Education, and Big Data

The data revolution will affect every aspect of society including what we normally associate with the past, not our future: museums. O'Reilly’s Alex Howard has a quick summary of a presentation from Ignite Smithsonian on April 12th. The presentation focused on how big data can be used for research. Although big data is already a challenge for academics in physics and computer science, other disciplines will be affected as society goes through the data revolution.

Link: Ignite Smithsonian Summary (O'Reilly)

Link: Ignite Smithsonian Wiki and Presentations (Smithsonian)

Apr 15, 2011 1 note

#big data #museums #Ignite Smithsonian

Simple Moving Average in Excel, R, and Hadoop

Cloudera has a three part walk through for how to create a Simple Moving Average in Excel, R, and Hadoop.

Simple Moving Average Part 1: Excel

Simple Moving Average Part 2: R

Simple Moving Average Part 3: Hadoop

Anyone interested in R and Hadoop should be familiar with executing basic calculations in Excel, so this is a great tutorial to get your feet wet with more advanced data analysis tools.

Apr 12, 2011 2 notes

#Hadoop #Excel #R #Simple Moving Average

Hadoop’s versatility allows all industries with unstructured data to take advantage of distributed MapReduce analysis capabilities. This linked article by GigaOm shows how the news industry is using Hadoop for data journalism.

Link: Hadoop: From Boardrooms to Newsrooms (GigaOm)

Apr 6, 2011

#Hadoop #big data #news industry

Three Great Presentations on R

The folks at ReadWriteWeb collected three great presentations on R covering:

R for web development in 2011
R packages the experts use, and
R usage by the New York Times Graphics Department.

Link: 3 Presentations on R (ReadWriteWeb)

Apr 1, 2011

#R #statistics #big data