Yahoo! is mulling whether to spin out its engineering group focused on Hadoop as its own firm. The new company would compete directly with Cloudera.
More commercial competition would address concerns from GigaOm’s article on why Hadoop innovation has to pick up.
The Yahoo! spin off would also compete with IBM, EMC and business intelligence software providers such as Teradata who are rapidly moving into the unstructured and semi-structured data space.
The trend of moving IT infrastructure from private data centers to public or hybrid cloud infrastructure appears inevitable. The media has discussed cloud trends for years and now other terms such as big data are taking cloud computing’s place as “the new new thing” in business IT. But it’s possible big data will actually reverse some of the movement towards cloud computing.
To see why big data and the cloud are not entirely compatible, we have to look at technology limitations. Daily transfers of large files in the terabyte, petabyte, and beyond range remains slow and expensive. Bandwidth will continually increase, but so will the amount of data generated and saved, especially after sensor networks come into widespread usage.
So a “big data + cloud computing” model for the next 5-10 years will likely involve the following pieces:
Multiple Local Data Centers
Centralized Cloud Operations
The complexity is in determining which operations should be performed locally to minimize data transfer versus sending everything to the centralized infrastructure. But as long as there’s business value that can be gained from analyzing very large data sets to improve operations, the complexity will be overcome.
Further Resource: Big Data Is On A Collision Course With the Cloud (GigaOm)
High-end niche technology skill sets on the edge are expensive. That’s part of what contributes to Hadoop’s slow adoption rate, reports GigaOm.
Like all useful technologies, the Hadoop hiring problem will get better as the tools mature and move into mainstream skill sets. Pig helps, as does Amazon’s Elastic MapReduce by introducing (non-Lisp) programmers to the MapReduce paradigm.
A bigger question raised in GigaOm’s article on whether Groupon can compete with Facebook isn’t the money that is being thrown around to top-quality big data and Hadoop developers. It’s how much interesting data Groupon have to analyze. While every organization is collecting a dramatically increasing amount of data today, it’s questionable whether Groupon is as mature in its collection, analysis, and visualization efforts as Facebook. While some engineers would be excited to join a fledgling data analysis effort at Groupon, Facebook has both the data collection maturity and massive scale of data to make any data scientist excited.
The question of data maturity will come up for mainstream companies as they try to get a handle on their big data challenges. It will be difficult to lead top data scientists away from organizations diligent in collecting and analyzing their data to immature companies that “just don’t get it” when it comes to how data can inform a business strategy.
Link: Can Groupon Compete With Facebook in Hadoop Hiring? (GigaOm)
If you’re interested in building web applications for mobile browsers and want to avoid worrying about differences in Android, Apple, BlackBerry, and other platforms’ web browsers, check out this new book, JQuery Mobile First Look by Giulio Bai. I just finished a technical review on the book for Packt Publishing and it’s a great introduction to the platform.
After reading the book you’ll have enough information to create mobile web applications like KhanApp that Khan Academy just released. KhanApp was created with JQuery Mobile and it’s a great example of how the platform can be used.
The mobile web applications created with JQuery Mobile can even be wrapped with PhoneGap to create platform-specific apps!
Atlanta-based ipTrust is using Cassandra to store trillions of log files and Hadoop to analyze those files to identify botnets. After identification, the botnets can be targeted for cleansing.
ipTrust’s goal is to build more intelligent firewalls and protection software by creating a reputation-based system for IP addresses. The relationship graph between IP addresses is analyzed similar to how users are connected on social networks. It’s a fascinating idea and appears similar to both social network analysis and Google’s PageRank algorithm for using relationship data for contextual information.
Link: Hadoop Kills Zombies, Too! (GigaOm)
Enterprises need to differentiate between little and big data. “Big data” remains a moving target without an exact definition, but it has certain criteria:
“Little data” is:
Traditional data warehousing and business intelligence approaches are not enough for leading-edge companies to stay in their positions. Companies who are not yet feeling the data deluge but will be as big data goes mainstream need to ensure their “little data” needs are covered by strong business intelligence practices. As the data deluge grows over the next several years, companies that have thought through the implications and can act will generate more business value than their competitors based on their data best practices.
Related link: Distinguishing Between Little and Big Data
The data revolution will affect every aspect of society including what we normally associate with the past, not our future: museums. O'Reilly’s Alex Howard has a quick summary of a presentation from Ignite Smithsonian on April 12th. The presentation focused on how big data can be used for research. Although big data is already a challenge for academics in physics and computer science, other disciplines will be affected as society goes through the data revolution.
Link: Ignite Smithsonian Summary (O'Reilly)
Link: Ignite Smithsonian Wiki and Presentations (Smithsonian)
Cloudera has a three part walk through for how to create a Simple Moving Average in Excel, R, and Hadoop.
Simple Moving Average Part 1: Excel
Simple Moving Average Part 2: R
Simple Moving Average Part 3: Hadoop
Anyone interested in R and Hadoop should be familiar with executing basic calculations in Excel, so this is a great tutorial to get your feet wet with more advanced data analysis tools.
Hadoop’s versatility allows all industries with unstructured data to take advantage of distributed MapReduce analysis capabilities. This linked article by GigaOm shows how the news industry is using Hadoop for data journalism.
Link: Hadoop: From Boardrooms to Newsrooms (GigaOm)
The folks at ReadWriteWeb collected three great presentations on R covering:
Link: 3 Presentations on R (ReadWriteWeb)