This Information Management article reports that IBM is using open source technologies, including Hadoop, in its Watson system that will compete on Jeopardy in February.
Article: How IBM’s Watson Churns Analytics
Yury Izrailevsky, the Director of Cloud and Systems Infrastructure at Netflix, wrote a great post on how NoSQL systems are in use at the company. The post discusses the mindset adjustment when moving away from traditional ACID database systems to systems that only satisfy two of the three CAP properties. Most big corporations would have a big job retraining their in house IT developers to understand how, when, and why you must decide on the trade offs up front with NoSQL systems.
Yury also writes how the firm tries to use the right tool for the job instead of shoehorning existing “approved” enterprise IT tools into systems they are not designed to accommodate. Again, there’s a big gap here between how the best technology companies like Netflix do their IT work and most mainstream companies’ IT shops run.
Article: NoSQL at Netflix
Here is a diagram of Google’s application programming interfaces (APIs) in a format that would be familiar to anyone who’s taken high school chemistry: a periodic table. Who knew Blogger had an API?
LinkedIn has one of the best teams of big data experts in the world. In this O'Reilly video, Pete Skomoroch from LinkedIn explains what skills are necessary to excel in the data scientist role.
Video: 3 Skills A Data Scientist Needs (O'Reilly)
Google, Facebook, LinkedIn, and Twitter have software engineers using big data analysis tools, but what other companies need these skill sets at the beginning of 2011?
GitHub: Software Engineer, Big Data
Groupon: Software Engineer, Big Data Infrastructure
Amazon: Data Engineer (Amazon Web Services)
Massive Data News also has a list of jobs that require big data skills: http://www.massivedatanews.com/jobs
Most of these jobs are with startup-type companies that track more than the average amount of information about their customers. What we’ll see in the next 2-5 years is mainstream companies following suit as case studies and magazines show the power of extracting information from data sets that previously were useless because the analysis took too long to complete.
I never understood the W3C’s push behind the semantic web. Yes, if done right with accurate markup and advanced parsers in browsers and applications it could provide much of the “intelligence” currently lacking for searching the web for more than just keywords. But it seemed like too much developer work for little benefit. Also, how could anyone ensure the RDF semantic specifications were correct?
Since the semantic web has not taken off, I’m not alone in my concerns. Now O'Reilly has a different take: linked data will provide the benefits that the semantic web offered but never delivered.
Article: Linked Data Will Succeed Where the Semantic Web Failed
ReadWriteWeb has a summary of the growth of published connected and structured data, known as “linked data.” There are several great diagrams that show how major data sources continue to proliferate and integrate on the web.
The military is producing and collecting very large amounts of data from unmanned aerial vehicles, spy satellites, communications channels, internal applications used by intelligence analysts, and reports from troops in the field. This article by the NY Times describes the result of attempting to handle all of that data: overload.
The military needs new visualization techniques and analysis algorithms to assist human operators with understanding and acting on data in a timely manner.
Article: Military Data Overload (NY Times)
This article is a great introduction to both NoSQL and MapReduce. The author’s goal is to explain the basic concepts, show code, and examine how MapReduce can be useful.
Article: MapReduce from the basics to the actually useful (in under 30 minutes)
Enrico Bertini provides a list of 7 papers that influenced the data visualization field. Enrico admits there are some newer papers that are just as influential but he chose to only include older papers that set the foundation for the discipline.
Article: 7 Classic Visualization Papers
This is a great article that answers the question “how exactly can you derive meaningful information from big data using existing technologies?” BackType has 3 engineers and is currently using Hadoop and Cassandra, among other home grown tools, to analyze Twitter, Facebook, blogs, and other user-generated content sources and provide useful information to companies that use their services.
Of interest in their architecture is the difference between the “speed” layer and the “batch” layer. One of the main complaints about Hadoop is that it is a batch system not meant for real-time use. But the volume of data being analyzed is generally too great for real-time systems. BackType solves this problem by duplicating the work. In the speed layer, data is available for immediate use but stored transiently. Since the batch layer will eventually catch up to where the speed layer is at any given moment, the speed layer throws away older data in favor of the newest data.
Article: Secrets of BackType’s Data Engineers
I spent a couple of hours today with Dojo Toolkit trying to figure out how to get the value attribute (instead of the text value) from the dijit.form.ComboBox widget’s option elements. A bunch of Google searches ended in finding other people who asked the same question, but no real answers.
So how do you get the value attribute from a Dojo ComboBox widget’s selected option element?
You can’t except by some Javascript manipulation. The widget wasn’t designed for value attributes. See here on the official Dojo documentation:
note: ComboBox only has a single value that matches what is displayed while FilteringSelect incorporates a hidden value that corresponds to the displayed value. (source: http://dojotoolkit.org/reference-guide/dijit/form/ComboBox.html#id3)
So use FilteringSelect instead. It will return the value attribute like you expected.
The NoSQL Tapes site is a compilation of videos and case studies with influential people in the NoSQL field. The site just launched but already has several videos with many more in the “coming soon” list.
Website: The NoSQL Tapes
Google is granting limited access to its BigQuery functionality in Google Apps. BigQuery allows people to use the Spreadsheet application to run SQL-like queries against data sets using Google’s infrastructure. If BigQuery is opened to the general public and compelling use cases are created, could this become the “killer app” that Google Spreadsheet has over Microsoft Excel?
Twitter currently handles about twelve terabytes of new data daily. A couple of years ago, when Twitter was mostly a Ruby on Rails and MySQL application, infrastructure stability was a major issue for them. The difficulties prompted Twitter to move to a NoSQL solution. Considering the tremendous growth they’ve had since then and the lack of serious downtime the switch has been very successful. ReadWriteWeb has an overview of the new NoSQL components in Twitter’s technical architecture.
Article: How Twitter Uses NoSQL
O'Reilly gives an overview of four websites that provide raw data on Web traffic and site popularity. Aspiring data journalists can analyze and extract information from these sources to find interesting patterns and combine them with other sources to create original reports.
Google Refine 2.0 was released late last year as free software for cleaning up messy data sets. Refine is a powerful tool for working with unstructured data, extracting value from it, and linking it to other data sets. This tutorial is a great starting point beyond Google’s own documentation for how to get started.
In addition to Information Management’s 2011 prediction that big data will move further into the mainstream, one of ReadWriteWeb’s columnists posted a similar prediction. Audrey Watters describes “Data Scientist” as the hot new occupation in addition to growth in data storage, processing, and analytics sectors.
Article: ReadWriteWeb: 2011 Predictions
Information Management magazine ranks big data as one of its six big IT trends for 2011. They expect big data to move further into the mainstream as companies throw away less data produced in anticipation of extracting value from it in the future.
Article: 6 Predictions For The Year Ahead