Makai's Blog: Archive

"Everyone is building custom stuff right now"

The quote in this post’s title is from Gnip’s CEO on their challenges of handling a sustained 35MB/sec stream of constant updates from Twitter. Gnip is the only partner that receives the entire Twitter data stream for analysis.

“Everyone is building custom stuff right now” echoes what I’m seeing at companies handling big data like Clearspring. Although NoSQL data stores and tools like Hadoop are gaining mainstream acceptance, the companies really handling big data don’t have tools they can use out of the box to perform analysis. The big data trends are just beginning to take shape and no one is yet offering the right solutions to handle them.

Link: G nip CEO on the Challenges of Handling the Real-time, Big Data Firehouse (ReadWriteWeb)

May 30, 2011

#big data

Mainstream Big Data Articles

A few interesting articles came out today from mainstream sources such as The Economist, Forbes, and MIT’s Technology Review.

IBM is ratcheting up their big data PR push with a splashy $100 million investment in the field. While that $100 million will cover both basic R&D and system development, it remains to be seen whether IBM can successfully create enterprise-class products that actually add business value instead of useless features that sound good to technology executives, a la Rational Suite, WebSphere, and RAD. Surprisingly though, the time frame for practical Watson big data applications such as a medical assistant is a long-term 5-8 year time horizon, indicating Watson technology is not ready for prime time just yet.

Link: IBM’s Watson Now A Second Year Med Student (Forbes)

The Economist has an interesting fact-laden article on the data revolution. For example, there 4 billion people have mobile phones and 12% of them (480 million) are smartphone users. Much of The Economist’s article is based on McKinsey’s recent big data report.

Link: Building with Big Data (The Economist)

MIT’s Technology Review has an article on why big data needs a code of the ethics, a topic few people consider because they do not understand the vast amount of data being collected about them.

Link: What Big Data Needs: A Code of Ethical Practices (Technology Review)

May 26, 2011 4 notes

#big data #Watson #ethics

Monitis has a nice summary of Apache Cassandra up on their blog. It looks like they are doing a series of overviews on NoSQL solutions that will be worth checking out.

Link: Apache Cassandra Overview

May 25, 2011 1 note

#Cassandra #NoSQL

Tell me a story

I hate corporate performance reviews. Trying to fit the work you performed over the last six months to a year into pre-defined generic boxes such as “flexibility”, “interpersonal skills”, “creativity”, and so on strikes me as really dull. I doubt I’m alone in this sentiment.

I created some consternation during my first performance assessment at Excella by not filling anything in on my performance self-assessment. All blanks in every field. It wasn’t that I didn’t care. I just felt the 2-3 hours of filling in those generic buckets with work I performed struck me as a useless exercise. I’d prefer using those hours to learn more about Clojure or Hadoop.

Why go through performance reviews at all if they don’t provide value? Well, they are supposed to provide value through self-reflection. It’s just that most people (myself included) half-ass them and don’t perform the self-reflection part because it’s tedious and boring. But if the process is more enjoyable, maybe that will help people to think through their performance.

So here’s what I’m going to do in the future to make the process more enjoyable. I’m going to tell a story. Storytelling provides more value. Storytelling is more fun for both the writer and the reader. It's memorable. Give me a story over a bulleted list any day. And I need to be a better storyteller so I can get better at explaining the results of data analysis and visualizations.

Here’s an off-the-cuff attempt at a performance review story excerpt for the fall (without any sensitive client info).

I glanced at my watch. 2pm. Ready to start this client demo. I wasn’t happy with how the demo the week earlier turned out. Apparently I wasn’t clued in on some last minute changes the business expected. I got defensive when I asked whether I should have followed the conflicting requirements document, wireframes, or numerous emails I received from our business analyst.
I made some changes in the time between that first demo and this one. Right now my part of the system was working well minus a few features I could explain away since we were more than a month out from delivery.
Fast forward an hour. It went great. Lots of congratulations on a job well done. It was nice to get some positive feedback directly from the client and people who’d be actually using the system. No defensiveness as I carefully took notes when the client asked for small changes to the system. What a difference a week and some self-reflection on how I should handle client feedback could make.
I don’t always handle my client interactions perfectly, but I’ve learned from a few failings so far this fall. This successful demo was the result of of that learning process.

Definitely more fun writing that little excerpt than a bulleted list. Hopefully better for consumption as well. And while not everything will fit well within the context of a story, the outcome of self-reflection from writing the story will make the exercise worthwhile.

May 18, 2011

#performance reviews

"Is the ability to analyze large data sets driven by hardware or the availability of new software algorithms?"

Someone I know asked me, “Is the ability to analyze large data sets driven by hardware or the availiability of new software algorithms?”

Here’s my off-the-cuff answer.

It’s both hardware and software. Also, a third factor: we now have data sets that are large enough to warrant this type of analysis.

The algorithm part is driven by the maturity of Hadoop, a distributed platform for performing the map-reduce algorithm. The map part sends chunks of work out to a cluster of hundreds, thousands, or even millions of individual commodity machines. The reduce part sends those completed chunks of work back to a few machines that are responsible for getting the answers and rolling up the results.

Hardware factors are driven by the decreasing cost of commodity servers. I can put together a server with several gigabytes of ram, a terabyte hard drive, and a decent processor for $300. That’s unbelievable! The magic comes in when you purchase dozens of these machines and then have the software on top that allows you to distribute your data analysis job among them.

The third factor is the Internet. It’s the first step towards really, really large data sets (another factor will be sensor networks which are not yet commonplace). We have petabytes of information on consumer trends which can be big money to companies. I was just at a technology meet up group on Wednesday night at Clearspring. Clearspring has a widget that goes on websites and allows people to share the page with people they know. 9 million websites use this little widget! It sends back to them several terabytes of information every single day! Their business is based on helping companies analyze consumer trends on the Internet and they are doing really well (they just got $20 million from a venture capital firm and they were already cash flow positive so they didn’t really need the money except to grow faster).

Data analysis is a growing business because the economics to it make sense. Companies gain business value when they understand more about their target customers and act on that resulting information. The software, hardware, and availability of data sets will only make this trend more powerful in the future.

May 14, 2011 17 notes

#big data #mapreduce #Hadoop

McKinsey & Company just released their report entitled, “Big Data: The Next Frontier for Innovation, Competition, and Productivity.” I’ll have a summary of the most pertinent parts of the report as well as commentary on my take once I have a chance to read through it all.

Link: McKinsey on Big Data

May 13, 2011 2 notes

#big data #McKinsey

NoSQL is about more than just key-value pairs. Some of the most interesting developments in non-relational data stores are occurring with graph databases such as Neo4j. In this linked article, Jim Webber discusses how graph databases are often the best way to store complex relationships between entities.

Article: Neo4j and Graph Databases Overview

May 10, 2011 5 notes

#Neo4j #graph databases #NoSQL

For those in the DC area, Clearspring will be hosting the first Big Data DC meet up on Wednesday, May 11th. The group is focused on technical presentations of NoSQL, Hadoop, R, data analysis, and data visualization.

Link: Big Data DC First Meet Up (meetup.com)

May 6, 2011

#big data #DC

Opening the World's Scientific Data

Universities in Canada and elsewhere are rebelling against the high costs of proprietary, closed systems for publishing and retrieving scientific information. This development is a positive outgrowth of the reduction in government funding available to universities.

Open access to scientific research that is funded by taxpayer dollars is a critical component towards advancing our data-driven society. Currently large datasets and scientific research are not easily accessible outside of universities that pay hundreds of thousands of dollars in subscription fees a year. Not only does this limit the public’s ability to review scientific data themselves, it adds to the amount of fraudulent science performed.

The trend towards open access of scientific research and results is a very positive step that will help transform higher education institutions into a more transparent, accountable entities in the next several years.

It will be a great day for science when data produced by experiments are freely available on data websites such as InfoChimps.

May 4, 2011

#big data #science

NoSQL Does Not Always Equal Big Data

NoSQL and big data often get lumped together because NoSQL solutions are designed to scale better than traditional relational database sharding strategies. However, NoSQL and big data are not synonymous.

For example, BigTable on Google App Engine is a NoSQL solution that values consistency and availability over partition tolerance. Simple CRUD application on top of BigTable often do not contain large amounts of data. Instead, the scalability on BigTable is for performing many read operations.

Another example is the new Couchbase platform for iOS. It appears Couchbase is targeting a database market traditionally held by SQLite. iOS devices are not the appropriate platforms for storing large datasets. However, the document-oriented NoSQL Couchbase could be perfect for developing dynamic applications that do not require strict relationship enforcement.

Further reading: Couchbase for iOS (GigaOm)

May 2, 2011 1 note

#big data #NoSQL #Google App Engine #BigTable #MongoDB