Makai's Blog: Archive

"Is the ability to analyze large data sets driven by hardware or the availability of new software algorithms?"

Someone I know asked me, “Is the ability to analyze large data sets driven by hardware or the availiability of new software algorithms?”

Here’s my off-the-cuff answer.

It’s both hardware and software. Also, a third factor: we now have data sets that are large enough to warrant this type of analysis.

The algorithm part is driven by the maturity of Hadoop, a distributed platform for performing the map-reduce algorithm. The map part sends chunks of work out to a cluster of hundreds, thousands, or even millions of individual commodity machines. The reduce part sends those completed chunks of work back to a few machines that are responsible for getting the answers and rolling up the results.

Hardware factors are driven by the decreasing cost of commodity servers. I can put together a server with several gigabytes of ram, a terabyte hard drive, and a decent processor for $300. That’s unbelievable! The magic comes in when you purchase dozens of these machines and then have the software on top that allows you to distribute your data analysis job among them.

The third factor is the Internet. It’s the first step towards really, really large data sets (another factor will be sensor networks which are not yet commonplace). We have petabytes of information on consumer trends which can be big money to companies. I was just at a technology meet up group on Wednesday night at Clearspring. Clearspring has a widget that goes on websites and allows people to share the page with people they know. 9 million websites use this little widget! It sends back to them several terabytes of information every single day! Their business is based on helping companies analyze consumer trends on the Internet and they are doing really well (they just got $20 million from a venture capital firm and they were already cash flow positive so they didn’t really need the money except to grow faster).

Data analysis is a growing business because the economics to it make sense. Companies gain business value when they understand more about their target customers and act on that resulting information. The software, hardware, and availability of data sets will only make this trend more powerful in the future.

May 14, 2011 17 notes

#big data #mapreduce #Hadoop

McKinsey & Company just released their report entitled, “Big Data: The Next Frontier for Innovation, Competition, and Productivity.” I’ll have a summary of the most pertinent parts of the report as well as commentary on my take once I have a chance to read through it all.

Link: McKinsey on Big Data

May 13, 2011 2 notes

#big data #McKinsey

NoSQL is about more than just key-value pairs. Some of the most interesting developments in non-relational data stores are occurring with graph databases such as Neo4j. In this linked article, Jim Webber discusses how graph databases are often the best way to store complex relationships between entities.

Article: Neo4j and Graph Databases Overview

May 10, 2011 5 notes

#Neo4j #graph databases #NoSQL

For those in the DC area, Clearspring will be hosting the first Big Data DC meet up on Wednesday, May 11th. The group is focused on technical presentations of NoSQL, Hadoop, R, data analysis, and data visualization.

Link: Big Data DC First Meet Up (meetup.com)

May 6, 2011

#big data #DC

Opening the World's Scientific Data

Universities in Canada and elsewhere are rebelling against the high costs of proprietary, closed systems for publishing and retrieving scientific information. This development is a positive outgrowth of the reduction in government funding available to universities.

Open access to scientific research that is funded by taxpayer dollars is a critical component towards advancing our data-driven society. Currently large datasets and scientific research are not easily accessible outside of universities that pay hundreds of thousands of dollars in subscription fees a year. Not only does this limit the public’s ability to review scientific data themselves, it adds to the amount of fraudulent science performed.

The trend towards open access of scientific research and results is a very positive step that will help transform higher education institutions into a more transparent, accountable entities in the next several years.

It will be a great day for science when data produced by experiments are freely available on data websites such as InfoChimps.

May 4, 2011

#big data #science

NoSQL Does Not Always Equal Big Data

NoSQL and big data often get lumped together because NoSQL solutions are designed to scale better than traditional relational database sharding strategies. However, NoSQL and big data are not synonymous.

For example, BigTable on Google App Engine is a NoSQL solution that values consistency and availability over partition tolerance. Simple CRUD application on top of BigTable often do not contain large amounts of data. Instead, the scalability on BigTable is for performing many read operations.

Another example is the new Couchbase platform for iOS. It appears Couchbase is targeting a database market traditionally held by SQLite. iOS devices are not the appropriate platforms for storing large datasets. However, the document-oriented NoSQL Couchbase could be perfect for developing dynamic applications that do not require strict relationship enforcement.

Further reading: Couchbase for iOS (GigaOm)

May 2, 2011 1 note

#big data #NoSQL #Google App Engine #BigTable #MongoDB

Yahoo Spin Off to Compete With Cloudera?

Yahoo! is mulling whether to spin out its engineering group focused on Hadoop as its own firm. The new company would compete directly with Cloudera.

More commercial competition would address concerns from GigaOm’s article on why Hadoop innovation has to pick up.

The Yahoo! spin off would also compete with IBM, EMC and business intelligence software providers such as Teradata who are rapidly moving into the unstructured and semi-structured data space.

Apr 27, 2011 10 notes

#Hadoop #Yahoo! #Cloudera

Could Big Data Reverse the Cloud Computing Trend?

The trend of moving IT infrastructure from private data centers to public or hybrid cloud infrastructure appears inevitable. The media has discussed cloud trends for years and now other terms such as big data are taking cloud computing’s place as “the new new thing” in business IT. But it’s possible big data will actually reverse some of the movement towards cloud computing.

To see why big data and the cloud are not entirely compatible, we have to look at technology limitations. Daily transfers of large files in the terabyte, petabyte, and beyond range remains slow and expensive. Bandwidth will continually increase, but so will the amount of data generated and saved, especially after sensor networks come into widespread usage.

So a “big data + cloud computing” model for the next 5-10 years will likely involve the following pieces:

Multiple Local Data Centers

collect data
store all collected data locally
perform basic analysis
compression

Centralized Cloud Operations

aggregate locally-processed data
acquire data from a wide range of sources, both proprietary and public
perform complex analysis on the entire data set

The complexity is in determining which operations should be performed locally to minimize data transfer versus sending everything to the centralized infrastructure. But as long as there’s business value that can be gained from analyzing very large data sets to improve operations, the complexity will be overcome.

Further Resource: Big Data Is On A Collision Course With the Cloud (GigaOm)

Apr 22, 2011

#big data #cloud computing

Hiring Limits Hadoop Adoption

High-end niche technology skill sets on the edge are expensive. That’s part of what contributes to Hadoop’s slow adoption rate, reports GigaOm.

Like all useful technologies, the Hadoop hiring problem will get better as the tools mature and move into mainstream skill sets. Pig helps, as does Amazon’s Elastic MapReduce by introducing (non-Lisp) programmers to the MapReduce paradigm.

A bigger question raised in GigaOm’s article on whether Groupon can compete with Facebook isn’t the money that is being thrown around to top-quality big data and Hadoop developers. It’s how much interesting data Groupon have to analyze. While every organization is collecting a dramatically increasing amount of data today, it’s questionable whether Groupon is as mature in its collection, analysis, and visualization efforts as Facebook. While some engineers would be excited to join a fledgling data analysis effort at Groupon, Facebook has both the data collection maturity and massive scale of data to make any data scientist excited.

The question of data maturity will come up for mainstream companies as they try to get a handle on their big data challenges. It will be difficult to lead top data scientists away from organizations diligent in collecting and analyzing their data to immature companies that “just don’t get it” when it comes to how data can inform a business strategy.

Link: Can Groupon Compete With Facebook in Hadoop Hiring? (GigaOm)

Apr 21, 2011

#big data #Hadoop #recruiting #hiring #data scientists

JQuery Mobile First Look

If you’re interested in building web applications for mobile browsers and want to avoid worrying about differences in Android, Apple, BlackBerry, and other platforms’ web browsers, check out this new book, JQuery Mobile First Look by Giulio Bai. I just finished a technical review on the book for Packt Publishing and it’s a great introduction to the platform.

After reading the book you’ll have enough information to create mobile web applications like KhanApp that Khan Academy just released. KhanApp was created with JQuery Mobile and it’s a great example of how the platform can be used.

The mobile web applications created with JQuery Mobile can even be wrapped with PhoneGap to create platform-specific apps!

Apr 20, 2011 1 note

#JQuery Mobile #mobile development #KhanApp #JQuery Mobile First Look

Using Hadoop to Identify Botnets

Atlanta-based ipTrust is using Cassandra to store trillions of log files and Hadoop to analyze those files to identify botnets. After identification, the botnets can be targeted for cleansing.

ipTrust’s goal is to build more intelligent firewalls and protection software by creating a reputation-based system for IP addresses. The relationship graph between IP addresses is analyzed similar to how users are connected on social networks. It’s a fascinating idea and appears similar to both social network analysis and Google’s PageRank algorithm for using relationship data for contextual information.

Link: Hadoop Kills Zombies, Too! (GigaOm)

Apr 20, 2011 1 note

#big data #Hadoop #Cassandra

Differentiating Between Little and Big Data

Enterprises need to differentiate between little and big data. “Big data” remains a moving target without an exact definition, but it has certain criteria:

Collected over an extended period of time, often years and even decades
A mix of unstructured, semi-structured, and structured data
Large enough in size that it is difficult for traditional business intelligence platforms to analyze it in real-time
Unknown data quality - often a mix of proprietary and public data sources

“Little data” is:

Collected over a short duration of days, weeks, or months
Mostly structured data
Easily analyzed by traditional business intelligence platforms
High data quality - usually proprietary data collected based on a well-defined business need

Traditional data warehousing and business intelligence approaches are not enough for leading-edge companies to stay in their positions. Companies who are not yet feeling the data deluge but will be as big data goes mainstream need to ensure their “little data” needs are covered by strong business intelligence practices. As the data deluge grows over the next several years, companies that have thought through the implications and can act will generate more business value than their competitors based on their data best practices.

Apr 17, 2011

#big data #little data

Museums, Education, and Big Data

The data revolution will affect every aspect of society including what we normally associate with the past, not our future: museums. O'Reilly’s Alex Howard has a quick summary of a presentation from Ignite Smithsonian on April 12th. The presentation focused on how big data can be used for research. Although big data is already a challenge for academics in physics and computer science, other disciplines will be affected as society goes through the data revolution.

Link: Ignite Smithsonian Summary (O'Reilly)

Link: Ignite Smithsonian Wiki and Presentations (Smithsonian)

Apr 15, 2011 1 note

#big data #museums #Ignite Smithsonian

Simple Moving Average in Excel, R, and Hadoop

Cloudera has a three part walk through for how to create a Simple Moving Average in Excel, R, and Hadoop.

Simple Moving Average Part 1: Excel

Simple Moving Average Part 2: R

Simple Moving Average Part 3: Hadoop

Anyone interested in R and Hadoop should be familiar with executing basic calculations in Excel, so this is a great tutorial to get your feet wet with more advanced data analysis tools.

Apr 12, 2011 2 notes

#Hadoop #Excel #R #Simple Moving Average

Hadoop’s versatility allows all industries with unstructured data to take advantage of distributed MapReduce analysis capabilities. This linked article by GigaOm shows how the news industry is using Hadoop for data journalism.

Link: Hadoop: From Boardrooms to Newsrooms (GigaOm)

Apr 6, 2011

#Hadoop #big data #news industry

Three Great Presentations on R

The folks at ReadWriteWeb collected three great presentations on R covering:

R for web development in 2011
R packages the experts use, and
R usage by the New York Times Graphics Department.

Link: 3 Presentations on R (ReadWriteWeb)

Apr 1, 2011

#R #statistics #big data

This linked blog post is written by a developer who was working on the MongoDB NoSQL data store then transitioned to working with MongoDB for another project.

Mar 31, 2011

#MongoDB NoSQL

NoSQL data stores are not the solution to every data storage problem! Using an SQL solution for structured data is vastly better than trying to shoehorn it into a NoSQL data store.

In this video, Lorenzo Alberton discusses the “what, why, and when” of using NoSQL solutions.

Video: NoSQL Databases: What, Why, and When (ontwik)

Mar 31, 2011

#NoSQL

Two of my favorite topics, NoSQL and mobile apps, are covered in this awesome tutorial by Todd Anderson. The multi-part tutorial is linked to through the CouchBase blog post.

Link: CouchBase + JQuery Mobile

Mar 30, 2011 5 notes

#CouchBase #JQuery Mobile #tutorial

Opportunities for Big Data Startups in Vertical Markets

GigaOm is reviewing topics from last week’s Structure Big Data conference. One subject is that horizontal markets for big data analysis and visualization are quickly filling up with offerings. Those offerings may not be the right answers to the big data challenge but there is certainly a lot of competition in the space. The author of this post, “Why Big Data Startups Should Take a Narrow View,” advocates looking at vertical markets.

I agree with looking into vertical markets for a specific fundamental reason. Data exploration, analysis, and visualization generate value only when combined with domain-specific knowledge. Data visualizations in isolation may look interesting but value is only created when the information is acted upon. It is much easier to act on specific information created by big data from your own industry than across industries that are not pertinent to your line of business.

Mar 30, 2011

#big data #visualization #vertical markets

Disruptive Innovation in the Data Storage Industry

Disruptive innovation is occurring in the data storage industry. Some trends are evolutionary and help established businesses such as Oracle. For example, the predictable increase in data generation and corresponding rise in storage needs. Other trends are revolutionary and disruptive, such as the increasing importance of semi-structured and unstructured data sources.

Oracle is well positioned to take advantage of structured enterprise data growth needs as evidenced by their positive latest earnings report. Oracle will remain entrenched in the structured data storage market despite competition from open source offerings. Most enterprises believe their structured data is too critical to be stored in open source solutions such as MySQL and PostgreSQL.

However, as NoSQL solutions from firms such as 10gen (MongoDB) and Couchbase continue to improve their offerings Oracle will find profit margins decrease due to competition in semi-structured and unstructured data storage. The NoSQL solutions present disruptive innovation through revolutionary change in data storage methods. NoSQL is a different paradigm in data storage that fundamental computer science principles such as the CAP Theorem force design constraints that cannot be reconciled with traditional relational databases.

Structured SQL products will coexist with semi-structured and unstructured NoSQL products in the near term. Yet there is overlap in data storage needs that will eat into Oracle’s traditional relational database business. For example, storing, searching, and extracting unstructured CLOB and BLOB values in databases is a frustrating experience for developers and database administrators. NoSQL products make unstructured data storage much easier due to trade offs in the CAP Theorem.

SQL databases also scale differently with big data sets which forces mutually exclusive design decisions. There are two ways to scale: vertically and horizontally.

Vertical scaling is done by buying a bigger machine to house your database, such as replacing a commodity server with a mainframe.
Horizontal scaling is accomplished by buying additional commodity servers and sharding the database across those new servers.

SQL solutions are geared towards vertical scaling while NoSQL solutions are usually built for horizontal scaling [1]. Oracle prefers vertical scaling because customers are more likely to buy expensive Sun Microsystems servers (which Oracle now owns). Companies without the funds for big servers or the inclination to buy dedicated hardware often prefer the horizontal scaling route because it’s easier to match supply of your web services with demand by customers through service providers such as Amazon Web Services.

Semi-structured and unstructured solutions will eventually compete in the structured data storage market. For example, EMC is not a traditional SQL database solution provider but it making major bets on big data and will soon be in direct competition with Oracle.

Oracle will compete against semi-structured and unstructured products with reduced motivation because it will cut into their server business. Oracle will be caught in a trap that may force it to move to the higher margin enterprise corner of the market where only structured data products exist.

Oracle could just buy out all of the NoSQL competitor companies. Disruptive innovation by commercial firms could be marginalized by Oracle’s really large war chest. Only open source NoSQL products would remain which would serve the needs of the majority of companies but enterprise customers could still be beholden to Oracle’s service guarantees similar to the situation today. However if a competitor refuses to be bought out or anti-trust issues are raised by limited competition in the data storage industry, Oracle could be left with a much smaller piece of the storage market than it holds today.

[1] I’m generalizing here so if you’re a subject matter expert you may take issue with this description. I recognize that and I appreciate feedback since I’m sure there’s a better way to succinctly state the difference in scaling between SQL and NoSQL solutions.

Mar 28, 2011 6 notes

#10gen #Couchbase #MongoDB #NoSQL #Oracle #big data #disruptive innovation #relational databases #EMC

Common Problems Hadoop Can Aid in Solving

Cloudera has an interesting white paper entitled “Ten Common Hadoopable Problems.” The paper’s introduction has background information familiar to experienced Hadoop users. After the introduction, the paper presents problems and solutions for areas such as risk modeling, recommendations, and ad targeting.

The problem descriptions and solutions are high level, so don’t expect any MapReduce algorithms. The most useful part of the paper is that it is targeted primarily at business decision makers. The paper is particularly useful when promoting Hadoop in an organization as part of the solution to any of these common problems. One of the common problems I often face when discussing Hadoop with non-technical users is answering the question, “do you have any case studies on how it can solve my problems?” Apparently Cloudera has encountered the same question and now they’ve gone out of their way to make our lives easier by passing on these case study answers.

White paper: Ten Common Hadoopable Problems (Cloudera)

Mar 27, 2011 1 note

#Hadoop #big data #case studies

DataStax Brisk: Cassandra, Hadoop, and Hive Packaged Together

DataStax’s new product Brisk is a merger of Cassandra, Hadoop, Hive. While Apache HBase is the most common NoSQL data stores used with Hadoop, Brisk attempts to merge Cassandra’s distributed data stores with Hadoop’s MapReduce capabilities (and Hive’s job tracking tools). The general concept is that Hadoop can be run against one of Cassandra’s multiple distributed data store without impacting the performance of the other data stores.

Brisk is an interesting concept but the proof is in whether the product delivers the best of both Cassandra and Hadoop without transferring the weaknesses. Since Cassandra is eventually consistent (trading consistency for availability and partition tolerance in the CAP Theorem), how that impacts Hadoop’s MapReduce jobs remains an open question.

I’ve requested a copy of Brisk through DataStax’s website and when I get my hands on it I’ll create a further technical write up.

Link: DataStax Brisk

Mar 25, 2011

#big data #Cassandra #Hadoop #DataStax Brisk

A History of the World Visualization through Wikipedia Articles

This amazing 100-second video of world events during human history was created by scraping all geo-tagged articles from Wikipedia. The visualization is obviously heavily based on Western world history since most English Wikipedia articles are written by contributors in the United States and Western Europe.

The video is worth watching for the final outline of the continents based on events alone.

Link: A History of the World in 100 Seconds

Mar 24, 2011

#visualization

Enthusiasm

I wrote the following piece four years ago while in my first job at Freddie Mac. The piece is advice I would give graduating college students in their first technology jobs.

Enthusiasm

Get excited about everything you do. Every job is an opportunity to become better. A manual maintenance task on a legacy system is a chance to help your colleagues by creating an automated fix. Simple programming assignments are an opportunity to learn how to rigorously unit test your solutions. If you work that hard on simple tasks your colleagues will want to work with you on harder challenges.

Enthusiasm is contagious but it is rarely found in the business world. Many of your coworkers are jaded by bad experiences and poor career choices. If you are seen by your colleagues as someone who takes pleasure in working hard and enjoys every day despite its numerous challenges, people will be drawn to you.

You have the ability to make the workplace enjoyable by being enthusiastic. One reason businesses hire students out of college is that recent graduates are uplifting. New graduates are untainted by years of working in average organizations. Keep a positive attitude and take on new challenges. Negativity and sarcasm are poor alternatives.

Optimism and enthusiasm go hand-in-hand. Being optimistic during periods when projects are running smoothly is easy. But optimism is more important during challenging times. Software engineering is a difficult field! There are a myriad of reasons why projects can fail. Ambiguous requirements, poor project plans, incorrect architectures, faltering project sponsorship, and team member attrition are a small subset of the major issues that can occur.

Yet technology projects do succeed. Success is controlled by the optimism of a project team. Success must always be viewed as an option during the most difficult times. Be a source of optimism and your colleagues will remember you as a valuable contributor. People want to be around others who are positive. If you exude a positive vibe, your colleagues will be eager to work with you again in the future.

As a recent college graduate you will not have the same depth of technical expertise nor the breadth of project experience as your colleagues. Enthusiasm makes up for the knowledge you lack today and is just as valuable to the success of your projects.

Mar 15, 2011 3 notes

#enthusiasm #college hires

UVA Master's in Management of Information Technology Retrospective Summary

Here is a summary with links to my MS MIT retrospective posts:

Part 1 - Retrospective Introduction and Program Value Summary
Part 2 - MS MIT Curriculum Overview
Part 3 - Mod 0 and Mod 1
Part 4 - Mod 2 and Mod 3
Part 5 - Mod 4
Part 6 - Post-Graduation, Program Value, and Conclusion

Mar 13, 2011

#UVA #MS MIT

UVA Master's in Management of Information Technology - Retrospective (Part 6 of 6)

This is the final part of my retrospective on the UVA Master’s in Management of Information Technology program I graduated from in May 2010. See part 1, part 2, part 3, part 4, and part 5 for context.

Graduation and Beyond

Graduation at UVA is a beautiful tradition. Go to it! It provides closure to the MS MIT experience.

After wrapping up Mod 4 and graduating it takes time to transition back into a normal work/life balance. Here are some tips and lessons to ease the transition:

When you finish your project defense in May and then graduate, it will take a couple of months to unwind. You will wake up on Saturday mornings with a sense of urgency until you realize you don’t have to read and analyze several case studies over the weekend. A vacation can help a lot.
Keep applying the concepts learned in the program. It’s much easier to remember accounting and finance when you refresh your memory often.
Go to alumni events. Nicole Fitzwater is the Director of Alumni Relations and does a fantastic job of setting up events for networking and catching up with classmates.

Program Value

The MS MIT program provides most of its value by teaching business school topics in the context of technology. If you are in a technology field and are interested in MBA programs but do not want to study unrelated topics, the MS MIT program is a great choice.

The class is composed of a range of experience levels. My classmates had between 3 and 25+ years of experience and averaged 14 years of experience. There was roughly a 50/50 split between commercial and government positions (either consulting or employees).

A big portion of value in the program is gained by learning from your classmates. The program, particularly the Charlottesville section, facilitates interaction between classmates to assist the learning process outside of the classroom.

My One Criticism of the Curriculum

Overall I had a great experience with the MS MIT program. However, I was disappointed in one subject area. Technology and entrepreneurship is not covered in the curriculum. We learned a disproportionate amount about issues in large organizations. Many of the challenges we discussed in class were symptoms of big bureaucratic organizations.

For example, one of our case studies was on implementing a CRM system at a Fortune 500 organization. The political challenges were more of an issue than the technology problems. Smaller companies would not face the same challenges because they would be more likely to implement a standard software-as-a-service solution.

However, most of the subject matter in the program could apply to established (not startup) organizations. If you are looking to learn more about applying technology to create startups, there are other Master’s programs out there with more of an emphasis on that subject.

Conclusion

That’s my retrospective on the MS MIT program. In hindsight it was a great boost to my career and such a pleasure to have been a part of despite the heavy workload.

I hope this retrospective is helpful for prospective and current MS MIT students. If you have further questions or feedback, please email me at [email protected].

Mar 12, 2011

#UVA #MS MIT

UVA Master's in Management of Information Technology - Retrospective (Part 5 of 6)

This is part 5 of my retrospective on the UVA Master’s in Management of Information Technology program I graduated from in May 2010. See part 1, part 2, part 3, and part 4 for context.

Mod 4

Mod 4 is by far the most difficult, intense, and valuable experience in the MS MIT program. Prepare for emotional ups and downs. One day you will have everything lined up with your company and course work then the next day your group has to scramble because a project deliverable was not acceptable.

Besides the Mod 4 Capstone Project where you work with a company, you will study corporate strategy, managerial accounting and finance, marketing, and behavioral event interviewing. After completing Mod 4, you will be able to hold your own against any top MBA student in corporate strategy discussions. We learned Michael Porter’s Five Forces and similar corporate strategy subject material. We read and discussed Harvard Business Review case studies for context. All of the material was grounded in technology subject matter. I felt the focused subject matter approach worked very well because we analyzed case studies that we had experience dealing with.

Accounting and finance was also beneficial. I never studied accounting or finance before the MS MIT program. Learning finance was difficult but I now have working knowledge of calculating discounted cash flows, net present value, weighted average cost of capital, and related concepts. Reading balance sheets, income statements, and statements of cash flows is important when working with publicly traded companies because you can fully understand and appreciate the business challenges they face.

I found the Behavioral Event Interviewing (BEI) class very beneficial. Every time I go into an interview and get asked BEI questions, I know how to structure my answers so they are appropriate. Studying BEI boosted my interview confidence because I know more about what the interviewer is looking for.

It’s easy to get lost in the Mod 4 class work because there is so much of it. But the capstone project is ultimately the biggest part of the grade. There are two major presentations in addition to written reports:

Your company and its industry analysis. Tell a compelling story for what your company does, what aspects it does better and worse than rivals, why it is better or worse, and where the company’s strategy is taking it. This presentation takes place in late February or early March.
Your revenue-generating IT initiative for your company. Quickly review your company’s strategy then dive into your IT initiative, why it is critical to achieving the company’s strategy, how the initiative will be implemented, and the financials around implementation and execution. This presentation occurs the second to last day of the program.

After the final presentation, the group returns on the last day to defend its work to the professors. Prepare to answer questions that time did not allow for after the presentation the previous day.

Advice for Mod 4:

Focus on the story to your capstone project. It’s easy to get lost in industry data analysis. You must tell a compelling story to create a truly great IT initiative for the company you are working with.
If you’ve never studied corporate strategy before, learn the fundamentals of Michael Porter’s Five Forces and industry analysis then run with it on the capstone project. There is too much information to learn so you need to get the basics and move on because time is precious in Mod 4.
Grab drinks once your group successfully defends its capstone project, you’ll need it!

My advice for adjusting to life after graduation, analysis of the MS MIT program’s value and conclusions are found in part 6.

Mar 11, 2011

#UVA #MS MIT

UVA Master's in Management of Information Technology - Retrospective (Part 4 of 6)

This is part 4 of my retrospective on the UVA Master’s in Management of Information Technology program I graduated from in May 2010. See part 1, part 2, and part 3 for context.

Mod 2

Mod 2 covers managing information technology projects. A lot of the material is related to PMBoK (Project Management Body of Knowledge). Some people in the class earn their Project Management Professional (PMP) certification by taking the PMP test after finishing Mod 2. I cannot speak to how easy or difficult Mod 2 makes the PMBoK material since I did not take the test.

By far the most interesting part of Mod 2 is the group project. Groups are assigned by classmate geography to facilitate interaction. For example, I lived in Charlottesville at the time and my other four group members lived within a half an hour of me.

Groups are responsible for finding a completed (or failed) IT project and performing a retrospective. There is a laundry list of things that can go wrong on IT projects and groups have to analyze what went right, what went wrong, and what future projects can learn to do better.

Each group interviews project stakeholders, analyzes documentation and deliverables, and views demos of the system if it was completed. The analysis objective is to piece together the outcome and the project’s significant intermediate events. At the end of the three months, each group presents a retrospective on the project.

Things I wish I knew before going into Mod 2:

Learn to work well with your group. Your group members will change in Mod 3 but the lessons learned while working under the stress on a difficult project will be very beneficial for the remainder of the program.
Practice, practice, practice for the presentation. It makes a major difference in the Mod 2 grade! We ran through our presentation in full six or seven times in person and made adjustments as we went along. That’s at least four hours of time spent only on speaking our lines in the presentation plus several more hours for slide edits.
Do not wait until the night before your presentation to practice it for the first time. You will make changes to the slide deck and if the changes are drastic you will not have enough time to become comfortable with the section you are speaking on.

Mod 3

The main topics for Mod 3 are enterprise integration, data warehousing, and business intelligence. Although I found Mod 2 interesting, Mod 3 was where the classwork really became fascinating because it focused on enterprise-wide issues.

The Mod 3 topics are all major challenges that frustrate even the best IT organizations. Class discussions on enterprise integration and business intelligence were interesting because many of my classmates were working on these large projects. Professors provided best practices and case studies while classmates provided concrete examples.

The group project for Mod 3 comes from a list of choices on relevant issues in IT organizations, such as social networks, cloud computing, and the “data deluge” (how to process and make sense of the exponentially increasing amount of data organizations produce). As a side note, the data deluge and big data are the topics this blog usually focuses on so if you have further interest in that area, please check my archive for relevant posts.

Advice for Mod 3:

Pick a topic everyone in the group is interested in. If one person dominates the topic choice then others will have less motivation to learn the subject matter.
Start pinging contacts in your network to find out the level of access to C-level executives they have. You want to have several potential companies lined up for Mod 4 by mid-December.
Take a few days off either during Thanksgiving or around Christmas. You will be working for four months straight on a very difficult project in Mod 4.

Mod 4 is introduced even before Mod 3 ends. Mentally prepare yourself for the most difficult part of the program.

My Mod 4 retrospective can be found in part 5.

Mar 10, 2011

#UVA #MS MIT

UVA Master's in Management of Information Technology - Retrospective (Part 3 of 6)

This is part 3 of my retrospective on the UVA Master’s in Management of Information Technology program I graduated from in May 2010. See part 1 and part 2 for context.

Mod 0

May 2009 was the first weekend of my cohort’s program. It was more than just a meet and greet with classmates. Mod 0 set the tone for program weekends with three 8-hour class days.

Topics included an introduction to corporate strategy, IT relevance, and the program’s tag line, “Delivering business value through IT.” We learned that the program is about how IT works when done well. IT departments can be a critical piece of corporate strategy and not just a cost center.

If you leave Mod 0 feeling like you did not get any value out of the topics and discussions then you should consider dropping the program. It is a major commitment not only for yourself but also to your classmates who are paying $40,000 for an education. The cohort is only as strong as its weakest link. Everyone has to contribute for the program to produce maximum value.

Mod 1

Mod 1 is 10 days in June of rigorous class and project work. You get very little sleep. I slept 4 hours a night (2:30am to 6:30am) every day during Mod 1 and caught up as much as possible on the weekend break that divides the 5 day sections. Coffee was crucial.

Topics in Mod 1 included enterprise architecture, computer network fundamentals, computer and network security, database modeling, and an introduction to data warehousing (covered in further detail in Mod 3). The idea behind Mod 1 is to give non-technical students a grasp of technical fundamentals so the concepts are no longer intimidating. The topics are high-level. You will not be doing any Java or .NET programming.

If you are technical or have a technical background, the subject matter in Mod 1 is straightforward. You can fill in gaps in your knowledge or refresh areas you have forgotten. For example, I knew HTTP servers ran on port 80 but I forgot that a browser running on a client machine opens high number ports for communication to an HTTP server. I should have known that concept but had not thought about it in awhile so class corrected it for me. There were dozens of similar examples scattered throughout our coursework.

A few bits of advice for students about to experience Mod 1:

Set expectations with family, friends, and your company that during Mod 1 you will be incommunicado. You will not have time for anything outside the program during these two weeks.
Speak up in class when you have something valuable to add. Mod 1 is a combination of the Charlottesville and Northern Virginia sections so the classroom is crowded. Make your answers short and get to the point immediately.
Go to the bars on the Corner with your classmates after finishing up your group project work each night. Get to know your classmates! You’ll learn a lot from discussions over beers.
Plan to do a few hours of work on Saturday and Sunday during the weekend between the five day sections. Otherwise catch up on sleep as much as possible.
Get started on your Mod 2 reading as soon as Mod 1 ends. I made the mistake of taking a week off from all class-related reading after Mod 1 and put myself in a huge hole.

Mod 1 is a fantastic and intense experience. Go into it with the mindset of working incredibly hard the entire time and you can recover when the two weeks are over.

See part 4 for Mods 2 and 3.

Mar 9, 2011

#UVA #MS MIT

UVA Master's in Management of Information Technology - Retrospective (Part 2 of 6)

This post is part 2 of my retrospective on the UVA Master’s in Management of Information Technology (MS MIT) program I graduated from in May 2010. Please read part 1 for context.

Here’s an overview of the 12 month program that takes place in Charlottesville, Virginia on UVA Grounds:

Mod 0 (end of May) - Introduction to classmates, professors, program structure. A high-level overview of corporate strategy as it relates to information technology.
Mod 1 (2 weeks in June) - Intense 16+ hour days for 2 weeks. Classes on IT architecture, databases, networking, computer security, data warehousing, and IT governance. A group project is due at the end of the 2 weeks on a revenue-generating IT initiative for an assigned company.
Mod 2 (3 months from July to September) - 3 weekends of 8 hour classes primarily on IT project management. A group project involving a retrospective on a large IT initiative is due in September.
Mod 3 (3 months from October to December) - 3 weekends of 8 hour classes primarily on IT integration and data warehousing. A group project on an IT trend such as cloud computing is due in December.
Mod 4 (5 months from the end of Mod 3 in December to May) - 5 weekends of 8 hour classes and presentations on corporate strategy and CEO- and CIO-level issues. By far the most difficult part of the program. Group conference calls past midnight are common. I’ll review this in more detail in part 5.

The curriculum changes each year to incorporate material on new technology and trends. The professors did a great job of keeping the material fresh and relevant to the latest news. Classes were engaging and a lot of fun despite the 8 hour length.

Part 3 covers my experience with Mods 0 and 1.

Mar 8, 2011 1 note

#UVA #MS MIT

UVA Master's in Management of Information Technology - Retrospective (Part 1 of 6)

I graduated from the University of Virginia (UVA) Master’s in Management of Information Technology (MS MIT) program in May 2010. I worked full-time at Booz Allen Hamilton while going through the full-time 12-month program in Charlottesville, Virginia.

It’s been over 10 months since my group finished our program requirements by defending our capstone project to our professors.

A theme throughout the MS MIT program was performing retrospectives for iterative improvement. This six part series of blog posts are my retrospective for the entire Master’s program.

Here is a summary of what my posts explain:

I slept much less than normal during the program’s 12 months. The lack of sleep was balanced by a tremendous amount of valuable learning.
I went in as a developer with four years of experience. I graduated with an appreciation and working knowledge of all areas related to IT including managerial finance, project management, corporate strategy, marketing, and behavioral event interviewing.
I previously had a small network of contacts who were mostly in development roles. Now I have a much larger network of established professionals on all corporate levels.
I took out $40,000 in student loans. I got my money’s worth.

Part 2 is an overview of the MS MIT curriculum

Mar 7, 2011 1 note

#UVA #MS MIT

I just had to link to this post by Rich Aberman at WePay. The post covers both marketing and corporate strategy from the perspective of a start up going up against an established large organization. It’s a fantastic read.

Post: Picking a Fight With An 800 Pound Gorilla (WePay blog)

Mar 1, 2011 2 notes

#marketing #strategy

O'Reilly has some more great information on Google Public Data Explorer. The post shows an example of visualizing public data on the U.S. unemployment rate for the last 20 years. The post also has an embedded presentation with more information on how to upload data to Public Data Explorer.

Article: Google Public Data Explorer (O'Reilly)

Mar 1, 2011

#big data #Google

Google launched the Dataset Publishing Language (DSPL) to spur standardization in data visualization and metadata formatting. With any new technology, it’s important to figure out why a company is working the new product or service. In this case, Google wants more data for people to interact with its visualization tools so they can become a de facto standard.

Link: DSPL (Google)

Feb 19, 2011

#big data #visualization #Google #DSPL

One of McKinsey’s recommendations for reinvigorating the US economy through greater productivity and innovation is tapping the potential of big data.

The potential runs from Big Data—data-driven business decisions and actions—to cloud computing and the application of advances in biology and the life sciences.

Article: US Productivity (The Economist)

Feb 17, 2011

#big data

After 3 days of competition IBM’s Watson crushed the best human players that have ever been on Jeopardy. Now the debate begins on how relevant Watson’s software will be outside the niche game show setting.

CNET has an interesting article on what Watson means to storage system developers. Watson may not be a true big data analytics system because its actual memory bank of answers is less than 1 terabyte of data. Does that mean big data is irrelevant because storage capacities over 1 terabyte are unnecessary? In 2011 that may be true for many industries. But in 5 years we will have much more data for analysis due to sensor networks and the big data analytics systems will have to grow to meet that demand.

Article: What IBM’s Watson Means to Storage System Developers

Feb 17, 2011 1 note

#big data #IBM #Watson

Ted Leung went to O'Reilly’s recent Strata conference and wrote a detailed review. Ted’s take away was that it was a good conference but it will take several iterations for it to be a great conference. Since data science is an interdisciplinary field, the leaders in the space are still emerging. The next Strata conference is in September in NYC and I’m sure O'Reilly will be listening to the feedback to improve their offering.

Link: Strata Conference Review

Feb 15, 2011

#big data #Strata

There’s a lot of information coming out from the Strata conference just held by O'Reilly. In this article the author discusses the role of user experience and user interfaces in helping user’s understand big data.

An effective user interface is one of the three main areas for effective big data implementations (other two are data collection and analysis).

Article: The Role of UX/UI in the Big Data Revolution

Feb 15, 2011

#big data #visualization

This article explores how poor productivity is at large companies. The productivity is likely caused by communication overhead. These findings also help explain why a small team of half a dozen veteran programmers can outproduce a large team of 50+ mid-level developers.

Article: The 3/2 Rule of Employee Productivity

Feb 10, 2011

#productivity

GigaOM makes four standard predictions for the coming decade that relate to big data:

Employers will seek people with data analysis skills
Data producers will figure out new ways to monetize their data
The old sense of privacy, where individual could remain anonymous, will end
Some new companies will be based purely on commercializing data

I would add that in academia research on advanced concepts in big data will become a hot topic. Also, many schools will get on the trend by creating offshoots of computer science and statistics majors that focus on programming data analysis algorithms.

Article: Mining the Tar Sands of Big Data (GigaOM)

Feb 10, 2011 1 note

#big data

This article is a blog post by DataMarket, which just launched a data set marketplace. If DataMarket along with InfoChimps and other data markets are profitable, they could emerge as an important source of public data to combine with proprietary data to create new business value.

DataMarket blog post: 13 Thousand Data Sets, 100 Million Time Series, 600 Million Facts

Feb 6, 2011

The Iron Triangle Revised To Reflect People

Arin Sime wrote a great piece on the iron triangle of software development. Arin proposes we add a fourth dimension to the iron triangle that represents employee morale. Adding employee morale really fills in the model and I’m surprised no one suggested it before.

But there was one question I was left with after reading Arin’s article. Why should business leaders care about employee morale, especially if the project is staffed with consultants?

Who cares if consultants burn out producing software as long as it’s on time, within budget, and fulfills the desired scope? Here are three quick reasons:

Negative reputation. Running your business by exploiting others instead of building trust and respect will come back at you. Your customers will find out. This is the Internet Age. Word travels fast. Your business reputation is difficult to build but easily destroyed.
Troublesome future projects. A consulting firm that respects its employees will no longer do business with you in the future. Ones that continue to do business with you will staff future projects with less productive workers because good consultants will refuse to work with you.
Poor software quality. What is “done” in application development? Software is only as good as the business value it produces. When your business changes, how easy is it to adapt that software? If the software foundation is built on sand and it requires a completely new architecture to handle any changes, the long term value of that software is jeopardized.

Software development is about people. If you screw over your people, it is unpredictable in what ways it will come back to negatively impact you later.

Feb 4, 2011

#AgilityFeat #agile

Why 75,000 Applications in A Week Might Be Bad for Google

Google received 75,000 applications in a single week in a tight labor market for top technical talent. Here are several potential reasons that may be a bad thing [1]:

It takes a lot of time to sort, review, call, interview, decide, and potentially offer to that many applicants. That’s a lot of resources that could be used for something else. If you have developers performing interviews then “Maker’s Schedule, Manager’s Schedule” becomes an issue because the developer’s schedules are so broken up [2].
Is the Google reputation for only hiring the best and brightest weakened because they are on a hiring binge? It’s one thing to say “we need more great people!” It’s another to set a public mandate that you’re going to hire 6,000 new employees for an increase of ~23% in total headcount. That’s a big slow company mentality. Doesn’t Google want to figure out a better way to do things?
If the brand is damaged, is Google receiving many applications from people who are not qualified? That’s fine in a small company if you can weed out unqualified applicants. But at most big companies there’s people that fall through the cracks. Especially when recruiters are trying to make a hiring target for the month.

In addition to the volume of applications, why does Google need to hire that many new people? The company built their core business with less than 6% (1900/32000) of the total employees they hope to have by the end of 2011. Google’s product portfolio has expanded dramatically since their IPO. Yet it appears inertia is taking over a company that was a start up only 10 years ago.

[1] I am in Washington, DC but it’s really tough to find great available developers here. I’m assuming most other major U.S. and European cities are similar in that regard.

[2] Paul Graham - http://www.paulgraham.com/makersschedule.html

Feb 4, 2011

#google #hiring

Consulting Career Mismatch

Consulting firms and their consultants are locked in a struggle. In this post I define that struggle as consulting career mismatch.

Have you seen an “I’m an IBM-er” commercial? We learn about the interesting project the speaker on the commercial is working on. He’s employed by IBM. But we never learn the consultant’s name! The speaker’s career benefits if he gains name recognition. Instead IBM takes credit for the consultant’s hard work. But IBM pays the consultant’s salary and for the commercial. IBM wants the positive recognition because it provided the resources to execute the work.

Consulting career mismatch appears when the best interest of your career is misaligned with your consulting firm’s strategic direction. For example, you want to build your reputation as an expert in business intelligence. You want executives to call you when they have a business intelligence question. But your firm, XYZ Consulting, wants you to be one of thousands of “XYZ Business Intelligence Consultants.” XYZ Consulting wants to be the firm that receives the phone call from executives for business intelligence solutions.

IBM is the most obvious example of where consulting career mismatch can occur but it applies to all firms. The mismatch is most poignant at large firms because they have greater leverage over individual consultants.

Large firms want you to be an interchangeable cog in their consulting machinery. If you are a nameless consultant whose proof of excellent work depends solely on the firm’s reputation then you are less valuable without employment at that firm. The firm won’t have to pay you a premium for your services because you do not create as much value on the open market without the firm’s name recognition.

Consulting career mismatch is not inherently bad. You can still have a great career at large consulting firms despite the mismatch. But mismatch can derail your ambitions when aiming to be a recognized expert at a large firm. There is less friction when building your niche reputation as an independent consultant or part of a small firm that depends upon your success to remain in business.

Career mismatch should be seriously considered by every consultant that wants to build a unique personal reputation. The environment at your firm impacts whether or not you are successful and how that reputation is received.

Feb 2, 2011

#consulting

This Information Management article reports that IBM is using open source technologies, including Hadoop, in its Watson system that will compete on Jeopardy in February.

Article: How IBM’s Watson Churns Analytics

Jan 31, 2011

#hadoop #big data #watson #ibm

Yury Izrailevsky, the Director of Cloud and Systems Infrastructure at Netflix, wrote a great post on how NoSQL systems are in use at the company. The post discusses the mindset adjustment when moving away from traditional ACID database systems to systems that only satisfy two of the three CAP properties. Most big corporations would have a big job retraining their in house IT developers to understand how, when, and why you must decide on the trade offs up front with NoSQL systems.

Yury also writes how the firm tries to use the right tool for the job instead of shoehorning existing “approved” enterprise IT tools into systems they are not designed to accommodate. Again, there’s a big gap here between how the best technology companies like Netflix do their IT work and most mainstream companies’ IT shops run.

Article: NoSQL at Netflix

Jan 29, 2011

Here is a diagram of Google’s application programming interfaces (APIs) in a format that would be familiar to anyone who’s taken high school chemistry: a periodic table. Who knew Blogger had an API?

Link: Periodic Table of Google APIs

Jan 27, 2011 2 notes

#google #api

LinkedIn has one of the best teams of big data experts in the world. In this O'Reilly video, Pete Skomoroch from LinkedIn explains what skills are necessary to excel in the data scientist role.

Video: 3 Skills A Data Scientist Needs (O'Reilly)

Jan 26, 2011

Makai's Blog

May 2011

April 2011

March 2011

February 2011

January 2011