Makai's Blog: Archive

The Buzz on Redis (and NoSQL)

Redis is getting a lot of buzz for its fast read/write performance and its innovative use cases beyond just being a key/value store. For example, at the second Big Data DC meetup last week, Nick Kleinschmidt of Lucid Media discussed how they are using Redis at his firm for online display advertising.

Today the top link on Hacker News is how to add Redis to your current stack. It’s a great piece that explains what Redis is and how you can use it to augment your existing web application infrastructure.

Despite some consternation from the “you’re doing storage wrong” traditional SQL crowd, the “Not Only”-SQL movement is great for innovation in the storage space. More of these innovative use cases will continue to come up as NoSQL solutions with different flavors of storage formats and CAP Theorem choices proliferate.

Jun 28, 2011 11 notes

#Redis #NoSQL

The Blog Post Litmus Test for Prospective Employers and Clients

I use a simple test with every prospective employer or client as an aid to determine their culture:

Find a public article written by an interviewer (and potential colleague) at the prospective company
Write a summary of the article and some additional insightful comments based on prior related experience
Tell the interviewer her article was interesting and email the link to the new blog post
Observe and gauge her reaction

Responses range between two extremes:

Engagement with your ideas and an appreciation that you took time to add to the conversation
Direct feedback or intimation that you couldn’t possibly add anything of value

This is where it gets interesting. It’s a good sign if employees are interested in your outsider viewpoint and want to talk further. They care enough beyond their day to day tasks to discuss a topic related to their company or industry and actively seek outside perspectives. It’s similar to how there’s a correlation between better software developers and programming outside of work.

Proceed with caution where there is apathy or hostility towards your viewpoint. Even if you’re incorrect in what you wrote because you don’t have a clear picture of the company or industry, you should never be belittled for taking the time to write down your perspective.

That’s my litmus test for prospective employers and clients: do the employees care enough about their company and industry to actively engage me before I perform work for them? Do they value my input and my commitment to their mission? Will they view me as a respected peer or a butt in a seat they order around? An affirmative answer to these questions is critical to the success of highly motivated employees and can be found in part by performing this blog post litmus test.

Jun 21, 2011 3 notes

#interviewing #prospective clients #prospective employers

How to Really Hire Developers

Well, I can’t say it any better myself so I’ll just point you to this fantastic rant on how you should actually be interviewing software development candidates.

Jun 16, 2011 2 notes

#recruiting

HPCC: Competition for Hadoop in Big Data Analysis

LexisNexus just open sourced their data analysis platform, HPCC.

Competition in the data analysis space is a good thing. I don’t know enough about HPCC to compare it to Hadoop just yet. However, developers have to learn a new language ECL, which has a relatively sophisticated syntax, to run analysis jobs on HPCC.

I find it unlikely that most developers will be willing to spend the time to learn that new language until a community springs up to show what advantages HPCC offers over Hadoop. The supposed advantage of ECL’s conciseness in expressing analysis jobs is in relation to Java. The real test comes when comparing ECL to Clojure and Scala, much better programming languages for concise MapReduce jobs.

Further reading: LexisNexus open sources Hadoop competitor (GigaOm)

Jun 15, 2011 5 notes

#big data #HPCC #ECL

MongoDB: An Introduction (Part 1)

Document-Oriented Data Stores

A document-oriented data store extends the key-value pair model by providing a structure that the data store understands.[1] Document-oriented data stores are inspired by Lotus Notes[2] and the simplicity of the JavaScript Object Notation (JSON) format. The current leading document-oriented data stores are MongoDB, CouchDB, and Riak.

MongoDB

MongoDB is an open source document-oriented data store favoring the Consistency and Availability principles of the CAP Theorem. The term ‘Mongo’ comes from ‘humongous,’ as in the amount of data MongoDB allows you to store in its non-relational structure.

The company 10gen actively leads development on MongoDB and coordinates open source contributions. Core MongoDB functionality is written in C++ and official drivers are available for Java, C++, Python, Ruby, Scala and several other languages.[3] Drivers for Clojure, Groovy, R, Erlang, and many other languages are supported by community efforts. 10gen also provides commercial training and services to generate revenue which is partially reinvested in the data store’s development.[4]

MongoDB should only be run under 64-bit operating systems because of the way it addresses the data store. The limitation stems from MongoDB’s storage implementation using memory mapped files for performance reasons.[5] Running on 32-bit systems will work, but MongoDB will only be able to store about 2.5 gigabytes total - fine for some local development work but not for most production software.

Lingo

There are several terms commonly used in MongoDB literature:

Collection - roughly equivalent to a table in a relational database in that it contains zero to many documents.
Document - roughly equivalent to a row in a relational database in that it contains a logical grouping of data elements. Documents contain key-value pairs that represent stored data in MongoDB.
Schemaless - documents stored in the same MongoDB collection can have varying fields and elements within a document are not held to the same structure.

Data Storage Structure

Data are stored and represented as documents in Binary JavaScript Object Notation (BSON).[6] The BSON notation is identical to standard JavaScript Object Notation (JSON) for most structures. For example, here is an order for a single coffee at a cafe:

{

“_id” : ObjectId(“4de2fefcfe376e36c3bc620b”),

“coffee” : “Americano”,

“room_for_milk” : false,

“price” : 3.95

}

In the preceding example there are four keys: “_id”, “coffee”, “room_for_milk”, and “price”. Each of the keys has a single corresponding value: ObjectId(“4de2fefcfe376e36c3bc620b”), “Americano”, false, and 3.95, respectively. Each value has a data type. ObjectId(“4de2fefcfe376e36c3bc620b”) is an object identifier that is automatically generated by the MongoDB data store upon insertion of the document. “Americano” is a string. false is a Boolean. 3.95 is a float (note that floats should not be used to store monetary values in a production setting because of inaccuracies in rounding). The four keys and values are wrapped in curly braces and the resulting structure is called a document.

There are six basic JSON data types as well as several additional data types in MongoDB. The original six JSON data types are:

null - represents both a null value and a nonexistent field
boolean - two values, true and false
numeric - 32-bit integer, 64-bit integer, and 64-bit floating point handled automatically by MongoDB
string - a UTF-8 string of characters
array - lists or sets of values that can be heterogeneous in type
object - a JSON object

MongoDB’s extended types beyond the basic JSON data types are:

embedded document
JavaScript code
minimum value
maximum value
object id
date
regular expression
symbol
binary data
undefined

MongoDB’s schema-less design allows the creation of documents with variable structure. The variable structure works well for rapid prototyping and prevention of having to alter tables to add new attributes to documents. However, the schema-less design also prevents the creation of constraints to standardize data found in SQL databases.

There is less normalization involved in a typical MongoDB set up because there are no server-side joins.[7] Instead of joining separate relational tables, embedded objects can be inserted inside documents.

Inserting

Data manipulation in MongoDB can be performed through the shell and its JavaScript syntax. For example, here is the syntax for inserting a document into the “mydb” collection:

> db.mydb.insert({“coffee” : “Latte”, “price” : 4.95, “notes” : “customer wants room for milk”})

Inserts are non-blocking by default and do not wait for a response from the server. You can also specify “safe inserts” that wait for a response value from the server indicating whether the operation was successful or had an error.

Batch inserts are much faster than incremental data insertion. The MongoDB team recommends preallocating space with blank documents when performing numerous inserts of a predefined size.

Querying

10gen also touts MongoDB’s dynamic query language as a core feature and critical to accelerate of the development process.# The query language is not SQL, instead it is based on key and value matching. For example, here is a query to find all the documents with a value of “Latte” for the “coffee” key in the mydb collection:

> db.mydb.find({“coffee” : “Latte”})

The result of this command after executing the insertion from the previous section is:

{ “_id” : ObjectId(“4df75a03d30a7515a35f5942”), “coffee” : “Latte”, “price” : 4.95, “notes” : “customer wants room for milk” }

Note that querying on keys and values is case sensitive. If you instead used the following command…

> db.mydb.find({“COFFEE” : “latte”})

… the mydb collection would return no matching documents.

That covers MongoDB’s background information, basic inserting, and querying. Next post I’ll cover updating, deleting, capped collections, and a few other things.

[1] http://stackoverflow.com/questions/3046001/what-does-document-oriented-vs-key-value-mean-when-talking-about-mongodb-vs-ca

[2] http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html

[3] http://www.mongodb.org/display/DOCS/Drivers

[4] http://www.10gen.com/

[5] http://blog.mongodb.org/post/137788967/32-bit-limitations

[6] http://www.mongodb.org/display/DOCS/BSON

[7] http://www.mongodb.org/display/DOCS/Schema+Design#SchemaDesign-Embedvs.Reference

Jun 14, 2011 1 note

#MongoDB #document-oriented data stores #NoSQL

"Beyond Data" in the Intelligence Sector

Bob Gleichau from In-Q-Tel wrote an interesting article entitled “Beyond Data.” As the article discusses, the intelligence sector adds complexity onto the difficult job of sorting, searching, and understanding large data sets. Some of the challenges Bob wrote about include:

The security clearance level maze between dozens of intelligence, homeland security, and financial regulatory agencies
Laws against collecting data on US citizens in certain cases, for example biometrics
Legacy government systems

There are also many additional challenges:

Disconnected security networks
Low network bandwidth and high latency in overseas locations, particularly war zones
Paper (no joke- in 2011 paper still holds a vast amount of the government’s data)
The ones who pay the checks are not system end users
Government contract structures favor waterfall over iterative development processes
No access to commodity cloud infrastructure services such as Google App Engine, Rackspace, or Amazon Web Services to perform distributed data analysis

One of the most interesting ideas from the article was embedding great developers with intelligence analysts to create and execute very complicated queries. I’m sure some agencies are doing this already but from my experience it’s not a common practice. (Private industry may need to do this in the future as well but that’s a different topic.)

A second important idea is the concept of allowing full search capabilities but masking search output when a user’s clearance is not high enough to see results. This is a very hard problem that involves user authority and access management, metadata mark up, and clear, unambiguous rules for clearance resolution.

Finally, one last concept that isn’t in the article but is crucial. The government needs to be careful of throwing money at hard problems. Building information systems (including data analysis systems) isn’t like designing a new fighter jet. It’s amazing what a small team of six to eight capable software developers with a passion for intelligence community domain challenges can accomplish when given access to large data sets and the freedom to choose their own tools. That’s why companies like LinkedIn, Facebook, and Google are successful with using data to generate business value.

Article: Beyond Data (IQT Quarterly), see also Data Science in the U.S. Intelligence Community (IQT Quarterly)

Jun 8, 2011 1 note

#big data #intelligence community

Google App Engine In-depth Article

The Register has a great article on Google App Engine, Google’s scalable Platform-as-a-Service that will be removing the beta label later this year. App Engine is built upon BigTable, Google’s proprietary Column Family NoSQL data store. I’ve create several apps on Google App Engine, including http://www.mattmakai.com/ and http://scholarmaker.com/. Once you get past the standard Column Family data store quirks and understand the App Engine API (I used the Python version), it’s very easy to deploy an app and have it ready to scale to potentially millions of visitors.

Article: Google App Engine (The Register)

Jun 8, 2011 1 note

#Google App Engine #BigTable #NoSQL

On College and the IT Field

I needed to go to college to be successful. I required the disciplined studying, mentoring from my professors, social learning through meeting new friends, and enriching experiences from the general community at James Madison University. Even though I’ve been using computers since I was 3 years old and programming since sixth grade, I needed classes on operating systems, programming languages, computer networking, and information security to be successful in my career.

So I watch with some dismay as influential figures rail against the college model. I agree with a lot of the things Peter Thiel discusses. It’s important to provide an alternative development model for insanely smart people. Some people don’t need college because they already have all of the drive and intelligence to get started now. College simply slows those insanely smart people down!

But I’m not that smart. I simply was not ready to be a full-time member of the real world until I spent countless hours studying in the library and in front of a computer learning computer science.

I don’t think I’m unusual. Sure, in information technology I would hire a better software developer with a high school degree over a developer with a college degree any day. But I rarely see that. While it’s possible in theory to be successful without a college degree in information technology, completing a computer science degree at a good college is a strong signaling mechanism. The degree is neither necessary nor sufficient, but it provides a starting point for discussions about background in software development.

Side projects, technical blogging, past experience, open source contributions, enthusiasm, and dedication to constant learning should make or break decisions on whether or not to hire a software developers. But often those topics are so heavily influenced by learning from college in programming language theory, algorithms, and software engineering practices, it’s hard to pull them apart.

There are many issues with the college model besides holding back really smart people: extraordinary costs, massive student loans, majors of questionable value, and grade inflation. But in IT, while in theory you can be successful without a college degree, it’s a strong signaling mechanism that you can’t set your mind to finishing a major commitment to learning and education.

Jun 4, 2011

#college