The Story So Far
You can get a sense of my background in the previous post: I’ve done a lot of work developing analytics and processes to help businesses get better, faster, and more efficient results. This was usually in a garden-variety enterprise IT landscape: data warehouses, production systems, and PC-based tools. Now I’m working on learning how companies make sense of the massive quantities of data never before available to help make better decisions.
I’m taking a series of online courses from UC San Diego through Coursera. The content can be uneven, but it is mostly pretty good and keeps me moving through a structured introduction that’s a couple of steps above “hello world” for big data. That alone makes it worth it to me.
If you start looking into big data, one of the first things you’ll run into is Hadoop. Hadoop means a lot of things to different people, but first and foremost it means using the Hadoop Distributed File System (HDFS). As far as I can tell, the vast majority of people doing big data are using Hadoop in one configuration or another and there are a lot of configurations available but they all utilize HDFS.
HDFS is an open source file system written in Java that splits large datasets up into pieces and spreads them out onto an array of servers (“data nodes”) where the massive job of analyzing the data can be done in parallel across the array and then reassembled into the final results. It also manages redundancy, making sure each piece of data is stored on multiple data nodes (the default is 3) so that the job can be finished even if a server in the array fails completely.
On top of the HDFS foundation, a bunch of different applications can be part of a Hadoop stack in a particular installation, many with funny names like Pig, Hive, Spark, YARN, Sqoop, HBase (ok, that one’s not so funny), and others. These applications leverage HDFS and basically create processing instructions that can be sent to the data nodes and executed in parallel.
As I said, I’m finding the UC San Diego courses very helpful and I would recommend them if you want some help in getting your feet wet, but they aren’t free. You can check the prices on the Cousera site. It would be helpful to have some basic programming skills if you want to move beyond the intro course.
2 thoughts on “The Story So Far”
Hey Mark, I’m using Stata (new to it) to access small data but we should talk about this sometime.
I’ll check out Stata. Would be happy to chat. Thanks!