Working on a few things this week. In addition to keeping up with the Hadoop course I mentioned in a previous post, I’m working on prepping some data to create visualizations related to commuting in Michigan. This is publicly available data and I’ll post more about it soon.
In addition, I stumbled across a website called kaggle.com. If you’re interested in the topic of big data analytics, you should take a look at their site. They host worldwide competitions, many with cash prizes, where individuals or teams strive to develop the best predictive algorithms given a dataset. They also have great info for learning the associated techniques. I’ve posted in a quick video overview below.
You can get a sense of my background in the previous post: I’ve done a lot of work developing analytics and processes to help businesses get better, faster, and more efficient results. This was usually in a garden-variety enterprise IT landscape: data warehouses, production systems, and PC-based tools. Now I’m working on learning how companies make sense of the massive quantities of data never before available to help make better decisions.
I’m taking a series of online courses from UC San Diego through Coursera. The content can be uneven, but it is mostly pretty good and keeps me moving through a structured introduction that’s a couple of steps above “hello world” for big data. That alone makes it worth it to me.
If you start looking into big data, one of the first things you’ll run into is Hadoop. Hadoop means a lot of things to different people, but first and foremost it means using the Hadoop Distributed File System (HDFS). As far as I can tell, the vast majority of people doing big data are using Hadoop in one configuration or another and there are a lot of configurations available but they all utilize HDFS.
HDFS is an open source file system written in Java that splits large datasets up into pieces and spreads them out onto an array of servers (“data nodes”) where the massive job of analyzing the data can be done in parallel across the array and then reassembled into the final results. It also manages redundancy, making sure each piece of data is stored on multiple data nodes (the default is 3) so that the job can be finished even if a server in the array fails completely.
On top of the HDFS foundation, a bunch of different applications can be part of a Hadoop stack in a particular installation, many with funny names like Pig, Hive, Spark, YARN, Sqoop, HBase (ok, that one’s not so funny), and others. These applications leverage HDFS and basically create processing instructions that can be sent to the data nodes and executed in parallel.
As I said, I’m finding the UC San Diego courses very helpful and I would recommend them if you want some help in getting your feet wet, but they aren’t free. You can check the prices on the Cousera site. It would be helpful to have some basic programming skills if you want to move beyond the intro course.
Since I got my first real job, I’ve always gravitated to the most technical aspects of the work. Over time, as my aptitude and skills became clearer I moved into roles that were specifically in the technical realm and I always felt comfortable there. Being really good at something is a surefire way to fall in love with doing it, and that’s what happened to me.
In the beginning, there was no Internet. PC’s were connected via a LAN and internal office email was new. Fax machines chugged away, and documents were Fedex’ed all over the place all the time. Most heavy lifting was done on mainframes or other production systems which held all the data. Some of us who had the aptitude and ability started using PC database applications like dBase or Paradox and spreadsheets like Lotus 123 (Excel wasn’t out yet) to make our offices smarter and more efficient. There was a lot of McGyvering going on to extract and manipulate data out of and back into production systems.
Now I’m setting out to explore a new personal frontier: big data, shorthand for the challenge of turning mind-boggling amounts of data into value. I’ll be posting my progress and thoughts here. Feel free to email me and thanks for stopping by.