I haven’t posted in a while, and my new project isn’t in the sweet-spot of the blog’s focus, but I wanted to put some information up about developing for the Google Home, Echo, and text-based chatbots. These kinds of devices are certainly contributing to the massive growth of data that businesses have to work with, and the foundations of these devices (natural language processing, intents, and contexts) are leaning on machine learning to improve and expand their capabilities.
When the Google Home went on sale on Black Friday last November, I had to get one. Then shortly after that, over beers with a good friend who also has a Google Home, we made a list of ideas for apps to develop for the Google Home. The one that floated to the top was a “Choose Your Own Adventure” (CYOA) style app that would allow children (and potentially adults) to listen to a story and direct the choices by voice. This idea had a nice combination of simplicity, fun, and potential appeal.
Google opened up an API via their acquisition, api.ai, to developers in December and I started work in January. Api.ai has a great set of learning tools, documentation, and turnkey integrations. Luckily, I have a neighbor who is a children’s writer and he was really excited about participating in our little experiment. He started writing a custom story in the style of a CYOA, while I completed the code to deliver the story via the Google Home device, and a tool to import the stories from a Google Sheet into a NoSql database (RavenDB which is written in C# and made to work with .net applications).
Once the basic application was set up in api.ai, turning on and configuring the app for Facebook Messenger (not voice-directed, but similar text-based chatbot interface) was a pretty straightforward process and only took the better part of a day. The next steps will include adapting the application to work with Amazon’s Echo device which will dramatically expand the potential audience for StoryTree.
I may put more details on how the app was put together and what future developments take place in another blog post, but if you have any questions feel free to ask. Mark.
CSV File of Census Tract Shapefile Data for Entire US
I’ve seen this request a few times, and saw it again today. The Tableau web data connector I created for census tract shapefiles requires the user to pull the data state by state. How can someone get all states at once?
I looked at adding this through the normal interface, and I’m sure that would still be ideal, but I think the easiest route (for me) is to just export the entire dataset to one file that you can download and import into your Tableau workbook. That’s what I did.
If you want to down the entire US dataset in CSV (about 72MB), use the link below:
I’ve been distracted by another project I’m working on, so haven’t been able to update this blog as often as I’d like but I am happy that people have been using and benefiting from some of the learning projects I’ve posted here. Thought I’d just put up a few numbers.
Number of commuter animations viewed (US): 327,292
Number of commuter animations viewed (UK): 22,404
Number of Tableau-ready census datasets and shapefiles requested: 3,560
Please feel free to leave a comment if you find the info in this blog helpful!
Also, some people have requested my Tableau viz file that contained the commuter data at the country and state level. Here’s a link to the file on my public Dropbox folder.
After creating the Tableau Web Data Connectors (WDCs) in the previous posts, I thought I’d learn how to create some basic visualizations in Tableau using them to extract US Census data. In keeping with another popular theme of the blog, I focused on the commuting data available in the US Census American Community Survey.
I’m not sure exactly what I expected from Tableau. It is really popular right now and is, no doubt, a slick presentation tool. Like any complex app, it has a learning curve and I can say that I did not find it intuitive at times. I frequently felt like I was having to really kludge or hack my way to doing what I wanted to do. After feeling like I was spinning my wheels, using it started to feel easier and easier as the concepts sank in. I got Tableau to do most of what I wanted it to do, but all too often doing things that felt as if they should have been simple weren’t at all.
To give but one example of this, I had a table with columns (as you do) and wanted the columns to have different, column-specific widths based on the contents and the heading labels. However, changing one column width changed all of them simultaneously and uniformly. After trying a few things, I searched and found this solution to having variable column widths in a Tableau table. It’s possible that I’m too used to Excel, but that surprised me.
In any case, there’s no doubt that Tableau has a lot to offer and provides an amazing data presentation toolset that is browser based. This is clearly why it is no popular and growing.
A couple of quick things that I got hung up on with regards to effectively working in Tableau. One is that it is much, much easier to work with data that is organized downwards in rows rather than across in columns. For example, rather than a data table with columns for coffee sales, tea sales, soda pop sales with each row representing a day, it is much better to have a column for “Sales” and a column for “Category” where category is “Coffee”, “Tea”, or “Soda Pop” and there is a row for each Category/Day combination.
Secondly, know that when working with a Tableau dashboard, it will not know about any variable (measure value or measure name) that does not exist in the worksheet visualization or is listed on the “Marks” card. That last one is a bit specific, but was helpful once I got it.
Thanks for visiting! Email me at firstname.lastname@example.org if you have any questions.
As a follow up to my previous post on the US Census Tableau web data connector I created, I wanted to also share another web data connector that allows Tableau users to selectively import Census Tract shapes for mapping within Tableau.
The Census Tract is a key geography within US Census data, and it is the one that my commute map was based on. I wanted to figure out how to map at this level as I played with the census data so I could drill down below the county level using the built-in Tableau maps.
The elusive census tract “shapefiles” seemed mysterious, but it also seemed like there could only be one real logical way to store and communicate this info: with a series of coordinates (latitude and longitude in this case) that define a shape. After finding a website dedicated to this topic, I downloaded its very helpful Tableau workbook that had shapefiles for all census tracts in the US. This can be used to get the info you need, and it is what I imported into my SQL server and used to feed my web data connector.
However, I thought this web data connector might be slightly more elegant and easy to use than pulling the info from this master file. In the example below, I’m adding census tract mapping data and building a heat map for a Tableau worksheet that I used to import census data at the census tract level for the state of Michigan. I imported the percentage of homes built before 1939. See my previous post on how to do this.
Mapping Census Tracts Using the Web Data Connector
To use the census tract shapefile web data connector, “add a connection” to your workbook. For the URL, enter:
After a few seconds, an interface will pop up and allow you to choose the state, county, or individual census tract you’d like to download shapefiles for.
Select your criteria and click on “Retrieve Census Tract Shapefile Data”.
As the data is loaded, Tableau will ask you to define the relationship between the tract level data you had in your workbook and the new shapefile information. The GEOID column contains the identifier for the shape information. If you used my census data connector to import the census data, then use the value “FIPS” from that table and, as indicated, “GEOID” for the shape table info.
This will establish the link between the two tables you now have in your Tableau workbook. Click on “update now” to finish creating the table with the shape information.
Now create a sheet. Right click on “PointOrder” and change it to a dimension. Right click on PointLatitude and change it to “Geographic Role –> Latitude” and do the same for PointLongitude, changing it to “Geographic Role –> Longitude”. Double-click on each of these to add them each to the sheet.
Right click on the measure you want to map (we’ll assume a heat map) and select “Add to sheet”.
On the “Marks” card, change the dropdown value at top to “Polygon”.Drag the dimension “PointOrder” and drop it on the “Path” icon on the Marks card. This tells Tableau the order in which to draw connected points on the map.
Drag the GEOID dimension onto the bottom area of the Marks card.
Drag the SubPolygonId dimension onto the bottom area of the Marks card. As a quick aside, I wondered why this value is needed and it is because it defines the areas (which are polygons) and it is only really needed when the shape for a particular census tract has non-contiguous regions. That is, completely separate regions that need to be defined and shaded, but are part of the same defined census tract.
Finally, click on the control directly to the left of the measure you added to the Marks card and select “Color”.
Your “Marks” card should look like the example below.
If everything went according to plan, you should have a heat map displayed like the one shown above. In the example above, the dark blue regions have higher percentages of home built before 1939 and the lighter regions have lower percentages. If you don’t have this… well, leave a comment or email me and I’ll see if we can figure it out.
Hope this new shapefile source adds a tiny bit of extra functionality to Tableau for those of you who work with data by census tract. Please feel free to comment or email me if you have questions or suggestions. Thanks.
Tableau seems to be very hot right now. At least I seem to see it listed frequently when I’m looking through job postings. From my perspective, Tableau takes a lot of the presentation power of Excel (not so much the high power modeling features of Excel) and moves it into a more contemporary platform, by which I mean it is browser-based. I’ve developed many, many scorecards, charts, visualizations, reports, and analytical tools within Excel and it makes perfect sense to adapt these kinds of deliverables into a browser-based tool.
Tableau has a public version that you can download for free and use to create and publish workbooks to Tableau Public for everyone to see. I’m still learning the features, and there are a lot of learning resources on the web if you search around.
One thing I wanted to do was to tap into a great, diverse data source and create some Tableau workbooks to play around with. My commute map used US Census data from the American Community Survey, and when it was making the rounds back in June I was contacted by the Census Bureau asking if I was aware of their API. I wasn’t, though I wasn’t surprised to learn of one. I decided to tap into this data as I learn Tableau, because the Census Data has great breadth of information but also has a terrific multi-tiered geographic component.
Tableau has the ability out of the box to accept data from a wide variety of data sources, but it also has a way for developers to build a conduit into Tableau from any data source. This standardized data conduit into Tableau is called a “Web Data Connector” and a number of organizations have built these to enable Tableau users to import data. I looked for a Web Data Connector (WDC) for the Census API, but wasn’t able to find a working version. Always one to support scope-creep, I decided to build one myself and I did. It connects to the ACS Profile data for the five year estimate ending in 2014.
Unfortunately, I built it based on their WDC version 1.1 just as they were transitioning to a new version (2.0) with more features. If this gets any real use, I’ll look at upgrading the connector to 2.0. Below is a step by step on how to retrieve Census Data into Tableau using my WDC:
Importing Census Data into Tableau
Obviously, in order to use the WDC, you have to have Tableau loaded and if you do and you start a new workbook, you should see that you can choose “Web Data Connector.”
When you do choose “web data connector”, you’ll see a screen that asks for the WDC’s URL. To use my Census Data WDC, you’ll want to enter:
Once the data connector interface loads, you can choose a Geography (State, County, SubCounty, Tract, Zip Code, etc.) that you would like to retrieve and choose which variables you would like to include. The variables are one of the harder things to manage here. There are lots of them (about 2,500) and the names are cryptic. The list of variables for this particular Census dataset can be used as a reference.
You can use the hyperlinked text just above the variable list to include or exclude variable types. For example, if you want to include only “Estimates” then unselect the other three types “MarginOfError”, “PercentEstimate”, and “PercentMarginOfError”.
“Estimate” is the nominal value you’re probably most interested in. For example, if the variable is “population”, then the “estimate” would be the estimated number of people living in the selected geography.
Once you’ve selected your criteria, click on the “Retrieve Census Data” button at the top of the web data connector interface. The system will then retrieve the data from the Census API and convert it into a format that Tableau can readily accept. You can then click on “UPDATE NOW” when the data is returned to finish creating the data table inside your new Tableau workbook.
Many of the geography types provided by this WDC are supported for mapping in Tableau so you can create heat maps and other visual representations quickly and easily once you have the data imported. Some of them work automatically, others need a little help. Feel free to email me or post a comment if you have questions on this.
I also created a shapefile web data connector for the US Census Tracts, a geography that isn’t natively supported within Tableau. I’ll be posting on how to use this to map Census data by Census Tract.
If this tool gets any significant use, I’ll consider adding a dropdown at the top to allow users to choose data from other US Census datasets (10 year full census or other ACS datasets.) Let me know if you have any suggestions on how to extend this functionality. Here’s a link to a quick map I put together using this data. It shows Census Tracts in Washtenaw County, Michigan by percentage of residents with graduate degrees. The bluer regions have higher levels of the population with grad degrees and the greener areas have lower levels. Can you guess where the University of Michigan is on this map?
Azure Machine Learning, Knime, and Spinning Your Own Hadoop Cluster
As part of learning about Big Data, I took an online course on machine learning and played around with some of the concepts. They are two different things that get conflated frequently. Big Data is a field of deriving value from and managing huge amounts of data, levels of data beyond what organizations have ever had to deal with before. Machine learning is a discipline that uses algorithms and statistical methods to find patterns in test data that can be applied to new data to make predictions. It is frequently used in tandem with Big Data because part of the value of all the data is finding ways to learn and predict from it. The two overlap, but machine learning can involve data levels that are not really “big” and Big Data encompasses a lot more than just machine learning.
While taking my machine learning course, I was introduced to an open source tool called Knime, a GUI toolkit (and probably much more, but that was my take) for machine learning that I really liked. In fact, while experimenting with machine learning on Amazon’s AWS (cloud servers) I kept thinking how nice it would be to have a tool like Knime that could link directly to my datasets in the cloud. It’s entirely possible that Knime supports this, but Azure has similar features built in.
As an aside, I created my own Hadoop cluster in AWS from scratch from generic Linux servers using this handy blog post. I don’t really recommend this except as a learning exercise (or if you know something I don’t) since Amazon offers its own flavor of Hadoop as a turnkey option. Amazon also offers specific machine learning instances, but I have to say I didn’t find it particularly intuitive or useful, at least in my use case.
That brings me to Azure’s machine learning solution. I assumed that AWS would be ahead in this area because of their renown in the area of cloud computing, but that doesn’t seem to be the case as I recently discovered at a tech meetup at Ann Arbor’s DUO offices.
Jennifer Marsman, an evangelist for Microsoft, presented a great demo of using Azure’s Machine Learning tools to create a web service that helps predict whether a passenger of the Titanic will survive in a scenario based on Kaggle’s learning exercise. You can see the full presentation in the link below and if you’re playing with machine learning I highly recommend that you do, or at least check out the Azure site. That’s all for now. Cheers.
In addition to spending a week in Italy on a somewhat deserved vacation, I’ve been working on some projects that haven’t made it here yet. I’ve neglected this blog a bit, but I have a few different things that will shortly come to fruition that I will post about soon. The commute map got picked up by Wired.com and then it exploded (to be clear, it exploded in a modest “map of commuters” way, not in a global dance sensation “Gangnam Style” way). At this point, I have over 300,000 page hits on the commute map page. See here for a list of all the wonderful online publications that helped to connect web surfers with my hypnotizing animated dots.
Now, like a hit movie that results in the inevitable sequel, a new version of the commute map is back that is hopefully more Godfather II than Phantom Menace. I’ve just published a new interactive commute map for England and Wales that is very similar to the original US version. This is fully thanks to my long-distance colleague Alasdair Rae who inspired the original commute map with his GIFs, and who recently contacted me to propose this new map using data he has available. Unfortunately, Scotland always has to be different and their data is issued separately with a different methodology. Having a Scottish wife and knowing many Scots well, I’m certain that their version, having been perfected when they undoubtedly first invented the concept of a commuter census, is far superior to the southern regions’ attempt.
Not sure there’s lot more to say that I didn’t say on the original posting about how it was put together. Obviously, in this case the data didn’t come from the US Census but from the UK equivalent. Instead of counties, this version uses Local Authority Districts (LAD) and instead of a census tract, it uses something called a “middle layer super output area” (MSOA) as the most granular level of geography.
Please feel free to send me an email or leave a comment. Thanks for visiting!
You can get a sense of my background in the previous post: I’ve done a lot of work developing analytics and processes to help businesses get better, faster, and more efficient results. This was usually in a garden-variety enterprise IT landscape: data warehouses, production systems, and PC-based tools. Now I’m working on learning how companies make sense of the massive quantities of data never before available to help make better decisions.
I’m taking a series of online courses from UC San Diego through Coursera. The content can be uneven, but it is mostly pretty good and keeps me moving through a structured introduction that’s a couple of steps above “hello world” for big data. That alone makes it worth it to me.
If you start looking into big data, one of the first things you’ll run into is Hadoop. Hadoop means a lot of things to different people, but first and foremost it means using the Hadoop Distributed File System (HDFS). As far as I can tell, the vast majority of people doing big data are using Hadoop in one configuration or another and there are a lot of configurations available but they all utilize HDFS.
HDFS is an open source file system written in Java that splits large datasets up into pieces and spreads them out onto an array of servers (“data nodes”) where the massive job of analyzing the data can be done in parallel across the array and then reassembled into the final results. It also manages redundancy, making sure each piece of data is stored on multiple data nodes (the default is 3) so that the job can be finished even if a server in the array fails completely.
On top of the HDFS foundation, a bunch of different applications can be part of a Hadoop stack in a particular installation, many with funny names like Pig, Hive, Spark, YARN, Sqoop, HBase (ok, that one’s not so funny), and others. These applications leverage HDFS and basically create processing instructions that can be sent to the data nodes and executed in parallel.
As I said, I’m finding the UC San Diego courses very helpful and I would recommend them if you want some help in getting your feet wet, but they aren’t free. You can check the prices on the Cousera site. It would be helpful to have some basic programming skills if you want to move beyond the intro course.
Since I got my first real job, I’ve always gravitated to the most technical aspects of the work. Over time, as my aptitude and skills became clearer I moved into roles that were specifically in the technical realm and I always felt comfortable there. Being really good at something is a surefire way to fall in love with doing it, and that’s what happened to me.
In the beginning, there was no Internet. PC’s were connected via a LAN and internal office email was new. Fax machines chugged away, and documents were Fedex’ed all over the place all the time. Most heavy lifting was done on mainframes or other production systems which held all the data. Some of us who had the aptitude and ability started using PC database applications like dBase or Paradox and spreadsheets like Lotus 123 (Excel wasn’t out yet) to make our offices smarter and more efficient. There was a lot of McGyvering going on to extract and manipulate data out of and back into production systems.
Now I’m setting out to explore a new personal frontier: big data, shorthand for the challenge of turning mind-boggling amounts of data into value. I’ll be posting my progress and thoughts here. Feel free to email me and thanks for stopping by.