It’s been a long time since I posted on this blog. That’s mainly because I got a real job which has both kept me pretty busy and partly because I cannot just blithely share the things I work on now like I used to.
I started working in the role of what I eventually realized was a data engineer. To be honest, I’m not even sure I knew what a data engineer was before the title was bestowed upon me, but it turns out that a lot of my skills and innate abilities make me a pretty good one.
The project I was hired to work on involved a few hundreds of millions of records of transaction data – fairly ho-hum by today’s standards in terms of volume, but enough to chew on. The deliverable involved cleaning, loading, and distilling this data down to a size that could comfortably fit in an analytical tool for use by internal and external clients. We did this all in an on-prem MySQL server that was, for its time, pretty tricked out. Unfortunately, it was a bit of a nightmare and any problem or tweak that required rerunning from the beginning was a day-long setback. In the end, we made it work and we made the deadline but it involved a lot of babysitting processes overnight and relentlessly looking for ways to re-engineer the process to make it more efficient.
My partner who was mainly handling the visualizations kept lobbying for a new DBA who was an expert at fine tuning performance in the MySQL environment. I was still learning about the environment and didn’t feel ready to advocate for that.
The process chugged along, processing in MySQL, but any need for restatement or any change to the tables or logic was difficult due to the costs (mainly in processing time) involved.
A Bigger Project
A few months later, I learned that we were going to be receiving Adobe Analytics clickstream data on behalf of the client. The initial history load would be about 3.5b records with 970 fields, though we only really wanted about 75 of those fields. Again, these would need to loaded into fact tables and then summarized to a level that would fit within Power BI for analytical purposes.
This was a comparatively big lift for us and not conceivable within our MySQL setup.
We began a proof of concept in AWS. While I’m always skeptical of third-party services and their focus on entanglement, I was certainly no stranger to cloud computing and its explosive growth in recent years. I’ve used these services in smaller ways a number of times.
We used a scheduled nightly job in AWS Data Pipeline to bring the raw Adobe data from the inbound folder in s3 into the Redshift tables. The summary queries for Power BI were also handled by Data Pipeline.
My skepticism melted away. The amount of value provided by AWS was staggering. We delivered a robust, high performance reporting platform with nightly updates, backups, and all of the features around stability and redundancy that AWS has to offer for under $40k a year.
That’s with 24/7 availability. The cluster hosts over 4b rows of detail data as well as a number of summary tables to support Power BI. Depending on your vantage point, that may sound like a lot for one reporting solution but it doesn’t take many person-hours to eat up $40k in effort. The DBA we had pondered to tune our MySQL setup would equal three of these installs.
I know this is probably preaching to the choir, but to someone who’s seen a lot of changes in technology over the years, I can tell you that they don’t always deliver even when there’s a wave of companies adopting or trying them.
This one delivers.