Spry was recently given the opportunity to be a guest author for the Hortonworks blog. The post is available in its entirety here. A sneak peek of the blog is given below!
In early 2014, Spry developed a solution that heavily utilized Hive for data transformations. When the project was complete, three distinct data sources were integrated through a series of HiveQL queries using Hive 0.11 on HDP 2.0. While the project was ultimately successful, the workflow itself took an astounding two full days to execute, with one query taking 11 hours.
Apache Kafka is a distributive commit log service. It leverages a language independent TCP protocol to provide functionality as a messaging system over partitioned and replicated feeds called "topics". The partitioned logs are the object of distribution, as each active node constitutes a Kafka server and remains responsible for processing data and requests for a section of the partitions.
This post will provide an overview of these concepts and give you more insight into how Kafka functions.
In many of our use cases, the data we work with does not come ready to be fed into an analytics workflow. It must first be ingested and prepared. This includes renaming and/or reordering fields, changing data types, filtering out invalid values, and combining different parts of the same data source. In this post, we will be covering how to perform these steps using a Data Pipeline tool called Alteryx. We will walk through a workflow used for one of our clients.