Big Data


A massive amount of data is being produced (estimated 2.5+ quintillion bytes/daily) by different sources like social media platforms, websites, IoT devices, corporate databases and many more. Millions of users are connected 24/7/365 sharing information, uploading images and videos in social media platforms or any other websites/databases. So the question arises how this huge amount of data can be managed and leveraged for business decisions. That’s when Databricks comes into picture.

Databricks is an unified data analytics platform for data engineering, data science, machine learning and analytics. It allows business analysts, data engineers to build models & ETL pipelines, and deploy business process workflows using their platform. Apache Spark is the core of Databricks which is widely used in the industry for developing big data projects. Databricks is available as a service in Microsoft Azure, Amazon Web Services and Google Cloud Platform.


Big Data

In the modern world, data is much larger and complex. […]

Data Lakes – Is it Time for Your Business to Wade In?

As data continues to grow in both volume and structural variety, traditional relational database approaches fall increasingly short in providing the needed flexibility, agility, scalability, and economy to support its processing.  Alternative and complimentary approaches for managing information have been pioneered, and given time to mature, in the last few years to satisfy today’s big data storage and processing needs. Most prominent among them for centrally managing the onslaught of all the information a business needs to process and store are Data Lakes.

What is the purpose of a Data Lake?

Data Lakes offer a far more economic and imminently scalable approach for ingesting and assimilating an ever changing range of input data primarily because they can be implemented on top of the open source Hadoop eco system. Hadoop provides an architecture that can scale as needed by simply adding commodity servers to the cluster for increased parallel processing and storage.  Due to […]

ThoughtSpot – For Near Instant Analytics Gratification

ThoughtSpot ups the ante when it comes to rapidly and effortlessly delivering insightful and completely ad-hoc data analytics and visuals to your business, even for large many TB data sets.

ThoughtSpot has trail blazed a new area of BI called Search BI. This type of BI differs from the current genre of more established BI tools such as Tableau in that it embeds and applies knowledge about how data of different categories is generally analyzed and most effectively visualized. This knowledge is then mapped onto your business’s specific domain data.   The alignment and cataloguing of the business domain data and Meta data is then used to provide an optimized, intelligent and guided search capability through it.  A business user simply begins typing what they are looking for into the search box and then ThoughtSpot offers completions of the search as the user types.  The suggested completions are offered in the order that […]

HBase Data Extraction

Our Client is a NE based data solution provider in the healthcare industry. The client manages a single node CDH5 cluster Ver 5.3.2 in Ubuntu (Trusted Tahr) . The client had two main concerns. One of them being extracting data from HBase. Each table in HBase has its own metadata file. The metadata files provide information about the tables in HBase, including which columns to include and exclude from the output. The other concern was to convert the output data to JSON format.

In order to extract data from HBase, Pig is used. Originally, different approaches were made to interact with HBase. However, after exploring different options, Pig was an apt solution for this project due to its built-in functions and UDF flexibility. Only one UDF is used in this project, which is written in Python.

If you would like to read more about the article, click the link here ——-HbaseExtractionFinal

Pivotal GemFire

Hello everyone. This is an installation guide for Pivotal GemFire, which is a “distributed data management platform”. Pivotal is a company launched from VMware and EMC. It is relatively a new company that was founded in 2013 but with GE’s $105 million investment, they’ve been running strong with their own Hadoop distribution called, Pivotal HD.

On February 17th, 2015, Pivotal announced a partnership with Hortonworks and availability of their products, such as Pivotal HD, GemFire, HAWQ, and Greenplum, under open-source license.

Due to Pivotal’s path change and the partnership with Hortonworks, many Big Data users are intrigued but also questioning services Pivotal products offer. To guide these unanswered questions, we decided to give GemFire a try. Its high performance and HTML5 based dashboard, which visualizes and monitors status of GemFire clusters’ health and performance in real-time, are what keeps GemFire ahead of its competitors. Does that sound interesting enough? If it does, let’s […]

Hortonworks Sandbox

In this tutorial, students will learn how to set environment to use Sandbox. We are using CentOS 6.6 for this tutorial. Although, we used CentOS 6.6 GUI in the tutorial, I’ve written it in a way that even server operating system users can follow the tutorial without any problem. Hope you all enjoy.

Click on to Download Hortonworks SandBox Tutorial Here —Sandbox

Hadoop 2.x on Amazon EC2

This is a Amazon EC2 tutorial. This tutorial will help students to understand current stable 2.x Hadoop and how it can be deploy on Amazon EC2 instance. In the tutorial, students will learn to create Hadoop cluster that is production ready.

Click on to Download Amazon EC2 Tutorial here —AmazonEC2

Hadoop Multi-node Cluster Installation on Centos 6

This is a Hadoop multi-node cluster installation guide, which will help you to understand how each node process in Hadoop. Everything in this guide is straightforward. We are using Centos6.6 since it is widely used in production servers. Every step is explained with pictures and comments. Just follow through all the steps and you shouldn’t have any problem. If you hit a wall because of some kind of error occurs during the installation process, please check if there is any spelling or indentation error. These small mistakes can fail Hadoop to run properly. I mainly used the “Hadoop Cluster Setup On Centos” video from Edureka to install Hadoop and create this guide. To see the guide, you simple click the “HadoopCentosMulitnode″ link below and the guide will open as a pdf file. Please enjoy.

Click on to Download Hadoop Multi-node Cluster Installation Guide here […]