Project Description

  • Legacy Community Health was founded in 1978. It is a federally qualified Health Center and they provide services such as adult primary care, dental care, vision services, nutrition services, HIV/AIDS care, behavioural health services, etc. Currently, traditional RDBMS is used to manage all the data regarding to patient, medication, transaction, and more. Our goal of this project was to load the current existing data in RDBMS to HBase and extract the data using conditions such as prefixes, base column names and etc.
  • Pig is mainly used for this project since its built in functions found to be suitable. Using UDF was necessary due to Pig Latin being a data flow language. Some of the functions didn’t work as expected in the early stage of the project. However, it didn’t take a long time to find a way to work around it.  It was convenient to use Pig to interact with HBase. Working with positive arbitrary integer prefixed columns seemed somewhat challenging at first but the challenge was cracked by using a few built in functions.

Challenges

  • Find optimal Hadoop ecosystem and programming language to use for HBase data extraction.
  • Filter HBase columns based on first, second prefix and base column names, while the second prefix being arbitrary positive integer.
  • Manipulate tuple and databag.
  • Convert result output to JSON format.


Solutions

  • Interacted with HBase using Pig Latin due to its built in functions and ETL capabilities using HDFS.
  • Python UDF used to flatten data.


Results

  • All the appropriate columns filtered for each entity in HBase.  The output for each entity formatted in JSON.


Highlights

  • Pig
  • CDH 5.4.X
  • Python
  • HBase
  • MapReduce
  • HDFS