Home

Data Lakes

Introduction - How are data lakes different from data warehouses?

In addition to my previous post about Data Warehouses, I wanted to give an introduction to another important part of an organization's data analytics process: the data lake.

At its core, a data lake is considered a "lake" because it contains such a breadth of information. Whether you have information coming in from transactions and a transactional database or data from an enterprise application that you use or social media data from your social media channels, this information is all meant to be kept together for easier predictive analytics and machine learning, as well as other things.

Data

AWS has a bunch of cloud resources available for cloud computing and data warehousing. They provide training resources that get you up-to-speed with big data in no time. I highly recommend you check that resource out. The following information is paraphrased from their content.

"Data lakes store data in a non-relational and relational schema from sources such as IoT devices, websites, mobile apps, social media, and corporate applications." This data can be present in a variety of form factors, so it is important to know what your data looks like in the data lake if you are to take advantage of it.

Schema

"Data lakes have schemas that are written at the time of analysis." Schemas are what decide what format and structure the data should take for processing in a SQL-based tool or other BI tool. Schemas in data lakes are written at the time of analysis.

Price/Performance

This ratio is important to note for large enterprises and small enterprises alike. You want to make sure your data is optimized for the correct use case. A data lake uses very low-cost storage at the cost of slower, but getting faster, querying.

Data Quality

This is where you have to be most careful in using a data lake. Data lakes store raw data. This raw data needs to be formatted in a way that will be able to be processed by your application.

Users

These are the people of the organization that will be hands-on with the data stored in the data lake: data scientists, data engineers, and business analysts (usually only with curated data in the data lake).

Analytics

Methods by which the data is used in a data lake are: machine learning, predictive analytics, and data discovery and profiling. It is important to note the use cases for the data in the generation of data lakes.

Conclusion

Although data can get confusing, large, and seemingly incomprehensible, data lakes are a very big step in an organization in making use of the data they have at their fingertips. Using a data lake ensures that you stay in the forefront of your field and that you make the best recommendations to your customers.