By: Kris Brandt, Cloud Engineer and Developer
“Cloud,” “Machine Learning,” “Serverless,” “DevOps,” – technical terms utilized as buzzwords by marketing to get people excited, interested, and invested in the world of cloud architecture. And now we have a new one – “Data Lake.” So, what is it? Why do we care? And how are lakes better than rivers and oceans? For one, it might be harder to get swept away by the current in a lake (literally, not metaphorically).
A Data Lake is a place where data is stored regardless of type – structured or unstructured. That data can then have analytics or queries ran against them. An allegory to a data lake is the internet itself. The internet, by design, is a bunch of servers labeled by IP addresses for them to communicate with each other. Search Engine web crawlers visit websites associated with these servers, accumulating data that can then be analyzed with complex algorithms. The results allow a person to type in a few words into a Search Engine and receive the most relatable information. This type of indiscriminate data accumulation and the presentation of context-relatable results is the goal of data lake utilization.
However, for anyone who wants to manage and present data in such a manner, they first need a data store to create their data lake. A prime example of such a store is Amazon S3 (Simple Storage Service) where documents, images, files, and other objects are stored indiscriminately. Have logs from servers and services from your cloud environments? Dump them here. Do you have documentation that are related to one subject, but are in different formats? Place them in S3. The filetype does not really matter for a data lake.
So once the data is in S3, how can it be used for more than just a data dump? How do we get the information we desire from the data lake? In Amazon Web Services, we need to utilize services and tools that provide analysis and visualization of the data. ElasticSearch, Athena, and Macie are some services that can complete that task:
ElasticSearch can load data from S3, indexing your data through algorithms you define and providing ways to read and access that data with your own queries. It is a service designed to provide customers with search capability without the need to build your own searching algorithms.
Athena is a “serverless interactive query service.” What does this mean? It means, I can load countless CSVs into S3 buckets and have Athena return queried data as a data table output. Think database queries without the database server. Practically, you would need to implement cost management techniques (such as data partitioning) to limit the ingestion costs per query as you are charged on the amount of data read in a query.
Macie is an AWS service that ingests logs and content from all over AWS and analyzes that data for security risks. From personal identity information in S3 buckets to high risk IAM Users, Macie is an example of what types of analysis and visualization you can do when you have a data lake.
These are just some examples on how to augment your data in the cloud. S3, by itself, is already a data lake – ‘infinite’, unorganized, and unstructured data storage. And the service already is hooked into numerous other AWS services. Data lake is here to stay and is a mere stepping stone to utilizing the full suite of technologies available now and in the future. Start with S3, add your data files, and use Lambda, ElasticSearch, Athena, and traditional web pages to display the results of those services. No servers, no OS configurations or security concerns; just development of queries, lambda functions, API calls, and data presentation – serverless.
Our team is building and managing data lakes and the associated capabilities for multiple organizations and can help yours as well. Reach out to our team at [email protected] for some initial discovery.