By: Kris Brandt, Cloud Engineer and Developer
“Cloud,” “Machine Learning,” “Serverless,” “DevOps,” – technical terms utilized as buzzwords by marketing to get people excited, interested, and invested in the world of cloud architecture. And now we have a new one – “Data Lake.” So, what is it? Why do we care? And how are lakes better than rivers and oceans? For one, it might be harder to get swept away by the current in a lake (literally, not metaphorically).
A Data Lake is a place where data is stored regardless of type – structured or unstructured. That data can then have analytics or queries ran against them. An allegory to a data lake is the internet itself. The internet, by design, is a bunch of servers labeled by IP addresses for them to communicate with each other. Search Engine web crawlers visit websites associated with these servers, accumulating data that can then be analyzed with complex algorithms. The results allow a person to type in a few words into a Search Engine and receive the most relatable information. This type of indiscriminate data accumulation and the presentation of context-relatable results is the goal of data lake utilization.
However, for anyone who wants to manage and present data in such a manner, they first need a data store to create their data lake. A prime example of such a store is Amazon S3 (Simple Storage Service) where documents, images, files, and other objects are stored indiscriminately. Have logs from servers and services from your cloud environments? Dump them here. Do you have documentation that are related to one subject, but are in different formats? Place them in S3. The filetype does not really matter for a data lake.
So once the data is in S3, how can it be used for more than just a data dump? How do we get the information we desire from the data lake? In Amazon Web Services, we need to utilize services and tools that provide analysis and visualization of the data. ElasticSearch, Athena, and Macie are some services that can complete that task:
These are just some examples on how to augment your data in the cloud. S3, by itself, is already a data lake – ‘infinite’, unorganized, and unstructured data storage. And the service already is hooked into numerous other AWS services. Data lake is here to stay and is a mere stepping stone to utilizing the full suite of technologies available now and in the future. Start with S3, add your data files, and use Lambda, ElasticSearch, Athena, and traditional web pages to display the results of those services. No servers, no OS configurations or security concerns; just development of queries, lambda functions, API calls, and data presentation – serverless.
The JHC Technology team is building and managing data lakes and the associated capabilities for multiple organizations and can help yours as well. Reach out to our team at firstname.lastname@example.org for some initial discovery.