Data Lakes
Curious Joe was eager to resume the conversation with Genius Jane over where they had last left off when diving into data. He wanted to know where big data typically gets stored and learn all about the compute needed to process and analyze all this data to draw insights and support business decisions. He had just heard about the term Data lake and wanted to connect the dots.
“Could you tell me a little about data lakes?”, Curious Joe asked Genius Jane. With a broad smile Genius Jane replied, “Sure, let’s meet over at the data center…”

Genius Jane: You’ve heard that along with businesses going digital, widespread use of social media, and the growing adoption of the internet of things, there’s been an exponential growth in data generation. The rate, volume, and variety of data getting generated each day is simply phenomenal.
Want to you hear an example? If you were to consider a GPS tracking device fitted to a delivery truck that captured speed, distance, and altitude data every 10 seconds then at the end of the day the size of all the collected data from the truck would be around 30 Megabytes. If all the vehicles in the fleet, let’s say 50 vehicles, all did the same thing then we’re talking about 1.6 Gigabytes of data collected a day or around 48 Gigabytes a month or 576 Gigabytes a year. Consider also an IoT camera that is fixed to the rear of each delivery truck that takes a short video recording every time the rear door is opened. If that’s 10 Megabytes per 10-second clip and the door is opened for package delivery on average 100 times a day, then we’re talking about around 1.4 Terabytes of video clip footage collected per month from all the trucks or 16.8 Terabytes of data a year.
Curious Joe: Why would anyone want to record this? And that’s really a lot of data to store! We’ll need to order a lot more capacity for our data center if we’re going to handle big data.
Genius Jane: Yes that’s big data for you. And the need to capture and what we’re legally allowed to capture has to be business-driven. Here the logistics company had a use-case to use computer vision to monitor the quality of deliveries and conditions in which the delivery is done. We can’t use all this data directly unless we store and process it first.
Curious Joe: And that is where Data lakes come in…
Genius Jane: That’s right Joe. There’s no one way to perform big data analytics on the data collected. Different departments in the same enterprise may want to extract insights from this data for their specific needs and goals. Data Marts and Data Warehouses used to give the processed data that is cleaned and structured to analysts where they would create reports, dashboards, and the like while archiving or trashing the raw data. The Data Scientists however need access to the raw data. The need for something beyond Data Warehouses that looked at schema on read versus schema on write was felt by the industry way back prior to 2010. James Dixon (Pentaho), coined the term Data lake in his blog in 2010. There have been some interpretations that James went on to clarify a few years later. However, the most common layout of a Data lake is to have zones these days.
Curious Joe: Like general access, restricted and no entry street zones?
(Photo by Sangga Rima Roman Selia on Unsplash)

Genius Jane: Possibly that’s where the inspiration came from. In a Data lake there would be a “Landing Zone”. This is where software tools would ingest the structured and unstructured data from various sources. To keep costs manageable and to have fast search and indexing capabilities, parallel distributed file systems such as that of Hadoop’s would be used. They provide high availability, scalability, and resiliency and could be located either on-premise or in the cloud. Big data can be ingested directly into the cloud as well.
Then there is the “Gold Zone” where clean, processed data is kept for self-service that’s usually accessible to users via a set of production APIs. Other zones could include “Sensitive Zone” — containing highly confidential data and “Work Zone” — a place where Data Scientists could use data from the “Landing Zone” & “Gold Zone” to produce additional ‘clean’ datasets that would be stored in the “Gold Zone”.
Curious Joe: Is storage and organizing the incoming data all there is to Data lakes?
Genius Jane: Not at all, Joe. Zoning is part of a larger function coming under Data Governance. That’s all about data quality and reliability. Then there is also Data Management, which refers to all the functions necessary to collect, control, safeguard, manipulate, and deliver data. For anyone to use the data that’s in a Data lake a Data Catalog is essential. That is where one can get meta-data about the data. Things like: where it came from — for example from which IoT devices, short and long for descriptions about the data, the update frequency, business purpose, contact information, etc.
Curious Joe: Okay, you mentioned Data Scientists and Data Analysts. Are there other roles as well?
Genius Jane: The roles of Data Scientists and Data Analysts do overlap to some extent. But there are differences in the roles. The people who help get the data into the “Landing Zone”, as I’d like to say, are the Data Engineers. They help architect distributed systems, create data pipelines, combine data sources, architect data stores, and collaborate with data science teams — helping build the right solutions for them. Data Engineers are the data superheroes.
Curious Joe: Wow, amazing! I didn’t know so many people were involved. Data lakes do seem to help democratize access to data across the enterprise. I do understand now that it needs a whole organization around the data itself to keep the data healthy, discoverable, secure, and accessible.
Jane, you mentioned building a Data lake in the cloud. What’s that all about?
Genius Jane: Let’s set up some time to talk about that. There are lots of topics to cover here. Those are some amazing variety of clouds, aren’t they? (Photo by Tom Barrett on Unsplash)
