Companies have come to realize of late that the real value of their business is data. There has been a rush to create huge Data Lakes to store the enormous amounts of data available inside each company. The concept of a Data Lake is that of a low cost, but highly scalable infrastructure in which all types of data can be stored.
This sounds good, but creating a Data Lake is not easy and a good design is a must.
Data Lake concept
Although the basic definition of a Data Lake is a storage repository that holds a vast amount of raw data in its native format, it is also much more than that. The core function of a Data Lake is storage, but it is also being able to process this data rapidly in the same place so that different users may access it from the same location later.
Data can come in different formats: raw or structured. Depending on the user, it may be wise to store the former in distributed file storage, such as HDFS or S3. Raw data can be handled by a user with technical knowledge -a developer or data scientist- but business users will find it more difficult. In the latter case, (semi-)structured data in SQL or NoSQL databases would be better choice.
A computing framework should be placed around the data in order to read, transform and create new data. To avoid system redundancy it is recommendable to use an all-in-one solution such as Apache Spark, which permits batch, SQL, Streaming and Machine Learning operations within the same platform and API.
To be able to monetize the data, new applications and services will be created on top of the Data Lake. A good design is therefore important to ensure flexibility.
Sources
A Data Lake is valuable for the data it stores, which can come from a high variety of sources. These can be divided in two types: Internal and External sources.
Internal Sources
Internal sources are the easiest and generally most-used ones. All data comes from inside the company so the cost will be lower than for external sources. Here are some examples:
- CRM: A Customer relationship management (CRM) is a system for managing a company’s interactions with current and future customers. It often involves using technology to organize, automate and synchronize sales, marketing, customer services and technical support.
- Website Tracking: Tracking all users using a single system instead of searching within the logs can have many benefits. It requires a small JS script on the company websites and a simple collector server. Besides, a single system which stores all the customers’ behaviour is faster, easier and provides for more reliable analyses.
- Application Logs: Every company generates gigabytes of data logs every hour from their applications. Although most of this data is not very useful and hard to read, BI users can take advantage of this RAW data to learn about customer behaviour or find a hole in their business process-flows.
- Internal ETLs: This sources involves all customized ETLs of the company which are required by the Data Lake users.
External Sources
As you can imagine, external sources receive data from the outside world, generally coming from third-party companies. Their cost will be reasonably high comparing to internal sources. In the market, there are many companies that sell valuable data. Here are some examples:
- DMP: A data management platform (DMP) is a centralized computing system for collecting, integrating and managing large sets of structured and unstructured data from disparate sources.
- Online Advertisement: The most extended marketing strategy used by the companies to increase sales. We can understand it as a direct way to better amortize the investment and retrieve valuable information about the customers. This data can answer questions about the behaviour of million of customers and their tastes. However, this is a quite dark and expensive market and accessing to this data may not be easy.
Therefore, depending on the requirements and needs, the Data Lake will be more or less rich in these kinds of contexts.
Useful Tips to create a Data Lake
The creation of a Data Lake is not an easy thing and some indications may be of great help. Here some advice to create a consistent Data Lake:
- Make it simple must be a daily rule. The more frameworks, databases and systems involved, the more development and maintenance will be needed. Sometimes simple batch scripts make a good job.
- An unsecured system is a real threat: security must be a consideration from the beginning and configuring it -while the Data Lake is in production- can be a nightmare.
- Be sure about what you are doing: knowing the limits of the technologies you are using is very important. For example, if a NoSQL database is chosen and it is not compatible with ODBC, it is very likely that the business guys will not be able to use it.
- Document yourself as much as you can: documentation is important because there will be many different sources in the Data Lake. This means that an increasing number of users will access the same data and several teams will be developing the Data Lake at the same time.