With growing data volumes, businesses are searching for suitable infrastructure models to assist in processing their Big Data, or to effectively store it for unpredictable use in the future. Data Lake is a file-based system which enables both structured and unstructured data to be stored. Without predefining a schema, all raw data from various sources can be stored in a data lake. Data Lakes are very agile and flexible storage systems in which the user has all the flexibility of how to store the data. In fact, it is very easy to process the data stored in a data lake because it can be accessed from a wide variety of processing engines while at the same leveraging parallel computing.
On the other hand, a data lake stores data in their native formats and manages the three Vs of big data (volume, velocity, and variety) while offering analytic, querying, and processing tools. The Data Lake eliminates all constraints of a traditional data management system by offering unrestricted file size, unlimited space, read schemes and different methods of accessing data (including programming, SQL queries and REST calls).
With the rise of Hadoop system (including HDFS and YARN), the benefits of data lake – previously only available to the most resource-rich companies like Google, Yahoo, and Facebook – have become a practical reality for just about anyone. Today, there are more options for businesses that have been creating and gathering data on a wide scale but have failed to preserve and manage it in a meaningful way.
What is Azure Data Lake.
Microsoft Azure Data Lake is a highly scalable public cloud service that provides insights from large, complex data sets to developer, scientists, business professionals and other Microsoft customers. The service consists of two parts, data storage and data analysis, as is the case with most data lake offerings. Customers may provision Azure Data Lakes to store an unlimited amount of structured, semi-structured, or unstructured data from a variety of sources. The service does not restrict the size of the account, the file size or the volume of data stored in a data lake.
On the analytic side, customers at Azure Data Lake can develop their own code for particular transactional and analysis tasks. Existing tools like the analytics platform system from Microsoft and Azure Data Lake Analytics can also be used to query data sets. Azure Data Lake is based on the Cluster Management Framework Apache Hadoop YARN and is intended to scale across Azure Data Lake SQL and Azure SQL Data Warehouse Database servers dynamically. A centralized approach within the Hadoop system helps the service address the needs of computer-intensive big data projects that often have distributed data sources.
Azure Data Lake can be broadly divided into three parts:
Azure Data Lake store – The Data Lake store offers a centralized repository in which businesses upload data of any infinite volume. The store is designed from HDFS applications and tools for high-performance processing and analytics including support for low latency workloads. Data can be shared in the store for collaboration with enterprise level security.
Azure Data Lake Analytics – Data Lake Analytics is a distributed analysis tool that supplements the Data Lake Store based on Apache YARN. The analytics service offers on-demand processing capacity and a pay-as-you-go system that is very cost efficient in short-term or on-demand jobs at any scale at any time. It comprises a portable distributed runtime called U-SQL, a language that incorporates SQL advantages and user code expressive ability.
Azure HDInsight – Azure HDInsight is a full-stack, Azure Hadoop Platform as a service. The Apache Hadoop, Spark, HBase, and Storm Cluster are built on the Hortonworks Data Platform (HDP).
Features and benefits.
Azure Data Lake provides all the features required to make it simple for developers, data scientists, and analysts to store data of any size, shape, and speed and to do all kinds of processing and analytics across platforms and languages. This eliminates the complications of using batch, streaming, and interactive analytics to consume and store all of your data while making it easier to get up and run with. Azure Data Lake works toward streamlined data management and governance with current IT investments for identity, management and security. This also smoothly integrates with existing stores and data centers to extend current data applications. For Microsoft businesses like Office 365, Xbox Live, Azure, Windows, Bing and Skype, expertise has led the users to collaborate with corporate partners and managing some of the largest scale processing and analytics globally. With a solution designed to meet existing and future business needs, Azure Data Lake is overcoming several usability and scalability problems to optimize the value of data assets
Azure data lake offers:
- The ability to store and analyze data of any type and size.
- Various methods of access from the U-SQL, Spark, Hive, HBase, and Storm.
- Designed as per YARN and HDFS.
- Dynamic scaling to match the priorities in business.
- Enterprise-grade Security with Microsoft Azure Active Directory.
- With an enterprise-grade SLA management and support.
The pricing for Azure Data Lake depends on a variety of factors such as storage capacity, number of analytical units (AUs) per minute, the number and costs of the Hadoop and Spark managed clusters. As of this writing, a monthly price for Azure Data Lake Store is $0.039 per GB, with capacity-based discounts of up to 33 percent for monthly commitments. Customers can use the Azure Pricing Calculator to calculate exact data lake costs.
Azure Data Lake is a significant new component of the ambitious cloud offering Microsoft provides. Microsoft is providing a service with Data Lake to store and analyze data of any size at an affordable price.