IT PARK
    Most Popular

    What is resolution? What does resolution mean?

    May 12, 2025

    Driving Generative AI Pervasiveness: Intel's "duty to do so"

    Apr 10, 2025

    Cloud computing and data science, five steps to break through the flood of information

    May 31, 2025

    IT PARK IT PARK

    • Home
    • Encyclopedia

      Why does the phone turn off when the remaining battery is not zero

      Jun 03, 2025

      Internet era! How to prevent personal information leakage

      Jun 02, 2025

      Which one to choose for mobile power? Analysis of the three major types of battery cells

      Jun 01, 2025

      What is IMEI code

      May 31, 2025

      Mobile phone battery is not durable? 14 tips to extend battery life

      May 30, 2025
    • AI

      First U.S. Election in the Generative AI Era

      Jun 03, 2025

      Artificial intelligence: Hollywood writers' strike triggers

      Jun 02, 2025

      GPT-4 will allow users to customize the "personality" of the AI, making the avatar a real "person"

      Jun 01, 2025

      What industries ChatGPT may disrupt in the future

      May 31, 2025

      Gender equality issues plague the enterprise, and this SaaS company intends to use AI to solve them

      May 30, 2025
    • Big Data

      Your privacy, how does big data know

      Jun 03, 2025

      Accurate data is more important than more data in the healthcare industry

      Jun 02, 2025

      Gartner: Data Analytics Helps Build a New Equation of Business Value

      Jun 01, 2025

      How to Improve Big Data Performance with Low Latency Analytics?

      May 31, 2025

      What are the tips for storing big data in a Hadoop environment?

      May 30, 2025
    • CLO

      On the Importance of Cloud Access Security Agent CASB

      Jun 03, 2025

      The importance of cloud technology for agile supply chain

      Jun 02, 2025

      What is the relationship between cloud computing and cloud storage? The 3 major disadvantages of cloud computing explained!

      Jun 01, 2025

      Cloud computing and data science, five steps to break through the flood of information

      May 31, 2025

      What are the difficulties of cloud computing operations and maintenance?

      May 30, 2025
    • IoT

      Berlin showcases smart city innovations

      Jun 03, 2025

      IoT solutions lay the foundation for more effective data-driven policing

      Jun 02, 2025

      CO2 reductions won't happen without digital technology

      Jun 01, 2025

      4 Effective Ways the Internet of Things Can Help with Disaster Management

      May 31, 2025

      6 Ways the Internet of Things Can Improve the Lives of Animals

      May 30, 2025
    • Blockchain

      Will blockchain revolutionize the gaming industry?

      Jun 03, 2025

      How do you make a blockchain investment?

      Jun 02, 2025

      What is the connection between blockchain and Web 3.0?

      Jun 01, 2025

      Canon Launches Ethernet Photo NFT Marketplace Cadabra

      May 31, 2025

      The future development of blockchain technology, what are the main advantages?

      May 30, 2025
    IT PARK
    Home » Big Data » Talking about data lake and data warehouse
    Big Data

    Talking about data lake and data warehouse

    Data Lake is a term that has emerged in the past decade to describe an important part of the data analysis pipeline in the big data world.
    Updated: Apr 16, 2025
    Talking about data lake and data warehouse

    Data Lake is a term that has emerged in the past decade to describe an important part of the data analysis pipeline in the big data world. The idea is to create a single storage area for all the raw data that anyone in the organization may need to analyze. People usually use Hadoop to process data in lakes, but this concept is broader than Hadoop.

    When I mentioned that a single point can centralize all the data that an organization wants to analyze, I immediately thought of the concept of data warehouse and data mart. But there is an important difference between data lake and data warehouse. The data lake stores raw data in any form provided by the data source. There are no assumptions about the data schema, and each data source can use any schema it likes. Users of data need to understand these data according to their own purposes.

    Many data warehouses have not made much progress due to the problem of patterns. Data warehouse tends to adopt the concept of single mode to meet all analysis requirements, but a single unified data model is impractical for any organization except the smallest organization. Even if you want to model a slightly complex domain, you need multiple bounded contexts, each with its own data model. In terms of analysis, each analysis user is required to use a model that is meaningful to the analysis they are conducting. By shifting to storing only raw data, this places the onus on the data analyst.

    Another problem of data warehouse is to ensure data quality. Trying to obtain an authoritative single data source requires a lot of analysis on how different systems acquire and use data. System A may be applicable to some data, while system B may be applicable to other data. There are some rules that system A is more suitable for recent orders, while system B is more suitable for orders one month or earlier, unless returns are involved. The most important thing is that data quality is often a subjective problem. Different analysts have different tolerance for data quality problems, and even have different concepts of good quality.

    This led to the criticism of the data lake - it is just a garbage dump of uneven quality data, or rather a data swamp. Criticism is both reasonable and irrelevant. The popular title of the new analysis is "data scientist". Although this is a title that is often abused, many of these people do have a solid scientific background. Any serious scientist knows the problem of data quality. Imagine the simple problem of analyzing temperature readings over time. It must be considered that the relocation of some weather stations may subtly affect readings, exceptions caused by equipment problems, and missing period data when the sensors are not working. Many complex statistical techniques are created to solve data quality problems. Scientists are always sceptical about the quality of data and used to dealing with problematic data. Therefore, lakes are very important for them, because they can use the original data and carefully apply technology to understand it, rather than some opaque data cleaning mechanisms that may do more harm than good.

    Data warehouses usually not only clean up data, but also aggregate data into a form that is easier to analyze. But scientists also tend to oppose this, because aggregation means discarding data. The data lake should contain all the data, because it is unknown what people will find valuable, whether today or a few years later.

     

    They are being modified by some month end processing reports. So in short, these values in the data warehouse are useless; Scientists fear that such comparisons cannot be made. After more mining, it is found that these reports have been stored, so the real predictions made at that time can be extracted. The complexity of this raw data means that there is space to organize the data into a more manageable structure and reduce the amount of data. Data lakes should not be accessed directly. Because data is raw data, it requires a lot of skill to understand it. There are relatively few people working in the data lake because they have found data views that are often useful in the lake. They can create many data marts, each of which has a specific model for a single bounded context. Then, more downstream users can view these marts as the authoritative source of this context.

    Now, many times we have regarded the data lake as a single point for cross enterprise data integration, but it should be pointed out that this is not its original intention. This word was created by James Dixon in 2010. At that time, he intended to use the data lake for a single data source, and multiple data sources would form a "water garden". Despite the initial statement, it is now widely used to view the data lake as an integration of many sources.

    We should use the data lake for analysis purposes, not for collaboration between business systems. When business systems collaborate, they should be implemented through services designed for this purpose, such as RESTful HTTP calls or asynchronous messaging.

    It is important that all data put into the lake should have a clear source of time and place. Each data item should clearly track which system it comes from and when it generates data. Therefore, the data lake contains historical records. This may come from feeding business system events to the lake, or from the system that periodically dumps the current state to the lake -- this method is valuable when the source system does not have any time capability but wants to perform time analysis on its data.

    The data lake is modeless. The source system decides which model to use, and consumers decide how to deal with the resulting confusion. In addition, the source system can change its inflow data mode at will, and consumers must also deal with it. Obviously, we prefer such changes to be as less disruptive as possible, but scientists prefer comprehensive data rather than missing data.

    The data lake will become very large, and most of the storage will revolve around the concept of a large modeless structure - which is why Hadoop and HDFS are usually the technologies people use for data lakes. An important task of data lake fairs is to reduce the amount of data to be processed, so that big data analysis does not have to process a large amount of data.

    The storage of a large amount of raw data by the data lake has caused embarrassing problems about privacy and security. The data lake is an attractive target for hackers, who may like to suck selected data blocks into the public ocean. Limiting the direct access of small data science organizations to the data lake may reduce this threat, but it cannot avoid the problem of how the organization is responsible for the privacy of the data it obtains.

    big data data Lake data warehouse
    Previous Article CO2 reductions won't happen without digital technology
    Next Article Data Protection Best Practices for Securing Cloud Hosting

    Related Articles

    Big Data

    What is the biggest gap in the big data trend sweeping the world?

    Apr 28, 2025
    Big Data

    Big data is transforming education

    Apr 20, 2025
    Big Data

    How to Program Big Data Effectively

    May 17, 2025
    Most Popular

    What is resolution? What does resolution mean?

    May 12, 2025

    Driving Generative AI Pervasiveness: Intel's "duty to do so"

    Apr 10, 2025

    Cloud computing and data science, five steps to break through the flood of information

    May 31, 2025
    Copyright © 2025 itheroe.com. All rights reserved. User Agreement | Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.