IT PARK
    Most Popular

    Is the enterprise ready to protect its cloud computing?

    Mar 18, 2023

    Will the latest AI "kill" programming

    Mar 19, 2023

    What is big data? What can big data do?

    Mar 20, 2023

    IT PARK IT PARK

    • Home
    • Encyclopedia

      Differences between SSDs and HDDs

      Mar 22, 2023

      What is a discrete graphics card

      Mar 21, 2023

      What is Qualcomm three carrier aggregation

      Mar 20, 2023

      What is resolution? What does resolution mean?

      Mar 19, 2023

      How to solve the problem of computer blue screen? What about the blue screen of the computer?

      Mar 18, 2023
    • AI

      What is the neural network of artificial intelligence?

      Mar 22, 2023

      What is the core issue of AI technology?

      Mar 21, 2023

      What is AI?

      Mar 20, 2023

      Will the latest AI "kill" programming

      Mar 19, 2023

      Neural AI, the next frontier of artificial intelligence

      Mar 18, 2023
    • Big Data

      What is the maximum value of big data

      Mar 22, 2023

      How does big data start? From small data to big data

      Mar 21, 2023

      What is big data? What can big data do?

      Mar 20, 2023

      Benefits of big data analysis and how to analyze big data

      Mar 19, 2023

      Six benefits of big data for enterprises

      Mar 18, 2023
    • CLO

      SaaS sprawl: meaning, hazard, status quo and mitigation plan

      Mar 22, 2023

      What is the difference between cloud computing and virtualization?

      Mar 21, 2023

      What is cloud computing?

      Mar 20, 2023

      Four advantages are highlighted, and cloud computing is the trend

      Mar 19, 2023

      Is the enterprise ready to protect its cloud computing?

      Mar 18, 2023
    • IoT

      How does the Internet of Things affect business?

      Mar 22, 2023

      Five effective business models of Internet of Things

      Mar 21, 2023

      Use the Internet of Things to find new business models

      Mar 20, 2023

      Six ways for the Internet of Things to change the business model

      Mar 19, 2023

      6 Ways to Make Money for IoT Products

      Mar 18, 2023
    • Blockchain

      Blockchain Foundation - What is Blockchain Technology

      Mar 22, 2023

      After the collision between quantum computing and blockchain - quantum blockchain

      Mar 21, 2023

      What is blockchain? Simply understand blockchain

      Mar 20, 2023

      How does the Internet of Things affect the working world?

      Mar 19, 2023

      What is Bitcoin?

      Mar 18, 2023
    IT PARK
    Home » Big Data » Talking about data lake and data warehouse
    Big Data

    Talking about data lake and data warehouse

    Data Lake is a term that has emerged in the past decade to describe an important part of the data analysis pipeline in the big data world.
    Updated: Mar 14, 2023
    Talking about data lake and data warehouse

    Data Lake is a term that has emerged in the past decade to describe an important part of the data analysis pipeline in the big data world. The idea is to create a single storage area for all the raw data that anyone in the organization may need to analyze. People usually use Hadoop to process data in lakes, but this concept is broader than Hadoop.

    When I mentioned that a single point can centralize all the data that an organization wants to analyze, I immediately thought of the concept of data warehouse and data mart. But there is an important difference between data lake and data warehouse. The data lake stores raw data in any form provided by the data source. There are no assumptions about the data schema, and each data source can use any schema it likes. Users of data need to understand these data according to their own purposes.

    Many data warehouses have not made much progress due to the problem of patterns. Data warehouse tends to adopt the concept of single mode to meet all analysis requirements, but a single unified data model is impractical for any organization except the smallest organization. Even if you want to model a slightly complex domain, you need multiple bounded contexts, each with its own data model. In terms of analysis, each analysis user is required to use a model that is meaningful to the analysis they are conducting. By shifting to storing only raw data, this places the onus on the data analyst.

    Another problem of data warehouse is to ensure data quality. Trying to obtain an authoritative single data source requires a lot of analysis on how different systems acquire and use data. System A may be applicable to some data, while system B may be applicable to other data. There are some rules that system A is more suitable for recent orders, while system B is more suitable for orders one month or earlier, unless returns are involved. The most important thing is that data quality is often a subjective problem. Different analysts have different tolerance for data quality problems, and even have different concepts of good quality.

    This led to the criticism of the data lake - it is just a garbage dump of uneven quality data, or rather a data swamp. Criticism is both reasonable and irrelevant. The popular title of the new analysis is "data scientist". Although this is a title that is often abused, many of these people do have a solid scientific background. Any serious scientist knows the problem of data quality. Imagine the simple problem of analyzing temperature readings over time. It must be considered that the relocation of some weather stations may subtly affect readings, exceptions caused by equipment problems, and missing period data when the sensors are not working. Many complex statistical techniques are created to solve data quality problems. Scientists are always sceptical about the quality of data and used to dealing with problematic data. Therefore, lakes are very important for them, because they can use the original data and carefully apply technology to understand it, rather than some opaque data cleaning mechanisms that may do more harm than good.

    Data warehouses usually not only clean up data, but also aggregate data into a form that is easier to analyze. But scientists also tend to oppose this, because aggregation means discarding data. The data lake should contain all the data, because it is unknown what people will find valuable, whether today or a few years later.

     

    They are being modified by some month end processing reports. So in short, these values in the data warehouse are useless; Scientists fear that such comparisons cannot be made. After more mining, it is found that these reports have been stored, so the real predictions made at that time can be extracted. The complexity of this raw data means that there is space to organize the data into a more manageable structure and reduce the amount of data. Data lakes should not be accessed directly. Because data is raw data, it requires a lot of skill to understand it. There are relatively few people working in the data lake because they have found data views that are often useful in the lake. They can create many data marts, each of which has a specific model for a single bounded context. Then, more downstream users can view these marts as the authoritative source of this context.

    Now, many times we have regarded the data lake as a single point for cross enterprise data integration, but it should be pointed out that this is not its original intention. This word was created by James Dixon in 2010. At that time, he intended to use the data lake for a single data source, and multiple data sources would form a "water garden". Despite the initial statement, it is now widely used to view the data lake as an integration of many sources.

    We should use the data lake for analysis purposes, not for collaboration between business systems. When business systems collaborate, they should be implemented through services designed for this purpose, such as RESTful HTTP calls or asynchronous messaging.

    It is important that all data put into the lake should have a clear source of time and place. Each data item should clearly track which system it comes from and when it generates data. Therefore, the data lake contains historical records. This may come from feeding business system events to the lake, or from the system that periodically dumps the current state to the lake -- this method is valuable when the source system does not have any time capability but wants to perform time analysis on its data.

    The data lake is modeless. The source system decides which model to use, and consumers decide how to deal with the resulting confusion. In addition, the source system can change its inflow data mode at will, and consumers must also deal with it. Obviously, we prefer such changes to be as less disruptive as possible, but scientists prefer comprehensive data rather than missing data.

    The data lake will become very large, and most of the storage will revolve around the concept of a large modeless structure - which is why Hadoop and HDFS are usually the technologies people use for data lakes. An important task of data lake fairs is to reduce the amount of data to be processed, so that big data analysis does not have to process a large amount of data.

    The storage of a large amount of raw data by the data lake has caused embarrassing problems about privacy and security. The data lake is an attractive target for hackers, who may like to suck selected data blocks into the public ocean. Limiting the direct access of small data science organizations to the data lake may reduce this threat, but it cannot avoid the problem of how the organization is responsible for the privacy of the data it obtains.

    big data data Lake data warehouse
    Previous Article What is Qualcomm three carrier aggregation
    Next Article Explanation of the consensus mechanism of blockchain

    Related Articles

    Big Data

    Six benefits of big data for enterprises

    Mar 18, 2023
    Big Data

    Benefits of big data analysis and how to analyze big data

    Mar 19, 2023
    Big Data

    Has the development of big data come to an end?

    Mar 13, 2023
    Most Popular

    Is the enterprise ready to protect its cloud computing?

    Mar 18, 2023

    Will the latest AI "kill" programming

    Mar 19, 2023

    What is big data? What can big data do?

    Mar 20, 2023
    Copyright © 2023 itheroe.com. All rights reserved. | Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.