Database
What is a data warehouse? The source of business intelligence
Enterprise data warehouses are comprehensive structured data stores designed for analysis. They often serve as the data sources for BI systems and machine learning.
Databases are typically classified as relational (SQL) or NoSQL, and transactional (OLTP), analytic (OLAP), or hybrid (HTAP). Departmental and special-purpose databases were initially considered huge improvements to business practices, but later derided as “islands.” Attempts to create unified databases for all data across an enterprise are classified as data lakes if the data is left in its native format, and data warehouses if the data is brought into a common format and schema. Subsets of a data warehouse are called data marts.
Data warehouse defined
Essentially, a data warehouse is an analytic database, usually relational, that is created from two or more data sources, typically to store historical data, which may have a scale of petabytes. Data warehouses often have significant compute and memory resources for running complicated queries and generating reports. They are often the data sources for business intelligence (BI) systems and machine learning.
Why use a data warehouse?
One major motivation for using an enterprise data warehouse, or EDW, is that your operational (OLTP) database limits the number and kind of indexes you can create, and therefore slows down your analytic queries. Once you have copied your data into the data warehouse, you can index everything you care about in the data warehouse for good analytic query performance, without affecting the write performance of the OLTP database.
Another reason to have an enterprise data warehouse is to enable joining data from multiple sources for analysis. For example, your sales OLTP application probably has no need to know about the weather at your sales locations, but your sales predictions could take advantage of that data. If you add historical weather data to your data warehouse, it would be easy to factor it into your models of historical sales data.
Data warehouse vs. data lake
Data lakes, which store files of data in its native format, are essentially “schema on read,” meaning that any application that reads data from the lake will need to impose its own types and relationships on the data. Data warehouses, on the other hand, are “schema on write,” meaning that data types, indexes, and relationships are imposed on the data as it is stored in the EDW.
“Schema on read” is good for data that may be used in several contexts, and poses little risk of losing data, although the danger is that the data will never be used at all. (Qubole, a vendor of cloud data warehouse tools for data lakes, estimates that 90% of the data in most data lakes is inactive.) “Schema on write” is good for data that has a specific purpose, and good for data that must relate properly to data from other sources. The danger is that mis-formatted data may be discarded on import because it doesn’t convert properly to the desired data type.
Data warehouse vs. data mart
Data warehouses contain enterprise-wide data, while data marts contain data oriented towards a specific business line. Data marts may be dependent on the data warehouse, independent of the data warehouse (i.e. drawn from an operational database or external source), or a hybrid of the two.
RECOMMENDED WHITEPAPERS
Building a bridge with Softchoice and Azure VMware Solution
9 Emerging Trends for the Futurist CFO
Keeping up with a changing cybersecurity landscape
Reasons to create a data mart include using less space, returning query results faster, and costing less to run than a full data warehouse. Often a data mart contains summarized and selected data, instead of or in addition to the detailed data found in the data warehouse.
Data warehouse architectures
In general, data warehouses have a layered architecture: source data, a staging database, ETL (extract, transform, and load) or ELT (extract, load, and transform) tools, the data storage proper, and data presentation tools. Each layer serves a different purpose.
The source data often includes operational databases from sales, marketing, and other parts of the business. It may also include social media and external data, such as surveys and demographics.
The staging layer stores the data retrieved from the data sources; if a source is unstructured, such as social media text, this is where a schema is imposed. This is also where quality checks are applied, to remove poor quality data and to correct common mistakes. ETL tools pull the data, perform any desired mappings and transformations, and load the data into the data storage layer.
ELT tools store the data first and transform later. When you use ELT tools, you may also use a data lake and skip the traditional staging layer.
The data storage layer of a data warehouse contains cleaned, transformed data ready for analysis. It will often be a row-oriented relational store, but may also be column-oriented or have inverted-list indexes for full-text search. Data warehouses often have many more indexes than operational data stores, to speed analytic queries.
Data presentation from a data warehouse is often done by running SQL queries, which may be constructed with the help of a GUI tool. The output of the SQL queries is used to create display tables, charts, dashboards, reports, and forecasts, often with the help of BI (business intelligence) tools.
Of late, data warehouses have started to support machine learning to improve the quality of models and forecasts. Google BigQuery, for example, has added SQL statements to support linear regression models for forecasting and binary logistic regression models for classification. Some data warehouses have even integrated with deep learning libraries and automated machine learning (AutoML) tools.
Cloud data warehouse vs. on-prem data warehouse
A data warehouse can be implemented on-premises, in the cloud, or as a hybrid. Historically, data warehouses were always on-prem, but the capital cost and lack of scalability of on-prem servers in data centers was sometimes an issue. EDW installations grew when vendors started offering data warehouse appliances. Now, however, the trend is to move all or part of your data warehouse to the cloud to take advantage of the inherent scalability of cloud EDW, and the ease of connecting to other cloud services.
The downside of putting petabytes of data in the cloud is the operational cost, both for cloud data storage and for cloud data warehouse compute and memory resources. You might think that the time to upload petabytes of data to the cloud would be a huge barrier, but the hyperscale cloud vendors now offer high-capacity, disk-based data transfer services.
Top-down vs. bottom-up data warehouse design
There are two major schools of thought about how to design a data warehouse. The difference between the two has to do with the direction of data flow between the data warehouse and the data marts.
Top-down design (known as the Inman approach) treats the data warehouse as the centralized data repository for the whole enterprise. Data marts are derived from the data warehouse.
Bottom-up design (known as the Kimball approach) treats the data marts as primary, and combines them into the data warehouse. In Kimball’s definition, the data warehouse is “a copy of transaction data specifically structured for query and analysis.”
Insurance and manufacturing applications of the EDW tend to favor the Inman top-down design methodology. Marketing tends to favor the Kimball approach.
[ Keep up with the latest developments in data analytics and machine learning. Subscribe to the InfoWorld First Look newsletter ]
Data lake, data mart, or data warehouse?
Ultimately, all of the decisions associated with enterprise data warehouses boil down to your company’s goals, resources, and budget. The first question is whether you need a data warehouse at all. The next task, assuming you do, is to identify your data sources, their size, their current growth rate, and what you’re currently doing to utilize and analyze them. After that, you can start to experiment with data lakes, data marts, and data warehouses to see what works for your organization.
I’d suggest doing your proof of concept with a small subset of data, hosted either on existing on-prem hardware or on a small cloud installation. Once you have validated your designs and demonstrated the benefits to the organization, you can scale up to a full-blown installation with full management support.
How to choose a cloud database
Transactional cloud databases come in all shapes and sizes, from simple key-value stores to planet-scale distributed relational databases. Here’s how to choose the right cloud database for your application.
here are enormous cloud databases you can weed through to decide the best fit for you. The list includes Amazon Web Services, EnterpriseDB, Garantia Data, Google Cloud SQL, MongoLab and Microsoft Azure. Each option offers its myriad of benefits and its share of disadvantages. To make the best choice, you have to base your choice on some of your unique business needs.
As your data needs continue to grow, so will the management needs. You will be responsible for the storage, servers, networks, users, applications, customer support and fix problems. This is all that makes database management difficult. However, thanks to the availability of cloud databases, the work can be done easier on your part.
But how exactly do you ensure that you select the best service for your business? With choosing the right cloud database being the key to a successful experience in cloud, before you sign your first deal with a provider, it is imperative that you learn to ask the basic questions that help find the right service for you. The purpose of this article is to provide what you need to look at when making your choice.
Access your needs
Long before you can start considering the services of different service providers in cloud databases, it is imperative that you first assess your needs. It is only by doing this that you will be able to prepare efficiently for an eventual migration. There are a couple of questions you need to ask yourself at this point:
What will be the use of a database for you? Do you need it for a website, applications, direct access or complex environment?
Who will be using the database?
Which functions are most important to you? Will you be focusing on basic or advanced management?
What type of records will you be storing? What is the volume of the records that you need to store?
The purpose of asking these questions is for you to find the ideal cloud database requirements as well as decide on which service will be the best for you. These questions will enable you to pick that service that will meet your business needs. Knowing exactly what you need will further enable you to prevent being misguided when choosing a service.
Research cloud database providers
Now as you know what you need to look for, it is time to take a look at what different database providers have to offer. Database providers offer varying products and services. Some are a one-stop shop for all your database needs while others will specialise in specific areas. Dealing with service providers that offer a range of services will enable you to ease database management and help save costs.
Simple research will enable you to make a list of cloud service providers you can work with. Focus mostly on what you wish to see at the end of the day. For instance, if your objective is application management, you want that service provider who specializes in that.
It is always good to compare the services of different service providers before making the final decision. This is important as migrating from one service provider to another is never easy. Therefore, you should choose the most reliable service the first time round. Make a list of promising providers then compare their services. Always remember that DBA services are priced per usage over time and billed monthly. Thus, make sure you can afford the service.
Important considerations before choosing a database
Although cloud database providers need to satisfy strict standards to work in the industry, there are three important considerations you need to make before selecting a service. While you should ask questions regarding how the company keeps up with growing technologies, here are crucial points you must never skip.
Security
Security is important, more so when considering that data loss can lead to a myriad of problems which include your customers suing you. You, therefore, want to work with the service provider that offers you the most secure services in the industry. But how can you ensure that the service is secured?
Regular update deployment
Database threats keep evolving. You need a service provider who is at the forefront helping you stay ahead of the rest. The service you select should offer regular deployment of updates to the database management system. All updates must also be tested before they are deployed.
Offer role-based access control
Role-based access control enables you to define other types of access so as to achieve delegation as well as segregation of duties for the users.
Authentication method
Understanding the authentication method will help you know if the cloud database is strong enough or not. Microsoft SQL Server has two types of authentication: the SQL and Windows Server authentication.
Encryption service
For added security, the service you select should come with encryption. This will secure your data while in memory, in transit, on the disk and during query processing.
Certified for technology
Last but not least, ensure the service provider is certified for technology.
Performance
Performance has to be the key element you consider when picking a database or server solution. It is the server performance that propels the data forward.
- How good are the physical components?
- How many connections can you have to the cloud database?
- How much bandwidth do you get?
- What is the connection speed?
- Ease of management
Open source power
Open source cloud computing software can offer distinct advantages to organisations, often leveraging strong user and developer communities and aggressive release cycles. Open source database providing scalability, high availability and fault-tolerance on hardware, virtual systems or cloud infrastructure.
Some open source database include:
- Apache Cassandra
- CounchDB
- HBase
- Hypertable
- MongoDB
Comments
Post a Comment