At my new job as a data science consultant, I immediately needed to learn big data system structures. Here is a primer from my research to get the basics down and get up to speed on key terminologies.
What is a Data Warehouse?
First, what is a warehouse in general terms? It is a large space to store things in. Instead of an Amazon Warehouse holding many physical products inside the space, for example, data warehouses (DWH) are just digital spaces to store data in.
More specifically, the process of creating a DWH can be seen as moving raw data input via Extract-Transform-Load (ETL) actions into a consolidated storage system to be used for analysis.
A DWH is:
- Subject Oriented: Data is categorized and stored by business subject rather than by application.
- Integrated: Data on a given subject is collected from disparate sources and stored in a single place.
- Time-variant: Data is stored as a series of snapshots each representing a period of time.
- Non-volatile: Typically data in the data warehouse is not updated or deleted.
What makes up a DWH?
Well, yeah, data, but specifically it holds legacy data or historical data. With this historical data, strategic questions can be answered with trends and visualizations (because humans are pretty awful at interpreting data in its tabular form for the most part).
DWH can relate multiple databases together from source legacy data essentially creating these relational schemas. With schemas in place, DWH will be faster and more accurate.
What exactly are schemas?
A schema in its simplest form just means the definition of the records in the database e.g. the schema can say a record will consist of the following:
- PersonID (unique index number),
- FamilyName (40 chars),
- FirstName (40 chars),
- DateOfBirth (Date)
If you know the schema, then your programming becomes much easier because you know exactly what format to expect when you query the database. But, don’t EVER trust database queries returns alone. Verify the return through code each and every time.
Essentially, a schema is something you define in a database and has nothing to do with the database software itself. When you create a database, you have to define the names of the tables, the columns in those tables, the types of data those columns will hold, as well as the relationships between the tables and columns. That is a schema.
What are DB Views?
Views are handy for several reasons. The simplest way to understand what a view is, is to think of it as a generic select statement (e.g. SELECT NAME FROM EMP WHERE ID > 3 AND CITY = “OTTAWA”) made to appear like a regular table.
Underlying any view is a SELECT…that is how a view is created. Views allow you to take complex select statements and turn them into what appears to other users as a nice clean table.
A properly designed view also allows a skilled developer or usually DBA to optimize a specific query for others to use…without knowing exactly how the query was made. On top of that, the creator can then tune that view (make it faster or correct it), and now anyone using it never has to change anything.
Views can also be used to restrict certain information from different user groups…you might have a lot of data that only some users can see all of it, while others need only see specific info. Creating views for each of these teams allow them to see only the data they need, without relying on building out many extra tables. Think of these as pre-defined filters…very handy. Some caution though! As I mentioned, views are really only selected under the hood….so poorly constructed queries on views (as they look like regular tables!) can result in poor performance.
CREATE VIEW PopularBooks AS SELECT ISBN, Title, Author, PublishDate FROM Books WHERE IsPopular = 1/* an example of a SQL View creation */
In DWH, materialized views can be used to pre-compute and store aggregated data such as a sum of sales. Materialized views in these environments are typically referred to as summaries since they store summarized data. They can also be used to pre-compute joins with or without aggregations. A materialized view is used to eliminate the overhead associated with expensive joins or aggregations for a large or important class of queries.
In distributed environments, materialized views are used to replicate data at distributed sites and synchronize updates done at several sites with conflict resolution methods. The materialized views as replicas provide local access to data which otherwise would have to be accessed from remote sites.
The difference between the two being:
- A normal view is a query that defines a virtual table — you don’t actually have the data sitting in the table, you create it on the fly by executing.
- A materialized view is a view where the query gets to run and the data is saved in an actual table.
The data in the materialized view gets refreshed when you tell it to.
- Let’s say you have a fact table in a data warehouse with every book ever borrowed from a library, with dates and borrowers. And that staff regularly want to know how many times a book has been borrowed. Then build a materialized view as
select book_id, book_name, count(*) as borrowings from book_trans group by book_id, book_name, set it for whatever update frequency you want — usually the update frequency for the warehouse itself. Now if somebody runs a query like that for a particular book against the
book_transtable, the query rewrite capability in Oracle will be smart enough to look at the materialized view rather than walking through the millions of rows in
Usually, you’re building materialized views for performance and stability reasons — flaky networks, or doing long queries off hours.
OLTP vs OLAP
Typically, data flows from one or more online transaction processing (OLTP) databases into the data warehouse on a monthly, weekly, or daily basis. OLTP is traditionally associated with relational databases (RDB), while OLAP is usually associated with DWH.
The data is usually processed in a staging file before being added to the data warehouse. Data warehouses typically range in size from tens of gigabytes to a few terabytes, usually with the vast majority of the data stored in a few very large fact tables.
OLAP tends to conduct more aggregated actions across a large amount of data (i.e. sum of sales for region X). We can also have OLAP schemas:
What is a Fact or Dimension?
Facts are your measures. (E.g. If we were using a DWH for an entertainment company your facts could be seat count, num theaters, tickets sold, etc). Dimensions are like your objects or categorical information. (e.g. theater name, theater city).
Important to remember: Anything in a database is an object.
DWH High-Level Process when Business User Sends a Query:
Take data from operational systems -> Integrate the data from multiple sources -> Standardize the data & remove inconsistencies -> Store the data in a format suitable for easy access & analysis on any time axis from integrated sources.
What is data performance tuning?
Performance tuning is very subjective and a wide open statement. The very first step in performance tuning is to answer the question, “Do we really need to performance tune our jobs?”. SQL indexing is the most effective tuning method — yet it is often neglected during development.
During the ETL phase, large volumes of data are loaded into the DWH. The DBA should take this into account during database design.
The DBA can use key clustering to ensure that inserted rows do not create hot spots in the data. One common way is when new records are assigned a key based on a sequentially ascending value (such as an Identity column or a Sequence), and the table is clustered on this key. The result is that newly-inserted rows are added together in a section of the table.
These options must be coordinated with index design, because physical clustering of the table may depend on designating one of the indexes as the clustering index. Since adding a surrogate key to a table in order to cluster it necessarily means creating a clustering index for that key, this may affect the total number of indexes for the table.
Data performance tuning is a bit outside the scope of this post though. For more info on SQL Indexing for performance tuning check out https://use-the-index-luke.com/
Are they sitting in physical memory or transient memory?
Transient data is data that is created within an application session, which is not saved in the database after the application is terminated.
Whereas physical memory is actually saved in RAM or on the hard-drive.
What is a Data Mart?
A data mart contains a subset of corporate data that is of value to a specific business unit, department, or set of users. Typically, a data mart is derived from an enterprise data warehouse. (i.e. one data mart for financial data, selling data, marketing data, social media data, and one for operational data).
There is a lot that goes into DWHing,
but I hope this serves as a good primer for those just getting introduced/getting their feet wet with databases, DWHing, and system design. Understanding databases and DWH is crucial for data scientists today in the age of ‘big data’.