Monday, April 22, 2019

NoSQL Databases Benefits & Types


NoSQL database, also called Not Only SQL, is an approach to data management and database design that’s useful for very large sets of distributed data. NoSQL, which encompasses a wide range of technologies and architectures, seeks to solve the scalability and big data performance issues that relational databases weren’t designed to address. NoSQL is especially useful when an enterprise needs to access and analyse massive amounts of unstructured data or data that’s stored remotely on multiple virtual servers in the cloud.
NoSQL technology was originally created and used by Internet leaders such as Facebook, Google, Amazon and others, who required database management systems that could write and read data anywhere in the world, while scaling and delivering performance across massive data sets and millions of users.

Benefits of NoSQL databases:

 NoSQL databases provide various important advantages over traditional relational databases. A few core features of NoSQL are listed here, which apply to most NoSQL databases.

Schema agnostic:  NoSQL databases are schema agnostic. You aren’t required to do a lot on designing your schema before you can store data in NoSQL databases. You can start coding, and store and retrieve data without knowing how the database stores and works internally. If you need advanced functionality, then you can customize the schema manually before indexing the data. Schema agnosticism may be the most significant difference between NoSQL and relational databases.

Scalability: NoSQL databases support horizontal scaling methodology that makes it easy to add or reduce capacity quickly without tinkering with commodity hardware. This eliminates the tremendous cost and complexity of manual shading that is necessary when attempting to scale RDBMS.

Performance: With a NoSQL database, you can increase performance by simply adding cheaper servers, called commodity servers. This helps organizations to continue to deliver reliably fast user experiences with a predictable return on investment for adding resources again, without the overhead associated with manual shading.

High availability:  NoSQL databases are generally designed to ensure high availability and avoid the complexity that comes with a typical RDBMS architecture, which relies on primary and secondary nodes. Some ‘distributed’ NoSQL databases use a masterless architecture that automatically distributes data equally among multiple resources so that the application remains available for both read and write operations, even when one node fails.

Global availability By automatically replicating data across multiple servers, data centres or cloud resources, distributed NoSQL databases can minimise latency and ensure a consistent application experience wherever users are located. An added benefit is a significantly reduced database management burden of manual RDBMS configuration, freeing operations teams to focus on other business priorities.

Types of NoSQL databases:

Several different varieties of NoSQL databases have been created to support specific needs and use cases. These databases can broadly be categorized into four types.


Key-value store NoSQL database:  From an API perspective, key-value stores are the simplest NoSQL data stores to use. The client can either get the value for the key, assign a value for a key or delete a key from the data store. The value is a blob that the data store just stores, without caring or knowing what’s inside; it’s the responsibility of the application to understand what was stored. Since key-value stores always use primary-key access, they generally have great performance and can be easily scaled. The key-value database uses a hash table to store unique keys and pointers (in some databases it’s also called the inverted index) with respect to each data value it stores. There are no column type relations in the database; hence, its implementation is easy. Key-value databases give great performance and can be very easily scaled as per business needs.
Use cases: Here are some popular use cases of the key-value databases:
  • For storing user session data
  • Maintaining schema-less user profiles
  • Storing user preferences
  • Storing shopping cart data
However key-value databases are not the ideal choice for every use case when:
  • We have to query the database by specific data value.
  • We need relationships between data values.
  • We need to operate on multiple unique keys.
  • Our business needs updating a part of the value frequently.
Examples of this database are Redis, MemcacheDB and Riak.

Document store NoSQL database :
Document store NoSQL databases are similar to key-value databases in that there’s a key and a value. Data is stored as a value. Its associated key is the unique identifier for that value. The difference is that, in a document database, the value contains structured or semi-structured data. This structured/semi-structured value is referred to as a document and can be in XML, JSON or BSON format.
Use cases: Document store databases are preferable for:
  • E-commerce platforms
  • Content management systems
  • Analytics platforms
  • Blogging platforms
Document store NoSQL databases are not the right choice if you have to run complex search queries or if your application requires complex multiple operation transactions.
Examples of document store NoSQL databases are MongoDB, Apache CouchDB and Elasticsearch.

Column store NoSQL database :  In column-oriented NoSQL databases, data is stored in cells grouped in columns of data rather than as rows of data. Columns are logically grouped into column families. Column families can contain a virtually unlimited number of columns that can be created at runtime or while defining the schema. Read and write is done using columns rather than rows. Column families are groups of similar data that is usually accessed together. As an example, we often access customers’ names and profile information at the same time, but not the information on their orders.
The main advantages of storing data in columns over relational DBMS are fast search/access and data aggregation. Relational databases store a single row as a continuous disk entry. Different rows are stored in different places on the disk while columnar databases store all the cells corresponding to a column as a continuous disk entry, thus making the search/access faster.
Each column family can be compared to a container of rows in an RDBMS table, where the key identifies the row and the row consists of multiple columns. The difference is that various rows do not have to have the same columns, and columns can be added to any row at any time without having to add them to other rows.
Use cases: Developers mainly use column databases in:
  • Content management systems
  • Blogging platforms
  • Systems that maintain counters
  • Services that have expiring usage
  • Systems that require heavy write requests (like log aggregators)
Column store databases should be avoided if you have to use complex querying or if your querying patterns frequently change. Also avoid them if you don’t have an established database requirement, a trend which we are beginning to see in new systems.
Examples of column store NoSQL databases are Cassandra and Apache Hadoop Hbase.

Graph base NoSQL database : Graph databases are basically built upon the Entity – Attribute – Value model. Entities are also known as nodes, which have properties. It is a very flexible way to describe how data relates to other data. Nodes store data about each entity in the database, relationships describe a relationship between nodes, and a property is simply the node on the opposite end of the relationship. Whereas a traditional database stores a description of each possible relationship in foreign key fields or junction tables, graph databases allow for virtually any relationship to be defined on-the-fly.
Use cases: Graph base NoSQL databases are usually used in:
  • Fraud detection
  • Graph based search
  • Network and IT operations
  • Social networks, etc
Examples of graph base NoSQL databases are Neo4j, ArangoDB and OrientDB.




Wednesday, February 13, 2019

Data Modelling Terminology

Data Modelling is the diagrammatic representation showing how the entities are related to each other. It is the initial step towards database design. We first create the conceptual model, then logical model and finally move to the physical model.
Generally, the data models are created in data analysis & design phase of software development life cycle.

There are three types of data models – conceptual, logical and physical. The level of complexity and detail increases from conceptual to logical to a physical data model.
  • Conceptual model will be just portraying entity names and entity relationships. Figure 1 shown in the later part of this article depicts a conceptual model.
  • Logical model will be showing up entity names, entity relationships, attributes, primary keys and foreign keys in each entity. Figure 2 shown inside question#4 in this article depicts a logical model.
  • Physical data model will be showing primary keys, foreign keys, table names, column names and column data types. This view actually elaborates how the model will be actually implemented in the database.
Design sachems in Data Modelling:
  • Star Schema
  • Snowflake Schema
Star Schema The simplest of the schemes is star schema where we have a fact table in the center which references multiple dimension tables around it. All the dimension tables are connected to the fact table. The primary key in all dimension tables acts as a foreign key in the fact table.
The star schema is quite simple, flexible and it is in de-normalized form


Snowflake Schema  In a snowflake schema, the level of normalization increases. The fact table here remains the same as in star schema. However, the dimension tables are normalized.  Due to several layers of dimension tables, it looks like a snowflake and thus it is named as snowflake schema.





Dimension and Attribute Dimensions represent qualitative data. For example– plan, product, class are all dimensions. A dimension table contains descriptive or textual attributes. For example, product category & product name are the attributes of product dimension.

Fact and Fact table Facts represent quantitative data. For example – net amount due is a fact. A fact table contains numerical data and foreign keys from related dimensional tables.

Different Types of Dimensions There are typically five types of dimensions.

1) Conformed dimensions: A Dimension that is utilized as a part of different areas is called as conformed dimension. It might be utilized with different fact tables in a single database or over numerous data marts/warehouses. For example, if subscriber dimension is connected to two fact tables – billing and claim then the subscriber dimension would be treated as conformed dimension.
2) Junk Dimension: It is a dimension table comprising of attributes that don’t have a place in the fact table or in any of the current dimension tables. Generally, these are the properties like flags or indicators. For example, it can be member eligibility flag set as ‘Y’ or ‘N’ or any other indicator set as true/false, any specific comments, etc. if we keep all such indicator attributes in the fact table then its size gets increased. So, we combine all such attributes and put in a single dimension table called as junk dimension having unique junk IDs with a possible combination of all the indicator values.
3) Role Playing Dimension: These are the dimensions which are utilized for multiple purposes in the same database. For example, a date dimension can be used for “Date of Claim”, “Billing date” or “Plan Term date”. So, such a dimension will be called as Role playing dimension. The primary key of Date dimension will be associated with multiple foreign keys in the fact table.
4) Slowly Changing Dimension (SCD): These are most important amongst all the dimensions. These are the dimensions where attribute values vary with time. Below are the varies types of SCDs
  • Type-0: These are the dimensions where attribute value remains steady with time. For example, Subscriber’s DOB is a type-0 SCD because it will always remain the same irrespective of the time.
  • Type-1: These are the dimensions where previous value of the attribute is replaced by the current value. No history is maintained in Type-1 dimension. For example, Subscriber’s address (where the business requires to keep the only current address of subscriber) can be a Type-1 dimension.
  • Type-2: These are the dimensions where unlimited history is preserved. For example, Subscriber’s address (where the business requires to keep a record of all the previous addresses of the subscriber). In this case, multiple rows for a subscriber will be inserted in the table with his/her different addresses.
  • There will be some column(s) that will identify the current address. For example, ‘start date’ and ‘End date’. The row where ‘End date’ value will be blank would contain subscriber’s current address and all other rows will be having previous addresses of the subscriber.
  • Type-3: These are the type of dimensions where limited history is preserved. And we use an additional column to maintain the history. For example, Subscriber’s address (where the business requires to keep a record of current & just one previous address). In this case, we can dissolve the ‘address’ column into two different columns – ‘current address’ and ‘previous address’.
  • So, instead of having multiple rows, we will be having just one row showing current as well as the previous address of the subscriber.
  • Type-4: In this type of dimension, the historical data is preserved in a separate table. The main dimension table holds only the current data.
  • For example, the main dimension table will have only one row per subscriber holding its current address. All other previous addresses of the subscriber will be kept in the separate history table. This type of dimension is hardly ever used.
5) Degenerated Dimension: A degenerated dimension is a dimension which is not a fact but presents in the fact table as a primary key. It does not have its own dimension table. We can also call it as a single attribute dimension table.

Data mart Data marts are for the most part intended for a solitary branch of business. They are designed for the individual departments. For example, I used to work for a health insurance provider company which had different departments in it like Finance, Reporting, Sales and so forth.
We had a data warehouse that was holding the information pertaining to all these departments and then we have few data marts built on top of this data warehouse. These DataMart were specific to each department. In simple words, you can say that a DataMart is a subset of a data warehouse.

Measures
  • Non- additive measures
  • Semi-additive measures
  • Additive measures
Non-additive measures are the ones on top of which no aggregation function can be applied. For example, a ratio or a percentage column; a flag or an indicator column present in fact table holding values like Y/N, etc. is a non-additive measure.
Semi-additive measures are the ones on top of which some (but not all) aggregation functions can be applied. Example – fee rate or account balance.
Additive measures are the ones on top of which all aggregation functions cab be applied. Example- units purchased.

Wednesday, January 09, 2019

Techniques to Boost Your Data Modeling

Techniques to Boost Your Data Modeling:

·         Understand the Business Requirements and Results Needed - get a clear understanding of the requirements by asking people about the results they need from the data. Then start organizing your data with those ends in mind.

·         Visualize the Data to Be Modeled - Most people are far more comfortable looking at graphical representations of data that make it quick to see any anomalies or using intuitive drag-and-drop screen interfaces to rapidly inspect and join data tables. Data visualization approaches like these help you clean your data to make it complete, consistent, and free from error and redundancy.

·         Start with Simple Data Modeling and Extend Afterwards: Keeping data models small and simple at the start makes it easier to correct any problems or wrong turns. When you are sure your initial models are accurate and meaningful you can bring in more datasets, eliminating any inconsistencies as you go.

·         Break Business Enquiries Down into Facts, Dimensions, Filters, and Order: organizing your data using individual tables for facts and for dimensions, you facilitate the analysis for finding the top sales performers per sales period, and for answering other business intelligence questions as well.

·         Use Just the Data You Need, Rather Than All the Data Available: Portions of the data are needed to answer business questions. Ideally, you should be able to simply check boxes on-screen to indicate which parts of datasets are to be used, letting you avoid data modeling waste and performance issues.

·         Make Calculations in Advance to Prevent End User Disagreements: A key goal of data modeling is to establish one version of the truth, against which users can ask their business questions

·         Verify Each Stage of Your Data Modeling Before Continuing : Each action should be checked before moving to the next step, starting with the data modeling priorities from the business requirements.

·         Look for Causation, Not Just Correlation: Data modeling includes guidance in the way the modeled data is used. While empowering end users to access business intelligence for themselves is a big step forwards, it is also important that they avoid jumping to wrong conclusions. For example, perhaps they see that sales of two different products appear to rise and fall together. Are sales of one product driving sales of the other one (a cause and effect relationship), or do they just happen to rise and fall together (simple correlation) because of another factor such as the economy or the weather? Confusing causation and correlation here could lead to targeting wrong or non-existent opportunities, and thus wasting business resources.

·         Use Smart Tools to Do the Heavy Lifting: complex data modeling may require coding or other actions to process data before analysis begins. However, if such “heavy lifting” can be done for you by a software application A suitable software product can facilitate or automate all the different stages of data ETL (extracting, transforming, and loading). Data can be accessed visually without any coding required, different data sources can be brought together using a simple drag-and-drop interface, and data modeling can even be done automatically based on the query type.

·         Make Your Data Models Evolve:  Data models in business are never carved in stone because data sources and business priorities change continually. Therefore, you must plan on updating or changing them over time. For this, store your data models in a repository that makes them easy to access for expansion and modification, and use a data dictionary or “ready reference” with clear, up-to-date information about the purpose and format of each type of data.

Basics of Kubernetes

 Kubernetes, often abbreviated as K8s , is an open-source platform designed to automate the deployment, scaling, and management of container...