Hello Guys,
This blog is part three of eight part series of porting your RDBMS database to a NO-SQL wide column key value pair database. For introduction or part zero you can click here.
Well the question arises, why in topic Part One Sample Use Cases, I have provided the stored procedures. The stored procedures here give us the output that has to be reported. When we learned software engineering in our colleges or institutes we were told to start with the output. The output drives the input and hence the data model too. With more experience we started created data models that can be future proof. What I mean is we started adding fields which are not used currently but we think might make sense in future. NoSQL data model some what tries to restrict us with that thinking. Anyways, the stored procedures, are generally used to generate the reports are being analyzed here to generate the data model and later on the input.
Also note, NoSQL and other non-RDBMS databases the data is more modeled as De-Normalized data. As the joins are more costly, denormalization helps in improving performance. But denormalization on the other hand produces redundancy and data duplication and hence consumes lots of space. If more importance has to be given to performance and resource availability is not an issue then redundancy and data duplication becomes irrelevant. In this blog we will discuss more about data modeling. How the data is stored is out of scope of this series. For that you can refer to respective NoSQL data storage technique.
With the stored procedures in the part one we could think of only two data models:
- Multiple Column Families
- Multiple Tables
Multiple Column Families: The Mapr-DB wide column key value pair database allows to define multiple column families in a single table. For Cassandra it is called as Super Column Family or composite columns. I can imagine the key value pair in following way:
Many NoSQL databases like HBase are optimized to handle two to three column families. But after confirmation with MAPR PS guys, Mapr-DB can handle more number of column families. Anyways we can leave that discussion to debate on later.
Now lets see what the first table with multiple column looks like:
CustomerOrder Table:
Few things to note here:
- The data here is denormalized into different column families
- Column family names are just one letters. Smaller the best. Mapr-DB and HBase stores column family name with every column so smaller the better, so name consumes lesser space.
- ItemCount is stored in O (Order) Column Family. That saves us aggregation over OrderDetails table.
- Fetch list of customer payments
- Fetch customer orders
- Fetch customer order summary
- Total discounts applied by discount codes
- Tableau connectivity
CustomerDetails Table:
The CustomerDetails table will provide us data for 'Fetch list of customers' use case.
OrderDetails Table:
The OrderDetails table does not have any use case to provide data for but we have to store the data.
ItemDetails Table:
Finally ItemDetails table provides us data for 'Fetch last item status' use case. The ID column family holds all the aggregated values that are required for fetch last item status use case.
Above data model is debatable! The first use case Fetch list of customer payments the query in the stored procedure is fetched between dates. If we try to do that here there will be huge amount of table scan as the OrderDateTime is not indexed. This is true for all use cases where the WHERE clause has different search parameters then the row key. In HBASE and Mapr-DB only the Row Key is indexed. In Cassandra we have Primary Key as well as secondary indexes but this model fits into Super Column Data model where secondary indexes again are not possible. With the advent of Composite Columns, Cassandra has resolved the issue of secondary indexes. May be HBase and Mapr-DB should follow them. Well, not to worry, the huge table scans can be resolved by creating secondary index tables that have search parameters part of row keys and the value as the row key of another table. Its like we are building our own index. This we will discuss in next important part of this series Row Key Design.
Multiple Tables: Another way of optimizing is denormalizing the data in multiple flat structure tables with appropriate row key design. We will discuss more on row key design in next part of this series. I will give only few examples here to shorten the length of this topic. If you want data model for remaining please to add to comments and I will try to put up best response.
Data model for 'Fetch list of Customer Payments'
Data model for 'Fetch customer orders'
Data model for 'Fetch last item status'
We have created three tables for three use cases. So that said, it means that chances are there that we will have to create table for each use case. Note that the row key structure has changed also note the order in which each field in the row key has been positioned. There is a reason for that. We will surely discuss that in more detail in next part.
Now, if we compare both Multiple Column Families and Multiple Tables, here are few things to ponder on:
Thanks for reading!
Ashfaq Chougle
Previous: Part Two: Data Model - From SQL Server to Wide Column No-SQL
Next: Part Four: Row Key Design - From SQL Server to Wide Column NoSQL
The CustomerDetails table will provide us data for 'Fetch list of customers' use case.
OrderDetails Table:
| Row Key | OrderDetailsID |
|---|---|
| Column Families | |
| O | OD |
| OrderID | ItemID |
| CustomerID | ItemName |
| OrderNo | UnitCost |
| OrderDateTime | Quantity |
| TotalAmount | Amount |
| GrossAmount | NetAmount |
| OrderStatus | Status |
The OrderDetails table does not have any use case to provide data for but we have to store the data.
ItemDetails Table:
| Row Key | ItemID |
|---|---|
| Column Families | |
| I | ID |
| ItemName | ItemName |
| Manufacturer | LastOrderNo |
| Unit | LastOrderDateTime |
| UnitCost | LastQuantitySold |
| Discount | LastCustomerName |
| Quantity | LastOrderID |
| LastPaymentReceived | |
Finally ItemDetails table provides us data for 'Fetch last item status' use case. The ID column family holds all the aggregated values that are required for fetch last item status use case.
Above data model is debatable! The first use case Fetch list of customer payments the query in the stored procedure is fetched between dates. If we try to do that here there will be huge amount of table scan as the OrderDateTime is not indexed. This is true for all use cases where the WHERE clause has different search parameters then the row key. In HBASE and Mapr-DB only the Row Key is indexed. In Cassandra we have Primary Key as well as secondary indexes but this model fits into Super Column Data model where secondary indexes again are not possible. With the advent of Composite Columns, Cassandra has resolved the issue of secondary indexes. May be HBase and Mapr-DB should follow them. Well, not to worry, the huge table scans can be resolved by creating secondary index tables that have search parameters part of row keys and the value as the row key of another table. Its like we are building our own index. This we will discuss in next important part of this series Row Key Design.
Multiple Tables: Another way of optimizing is denormalizing the data in multiple flat structure tables with appropriate row key design. We will discuss more on row key design in next part of this series. I will give only few examples here to shorten the length of this topic. If you want data model for remaining please to add to comments and I will try to put up best response.
Data model for 'Fetch list of Customer Payments'
| Row Key | PaymentDateTime_CustomerID_PaymentID |
|---|---|
| Column Families | |
| P | |
| FirstName | |
| LastName | |
| BusinessName | |
| PaymentDateTime | |
| PaymentMode | |
| Name | |
| Number | |
| Amount | |
| ExpiryDate | |
| AuthDetails | |
| Status | |
Data model for 'Fetch customer orders'
| Row Key | OrderDateTime_CustomerID_OrderID |
|---|---|
| Column Families | |
| O | |
| OrderNo | |
| OrderDateTime | |
| OrderStatus | |
| TotalAmount | |
| FirstName | |
| LastName | |
| BusinessName | |
| ShippingAddress1 | |
| ShippingAddress2 | |
| ShippingCity | |
| ShippingState | |
| ShippingZip | |
| PaymentDateTime | |
| PaymentMode | |
| Name | |
| Number | |
| Amount | |
| Status | |
| BillingAddress1 | |
| BillingAddress2 | |
| BillingCity | |
| BillingState | |
| BillingZip | |
Data model for 'Fetch last item status'
| Row Key | ItemID |
|---|---|
| Column Families | |
| ID | |
| ItemName | |
| LastOrderNo | |
| LastOrderDateTime | |
| LastQuantitySold | |
| LastCustomerName | |
| LastOrderID | |
| LastPaymentReceived | |
We have created three tables for three use cases. So that said, it means that chances are there that we will have to create table for each use case. Note that the row key structure has changed also note the order in which each field in the row key has been positioned. There is a reason for that. We will surely discuss that in more detail in next part.
Now, if we compare both Multiple Column Families and Multiple Tables, here are few things to ponder on:
- Multiple writes: Multiple Column Families might decrease the number of writes as compared to Multiple Tables.
- Reads: Multiple Column Families is not modeled properly might end up in having multiple reads and joins on the code side or business logic layer side, as compare to Multiple Tables, it tends to be a straight forward read.
- Deserialization: Multiple Column Families end up in unnecessary deserialization of all columns in the column family if you have to fetch one or two columns each from multiple column families, as compared to one or two columns in Multiple Tables. So while creating a data model it makes sense not to add unnecessary columns. Now this on other hand decreases our data model to be future proof.
- Redundancy: Multiple Column Families to some extent resolve redundancy, but all depends on how it is modeled. On the other hand Multiple Tables increase redundancy and data duplication as same data is written to many tables.
- Consistency: Multiple Column Families are to some extent more consistent to Multiple Tables. If for some reason the value is not replaced in all tables in Multiple Tables data model on an update command, lets say few tables are ignored by mistake, then it ends up in inconsistent data. Same thing with insert, if few tables are ignored by mistake.
Thanks for reading!
Ashfaq Chougle
Previous: Part Two: Data Model - From SQL Server to Wide Column No-SQL
Next: Part Four: Row Key Design - From SQL Server to Wide Column NoSQL