Hello Guys,
This blog is part one of eight part series of porting your RDBMS database to a NO-SQL wide column key value pair database. For introduction or part zero you can click here.
In this blog, I am presenting the thought process involved in selecting our sample use cases. As pointed out in the introduction the main intention of the POC was to make sure that Hadoop ecosystem can help us in improving the performance of our reporting web application and analytics application. Also, to recollect we had lots of intelligent and complex stored procedures. Again, I will be using this sample data model to explain the use cases. It is assumed that the sample database is a high transaction volume database as the database handles millions of orders a month.
Following are the list of sample use cases we selected:
- Fetch large number of records, in the order of tens of millions of records.
We wanted to make sure that the new data architecture in the Hadoop ecosystem is capable of fetching large number of data set. Normally these are cases for a data dump requested by the users.
Fetch list of customer payments:
CREATE PROCEDURE prcGetCustomerPayments
(
@StartDateTime DateTime
,@EndDateTime DateTime
)
AS
BEGIN
SET NOCOUNT ON;
SELECT
C.FirstName
,C.LastName
,C.BusinessName
,PaymentDateTime
,PaymentMode
,Name
,Number
,Amount
,ExpiryDate
,AuthDetails
,Status
FROM Customer C WITH (NOLOCK)
JOIN Payments P WITH (NOLOCK) ON C.CustomerID = P.CustomerID
AND PaymentDateTime BETWEEN @StartDateTime AND @EndDateTime
JOIN PaymentDetails PD WITH (NOLOCK) ON P.PaymentsID = PD.PaymentsID
END - Fetch small number of records, in the order of tens or hundreds of records.
We also wanted to make sure that implemented architecture not only fetches large data sets it also is capable of fetching smaller data sets faster.
Fetch list of customers:
CREATE PROCEDURE prcGetCustomerList
AS
BEGIN
SET NOCOUNT ON
SELECT
C.CustomerID
,C.FirstName
,C.LastName
,C.BusinessName
,C.EmailAddress ,A.Address1
,A.Address2
,A.City
,A.State
,A.Zip
FROM Customer C WITH (NOLOCK)
JOIN Address A WITH (NOLOCK) ON C.CustomerID = A.CustomerID
AND A.AddressType = 1 --Personal Address
END - Fetch data from multiple tables.
Just wanted to make sure if architecture is efficiently able to fetch data from multiple tables with optimum performance.
Fetch last item status:
CREATE PROCEDURE prcGetLastItemStatus
(
@ItemID INT
)
AS
BEGIN
SET NOCOUNT ON
DECLARE @ItemName VARCHAR(100)
DECLARE @LastOrderNo VARCHAR(15)
DECLARE @LastOrderDateTime SMALLDATETIME
DECLARE @LastQuantitySold INT
DECLARE @LastCustomerName VARCHAR(100)
DECLARE @ShippingLastAddress1 VARCHAR(50)
DECLARE @ShippingLastAddress2 VARCHAR(50)
DECLARE @ShippingLastCity VARCHAR(50)
DECLARE @ShippingLastState VARCHAR(50)
DECLARE @ShippingLastZip VARCHAR(10)
DECLARE @LastPaymentReceived SMALLMONEY
SELECT
@ItemName = Name
FROM Item WITH (NOLOCK)
WHERE ItemID = @ItemID
DECLARE @OrderID INT
DECLARE @CustomerID INT
DECLARE @ShippingAddressID INT
SELECT
@OrderID = MAX(OrderID)
FROM OrderDetails WITH (NOLOCK)
WHERE ItemID = @ItemID
SELECT
@LastOrderNo = OrderNo
,@LastOrderDateTime = OrderDateTime
,@LastQuantitySold = Quantity
,@CustomerID = CustomerID
,@ShippingAddressID = ShippingAddressID
FROM Order O WITH (NOLOCK)
JOIN OrderDetails OD WITH (NOLOCK) ON O.OrderID = OD.OrderID AND OD.ItemID = @ItemID AND O.OrderID = @OrderID
SELECT
@LastCustomerName = FirstName + ' ' + ISNULL(MiddleName,'') + ' ' + LastName
FROM Customer C WITH (NOLOCK)
WHERE CustomerID = @CustomerID
SELECT
@ShippingLastAddress1 = Address1
,@ShippingLastAddress2 = Address2
,@ShippingLastCity = City
,@ShippingLastState = State
,@ShippingLastZip = Zip
FROM Address WITH (NOLOCK)
WHERE AddressID = @ShippingAddressID
DECLARE @PaymentDetailsID INT
SELECT @PaymentDetailsID = MAX(PaymentDetailsID)
FROM PaymentDetails PD WITH (NOLOCK)
JOIN Payments P WITH (NOLOCK) ON PD.PaymentsID = P.PaymentsID AND P.OrderID = @OrderID
SELECT
@LastPaymentReceived = Amount
FROM PaymentDetails WITH (NOLOCK)
WHERE PaymentDetailsID = @PaymentDetailsID
SELECT
ItemName = @ItemName
,LastOrderNo = @LastOrderNo
,LastOrderDateTime = @LastOrderDateTime
,LastQuantitySold = @LastQuantitySold
,LastCustomerName = @LastCustomerName
,ShippingLastAddress1= @ShippingLastAddress1
,ShippingLastAddress2= @ShippingLastAddress2
,ShippingLastCity = @ShippingLastCity
,ShippingLastState = @ShippingLastState
,ShippingLastZip = @ShippingLastZip
,LastPaymentReceived = @LastPaymentReceived
END - Fetch large number of records (multiple table joins), in the order of tens of millions of records.
Another thing we wanted to know, how the architecture helps in the optimizing multiple joins.
Fetch customer orders:
CREATE PROCEDURE prcGetCustomerOrders
(
@StartDateTime DateTime
,@EndDateTime DateTime
)
AS
BEGIN
SET NOCOUNT ON;
SELECT
O.OrderNo
,O.OrderDateTime
,O.OrderStatus
,O.TotalAmount
,C.FirstName
,C.LastName
,C.BusinessName
,A.Address1
,A.Address2
,A.City
,A.State
,A.Zip
,PaymentDateTime
,PaymentMode
,Amount
,Status
,BAddress1 = BA.Address1
,BAddress2 = BA.Address2
,BCity = BA.City
,BState = BA.State
,BZip = BA.Zip
FROM Customer C WITH (NOLOCK)
JOIN Order O WITH (NOLOCK)
ON O.CustomerID = C.CustomerID
AND OrderDateTime BETWEEN @StartDateTime AND @EndDateTime
JOIN Address A WITH (NOLOCK)
ON O.ShippingAddressID = A.AddressID
LEFT JOIN Payments P WITH (NOLOCK) ON O.OrderID = P.OrderID
LEFT JOIN PaymentDetails PD WITH (NOLOCK) ON P.PaymentsID = PD.PaymentsID
LEFT JOIN Address B WITH (NOLOCK)
ON P.BillingAddressID = B.AddressID
END - In memory aggregation large number of records.
Is the architecture able to handle in memory aggregations for large volume of data?
Fetch customer order summary:
CREATE PROCEDURE prcGetOrderSummary
(
@StartDateTime DateTime
,@EndDateTime DateTime
)
AS
BEGIN
SET NOCOUNT ON;
SELECT
O.CustomerID
,TotalDiscount = SUM(CASE WHEN DiscountCode IS NOT NULL THEN O.GrossAmount - O.TotalAmount ELSE 0 END)
,TotalAmount = SUM(O.TotalAmount)
,OrderCount = COUNT(1)
,FisrtOrder = MIN(OrderDateTime)
,LastOrder = MAX(OrderDateTime)
INTO #TempOrders
FROM Orders O WITH (NOLOCK)
WHERE O.OrderDateTime BETWEEN @StartDateTime AND @EndDateTime
SELECT
O.CustomerID
,ItemCount = COUNT(1)
INTO #TempOrderDetails
FROM Orders O WITH (NOLOCK)
JOIN OrderDetails OD WITH (NOLOCK) ON O.OrderID = OD.OrderID
AND O.OrderDateTime BETWEEN @StartDateTime AND @EndDateTime
SELECT
O.CustomerID
,PaymentCount = COUNT(1)
,FisrtPayment = MIN(PaymentDateTime)
,LastPayment = MAX(PaymentDateTime)
INTO #TempPayments
FROM Orders O WITH (NOLOCK)
JOIN Payments P WITH (NOLOCK) ON O.OrderID = P.OrderID
AND O.OrderDateTime BETWEEN @StartDateTime AND @EndDateTime
SELECT
O.CustomerID
,TotalPayment = SUM(CASE WHEN Status = 1 THEN Amount ELSE 0 END)
INTO #TempPaymentDetails
FROM Orders O WITH (NOLOCK)
JOIN Payments P WITH (NOLOCK) ON O.OrderID = P.OrderID
AND O.OrderDateTime BETWEEN @StartDateTime AND @EndDateTime
JOIN PaymentDetails P WITH (NOLOCK) ON P.PaymentsID = PD.PaymentsID
SELECT
C.FirstName
,C.LastName
,C.BusinessName
,A.Address1
,A.Address2
,A.City
,A.State
,A.Zip
,O.OrderCount
,O.TotalAmount
,O.TotalDiscount
,O.FisrtOrder
,O.LastOrder
,OD.ItemCount
,P.PaymentCount
,PD.TotalPayment
,P.FisrtPayment
,P.LastPayment
,Balance = O.TotalAmount - PD.TotalPayment
FROM Customer C WITH (NOLOCK)
JOIN Address A WITH (NOLOCK)
ON A.CustomerID = C.AddressID
JOIN #TempOrders O WITH (NOLOCK) ON O.CustomerID = C.CustomerID
JOIN #TempOrderDetails OD WITH (NOLOCK) ON OD.CustomerID = C.CustomerID
LEFT JOIN #TempPayments P WITH (NOLOCK) ON P.CustomerID = C.CustomerID
LEFT JOIN #TempPaymentDetails PD WITH (NOLOCK) ON PD.CustomerID = C.CustomerID
END - In memory aggregation less number of records.
How does it perform for smaller data set but heavy aggregation.
Total discounts applied by discount codes:
CREATE PROCEDURE prcGetDiscountsAppliedByCode
(
@StartDateTime DateTime
,@EndDateTime DateTime
)
AS
BEGIN
SET NOCOUNT ON
SELECT
DiscountCode
,Amount = SUM(GrossAmount - TotalAmount)
,Count = COUNT(1) FROM Order WITH (NOLOCK)
WHERE OrderDateTime BETWEEN @StartDateTime AND @EndDateTime
GROUP BY DiscountCode
END - Finally a data set for Tableau connectivity
Just want to make sure how the Tableau can be connected to the new data model.
CREATE VIEW vwAllCustomerOrders
AS
SELECT
O.OrderNo
,O.OrderDateTime
,O.OrderStatus
,O.TotalAmount
,C.FirstName
,C.LastName
,C.BusinessName
,A.Address1
,A.Address2
,A.City
,A.State
,A.Zip
,PaymentDateTime
,PaymentMode
,Amount
,Status
,BAddress1 = BA.Address1
,BAddress2 = BA.Address2
,BCity = BA.City
,BState = BA.State
,BZip = BA.Zip
FROM Customer C WITH (NOLOCK)
JOIN Order O WITH (NOLOCK)
ON O.CustomerID = C.CustomerID
JOIN Address A WITH (NOLOCK)
ON O.ShippingAddressID = A.AddressID
LEFT JOIN Payments P WITH (NOLOCK) ON O.OrderID = P.OrderID
LEFT JOIN PaymentDetails PD WITH (NOLOCK) ON P.PaymentsID = PD.PaymentsID
LEFT JOIN Address B WITH (NOLOCK)
ON P.BillingAddressID = B.AddressID
Thanks for reading!
Ashfaq Chougle
Previous: Part Zero: Introduction - From SQL Server to Wide Column No-SQL
Next: Part Two: Data Model - From SQL Server to Wide Column No-SQL