To SQL Or NoSQL: Part One: Sample Use Cases - From SQL Server to Wide Column No-SQL

Hello Guys,

This blog is part one of eight part series of porting your RDBMS database to a NO-SQL wide column key value pair database. For introduction or part zero you can click here.

In this blog, I am presenting the thought process involved in selecting our sample use cases. As pointed out in the introduction the main intention of the POC was to make sure that Hadoop ecosystem can help us in improving the performance of our reporting web application and analytics application. Also, to recollect we had lots of intelligent and complex stored procedures. Again, I will be using this sample data model to explain the use cases. It is assumed that the sample database is a high transaction volume database as the database handles millions of orders a month.

Following are the list of sample use cases we selected:

Fetch large number of records, in the order of tens of millions of records.
We wanted to make sure that the new data architecture in the Hadoop ecosystem is capable of fetching large number of data set. Normally these are cases for a data dump requested by the users.
Fetch list of customer payments:

CREATE PROCEDURE prcGetCustomerPayments
(
    @StartDateTime DateTime
    ,@EndDateTime DateTime
)
AS
BEGIN
    SET NOCOUNT ON;
    SELECT
         C.FirstName
        ,C.LastName
        ,C.BusinessName
        ,PaymentDateTime
        ,PaymentMode
        ,Name
        ,Number
        ,Amount
        ,ExpiryDate
        ,AuthDetails
        ,Status
    FROM Customer C WITH (NOLOCK)
    JOIN Payments P WITH (NOLOCK) ON C.CustomerID = P.CustomerID
        AND PaymentDateTime BETWEEN @StartDateTime AND @EndDateTime
    JOIN PaymentDetails PD WITH (NOLOCK) ON P.PaymentsID = PD.PaymentsID
END
Fetch small number of records, in the order of tens or hundreds of records.
We also wanted to make sure that implemented architecture not only fetches large data sets it also is capable of fetching smaller data sets faster.
Fetch list of customers:

CREATE PROCEDURE prcGetCustomerList
AS
BEGIN
    SET NOCOUNT ON
    SELECT
        C.CustomerID
        ,C.FirstName
        ,C.LastName
        ,C.BusinessName
        ,C.EmailAddress        ,A.Address1
        ,A.Address2
        ,A.City
        ,A.State
        ,A.Zip
    FROM Customer C WITH (NOLOCK)
    JOIN Address A WITH (NOLOCK) ON C.CustomerID = A.CustomerID
        AND A.AddressType = 1 --Personal Address

END
Fetch data from multiple tables.
Just wanted to make sure if architecture is efficiently able to fetch data from multiple tables with optimum performance.
Fetch last item status:

CREATE PROCEDURE prcGetLastItemStatus
(
    @ItemID INT
)
AS
BEGIN
    SET NOCOUNT ON

    DECLARE @ItemName VARCHAR(100)
    DECLARE @LastOrderNo VARCHAR(15)
    DECLARE @LastOrderDateTime SMALLDATETIME
    DECLARE @LastQuantitySold INT
    DECLARE @LastCustomerName VARCHAR(100)
    DECLARE @ShippingLastAddress1 VARCHAR(50)
    DECLARE @ShippingLastAddress2 VARCHAR(50)
    DECLARE @ShippingLastCity VARCHAR(50)
    DECLARE @ShippingLastState VARCHAR(50)
    DECLARE @ShippingLastZip VARCHAR(10)
    DECLARE @LastPaymentReceived SMALLMONEY

    SELECT
        @ItemName = Name
    FROM Item WITH (NOLOCK)
    WHERE ItemID = @ItemID

    DECLARE @OrderID INT
    DECLARE @CustomerID INT
    DECLARE @ShippingAddressID INT

    SELECT
        @OrderID = MAX(OrderID)
    FROM OrderDetails WITH (NOLOCK)
    WHERE ItemID = @ItemID

    SELECT
        @LastOrderNo = OrderNo
        ,@LastOrderDateTime = OrderDateTime
        ,@LastQuantitySold = Quantity
        ,@CustomerID = CustomerID
        ,@ShippingAddressID = ShippingAddressID
    FROM Order O WITH (NOLOCK)
    JOIN OrderDetails OD WITH (NOLOCK) ON O.OrderID = OD.OrderID AND OD.ItemID = @ItemID AND O.OrderID = @OrderID

    SELECT
        @LastCustomerName = FirstName + ' ' + ISNULL(MiddleName,'') + ' ' + LastName
    FROM Customer C WITH (NOLOCK)
    WHERE CustomerID = @CustomerID

    SELECT
         @ShippingLastAddress1 = Address1
        ,@ShippingLastAddress2 = Address2
        ,@ShippingLastCity = City
        ,@ShippingLastState = State
        ,@ShippingLastZip = Zip
    FROM Address WITH (NOLOCK)
    WHERE AddressID = @ShippingAddressID

    DECLARE @PaymentDetailsID INT

    SELECT @PaymentDetailsID = MAX(PaymentDetailsID)
    FROM PaymentDetails PD WITH (NOLOCK)
    JOIN Payments P WITH (NOLOCK) ON PD.PaymentsID = P.PaymentsID AND P.OrderID = @OrderID

    SELECT
        @LastPaymentReceived = Amount
    FROM PaymentDetails WITH (NOLOCK)
    WHERE PaymentDetailsID = @PaymentDetailsID

    SELECT
        ItemName                    = @ItemName
        ,LastOrderNo               = @LastOrderNo
        ,LastOrderDateTime    = @LastOrderDateTime
        ,LastQuantitySold        = @LastQuantitySold
        ,LastCustomerName    = @LastCustomerName
        ,ShippingLastAddress1= @ShippingLastAddress1
        ,ShippingLastAddress2= @ShippingLastAddress2
        ,ShippingLastCity        = @ShippingLastCity
        ,ShippingLastState       = @ShippingLastState
        ,ShippingLastZip          = @ShippingLastZip
        ,LastPaymentReceived = @LastPaymentReceived

END
Fetch large number of records (multiple table joins), in the order of tens of millions of records.
Another thing we wanted to know, how the architecture helps in the optimizing multiple joins.
Fetch customer orders:

CREATE PROCEDURE prcGetCustomerOrders
(
     @StartDateTime DateTime
    ,@EndDateTime DateTime
)
AS
BEGIN
    SET NOCOUNT ON;
    SELECT
        O.OrderNo
        ,O.OrderDateTime
        ,O.OrderStatus
        ,O.TotalAmount
        ,C.FirstName
        ,C.LastName
        ,C.BusinessName
        ,A.Address1
        ,A.Address2
        ,A.City
        ,A.State
        ,A.Zip
        ,PaymentDateTime
        ,PaymentMode
        ,Amount
        ,Status
        ,BAddress1 = BA.Address1
        ,BAddress2 = BA.Address2
        ,BCity = BA.City
        ,BState = BA.State
        ,BZip = BA.Zip
    FROM Customer C WITH (NOLOCK)
    JOIN Order O WITH (NOLOCK)
        ON O.CustomerID = C.CustomerID
            AND OrderDateTime BETWEEN @StartDateTime AND @EndDateTime
    JOIN Address A WITH (NOLOCK)
        ON O.ShippingAddressID = A.AddressID
    LEFT JOIN Payments P WITH (NOLOCK) ON O.OrderID = P.OrderID
    LEFT JOIN PaymentDetails PD WITH (NOLOCK) ON P.PaymentsID = PD.PaymentsID
    LEFT JOIN Address B WITH (NOLOCK)
        ON P.BillingAddressID = B.AddressID
END
In memory aggregation large number of records.
Is the architecture able to handle in memory aggregations for large volume of data?
Fetch customer order summary:

CREATE PROCEDURE prcGetOrderSummary
(
     @StartDateTime DateTime
    ,@EndDateTime DateTime
)
AS
BEGIN
    SET NOCOUNT ON;

    SELECT
        O.CustomerID
        ,TotalDiscount = SUM(CASE WHEN DiscountCode IS NOT NULL THEN O.GrossAmount - O.TotalAmount ELSE 0 END)
        ,TotalAmount = SUM(O.TotalAmount)
        ,OrderCount = COUNT(1)
        ,FisrtOrder = MIN(OrderDateTime)
        ,LastOrder = MAX(OrderDateTime)
    INTO #TempOrders
    FROM Orders O WITH (NOLOCK)
    WHERE O.OrderDateTime BETWEEN @StartDateTime AND @EndDateTime

    SELECT
        O.CustomerID
        ,ItemCount = COUNT(1)
    INTO #TempOrderDetails
    FROM Orders O WITH (NOLOCK)
    JOIN OrderDetails OD WITH (NOLOCK) ON O.OrderID = OD.OrderID
        AND O.OrderDateTime BETWEEN @StartDateTime AND @EndDateTime

    SELECT
        O.CustomerID
        ,PaymentCount = COUNT(1)
        ,FisrtPayment = MIN(PaymentDateTime)
        ,LastPayment = MAX(PaymentDateTime)
    INTO #TempPayments
    FROM Orders O WITH (NOLOCK)
    JOIN Payments P WITH (NOLOCK) ON O.OrderID = P.OrderID
        AND O.OrderDateTime BETWEEN @StartDateTime AND @EndDateTime

    SELECT
        O.CustomerID
        ,TotalPayment = SUM(CASE WHEN Status = 1 THEN Amount ELSE 0 END)
    INTO #TempPaymentDetails
    FROM Orders O WITH (NOLOCK)
    JOIN Payments P WITH (NOLOCK) ON O.OrderID = P.OrderID
        AND O.OrderDateTime BETWEEN @StartDateTime AND @EndDateTime
    JOIN PaymentDetails P WITH (NOLOCK) ON P.PaymentsID = PD.PaymentsID

    SELECT
        C.FirstName
        ,C.LastName
        ,C.BusinessName
        ,A.Address1
        ,A.Address2
        ,A.City
        ,A.State
        ,A.Zip
        ,O.OrderCount
        ,O.TotalAmount
        ,O.TotalDiscount
        ,O.FisrtOrder
        ,O.LastOrder
        ,OD.ItemCount
        ,P.PaymentCount
        ,PD.TotalPayment
        ,P.FisrtPayment
        ,P.LastPayment
        ,Balance = O.TotalAmount - PD.TotalPayment
    FROM Customer C WITH (NOLOCK)
    JOIN Address A WITH (NOLOCK)
        ON A.CustomerID = C.AddressID
    JOIN #TempOrders O WITH (NOLOCK) ON O.CustomerID = C.CustomerID
    JOIN #TempOrderDetails OD WITH (NOLOCK) ON OD.CustomerID = C.CustomerID
    LEFT JOIN #TempPayments P WITH (NOLOCK) ON P.CustomerID = C.CustomerID
    LEFT JOIN #TempPaymentDetails PD WITH (NOLOCK) ON PD.CustomerID = C.CustomerID

END
In memory aggregation less number of records.
How does it perform for smaller data set but heavy aggregation.
Total discounts applied by discount codes:

CREATE PROCEDURE prcGetDiscountsAppliedByCode
(
     @StartDateTime DateTime
    ,@EndDateTime DateTime
)
AS
BEGIN
    SET NOCOUNT ON

    SELECT
         DiscountCode
        ,Amount = SUM(GrossAmount - TotalAmount)
        ,Count = COUNT(1)    FROM Order WITH (NOLOCK)
    WHERE OrderDateTime BETWEEN @StartDateTime AND @EndDateTime
    GROUP BY DiscountCode
END
Finally a data set for Tableau connectivity
Just want to make sure how the Tableau can be connected to the new data model.
CREATE VIEW vwAllCustomerOrders
AS

    SELECT
         O.OrderNo
        ,O.OrderDateTime
        ,O.OrderStatus
        ,O.TotalAmount
        ,C.FirstName
        ,C.LastName
        ,C.BusinessName
        ,A.Address1
        ,A.Address2
        ,A.City
        ,A.State
        ,A.Zip
        ,PaymentDateTime
        ,PaymentMode
        ,Amount
        ,Status
        ,BAddress1 = BA.Address1
        ,BAddress2 = BA.Address2
        ,BCity = BA.City
        ,BState = BA.State
        ,BZip = BA.Zip
    FROM Customer C WITH (NOLOCK)
    JOIN Order O WITH (NOLOCK)
        ON O.CustomerID = C.CustomerID
    JOIN Address A WITH (NOLOCK)
        ON O.ShippingAddressID = A.AddressID
    LEFT JOIN Payments P WITH (NOLOCK) ON O.OrderID = P.OrderID
    LEFT JOIN PaymentDetails PD WITH (NOLOCK) ON P.PaymentsID = PD.PaymentsID
    LEFT JOIN Address B WITH (NOLOCK)
        ON P.BillingAddressID = B.AddressID

Thanks for reading!

Ashfaq Chougle

Previous: Part Zero: Introduction - From SQL Server to Wide Column No-SQL
Next: Part Two: Data Model - From SQL Server to Wide Column No-SQL

To SQL Or NoSQL

Sunday, December 25, 2016

Part One: Sample Use Cases - From SQL Server to Wide Column No-SQL

Total Pageviews