Sunday, December 25, 2016

Part One: Sample Use Cases - From SQL Server to Wide Column No-SQL




Hello Guys,

This blog is part one of eight part series of porting your RDBMS database to a NO-SQL wide column key value pair database. For introduction or part zero you can click here.

In this blog, I am presenting the thought process involved in selecting our sample use cases. As pointed out in the introduction the main intention of the POC was to make sure that Hadoop ecosystem can help us in improving the performance of our reporting web application and analytics application. Also, to recollect we had lots of intelligent and complex stored procedures. Again, I will be using this sample data model to explain the use cases. It is assumed that the sample database is a high transaction volume database as the database handles millions of orders a month.

Following are the list of sample use cases we selected:
  1. Fetch large number of records, in the order of tens of millions of records.
    We wanted to make sure that the new data architecture in the Hadoop ecosystem is capable of fetching large number of data set. Normally these are cases for a data dump requested by the users.
    Fetch list of customer payments:

    CREATE PROCEDURE prcGetCustomerPayments
    (
         @StartDateTime DateTime
        ,@EndDateTime DateTime
    )
    AS
    BEGIN
        SET NOCOUNT ON;
        SELECT
             C.FirstName
            ,C.LastName
            ,C.BusinessName
            ,PaymentDateTime
            ,PaymentMode
            ,Name
            ,Number
            ,Amount
            ,ExpiryDate
            ,AuthDetails
            ,Status
        FROM Customer C WITH (NOLOCK)
        JOIN Payments P WITH (NOLOCK) ON C.CustomerID = P.CustomerID
            AND PaymentDateTime BETWEEN @StartDateTime AND @EndDateTime
        JOIN PaymentDetails PD WITH (NOLOCK) ON P.PaymentsID = PD.PaymentsID
    END
  2. Fetch small number of records, in the order of tens or hundreds of records.
    We also wanted to make sure that implemented architecture not only fetches large data sets it also is capable of fetching smaller data sets faster.
    Fetch list of customers:

    CREATE PROCEDURE prcGetCustomerList
    AS
    BEGIN
        SET NOCOUNT ON
        SELECT
             C.CustomerID
            ,C.FirstName
            ,C.LastName
            ,C.BusinessName

            ,C.EmailAddress
            ,A.Address1
            ,A.Address2
            ,A.City
            ,A.State
            ,A.Zip
        FROM Customer C WITH (NOLOCK)
        JOIN Address A WITH (NOLOCK) ON C.CustomerID = A.CustomerID
            AND A.AddressType = 1 --Personal Address

    END
  3. Fetch data from multiple tables.
    Just wanted to make sure if architecture is efficiently able to fetch data from multiple tables with optimum performance.
    Fetch last item status:

    CREATE PROCEDURE prcGetLastItemStatus
    (
        @ItemID INT
    )
    AS
    BEGIN
        SET NOCOUNT ON
       
        DECLARE @ItemName VARCHAR(100)
        DECLARE @LastOrderNo VARCHAR(15)
        DECLARE @LastOrderDateTime SMALLDATETIME
        DECLARE @LastQuantitySold INT
        DECLARE @LastCustomerName VARCHAR(100)
        DECLARE @ShippingLastAddress1 VARCHAR(50)
        DECLARE @ShippingLastAddress2 VARCHAR(50)
        DECLARE @ShippingLastCity VARCHAR(50)
        DECLARE @ShippingLastState VARCHAR(50)
        DECLARE @ShippingLastZip VARCHAR(10)
        DECLARE @LastPaymentReceived SMALLMONEY

        SELECT
            @ItemName = Name
        FROM Item WITH (NOLOCK)
        WHERE ItemID = @ItemID

        DECLARE @OrderID INT
        DECLARE @CustomerID INT
        DECLARE @ShippingAddressID INT

        SELECT
            @OrderID = MAX(OrderID)
        FROM OrderDetails WITH (NOLOCK)
        WHERE ItemID = @ItemID

        SELECT
             @LastOrderNo = OrderNo
            ,@LastOrderDateTime = OrderDateTime
            ,@LastQuantitySold = Quantity
            ,@CustomerID = CustomerID
            ,@ShippingAddressID = ShippingAddressID
        FROM Order O WITH (NOLOCK)
        JOIN OrderDetails OD WITH (NOLOCK) ON O.OrderID = OD.OrderID AND OD.ItemID = @ItemID AND O.OrderID = @OrderID

        SELECT
            @LastCustomerName = FirstName + ' ' + ISNULL(MiddleName,'') + ' ' + LastName
        FROM Customer C WITH (NOLOCK)
        WHERE CustomerID = @CustomerID

        SELECT
             @ShippingLastAddress1 = Address1
            ,@ShippingLastAddress2 = Address2
            ,@ShippingLastCity = City
            ,@ShippingLastState = State
            ,@ShippingLastZip = Zip
        FROM Address WITH (NOLOCK)
        WHERE AddressID = @ShippingAddressID

        DECLARE @PaymentDetailsID INT

        SELECT @PaymentDetailsID = MAX(PaymentDetailsID)
        FROM PaymentDetails PD WITH (NOLOCK)
        JOIN Payments P WITH (NOLOCK) ON PD.PaymentsID = P.PaymentsID AND P.OrderID = @OrderID

        SELECT
            @LastPaymentReceived = Amount
        FROM PaymentDetails WITH (NOLOCK)
        WHERE PaymentDetailsID = @PaymentDetailsID

        SELECT 
             ItemName                    = @ItemName
            ,LastOrderNo               = @LastOrderNo
            ,LastOrderDateTime    = @LastOrderDateTime
            ,LastQuantitySold        = @LastQuantitySold
            ,LastCustomerName    = @LastCustomerName
            ,ShippingLastAddress1= @ShippingLastAddress1
            ,ShippingLastAddress2= @ShippingLastAddress2
            ,ShippingLastCity        = @ShippingLastCity
            ,ShippingLastState       = @ShippingLastState
            ,ShippingLastZip          = @ShippingLastZip
            ,LastPaymentReceived = @LastPaymentReceived

    END
  4. Fetch large number of records (multiple table joins), in the order of tens of millions of records.
    Another thing we wanted to know, how the architecture helps in the optimizing multiple joins.
    Fetch customer orders:

    CREATE PROCEDURE prcGetCustomerOrders
    (
         @StartDateTime DateTime
        ,@EndDateTime DateTime
    )
    AS
    BEGIN
        SET NOCOUNT ON;
        SELECT
             O.OrderNo
            ,O.OrderDateTime
            ,O.OrderStatus
            ,O.TotalAmount
            ,C.FirstName
            ,C.LastName
            ,C.BusinessName
            ,A.Address1
            ,A.Address2
            ,A.City
            ,A.State
            ,A.Zip
            ,PaymentDateTime
            ,PaymentMode
            ,Amount
            ,Status
            ,BAddress1 = BA.Address1
            ,BAddress2 = BA.Address2
            ,BCity = BA.City
            ,BState = BA.State
            ,BZip = BA.Zip
        FROM Customer C WITH (NOLOCK)
        JOIN Order O WITH (NOLOCK)
            ON O.CustomerID = C.CustomerID
                AND OrderDateTime BETWEEN @StartDateTime AND @EndDateTime
        JOIN Address A WITH (NOLOCK)
            ON O.ShippingAddressID = A.AddressID
        LEFT JOIN Payments P WITH (NOLOCK) ON O.OrderID = P.OrderID
        LEFT JOIN PaymentDetails PD WITH (NOLOCK) ON P.PaymentsID = PD.PaymentsID
        LEFT JOIN Address B WITH (NOLOCK)
            ON P.BillingAddressID = B.AddressID
    END
  5. In memory aggregation large number of records.
    Is the architecture able to handle in memory aggregations for large volume of data?
    Fetch customer order summary:

    CREATE PROCEDURE prcGetOrderSummary
    (
         @StartDateTime DateTime
        ,@EndDateTime DateTime
    )
    AS
    BEGIN
        SET NOCOUNT ON;

        SELECT
             O.CustomerID
            ,TotalDiscount = SUM(CASE WHEN DiscountCode IS NOT NULL THEN O.GrossAmount - O.TotalAmount ELSE 0 END)
            ,TotalAmount = SUM(O.TotalAmount)
            ,OrderCount = COUNT(1)
            ,FisrtOrder = MIN(OrderDateTime)
            ,LastOrder = MAX(OrderDateTime)
        INTO #TempOrders
        FROM Orders O WITH (NOLOCK)
        WHERE O.OrderDateTime BETWEEN @StartDateTime AND @EndDateTime

        SELECT
             O.CustomerID
            ,ItemCount = COUNT(1)
        INTO #TempOrderDetails
        FROM Orders O WITH (NOLOCK)
        JOIN OrderDetails OD WITH (NOLOCK) ON O.OrderID = OD.OrderID
            AND O.OrderDateTime BETWEEN @StartDateTime AND @EndDateTime


        SELECT
             O.CustomerID
            ,PaymentCount = COUNT(1)
            ,FisrtPayment = MIN(PaymentDateTime)
            ,LastPayment = MAX(PaymentDateTime)
        INTO #TempPayments
        FROM Orders O WITH (NOLOCK)
        JOIN Payments P WITH (NOLOCK) ON O.OrderID = P.OrderID
            AND O.OrderDateTime BETWEEN @StartDateTime AND @EndDateTime

        SELECT
             O.CustomerID
            ,TotalPayment = SUM(CASE WHEN Status = 1 THEN Amount ELSE 0 END)
        INTO #TempPaymentDetails
        FROM Orders O WITH (NOLOCK)
        JOIN Payments P WITH (NOLOCK) ON O.OrderID = P.OrderID
            AND O.OrderDateTime BETWEEN @StartDateTime AND @EndDateTime
        JOIN PaymentDetails P WITH (NOLOCK) ON P.PaymentsID = PD.PaymentsID

        SELECT
             C.FirstName
            ,C.LastName
            ,C.BusinessName
            ,A.Address1
            ,A.Address2
            ,A.City
            ,A.State
            ,A.Zip
            ,O.OrderCount
            ,O.TotalAmount
            ,O.TotalDiscount
            ,O.FisrtOrder
            ,O.LastOrder
            ,OD.ItemCount
            ,P.PaymentCount
            ,PD.TotalPayment
            ,P.FisrtPayment
            ,P.LastPayment
            ,Balance = O.TotalAmount - PD.TotalPayment
        FROM Customer C WITH (NOLOCK)
        JOIN Address A WITH (NOLOCK)
            ON A.CustomerID = C.AddressID
        JOIN #TempOrders O WITH (NOLOCK) ON O.CustomerID = C.CustomerID
        JOIN #TempOrderDetails OD WITH (NOLOCK) ON OD.CustomerID = C.CustomerID
        LEFT JOIN #TempPayments P WITH (NOLOCK) ON P.CustomerID = C.CustomerID
        LEFT JOIN #TempPaymentDetails PD WITH (NOLOCK) ON PD.CustomerID = C.CustomerID
       
    END
  6. In memory aggregation less number of records.
    How does it perform for smaller data set but heavy aggregation.
    Total discounts applied by discount codes:

    CREATE PROCEDURE prcGetDiscountsAppliedByCode
    (
         @StartDateTime DateTime
        ,@EndDateTime DateTime
    )
    AS
    BEGIN
        SET NOCOUNT ON

        SELECT
              DiscountCode
            ,Amount = SUM(GrossAmount - TotalAmount)

            ,Count = COUNT(1)
        FROM Order WITH (NOLOCK)
        WHERE OrderDateTime BETWEEN @StartDateTime AND @EndDateTime
        GROUP BY DiscountCode
    END
  7. Finally a data set for Tableau connectivity
    Just want to make sure how the Tableau can be connected to the new data model.
    CREATE VIEW vwAllCustomerOrders
    AS

        SELECT
             O.OrderNo
            ,O.OrderDateTime
            ,O.OrderStatus
            ,O.TotalAmount
            ,C.FirstName
            ,C.LastName
            ,C.BusinessName
            ,A.Address1
            ,A.Address2
            ,A.City
            ,A.State
            ,A.Zip
            ,PaymentDateTime
            ,PaymentMode
            ,Amount
            ,Status
            ,BAddress1 = BA.Address1
            ,BAddress2 = BA.Address2
            ,BCity = BA.City
            ,BState = BA.State
            ,BZip = BA.Zip
        FROM Customer C WITH (NOLOCK)
        JOIN Order O WITH (NOLOCK)
            ON O.CustomerID = C.CustomerID
        JOIN Address A WITH (NOLOCK)
            ON O.ShippingAddressID = A.AddressID
        LEFT JOIN Payments P WITH (NOLOCK) ON O.OrderID = P.OrderID
        LEFT JOIN PaymentDetails PD WITH (NOLOCK) ON P.PaymentsID = PD.PaymentsID
        LEFT JOIN Address B WITH (NOLOCK)
            ON P.BillingAddressID = B.AddressID


Thanks for reading!

Ashfaq Chougle

Previous: Part Zero: Introduction - From SQL Server to Wide Column No-SQL
Next: Part Two: Data Model - From SQL Server to Wide Column No-SQL