Home Forums Power Pivot in line data scrub

This topic contains 15 replies, has 2 voices, and was last updated by  Mike Church 8 years, 10 months ago.

Viewing 15 posts - 1 through 15 (of 16 total)
  • Author
    Posts
  • #679

    Mike Church
    Participant
    • Started: 7
    • Replies: 24
    • Total: 31

    First a dislosure, I am new to Powerpivot and have been on a Powerpivot binge since discovering it 60 days ago.  Have since read Rob’s book and tons of other information.  With that said,  I have a desire to solve for an issue in a dataset that I use as my source data table and it’s baffled me to the point of not even knowing where to begin in Powerpivot.  I could solve the issue at the data table level by overwriting some data using index/match but want to solve it in Powerpivot b/c I believe the skill would be beneficial for similar data scrub work.

    I have a file that provides $ totals by month end dates going back 13 months.  I also have a certain code that I use as my rows measure in a pivot table.  A large 3 month data scrub on this certain code recently occurred which had the affect of creating an appearance of migration of $ between categories based on actual movement of dollars in those areas when in fact it the movements were artificial due to corrections in coding.

    Is anyone aware of a way to have powerpivot summarize the dollars for each category based on the most recent code for a record.  Here is an example, customer 123 first appears in the database in January 31st with a code of “A” however several months later it was determined that the correct code should have been “B”.  My pivot table by month end date would account for that record by classifying the total dollars for that particular customer as an “A” for January and through the months just prior to the recoding with the subsequent month end periods showing the $ as “B”.  What I really want is for the total $ to show as “B” for all month end reporting periods.  The logic is that we are improving the accuracy of this coding and the most accurate coding will be code found in the most recent month end data set.

    Hopefully my description wasn’t terribly confusing, thanks in advance for any assistance provided.

    #690

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    Hello sans holo,

    I can think of several possibilities; do you have a sample dataset?

     

    #706

    Mike Church
    Participant
    • Started: 7
    • Replies: 24
    • Total: 31

    Thanks Tom!

    I work for a bank so I had to create a dataset for you.  It is a very simple dataset however I believe it gives you what you need to generate a solution.  With that said, all of the columns provided are actual columns found in my fact table along with 20 or so additional columns of data not pertinent to this request.  I use this fact table to trend our loan portfolio over many months worth of reporting data by balance and other metrics.

    In reference to the problem at hand, the example provided is a very prevalent issue whereby we have determined that a NAICS code is incorrect and have corrected the coding as is seen in the sample dataset where the NAICS code changes somewhere in the month of April.  I would like to be able to create a measure that overwrites the NAICS code for previous reporting periods that contained the incorrect code in order to limit the noise in the reporting.  I still want the incorrect coding in place because even though it is incorrect I need it to report as such for audit purposes.  Could even contrast the two (actual vs. retro corrected) to report on the total $ worth of accounts that have been scrubbed/corrected over a period of time.

     

    Thanks for your help.

    Attachments:
    You must be logged in to view attached files.
    #712

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    Hello sans holo,

    Do you have Excel 2013 or Excel 2010? 2013 has an option that 2010 does not have.

    #713

    Mike Church
    Participant
    • Started: 7
    • Replies: 24
    • Total: 31

    64 bit 2013

    #714

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    In the attached workbook there is a calculated column named:

    Latest NAICS # For This Account

    Let me know if you have any questions; most of the heavy work is done by the TOPN clause, the last non-blank and max are there to suppress data showing up in totals and to satisfy the need for an aggregation in the CALCULATE.

    Attachments:
    You must be logged in to view attached files.
    #718

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    In the previous entry, there are two files under attachments, except for file name, one is a duplicate of the other.

    #720

    Mike Church
    Participant
    • Started: 7
    • Replies: 24
    • Total: 31

    Thanks Tom, this is impressive.  Exactly what I needed.  I will definitely be adding this arrow to my DAX quiver.

    Just so I can make sure I understand what I have here, I would like to describe what you have written and get your feedback my understanding of each component of the formula.

    Starting from left to right, CALCULATE is as I understand it the gateway to all good things in DAX (I know there is more than that but I know it allows for a summary calc with filter capabilities).  HASONEVALUE was used as the true false test for the if statement.  It is testing how many variations of the “NAICS” filter are present.  If it is a distinct value the the true result caluclates the max “NAICS” per the MAX operator.  MAX operator is there as you said to keep everyting kosher per the CALCULATE operator requirements and to supress dupes.  If the statement were false, and there are multiple “NAICS” present then the TOPN operator is depolyed to return the nth record, which you have specified as the first per the “1” at the beginning of the equation.  The FILTER and ALL sets directs the TOPN to the table to be analyzed and the ALL clears any filters in place.  The [Account#]=EARLIER([Account #]) section sets the account # as the expression that will drive the ordering which will be done by the [Data Refresh] field at the end of the formula.  In doing some research on EARLIER which was new to me, I found a post http://www.powerpivotpro.com/2012/03/the-correct-usage-of-earlier/ that describes EARLIER as the escape hatch for the ALL operator as the ALL operator sets the context to a column order and EARLIER can be thought of as CURRENTROW as it sets the context back to a row order which is what TOPN needs.

    Let me know where I missed the mark with my comprehension of these concepts.  So much to learn!!

    #737

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    Hello sans holo,

    Larger formulas can be easier to understand going from the inside and working outwards.

    Innermost is a FILTER that separates rows of the same account number as the current row’s account number, and the ALL function ensures that external filters (coming from rows, columns, slicers, etc.) will be ignored — so in the end we have a complete set of rows with the same account number.

    To understand EARLIER you could picture the Power Pivot engine scanning through each record of  a copy of the Loans table (I will call this copy the “EARLIER” table), and for each record it scans, the engine makes another copy of the Loans table to compare–record by record–with the current record in the EARLIER table.  The EARLIER function is just a tool so your formula can tell the Power Pivot engine which record’s value you want to use.

    The TOPN function( which has a default descending order) using a 1 in the first position ensures that we have the most recent data refresh record of the set.

    After the TOPN has finished its work, we have the row we want, but CALCULATE expects an aggregation.  MAX, MIN and AVERAGE for a single numeric value all return the same result, so I just chose MAX out of habit.

    The IF ( HASONEVALUE() ) suppresses the effort to evaluate grand totals and subtotals with the formula.  Behind the scenes in Power Pivot, totals are collections of values.

    So there you have it.  The grand tour from the inside out.

    #755

    Mike Church
    Participant
    • Started: 7
    • Replies: 24
    • Total: 31

    Thanks for the tour!  Very helpful.

    I do have a follow up question related to the formula at hand.  How would I exclude a certain NAICS codes from being processed in this manner.  I tried a nested if statement ahead of the formula to preempt the formula.  For example, I said that if a NAICS =555555 then rely on the NAICS code provided and if false leads to your equation.   As I think of it though I believe what I really need is something more complex because what I really need is to keep the NAICS codes as they are for each refresh date when code 555555 exists in any of the refresh dates and if not then follow your coding.

    Thanks for your help.

     

    #760

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    Hello sans holo,

    Let me put this is my own words, and you can tell me if I have it:

    What you need is to keep the NAICS codes as they are for each refresh date of an Account #, when the code for 555555 exists in any of that Account #’s NAICS codes.

    Right? or No…

    #764

    Mike Church
    Participant
    • Started: 7
    • Replies: 24
    • Total: 31

    Exactly.  Probably more info that what you need but to give you a background of the application here this is not for NAICS codes as the formula you sent me is applicable for many other system codes beyond just NAICS codes.  This new request is for the code that we use to determine the type of loan which is also an area where a good bit of data scrub work gets done which is why your formula is also very useful in this other type of coding.  The catch that I ran into when using your formula for this other coding was that we have some codes that are intentionally transient in that they start as one type of loan and then at some point they switch.  A construction loan is an example, where we start off with a loan that is coded as a construction loan, which is the correct coding, and then the construction is complete and we now have a functioning apartment complex for example.  If I use the coding you provided then the loan type code would get overwritten as a finished apartment complex from day 1 when in reality it should show as construction for the first nine months with a code switch occuring once the project is complete.  In this case I will use this additional logic in front of the formula you previously provided for the logic to return the code on the system for each refresh date if there is a code of 120 (which is the construction code) for any of the refresh dates.

    Once again, way more than you probably want to know but thought it might help explain what I am trying to do and why I would want to suspend the logic for certain codes.

    #811

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    Hello sans holo,

    Attached is workbook for you with this formula:

    =

    IF (

    COUNTROWS (

    FILTER (

    ALL ( MyLoans ),

    [Account #] = EARLIER ( MyLoans[Account #] )

    && [NAICS #] = 555555

    )

    )

    > 0,

    CALCULATE ( MAX ( MyLoans[NAICS #] ) , TOPN ( 1, FILTER ( ALL( MyLoans), [Account #] = EARLIER( MyLoans[Account #] ) && [NAICS #] = EARLIER ( [NAICS #] ) ), MyLoans[NAICS #] ) ) ,

    CALCULATE (

    MAX ( MyLoans[NAICS #] ),

    TOPN (

    1,

    FILTER (

    ALL ( MyLoans ),

    [Account #] = EARLIER ( [Account #] )

    ),

    MyLoans[Data Refresh]

    )

    )

    )

     

    It should work well for you.  A word of caution: Make sure that there are sufficient columns when you use these calculations to uniquely identify each row of data.  If there is not sufficient information to uniquely identify each row and you drop the measure in the Values drop zone for a pivot, Excel will sum the values from non-unique rows.  However, if you drop one of these measures on the Rows drop zone, and there duplicated rows, the measure in the Rows drop zone will not be summed.

    Attachments:
    You must be logged in to view attached files.
    #872

    Mike Church
    Participant
    • Started: 7
    • Replies: 24
    • Total: 31

    Thanks, this worked wonderfully!  You can really see it on a 13 month trending graph which is now smooth ups and downs as opposed to the sharp jumps that were present before.

    Thanks for your help.

    #874

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    Your welcome; I learn from helping.

Viewing 15 posts - 1 through 15 (of 16 total)

The forum ‘Power Pivot’ is closed to new topics and replies.