Home Forums Power Pivot Need method to prep combined Name & Activity & Subj column

This topic contains 18 replies, has 2 voices, and was last updated by  tomallan 8 years, 5 months ago.

Viewing 15 posts - 1 through 15 (of 19 total)
  • Author
    Posts
  • #2326

    rsiegmund
    Participant
    • Started: 3
    • Replies: 13
    • Total: 16

    Completely new to Power Pivot, on my 2nd reading of ebooks by Rob Collie and Kasper de Jonge… there’s a good chance I’m wanting something not well suited to Power Pivot, but hoping for a solution that’s simpler than VBA or Perl.

    My data is (only) available from an exported Excel spreadsheet that has 3 columns:

    col1 = Activity
    col2 =  User
    col3 =  Timestamp

    The Activity column is the issue. It’s actually a merged set of three fields like this:

    Field1 = Customer Name
    Field2 = Activity Type
    Field3 = Subject

    These are some example text values from the Activity column:
    JIM SMITH To-Do Created May need to convert HO to LLP
    Alexander K Jones Contact Edited
    JAMES MORRI New Policy Created

    Field2(Activity Type) can be anything from a list of about 40 types (see partial list pasted at end of post)

    Should I be attempting to split the Activity column into the 3 fields using Excel? Or Power Query? Or can this be done in Power Pivot?

    I’ve attempted the Power Pivot approach based on this post by Rob Collie which gets me really close ….
    http://www.powerpivotpro.com/2014/01/containsx-finding-if-a-value-in-table-1-has-a-matching-value-in-table-2/
    …. Based on that post I’m now able to split the Customer Name off the front end of the Activity string, but that approach won’t work for splitting off whatever is the Subject that sometimes follows the Activity Type. In other words,

    JIM SMITH To-Do Created May need to convert HO to LLP

    needs to be split into three columns so I can create counts of the various Activity Types by User

    JIM SMITH
    To-Do Created
    May need to convert HO to LLP

    If it helped I could sanitize my datafile and upload it. The Activity column will sometimes not have a Customer Name, and sometimes not have a Subject, but it will always have an Activity Type.  An example with not Customer Name or Subject looks like this:

    Logged In

     
    Example Activity Types
    <table width=”345″>
    <tbody>
    <tr>
    <td width=”345″>Comment Added</td>
    </tr>
    <tr>
    <td width=”345″>Comment Deleted</td>
    </tr>
    <tr>
    <td>Contact Created</td>
    </tr>
    <tr>
    <td width=”345″>Contact Deleted</td>
    </tr>
    <tr>
    <td width=”345″>Contact Edited</td>
    </tr>
    <tr>
    <td width=”345″>Created An Email Template</td>
    </tr>
    <tr>
    <td width=”345″>Customer Deleted</td>
    </tr>
    <tr>
    <td width=”345″>Deleted An Email Template</td>
    </tr>
    </tbody>
    </table>

    #2327

    rsiegmund
    Participant
    • Started: 3
    • Replies: 13
    • Total: 16

    I’m using Excel 2013 MS Office 365 Pro Plus, and all of my data files at this point are exported Excel files from various tools like agency management systems, lead management systems, and voip phone systems.

    The end goal is to be using all these sources of data to establish dashboards for daily activities that happen in an insurance agency.

    #2328

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    Hi,

    Tasks like dividing a column into its component parts should be a Power Query task, not Power Pivot.

    Some questions:

    Do you have control over the merging of Customer Name, Activity Type, and Subject in the Activity column? (if not now, in the future?)

    If there is a customer name in Activity, is it always to the left of the activity type?

    If there is a subject in Activity, is it always to the right of the activity type?

    What is the chance of getting sample data to work with (it can be anonymized, but the structure should be valid)?

    Tom

    #2329

    rsiegmund
    Participant
    • Started: 3
    • Replies: 13
    • Total: 16

    The chance of an example file is very good 😉  See attached

    I have no control over the merging of Customer Name, Activity Type, and Subject in the Activity column now, and very unlikely in the future. It comes from a legacy application and I have no influence on those responsible.

    But yes, you pointed out some rules I didn’t mention.

    • if there is a customer name, it’s always on the left side of the activity type
    • if there is a subject, there is always a customer, and the subject is to the right
    • there is always an activity type

    The reason for their being a subject in some cases is that the activity represents something that has a subject line, such as an email or a ToDo where a staff person chooses to add a subject. But it’s not required.

    The attached file has two tabs:

    Activity Report shows 21 example rows.
    Type shows all of the possible activity types that I’ve been able to discover

    It’s possible there will be more activity types, but it’s not critical that I can match against all activity types, just the interesting ones.

     

    Attachments:
    You must be logged in to view attached files.
    #2331

    rsiegmund
    Participant
    • Started: 3
    • Replies: 13
    • Total: 16

    … the dataset is currently at 65K rows with a timeframe of 2015 YTD.  I have data going back to 2010, so conservatively the dataset could be in the range of 300-400K rows.

    #2332

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    A long answer made short: Yes, it can be done. Please see attached workbook that successfully parsed all 3 fields out of an Activity column.

    The solution in the workbook will be successful as long as none of the activity types can be substrings of another activity type (if such is the case, the answer is still a “Yes, it can be done”, but with some different steps).

    If there are any questions regarding the attached queries, please feel free to ask.

    Tom

    Note: Technique for creating a cross-join in Power Query was learned from Chris Webb, who discusses the technique in detail at:

    http://blog.crossjoin.co.uk/2014/06/04/join-conditions-in-power-query-part-2-events-in-progress-performance-and-query-folding/

    but is also summarized in (you will have to scroll way down to the bottom of the url below to read):

    https://social.technet.microsoft.com/Forums/en-US/849833aa-7767-4a7f-8ec0-18fb342dc84d/cross-join-in-power-query?forum=powerquery

    Attachments:
    You must be logged in to view attached files.
    #2334

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    It looks like we exchanged workbooks at the same time.

    In your sample data, you do have at least one activity type that is a substring of another: “Created” and “Contact Created”.

    Currently working out details.

    Tom

    #2335

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    Here it is and the unsung hero here is forum participant MikeChina, who posted on October 2 at 5:02 pm and researched an answer that became they key to handling this case (some ActityType names are substrings of other ActivityTypes).

    To get the link that MikeChina posted way back then, please follow the url below to reach the earlier forum topic:

    Power Query – Remove duplicates – keep highest ID

    Attachments:
    You must be logged in to view attached files.
    #2339

    rsiegmund
    Participant
    • Started: 3
    • Replies: 13
    • Total: 16

    Thanks so much for giving me the links and the workbook. Diving into it now!

    #2340

    rsiegmund
    Participant
    • Started: 3
    • Replies: 13
    • Total: 16

    Tom,

    When I open the attached prep file I can see errors because associated .txt files aren’t present in the expected locations for each of three queries. I presume this is because you established one input file for the activity types and another for the activities (ie: fact table). And then the 3rd must have been an output .txt for the result?

    In the Advanced Editor I can see M language code for each of the three different queries. Were these completely hand-written by you? Or was some/all of the code resulting from menu/buttons in Power Query. I’m asking so I can determine whether I need to be looking at Power Query M language much.  At this moment I don’t even know what a comment character is in M language 😉

    I can experiment, but it would help me to know the broad steps I need to get the Prep worksheet functioning so I can see how it operates. I tried the obvious step of editing the Activity Types Query to point at my local workbook, but the Query Editor window then displays something I’m not expecting (a sheet reference and something about a filter …) and when I close the Query Editor and hover over the ‘connection only’ query at the right side pane instead of seeing a list of the activity types I see the same kind of information (sheet reference, filter, etc)

    By the way, I was able to download a PowerQueryExample.xlsx from Chris Webb’s blog and had similar issues with disconnected source files, but once I added the two supporting workbooks and edit the query settings to point to them that example seems to work.  So I suspect I’m just missing something simple in getting your Prep file working.

    #2342

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    Attached is a zip file with a slightly revised Excel file and two text files for importing data. The queries in the xlsx file now expect to file the text files within a directory path of c:\PowerQuery\. If you install the folder in another location, just update the paths in the Power Query queries.

    So you will have something to work with, I will send this off now and will continue in another post.

    There are only two text files for import; the third query’s error is probably the result of the first two queries showing an error.

    Tom

    Attachments:
    You must be logged in to view attached files.
    #2344

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    I have been working with Power Query for about the last 6 months (originally off-and-on, now more steadily).

    A recent game changer for me was getting the ebook of “M is for (Data) Monkey” on Bill Jelen’s website (then it had, and perhaps still does have, the lowest price):

    http://www.mrexcel.com/store/index.php?l=product_detail&p=323

    From my point of view, the book is well-written and is divided into about 25 short chapters that consist of a work-along-with-me-as-you-go-through-the-practice-workbooks approach. I was also impressed with the quantity of practice files which were available as separate download. I felt the overall style of the book was more like working with mentors than with instructors.

    Let me know if you have any grief after extracting the contents of PowerQuery.zip attached to the previous comment.

    The queries ActivityTypes and Data were created using the PQ interface; the Results Query was produced partly through user-interface, partly through custom M.

    Also, if you are interested, I put together a calendar table generator using Power Query based on a post by Kasper de Jonge and using insights from the “M is for (Data) Monkey” book.

    #2345

    tomallan
    Keymaster
    • Started: 0
    • Replies: 417
    • Total: 417

    ** This discussion addresses a scenario where none of the individual values within
    ** the ActivityType column could be evaluated as a substring of the content
    ** of a different value in the same column. This was the case in Prep.xlsx.
    **
    ** The workbook upload-example-raw-activity-report.xlsx addresses the more
    ** advanced case when an individual activity type (for example, “Created”,
    ** could be evaluated as a substring of another activity type such as “Contact Created”).
    **
    ** Another factor that could be addressed is possible case-sensitivity issues that might
    ** arise from occurrences where “Comment Added” might be typed out as such in one
    ** row, but in another row was entered as “comment added”. If such is the case,
    ** let me know.
    **

    Discussion of the steps in Prep.xlsx:

    Because there are no delimiters separating the Customer Name, Activity Type and Subject, I felt it necessary to bring in one of the heaviest pieces from my personal data parsing arsenal: the cross join. If there had been delimiters within the Activity column, the required solution would have been totally vanilla: do-able with PQ’s interface alone with little more that a few clicks and entering a couple values for a split column.

    A side note: My delimiter of preference is the vertical bar, |, also known as the pipe. I find it less problematic than commas or tabs since the values in text fields all too often contain commas and tabs, but almost never contain the vertical bar (except as a delimiter).

    In both ActivityType and Data queries, a custom column (“Cross Join ID”) was created that consisted only of the number 1 for all rows in each table. A cross-join will match up every row from the ActivityTypes table with every row from the Data table. This will give an opportunity to find out which ActivityType can be found within the Activity column.

    The Table.SelectRows only keeps rows in the temporary result set where ActivityType is found in the Activity column.

    The next step, was to create a new custom column (“Parsed Activity”), that contains the same value as the Activity column, with one important difference: Text.Replace replaced the ActivityType with the ActivityType delimited by vertical bars.

    Now we have a vanilla solution: In the Home tab of the PQ query ribbon, in the Transform group, there is a Split Column option (entered a custom delimiter of the vertical bar, split at each occurrence of the delimiter; under the Advanced options saw that the “Number of columns to split into” was 3, and (perhaps unnecessarily) changed the quote style to none).

    The rest is a matter of renaming, reordering, and removing columns (more vanilla UI options).

    If you have the patience here to wade through the use of the cross-join (which is a rare occurrence), you will go far in the world of Power Query.

    Tom

    #2347

    rsiegmund
    Participant
    • Started: 3
    • Replies: 13
    • Total: 16

    Tom,

    You have my heartfelt thanks for taking the time to spell things out so clearly.   I had the misfortune to stumble upon this non-vanilla problem with my first foray into Power Pivot (and now Power Query) but you’ve given me lots to ponder, and I very much appreciate the suggestion to read “M is for (Data) Monkey” … will be starting that this evening 😉

    Thanks!

    #2488

    rsiegmund
    Participant
    • Started: 3
    • Replies: 13
    • Total: 16

    Tom,

    Finally got time to work on this with my own dataset and ran into a few issues…

    a) wasn’t obvious to me how to create the third query which performed the Table.Join. I I knew that I could do it manually by creating any query and then editing in the advanced editor, but I thought that joining two existing query tables should work by clicking. I tried using the combine/Merge but that did something other than Table.Join.  So my question is how best to create the query that does the join.  I’m past this because I ended up manually editing the m language which of course worked.

    NOTE – For question b) below, right after sending I realized that I had just done the table.join but not the reduction, so what was being loaded to Power Pivot was **way** larger than it needed to be.  Assuming that Power Query can handle this I’ll probably be fine. But I’m still curious about whether you think 32-bit will be an issue.

    b) My dataset is about 64K rows, and with about 40 different activity types there would be 2.4M rows.  The data started loading but I eventually got an error saying that I should consider 64-bit Excel. I might have to just do that, but my existing office laptop (T540P) was running 32-bit MS Office and I think 32-bit Win8.1 as well, haven’t thought about it in awhile.  Should a 2.4M row dataset with about 7 columns work?   My ultimate use case will have much larger datasets.

Viewing 15 posts - 1 through 15 (of 19 total)

The forum ‘Power Pivot’ is closed to new topics and replies.