Home › Forums › Power Pivot › Need method to prep combined Name & Activity & Subj column
Tagged: Extract, parse, Power Query, Table.Buffer, Transform
This topic contains 18 replies, has 2 voices, and was last updated by tomallan 8 years, 5 months ago.
-
AuthorPosts
-
November 7, 2015 at 9:47 pm #2326
Completely new to Power Pivot, on my 2nd reading of ebooks by Rob Collie and Kasper de Jonge… there’s a good chance I’m wanting something not well suited to Power Pivot, but hoping for a solution that’s simpler than VBA or Perl.
My data is (only) available from an exported Excel spreadsheet that has 3 columns:
col1 = Activity
col2 = User
col3 = TimestampThe Activity column is the issue. It’s actually a merged set of three fields like this:
Field1 = Customer Name
Field2 = Activity Type
Field3 = SubjectThese are some example text values from the Activity column:
JIM SMITH To-Do Created May need to convert HO to LLP
Alexander K Jones Contact Edited
JAMES MORRI New Policy CreatedField2(Activity Type) can be anything from a list of about 40 types (see partial list pasted at end of post)
Should I be attempting to split the Activity column into the 3 fields using Excel? Or Power Query? Or can this be done in Power Pivot?
I’ve attempted the Power Pivot approach based on this post by Rob Collie which gets me really close ….
http://www.powerpivotpro.com/2014/01/containsx-finding-if-a-value-in-table-1-has-a-matching-value-in-table-2/
…. Based on that post I’m now able to split the Customer Name off the front end of the Activity string, but that approach won’t work for splitting off whatever is the Subject that sometimes follows the Activity Type. In other words,JIM SMITH To-Do Created May need to convert HO to LLP
needs to be split into three columns so I can create counts of the various Activity Types by User
JIM SMITH
To-Do Created
May need to convert HO to LLPIf it helped I could sanitize my datafile and upload it. The Activity column will sometimes not have a Customer Name, and sometimes not have a Subject, but it will always have an Activity Type. An example with not Customer Name or Subject looks like this:
Logged In
Example Activity Types
<table width=”345″>
<tbody>
<tr>
<td width=”345″>Comment Added</td>
</tr>
<tr>
<td width=”345″>Comment Deleted</td>
</tr>
<tr>
<td>Contact Created</td>
</tr>
<tr>
<td width=”345″>Contact Deleted</td>
</tr>
<tr>
<td width=”345″>Contact Edited</td>
</tr>
<tr>
<td width=”345″>Created An Email Template</td>
</tr>
<tr>
<td width=”345″>Customer Deleted</td>
</tr>
<tr>
<td width=”345″>Deleted An Email Template</td>
</tr>
</tbody>
</table>November 7, 2015 at 9:53 pm #2327I’m using Excel 2013 MS Office 365 Pro Plus, and all of my data files at this point are exported Excel files from various tools like agency management systems, lead management systems, and voip phone systems.
The end goal is to be using all these sources of data to establish dashboards for daily activities that happen in an insurance agency.
November 8, 2015 at 1:47 am #2328Hi,
Tasks like dividing a column into its component parts should be a Power Query task, not Power Pivot.
Some questions:
Do you have control over the merging of Customer Name, Activity Type, and Subject in the Activity column? (if not now, in the future?)
If there is a customer name in Activity, is it always to the left of the activity type?
If there is a subject in Activity, is it always to the right of the activity type?
What is the chance of getting sample data to work with (it can be anonymized, but the structure should be valid)?
Tom
November 8, 2015 at 2:28 am #2329The chance of an example file is very good 😉 See attached
I have no control over the merging of Customer Name, Activity Type, and Subject in the Activity column now, and very unlikely in the future. It comes from a legacy application and I have no influence on those responsible.
But yes, you pointed out some rules I didn’t mention.
- if there is a customer name, it’s always on the left side of the activity type
- if there is a subject, there is always a customer, and the subject is to the right
- there is always an activity type
The reason for their being a subject in some cases is that the activity represents something that has a subject line, such as an email or a ToDo where a staff person chooses to add a subject. But it’s not required.
The attached file has two tabs:
Activity Report shows 21 example rows.
Type shows all of the possible activity types that I’ve been able to discoverIt’s possible there will be more activity types, but it’s not critical that I can match against all activity types, just the interesting ones.
Attachments:
You must be logged in to view attached files.November 8, 2015 at 2:30 am #2331… the dataset is currently at 65K rows with a timeframe of 2015 YTD. I have data going back to 2010, so conservatively the dataset could be in the range of 300-400K rows.
November 8, 2015 at 3:32 am #2332A long answer made short: Yes, it can be done. Please see attached workbook that successfully parsed all 3 fields out of an Activity column.
The solution in the workbook will be successful as long as none of the activity types can be substrings of another activity type (if such is the case, the answer is still a “Yes, it can be done”, but with some different steps).
If there are any questions regarding the attached queries, please feel free to ask.
Tom
Note: Technique for creating a cross-join in Power Query was learned from Chris Webb, who discusses the technique in detail at:
but is also summarized in (you will have to scroll way down to the bottom of the url below to read):
Attachments:
You must be logged in to view attached files.November 8, 2015 at 4:13 am #2334It looks like we exchanged workbooks at the same time.
In your sample data, you do have at least one activity type that is a substring of another: “Created” and “Contact Created”.
Currently working out details.
Tom
November 8, 2015 at 4:59 am #2335Here it is and the unsung hero here is forum participant MikeChina, who posted on October 2 at 5:02 pm and researched an answer that became they key to handling this case (some ActityType names are substrings of other ActivityTypes).
To get the link that MikeChina posted way back then, please follow the url below to reach the earlier forum topic:
Attachments:
You must be logged in to view attached files.November 8, 2015 at 1:22 pm #2339Thanks so much for giving me the links and the workbook. Diving into it now!
November 8, 2015 at 3:24 pm #2340Tom,
When I open the attached prep file I can see errors because associated .txt files aren’t present in the expected locations for each of three queries. I presume this is because you established one input file for the activity types and another for the activities (ie: fact table). And then the 3rd must have been an output .txt for the result?
In the Advanced Editor I can see M language code for each of the three different queries. Were these completely hand-written by you? Or was some/all of the code resulting from menu/buttons in Power Query. I’m asking so I can determine whether I need to be looking at Power Query M language much. At this moment I don’t even know what a comment character is in M language 😉
I can experiment, but it would help me to know the broad steps I need to get the Prep worksheet functioning so I can see how it operates. I tried the obvious step of editing the Activity Types Query to point at my local workbook, but the Query Editor window then displays something I’m not expecting (a sheet reference and something about a filter …) and when I close the Query Editor and hover over the ‘connection only’ query at the right side pane instead of seeing a list of the activity types I see the same kind of information (sheet reference, filter, etc)
By the way, I was able to download a PowerQueryExample.xlsx from Chris Webb’s blog and had similar issues with disconnected source files, but once I added the two supporting workbooks and edit the query settings to point to them that example seems to work. So I suspect I’m just missing something simple in getting your Prep file working.
November 8, 2015 at 3:56 pm #2342Attached is a zip file with a slightly revised Excel file and two text files for importing data. The queries in the xlsx file now expect to file the text files within a directory path of c:\PowerQuery\. If you install the folder in another location, just update the paths in the Power Query queries.
So you will have something to work with, I will send this off now and will continue in another post.
There are only two text files for import; the third query’s error is probably the result of the first two queries showing an error.
Tom
Attachments:
You must be logged in to view attached files.November 8, 2015 at 4:35 pm #2344I have been working with Power Query for about the last 6 months (originally off-and-on, now more steadily).
A recent game changer for me was getting the ebook of “M is for (Data) Monkey” on Bill Jelen’s website (then it had, and perhaps still does have, the lowest price):
http://www.mrexcel.com/store/index.php?l=product_detail&p=323
From my point of view, the book is well-written and is divided into about 25 short chapters that consist of a work-along-with-me-as-you-go-through-the-practice-workbooks approach. I was also impressed with the quantity of practice files which were available as separate download. I felt the overall style of the book was more like working with mentors than with instructors.
Let me know if you have any grief after extracting the contents of PowerQuery.zip attached to the previous comment.
The queries ActivityTypes and Data were created using the PQ interface; the Results Query was produced partly through user-interface, partly through custom M.
Also, if you are interested, I put together a calendar table generator using Power Query based on a post by Kasper de Jonge and using insights from the “M is for (Data) Monkey” book.
November 8, 2015 at 5:58 pm #2345** This discussion addresses a scenario where none of the individual values within
** the ActivityType column could be evaluated as a substring of the content
** of a different value in the same column. This was the case in Prep.xlsx.
**
** The workbook upload-example-raw-activity-report.xlsx addresses the more
** advanced case when an individual activity type (for example, “Created”,
** could be evaluated as a substring of another activity type such as “Contact Created”).
**
** Another factor that could be addressed is possible case-sensitivity issues that might
** arise from occurrences where “Comment Added” might be typed out as such in one
** row, but in another row was entered as “comment added”. If such is the case,
** let me know.
**Discussion of the steps in Prep.xlsx:
Because there are no delimiters separating the Customer Name, Activity Type and Subject, I felt it necessary to bring in one of the heaviest pieces from my personal data parsing arsenal: the cross join. If there had been delimiters within the Activity column, the required solution would have been totally vanilla: do-able with PQ’s interface alone with little more that a few clicks and entering a couple values for a split column.
A side note: My delimiter of preference is the vertical bar, |, also known as the pipe. I find it less problematic than commas or tabs since the values in text fields all too often contain commas and tabs, but almost never contain the vertical bar (except as a delimiter).
In both ActivityType and Data queries, a custom column (“Cross Join ID”) was created that consisted only of the number 1 for all rows in each table. A cross-join will match up every row from the ActivityTypes table with every row from the Data table. This will give an opportunity to find out which ActivityType can be found within the Activity column.
The Table.SelectRows only keeps rows in the temporary result set where ActivityType is found in the Activity column.
The next step, was to create a new custom column (“Parsed Activity”), that contains the same value as the Activity column, with one important difference: Text.Replace replaced the ActivityType with the ActivityType delimited by vertical bars.
Now we have a vanilla solution: In the Home tab of the PQ query ribbon, in the Transform group, there is a Split Column option (entered a custom delimiter of the vertical bar, split at each occurrence of the delimiter; under the Advanced options saw that the “Number of columns to split into” was 3, and (perhaps unnecessarily) changed the quote style to none).
The rest is a matter of renaming, reordering, and removing columns (more vanilla UI options).
If you have the patience here to wade through the use of the cross-join (which is a rare occurrence), you will go far in the world of Power Query.
Tom
November 8, 2015 at 6:06 pm #2347Tom,
You have my heartfelt thanks for taking the time to spell things out so clearly. I had the misfortune to stumble upon this non-vanilla problem with my first foray into Power Pivot (and now Power Query) but you’ve given me lots to ponder, and I very much appreciate the suggestion to read “M is for (Data) Monkey” … will be starting that this evening 😉
Thanks!
November 23, 2015 at 1:32 am #2488Tom,
Finally got time to work on this with my own dataset and ran into a few issues…
a) wasn’t obvious to me how to create the third query which performed the Table.Join. I I knew that I could do it manually by creating any query and then editing in the advanced editor, but I thought that joining two existing query tables should work by clicking. I tried using the combine/Merge but that did something other than Table.Join. So my question is how best to create the query that does the join. I’m past this because I ended up manually editing the m language which of course worked.
NOTE – For question b) below, right after sending I realized that I had just done the table.join but not the reduction, so what was being loaded to Power Pivot was **way** larger than it needed to be. Assuming that Power Query can handle this I’ll probably be fine. But I’m still curious about whether you think 32-bit will be an issue.
b) My dataset is about 64K rows, and with about 40 different activity types there would be 2.4M rows. The data started loading but I eventually got an error saying that I should consider 64-bit Excel. I might have to just do that, but my existing office laptop (T540P) was running 32-bit MS Office and I think 32-bit Win8.1 as well, haven’t thought about it in awhile. Should a 2.4M row dataset with about 7 columns work? My ultimate use case will have much larger datasets.
-
AuthorPosts
The forum ‘Power Pivot’ is closed to new topics and replies.