Cleaning data with single column of values

Tagged: Data Cleaning, Power Query

This topic contains 9 replies, has 3 voices, and was last updated by RonBarrett 6 years, 4 months ago.

Viewing 10 posts - 1 through 10 (of 10 total)

Author

Posts
December 19, 2017 at 5:39 pm #8745
bdeuce14
Participant
- Started: 11
- Replies: 14
- Total: 25
I have a single column table that I need help with to clean and extract the data from the rows and place them into columns.
1. The column headers are in row 2 and are one long text string separated by a space and I need to split this into individual column headers.
2. Each row is separated by a blank space and follows with 4 consecutive rows of information that belong under the corresponding column value
I tried using Power Query and excel LEFT and FIND functions to extract text but I was confusing myself. I attached a copy of the file with the original data and next to it put the desired outcome. Columns A and B are the output of the raw data and C through K are the desired output.

Thank you,

Brian

Attachments:
You must be logged in to view attached files.
December 19, 2017 at 8:07 pm #8748
tomallan
Keymaster
- Started: 0
- Replies: 417
- Total: 417
Brian,

Please see attached workbook.

After using first row as header, and removing blank rows, an Index column was created (important to start at 0, not 1) and then a ColumnGroup column was created (using Number.Mod to get the modulus of the index column) and then saved as a connection only.

Using the first query as a data source, separate tables were created for each of the column groups. These separate tables were then merged together to get the basic row structure.

Then, with the basic row structure, there were several split columns from the *right* of the long string column to get the Amount, Age, Liability Bucket, CLP ID. There was one split from the left to get the Activity date and time info. What was left was the comment.

There was no use of the Reason information in your target output.

Someone else may have a better idea, but with a background in SQL, this approach was intuitive to me.

Tom

Attachments:
You must be logged in to view attached files.
December 19, 2017 at 9:17 pm #8752
bdeuce14
Participant
- Started: 11
- Replies: 14
- Total: 25
Hi Tom,

Thanks very much for your help that’s exactly how I need the final output. The original file I sent was only a portion of the full data set. I just tried recreating your steps with the full data set and I’m having a problem with the column group now because there are rows in the full data set that have a value other than “Tickled” for Column Group 1.

In the query editor the issue starts at Row 17. Power query is expecting the value in the second column to be “Tickled” and assigned it a value of 1 for the column group.

However there was an extra row of data to the entry above it in row 16 which threw off the numbering.

How can I adjust the Number.Mod to account for this?

I attached my updated file.

Attachments:
You must be logged in to view attached files.
December 20, 2017 at 12:30 am #8758
tomallan
Keymaster
- Started: 0
- Replies: 417
- Total: 417
OK.

The original raw data appeared to have a blank row on every 5th row. Maybe the blank 5th row was meant to handle an “optional” extra line of information.

If we can go back to the original data set, can you confirm that there will always be only 5 rows per claim? Or is it possible that there could be 6, 7, or more rows per claim?
December 20, 2017 at 1:30 am #8759
bdeuce14
Participant
- Started: 11
- Replies: 14
- Total: 25
After you sent me the output from my first post I reviewed your steps and then went back to the full data set and removed the 5th blank row before adding it to Power Query.

I attached the original raw data that was not altered and labeled it “OriginalData_Not Altered” and labeled the sheet with the raw data and blanks that I removed as “OrignalData_No Blank Rows.”

In looking at the first sheet “Original Data_Not Altered” it appears that if the word “Tickled” is the value then every 5th row will have a blank. If the word “Transferred” is the value then every 3rd row will be a blank.

So it appears there will never be more than 5 rows per claim.

Attachments:
You must be logged in to view attached files.
December 20, 2017 at 6:34 pm #8774
tomallan
Keymaster
- Started: 0
- Replies: 417
- Total: 417
Brian,

For transfers, in the source data, is every 3rd row (the “Reason” row) a blank, or is it omitted? In the last workbook, in the OriginalData_Not Altered tab, transfers appear to omit the reason row (not leave a blank).

Please advise.

Tom
December 21, 2017 at 3:11 pm #8778
bdeuce14
Participant
- Started: 11
- Replies: 14
- Total: 25
Hi Tom,

I believe that for transfers, there is no reason it’s blank. So for transfers, the first two rows “Transfered” and to “BWH Coding – TA TRANSFER” should all fall under the first column Claim Action. The text string after Activity Time in the third column “now supports…” is the comment.

The “reason” row is not a show stopper for me if it would be easier to remove it so we have 3 rows for Tickled and 3 rows for a Transfer we can do that instead.
December 22, 2017 at 3:12 am #8787
tomallan
Keymaster
- Started: 0
- Replies: 417
- Total: 417
Brian,

On the tab “OriginalData_No Blank Rows”, take a look at the transfer data that starts at row 378 and runs until row 384: there are 4 resubmittals back-to-back, which do not follow the rules for data grouping that we have developed so far.

At this point, since it appears that there are no consistent rules that Power Query can easily work with, this dataset would be better handled using a procedural language like VB.

Tom
December 22, 2017 at 5:45 pm #8791
tomallan
Keymaster
- Started: 0
- Replies: 417
- Total: 417
One more thought:

I think the issue here can be resolved by working with the author of the dataset you have imported into Excel.

I think their intent was to produce a table that was easy to work with, but Excel is misinterpreting their field delimiter. Also, there appears to be an issue with how Excel is interpreting the end-of-line.

See if you can find out if there is way the author of the dataset can specify a different field delimiter that Excel can better work with. Since some fields contain spaces, using a space as a field delimiter would be not a good option.

Tom
December 24, 2017 at 1:45 am #8794
RonBarrett
Participant
- Started: 13
- Replies: 12
- Total: 25
Hi BDeuce14,

Quick question: Is the data set static or do you need to be updating it from an external source (e.g. API, transaction database)?

I had a similar issue with a download from Yelp. In the table’s “Category” column each record was essentially an array of values. I used Excel to parse out the column’s records into a of list unique values. Then, created a “key” column. See the attached workbook, TestYelpData.xlsm.
Consider the following list of restaurant businesses and their associated category values:
– Restaurant 1: Japanese
– Restaurant 2: Chinese
– Restaurant 3: Bar
– Restaurant 4: Japanese,Bar
– Restaurant 5: Japanese,Chinese,Bar

If the user or report selects “Japanese”, then restaurants 1, 4, and 5 should be included. If both “Japanese” and “Bar” are the selected options, then restaurants 4 and 5 should be included.

I’m trying to work out the filtering logic and how it might work.
– A category key was created for each business based upon the values in its Category column. The key is a simple concatenation of 0’s and 1’s. Since the edited category values are sorted, each value has a specific position within the key. For example, say “Bar” has the 2nd key position and “Japanese” holds the 5th position. Then a business with the category values of “Japanese,Bar” has a category key of “01001”. (The business’ order of the comma-delimited category values doesn’t matter.)
– So, I know if the user or report wants “Bar” and “Japanese”, the Business table is simply filtered upon the CategoryKey where the 2nd position is “1” (for “Bar”) and the 5th position is “1” (for “Japanese”).

I’m going to be submitting a forum post/question on this myself, but I thought at least the direction of the approach might be of help for you.

Ron

Attachments:
You must be logged in to view attached files.
Author

Posts

Viewing 10 posts - 1 through 10 (of 10 total)

The forum ‘Power Pivot’ is closed to new topics and replies.

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: