Performances

This component was designed to allow the SSIS developer to trim all the column of a table in a couple of clicks, with no care for non-textual data. However, performances do matter: I wanted something at least as fast as the standard trim function, especially when we have many columns and many columns. Indeed this component is most useful when the developer needs to trim a wide table and he has to spend much time to trim manually all the columns. If the developer has to trim a table composed by only on column, there would not be great time earning between these two practices. So the objective is to reach good performances with many cols, even with many rows in order to be a valid replacement for the standard Derived Column component.
In order to test this component, I’ve created a simple C# program to assemble CSV files by specifying the number of columns and the number of columns. The script will create a sort of matrix with values padded with a random number of blank spaces at the beginning and at the end of the value. So I was able to simulate the “Mass Trim Operation” of data composed from thousands rows up to 10 million rows and from 1 single column up to 100 columns.
I created two SSIS packages, which has been ran separately as many times as the number of prepared csv files, each time with the relative flat file connector. We measured the execution time, taken from inside visual studio, from the “Execution Results tab”.

Summary results

The following two charts show the summary of all the collected results for all the tests we ran.

ExecutionTime StandardTrimmer.png

Execution Time WS Mass Trimmer.png

As I was expecting, the WS Trimmer behaves very similar to the standard derived column component with the TRIM calculation. In both cases the worst scenario is registered when trimming a 100 columns x 10M rows (a CSV file of about 16 Gb). This operation has taken about 3 minutes for the WS Mass Trimmer to complete, and about 3 minutes and 10 seconds for the standard derived column to complete. This is a very good result: this component behaves correctly, especially with huge tables.

Row Scalability

We could see the same situation by having a look to the following detailed charts, which stress differences between the two components in different situations.

row_scalability_1.png

When performing trim operation on a vector-table (1 column table), the standard Trim components achieves better performances. For huge columns with more then 100k rows there’s almost no performance differences; when dealing with short tables the WS Trimmer takes more time to complete. This is not a real problem, since using the mass trimmer on a vector table is pointless.

row_scalability_2.png

On the other hand, the picture is different when scaling the number of rows on a table with 10 columns. Here, when passing 10k rows the WS Mass Trimmer becomes faster. We can see that having to trim a 10M rows table there are more than 1,5 seconds of difference in execution time, in favor of the WS Trimmer.
When dealing with very huge tables, composed by 100 columns and scaling by number of rows, the Mass Trimmer becomes the best candidate. We can see that by having a look to the following table:

row_scalability_3.png

Last edited Apr 26, 2014 at 9:47 PM by webking, version 3