Learn how to manage your Vertica Data load Streams

   Many times when we load data in HP Vertica we really do not know what is happening and unless we have a tool such as Vertica Management Console installed and configured capturing information about the load streams is quite hard.

  I am going to walk you thru some simple steps that will help manage and understand better at what point is my load stream. The table that captures the information about the Data load streams is called load_streams. This table contains all the information you require to understand the status of your loads. So before you jump in and COPY+PASTE the script go thru some of this bullet points below: 1 - Make sure you setup the STREAM NAME of the Data Load, this will allow you to easy track and measure your data loads over periods of time. 2 - Use DIRECT option when you load large amounts of data.      Why DIRECT ?  - the DIRECT option enables you to load data straight into the ROS containers. 3 - Put your Rejected data into a table. - i find this very useful as you can query it and work with easy on it to fix any loading issues. Here is a basic syntax of COPY command:
copy  schema.Table_Name FROM '/tmp/file.csv' SKIP 1 DELIMITER '|'  NULL as 'null' DIRECT STREAM NAME 'Load Stream Identifier'  REJECTED DATA AS TABLE staging.Table_Name_rej;
So while loading the data this is the script i use to monitor the load_streams. 
  • this script will consider all loading sessions that are executing and will output details about Accepted_Row_Count, Duration of the load since load start and what is the load average per secound.
SELECT
    stream_name,
    table_name,
    is_executing,
    accepted_row_count,
    rejected_row_count,
    DATEDIFF(ss,load_start::TIMESTAMP,GETDATE()::TIMESTAMP) AS DurationSec,
    ROUND((accepted_row_count+rejected_row_count)/DATEDIFF(ss,load_start::TIMESTAMP,GETDATE()::TIMESTAMP),3.0) AS RowsPerSec
FROM
    load_streams
WHERE
    is_executing='true';

                      stream_name                      |     table_name     | is_executing | accepted_row_count | rejected_row_count | DurationSec | RowsPerSec
-------------------------------------------------------+--------------------+--------------+--------------------+--------------------+-------------+------------

 Loading dev.Mapping from Sql Server                   | UAT_Mapping        | t            |           29885952 |                  1 |         757 |  39479.462
 Loading dev.Item from Sql Server                      | UAT_Item           | t            |           23785472 |               3919 |         754 |  31550.916
(2 rows)

I hope this will be useful and if any can improve this script fell free to do so and drop me a message with any ideas or requests related to this topic.