HP Vertica compute_flextable_keys() function

Using the created flex tables in the Working with HP Vertica Flex Tables article will demonstrate how to use the compute_flextable_keys() function.

  • Running the compute_flextable_keys() on 'row' Flex Table.
  • dbadmin=> select compute_flextable_keys('test');
                compute_flextable_keys
    ----------------------------------------------
     Please see public.test_keys for updated keys
    (1 row)
  • Check the public.test_keys table definition.
  • dbadmin=> d public.test_keys
                                                   List of Fields by Tables
     Schema |   Table   |     Column      |         Type         |  Size  | Default | Not Null | Primary Key | Foreign Key
    --------+-----------+-----------------+----------------------+--------+---------+----------+-------------+-------------
     public | test_keys | key_name        | long varchar(130000) | 130000 |         | f        | f           |
     public | test_keys | frequency       | int                  |      8 |         | f        | f           |
     public | test_keys | data_type_guess | varchar(200)         |    200 |         | f        | f           |
    (3 rows)

    The table containt 3 columns(no matter the json,csv,etc file size or type):

    key_name : it will receive the column name.

    frequency : this will hold the number of times the virtual column occurs in the map.

    data_type_guess: it holds the data type for each virtual column, cast to VARCHAR, LONG VARCHAR or LONG VARBINARY, depending on the length of the key, and whether the key includes one or more nested maps.

  • Check the public.test_keys content.
  • dbadmin=> select * from public.test_keys;
       key_name    | frequency |   data_type_guess
    ---------------+-----------+---------------------
     about         |    122450 | varchar(1270)
     address       |    122450 | varchar(120)
     age           |    122450 | varchar(20)
     balance       |    122450 | varchar(20)
     company       |    122450 | varchar(24)
     email         |    122450 | varchar(66)
     favoriteFruit |    122450 | varchar(20)
     friends       |    122450 | long varbinary(304)
     gender        |    122450 | varchar(20)
     greeting      |    122450 | varchar(114)
     guid          |    122450 | varchar(72)
     id            |    122450 | varchar(20)
     isActive      |    122450 | varchar(20)
     latitude      |    122450 | varchar(20)
     longitude     |    122450 | varchar(20)
     name          |    122450 | varchar(40)
     phone         |    122450 | varchar(34)
     picture       |    122450 | varchar(50)
     registered    |    122450 | varchar(52)
     tags          |    122450 | long varbinary(288)
    (20 rows)

    Not everytime the function will choose the best data types and their optimum length.

  • This is the action done by the HP Vertica engine when creating the public.test_keys table.
  • INSERT INTO public.test_keys
      SELECT keys,
             frequency,
             CASE
               WHEN data_type IS NOT NULL THEN
                data_type
               WHEN type_oid = 116 and (length * 2) ::int <= 20 THEN
                'varchar(20)'
               WHEN type_oid = 116 and (length * 2) ::int <= 65000 THEN
                'varchar(' || (length * 2) ::int || ')'
               WHEN type_oid = 116 and (length * 2) ::int > 130000 THEN
                'long varchar(130000)'
               WHEN type_oid = 116 THEN
                'long varchar(' || (length * 2) ::int || ')'
               WHEN type_oid = 199 and (length * 2) ::int <= 20 THEN
                'long varbinary(20)'
               WHEN type_oid = 199 and (length * 2) ::int > 130000 THEN
                'long varbinary(130000)'
               WHEN type_oid = 199 THEN
                'long varbinary(' || (length * 2) ::int || ')'
               ELSE
                NULL
             END as data_type_guess
        FROM (SELECT keys, frequency, type_oid, length, data_type
                FROM (SELECT keys,
                             COUNT(*) as frequency,
                             max(length) as length,
                             max(type_oid) as type_oid
                        FROM (SELECT mapkeysInfo(__raw__) OVER() FROM public.test) all_keyinfo
                       GROUP BY keys) aggregated_keyinfo
                LEFT OUTER JOIN V_CATALOG.COLUMNS ON keys = columns.column_name
                                                 and columns.table_id =
                                                     '45035996273869954') aggregated_keyinfo_and_types
       order by frequency DESC, keys;
  • Here is the explain plan.
  • Access Path:
     +-SORT [Cost: 8M, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
     |  Order: aggregated_keyinfo_and_types.frequency DESC, aggregated_keyinfo_and_types.keys ASC
     | +---> JOIN HASH [RightOuter] [Cost: 7M, Rows: 10K (NO STATISTICS)] (PATH ID: 3)
     | |      Join Cond: (aggregated_keyinfo.keys = COLUMNS.column_name)
     | | +-- Outer -> SELECT [Cost: 3K, Rows: 10K (NO STATISTICS)] (PATH ID: 4)
     | | | +---> STORAGE ACCESS for vcs [Cost: 3K, Rows: 10K (NO STATISTICS)] (PATH ID: 5)
     | | | |      Projection: v_internal.vs_columns_p
     | | | |      Materialize: vcs.attname, vcs.typname
     | | | |      Filter: (vcs.t_oid = 45035996273869954)
     | | | |      Filter: ((vcs.nspname <> 'v_internal') AND (vcs.nspname <> 'v_monitor') AND (vcs.nspname <> 'v_catalog'))
     | | | |      Runtime Filters: (SIP1(HashJoin): COLUMNS.column_name), (SIP1(HashJoin): COLUMNS.column_name)
     | | +-- Inner -> SELECT [Cost: 7M, Rows: 1 (NO STATISTICS)] (PATH ID: 6)
     | | | +---> GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 7M, Rows: 1 (NO STATISTICS)] (PATH ID: 7)
     | | | |      Aggregates: count(*), max(all_keyinfo.length), max(all_keyinfo.type_oid)
     | | | |      Group By: all_keyinfo.keys
     | | | | +---> ANALYTICAL [Cost: 7M, Rows: 122K (NO STATISTICS)] (PATH ID: 9)
     | | | | |      Analytic Group
     | | | | |       Functions: MapKeysInfo()
     | | | | | +---> STORAGE ACCESS for test [Cost: 782K, Rows: 122K (NO STATISTICS)] (PATH ID: 10)
     | | | | | |      Projection: public.test_super
     | | | | | |      Materialize: test.__raw__

    We can see the (NO STATISTICS) is becouse we cannot run or collect statistics on Flex Tables.

    Actions made by Vertica engine during the execution of compute_flextable_keys() function.

  • 1- The Vertica engine starts by processing the row content of the flex table using the mapkeysInfo() function.
  • dbadmin=> SELECT mapkeysInfo(__raw__) OVER() FROM public.test;
         keys      | length | type_oid | row_num | field_num
    ---------------+--------+----------+---------+-----------
     about         |    239 |      116 |       1 |         0
     address       |     41 |      116 |       1 |         1
     age           |      2 |      116 |       1 |         2
     balance       |      9 |      116 |       1 |         3
     company       |      5 |      116 |       1 |         4
     email         |     20 |      116 |       1 |         5
     favoriteFruit |     10 |      116 |       1 |         6
     friends       |    140 |      199 |       1 |         7
     gender        |      4 |      116 |       1 |         8
     greeting      |     47 |      116 |       1 |         9
     guid          |     36 |      116 |       1 |        10
     id            |      4 |      116 |       1 |        11
     isActive      |      1 |      116 |       1 |        12
     latitude      |      3 |      116 |       1 |        13
     longitude     |      2 |      116 |       1 |        14
     name          |     11 |      116 |       1 |        15
     phone         |     17 |      116 |       1 |        16
     picture       |     25 |      116 |       1 |        17
     registered    |     26 |      116 |       1 |        18
     tags          |    116 |      199 |       1 |        19
  • 2-The data generated by the mapkeysInfo() is then computed:
  • SELECT keys, frequency, type_oid, length, data_type
                FROM (SELECT keys,
                             COUNT(*) as frequency,
                             max(length) as length,
                             max(type_oid) as type_oid
                        FROM (SELECT mapkeysInfo(__raw__) OVER() FROM public.test) all_keyinfo
                       GROUP BY keys) aggregated_keyinfo
                LEFT OUTER JOIN V_CATALOG.COLUMNS ON keys = columns.column_name
                                                 and columns.table_id ='45035996273869954';
         keys      | frequency | type_oid | length | data_type
    ---------------+-----------+----------+--------+-----------
     about         |    122450 |      116 |    635 |
     address       |    122450 |      116 |     60 |
     age           |    122450 |      116 |      2 |
     balance       |    122450 |      116 |      9 |
     company       |    122450 |      116 |     12 |
     email         |    122450 |      116 |     33 |
     favoriteFruit |    122450 |      116 |     10 |
     friends       |    122450 |      199 |    152 |
     gender        |    122450 |      116 |      6 |
     greeting      |    122450 |      116 |     57 |
     guid          |    122450 |      116 |     36 |
     id            |    122450 |      116 |      4 |
     isActive      |    122450 |      116 |      1 |
     latitude      |    122450 |      116 |      3 |
     longitude     |    122450 |      116 |      4 |
     name          |    122450 |      116 |     20 |
     phone         |    122450 |      116 |     17 |
     picture       |    122450 |      116 |     25 |
     registered    |    122450 |      116 |     26 |
     tags          |    122450 |      199 |    144 |
  • 3- Then the result set is computed by the query makeing available the data type corespondig to it's calculated algoritm.
  • Where the data type lenght is going to be *2 it's actual value

     dbadmin=> select * from test_keys;
       key_name    | frequency |   data_type_guess
    ---------------+-----------+---------------------
     about         |    122450 | varchar(1270)
     address       |    122450 | varchar(120)
     age           |    122450 | varchar(20)
     balance       |    122450 | varchar(20)
     company       |    122450 | varchar(24)
     email         |    122450 | varchar(66)
     favoriteFruit |    122450 | varchar(20)
     friends       |    122450 | long varbinary(304)
     gender        |    122450 | varchar(20)
     greeting      |    122450 | varchar(114)
     guid          |    122450 | varchar(72)
     id            |    122450 | varchar(20)
     isActive      |    122450 | varchar(20)
     latitude      |    122450 | varchar(20)
     longitude     |    122450 | varchar(20)
     name          |    122450 | varchar(40)
     phone         |    122450 | varchar(34)
     picture       |    122450 | varchar(50)
     registered    |    122450 | varchar(52)
     tags          |    122450 | long varbinary(288)
  • Here is the exaplin plan in a graphical manner - created by Graphiz.