Using the created flex tables in the Working with HP Vertica Flex Tables article will demonstrate how to use the compute_flextable_keys() function.
dbadmin=> select compute_flextable_keys('test');
compute_flextable_keys
----------------------------------------------
Please see public.test_keys for updated keys
(1 row)
dbadmin=> d public.test_keys
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-----------+-----------------+----------------------+--------+---------+----------+-------------+-------------
public | test_keys | key_name | long varchar(130000) | 130000 | | f | f |
public | test_keys | frequency | int | 8 | | f | f |
public | test_keys | data_type_guess | varchar(200) | 200 | | f | f |
(3 rows)
The table containt 3 columns(no matter the json,csv,etc file size or type):
key_name : it will receive the column name. frequency : this will hold the number of times the virtual column occurs in the map. data_type_guess: it holds the data type for each virtual column, cast to VARCHAR, LONG VARCHAR or LONG VARBINARY, depending on the length of the key, and whether the key includes one or more nested maps.dbadmin=> select * from public.test_keys;
key_name | frequency | data_type_guess
---------------+-----------+---------------------
about | 122450 | varchar(1270)
address | 122450 | varchar(120)
age | 122450 | varchar(20)
balance | 122450 | varchar(20)
company | 122450 | varchar(24)
email | 122450 | varchar(66)
favoriteFruit | 122450 | varchar(20)
friends | 122450 | long varbinary(304)
gender | 122450 | varchar(20)
greeting | 122450 | varchar(114)
guid | 122450 | varchar(72)
id | 122450 | varchar(20)
isActive | 122450 | varchar(20)
latitude | 122450 | varchar(20)
longitude | 122450 | varchar(20)
name | 122450 | varchar(40)
phone | 122450 | varchar(34)
picture | 122450 | varchar(50)
registered | 122450 | varchar(52)
tags | 122450 | long varbinary(288)
(20 rows)
Not everytime the function will choose the best data types and their optimum length.
INSERT INTO public.test_keys
SELECT keys,
frequency,
CASE
WHEN data_type IS NOT NULL THEN
data_type
WHEN type_oid = 116 and (length * 2) ::int <= 20 THEN
'varchar(20)'
WHEN type_oid = 116 and (length * 2) ::int <= 65000 THEN
'varchar(' || (length * 2) ::int || ')'
WHEN type_oid = 116 and (length * 2) ::int > 130000 THEN
'long varchar(130000)'
WHEN type_oid = 116 THEN
'long varchar(' || (length * 2) ::int || ')'
WHEN type_oid = 199 and (length * 2) ::int <= 20 THEN
'long varbinary(20)'
WHEN type_oid = 199 and (length * 2) ::int > 130000 THEN
'long varbinary(130000)'
WHEN type_oid = 199 THEN
'long varbinary(' || (length * 2) ::int || ')'
ELSE
NULL
END as data_type_guess
FROM (SELECT keys, frequency, type_oid, length, data_type
FROM (SELECT keys,
COUNT(*) as frequency,
max(length) as length,
max(type_oid) as type_oid
FROM (SELECT mapkeysInfo(__raw__) OVER() FROM public.test) all_keyinfo
GROUP BY keys) aggregated_keyinfo
LEFT OUTER JOIN V_CATALOG.COLUMNS ON keys = columns.column_name
and columns.table_id =
'45035996273869954') aggregated_keyinfo_and_types
order by frequency DESC, keys;
Access Path:
+-SORT [Cost: 8M, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
| Order: aggregated_keyinfo_and_types.frequency DESC, aggregated_keyinfo_and_types.keys ASC
| +---> JOIN HASH [RightOuter] [Cost: 7M, Rows: 10K (NO STATISTICS)] (PATH ID: 3)
| | Join Cond: (aggregated_keyinfo.keys = COLUMNS.column_name)
| | +-- Outer -> SELECT [Cost: 3K, Rows: 10K (NO STATISTICS)] (PATH ID: 4)
| | | +---> STORAGE ACCESS for vcs [Cost: 3K, Rows: 10K (NO STATISTICS)] (PATH ID: 5)
| | | | Projection: v_internal.vs_columns_p
| | | | Materialize: vcs.attname, vcs.typname
| | | | Filter: (vcs.t_oid = 45035996273869954)
| | | | Filter: ((vcs.nspname <> 'v_internal') AND (vcs.nspname <> 'v_monitor') AND (vcs.nspname <> 'v_catalog'))
| | | | Runtime Filters: (SIP1(HashJoin): COLUMNS.column_name), (SIP1(HashJoin): COLUMNS.column_name)
| | +-- Inner -> SELECT [Cost: 7M, Rows: 1 (NO STATISTICS)] (PATH ID: 6)
| | | +---> GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 7M, Rows: 1 (NO STATISTICS)] (PATH ID: 7)
| | | | Aggregates: count(*), max(all_keyinfo.length), max(all_keyinfo.type_oid)
| | | | Group By: all_keyinfo.keys
| | | | +---> ANALYTICAL [Cost: 7M, Rows: 122K (NO STATISTICS)] (PATH ID: 9)
| | | | | Analytic Group
| | | | | Functions: MapKeysInfo()
| | | | | +---> STORAGE ACCESS for test [Cost: 782K, Rows: 122K (NO STATISTICS)] (PATH ID: 10)
| | | | | | Projection: public.test_super
| | | | | | Materialize: test.__raw__
We can see the (NO STATISTICS) is becouse we cannot run or collect statistics on Flex Tables.
Actions made by Vertica engine during the execution of compute_flextable_keys() function.
dbadmin=> SELECT mapkeysInfo(__raw__) OVER() FROM public.test;
keys | length | type_oid | row_num | field_num
---------------+--------+----------+---------+-----------
about | 239 | 116 | 1 | 0
address | 41 | 116 | 1 | 1
age | 2 | 116 | 1 | 2
balance | 9 | 116 | 1 | 3
company | 5 | 116 | 1 | 4
email | 20 | 116 | 1 | 5
favoriteFruit | 10 | 116 | 1 | 6
friends | 140 | 199 | 1 | 7
gender | 4 | 116 | 1 | 8
greeting | 47 | 116 | 1 | 9
guid | 36 | 116 | 1 | 10
id | 4 | 116 | 1 | 11
isActive | 1 | 116 | 1 | 12
latitude | 3 | 116 | 1 | 13
longitude | 2 | 116 | 1 | 14
name | 11 | 116 | 1 | 15
phone | 17 | 116 | 1 | 16
picture | 25 | 116 | 1 | 17
registered | 26 | 116 | 1 | 18
tags | 116 | 199 | 1 | 19
SELECT keys, frequency, type_oid, length, data_type
FROM (SELECT keys,
COUNT(*) as frequency,
max(length) as length,
max(type_oid) as type_oid
FROM (SELECT mapkeysInfo(__raw__) OVER() FROM public.test) all_keyinfo
GROUP BY keys) aggregated_keyinfo
LEFT OUTER JOIN V_CATALOG.COLUMNS ON keys = columns.column_name
and columns.table_id ='45035996273869954';
keys | frequency | type_oid | length | data_type
---------------+-----------+----------+--------+-----------
about | 122450 | 116 | 635 |
address | 122450 | 116 | 60 |
age | 122450 | 116 | 2 |
balance | 122450 | 116 | 9 |
company | 122450 | 116 | 12 |
email | 122450 | 116 | 33 |
favoriteFruit | 122450 | 116 | 10 |
friends | 122450 | 199 | 152 |
gender | 122450 | 116 | 6 |
greeting | 122450 | 116 | 57 |
guid | 122450 | 116 | 36 |
id | 122450 | 116 | 4 |
isActive | 122450 | 116 | 1 |
latitude | 122450 | 116 | 3 |
longitude | 122450 | 116 | 4 |
name | 122450 | 116 | 20 |
phone | 122450 | 116 | 17 |
picture | 122450 | 116 | 25 |
registered | 122450 | 116 | 26 |
tags | 122450 | 199 | 144 |
Where the data type lenght is going to be *2 it's actual value
dbadmin=> select * from test_keys;
key_name | frequency | data_type_guess
---------------+-----------+---------------------
about | 122450 | varchar(1270)
address | 122450 | varchar(120)
age | 122450 | varchar(20)
balance | 122450 | varchar(20)
company | 122450 | varchar(24)
email | 122450 | varchar(66)
favoriteFruit | 122450 | varchar(20)
friends | 122450 | long varbinary(304)
gender | 122450 | varchar(20)
greeting | 122450 | varchar(114)
guid | 122450 | varchar(72)
id | 122450 | varchar(20)
isActive | 122450 | varchar(20)
latitude | 122450 | varchar(20)
longitude | 122450 | varchar(20)
name | 122450 | varchar(40)
phone | 122450 | varchar(34)
picture | 122450 | varchar(50)
registered | 122450 | varchar(52)
tags | 122450 | long varbinary(288)