Wednesday, September 4, 2013

Hive table design tricks

In Hive, we can create tables like:
create table test (

    name STRING,

    age STRING

)

stored as textfile;


this is a simplest table we can create in Hive, like mysql.
However, Hive offers a lot more options to create the table:

1. internal/external table
internal table is like the mysql tables, data stored in the Hive specified locations and managed by Hive.
In other words, if we create an internal table `test`, and then use
hive>drop table test;

both the table schema and the data will be deleted at the same time.
not just that, if we want to load in data into that table, we have to do things like:
hive>load data local inpath '/test/test.txt' overwrite test;

or the equivalent queries to load in data into table.

However, Hive offers the external table type to allow us just point out the location of the file and Hive will read from the external file.
we call that external table which can be created like:
hive>create external table test (...) ;

usually we will specify the data format of this table like:
hive>   CREATE TABLE user(id INT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' STORED AS TEXTFILE;
so Hive will load in data based on the format you specified.
e.g. you can create the external table schema and then use 'location' feature to load in the data.

usually, we want to add in more data everyday into Hive without use 'load data local', or sometimes we want to copy the file to a certain location but load in to Hive only when we want.
then we can use:
hive>alter table test add partition (dt = '2013-08-31') location 'test/test.txt';
in order to do that, we have to create table schema using partition function to let Hive know about the parititions.
hive>create table test partitioned by (dt STRING) (....);

Hive partition:
as mentioned above, Hive can handle partitions like mysql.
however, partition is more useful to Hive because of some certain reasons:
Hive does not support update table because all the data is stored as text or sequential way at least.
unlike the real DB, Hive is just an abstract DB which we can issue queries to it and it will transfer those queries into hadoop map/reduce jobs.
so with out update table, how can we manage the table partially like delete one day's data or just add one day's data?
we use Hive partition here:
we can do:
hive>load data local inpath 'test/test.txt' overwrite table test partition (dt = '2013-08-31');
by doing this, we will only overwrite the dt = '2013-08-31' partition without affecting the other partitions' data.
this is the most usual way to manage data by date.

Note: drop partitions will not affect the other partitions or the DB schema as well.
Note: drop external table will not delete the real data on disk or hadoop cluster, therefore we have to do that manually by 'hadoop fs -rmr' on the real location of data.

dynamic partition:
Hive offers dynamic partition as you can insert into a certain partition based on the return of another SELECT query, like:
hive>insert overwrite table test(name, age) partition (dt = '2013-08-31', siteid)
             select name, age, siteid from dummy_table where ...;
here siteid from SELECT will decide which partition it will write into.

NOTE: Hive can only accept dynamic partitions after static partitions, so this won't work:
hive>insert overwrite table test(name, age) partition (siteid, dt = '2013-08-31')
           select name, age, siteid from dummy_table where ...;

so when we create the Hive table schema, we have to consider this as well, otherwise after table is created, the only way to change it is to drop table and redo everything!
Because Hive use directories to implement partition hierarchy.(see Hive Spec)

Note: don't use the column which has thousands of different data as the dynamic partition, it will create thousands of directories in your filesystem!
imagine you have 10000 different siteid and the table above has to be populated everyday based on dt.
then you will have 10000 more directories everyday!(which is probably not what you want...)

another advantage of partition is, once you do:
hive>select * from test where dt = '2013-08-31';
if dt is the partition, then Hive does not have to invoke hadoop job, because there will be no map/reduce. As Hive use directories to manage all the partitions, Hive just need to read out the file in that partition dir and then print it out, which will be much faster than doing hadoop map/reduce job.





Tuesday, September 3, 2013

How to use lateral view explode() function in Hive

In Hive, we can create tables which has the MAP structure inside, like:

create table test (
    item MAP<STRING, STRING>
);

and sometimes we want to iterate all the items inside the MAP as key-value pairs.
Hive offered such function called explode():

explode() takes in an array as an input and outputs the elements of the array as separate rows. UDTF's can be used in the SELECT expression list and as a part of LATERAL VIEW.
An example use of explode() in the SELECT expression list is as follows:
Consider a table named myTable that has a single column (myCol) and two rows:
Array<int> myCol
[1,2,3]
[4,5,6]
Then running the query:
SELECT explode(myCol) AS myNewCol FROM myTable;
Will produce:
(int) myNewCol
1
2
3
4
5
6
the above is extracted from the official guide:
and here I am going to take a tour through how to explode the MAP structure and also how to explode multiple MAP structure 

taking `test` table as an example:
hive>select * from test;
{"123":"abc"}
{"234":"bcd"}

now if we do :
hive>select key, value from test 
    lateral view explode(item) dummy_table as key, value;
123    abc
234    bcd

as we can see, explode will expand the MAP into multiple rows. and of course we can use key, value in any clause like 'group by' or 'sort by' etc.

and how about if we have multiple items which are all MAP structure, like :

create table test2 (
    item1 MAP<STRING, STRING>,
    item2 MAP<STRING, STRING>
)

hive>select * from test2;
{"123":"abc","234":"bcd"}  {"123":"aaa","234":"bbb"}

now if we do the same query again:
hive>select key1, value1, key2, value2 from test2 
                  lateral view explode(item1) dummy1 as key1, value1
                  lateral view explode(item2) dummy2 as key2, value2;
123 abc 123 aaa
123 abc 234 bbb
234 bcd 123 aaa
234 bcd 234 bbb

we see that Hive won't show just two lines, instead, it will try all combinations.
so now we have a problem, how about I wanna do sum up on item1["123"] ?
if the value of key "123" is not the alphabet but number instead, I should be able to sum up the value based on the key, right ?

but now, as we showed above, Hive will do combination, so the value will be duplicated!

Here is my solution, and simple:
hive>select key1, value1, key2, value2 from test2 
                  lateral view explode(item1) dummy1 as key1, value1
                  lateral view explode(item2) dummy2 as key2, value2
          where key1 = key2;
123 abc 123 aaa
234 bcd 234 bbb

so now if we do sum up like :
hive>select key1,SUM(value1), SUM(value2) from test2
                  lateral view explode(item1) dummy1 as key1, value1
                  lateral view explode(item2) dummy2 as key2, value2
          where key1 = key2;

we will get the correct sum up value of every key.