Lateral view and explode function in Hive

Lateral view and explode function are mostly used for analysing text data using Hive. Let us understand both these concepts with an example.

Explode function

Explode function is known as “User defined Tabular function (UDTF)”, which takes one row as input and returns multiple rows as output.

Say, we have a table ‘all_tweets’, with id and tweet as its columns.

hive > select id , tweet from all_tweets ;

Fig1 : all_tweets table

Let us break down the tweet column into array of sentences by converting all tweets to lower case and then pass it to sentences function

hive>select id, tweet, sentences(lower(tweet)) as sentence from all_tweets ;

Fig 2 : sentences function

Here, we can see that the tweets are split into array of sentences. The 1st tweet(tweet_id=1) is split into ‘array<array<string>>’, since, the function sentences( ) breaks each sentence based on the exclamation mark (!) and other punctuations like periods and commas are stripped automatically.

Now, we will apply explode function, which takes each tweet as input ,flattens the array of sentences and outputs the elements of the array as separate rows

hive > select explode(sentences(lower(tweet))) from all_tweets;

Fig 3: explode function

Here, we know from Fig2 , that tweet_id=1, had array<array<string>> ([[“wow”],[“the”,”movie”,”was”,”awesome”]]), which is now flattened to two rows each having data type, array<string> - ([“wow”] and [“the”,”movie”,”was”,”awesome”]).

NOTE : explode function returns output rows and these output rows cannot be joined with input rows i.e, we cannot perform

hive > select tweet_id, explode(sentences(lower(tweet))) from all_tweets;

since, the explode function returns 2 rows(output) for single tweet_id(input)., eg: tweet_id=1. To solve this, we make use of Lateral View

Lateral View

Lateral view distributes the array elements retaining the common rows. It is used with UDTF (eg: explode( )). Lateral view creates a virtual table(with given table alias) by first applying the explode function to each row of base table and then joining the resulting output rows with input rows of base table. Let us understand this concept using our example.

hive > select * from all_tweets lateral view explode(sentences(lower(tweet))) dummy_view as words ;

Fig 4: lateral view 1

Here, we see that lateral view distributed the array elements — [“wow”] , [“the”,”movie”,”was”,”awesome”] retaining the common rows i.e id(=1) and tweet (‘Wow! The movie was awesome’). Lateral view creates a virtual table (dummy_view) with column name words obtained from explode function which is then joined with input rows -id and tweet.

Now that we have split tweets into array of words , we can further create separate records for every word using lateral view and explode.

First we need to create a view(v1) for our previous task

hive> create view v1 as select * from all_tweets lateral view explode(sentences(lower(tweet))) dummy_view as words ;

To create separate record for each word, we can apply lateral view onto the existing view v1

hive>create view v2 as select * from v1 lateral view explode(words) dummy_view as word;

Fig 5: lateral view 2

Next Steps

Till now, we have created separate records for each word in the tweet. We can perform sentiment analysis on each tweet using the word column. We can create a dictionary table which maps all the words in english to a sentiment (neutral(0), positive (+1) , negative (-1)). Each sentiment has a polarity value (0,+1,-1) associated with it. Using the dictionary table, we can identify polarity value for each word. Finally, we can sum the polarity values by grouping them on tweet id. The tweet sentiment is positive if sum of polarity is >1, negative if the sum is <1 , otherwise neutral.

References:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView

Freelancer | Aspiring Data Scientist | Software Developer