Lateral view and explode function are mostly used for analysing text data using Hive. Let us understand both these concepts with an example.
Explode function is known as “User defined Tabular function (UDTF)”, which takes one row as input and returns multiple rows as output.
Say, we have a table ‘all_tweets’, with id and tweet as its columns.
hive > select id , tweet from all_tweets ;
Let us break down the tweet column into array of sentences by converting all tweets to lower case and then pass it to sentences function
hive>select id, tweet, sentences(lower(tweet)) as sentence from all_tweets ;
Here, we can see that the tweets are split into array of sentences. The 1st tweet(tweet_id=1) is split into ‘array<array<string>>’, since, the function sentences( ) breaks each sentence based on the exclamation mark (!) and other punctuations like periods and commas are stripped automatically.
Now, we will apply explode function, which takes each tweet as input ,flattens the array of sentences and outputs the elements of the array as separate rows
hive > select explode(sentences(lower(tweet))) from all_tweets;
Here, we know from Fig2 , that tweet_id=1, had array<array<string>> ([[“wow”],[“the”,”movie”,”was”,”awesome”]]), which is now flattened to two rows each having data type, array<string> - ([“wow”] and [“the”,”movie”,”was”,”awesome”]).
NOTE : explode function returns output rows and these output rows cannot be joined with input rows i.e, we cannot perform
hive > select tweet_id, explode(sentences(lower(tweet))) from all_tweets;
since, the explode function returns 2 rows(output) for single tweet_id(input)., eg: tweet_id=1. To solve this, we make use of Lateral View
Lateral view distributes the array elements retaining the common rows. It is used with UDTF (eg: explode( )). Lateral view creates a virtual table(with given table alias) by first applying the explode function to each row of base table and then joining the resulting output rows with input rows of base table. Let us understand this concept using our example.
hive > select * from all_tweets lateral view explode(sentences(lower(tweet))) dummy_view as words ;
Here, we see that lateral view distributed the array elements — [“wow”] , [“the”,”movie”,”was”,”awesome”] retaining the common rows i.e id(=1) and tweet (‘Wow! The movie was awesome’). Lateral view creates a virtual table (dummy_view) with column name words obtained from explode function which is then joined with input rows -id and tweet.
Now that we have split tweets into array of words , we can further create separate records for every word using lateral view and explode.
First we need to create a view(v1) for our previous task
hive> create view v1 as select * from all_tweets lateral view explode(sentences(lower(tweet))) dummy_view as words ;
To create separate record for each word, we can apply lateral view onto the existing view v1
hive>create view v2 as select * from v1 lateral view explode(words) dummy_view as word;
Till now, we have created separate records for each word in the tweet. We can perform sentiment analysis on each tweet using the word column. We can create a dictionary table which maps all the words in english to a sentiment (neutral(0), positive (+1) , negative (-1)). Each sentiment has a polarity value (0,+1,-1) associated with it. Using the dictionary table, we can identify polarity value for each word. Finally, we can sum the polarity values by grouping them on tweet id. The tweet sentiment is positive if sum of polarity is >1, negative if the sum is <1 , otherwise neutral.