Voice Intelligence Based Wake Word Detection of Regional Dialects Using 1d Convolutional Neural Network

Voice-based apps can be effective among rural farmers, if it is in their own spoken language/dialect. Many voice-based apps were developed in the agricultural sector, and in each case, farmers had to either type in the queries or they had to communicate with the device which had the standard speech to deliver the solution which added to the challenge to comprehend the information. This paper presents the research work in developing the wake word detection system for major dialects based on 5 different regions in Karnataka, namely-Dharwad, Dogganal, Tulu, Kodagu and Urban Kannada.The customized wake word system is designed using 1D CNN model with 98% accuracy which showed better results over ANNs with 14.1% and RNNs with 48.1% accuracies. The diversity in regional dialects has been well identified using Conv1D model and with comparative analysis with RNNs to validate on the sequential data the predicted labels were compared and the performance of Conv1d reconciles well for the Dialect Dataset.


INTRODUCTION
Majority of the Indian population, around 58% -depend on agriculture as the major source of income.Information Management is the main challenge faced by rural farmers.Acquiring relevant information, comprehending it and implementing it in the right way can yield good production [19].[11] has mentioned about how understanding the farmer's conception towards these modern facilities, can play a major role in analyzing the factors which can influence the adoption of Technology like, education, financial conditions, social constraints, dependency on Mandi (local market) for price information, and social network among the farming community.Transformation from feature/basic phones to smart phones has paved way for different approaches in the technology series [12].In recent years, with the advancement of Technology, especially in the field of Artificial Intelligence and Data Science, various brands have come up with voice-based apps to provide a flawless experience for their consumers [19].Mobile apps can become one of the most powerful tools for delivering relevant information to rural farmers regarding agricultural needs without any third-party influence [11].Various agriculture related apps are released by the Government and other private companies which include both voice assistance and chatbots.Few popular apps include Krishify where farmers can search for any agricultural related topics, AgNext integrates various agriculture related services along with e-Nam platform [22].Agri Media Video App is an online retail market.It also provides online chat services • Email: editor@ijfmr.com

IJFMR240217767
Volume 6, Issue 2, March-April 2024 2 connecting farmers with field experts [12].FarmBee app, which is available in 10 distinct Indian languages, can provide information at different stages in the life cycle of the crop.Kisan Yojana provides information regarding various government schemes and policies in the agricultural sector [23].Kumar [14] has mentioned difficulties faced by farmers in using these applications.While using these apps, farmers had to either type in the queries or they had to communicate with the device which had the standard speech to deliver the solution.[21][14] And based on the statistics provided in the research paper around 65% of blockade are related to language since most of these apps use standard speech to deliver the solution.[23] And with respect to Field experts, human error may occur, or they may lack the expertise in providing the right solution to the farmers issues.[20] Out of 58%, only around 15 to 20% of farmers are ready to shift to online platform.This gap may affect the agriculture sector in the coming years [14].India is a country with 120 major languages with 1600 dialects.And if Technology must reach the granular level to suite the rural farmer's need, then the fact about the language barrier cannot be left unnoticed [13].
There is a dire need in creating a bond between Technology and a farmer through which farmers can easily communicate with the device and this is only possible when the communication is through their native language and dialect [15].Above mentioned facts only lead us to one important aspect that the farmer community may feel it is easier to speak to the device in their own native language and dialect, than typing.Vernacular voice-based apps can make the farmers feel connected to the technology through which they can leverage on getting various solutions for their agricultural needs to gain potential benefits [11].They only need a simple logical solution by speaking to the device and connecting to different platforms without any intervention from the middlemen [13].Dialect Identification is one of the upcoming topics in the world of speech recognition tasks.It is one of the most challenging in terms of differentiating it with the spoken language in terms of complexity and overlapping phonetic systems [1].In [5] authors have mentioned about the deficiency in the resources for Dialect Identification (DID) for modelling.There are very small variations in the parameter related to the utterance of the same word with different styles for the same language [4].For any DID task a few parameters should be closely monitored like prosodic features including intonation, phonology, vocabulary and grammar.Using TensorFlow Lite makes the task much easier to deploy the model on Android devices.

LITERATURE REVIEW
In this paper, we mainly focus on wake word detection method which is the phase-1 of our research work in providing the voice-based solution to the farmers mainly focusing on dialects.Most of the work in recent years is based on using different deep learning techniques.[8] Tsai T H proposed the wake word system in real time based on Convolutional Neural Networks (also termed as CNN).After preprocessing the data with MFCC (Mel-Frequency Cepstral Coefficient), they have used GMM (Gaussian Mixture Model) to train the speaker identification model which uses likelihood function to identify the true speaker.Next, for each GMM model posterior probability is predicted and state sequence is compared using Hidden Markov Models [HMM].Hidden Markov Model is efficient in partitioning human speech into different syllables.And wake up action is achievable only when both state sequence and posterior probability passes the threshold.Probabilistic model, rather than assuming for the entire distribution it assumes for some moments which makes it more accurate than machine learning.The paper [2] proposed a LSTM (Long Short-term Memory) based method for trigger word or wake word detection for speech data.It is the variant of Recurrent neural Network (RNN) which supports long term dependencies among the timestep of the data (which is done on spectrogram).Here in timestep which is a backpropagation technique of using current as well as previous inputs as an input to the neuron.Authors CNN based wake word system works on fixed sizes on inputs which can cause some errors with long durations as it may consider some non-relevant utterances as well.have also explained the facts about LSTM which was developed to handle vanishing gradient and exploding issues while training RNN's.LSTM techniques are good in handling lengthy speech data.[18] In this paper the authors have proposed a wake word detection system using Transformers, which accomplished better results over LSTM and CNN sequence modelling tasks.One of the main points highlighted in the paper is-Since wake word detection is a short-range temporal model, large sequence modelling like Transformers may not be a viable option.Transformers use attention mechanism for having long term memory.It has an attention-based encoder and decoder mechanism, where the encoder holds all the information learned for the input sequence and decoder then intakes that sequence and gives a single output also considering the previous output.The model can "attend" on all tokens which are generated previously.Authors have adopted a LF-MMI (Lattice Free-Maximum Mutation Information) system which includes gradient stopping, looking forth to the next piece of data, embedding methods based on positions in the sequence and have layer dependencies.It resulted in outperforming CNN by 25% in the rate for false rejection and sustains the linear complexity for the segment length.They have also mentioned using tensors as it may be more efficient for short range word sequences in which instead of considering the entire utterance it only focuses on the target word.[17] proposed the system deployed with Res2Net which is the better variation for ResNet.Here it enhances the ability of detecting the wake word of different durations.It is applied on Mobvoi data which consist of two wake words.And, it has a rate of false rejection at 12% over other systems.Res2Net is a classification model with a broadened receptive field which increases the detection capability of the model.With fewer model parameters.It is done by extracting the exact features by considering the local features and then capitulating the global feature of the same size from the given region of varied lengths.

PROPOSED WORK
The proposed work is based on multi-class classification problem wherein every input belongs to only one class.Wake words are used to start the conversation and wake up the device to respond to our queries.Device cannot continuously listen to the conversation it may cause the security breach and may also lead to huge load on the servers to process each audio signal.It only starts listening to our commands once the wake word is detected and device is woken up.The wake word detection system is a 5-step process.First, we need to prepare the data set by recording the audio for few seconds containing the wake word in different dialects and recording the audio which do not contain the wake word.Next, convert the raw audio data into waveform which is in time domain, but to better analyze and extract the important features in the audio we need to transform waveform of time domain to frequency domain mainly in our work we are implementing MFCCs.Each MFCCs are labelled and these labels and features are saved into pickle file for later use.Based on the dataset and the problem statement we need to choose the deep learning model which yields the best predictions.In this, scenario of dialect identification for wake word detection Conv1D technique suits well.We convert every MFCCs into 1 dimensional array and give it as a input to first convolution layer.Then train the model using TensorFlow with Keras technique.Later, we evaluate the trained model for prediction where the system listens to the audio and classifies the input into one of the classes and writes the result in the csv file.The result reaches to its accuracy of whether specific audio contains the wake word or not.For samples, it includes the audio file which contains the recorded audio of wake words.The audio is recorded for short 3 seconds saying "Namaskara" in the respective regional dialects along with 2 or more words which go along with the greeting word which are specific to the region.And for non-wake word, data was collected from crowded places like restaurants and local marketplaces and made sure that wake words were not present these audio clips.Sounddevice is used for recording the sound/audio and creating a NumPy array and then Scipy.io.wav will save the NumPy array as an audio file in .wavformat.Each dialect has 100 recording each with audio clips of both male and female voices with different age groups.Data Augmentation plays a crucial role in increasing the volume of the dataset by varying speeds from 0.7 to 1.4.
In audio data all the audio files are stored where each dialect has 100 recorded audio clips.Each time the audio is recorded it is saved under the unique file name which can be later helpful in conducting iteration for each of these files.
The recorded voice data collected during the data collection process can also be used for giving as the input for recording the wake word.The actual audios are recorded on field while conducting the survey about the dialect variations in the respective regions.The same audio is used for preparing the data set to get accurate results.

B. PREPROCESSING THE RAW AUDIO FILE:
While recording the audio, two parameters have been considered: -save path which is the empty directory where it must save all the audio files and n_times is the number of times the audio is recorded.Once the audio file is ready the sample rate must be initialized.[8] Sample rate is the rate at which the sound is sampled per second.For audio signal sample rate to be considered is 44100 hertz.For recording the wake word, minimum of 3 seconds is initialized for each dialect.
Librosa (python library) is extensively used for examining audio data.Audio preprocessing includes 3 major steps-First, raw audio file is loaded, next it must be converted into .wavfile and third step is to extract the useful pattern from that spectrum.librosa.load takes the path of the NumPy array where the audio files are stored and then returns the NumPy array and sample rate of the audio file.librosa.display is an API for visualizing the spectrogram and it is built on the top of matplotlib.
Figure 2 shows the waveform plotted for one of the files in Non-Wake Word and Figure 3 shows the MFCC of the Waveform.
The same procedure is applied to all the audio files in the dataset.After preprocessing all the audio files, they are classified into respective labels.Signals are framed into 20-40 ms as there will be continuous variation in the audio signal.So, we need to consider a short time range where audio signal has less variation, or it is static.After loading files using librosa, the next step of preprocessing phase is to extract useful pattern from the audio files.For this purpose, we have used MFCC-Mel Frequency Cepstral Coefficient as they yield better performance in identifying low frequency regions better than the high frequency regions.It can easily be applied to examine the patterns in lower frequencies and analyze the resonances created by the vocal tract.This leads us to spot only the linguistic position excluding the noise.Very small yet important variation in speech signal which is observed in dialects exists in the changing amplitude, pitch, speaker identity, duration and timber (which comes from the uniqueness in each speaker describing the quality of the tone).MFCC gives information about changing rates in spectral bands.Mainly the signal must shift from time domain to frequency domain which can be done using Fourier transform to examine the spectral and power components of the signal and encoding words into numbers which is a procedure for text vectorization-mapping words into vectors.It is usually applied to sentences.Figure 4 depicts the process of extracting MFCC for the Audio file.MFCC is used to extract the pattern from the wave file.The procedure is followed for all the files in the dataset.We use the mean of MFCC to reduce the dimensionality of the data.It helps in removing the convolutional effects caused either with the recording device or with the participants vocal tract response.It may add additional features which can yield better results when given as the input to the model.Figure 4 depicts the process of extracting MFCC for the Audio file.
We need to label the data.Label encoding technique is used to assign labels from 0-6 for all the audios to the respective dialects.It is considered as the Multiclass classification problem.
Moving to the next procedure of creating pandas' data frame of the final data and one more dictionary is created where this final data dataframe is saved.And this data frame can be easily accessed during the training phase.Data frame is saved in the csv format as a pickle file.
Pickle is mainly used to save the labelled dataset for applying it later for further experiments [2].By doing this, we can transfer the pickled file to other users who are working on similar dataset, instead of transferring the entire dataset which may cause storage problems, for handling large dataset.Sometimes we may also loose the original dataset.

C.MODEL ARCHITECTURE
Based on the dataset the deep learning model which fits best is the CONV1D.Here the input shape is in the form of 1-dimensional array.MFCCs are fed as the input in the form of 1 dimensional array with 40 coefficients.The input shape is (40,1).The model has two convolution layers-the first layer has 64 filters with kernel size 3 and again the kernel is also 1 dimensional followed by ReLU activation function which converts the negative values to zero and max pooling of pool_size 2, and dropout of around 0.25 neurons.
The second layer has 128 filters again followed by ReLU and max pooling and dropout of around 0.25 neurons.Next the output from max_pooling ( 2) is flattened and connected to dense layer with 512 neurons followed by drop out by 0.5 neurons and then with SoftMax activation function for the final dense layer with 6 neurons as per our labels (number of dialects).Each layer is also followed by the dropout layer.
Padding is mentioned in terms of 'same' which adds 0 at the two ends of the input array and it is used to get output image with the same dimension as the input shape.The output shape is calculated as: (n+2p-f/stride) +1-wherein n is the input shape p is the number of layers of zeros added at the boarder of the input data.f is the filter size and stride is taken as 1 in this customized model architecture.
ReLU-Rectified linear unit is an activation function used in convolution layers which returns 0 whenever it receives negative value in the feature map (obtained by the dot product and summation of input data and the filter).but for any positive values let it be x it returns the value back.It is represented as f (x)= max (0, x) It is graphically represented as: Max_pooling layer-it is applied on convolution layer.The sliding window or Kernel slides over the feature map and takes the maximum value from the region.The size of the filter in pooling operation is smaller than the feature map.Here the stride is taken as 2 and based on this we can calculate the output shape.
Dropout layer-This can be applied booth at Convolutional layers as well as at the dense layers.But both as different effects.
1.At convolutional layer dropout are applied at a very low rate at about 0.2 which increases the performance and will not affect in extracting the important features in feature maps.

SoftMax is Graphically represented as:
The S-shaped function in the graph is in between 0 to 1 as 0.5 as its midpoint.Output is 1 for larger values and 0 for smaller or if the values are negative.
Figure 5 gives the visualization of the model using NetViz software tool.
There is no such rule for the specific number of filters or layers, whichever best suits for the problem statement and gives the best predictions we can build the architecture.

RESULTS
The output of the proposed Conv1d model is one of the dialects(class) identified.Given the input either through audio file or through saying/uttering the words for 3 seconds the model predicts one of the classes out of 6 classes of dialects which the input belongs to and writs the predicted output to the csv file.Accuracy of the prediction is 0.9917.Another way is to record a new audio which is unseen in both training and testing data and use the pretrained model-model which is saved as .hdf5and the prediction is the one of the classes which is saved in csv file.The dataset for our problem statement reflects diversity among dialects.Each dialect has its own phrase for greeting.They are varied in-terms of the use of noun, verb and pronoun in their respective dialects.
Figure below shows the breakdown of Wake word/Sequence of words or phrase and their part of speech with their position.The work which proposed in [8], [18] and [2] are based on keywords of maximum length of two words which are very short and no variation is present among the multiple keywords as compared to the dataset for this paper with variation present in each category and has more than two words which can be considered as phrase but not the sentence because there is no sequence in these phrases which contradicts the use of RNNs for the purpose of our experiment.
Comparative analysis is carried out with ANNs and RNN model with Conv1d to validate the accuracy of the proposed Conv1d model on the dialect dataset.The ANNs performance was deficient as its classification was fragmentary which was concerning in terms of predictions.
The confusion matrix represents the deficiency in predictions with both models.We can closely compare the RNN-LSTM model with Conv1d.
As mentioned in the breakdown words into part of speech it can be perceived that the dialects of Dharwad and Dogganal are similar but not the same.Here since the position of words in the phrase into their part of speech RNN model has misclassified or made a faulty prediction with reference to labels [0,1] and [2,4].
Comparing the Predicted labels can evidently show the progressing performance of Conv1d over RNN-LSTM model.
The below figure shows the labels which are wrongly predicted by RNN-LSTM model.
The overall implementation of Conv1d model yield noticeable results over other two models which accomplishes the task.
The above figure clearly identifies the performance metrics of the two models over proposed Conv1d model for the dialect dataset where-in Conv1d shows satisfactory results in correctly identifying the wake word to the correct labels.

CONCLUSION
Farmers in rural areas still face a huge challenge in adopting new technological trends for voice-based apps.One of the major obstacles arises with communicating and comprehending the information in the standard language used by the mobile apps.To tackle this issue apps should be more friendly and approachable and above all, farmers should feel connected to the app which magnifies the need of communication in their regional dialects.Dialect Identification task is slowly picking up the pace with advancement in AI and Data Science.In this paper for our research work, we have proposed a simple wake word detection system built on five major dialects of Karnataka using TensorFlow and Keras and CNN which works efficiently on short word ranges over other deep learning techniques Model architecture is well built to provide error free predictions.For further work TensorFlow Quantization API can be more flexible in deployment of the model.Quantization is a method which is created to make models smaller, lesser dependence on the settings in the environment where they will be deployed and are faster.

Figure 1
Figure 1 gives the step-by-step process of classification which are explained in detail in the following topics.

2 .
Next, they are applied at dense layer at the rate of 0.5 which increases the accuracy for the classificationy = input vector of SoftMax function which consists of n elements with n classes.y j = termfornormalizationwhichmakessure that the value of output ranges from 0 to 1. exp(y i ) = resultsinsmallervaluecloseto0butnot0.SoftMax activation function is applied on the outputs from the dense layer mainly at the last layer of neural network for multiclass classification problem with n classes.I t returns the output vector in terms of probability scores.It highlights the maximum value and does not mention the lower values.

Figure 7
Figure 7 gives the summary of the model and figure 8 is the table depicting in detail the calculations for the trainable parameters.