Entries |
Document | Title | Date |
20080201149 | VARIABLE VOICE RATE APPARATUS AND VARIABLE VOICE RATE METHOD - A variable voice rate apparatus to control a reproduction rate of voice, includes a voice data generation unit configured to generate voice data from the voice, a text data generation unit configured to generate text data indicating a content of the voice data, a division information generation unit configured to generate division information used for dividing the text data into a plurality of linguistic units each of which is characterized by a linguistic form, a reproduction information generation unit configured to generate reproduction information set for each of the linguistic units, and a voice reproduction controller which controls reproduction of each of the linguistic units, based on the reproduction information and the division information. | 08-21-2008 |
20080208584 | Pausing A VoiceXML Dialog Of A Multimodal Application - Pausing a VoiceXML dialog of a multimodal application, including generating by the multimodal application a pause event; responsive to the pause event, temporarily pausing the dialogue by the VoiceXML interpreter; generating by the multimodal application a resume event; and responsive to the resume event, resuming the dialog. Embodiments are implemented with the multimodal application operating on a multimodal device supporting multiple modes of interaction including a voice mode and one or more non-voice modes, the multimodal application is operatively coupled to a VoiceXML interpreter, and the VoiceXML interpreter is interpreting the VoiceXML dialog to be paused. | 08-28-2008 |
20080221894 | SYNTHESIZING SPEECH FROM TEXT - Speech is synthesized for a given text by determining a sequence of phonetic components based on the text, determining a sequence of target phonetic elements associated phonetic components, determining a sequence of target event types associated with the phonetic components and determining a sequence of speech units from a plurality of stored speech unit candidates by use of a cost function. The cost function comprises a unit cost, a concatenation cost, and an event type cost for each speech unit in the sequence of speech units. The unit cost of a speech unit is determined with respect to the corresponding target phonetic element, while the concatenation cost of a speech unit is determined with respect to adjacent speech units and the event type cost of each speech unit is determined with respect to the corresponding target event type. | 09-11-2008 |
20080235024 | METHOD AND SYSTEM FOR TEXT-TO-SPEECH SYNTHESIS WITH PERSONALIZED VOICE - A method and system are provided for text-to-speech synthesis with personalized voice. The method includes receiving an incidental audio input ( | 09-25-2008 |
20080235025 | PROSODY MODIFICATION DEVICE, PROSODY MODIFICATION METHOD, AND RECORDING MEDIUM STORING PROSODY MODIFICATION PROGRAM - A prosody modification device includes: a real voice prosody input part that receives real voice prosody information extracted from an utterance of a human; a regular prosody generating part that generates regular prosody information having a regular phoneme boundary that determines a boundary between phonemes and a regular phoneme length of a phoneme by using data representing a regular or statistical phoneme length in an utterance of a human with respect to a section including at least a phoneme or a phoneme string to be modified in the real voice prosody information; and a real voice prosody modification part that resets a real voice phoneme boundary by using the generated regular prosody information so that the real voice phoneme boundary and a real voice phoneme length of the phoneme or the phoneme string to be modified in the real voice prosody information are approximate to an actual phoneme boundary and an actual phoneme length of the utterance of the human, thereby modifying the real voice prosody information. | 09-25-2008 |
20080243511 | Speech synthesizer - The present invention is a speech synthesizer that generates speech data of text including a fixed part and a variable part, in combination with recorded speech and rule-based synthetic speech. The speech synthesizer is a high-quality one in which recorded speech and synthetic speech are concatenated with the discontinuity of timbres and prosodies not perceived. The speech synthesizer includes: a recorded speech database that previously stores recorded speech data including a recorded fixed part; a rule-based synthesizer that generates rule-based synthetic speech data including a variable part and at least part of the fixed part, from received text; a concatenation boundary calculator that a concatenation boundary position in a region in which the recorded speech data and the rule-based synthetic speech data overlap, based on acoustic characteristics of the recorded speech data and the rule-based synthetic speech data that correspond to the text; a concatenative synthesizer that generates synthetic speech data corresponding to the text by concatenating the recorded speech data and the rule-based synthetic speech data that are segmented in the concatenation boundary position. | 10-02-2008 |
20080249776 | Methods and Arrangements for Enhancing Machine Processable Text Information - The invention relates to methods and arrangements for enhancing machine processable text information which is provided by at least machine processable text data. On the basis of synthetic speech, i.e. speech generated by a machine, prosody-related information and/or text-related information is determined and added to given text information. | 10-09-2008 |
20080262845 | METHOD TO TRANSLATE, CACHE AND TRANSMIT TEXT-BASED INFORMATION CONTAINED IN AN AUDIO SIGNAL - A method, system and computer-readable medium for generating, caching and transmitting textual equivalents of information contained in an audio signal are presented. The method includes generating a textual equivalent of at least a portion of a speech-based audio signal in one device into a textual equivalent, storing a portion of the textual equivalent in first device's memory and transmitting the stored textual equivalent to a another device. | 10-23-2008 |
20080262846 | Wireless server based text to speech email - An email system for mobile devices, such as cellular phones and PDAs, is disclosed which allows email messages to be played back on the mobile device as voice messages on demand by way of a media player, thus eliminating the need for a unified messaging system. Email messages are received by the mobile device in a known manner. In accordance with an important aspect of the invention, the email messages are identified by the mobile device as they are received. After the message is identified, the mobile device sends the email message in text format to a server for conversion to speech or voice format. After the message is converted to speech format, the server sends the messages back to the user's mobile device and notifies the user of the email message and then plays the message back to the user through a media player upon demand. | 10-23-2008 |
20080270137 | TEXT TO SPEECH INTERACTIVE VOICE RESPONSE SYSTEM - A text to speech interactive voice response system is operable within a personal computer having a processor, data storage means and an operating system. The system comprises an input subsystem for receiving a text data stream from a source device in a predetermined format; a process control subsystem for converting the text data stream into corresponding output data items; an audio record subsystem for recording audio data to be associated with each output data item; and, a broadcast control subsystem for generating an audio broadcast based on the output data items. There is also disclosed a system management and control subsystem for user interface with the system. | 10-30-2008 |
20080270138 | AUDIO CONTENT SEARCH ENGINE - A method of generating an audio content index for use by a search engine includes determining a phoneme sequence based on recognized speech from an audio content time segment. The method also includes identifying k-phonemes which occur within the phoneme sequence. The identified k-phonemes are stored within a data structure such that the identified k-phonemes are capable of being compared with k-phonemes from a search query. | 10-30-2008 |
20080270139 | Converting text-to-speech and adjusting corpus - The present invention provides a method and apparatus for text to speech conversion, and a method and apparatus for adjusting a corpus. The method for text to speech comprises: text analysis step for parsing the text to obtain descriptive prosody annotations of the text based on a TTS model generated from a first corpus; prosody parameter prediction step for predicting the prosody parameter of the text according to the result of text analysis step; speech synthesis step for synthesizing speech of said text based on said the prosody parameter of the text; wherein descriptive prosody annotations of the text include prosody structure for the text, the prosody structure of the text is adjusted according to a target speech speed for the synthesized speech. The present invention adjusts the prosody structure of the text according to the target speech speed. The synthesized speech will have improved quality. | 10-30-2008 |
20080288256 | REDUCING RECORDING TIME WHEN CONSTRUCTING A CONCATENATIVE TTS VOICE USING A REDUCED SCRIPT AND PRE-RECORDED SPEECH ASSETS - The present invention discloses a system and a method for creating a reduced script, which is read by a voice talent to create a concatenative text-to-speech (TTS) voice. The method can automatically process pre-recorded audio to derive speech assets for a concatenative TTS voice. The pre-recording audio can include sets of recorded phrases used by a speech user interface (Sill). A set of unfulfilled speech assets needed for foil phonetic coverage of the concatenative TTS voice can be determined. A reduced script can be constructed that includes a set of phrases, which when read by a voice talent result in a reduced corpus. When the reduced corpus is automatically processed, a reduced set of speech assets result. The reduced set includes each of the unfulfilled speech assets. When this reduced corpus is combined with existing speech assets the result will be a voice with a complete set of speech assets. | 11-20-2008 |
20080288257 | APPLICATION OF EMOTION-BASED INTONATION AND PROSODY TO SPEECH IN TEXT-TO-SPEECH SYSTEMS - A text-to-speech system that includes an arrangement for accepting text input, an arrangement for providing synthetic speech output, and an arrangement for imparting emotion-based features to synthetic speech output. The arrangement for imparting emotion-based features includes an arrangement for accepting instruction for imparting at least one emotion-based paradigm to synthetic speech output, as well as an arrangement for applying at least one emotion-based paradigm to synthetic speech output. | 11-20-2008 |
20080294442 | APPARATUS, METHOD AND SYSTEM - A method includes obtaining digital content comprising text content; obtaining at least one speech parameter associated with the digital content; and using the speech parameters as an input, generating a speech output corresponding to at least part of the text content. Corresponding apparatuses, system and computer program products are also presented. | 11-27-2008 |
20080294443 | APPLICATION OF EMOTION-BASED INTONATION AND PROSODY TO SPEECH IN TEXT-TO-SPEECH SYSTEMS - Abstract of the Disclosure A text-to-speech system that includes an arrangement for accepting text input, an arrangement for providing synthetic speech output, and an arrangement for imparting emotion-based features to synthetic speech output. The arrangement for imparting emotion-based features includes an arrangement for accepting instruction for imparting at least one emotion-based paradigm to synthetic speech output, as well as an arrangement for applying at least one emotion-based paradigm to synthetic speech output. | 11-27-2008 |
20080300882 | Methods and Apparatus for Conveying Synthetic Speech Style from a Text-to-Speech System - A technique for producing speech output in a text-to-speech system is provided. A message is created for communication to a user in a natural language generator of the text-to-speech system. The message is annotated in the natural language generator with a synthetic speech output style. The message is conveyed to the user through a speech synthesis system in communication with the natural language generator, wherein the message is conveyed in accordance with the synthetic speech output style. | 12-04-2008 |
20080312929 | USING FINITE STATE GRAMMARS TO VARY OUTPUT GENERATED BY A TEXT-TO-SPEECH SYSTEM - The present invention discloses a text-to-speech system that provides output variability. The system can include a finite state grammar, a variability engine and a text-to-speech engine. The finite state grammar can contain a phrase role consisting of one or more phrase elements. The phrase rule can deterministically generate a variable text phrase based upon at least one random number. The phrase rule can include a definition for each of the phrase elements. Each definition can be associated with at least one defined text string. The variability engine can construct a random text phrase responsive to receiving an action command, wherein said finite state grammar is used to create the text phrase. The variability engine can also rely on user-specified weights to adjust the output probabilities. The speech-to-text engine can convert the text phrase generated by the variability engine into speech output. | 12-18-2008 |
20080312930 | METHOD AND SYSTEM FOR ALIGNING NATURAL AND SYNTHETIC VIDEO TO SPEECH SYNTHESIS - According to MPEG- | 12-18-2008 |
20080312931 | SPEECH SYNTHESIS METHOD, SPEECH SYNTHESIS SYSTEM, AND SPEECH SYNTHESIS PROGRAM - A speech synthesis system stores a group of speech units in a memory, selects a plurality of speech units from the group based on prosodic information of target speech, the speech units selected corresponding to each of segments which are obtained by segmenting a phoneme string of the target speech and minimizing distortion of synthetic speech generated from the speech units selected to the target speech, generates a new speech unit corresponding to the each of the segments, by fusing the speech units selected, to obtain a plurality of new speech units corresponding to the segments respectively, and generates synthetic speech by concatenating the new speech units. | 12-18-2008 |
20080319753 | TECHNIQUE FOR TRAINING A PHONETIC DECISION TREE WITH LIMITED PHONETIC EXCEPTIONAL TERMS - The present invention discloses a method for training an exception-limited phonetic decision tree. An initial subset of data can be selected and used for creating an initial phonetic decision tree. Additional terms can then be incorporated into the subset. The enlarged subset can be used to evaluate the phonetic decision tree with the results being categorized as either correctly or incorrectly phonetized. An exception-limited phonetic tree can be generated from the set of correctly phonetized terms. If the termination conditions for the method have been determined to be unsatisfactorily met, then steps of the method can be repeated. | 12-25-2008 |
20080319754 | Text-to-speech apparatus - According to an aspect of an embodiment, an apparatus for converting text data into sound signal, comprises: a phoneme determiner for determining phoneme data corresponding to a plurality of phonemes and pause data corresponding to a plurality of pauses to be inserted among a series of phonemes in the text data to be converted into sound signal; a phoneme length adjuster for modifying the phoneme data and the pause data by determining lengths of the phonemes, respectively in accordance with a speed of the sound signal and selectively adjusting, the length of at least one of the phonemes which is a fricative in the text data so that the at least one of the fricative phonemes is relatively extended timewise as compared to other phonemes; and an output unit for outputting sound signal on the basis of the adjusted phoneme data and pause data by the phoneme length adjuster. | 12-25-2008 |
20090006096 | Voice persona service for embedding text-to-speech features into software programs - Described is a voice persona service by which users convert text into speech waveforms, based on user-provided parameters and voice data from a service data store. The service may be remotely accessed, such as via the Internet. The user may provide text tagged with parameters, with the text sent to a text-to-speech engine along with base or custom voice data, and the resulting waveform morphed based on the tags. The user may also provide speech. Once created, a voice persona corresponding to the speech waveform may be persisted, exchanged, made public, shared and so forth. In one example, the voice persona service receives user input and parameters, and retrieves a base or custom voice that may be edited by the user via a morphing algorithm. The service outputs a waveform, such as a .wav file for embedding in a software program, and persists the voice persona corresponding to that waveform. | 01-01-2009 |
20090006097 | Pronunciation correction of text-to-speech systems between different spoken languages - Pronunciation correction for text-to-speech (TTS) systems and speech recognition (SR) systems between different languages is provided. If a word requiring pronunciation by a target language TTS or SR is from a same language as the target language, but is not found in a lexicon of words from the target language, a letter-to-speech (LTS) rules set of the target language is used to generate a letter-to-speech output for the word for use by the TTS or SR configured according to the target language. If the word is from a different language as the target language, phonemes comprising the word according to its native language are mapped to phonemes of the target language. The phoneme mapping is used by the TTS or SR configured according to the target language for generating or recognizing an audible form of the word according to the target language. | 01-01-2009 |
20090006098 | Text-to-speech apparatus - According to an aspect of an embodiment, an apparatus for converting text data into sound signal, comprises: a phoneme determiner for determining phoneme data corresponding to a plurality of phonemes and pause data corresponding to a plurality of pauses to be inserted among a series of phonemes in the text data to be converted into sound signal; a phoneme length adjuster for modifying the phoneme data and the pause data by determining lengths of the phonemes, respectively in accordance with a speed of the sound signal and selectively reducing the length of at least one of the pause in the text data to a pause length which is less than the pause length corresponding to the speed of the sound signal; and an output unit for outputting sound signal on the basis of the adjusted phoneme data and pause data by the phoneme length adjuster. | 01-01-2009 |
20090012793 | TEXT-TO-SPEECH ASSIST FOR PORTABLE COMMUNICATION DEVICES - The present invention provides a text-to-speech assist for portable communication devices. A method for communicating text data using a portable communication device in accordance with the present invention includes: displaying text data on a display of the portable communication device while communicating with a party; selecting at least a portion of the displayed text data; converting the selected text data into synthesized speech; and providing the synthesized speech to the party using the portable communication device. | 01-08-2009 |
20090018836 | SPEECH SYNTHESIS SYSTEM AND SPEECH SYNTHESIS METHOD - In a speech synthesis, a selecting unit selects one string from first speech unit strings corresponding to a first segment sequence obtained by dividing a phoneme string corresponding to target speech into segments. The selecting unit performs repeatedly generating, based on maximum W second speech unit strings corresponding to a second segment sequence as a partial sequence of the first sequence, third speech unit strings corresponding to a third segment sequence obtained by adding a segment to the second sequence, and selecting maximum W strings from the third strings based on a evaluation value of each of the third strings. The value is obtained by correcting a total cost of each of the third string candidate with a penalty coefficient for each of the third strings. The coefficient is based on a restriction concerning quickness of speech unit data acquisition, and depends on extent in which the restriction is approached. | 01-15-2009 |
20090018837 | SPEECH PROCESSING APPARATUS AND METHOD - A speech processing apparatus which can playback a sentence using recorded-speech-playback or text-to-speech is provided. It is determined whether each of a plurality of words or phrases constituting a sentence is a word or phrase to be played back by recorded-speech-playback or a word or phrase to be played back by text-to-speech. When each of the plurality of words or phrases is to be played back in a first sequence using the determined synthesis method, it is selected whether to playback each of the plurality of words or phrases in the first sequence or a sequence different from the first sequence, based on the number of times of reversing playback using recorded-speech-playback and playback using text-to-speech. Each of the plurality of words or phrases is played back in the selected sequence using the selected synthesis method. | 01-15-2009 |
20090018838 | Media interface - Provided are a user interface for processing digital data, a method for processing a media interface, and a recording medium thereof. The user interface is used for converting a selected script into voice to generate digital data having a form of a voice file corresponding to the script, or for managing the generated digital data. In the method, the user interface is displayed. The user interface includes at least a text window on which a script to be converted into voice is written, and an icon to be selected for converting the script written on the text window into voice. | 01-15-2009 |
20090018839 | Personal Virtual Assistant - A computer-based virtual assistant includes a virtual assistant application running on a computer capable of receiving human voice communications from a user of a remote user interface and transmitting a vocalization to the remote user interface, the virtual assistant application enabling the user to access email and voicemail messages of the user, the virtual assistant application selecting a responsive action to a verbal query or instruction received from the remote user interface and transmitting a vocalization characterizing the selected responsive action to the remote user interface, and the virtual assistant waiting a predetermined period of time, and if no canceling indication is received from the remote user interface, proceeding to perform the selected responsive action, and if a canceling indication is received from the remote user interface halting the selected responsive action and transmitting a new vocalization to the remote user interface. Also a method of using the virtual assistant. | 01-15-2009 |
20090024393 | Speech synthesizer and speech synthesis system - A speech synthesizer conducts a dialogue among a plurality of synthesized speakers, including a self speaker and one or more partner speakers, by use of a voice profile table describing emotional characteristics of synthesized voices, a speaker database storing feature data for different types of speakers and/or different speaking tones, a speech synthesis engine that synthesizes speech from input text according to feature data fitting the voice profile assigned to each synthesized speaker, and a profile manager that updates the voice profiles according to the content of the spoken text. The voice profiles of partner speakers are initially derived from the voice profile of the self speaker. A synthesized dialogue can be set up simply by selecting the voice profile of the self speaker. | 01-22-2009 |
20090037178 | ANSWER AN INCOMING VOICE CALL WITHOUT REQUIRING A USER TO SPEAK - A system comprises a wireless transceiver and logic coupled to the wireless transceiver. The logic is adapted to answer a phone call from a calling party with an automated voice message and then, in the same phone call, to enable a user to have a two-way conversation with the calling party without requiring the user to speak. | 02-05-2009 |
20090037179 | Method and Apparatus for Automatically Converting Voice - The invention proposes a method and apparatus for significantly improving the quality of voice morphing and guaranteeing the similarity of converted voice. The invention sets several standard speakers in a TTS database, and selects the voices of different standard speakers for speech synthesis according to different roles, wherein the voice of the selected standard speaker is similar to the original role to a certain extent. Then the invention further performs voice morphing on the standard voice similar to the original voice to a certain extent, in order to accurately mimic the voice of the original speaker, so as to make the converted voice closer to the original voice features while guaranteeing the similarity. | 02-05-2009 |
20090043583 | DYNAMIC MODIFICATION OF VOICE SELECTION BASED ON USER SPECIFIC FACTORS - The present invention discloses a solution for customizing synthetic voice characteristics in a user specific fashion. The solution can establish a communication between a user and a voice response system. A data store can be searched for a speech profile associated with the user. When a speech profile is found, a set of speech output characteristics established for the user from the profile can be determined. Parameters and settings of a text-to-speech engine can be adjusted in accordance with the determined set of speech output characteristics. During the established communication, synthetic speech can be generated using the adjusted text-to-speech engine. Thus, each detected user can hear a synthetic speech generated by a different voice specifically selected for that user. When no user profile is detected, a default voice or a voice based upon a user's speech or communication details can be used. | 02-12-2009 |
20090043584 | System and method for phonetic representation - A method for generating an Approximate Phonetic Representation (APR) of a given word, the word having a sequence of characters, the method comprising: Receiving the word; Generating the APR by applying at least one metaphone3 translation rule to encode one or more of the characters of the given word into a resulting APR; and Returning either the generated APR and/or one or more words matching the APR from a dictionary of words. | 02-12-2009 |
20090048840 | DEVICE FOR CONVERTING INSTANT MESSAGE INTO AUDIO OR VISUAL RESPONSE - The conversion device is connected, by a wired or wireless means, to an input/output port of a computing device installed with an instant messaging software. As instant messages are exchanged, the conversion device is activated by the instant messaging software to produce audio and/or visual responses in accordance with specific texts, symbols, and graphical images contained in the messages received. The conversion device could have an appealing appearance such as a doll, a puppet, or a toy figure. The conversion device can further contain at least an actuation mechanism such that, when activated, the conversion device sends a specific signal to the instant messaging software which encodes and packages the signal into a message and delivers the message to a remote computing device. | 02-19-2009 |
20090048841 | Synthesis by Generation and Concatenation of Multi-Form Segments - A speech synthesis system and method is described. A speech segment database references speech segments having various different speech representational structures. A speech segment selector selects from the speech segment database a sequence of speech segment candidates corresponding to a target text. A speech segment sequencer generates from the speech segment candidates sequenced speech segments corresponding to the target text. A speech segment synthesizer combines the selected sequenced speech segments to produce a synthesized speech signal output corresponding to the target text. | 02-19-2009 |
20090048842 | Generalized Object Recognition for Portable Reading Machine - Techniques for operating a reading machine are disclosed. The techniques include forming an N-dimensional features vector based on features of an image, the features corresponding to characteristics of at least one object depicted in the image, representing the features vector as a point in n-dimensional space, where n corresponds to N, the number of features in the features vector and comparing the point in n-dimensional space to a centroid that represents a cluster of points in the n-dimensional space corresponding to a class of objects to determine whether the point belongs in the class of objects corresponding to the centroid. | 02-19-2009 |
20090048843 | SYSTEM-EFFECTED TEXT ANNOTATION FOR EXPRESSIVE PROSODY IN SPEECH SYNTHESIS AND RECOGNITION - The inventive system can automatically annotate the relationship of text and acoustic units for the purposes of: (a) predicting how the text is to be pronounced as expressively synthesized speech, and (b) improving the proportion of expressively uttered speech as correctly identified text representing the speaker's message. The system can automatically annotate text corpora for relationships of uttered speech for a particular speaking style and for acoustic units in terms of context and content of the text to the utterances. The inventive system can use kinesthetically defined expressive speech production phonetics that are recognizable and controllable according to kinesensic feedback principles. In speech synthesis embodiments of the invention, the text annotations can specify how the text is to be expressively pronounced as synthesized speech. Also, acoustically-identifying features for dialects or mispronunciations can be identified so as to expressively synthesize alternative dialects or stylistic mispronunciations for a speaker from a given text. In speech recognition embodiments of the invention, each text annotation can be uniquely identified from the corresponding acoustic features of a unit of uttered speech to correctly identify the corresponding text. By employing a method of rules-based text annotation, the invention enables expressiveness to be altered to reflect syntactic, semantic, and/or discourse circumstances found in text to be synthesized or in an uttered message. | 02-19-2009 |
20090055186 | METHOD TO VOICE ID TAG CONTENT TO EASE READING FOR VISUALLY IMPAIRED - A method for providing information to generate distinguishing voices for text content attributable to different authors includes receiving a plurality of text sections each attributable to one of a plurality of authors; identifying which author authored each text section; assigning a unique voice tag id to each author; associating a distinct set of descriptive metadata with each unique voice tag id; and generating a set of speech information for each text section. The set of speech information generated for each text section is based upon the distinct set of descriptive metadata associated with the unique voice tag id assigned to the corresponding author of the text section. The set of speech information generated for each text section is configured to be used by a speech synthesizer to translate the text section into speech in a distinguishing computer-generated voice for the author of the text section. | 02-26-2009 |
20090055187 | Conversion of text email or SMS message to speech spoken by animated avatar for hands-free reception of email and SMS messages while driving a vehicle - Subscribers can access and listen to their email while they drive, access to the email messages being hands-free so a person can listen to email while they drive. In further accord with the present invention, a selectable avatar speaks the email message. And, the invention provides unified messaging such that SMS and email are unified and present and spoken by the avatar, so the subscriber need not access two devices (an instant message device, and an email device). Additionally, the invention can convert natural language to an acronym to be spoken by the avatar, and can convert acronyms in a message to natural language spoken by the avatar; subscriber selects the desired one of these two. | 02-26-2009 |
20090055188 | PITCH PATTERN GENERATION METHOD AND APPARATUS THEREOF - The prosody control unit pattern generation module generates pitch patterns in respective prosody control units based on language attribute information, the phoneme duration and emphasis degree information, the modification method decision module decides a modification method by smoothing processing with respect to the pitch pattern in a connection portion between the prosody control unit and at least one of previous and next prosody control units based on at least emphasis degree information to generate modification method information, and the pattern connection module modifies pitch patterns generated in respective prosody control units by smoothing processing according to the modification method information and connects them to generate a sentence pitch pattern corresponding to a text to be a target for speech synthesis. | 02-26-2009 |
20090063152 | AUDIO REPRODUCING METHOD, CHARACTER CODE USING DEVICE, DISTRIBUTION SERVICE SYSTEM, AND CHARACTER CODE MANAGEMENT METHOD - A character code is associated with sound as well as character or sign so as to enhance expressiveness on the Internet or in electronic mail. Sound data is recorded in the character code using device in association with the character code. The user can reproduce an intended sound in the same way as he or she displays a character on the character code using device, whereby the user can enhance his or her expressiveness on the Internet or in electronic mail, for example. | 03-05-2009 |
20090063153 | SYSTEM AND METHOD FOR BLENDING SYNTHETIC VOICES - A system and method for generating a synthetic text-to-speech TTS voice are disclosed. A user is presented with at least one TTS voice and at least one voice characteristic. A new synthetic TTS voice is generated by blending a plurality of existing TTS voices according to the selected voice characteristics. The blending of voices involves interpolating segmented parameters of each TTS voice. Segmented parameters may be, for example, prosodic characteristics of the speech such as pitch, volume, phone durations, accents, stress, mis-pronunciations and emotion. | 03-05-2009 |
20090063154 | EMOTIVE TEXT-TO-SPEECH SYSTEM AND METHOD - Information about a device may be emotively conveyed to a user of the device. Input indicative of an operating state of the device may be received. The input may be transformed into data representing a simulated emotional state. Data representing an avatar that expresses the simulated emotional state may be generated and displayed. A query from the user regarding the simulated emotional state expressed by the avatar may be received. The query may be responded to. | 03-05-2009 |
20090070114 | AUDIBLE METADATA - This disclosure describes systems and methods for audibly presenting metadata. Audibly presentable metadata is referred to as audible metadata. Audible metadata may be associated with one or more media objects. In one embodiment, audible metadata is pre-recorded requiring little or no processing before it can be rendered. In another embodiment, audible metadata is text, and a text-to-speech conversion device may be used to convert the text into renderable audible metadata. Audible metadata may be rendered at any point before or after rendering of a media object, or may be rendered during rendering of a media object via a dynamic user request. | 03-12-2009 |
20090070115 | SPEECH SYNTHESIS SYSTEM, SPEECH SYNTHESIS PROGRAM PRODUCT, AND SPEECH SYNTHESIS METHOD - It is an objective of the present invention to provide waveform concatenation speech synthesis with high sound quality utilizing its advantages in the case where there is a large quantity of speech segments while providing waveform concatenation speech synthesis with accurate accents in other cases. Prosody with both high accuracy and high sound quality is achieved by performing a two-path search including a speech segment search and a prosody modification value search. In the preferred embodiment of the present invention, an accurate accent is secured by evaluating the consistency of the prosody by using a statistical model of prosody variations (the slope of fundamental frequency) for both of two paths of the speech segment selection and the modification value search. In the prosody modification value search, a prosody modification value sequence that minimizes a modified prosody cost is searched for. This allows a search for a modification value sequence that can increase the likelihood of absolute values or variations of the prosody to the statistical model as high as possible with minimum modification values. | 03-12-2009 |
20090070116 | FUNDAMENTAL FREQUENCY PATTERN GENERATION APPARATUS AND FUNDAMENTAL FREQUENCY PATTERN GENERATION METHOD - A fundamental frequency pattern generation apparatus includes a first storage including representative vectors each corresponding to a prosodic control unit and having a section for changing the number of phonemes, a second storage unit including a rule to select a vector corresponding to an input context, a selection unit configured to select a vector from the representative vectors by applying the rule to the context and output the selected vector, a calculation unit configured to calculate an expansion/contraction ratio of the section of the selected vector in a time-axis direction based on a designated value for a specific feature amount related to a length of a fundamental frequency pattern to be generated, the designated value of the feature amount being required of the fundamental frequency pattern to be generated, and an expansion/contraction unit configured to expand/contract the selected vector based on the expansion/contraction ratio to generate the fundamental frequency pattern. | 03-12-2009 |
20090076819 | Text to speech synthesis - An input linguistic description is converted into a speech waveform by deriving at least one target unit sequence corresponding to the linguistic description, selecting from a waveform unit database for the target unit sequences a plurality of alternative unit sequences approximating the target unit sequences, concatenating the alternative unit sequences to alternative speech waveforms and presenting the alternative speech waveforms to an operating person and enabling the choice of one of the presented alternative speech waveforms. There are no iterative cycles of manual modification and automatic selection, which enables a fast way of working. The operator does not need knowledge of units, targets, and costs, but chooses from a set of given alternatives. The fine-tuning of TTS prompts therefore becomes accessible to non-experts. | 03-19-2009 |
20090076820 | METHOD AND APPARATUS FOR TAGTOE REMINDERS - A network-based text-to-speech (TTS) TagToe alert system is configured to take a user's textual and/or multimedia input to a TagToe user interface to schedule delivery of text-to-speech-converted TagToe information to one or more telephone call recipients. The text-to-speech converted TagToe information optionally includes e-commerce specific, location-specific, and/or product-specific information which may be presented to the one or more call recipients as additional voice information or interactive voice response (IVR) information. The TagToe alert system can be configured to provide an advanced level of integration between IP telephony and electronic transactions and online services for optimized efficiency and improved revenue to e-commerce. | 03-19-2009 |
20090076821 | METHOD AND APPARATUS TO CONTROL OPERATION OF A PLAYBACK DEVICE - Media metadata is accessible for a plurality of media items (See FIG. | 03-19-2009 |
20090083035 | TEXT PRE-PROCESSING FOR TEXT-TO-SPEECH GENERATION - A system and method are provided for improved speech synthesis, wherein text data is pre-processed according to updated grammar rules or a selected group of grammar rules. In one embodiment, the TTS system comprises a first memory adapted to store a text information database, a second memory adapted to store grammar rules, and a receiver adapted to receive update data regarding the grammar rules. The system also includes a TTS engine adapted to retrieve at least one text entry from the text information database, pre-process the at least one text entry by applying the updated grammar rules to the at least one text entry, and generate speech based at least in part on the least one pre-processed text entry. | 03-26-2009 |
20090083036 | Unnatural prosody detection in speech synthesis - Described is a technology by which synthesized speech generated from text is evaluated against a prosody model (trained offline) to determine whether the speech will sound unnatural. If so, the speech is regenerated with modified data. The evaluation and regeneration may be iterative until deemed natural sounding. For example, text is built into a lattice that is then (e.g., Viterbi) searched to find a best path. The sections (e.g., units) of data on the path are evaluated via a prosody model. If the evaluation deems a section to correspond to unnatural prosody, that section is replaced, e.g., by modifying/pruning the lattice and re-performing the search. Replacement may be iterative until all sections pass the evaluation. Unnatural prosody detection may be biased such that during evaluation, unnatural prosody is falsely detected at a higher rate relative to a rate at which unnatural prosody is missed. | 03-26-2009 |
20090083037 | INTERACTIVE DEBUGGING AND TUNING OF METHODS FOR CTTS VOICE BUILDING - A method, a system, and an apparatus for identifying and correcting sources of problems in synthesized speech which is generated using a concatenative text-to-speech (CTTS) technique. The method can include the step of displaying a waveform corresponding to synthesized speech generated from concatenated phonetic units. The synthesized speech can be generated from text input received from a user. The method further can include the step of displaying parameters corresponding to at least one of the phonetic units. The method can include the step of displaying the original recordings containing selected phonetic units. An editing input can be received from the user and the parameters can be adjusted in accordance with the editing input. | 03-26-2009 |
20090089061 | Audio Reader Device - An audio reader device for reading printed infrared media includes a linear sensor device sensitive to infra-red. A processor is operatively connected to the sensor device and is configured to read and decode infra-red audio data on the media. A memory is operatively connected to the processor for storing the audio data. A sound processing integrated circuit and speaker arrangement is operatively connected to the memory for playback of the audio data. A roller arrangement feeds the media past the linear sensor device. | 04-02-2009 |
20090094034 | VOICE INFORMATION RECORDING APPARATUS - A link table is generated, voice information is associated by dot patterns, and then, voice information associated with the dot pattern is reproduced from a speaker when the dot pattern is read by means of a scanner. In this manner, the dot pattern is printed on a surface of a material such as a picture book or a card, making it possible to play back voice information corresponding to a pattern or a story of a picture book and to play back voice information corresponding to a character described on the card. In addition, by means of a link table, new voice information can be associated with, dissociated from, or changed to, a new dot pattern. | 04-09-2009 |
20090094035 | METHOD AND SYSTEM FOR PRESELECTION OF SUITABLE UNITS FOR CONCATENATIVE SPEECH - A system and method for improving the response time of text-to-speech synthesis utilizes “triphone contexts” (i.e., triplets comprising a central phoneme and its immediate context) as the basic unit, instead of performing phoneme-by-phoneme synthesis. The method comprises a method of generating a triphone preselection cost database for use in speech synthesis, the method comprising 1) selecting a triphone sequence u | 04-09-2009 |
20090099846 | METHOD AND APPARATUS FOR PREPARING A DOCUMENT TO BE READ BY TEXT-TO-SPEECH READER - There is disclosed a method and system for preparing a document to be read by a text-to-speech reader. The method can include identifying two or more voice types available to the text-to-speech reader, identifying the text elements within the document, grouping related text elements together, and classifying the text elements according to voice types available to the text-to-speech reader. The method of grouping the related text elements together can include syntactic and intelligent clustering. The classification of text elements can include performing latent semantic analysis on the text elements and characteristics of the available voice types. | 04-16-2009 |
20090112596 | SYSTEM AND METHOD FOR IMPROVING SYNTHESIZED SPEECH INTERACTIONS OF A SPOKEN DIALOG SYSTEM - A system and method are disclosed for synthesizing speech based on a selected speech act. A method includes modifying synthesized speech of a spoken dialogue system, by (1) receiving a user utterance, (2) analyzing the user utterance to determine an appropriate speech act, and (3) generating a response of a type associated with the appropriate speech act, wherein in linguistic variables in the response are selected, based on the appropriate speech act. | 04-30-2009 |
20090112597 | PREDICTING A RESULTANT ATTRIBUTE OF A TEXT FILE BEFORE IT HAS BEEN CONVERTED INTO AN AUDIO FILE - An apparatus for predicting a resultant attribute of a text file before it has been converted to an audio file by a text-to-speech converter application. In accordance with an embodiment, the apparatus includes: a receiver component for receiving a text file and a request to determine a resultant attribute of the text file before it is converted to an audio file, by a text-to-speech converter component; a calculation component for determining a file type associated with the received text file and the size of the received text file; a calculation component for identifying an attribute associated with the determined file type; and a calculation component for determining from the identified attribute and the size of the received text file a resultant attribute of the text file before it is converted to an audio file by the text-to-speech converter component. | 04-30-2009 |
20090119108 | AUDIO-BOOK PLAYBACK METHOD AND APPARATUS - An audio-book playback method includes buffering text data that is to be played back by speech, converting the buffered text data to speech data, performing speech-playback by using the speech data, and buffering next text data for continuous playback. The provided audio-book playback method and an apparatus enable a user to enjoy reading a book while also listening to content of the book being voiced by a multimedia playback device. Moreover, double buffering technology is employed to provide seamless text and speech-playback services. | 05-07-2009 |
20090125309 | Methods, Systems, and Products for Synthesizing Speech - Methods, Systems, and Products are disclosed for synthesizing speech. Text is received for translation to speech. The text is correlated to phrases, and each phrase is converted into a corresponding string of phonemes. A phoneme identifier is retrieved that uniquely represents each phoneme in the string of phonemes. Each phoneme identifier is concatenated to produce a sequence of phoneme identifiers with each phoneme identifier separated by a comma. Each sequence of phoneme identifiers is concatenated and separated by a semi-colon. | 05-14-2009 |
20090138268 | DATA PROCESSING DEVICE AND COMPUTER-READABLE STORAGE MEDIUM STORING SET OF PROGRAM INSTRUCTIONS EXCUTABLE ON DATA PROCESSING DEVICE - A data processing device includes a displaying unit, a receiving unit, a determining unit, and a controlling unit. The displaying unit displaying one of a first operation screen and a second operation screen. Input data is inputted into the receiving unit by a user. The determining unit determines, based on at least one of the input data and settings of an OS, which of the first operation screen and the second operation screen should be displayed on the displaying unit. The controlling unit controls the displaying unit to display the first operation screen if the determining unit determines that the first operation screen should be displayed on the displaying unit, and control the displaying unit to display the second operation screen if the determining unit determines that the second operation screen should be displayed on the displaying unit. | 05-28-2009 |
20090144060 | System and Method for Generating a Web Podcast Service - Disclosed is a system and method for generating a web podcast interview that allows a single user to create his own multi-voices interview from his computer. The method allows the user to enter a set of questions from a text file using a text editor. (Answers may also be entered from a text file although this is not the more preferred embodiment.) For each question, the user may select one particular interviewer voice among a plurality of predefined interviewer voices, and by using a text-to-speech module in a text-to-speech server, each question is converted into an audio question having the selected interviewer voice. Then, the user preferably records answers to each audio question using a telephone. And a questions/answers sequence in a podcast compliant format is generated. | 06-04-2009 |
20090150157 | SPEECH PROCESSING APPARATUS AND PROGRAM - A word dictionary including sets of a character string which constitutes a word, a phoneme sequence which constitutes pronunciation of the word and a part of speech of the word is referenced, an entered text is analyzed, the entered text is divided into one or more subtexts, a phoneme sequence and a part of speech sequence are generated for each subtext, the part of speech sequence of the subtext and a list of part of speech sequence are collated to determine whether the phonetic sound of the subtext is to be converted or not, and the phonetic sounds of the phoneme sequence in the subtext whose phonetic sounds are determined to be converted are converted. | 06-11-2009 |
20090157407 | Methods, Apparatuses, and Computer Program Products for Semantic Media Conversion From Source Files to Audio/Video Files - An apparatus for semantic media conversion from source data to audio/video data may include a processor. The processor may be configured to parse source data having text and one or more tags and create a semantic structure model representative of the source data, and generate audio data comprising at least one of speech converted from parsed text of the source data contained in the semantic structure model and applied audio effects. Corresponding methods and computer program products are also provided. | 06-18-2009 |
20090157408 | SPEECH SYNTHESIZING METHOD AND APPARATUS - The present invention relates to a speech synthesizing method and apparatus based on a hidden Markov model (HMM). Among code words that are obtained by quantizing speech parameter instances for each state of an HMM model, a code word closest to a speech parameter generated from an input text using a known method is searched. When the distance between the searched code word and the speech parameter generated by the known method is smaller to or equal to a threshold value, the searched code word is output as a final speech parameter. When the distance exceeds the threshold value, the speech parameter generated by the known method is output as the final speech parameter. The final speech parameter is processed to generate final synthesized speech for the input text. | 06-18-2009 |
20090157409 | METHOD AND APPARATUS FOR TRAINING DIFFERENCE PROSODY ADAPTATION MODEL, METHOD AND APPARATUS FOR GENERATING DIFFERENCE PROSODY ADAPTATION MODEL, METHOD AND APPARATUS FOR PROSODY PREDICTION, METHOD AND APPARATUS FOR SPEECH SYNTHESIS - A method includes, generating, for each parameter of the prosody vector, an initial parameter prediction model with a plurality of attributes related to difference prosody prediction and at least part of attribute combinations of the plurality of attributes, in which each of the plurality of attributes and the attribute combinations is included as an item, calculating importance of each item in the parameter prediction model, deleting the item having the lowest importance calculated, re-generating a parameter prediction model with the remaining items, determining whether the re-generated parameter prediction model is an optimal model, and repeating the step of calculating importance and the steps following the step of calculating importance with the re-generated parameter prediction model, if the re-generated parameter prediction model is determined as not an optimal model, wherein the difference prosody vector and all parameter prediction models of the difference prosody vector constitute the difference prosody adaptation model. | 06-18-2009 |
20090177473 | APPLYING VOCAL CHARACTERISTICS FROM A TARGET SPEAKER TO A SOURCE SPEAKER FOR SYNTHETIC SPEECH - A computer implemented method, system and computer usable program code for synthesizing speech. A computer implemented method for synthesizing speech includes providing a database of speech of a source speaker, and providing a prosody model of speech of a target speaker different from the source speaker. Text input to be synthesized is received, and the prosody model of speech of the target speaker is applied to the text input to select segments of the speech of the source speaker in the database to form synthesized speech of the text input. The synthesized speech of the text input is then output. | 07-09-2009 |
20090177474 | SPEECH PROCESSING APPARATUS AND PROGRAM - A speech synthesizer includes a periodic component fusing unit and an aperiodic component fusing unit, and fuses periodic components and aperiodic components of a plurality of speech units for each segment, which are selected by a unit selector, by a periodic component fusing unit and an aperiodic component fusing unit, respectively. The speech synthesizer is further provided with an adder, so that the adder adds, edits, and concatenates the periodic components and the aperiodic components of the fused speech units to generate a speech waveform. | 07-09-2009 |
20090177475 | SPEECH SYNTHESIS DEVICE, METHOD, AND PROGRAM - Even when a pitch cycle has a large fluctuation and the pitch cycle string changes abruptly, it possible to suppress the affect of the pitch cycle fluctuation and generate high-quality synthesized speech. A speech synthesis device generates a synthesized speech corresponding to an input text sentence according to an original speech waveform stored in original speech waveform information storage unit ( | 07-09-2009 |
20090187407 | System and methods for reporting - The present invention relates to a system and methods for preparing reports, such as medical reports. The system and methods advantageously can verbalize information, using speech synthesis (text-to-speech), to support a dialogue between a user and the reporting system during the course of the preparation of the report in order that the user can avoid inefficient visual distractions. | 07-23-2009 |
20090187408 | SPEECH INFORMATION PROCESSING APPARATUS AND METHOD - A temporary child set is generated. An elastic ratio of an elastic section of a model pattern is calculated. A temporary typical pattern of the set is generated by combining the pattern belonging to the set with the model pattern having the elastic pattern expanded or contracted. A distortion between the temporary typical pattern of the set and the pattern belonging to the set is calculated, and a child set is determined as the set when the distortion is below a threshold. A typical pattern as the temporary typical pattern of the child set is stored with a classification rule as the classification item of the context of the pattern belonging to the child set. | 07-23-2009 |
20090198497 | METHOD AND APPARATUS FOR SPEECH SYNTHESIS OF TEXT MESSAGE - Provided is a method and apparatus for speech synthesis of a text message. The method includes receiving input of voice parameters for a text message, storing each of the text message and the input voice parameters in a data packet, and transmitting the data packet to a receiving terminal. | 08-06-2009 |
20090204401 | SPEECH PROCESSING SYSTEM, SPEECH PROCESSING METHOD, AND SPEECH PROCESSING PROGRAM - Provided is a speech translation system for receiving an input of the original speech in a first language, translating an input content into a second language, and outputting a result of the translating as a speech, including: an input processing part for receiving the input of the original speech, and generating, from the original speech, an original language text and the prosodic information of the original speech; a translation part for generating a translated sentence by translating the first language into the second language; prosodic feature transform information including associated prosodic information between the first language and the second language; a prosodic feature transform part for transforming the prosodic information of the original speech into prosodic information of the speech to be output; and a speech synthesis part for outputting the translated sentence as a speech synthesized based on the prosodic information of the speech to be output. | 08-13-2009 |
20090204402 | METHOD AND APPARATUS FOR CREATING CUSTOMIZED PODCASTS WITH MULTIPLE TEXT-TO-SPEECH VOICES - Method and apparatus for creating customized podcasts with multiple voices, where text content is converted into audio content, and where the voices are selected at least in part on words in the text content suggestive of the type of voice. Types of voice include at least male and female, accent, language, and speed. | 08-13-2009 |
20090204403 | SPEECH GENERATING MEANS FOR USE WITH SIGNAL SENSORS - An apparatus includes receiving circuitry for receiving a signal; and a speech module for converting the signal into speech. | 08-13-2009 |
20090204404 | METHOD AND APPARATUS FOR CONTROLLING PLAY OF AN AUDIO SIGNAL - Apparatus and methods conforming to the present invention comprise a method of controlling playback of an audio signal through analysis of a corresponding close caption signal in conjunction with analysis of the corresponding audio signal. Objection text or other specified text in the close caption signal is identified through comparison with user identified objectionable text. Upon identification of the objectionable text, the audio signal is analyzed to identify the audio portion corresponding to the objectionable text. Upon identification of the audio portion, the audio signal may be controlled to mute the audible objectionable text. | 08-13-2009 |
20090216536 | IMAGE PROCESSING APPARATUS, IMAGE PROCESSING METHOD AND RECORDING MEDIUM - An image processing apparatus comprises an image data input portion that inputs image data and a text data input portion that inputs text data. The text data inputted by the text data input portion is converted into voice data by a voice data converter, and this obtained voice data and the image data inputted by the image data input portion are connected to each other by a connector, and then a file including the voice data and the image data connected to each other is created. | 08-27-2009 |
20090240501 | AUTOMATICALLY GENERATING NEW WORDS FOR LETTER-TO-SOUND CONVERSION - Described is a technology by which artificial words are generated based on seed words, and then used with a letter-to-sound conversion model. To generate an artificial word, a stressed syllable of a seed word is replaced with a different syllable, such as a candidate (artificial) syllable, when the phonemic structure and/or graphonemic structure of the stressed syllable and the candidate syllable match one another. In one aspect, the artificial words are provided for use with a letter-to-sound conversion model, which may be used to generate artificial phonemes from a source of words, such as in conjunction with other models. If the phonemes provided by the various models for a selected source word are in agreement relative to one another, the selected source word and an associated artificial phoneme may be added to a training set which may then be used to retrain the letter-to-sound conversion model. | 09-24-2009 |
20090248417 | SPEECH PROCESSING APPARATUS, METHOD, AND COMPUTER PROGRAM PRODUCT - A method to generate a pitch contour for speech synthesis is proposed. The method is based on finding the pitch contour that maximizes a total likelihood function created by the combination of all the statistical models of the pitch contour segments of an utterance, at one or multiple linguistic levels. These statistical models are trained from a database of spoken speech, by means of a decision tree that for each linguistic level clusters the parametric representation of the pitch segments extracted from the spoken speech data with some features obtained from the text associated with that speech data. The parameterization of the pitch segments is performed in such a way, the likelihood function of any linguistic level can be expressed in terms of the parameters of one of the levels, thus allowing the maximization to be calculated with respect to the parameters of that level. Moreover, the parameterization of that main level has to be invertible so that the final pitch contour is obtained from the parameters of that level by means of an inverse transformation. | 10-01-2009 |
20090254345 | Intelligent Text-to-Speech Conversion - Techniques for improved text-to-speech processing are disclosed. The improved text-to-speech processing can convert text from an electronic document into an audio output that includes speech associated with the text as well as audio contextual cues. One aspect provides audio contextual cues to the listener when outputting speech (spoken text) pertaining to a document. The audio contextual cues can be based on an analysis of a document prior to a text-to-speech conversion. Another aspect can produce an audio summary for a file. The audio summary for a document can thereafter be presented to a user so that the user can hear a summary of the document without having to process the document to produce its spoken text via text-to-speech conversion. | 10-08-2009 |
20090254346 | AUTOMATED VOICE ENABLEMENT OF A WEB PAGE - Embodiments of the present invention provide a method, system and computer program product for the automated voice enablement of a Web page. In an embodiment of the invention, a method for voice enabling a Web page can include selecting an input field of a Web page for speech input, generating a speech grammar for the input field based upon terms in a core attribute of the input field, receiving speech input for the input field, posting the received speech input and the grammar to an automatic speech recognition (ASR) engine and inserting a textual equivalent to the speech input provided by the ASR engine into a document object model (DOM) for the Web page. | 10-08-2009 |
20090254347 | PROACTIVE COMPLETION OF INPUT FIELDS FOR AUTOMATED VOICE ENABLEMENT OF A WEB PAGE - Embodiments of the present invention provide a method and computer program product for the proactive completion of input fields for automated voice enablement of a Web page. In an embodiment of the invention, a method for proactively completing empty input fields for voice enabling a Web page can be provided. The method can include receiving speech input for an input field in a Web page and inserting a textual equivalent to the speech input into the input field in a Web page. The method further can include locating an empty input field remaining in the Web page and generating a speech grammar for the input field based upon permitted terms in a core attribute of the empty input field and prompting for speech input for the input field. Finally, the method can include posting the received speech input and the grammar to an automatic speech recognition (ASR) engine and inserting a textual equivalent to the speech input provided by the ASR engine into the empty input field. | 10-08-2009 |
20090254348 | FREE FORM INPUT FIELD SUPPORT FOR AUTOMATED VOICE ENABLEMENT OF A WEB PAGE - Embodiments of the present invention provide a method and computer program product for the automated voice enablement of a Web page with free form input field support. In an embodiment of the invention, a method for voice enabling a Web page with free form input field support can be provided. The method can include receiving speech input for an input field in a Web page, parsing a core attribute for the input field and identifying an external statistical language model (SLM) referenced by the core attribute of the input field, posting the received speech input and the SLM to an automatic speech recognition (ASR) engine, and inserting a textual equivalent to the speech input provided by the ASR engine in conjunction with the SLM into the input field. | 10-08-2009 |
20090254349 | SPEECH SYNTHESIZER - A speech synthesizer can execute speech content editing at high speed and generate speech content easily. The speech synthesizer includes a small speech element DB ( | 10-08-2009 |
20090259471 | DISTANCE METRICS FOR UNIVERSAL PATTERN PROCESSING TASKS - A universal pattern processing system receives input data and produces output patterns that are best associated with said data. The system uses input means receiving and processing input data, a universal pattern decoder means transforming models using the input data and associating output patterns with original models that are changed least during transforming, and output means outputting best associated patterns chosen by a pattern decoder means. | 10-15-2009 |
20090259472 | SYSTEM AND METHOD FOR ANSWERING A COMMUNICATION NOTIFICATION - Disclosed herein are systems, methods, and computer readable-media for answering a communication notification. The method for answering a communication notification comprises receiving a notification of communication from a user, converting information related to the notification to speech, outputting the information as speech to the user, and receiving from the user an instruction to accept or ignore the incoming communication associated with the notification. In one embodiment, information related to the notification comprises one or more of a telephone number, an area code, a geographic origin of the request, caller id, a voice message, address book information, a text message, an email, a subject line, an importance level, a photograph, a video clip, metadata, an IP address, or a domain name. Another embodiment involves notification assigned an importance level and repeat attempts at notification if it is of high importance. | 10-15-2009 |
20090259473 | METHODS AND APPARATUS TO PRESENT A VIDEO PROGRAM TO A VISUALLY IMPAIRED PERSON - Methods and apparatus to present a video program to a visually impaired person are disclosed. An example method comprises receiving a video stream and an associated audio stream of a video program, detecting a portion of the video program that is not readily consumable by a visually impaired person, obtaining text associated with the portion of the video program, converting the text to a second audio stream, and combining the second audio stream with the associated audio stream. | 10-15-2009 |
20090265172 | INTEGRATED SYSTEM AND METHOD FOR MOBILE AUDIO PLAYBACK AND DICTATION - A method and system provides for a single-pass review and feedback of a document. During audio playback of the document to be reviewed, voice-activated recording of feedback and submission of feedback relative to the location in the original document are accomplished. This provides for a fully integrated, single pass review and feedback of documentation to occur. | 10-22-2009 |
20090271202 | SPEECH SYNTHESIS APPARATUS, SPEECH SYNTHESIS METHOD, SPEECH SYNTHESIS PROGRAM, PORTABLE INFORMATION TERMINAL, AND SPEECH SYNTHESIS SYSTEM - A speech synthesis apparatus includes a content selection unit that selects a text content item to be converted into speech; a related information selection unit that selects related information which can be at least converted into text and which is related to the text content item selected by the content selection unit; a data addition unit that converts the related information selected by the related information selection unit into text and adds text data of the text to text data of the text content item selected by the content selection unit; a text-to-speech conversion unit that converts the text data supplied from the data addition unit into a speech signal; and a speech output unit that outputs the speech signal supplied from the text-to-speech conversion unit. | 10-29-2009 |
20090299746 | METHOD AND SYSTEM FOR SPEECH SYNTHESIS - A method for performing speech synthesis to a textual content at a client. The method includes the steps of: performing speech synthesis to the textual content based on a current acoustical unit set S | 12-03-2009 |
20090306986 | Method and system for providing speech synthesis on user terminals over a communications network - Service architecture for providing to a user terminal of a communications network textual information and relative speech synthesis, the user terminal being provided with a speech synthesis engine and a basic database of speech waveforms includes: a content server for downloading textual information requested by means of a browser application on the user terminal; a context manager for extracting context information from the textual information requested by the user terminal; a context selector for selecting an incremental database of speech waveforms associated with extracted context information and for downloading the incremental database into the user terminal; a database manager on the user terminal for managing the composition of an enlarged database of speech waveforms for the speech synthesis engine including the basic and the incremental databases of speech waveforms. | 12-10-2009 |
20090306987 | SINGING SYNTHESIS PARAMETER DATA ESTIMATION SYSTEM - There is provided a singing synthesis parameter data estimation system that automatically estimates singing synthesis parameter data for automatically synthesizing a human-like singing voice from an audio signal of input singing voice. A pitch parameter estimating section | 12-10-2009 |
20090313020 | TEXT-TO-SPEECH USER INTERFACE CONTROL - A system and method includes a detecting computer readable text associated with a device, detecting a starting point for a text-to-speech conversion of text, beginning the text-to-speech conversion upon detection of movement of a pointing device in a direction of text flow, and controlling a rate of the text-to-speech conversion based on a rate of movement of the pointing device in relation to the text to be converted. | 12-17-2009 |
20090313021 | METHODS AND SYSTEMS FOR SIGHT IMPAIRED WIRELESS CAPABILITY - A method for sending data to a sight impaired user, the method comprising, receiving data from a data resource, determining whether the data is compatible with a Symbian API, transcoding the data into a first format compatible with the Symbian API, determining whether the data is compatible with a TALKS filter, transcoding the data into a second format compatible with the TALKS filter, determining whether the data is usable by a sight impaired user, transcoding the data into a third format usable by a sight impaired user responsive to determining that the data is not usable by a sight impaired user, converting a data type definition associated with the data into a format compatible with a user profile, sending the received data to a user mobile device, wherein the mobile device is operative to convert the data into an audible output. | 12-17-2009 |
20090313022 | SYSTEM AND METHOD FOR AUDIBLY OUTPUTTING TEXT MESSAGES - A method and system for audibly outputting text messages includes: setting a vocalizing function for audibly outputting text messages, searching a character speech library for each character of a received text message, and acquiring pronunciation data of each character of the received text message. The method and the system further includes vocalizing the pronunciation data of each character of the received text message, generating a voice message, and audibly outputting the generated voice message. | 12-17-2009 |
20090313023 | Multilingual text-to-speech system - The invention converts raw data in a base language (e.g. English) into conversational formatted messages in multiple languages. The process converts input data rows into related sequences to a set of prerecorded audio phrase files. The sequences reference both recorded phrases of input data components and user-created text phrases inserted before and after the input data. When the audio sequences are played in sequence, a coherent conversational message in the language of the caller results. An IVR server responding to a caller's menu selection uses the invention's output data to generate the coherent response. Two embodiment are presented, a simple embodiment that responds to messages, and a more complex embodiment that converts enterprise demographic and member-event data collected over a period into audio sentences played in response to a menu item section by a caller in the caller's language. | 12-17-2009 |
20090319273 | AUDIO CONTENT GENERATION SYSTEM, INFORMATION EXCHANGING SYSTEM, PROGRAM, AUDIO CONTENT GENERATING METHOD, AND INFORMATION EXCHANGING METHOD - An audio content generation system is a system for generating audio contents including a voice synthesis unit | 12-24-2009 |
20090319274 | System and Method for Verifying Origin of Input Through Spoken Language Analysis - An audible based electronic challenge system is used to control access to a computing resource by using a test to identify an origin of a voice. The test is based on analyzing a spoken utterance to determine if it was articulated by an unauthorized human or a text to speech (TTS) system. | 12-24-2009 |
20090326948 | Automated Generation of Audiobook with Multiple Voices and Sounds from Text - A method, system and computer-usable medium are disclosed for the transcoding of annotated text to speech and audio. Source text is parsed into spoken text passages and sound description passages. A speaker identity is determined for each spoken text passage and a sound element for each sound description passage. The speaker identities and sound elements are automatically referenced to a voice and sound effects schema. A voice effect is associated with each speaker identity and a sound effect with each sound element. Each spoken text passage is then annotated with the voice effect associated with its speaker identity and each sound description passage is annotated with the sound effect associated with its sound element. The resulting annotated spoken text and sound description passages are processed to generate output text operable to be transcoded to speech and audio. | 12-31-2009 |
20090326949 | SYSTEM AND METHOD FOR EXTRACTION OF META DATA FROM A DIGITAL MEDIA STORAGE DEVICE FOR MEDIA SELECTION IN A VEHICLE - A method is provided for extracting meta data from a digital media storage device in a vehicle over a communication link between a control module of the vehicle and the digital media storage device. The method includes establishing a communication link between control module of the vehicle and the digital media storage device, identifying a media file on the digital media storage device, and retrieving meta data from a media file, the meta data including a plurality of entries, wherein at least one of the plurality of entries includes text data. The method further includes identifying the text data in an entry of the media file and storing the plurality of entries in a memory. | 12-31-2009 |
20100004933 | VOICE DIRECTED SYSTEM AND METHOD CONFIGURED FOR ASSURED MESSAGING TO MULTIPLE RECIPIENTS - A communications system transmits messages via a wireless network to multiple users nearly simultaneously in real-time. Each user has a terminal that receives a message and plays the message for the user. The terminal may also wait for the user to verbally acknowledge the arrival of the message before continuing with its normally executing application. The sender of the message may track, for each intended recipient, the delivery of the message, the accessing of the message by the user, and the acknowledgement by the user that the message was understood. | 01-07-2010 |
20100010815 | FACILITATING TEXT-TO-SPEECH CONVERSION OF A DOMAIN NAME OR A NETWORK ADDRESS CONTAINING A DOMAIN NAME - To facilitate text-to-speech conversion of a username, a first or last name of a user associated with the username may be retrieved, and a pronunciation of the username may be determined based at least in part on whether the name forms at least part of the username. To facilitate text-to-speech conversion of a domain name having a top level domain and at least one other level domain, a pronunciation for the top level domain may be determined based at least in part upon whether the top level domain is one of a predetermined set of top level domains. Each other level domain may be searched for one or more recognized words therewithin, and a pronunciation of the other level domain may be determined based at least in part on an outcome of the search. The username and domain name may form part of a network address such as an email address, URL or URI. | 01-14-2010 |
20100010816 | FACILITATING TEXT-TO-SPEECH CONVERSION OF A USERNAME OR A NETWORK ADDRESS CONTAINING A USERNAME - To facilitate text-to-speech conversion of a username, a first or last name of a user associated with the username may be retrieved, and a pronunciation of the username may be determined based at least in part on whether the name forms at least part of the username. To facilitate text-to-speech conversion of a domain name having a top level domain and at least one other level domain, a pronunciation for the top level domain may be determined based at least in part upon whether the top level domain is one of a predetermined set of top level domains. Each other level domain may be searched for one or more recognized words therewithin, and a pronunciation of the other level domain may be determined based at least in part on an outcome of the search. The username and domain name may form part of a network address such as an email address, URL or URI. | 01-14-2010 |
20100030561 | ANNOTATING PHONEMES AND ACCENTS FOR TEXT-TO-SPEECH SYSTEM - A system that outputs phonemes and accents of texts. The system has a storage section storing a first corpus in which spellings, phonemes, and accents of a text input beforehand are recorded separately for individual segmentations of the words that are contained in the text. A text for which phonemes and accents are to be output is acquired and the first corpus is searched to retrieve at least one set of spellings that match the spellings in the text from among sets of contiguous spellings. Then, the combination of a phoneme and an accent that has a higher probability of occurrence in the first corpus than a predetermined reference probability is selected as the phonemes and accent of the text. | 02-04-2010 |
20100042410 | Training And Applying Prosody Models - Techniques for training and applying prosody models for speech synthesis are provided. A speech recognition engine processes audible speech to produce text annotated with prosody information. A prosody model is trained with this annotated text. After initial training, the model is applied during speech synthesis to generate speech with non-standard prosody from input text. Multiple prosody models can be used to represent different prosody styles. | 02-18-2010 |
20100057464 | SYSTEM AND METHOD FOR VARIABLE TEXT-TO-SPEECH WITH MINIMIZED DISTRACTION TO OPERATOR OF AN AUTOMOTIVE VEHICLE - A text-to-speech (TTS) system implemented in an automotive vehicle is dynamically tuned to increase intelligibility over a wide variety of vehicle operating states and environmental conditions by tuning characteristics of the synthesized voice in response to measured operating states. To decrease distractions to an operator of the vehicle, an embodiment of the invention prevents updates to the synthesized voice character from taking effect while a message phrase is being played. Instead, voice characteristics are updated only during natural phrase breaks. In another embodiment of the invention, a damping filter is applied to calculated changes in voice characteristics to prevent excessively rapid changes from being applied, reducing the likelihood of distracting the vehicle operator. In another embodiment of the invention, both phrase-break detectors and damping filters are employed. | 03-04-2010 |
20100057465 | VARIABLE TEXT-TO-SPEECH FOR AUTOMOTIVE APPLICATION - A text-to-speech (TTS) system implemented in an automotive vehicle is dynamically tuned to improve intelligibility over a wide variety of vehicle operating states and environmental conditions. In one embodiment of the present invention, a TTS system is interfaced to one or more vehicle sensors to measure parameters including vehicle speed, interior noise, visibility conditions, and road roughness, among others. In response to measurements of these operating parameters, TTS voice volume, pitch, and speed, among other parameters, may be tuned in order to improve intelligibility of the TTS voice system and increase its effectiveness for the operator of the vehicle. | 03-04-2010 |
20100057466 | METHOD AND APPARATUS FOR SCROLLING TEXT DISPLAY OF VOICE CALL OR MESSAGE DURING VIDEO DISPLAY SESSION - A method and communication device disclosed includes displaying a video on a display, converting voice audio data to textual data by applying voice-to-text conversion, and displaying the textual data as scrolling text displayed along with the video on the display and either above, below or across the video. The method may further include receiving a voice call indication from a network, providing the voice call indication to a user interface where the voice call indication corresponds to an incoming voice call; and receiving a user input for receiving the voice call and displaying the voice call as scrolling text. In another embodiment, a method includes displaying application related data on a display; converting voice audio data to textual data by applying voice-to-text conversion; converting the textual data to a video format; and displaying the textual data as scrolling text over the application related data on said display. | 03-04-2010 |
20100063821 | Hands-Free and Non-Visually Occluding Object Information Interaction System - Technologies are described herein for providing a hands-free and non-visually occluding interaction with object information. In one method, a visual capture of a portion of an object is received through a hands-free and non-visually occluding visual capture device. An audio capture is also received from a user through a hands-free and non-visually occluding audio capture device. The audio capture may include a request for information about a portion of the object in the visual capture. The information is retrieved and is transmitted to the user for playback through a hands-free and non-visually occluding audio output device. | 03-11-2010 |
20100070281 | SYSTEM AND METHOD FOR AUDIBLY PRESENTING SELECTED TEXT - Disclosed herein are methods for presenting speech from a selected text that is on a computing device. This method includes presenting text on a touch-sensitive display and having that text size within a threshold level so that the computing device can accurately determine the intent of the user when the user touches the touch screen. Once the user touch has been received, the computing device identifies and interprets the portion of text that is to be selected, and subsequently presents the text audibly to the user. | 03-18-2010 |
20100070282 | METHOD AND APPARATUS FOR IMPROVING TRANSACTION SUCCESS RATES FOR VOICE REMINDER APPLICATIONS IN E-COMMERCE - Methods and apparatuses are disclosed for improving transaction success rates for voice reminder applications in e-commerce. In one embodiment of the invention, the voice reminder applications in e-commerce utilizes a network-based text-to-speech (TTS) alert system, which can generate a purchase reminder associated with a recipient's potential purchase. The network-based text-to-speech (TTS) alert system can also deliver the purchase reminder to a recipient's voicemail and leave a transaction identifier number and a centralized or a recipient-specific call-back phone number to the recipient's voicemail. A recipient can utilize the transaction identifier number, the centralized or the recipient-specific call-back phone number, and optionally a recipient-specific password to make a phone call to retrieve the purchase reminder previously delivered to the recipient's voicemail by the network-based text-to-speech (TTS) alert system. Then, the recipient can authorize and/or complete a transaction related to the purchase reminder over the same phone call. | 03-18-2010 |
20100076766 | Method for producing indicators and processing apparatus and system utilizing the indicators - The present invention discloses a method for producing graphical indicators and interactive systems for utilizing the graphical indicators. On the surface of an object, visually negligible graphical indicators are provided. The graphical indicators and main information, i.e. text or pictures, co-exist on the surface of object. The graphical indicators do not interfere with the main information when the perception of human eyes are concerned. With the graphical indicators, further information other than the main information on the surface of object are carried. In addition to the main information on the surface of object, one is able to obtain additional information through an auxiliary electronic device or trigger an interactive operation. | 03-25-2010 |
20100076767 | TEXT TO SPEECH CONVERSION OF TEXT MESSAGES FROM MOBILE COMMUNICATION DEVICES - A method includes providing a user interface, at a mobile communication device, that includes a first area to receive text input and a second area to receive an identifier associated with an addressee device. The text input and the identifier are received via the user interface. A short message service (SMS) message including the text input is transmitted to a Text to Speech (TTS) server for conversion into an audio message and for transmission of the audio message to the addressee device associated with the identifier. An acknowledge message transmitted from the TTS server permits the addressee device to allow delivery of the audio message or to decline delivery of the audio message. The TTS server transmits the audio message in response to the addressee device allowing delivery of the audio message. A confirmation message is received from the TTS server that indicates that a reply voice message has been received from the addressee device in response to the audio message. | 03-25-2010 |
20100082345 | SPEECH AND TEXT DRIVEN HMM-BASED BODY ANIMATION SYNTHESIS - An “Animation Synthesizer” uses trainable probabilistic models, such as Hidden Markov Models (HMM), Artificial Neural Networks (ANN), etc., to provide speech and text driven body animation synthesis. Probabilistic models are trained using synchronized motion and speech inputs (e.g., live or recorded audio/video feeds) at various speech levels, such as sentences, phrases, words, phonemes, sub-phonemes, etc., depending upon the available data, and the motion type or body part being modeled. The Animation Synthesizer then uses the trainable probabilistic model for selecting animation trajectories for one or more different body parts (e.g., face, head, hands, arms, etc.) based on an arbitrary text and/or speech input. These animation trajectories are then used to synthesize a sequence of animations for digital avatars, cartoon characters, computer generated anthropomorphic persons or creatures, actual motions for physical robots, etc., that are synchronized with a speech output corresponding to the text and/or speech input. | 04-01-2010 |
20100082346 | SYSTEMS AND METHODS FOR TEXT TO SPEECH SYNTHESIS - Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized form text strings associated with media assets. A text string may be normalized and its native language determined for obtaining a target phoneme for providing human-sounding speech in a language (e.g., dialect or accent) that is familiar to a user. The algorithms may be implemented on a system including several dedicated render engines. The system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesized speech. The front end may communicate media assets and associated synthesized speech content over a network to host devices coupled to portable electronic devices on which the media assets and synthesized speech are played back. | 04-01-2010 |
20100082347 | SYSTEMS AND METHODS FOR CONCATENATION OF WORDS IN TEXT TO SPEECH SYNTHESIS - Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized form text strings associated with media assets. A text string may be normalized and its native language determined for obtaining a target phoneme for providing human-sounding speech in a language (e.g., dialect or accent) that is familiar to a user. The algorithms may be implemented on a system including several dedicated render engines. The system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesized speech. The front end may communicate media assets and associated synthesized speech content over a network to host devices coupled to portable electronic devices on which the media assets and synthesized speech are played back. | 04-01-2010 |
20100082348 | SYSTEMS AND METHODS FOR TEXT NORMALIZATION FOR TEXT TO SPEECH SYNTHESIS - Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized form text strings associated with media assets. A text string may be normalized and its native language determined for obtaining a target phoneme for providing human-sounding speech in a language (e.g., dialect or accent) that is familiar to a user. The algorithms may be implemented on a system including several dedicated render engines. The system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesized speech. The front end may communicate media assets and associated synthesized speech content over a network to host devices coupled to portable electronic devices on which the media assets and synthesized speech are played back. | 04-01-2010 |
20100082349 | SYSTEMS AND METHODS FOR SELECTIVE TEXT TO SPEECH SYNTHESIS - Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized form text strings associated with media assets. A text string may be normalized and its native language determined for obtaining a target phoneme for providing human-sounding speech in a language (e.g., dialect or accent) that is familiar to a user. The algorithms may be implemented on a system including several dedicated render engines. The system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesized speech. The front end may communicate media assets and associated synthesized speech content over a network to host devices coupled to portable electronic devices on which the media assets and synthesized speech are played back. | 04-01-2010 |
20100082350 | METHOD AND SYSTEM FOR PROVIDING SYNTHESIZED SPEECH - An approach providing the efficient use of speech synthesis in rendering text content as audio in a communications network. The communications network can include a telephony network and a data network in support of, for example, Voice over Internet Protocol (VoIP) services. A speech synthesis system receives a text string from either a telephony network, or a data network. The speech synthesis system determines whether a rendered audio file of the text string is stored in a database and to render the text string to output the rendered audio file, if the rendered audio is determined not to exist. The rendered audio file is stored in the database for re-use according to a hash value generated by the speech synthesis system based on the text string. | 04-01-2010 |
20100088099 | Reducing Processing Latency in Optical Character Recognition for Portable Reading Machine - A portable reading device includes a computing device and a computer readable medium storing a computer program product to receive an image and select a section of the image to process. The product processes the section of the image with a first process and when the first process is finished processing the section of the image, process a result of the first process with a second process. While the second process is processing, repeats the first process on another section of the image. | 04-08-2010 |
20100094632 | System and Method of Developing A TTS Voice - Disclosed herein are various aspects of a toolkit used for generating a TTS voice for use in a spoken dialog system. The embodiments in each case may be in the form of the system, a computer-readable medium or a method for generating the TTS voice. An embodiment of the invention relates to a method of tracking progress in developing a text-to-speech (TTS) voice. The method comprises insuring that a corpus of recorded speech contains reading errors and matches an associated written text, creating a tuple for each utterance in the corpus and tracking progress for each utterance utilizing the tuple. Various parameters may be tracked using the tuple but the tuple provides a means for enabling multiple workers to efficiently process a database of utterance in preparation of a TTS voice. | 04-15-2010 |
20100100385 | System and Method for Testing a TTS Voice - Disclosed are various elements of a toolkit used for generating a TTS voice for use in a spoken dialog system. The invention in each case may be in the form of the system, a computer-readable medium or a method for generating the TTS voice. An embodiment of the invention relates to a method for preparing a text-to-speech (TTS) voice for testing and verification. The method comprises processing a TTS voice to be ready for testing, synthesizing words utilizing the TTS voice, presenting to a person a smallest possible subset that contains at least N instances of a group of units in the TTS voice, receiving information from the person associated with corrections needed to the TTS voice and making corrections to the TTS voice according to the received information. | 04-22-2010 |
20100106506 | SYSTEMS AND METHODS FOR DOCUMENT NAVIGATION WITH A TEXT-TO-SPEECH ENGINE - A system for visually navigating a document in conjunction with a text-to-speech (“TTS) engine presents a visual display of a region of interest that is related to the text of the document that is being audibly presented as speech to a user. When the TTS engine converts the text to speech and presents the speech to the user, the system presents the corresponding section of text on a display. During the presentation, if the system encounters a linked section of text, the visual display changes to display a linked region of interest that corresponds to the linked section of text. | 04-29-2010 |
20100114578 | Method and Apparatus for Improving Voice recognition performance in a voice application distribution system - A vocabulary management system for constraining voice recognition processing associated with text-to-speech and speech-to-text rendering associated with use of a voice application in progress between a user accessing a data source through a voice portal has a vocabulary management server connected to a voice application server and to a telephony server, and an instance of vocabulary management software running on the management server for enabling vocabulary establishment and management for voice recognition software. The system is characterized in that an administrator accessing the vocabulary management server uses the software to create unique vocabulary sets that are specific to selected portions of vocabulary associated with target data sources the vocabulary sets differing in content according to administrator direction. | 05-06-2010 |
20100114579 | System and Method of Controlling Sound in a Multi-Media Communication Application - A computing device and computer-readable medium storing instructions for controlling a computing device to customize a voice in a multi-media message created by a sender for a recipient, the multi-media message comprising a text message from the sender to be delivered by an animated entity. The instructions comprise receiving from the sender inserted voice emoticons, which may be repeated, into the text message associated with parameters of a voice used by an animated entity to deliver the text message; and transmitting the text message such that a recipient device can deliver the multi-media message at a variable level associated with a number of times a respective voice emoticon is repeated. | 05-06-2010 |
20100125459 | STOCHASTIC PHONEME AND ACCENT GENERATION USING ACCENT CLASS - Exemplary embodiments provide for determining a sequence of words in a TTS system. An input text is analyzed using two models, a word n-gram model and an accent class n-gram model. A list of all possible words for each word in the input is generated for each model. Each word in each list for each model is given a score based on the probability that the word is the correct word in the sequence, based on the particular model. The two lists are combined and the two scores are combined for each word. A set of sequences of words are generated. Each sequence of words comprises a unique combination of an attribute and associated word for each word in the input. The combined score of each of word in the sequence of words is combined. A sequence of words having the highest score is selected and presented to a user. | 05-20-2010 |
20100145703 | Portable Code Recognition Voice-Outputting Device - The present invention relates to a code recognition voice-outputting device, in which a digital code image of a predetermined compression type is recognized, and the recognized image is converted into voice to be output to the outside. The apparatus includes a reader as a scanning unit for recognizing a compressed digital code image, and a player for processing the digital code image read from the reader, and converting the processed code image into voice to be output to the outside, wherein the reader and the player are configured to be capable of being separated from each other. The present invention further provides a code recognition voice-outputting device which supports a variety of functions and provides a voice guide function for all menus and operating statuses that support the functions for the sake of eyesight handicapped, illiterates, the aged, etc., thereby promoting user convenience. | 06-10-2010 |
20100145704 | SYSTEM AND METHOD FOR INCREASING RECOGNITION RATES OF IN-VOCABULARY WORDS BY IMPROVING PRONUNCIATION MODELING - Disclosed herein are systems, methods, and computer readable-media for generating a lexicon for use with speech recognition. The method includes receiving symbolic input as labeled speech data, overgenerating potential pronunciations based on the symbolic input, identifying best potential pronunciations in a speech recognition context, and storing the identified best potential pronunciations in a lexicon. Overgenerating potential pronunciations can include establishing a set of conversion rules for short sequences of letters, converting portions of the symbolic input into a number of possible lexical pronunciation variants based on the set of conversion rules, modeling the possible lexical pronunciation variants in one of a weighted network and a list of phoneme lists, and iteratively retraining the set of conversion rules based on improved pronunciations. Symbolic input can include multiple examples of a same spoken word. Speech data can be labeled explicitly or implicitly and can include words as text and recorded audio. | 06-10-2010 |
20100145705 | AUDIO WITH SOUND EFFECT GENERATION FOR TEXT-ONLY APPLICATIONS - A method of generating audio for a text-only application comprises the steps of adding tag to an input text, said tag is usable for adding sound effect to the generated audio; processing the tag to form instructions for generating the audio; generating audio with said effect based on the instructions, while the text being presented. The present invention adds entertainment value to text applications and provides very compact format compared to conventional multimedia as well as uses entertainment sound to make text-only applications such as SMS and email more fun and entertaining. | 06-10-2010 |
20100153114 | AUDIO OUTPUT OF A DOCUMENT FROM MOBILE DEVICE - Architecture for playing a document converted into an audio format to a user of an audio-output capable device. The user can interact with the device to control play of the audio document such as pause, rewind, forward, etc. In more robust implementation, the audio-output capable device is a mobile device (e.g., cell phone) having a microphone for processing voice input. Voice commands can then be input to control play (“reading”) of the document audio file to pause, rewind, read paragraph, read next chapter, fast forward, etc. A communications server (e.g., email, attachments to email, etc.) transcodes text-based document content into an audio format by leveraging a text-to-speech (TTS) engine. The transcoded audio files are then transferred to mobile devices through viable transmission channels. Users can then play the audio-formatted document while freeing hand and eye usage for other tasks. | 06-17-2010 |
20100153115 | Human-Assisted Pronunciation Generation - Pronunciation generation may be provided. First, a pronunciation interface may be provided. The pronunciation interface may be configured to display a word and a plurality of alternatives corresponding to a one of a plurality of parts of the word. The plurality of parts may comprise phonemes or syllables of the word. Next, pronunciation data may be received through the pronunciation interface. The pronunciation data may indicate one of the plurality of alternatives. Then a pronunciation of the word may be generated based upon the received pronunciation data. The pronunciation may correspond to the indicated one of the plurality of alternatives. In addition, the pronunciation data may indicate which one of the plurality of parts of the word is stressed. This stress indication may be received in response to a user sliding a user selectable element to indicate which one of the plurality of parts of the word is stressed. | 06-17-2010 |
20100153116 | METHOD FOR STORING AND RETRIEVING VOICE FONTS - The present invention is a system for storing text-to-speech files which includes a means for storing a plurality of voice fonts wherein each voice font has associated therewith a universal voice identifier (UVI). The invention includes delivering a voice font to a receiver of a message containing text wherein the message contains the UVI and the receiver requests the voice font associated with the UVI from the means for storing. | 06-17-2010 |
20100169096 | Instant communication with instant text data and voice data - Embodiments of the invention relate to an instant communication method, an instant communication server, a speech server and a system thereof. The instant communication method includes: receiving, by a speech server, text data sent via instant communication software by a first user terminal; transforming, by the speech server, the text data into first speech data; sending, by the speech server, the first speech data via a preconfigured phone number to a corresponding second user terminal; receiving, by the speech server, second speech data sent by the second user terminal; and sending, by the speech server, the second speech data to the first user terminal via the instant communication software. Using embodiments of the invention, website owners can communicate with visitors via a mobile phone or a fixed telephone anytime and anywhere, which may improve the reception of Internet marketing, reduce prerequisite requirements for e-commerce; and connect the Internet and the telecommunication network. | 07-01-2010 |
20100174544 | SYSTEM, METHOD AND END-USER DEVICE FOR VOCAL DELIVERY OF TEXTUAL DATA - System and method for receiving documents of different formats from external sources, analyzing the documents and transforming them into an internal format comprising tokens for effective browsing and referencing, communicating data volumes of transformed documents to a user device, browsing and vocalizing tokens from the documents to the user, receiving and processing verbal user commands pertaining to said vocalized tokens, retrieving documents pertaining to the user command and vocalizing the retrieved documents to said user. | 07-08-2010 |
20100174545 | INFORMATION PROCESSING APPARATUS AND TEXT-TO-SPEECH METHOD - An information processing apparatus for playing back data includes an oral reading unit, a storage unit storing text templates for responses to questions from a user and text template conversion rules, an input unit for inputting a question from a user, and a control unit for retrieving data and items of information associated with the data. The control unit analyzes a question about a data from a user, for example, a question about a tune, to select a text template for a response to the question and detects the characters in items of tune information of the tune. The characters are designated to replace replacement symbols included in the text template. The control unit also converts the text template based on whether the characters can be read aloud, generates a text to be read aloud using the converted text template, and causes the oral reading unit to read the text aloud. | 07-08-2010 |
20100191533 | CHARACTER INFORMATION PRESENTATION DEVICE - The text information presentation device calculates an optimum readout speed on the basis of the content of text information being input, its arriving time, and its previous arriving time; speech-synthesizes text information being input, at the readout speed calculated; and outputs it as an audio signal, or alternatively controls the speed at which a video signal is output according to an output state of the speech synthesizing unit. | 07-29-2010 |
20100198599 | DISPLAY APPARATUS - In a display apparatus, a text code input section outputs externally-supplied text code information to a font conversion section and a voice synthesizer section. The font conversion section converts the input text code into a corresponding font, and transmits the font to a display drive section via a video signal input section, and the display drive section causes a display section to display the font. Meanwhile, the voice synthesizer section converts the input text code into corresponding voice data, and transmits the voice data to a voice device where the voice data is outputted. With this structure, superior convenience is ensured for a display apparatus which serves only as an individual displaying apparatus and relies on an external device (server) for the major functions of the system. | 08-05-2010 |
20100211392 | SPEECH SYNTHESIZING DEVICE, METHOD AND COMPUTER PROGRAM PRODUCT - The speech synthesizing device acquires numerical data at regular time intervals, each piece of the numerical data representing a value having a plurality of digits, detects a change between two values represented by the numerical data that is acquired at two consecutive times, determines which digit of the value represented by the numerical data is used to generate speech data depending on the detected change, generates numerical information that indicates the determined digit of the value represented by the numerical data, and generates speech data from the digit indicated by the numerical information. | 08-19-2010 |
20100211393 | SPEECH SYNTHESIS DEVICE, SPEECH SYNTHESIS METHOD, AND SPEECH SYNTHESIS PROGRAM - A speech synthesis device is provided with: a central segment selection unit for selecting a central segment from among a plurality of speech segments; a prosody generation unit for generating prosody information based on the central segment; a non-central segment selection unit for selecting a non-central segment, which is a segment outside of a central segment section, based on the central segment and the prosody information; and a waveform generation unit for generating a synthesized speech waveform based on the prosody information, the central segment, and the non-central segment. The speech synthesis device first selects a central segment that forms a basis for prosody generation and generates prosody information based on the central segment so that it is possible to sufficiently reduce both concatenation distortion and sound quality degradation accompanying prosody control in the section of the central segment. | 08-19-2010 |
20100217600 | ELECTRONIC DEVICE AND METHOD OF ASSOCIATING A VOICE FONT WITH A CONTACT FOR TEXT-TO-SPEECH CONVERSION AT THE ELECTRONIC DEVICE - A method of associating a voice font with a contact for text-to-speech conversion at an electronic device includes obtaining, at the electronic device, the voice font for the contact, and storing the voice font in association with a contact data record stored in a contacts database at the electronic device. The contact data record includes contact data for the contact. | 08-26-2010 |
20100223058 | SPEECH SYNTHESIS DEVICE, SPEECH SYNTHESIS METHOD, AND SPEECH SYNTHESIS PROGRAM - A speech synthesis device includes a pitch pattern generation unit ( | 09-02-2010 |
20100228549 | SYSTEMS AND METHODS FOR DETERMINING THE LANGUAGE TO USE FOR SPEECH GENERATED BY A TEXT TO SPEECH ENGINE - Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized from text strings associated with media assets, where each text string can be associated with a native string language (e.g., the language of the string). When several text strings are associated with at least two distinct languages, a series of rules can be applied to the strings to identify a single voice language to use for synthesizing the speech content from the text strings. In some embodiments, a prioritization scheme can be applied to the text strings to identify the more important text strings. The rules can include, for example, selecting a voice language based on the prioritization scheme, a default language associated with an electronic device, the ability of a voice language to speak text in a different language, or any other suitable rule. | 09-09-2010 |
20100241432 | PROVIDING DESCRIPTIONS OF VISUALLY PRESENTED INFORMATION TO VIDEO TELECONFERENCE PARTICIPANTS WHO ARE NOT VIDEO-ENABLED - Descriptions of visually presented material are provided to one or more conference participants that do not have video capabilities. This presented material could be any one or more of a document, PowerPoint® presentation, spreadsheet, Webex® presentation, whiteboard, chalkboard, interactive whiteboard, description of a flowchart, picture, or in general, any information visually presented at a conference. For this visually presented information, descriptions thereof are assembled and forwarded via one or more of a message, SMS message, whisper channel, text information, non-video channel, MSRP, or the like, to one or more conference participant endpoints. These descriptions of visually presented information, such as a document, spreadsheet, spreadsheet presentation, multi-media presentation, or the like, can be assembled in cooperation with one or more of OCR recognition and text-to-speech conversion, human input, or the like. | 09-23-2010 |
20100250253 | CONTEXT AWARE, SPEECH-CONTROLLED INTERFACE AND SYSTEM - A speech-directed user interface system includes at least one speaker for delivering an audio signal to a user and at least one microphone for capturing speech utterances of a user. An interface device interfaces with the speaker and microphone and provides a plurality of audio signals to the speaker to be heard by the user. A control circuit is operably coupled with the interface device and is configured for selecting at least one of the plurality of audio signals as a foreground audio signal for delivery to the user through the speaker. The control circuit is operable for recognizing speech utterances of a user and using the recognized speech utterances to control the selection of the foreground audio signal. | 09-30-2010 |
20100250254 | SPEECH SYNTHESIZING DEVICE, COMPUTER PROGRAM PRODUCT, AND METHOD - An acquiring unit acquires pattern sentences, which are similar to one another and include fixed segments and non-fixed segments, and substitution words that are substituted for the non-fixed segments. A sentence generating unit generates target sentences by replacing the non-fixed segments with the substitution words for each of the pattern sentences. A first synthetic-sound generating unit generates a first synthetic sound, a synthetic sound of the fixed segment, and a second synthetic-sound generating unit generates a second synthetic sound, a synthetic sound of the substitution word, for each of the target sentences. A calculating unit calculates a discontinuity value of a boundary between the first synthetic sound and the second synthetic sound for each of the target sentences and a selecting unit selects the target sentence having the smallest discontinuity value. A connecting unit connects the first synthetic sound and the second synthetic sound of the target sentence selected. | 09-30-2010 |
20100262426 | Interactive speech synthesizer for enabling people who cannot talk but who are familiar with use of anonym moveable picture communication to autonomously communicate using verbal language - A method for enabling a person, who cannot talk but who is familiar with use of anonym moveable picture communication, to autonomously communicate speech sound automatically in a sequence. The method includes the steps of choosing a plurality of selected encoded tags—each having apparatus for it to be identified by the person, providing an interactive speech synthesizer, arranging each of the plurality of the selected encoded tags to be movable between a ready mode wherein it is proximate to an associated tag reader and a go mode wherein it is in operative association with the associated tag reader, providing in the go mode for each of the plurality of the tag reader to read data from its associated one of the selected plurality of the encoded tags in the sequence to provide a series of coded signals, transmitting the series of coded signals to a microcontroller, causing the microcontroller to organize a sound file corresponding to the series of coded signals, and transmitting the sound file to an audio output device to convert the sound file automatically into the speech sound. | 10-14-2010 |
20100268539 | SYSTEM AND METHOD FOR DISTRIBUTED TEXT-TO-SPEECH SYNTHESIS AND INTELLIGIBILITY - A method and system for distributed text-to-speech synthesis and intelligibility, and more particularly to distributed text-to-speech synthesis on handheld portable computing devices that can be used for example to generate intelligible audio prompts that help a user interact with a user interface of the handheld portable computing device. The text-to-speech distributed system | 10-21-2010 |
20100299149 | Character Models for Document Narration - Disclosed are techniques and systems to provide a narration of a text in multiple different voices where the portions of the text narrated using the different voices are selected by a user. Also disclosed are techniques and systems for associating characters with portions of a sequence of words selected by a user. Different characters having different voice models can be associated with different portions of a sequence of words. | 11-25-2010 |
20100312562 | HIDDEN MARKOV MODEL BASED TEXT TO SPEECH SYSTEMS EMPLOYING ROPE-JUMPING ALGORITHM - A rope-jumping algorithm is employed in a Hidden Markov Model based text to speech system to determine start and end models and to modify the start and end models by setting small co-variances. Disordered acoustic parameters due to violation of parameter constraints are avoided through the modification and result in stable line frequency spectrum for the generated speech. | 12-09-2010 |
20100312563 | TECHNIQUES TO CREATE A CUSTOM VOICE FONT - Techniques to create and share custom voice fonts are described. An apparatus may include a preprocessing component to receive voice audio data and a corresponding text script from a client and to process the voice audio data to produce prosody labels and a rich script. The apparatus may further include a verification component to automatically verify the voice audio data and the text script. The apparatus may further include a training component to train a custom voice font from the verified voice audio data and rich script and to generate custom voice font data usable by the TTS component. Other embodiments are described and claimed. | 12-09-2010 |
20100312564 | LOCAL AND REMOTE FEEDBACK LOOP FOR SPEECH SYNTHESIS - A local text to speech feedback loop is utilized to modify algorithms used in speech synthesis to provide a user with an improved experience. A remote text to speech feedback loop is utilized to aggregate local feedback loop data and incorporate best solutions into new improved text to speech engine for deployment. | 12-09-2010 |
20100318360 | METHOD AND SYSTEM FOR EXTRACTING MESSAGES - The present invention is a method and system for extracting messages from a person using the body features presented by a user. The present invention captures a set of images and extracts a first set of body features, along with a set of contexts, and a set of meanings. From the first set of body features, the set of contexts, and the set of meanings, the present invention generates a set of words corresponding to the message that the person is attempting to convey. The present invention can also use the body features of the person in addition to the voice of the person to further improve the accuracy of extracting the person's message. | 12-16-2010 |
20100318361 | Context-Relevant Images - Assistive, context-relevant images may be provided. First, text may be received. Then a spell check indication may be received and a spelling check may be performed on the received text in response to the received spell check indication. Next, in response to the performed spelling check, a misspelling indication may be provided configured to indicate that at least one word in the received text is misspelled. A selection of the misspelling indication may then be received. Then, on a display device in response to the received selection of the misspelling indication, a plurality of suggested spellings for the at least one word and an image corresponding to a first one of the plurality of suggested spellings for the at least one word may be displayed. | 12-16-2010 |
20100318362 | Systems and Methods for Multiple Voice Document Narration - Disclosed are techniques and systems to provide a narration of a text in multiple different voices where the portions of the text narrated using the different voices are selected by a user. | 12-16-2010 |
20100318363 | SYSTEMS AND METHODS FOR PROCESSING INDICIA FOR DOCUMENT NARRATION - Disclosed are techniques and systems to provide a narration of a text in multiple different voices. Further disclosed are techniques and systems for processing indicia in a document to determine a portion of words and associating a particular a voice model with the portion of words based on the indicia. During a readback process, an audible output corresponding to the words in the portion of words is generated using the voice model associated with the portion of words. | 12-16-2010 |
20100318364 | SYSTEMS AND METHODS FOR SELECTION AND USE OF MULTIPLE CHARACTERS FOR DOCUMENT NARRATION - Disclosed are techniques and systems to provide a narration of a text in multiple different voices. Further disclosed are techniques and systems for generating an audible output in which different portions of a text are narrated using voice models associated with different characters. | 12-16-2010 |
20100324902 | Systems and Methods Document Narration - Disclosed are techniques and systems to provide a narration of a text in multiple different voices. In some aspects, systems and methods described herein can include receiving a user-based selection of a first portion of words in a document where the document has a pre-associated first voice model and overwriting the association of the first voice model, by the one or more computers, with a second voice model for the first portion of words. | 12-23-2010 |
20100324903 | SYSTEMS AND METHODS FOR DOCUMENT NARRATION WITH MULTIPLE CHARACTERS HAVING MULTIPLE MOODS - Disclosed are techniques and systems to provide a narration of a text in multiple different voices. Further disclosed are techniques and systems for providing a plurality of characters at least some of the characters having multiple associated moods for use in document narration. | 12-23-2010 |
20100324904 | SYSTEMS AND METHODS FOR MULTIPLE LANGUAGE DOCUMENT NARRATION - Disclosed are techniques and systems to provide a narration of a text in multiple different languages where the portions of the text narrated using the different voices associated with different languages are selected by a user. | 12-23-2010 |
20100324905 | VOICE MODELS FOR DOCUMENT NARRATION - Disclosed are techniques and systems to provide a narration of a text in multiple different voices. Further disclosed are techniques and systems for modifying a voice model associated with a selected character based on data received from a user. | 12-23-2010 |
20110010178 | SYSTEM AND METHOD FOR TRANSFORMING VERNACULAR PRONUNCIATION - Provided is a system and method for transforming vernacular pronunciation with respect to Hanja using a statistical method. In a system for transforming vernacular pronunciation, a vernacular pronunciation extracting unit extracts a vernacular pronunciation with respect to a Hanja character string, a statistical data determining unit determines a statistical data with respect to the Hanja character string by using statistical data of features related to a Hanja-vernacular pronunciation transformation, and a vernacular pronunciation transforming unit transforms the Hanja character string into a vernacular pronunciation using the extracted vernacular pronunciation and the determined statistical data. | 01-13-2011 |
20110015929 | TRANSFORMING A TACTUALLY SELECTED USER INPUT INTO AN AUDIO OUTPUT - A contextual input device includes a plurality of tactually discernable keys disposed in a predetermined configuration which replicates a particular relationship among a plurality of items associated with a known physical object. The tactually discernable keys are typically labeled with Braille type. The known physical object is typically a collection of related items grouped together by some common relationship. A computer-implemented process determines whether a input signal represents a selection of an item from among a plurality of items or an attribute pertaining to an item among the plurality of items. Once the selected item or attribute pertaining to an item is determined, the computer-implemented process transforms a user's selection from the input signal into an analog audio signal which is then audibly output as human speech with an electro-acoustic transducer. | 01-20-2011 |
20110015930 | UNIFIED COMMUNICATION SYSTEM - A unified communication system is disclosed that allows a variety of end point types to participate in a communication event using a common, unified communication system. In some implementations, a calling party interacts with a client application residing on an endpoint to make a communication request to another endpoint. A communication event manager residing in the unified communication system selects a script from a repository of scripts based on the communication event and the capabilities of the endpoints. A communication event execution engine receives a user profile associated with at least one of the endpoints. The user profile can be configured by the user to describe the user's preferences for how the communication should be processed by the unified communication system. | 01-20-2011 |
20110035222 | SELECTING FROM A PLURALITY OF AUDIO CLIPS FOR ANNOUNCING MEDIA - Systems and methods for selecting one of several audio clips associated with a text item for playback are provided. The electronic device can determine which audio clip to play back at any point in time using different approaches, including for example receiving a user selection or randomly selecting audio clips. In some embodiments, the electronic device can intelligently select audio clips based on attributes of the media item, the electronic device operations, or the environment of the electronic device. The attributes can include, for example, metadata values of the media item, the type of ongoing operations of the electronic device, and environmental characteristics that can be measured or detected using sensors of or coupled to the electronic device. Different audio clips can be associated with particular attribute values, such that an audio clip corresponding to the detected or received attribute values are played back. | 02-10-2011 |
20110035223 | AUDIO CLIPS FOR ANNOUNCING REMOTELY ACCESSED MEDIA ITEMS - Systems and methods for retrieving and playing back audio clips for streamed or remotely received media items are provided. An electronic device can provide audio clips identifying media items at any suitable time, including for example to identify media items that are currently played back or available for playback. When the media items played back are not locally stored, the electronic device may not have a corresponding audio clip locally stored. In such cases, the electronic device can identify a streamed media item, and retrieve an audio clip corresponding to text items associated with the media item. For example, the electronic device can retrieve audio clips corresponding to the artist, title and album of the received media item. The electronic device can retrieve audio clips from any suitable source, such as a dedicated audio clip server or other remote source, a remote text-to-speech engine, or a locally stored text-to-speech engine. | 02-10-2011 |
20110046955 | SPEECH PROCESSING APPARATUS, SPEECH PROCESSING METHOD AND PROGRAM - There is provided a speech processing apparatus including: a data obtaining unit which obtains music progression data defining a property of one or more time points or one or more time periods along progression of music; a determining unit which determines an output time point at which a speech is to be output during reproducing the music by utilizing the music progression data obtained by the data obtaining unit; and an audio output unit which outputs the speech at the output time point determined by the determining unit during reproducing the music. | 02-24-2011 |
20110046956 | System And Method For Improved Dynamic Allocation Of Application Resources - A self-help application platform such as one hosting an interactive voice response (IVR) has a browser that executes application scripts to implement the self-help application. The execution of the application scripts is performed by utilizing various application resources, such as media conversions from text to speech (TTS) and speech to text (automatic speech recognition ASR) and other media servers. The platform is provided with a dynamic resource selection mechanism in which the application is executed with an updated optimum set of application resources distributed over different locations. The selection is based on the profiles of the browser, users, route, and quality of service. The selection is further modulated by the browser's previous experiences with the individual resources. The selection is made dynamically during the executing of the application script. | 02-24-2011 |
20110054903 | RICH CONTEXT MODELING FOR TEXT-TO-SPEECH ENGINES - Embodiments of rich text modeling for speech synthesis are disclosed. In operation, a text-to-speech engine refines a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models. The text-to-speech engine then generates synthesized speech for an input text based at least on some of the plurality of refined rich context models. | 03-03-2011 |
20110060590 | SYNTHETIC SPEECH TEXT-INPUT DEVICE AND PROGRAM - A synthetic speech text-input device is provided that allows a user to intuitively know an amount of an input text that can be fit in a desired duration. A synthetic speech text-input device | 03-10-2011 |
20110071835 | SMALL FOOTPRINT TEXT-TO-SPEECH ENGINE - Embodiments of small footprint text-to-speech engine are disclosed. In operation, the small footprint text-to-speech engine generates a set of feature parameters for an input text. The set of feature parameters includes static feature parameters and delta feature parameters. The small footprint text-to-speech engine then derives a saw-tooth stochastic trajectory that represents the speech characteristics of the input text based on the static feature parameters and the delta parameters. Finally, the small footprint text-to-speech engine produces a smoothed trajectory from the saw-tooth stochastic trajectory, and generates synthesized speech based on the smoothed trajectory. | 03-24-2011 |
20110106537 | TRANSFORMING COMPONENTS OF A WEB PAGE TO VOICE PROMPTS - Embodiments of the invention address the deficiencies of the prior art by providing a method, apparatus, and program product to of converting components of a web page to voice prompts for a user. In some embodiments, the method comprises selectively determining at least one HTML component from a plurality of HTML components of a web page to transform into a voice prompt for a mobile system based upon a voice attribute file associated with the web page. The method further comprises transforming the at least one HTML component into parameterized data suitable for use by the mobile system based upon at least a portion of the voice attribute file associated with the at least one HTML component and transmitting the parameterized data to the mobile system. | 05-05-2011 |
20110106538 | SPEECH SYNTHESIS SYSTEM - This speech synthesis system includes a server device and a client device. The client device accepts text information representing text, and transmits a speech element request to the server device. The server device stores speech element information. The server device receives the speech element request transmitted by the client device and, in response to the received speech element request, transmits speech element information to the client device so that the speech element information is received by the client device in a different order from an order of arrangement of speech elements in speech corresponding to the text. The client device executes a speech synthesis process by rearranging the speech element information so that speech elements represented by the received speech element information are arranged in the same order as the order of arrangement of the speech elements in the speech corresponding to the text. | 05-05-2011 |
20110137655 | SPEECH SYNTHESIS SYSTEM - A speech synthesis system includes a server device and a client device. The server device stores speech element information and speech element identification information in association with each other so that, in a case that speech element information representing respective speech elements included in speech uttered by a speech registering user are arranged in the order of arrangement of the speech elements in the speech, at least one of speech element identification information identifying the respective speech element information has different information from information arranged in accordance with a predetermined rule. The client device transmits speech element identification information to the server device based on accepted text information. The client device executes a speech synthesis process based on the speech element information received from the server device. | 06-09-2011 |
20110153330 | SYSTEM AND METHOD FOR RENDERING TEXT SYNCHRONIZED AUDIO - One or more computing devices include software and/or hardware implemented processing units synchronize a textual content with an audio content, where the textual content is made up of a sequence of textual units and the audio content is made up of a sequence of sound units. The system and/or method matches each of the sequence of sound units with a corresponding textual unit. The system and/or method determines a corresponding time of occurrence for each sound unit in the audio content relative to a time reference. Each matched textual unit is then associated with a tag that corresponds to the time of occurrence for the sound unit matched with the textual unit. | 06-23-2011 |
20110161085 | METHOD AND APPARATUS FOR AUDIO SUMMARY OF ACTIVITY FOR USER - Techniques for audio summary of activity for a user include tracking activity at one or more network sources associated with a user. One audio stream that summarizes the activity over a particular time period is generated. The audio stream is caused to be delivered to a particular device associated with the user. A duration of a complete rendering of the audio stream is shorter than the particular time period. In some embodiments, a link to content related to at least a portion of the audio stream is also caused to be delivered for a user. | 06-30-2011 |
20110166861 | METHOD AND APPARATUS FOR SYNTHESIZING A SPEECH WITH INFORMATION - According to one embodiment, an apparatus for synthesizing a speech, comprises an inputting unit configured to input a text sentence, a text analysis unit configured to analyze the text sentence so as to extract linguistic information, a parameter generation unit configured to generate a speech parameter by using the linguistic information and a pre-trained statistical parameter model, an embedding unit configured to embed information into the speech parameter, and a speech synthesis unit configured to synthesize the speech parameter with the information embedded by the embedding unit into a speech with the information. | 07-07-2011 |
20110184738 | NAVIGATION AND ORIENTATION TOOLS FOR SPEECH SYNTHESIS - TTS is a well known technology for decades used for various applications from Artificial Call centers attendants to PC software that allows people with visual impairments or reading disabilities to listen to written works on a home computer. However to date TTS is not widely adopted for PC and Mobile users for daily reading tasks such as reading emails, reading pdf and word documents, reading through website content, and for reading books. The present invention offers new user experience for operating TTS for day to day usage. More specifically this invention describes a synchronization technique for following text being read by TTS engines and specific interfaces for touch pads, touch and multi touch screens. Nevertheless this invention also describes usage of other input methods such as touchpad, mouse, and keyboard. | 07-28-2011 |
20110184739 | COMMUNICATIONS SYSTEM PROVIDING AUTOMATIC TEXT-TO-SPEECH CONVERSION FEATURES AND RELATED METHODS - A communications system may include at least one mobile wireless communications device, and a wireless communications network for sending text messages thereto. More particularly, the at least one mobile wireless communications device may include a wireless transceiver and a controller for cooperating therewith for receiving text messages from the wireless communications network. It may further include a headset output connected to the controller. The controller may be for switching between a normal message mode and an audio message mode based upon a connection between the headset output and a headset. Moreover, when in the audio message mode, the controller may output at least one audio message including speech generated from at least one of the received text messages via the headset output. | 07-28-2011 |
20110196679 | Systems And Methods For Machine To Operator Communications - Systems and methods for machine to operator communications are disclosed. For example, one disclosed system includes a concentrator having a memory; a radio transmitter; and a processor in communication with the memory and the radio transmitter, the processor configured to: request information associated with a status of a machine; receive information associated with the request; determine a message based on the received information; generate an audio signal based on the message; and transmit the audio signal to the radio transmitter. | 08-11-2011 |
20110202344 | METHOD AND APPARATUS FOR PROVIDING SPEECH OUTPUT FOR SPEECH-ENABLED APPLICATIONS - Techniques for providing speech output for speech-enabled applications. A synthesis system receives from a speech-enabled application a text input including a text transcription of a desired speech output. The synthesis system selects one or more audio recordings corresponding to one or more portions of the text input. In one aspect, the synthesis system selects from audio recordings provided by a developer of the speech-enabled application. In another aspect, the synthesis system selects an audio recording of a speaker speaking a plurality of words. The synthesis system forms a speech output including the one or more selected audio recordings and provides the speech output for the speech-enabled application. | 08-18-2011 |
20110202345 | METHOD AND APPARATUS FOR GENERATING SYNTHETIC SPEECH WITH CONTRASTIVE STRESS - Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings. | 08-18-2011 |
20110202346 | METHOD AND APPARATUS FOR GENERATING SYNTHETIC SPEECH WITH CONTRASTIVE STRESS - Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings. | 08-18-2011 |
20110202347 | COMMUNICATION CONVERTER FOR CONVERTING AUDIO INFORMATION/TEXTUAL INFORMATION TO CORRESPONDING TEXTUAL INFORMATION/AUDIO INFORMATION - A communication converter is described for converting among speech signals and textual information, permitting communication between telephone users and textual instant communications users. | 08-18-2011 |
20110218809 | VOICE SYNTHESIS DEVICE, NAVIGATION DEVICE HAVING THE SAME, AND METHOD FOR SYNTHESIZING VOICE MESSAGE - A voice synthesis device includes: a memory for storing a plurality of recorded voice data; a dividing unit for dividing a text into a plurality of words or phrases, wherein the text is to be converted into a voice message; a verifying unit for verifying whether one of the recorded voice data corresponding to each word or phrase is disposed in the memory; and a voice synthesizing unit for preparing a whole of the text with the recorded voice data when all of the recorded voice data corresponding to all of the plurality of words or phrases are disposed in the memory, and for preparing the whole of the text with rule-based synthesized voice data when at least one of the recorded voice data corresponding to one of the plurality of words or phrases is not disposed in the memory. | 09-08-2011 |
20110231192 | System and Method for Audio Content Generation - A system and method for generating audio content. Content is automatically retrieved from an original website according to a predetermined schedule to generate retrieved content. The retrieved content is converted to one or more audio file. A hierarchy is assigned to the one or more audio files to provide an audible website that mimics a hierarch of the retrieved content as represented at the original website. The audible website is stored in a database for retrieval by one or more users. A first user input is received indicating an attempt to access the original website. The audible website is indicated as being associated with the original website in response to the user selection. Portion of the audible website are played in response to a second user input. | 09-22-2011 |
20110231193 | SYNTHESIZED SINGING VOICE WAVEFORM GENERATOR - Various technologies for generating a synthesized singing voice waveform. In one implementation, the computer program may receive a request from a user to create a synthesized singing voice using the lyrics of a song and a digital file containing its melody as inputs. The computer program may then dissect the lyrics' text and its melody file into its corresponding sub-phonemic units and musical score respectively. The musical score may be further dissected into a sequence of musical notes and duration times for each musical note. The computer program may then determine a fundamental frequency (F | 09-22-2011 |
20110238420 | METHOD AND APPARATUS FOR EDITING SPEECH, AND METHOD FOR SYNTHESIZING SPEECH - According to one embodiment, a method for editing speech is disclosed. The method can generate speech information from a text. The speech information includes phonologic information and prosody information. The method can divide the speech information into a plurality of speech units, based on at least one of the phonologic information and the prosody information. The method can search at least two speech units from the plurality of speech units. At least one of the phonologic information and the prosody information in the at least two speech units are identical or similar. In addition, the method can store a speech unit waveform corresponding to one of the at least two speech units as a representative speech unit into a memory. | 09-29-2011 |
20110238421 | Speech Output Device, Control Method For A Speech Output Device, Printing Device, And Interface Board - A speech output device, a control method for a speech output device, a printer, and an interface board can improve the productivity of foreign language speaking workers in industries such as retailing and food services. A data communication unit | 09-29-2011 |
20110246200 | PRE-SAVED DATA COMPRESSION FOR TTS CONCATENATION COST - Pre-saved concatenation cost data is compressed through speech segment grouping. Speech segments are assigned to a predefined number of groups based on their concatenation cost values with other speech segments. A representative segment is selected for each group. The concatenation cost between two segments in different groups may then be approximated by that between the representative segments of their respective groups, thereby reducing an amount of concatenation cost data to be pre-saved. | 10-06-2011 |
20110246201 | SYSTEM FOR PROVIDING AUDIO MESSAGES ON A MOBILE DEVICE - While performing a function, a mobile device identifies that it is idle while it is downloading content or performing another task. During that idle time, it gathers one or more parameters (e.g., location, time, gender of user, age of user, etc.) and sends a request for an audio message (e.g., audio advertisement). One or more servers at a remote facility receive the request with the one or more parameters, and use the parameters to identify a targeted message. In some cases, the targeted message will include one or more dynamic variables (e.g., distance to store, time to event, etc.) that will be replaced based on the parameters received from the mobile device, so that the audio message is dynamically updated and customized for the mobile device. In one embodiment, the targeted message is transmitted to the mobile device as text. After being received at the mobile device, the text is optionally displayed and converted to an audio format and played for the user. | 10-06-2011 |
20110264452 | AUDIO OUTPUT OF TEXT DATA USING SPEECH CONTROL COMMANDS - Example embodiments disclosed herein relate to audio output of speech data using speech control commands. In particular, example embodiments include a mechanism for accessing text data. Example embodiments may also include a mechanism for outputting the text data as audio by converting the text data to speech audio data and transmitting the speech audio data over an audio output. Example embodiments may also include a mechanism for receiving speech control commands that allow for voice control of the output of the audio data. | 10-27-2011 |
20110270613 | INFERRING SWITCHING CONDITIONS FOR SWITCHING BETWEEN MODALITIES IN A SPEECH APPLICATION ENVIRONMENT EXTENDED FOR INTERACTIVE TEXT EXCHANGES - The disclosed solution includes a method for dynamically switching modalities based upon inferred conditions in a dialogue session involving a speech application. The method establishes a dialogue session between a user and the speech application. During the dialogue session, the user interacts using an original modality and a second modality. The speech application interacts using a speech modality only. A set of conditions indicative of interaction problems using the original modality can be inferred. Responsive to the inferring step, the original modality can be changed to the second modality. A modality transition to the second modality can be transparent the speech application and can occur without interrupting the dialogue session. The original modality and the second modality can be different modalities; one including a text exchange modality and another including a speech modality. | 11-03-2011 |
20110276332 | SPEECH PROCESSING METHOD AND APPARATUS - A speech synthesis method comprising:
| 11-10-2011 |
20110282668 | SPEECH ADAPTATION IN SPEECH SYNTHESIS - A method of and system for speech synthesis. First and second text inputs are received in a text-to-speech system, and processed into respective first and second speech outputs corresponding to stored speech respectively from first and second speakers using a processor of the system. The second speech output of the second speaker is adapted to sound like the first speech output of the first speaker. | 11-17-2011 |
20110295606 | CONTEXTUAL CONVERSION PLATFORM - A contextual conversion platform, and method for converting text-to-speech, are described that can convert content of a target to spoken content. Embodiments of the contextual conversion platform can identify certain contextual characteristics of the content, from which can be generated a spoken content input. This spoken content input can include tokens, e.g., words and abbreviations, to be converted to the spoken content, as well as substitution tokens that are selected from contextual repositories based on the context identified by the contextual conversion platform. | 12-01-2011 |
20110313772 | SYSTEM AND METHOD FOR UNIT SELECTION TEXT-TO-SPEECH USING A MODIFIED VITERBI APPROACH - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for speech synthesis. A system practicing the method receives a set of ordered lists of speech units, for each respective speech unit in each ordered list in the set of ordered lists, constructs a sublist of speech units from a next ordered list which are suitable for concatenation, performs a cost analysis of paths through the set of ordered lists of speech units based on the sublist of speech units for each respective speech unit, and synthesizes speech using a lowest cost path of speech units through the set of ordered lists based on the cost analysis. The ordered lists can be ordered based on the respective pitch of each speech unit. In one embodiment, speech units which do not have an assigned pitch can be assigned a pitch. | 12-22-2011 |
20110320204 | SYSTEMS AND METHODS FOR INPUT DEVICE AUDIO FEEDBACK - Systems, methods, apparatuses and computer program products configured to provide sound feedback for input devices are described. Embodiments take input from a digitizer, such as input using as stylus/pen, and produce sound feedback to enhance the user's input interface experience. Embodiments thus provide a user with a more realistic interface with an electronic device, emulating use of conventional writing implements. | 12-29-2011 |
20110320205 | ELECTRONIC BOOK READER - An electronic book reader includes a display, an audio output device, a text obtaining module, a storing module, a text displaying module, a text analyzing module, a text highlighting module, a speech synthesis module, a player module, and a synchronization control module. The text obtaining module obtains a text from a text source. The storing module stores the text. The text displaying module displays the text on the display. The text analyzing module divides the text into a plurality of segments according to punctuations of the text, and read a selected segment. The speech synthesis module converts the selected segment into speech. The synchronization control module sends a command to the text analyzing module for reading the segment, and sends the segment to the text highlighting module and speech synthesis module synchronously. | 12-29-2011 |
20110320206 | ELECTRONIC BOOK READER AND TEXT TO SPEECH CONVERTING METHOD - An electronic book reader includes a text obtaining module, a text highlighting module, a speech synthesis module, a player module, and a synchronization control module. The text obtaining module obtains a selected segment of a text. The text highlighting module highlights the selected segment. The speech synthesis module converts the selected segment into a speech. The player module plays the speech. The synchronization control module sends the selected segment to the text highlighting module and speech synthesis module synchronously. | 12-29-2011 |
20110320207 | CODING, MODIFICATION AND SYNTHESIS OF SPEECH SEGMENTS - The invention relates to a method for speech signal analysis, modification and synthesis comprising a phase for the location of analysis windows by means of an iterative process for the determination of the phase of the first sinusoidal component and comparison between the phase value of said component and a predetermined value, a phase for the selection of analysis frames corresponding to an allophone and readjustment of the duration and the fundamental frequency according to certain thresholds and a phase for the generation of synthetic speech from synthesis frames taking the information of the closest analysis frame as spectral information of the synthesis frame and taking as many synthesis frames as periods that the synthetic signal has. The method allows a coherent location of the analysis windows within the periods of the signal and the exact generation of the synthesis instants in a manner synchronous with the fundamental period. | 12-29-2011 |
20120010888 | Method and System for Speech Synthesis and Advertising Service - Methods and systems for providing a network-accessible text-to-speech synthesis service are provided. The service accepts content as input. After extracting textual content from the input content, the service transforms the content into a format suitable for high-quality speech synthesis. Additionally, the service produces audible advertisements, which are combined with the synthesized speech. The audible advertisements themselves can be generated from textual advertisement content. | 01-12-2012 |
20120016675 | BROADCAST SYSTEM USING TEXT TO SPEECH CONVERSION - A broadcast signal receiver comprises a text data receiver for receiving broadcast text data for display to a user in relation to a user interface; a text-to-speech (TTS) converter for converting received text data into an audio speech signal, the TTS converter being operable to detect whether a word for conversion is included in a stored list of words for conversion and, if so, to convert that word according to a conversion defined by the stored list; and if not, to convert that word according to a set of predetermined conversion rules; a conversion memory storing the list of words for conversion by the TTS converter; and an update receiver for receiving additional words and associated conversions for storage in the conversion memory. | 01-19-2012 |
20120029920 | Cooperative Processing For Portable Reading Machine - A handheld device includes an image input device capable of acquiring images, circuitry to send a representation of the image to a remote computing system that performs at least one processing function related to processing the image and circuitry to receive from the remote computing system data based on processing the image by the remote system. | 02-02-2012 |
20120035933 | SYSTEM AND METHOD FOR SYNTHETIC VOICE GENERATION AND MODIFICATION - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating a synthetic voice. A system configured to practice the method combines a first database of a first text-to-speech voice and a second database of a second text-to-speech voice to generate a combined database, selects from the combined database, based on a policy, voice units of a phonetic category for the synthetic voice to yield selected voice units, and synthesizes speech based on the selected voice units. The system can synthesize speech without parameterizing the first text-to-speech voice and the second text-to-speech voice. A policy can define, for a particular phonetic category, from which text-to-speech voice to select voice units. The combined database can include multiple text-to-speech voices from different speakers. The combined database can include voices of a single speaker speaking in different styles. The combined database can include voices of different languages. | 02-09-2012 |
20120035934 | SPEECH GENERATION DEVICE WITH A PROJECTED DISPLAY AND OPTICAL INPUTS - In several embodiments, a speech generation device is disclosed. The speech generation device may generally include a projector configured to project images in the form of a projected display onto a projection Surface, an optical input device configured to detect an input directed towards the projected display and a speaker configured to generate an audio output. In addition, the speech generation device may include a processing unit communicatively coupled to the projector, the optical input device and the speaker. The processing unit may include a processor and related computer readable medium configured to store instructions executable by the processor, wherein the instructions stored on the computer readable medium configure the speech generation device to generate text-to-speech output. | 02-09-2012 |
20120041765 | ELECTRONIC BOOK READER AND TEXT TO SPEECH CONVERTING METHOD - An electronic book reader includes a text obtaining module, a text analysis module, a speech synthesis module, a control module, and an audio output device. The text obtaining module is used for obtaining a selected segment of a text. The text analysis module is used for analyzing a time phrase of the selected segment to obtain a waiting time period according to meaning of the time phrase in the selected segment. The speech synthesis module is used for converting the selected segment into speech. The control module is used for sending the content of the selected segment to the speech synthesis module. Wherein the control module waits for the waiting time period after sending the time phrase to the speech synthesis. The audio output module is used for playing the speech. | 02-16-2012 |
20120046947 | Assisted Reader - An electronic reading device for reading ebooks and other digital media items combines a touch surface electronic reading device with accessibility technology to provide a visually impaired user more control over his or her reading experience. In some implementations, the reading device can be configured to operate in at least two modes: a continuous reading mode and an enhanced reading mode. | 02-23-2012 |
20120046948 | METHOD AND APPARATUS FOR GENERATING AND DISTRIBUTING CUSTOM VOICE RECORDINGS OF PRINTED TEXT - A speech analysis module compares a subject text to the voice of a subject person reciting the text, and generates a personal voice library of the subject's voice. The library includes audio files of actual words spoken by the subject person, as well as morphological, syntactical and grammatical considerations affecting the pronunciation of words and pauses. Words not actually spoken by the subject can be artificially synthesized by an analysis of the subject's speech and pronunciation, and utilizing sounds and portions of words spoken by the subject. Upon request for an audio recording of an object text in the voice of the subject, an integration module retrieves discrete audio files from the personal voice library and artificially generates a voice recording of the object text in the voice of the subject. The generation and transmission of custom audio files can be part of a commercial transaction. | 02-23-2012 |
20120046949 | METHOD AND APPARATUS FOR GENERATING AND DISTRIBUTING A HYBRID VOICE RECORDING DERIVED FROM VOCAL ATTRIBUTES OF A REFERENCE VOICE AND A SUBJECT VOICE - A first person narrates a selected written text to generate a reference audio file including one or more parameters are selected from the sounds of the reference audio file, including the duration of a sound, the duration of a pause, the rise and fall of frequency relative to a reference frequency, and/or volume differential between select sounds. A voice profile library contains a phonetic library of sounds spoken by a subject speaker. An integration module generates a preliminary audio file of the selected text in the voice of the subject speaker and then modifies individual sounds by the parameters from the reference file, forming a hybrid audio file. The hybrid audio file retains the tonality of the subject voice, but incorporates the rhythm, cadence and inflections of the reference voice. The reference audio file, and/or the hybrid audio file are licensed or sold as part of a commercial transaction. | 02-23-2012 |
20120065979 | METHOD AND SYSTEM FOR TEXT TO SPEECH CONVERSION - A system and method for text to speech conversion. The method of performing text to speech conversion on a portable device includes: identifying a portion of text for conversion to speech format, wherein the identifying includes performing a prediction based on information associated with a user. While the portable device is connected to a power source, a text to speech conversion is performed on the portion of text to produce converted speech. The converted speech is stored into a memory device of the portable device. A reader application is executed, wherein a user request is received for narration of the portion of text. During the executing, the converted speech is accessed from the memory device and rendered to the user, responsive to the user request | 03-15-2012 |
20120072224 | METHOD OF SPEECH SYNTHESIS - The present invention relates to a method of text-based speech synthesis, wherein at least one portion of a text is specified; the intonation of each portion is determined; target speech sounds are associated with each portion; physical parameters of the target speech sounds are determined; speech sounds most similar in terms of the physical parameters to the target speech sounds are found in a speech database; and speech is synthesized as a sequence of the found speech sounds. The physical parameters of said target speech sounds are determined in accordance with the determined intonation. The present method, when used in a speech synthesizer, allows improved quality of synthesized speech due to precise reproduction of intonation. | 03-22-2012 |
20120078633 | READING ALOUD SUPPORT APPARATUS, METHOD, AND PROGRAM - According to one embodiment, a reading aloud support apparatus includes a reception unit, a first extraction unit, a second extraction unit, an acquisition unit, a generation unit, a presentation unit. The reception unit is configured to receive an instruction. The first extraction unit is configured to extract, as a partial document, a part of a document which corresponds to a range of words. The second extraction unit is configured to perform morphological analysis and to extract words as candidate words. The acquisition unit is configured to acquire attribute information items relates to the candidate words. The generation unit is configured to perform weighting relating to a value corresponding a distance and to determine each of candidate words to be preferentially presented to generate a presentation order. The presentation unit is configured to present the candidate words and the attribute information items in accordance with the presentation order. | 03-29-2012 |
20120089400 | SYSTEMS AND METHODS FOR USING HOMOPHONE LEXICONS IN ENGLISH TEXT-TO-SPEECH - The present invention relates to information systems. More specifically, the present invention relates to infrastructure and techniques for improving Text-to-Speech-enabled applications. | 04-12-2012 |
20120089401 | METHODS AND APPARATUS TO AUDIBLY PROVIDE MESSAGES IN A MOBILE DEVICE - Methods and apparatus to audibly provide messages in a mobile device at described. An example method includes receiving a message at a mobile device, wherein the message includes an identification of a sender, an identification of a recipient, and a message contents, determining that the message contents includes a predetermined phrase, in response to determining that the message contents includes the predetermined phrase, audibly presenting the message contents. | 04-12-2012 |
20120089402 | SPEECH SYNTHESIZER, SPEECH SYNTHESIZING METHOD AND PROGRAM PRODUCT - According to one embodiment, a speech synthesizer includes an analyzer, a first estimator, a selector, a generator, a second estimator, and a synthesizer. The analyzer analyzes text and extracts a linguistic feature. The first estimator selects a first prosody model adapted to the linguistic feature and estimates prosody information that maximizes a first likelihood representing probability of the selected first prosody model. The selector selects speech units that minimize a cost function determined in accordance with the prosody information. The generator generates a second prosody model that is a model of the prosody information of the speech units. The second estimator estimates prosody information that maximizes a third likelihood calculated on the basis of the first likelihood and a second likelihood representing probability of the second prosody model. The synthesizer generates synthetic speech by concatenating the speech units on the basis of the prosody information estimated by the second estimator. | 04-12-2012 |
20120109654 | METHODS AND APPARATUSES FOR FACILITATING SPEECH SYNTHESIS - Methods and apparatuses are provided for facilitating speech synthesis. A method may include generating a plurality of input models representing an input by using a statistical model synthesizer to statistically model the input. The method may further include determining a speech unit sequence representing at least a portion of the input by using the input models to influence selection of one or more pre-recorded speech units having parameter representations. The method may additionally include identifying one or more bad units in the unit sequence. The method may also include replacing the identified one or more bad units with one or more parameters generated by the statistical model synthesizer. Corresponding apparatuses are also provided. | 05-03-2012 |
20120109655 | WIRELESS SERVER BASED TEXT TO SPEECH EMAIL - An email system for mobile devices, such as cellular phones and PDAs, is disclosed which allows email messages to be played back on the mobile device as voice messages on demand by way of a media player, thus eliminating the need for a unified messaging system. Email messages are received by the mobile device in a known manner. In accordance with an important aspect of the invention, the email messages are identified by the mobile device as they are received. After the message is identified, the mobile device sends the email message in text format to a server for conversion to speech or voice format. After the message is converted to speech format, the server sends the messages back to the user's mobile device and notifies the user of the email message and then plays the message back to the user through a media player upon demand. | 05-03-2012 |
20120109656 | AUDIO OUTPUT OF A DOCUMENT FROM MOBILE DEVICE - Architecture for playing a document converted into an audio format to a user of an audio-output capable device. The user can interact with the device to control play of the audio document such as pause, rewind, forward, etc. In more robust implementation, the audio-output capable device is a mobile device (e.g., cell phone) having a microphone for processing voice input. Voice commands can then be input to control play (“reading”) of the document audio file to pause, rewind, read paragraph, read next chapter, fast forward, etc. A communications server (e.g., email, attachments to email, etc.) transcodes text-based document content into an audio format by leveraging a text-to-speech (TTS) engine. The transcoded audio files are then transferred to mobile devices through viable transmission channels. Users can then play the audio-formatted document while freeing hand and eye usage for other tasks. | 05-03-2012 |
20120123781 | TOUCH SCREEN DEVICE FOR ALLOWING BLIND PEOPLE TO OPERATE OBJECTS DISPLAYED THEREON AND OBJECT OPERATING METHOD IN THE TOUCH SCREEN DEVICE - A touch screen device allowing blind people to operate objects displayed thereon and an object operating method in the touch screen device are provided. The touch screen device includes a touch sensing unit generating key values corresponding to touched ‘touch position's of a virtual keyboard for controlling application software being executed, the number of touches and touch time and transmitting the key values to the application software when sensing touches of the virtual keyboard while the virtual keyboard is activated, an object determination unit reading text information of a focused object using hooking mechanism when the application software is executed based on the key values received from the touch sensing unit and the object among objects included in the application software is focused, and a speech synthesis unit converting the text information read by the object determination unit into speech data using a text-to-speech engine and outputting the speech data. | 05-17-2012 |
20120130718 | METHOD AND SYSTEM FOR COLLECTING AUDIO PROMPTS IN A DYMANICALLY GENERATED VOICE APPLICATION - A prompt collecting tool ( | 05-24-2012 |
20120136664 | SYSTEM AND METHOD FOR CLOUD-BASED TEXT-TO-SPEECH WEB SERVICES - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating speech. One variation of the method is from a server side, and another variation of the method is from a client side. The server side method, as implemented by a network-based automatic speech processing system, includes first receiving, from a network client independent of knowledge of internal operations of the system, a request to generate a text-to-speech voice. The request can include speech samples, transcriptions of the speech samples, and metadata describing the speech samples. The system extracts sound units from the speech samples based on the transcriptions and generates an interactive demonstration of the text-to-speech voice based on the sound units, the transcriptions, and the metadata, wherein the interactive demonstration hides a back end processing implementation from the network client. The system provides access to the interactive demonstration to the network client. | 05-31-2012 |
20120136665 | ELECTRONIC DEVICE AND CONTROL METHOD THEREOF - Disclosed are an electronic device and a control method thereof, The electronic device includes a text-to-speech unit which converts a text into an audio signal, an audio output unit which outputs an audio corresponding to the converted audio signal; and a controller which controls the audio output unit to reoutput at least one of audios whose output is not completed if there is at least one audio which is not completely output among a plurality of audios output by the audio output unit. | 05-31-2012 |
20120143611 | Trajectory Tiling Approach for Text-to-Speech - Hidden Markov Models HMM trajectory tiling (HTT)-based approaches may be used to synthesize speech from text. In operation, a set of Hidden Markov Models (HMMs) and a set of waveform units may be obtained from a speech corpus. The set of HMMs are further refined via minimum generation error (MGE) training to generate a refined set of HMMs. Subsequently, a speech parameter trajectory may be generated by applying the refined set of HMMs to an input text. A unit lattice of candidate waveform units may be selected from the set of waveform units based at least on the speech parameter trajectory. A normalized cross-correlation (NCC)-based search on the unit lattice may be performed to obtain a minimal concatenation cost sequence of candidate waveform units, which are concatenated into a concatenated waveform sequence that is synthesized into speech. | 06-07-2012 |
20120150543 | Personality-Based Device - A personality-based theme may be provided. An application program may query a personality resource file for a prompt corresponding to a personality. Then the prompt may be received at a speech synthesis engine. Next, the speech synthesis engine may query a personality voice font database for a voice font corresponding to the personality. Then the speech synthesis engine may apply the voice font to the prompt. The voice font applied prompt may then be produced at an output device. | 06-14-2012 |
20120158406 | FACILITATING TEXT-TO-SPEECH CONVERSION OF A USERNAME OR A NETWORK ADDRESS CONTAINING A USERNAME - To facilitate text-to-speech conversion of a username, a first or last name of a user associated with the username may be retrieved, and a pronunciation of the username may be determined based at least in part on whether the name forms at least part of the username. To facilitate text-to-speech conversion of a domain name having a top level domain and at least one other level domain, a pronunciation for the top level domain may be determined based at least in part upon whether the top level domain is one of a predetermined set of top level domains. Each other level domain may be searched for one or more recognized words therewithin, and a pronunciation of the other level domain may be determined based at least in part on an outcome of the search. The username and domain name may form part of a network address such as an email address, URL or URI. | 06-21-2012 |
20120166198 | CONTROLLABLE PROSODY RE-ESTIMATION SYSTEM AND METHOD AND COMPUTER PROGRAM PRODUCT THEREOF - In one embodiment of a controllable prosody re-estimation system, a TTS/STS engine consists of a prosody prediction/estimation module, a prosody re-estimation module and a speech synthesis module. The prosody prediction/estimation module generates predicted or estimated prosody information. And then the prosody re-estimation module re-estimates the predicted or estimated prosody information and produces new prosody information, according to a set of controllable parameters provided by a controllable prosody parameter interface. The new prosody information is provided to the speech synthesis module to produce a synthesized speech. | 06-28-2012 |
20120166199 | HOSTED VOICE RECOGNITION SYSTEM FOR WIRELESS DEVICES - Methods, systems, and software for converting the audio input of a user of a hand-held client device or mobile phone into a textual representation by means of a backend server accessed by the device through a communications network. The text is then inserted into or used by an application of the client device to send a text message, instant message, email, or to insert a request into a web-based application or service. In one embodiment, the method includes the steps of initializing or launching the application on the device; recording and transmitting the recorded audio message from the client device to the backend server through a client-server communication protocol; converting the transmitted audio message into the textual representation in the backend server; and sending the converted text message back to the client device or forwarding it on to an alternate destination directly from the server. | 06-28-2012 |
20120173241 | MULTI-LINGUAL TEXT-TO-SPEECH SYSTEM AND METHOD - A multi-lingual text-to-speech system and method processes a text to be synthesized via an acoustic-prosodic model selection module and an acoustic-prosodic model mergence module, and obtains a phonetic unit transformation table. In an online phase, the acoustic-prosodic model selection module, according to the text and a phonetic unit transcription corresponding to the text, uses at least a set controllable accent weighting parameter to select a transformation combination and find a second and a first acoustic-prosodic models. The acoustic-prosodic model mergence module merges the two acoustic-prosodic models into a merged acoustic-prosodic model, according to the at least a controllable accent weighting parameter, processes all transformations in the transformation combination and generates a merged acoustic-prosodic model sequence. A speech synthesizer and the merged acoustic-prosodic model sequence are further applied to synthesize the text into an L1-accent L2 speech. | 07-05-2012 |
20120173242 | SYSTEM AND METHOD FOR EXCHANGE OF SCRIBBLE DATA BETWEEN GSM DEVICES ALONG WITH VOICE - A method for transferring scribble data along with voice includes connecting at least two electronic devices through a GSM network, accumulating and down sampling the scribble coordinates, which are converted to a speech-like signal that is sent along with voice data packets simultaneously in the GSM network. | 07-05-2012 |
20120179468 | Automatic Dominant Orientation Estimation In Text Images Based On Steerable Filters - Briefly, in accordance with one or more embodiments, an image processing system is capable of receiving an image containing text, applying optical character recognition to the image, and then audibly reproducing the text via text-to-speech synthesis. Prior to optical character recognition, an orientation corrector is capable of detecting an amount of angular rotation of the text in the image with respect to horizontal, and then rotating the image by an appropriate amount to sufficiently align the text with respect to horizontal for optimal optical character recognition. The detection may be performed using steerable filters to provide an energy versus orientation curve of the image data. A maximum of the energy curve may indicate the amount of angular rotation that may be corrected by the orientation corrector. | 07-12-2012 |
20120185253 | EXTRACTING TEXT FOR CONVERSION TO AUDIO - Embodiments are disclosed that relate to converting markup content to an audio output. For example, one disclosed embodiment provides, in a computing device a method including partitioning a markup document into a plurality of content panels, and forming a subset of content panels by filtering the plurality of content panels based upon geometric and/or location-based criteria of each panel relative to an overall organization of the markup document. The method further includes determining a document object model (DOM) analysis value for each content panel of the subset of content panels, identifying a set of content panels determined to contain text body content by filtering the subset of content panels based upon the DOM analysis value of each of the content panels of the subset of content panels, and converting text in a selected content panel determined to contain text body content to an audio output. | 07-19-2012 |
20120191457 | METHODS AND APPARATUS FOR PREDICTING PROSODY IN SPEECH SYNTHESIS - Techniques for predicting prosody in speech synthesis may make use of a data set of example text fragments with corresponding aligned spoken audio. To predict prosody for synthesizing an input text, the input text may be compared with the data set of example text fragments to select a best matching sequence of one or more example text fragments, each example text fragment in the sequence being paired with a portion of the input text. The selected example text fragment sequence may be aligned with the input text, e.g., at the word level, such that prosody may be extracted from the audio aligned with the example text fragments, and the extracted prosody may be applied to the synthesis of the input text using the alignment between the input text and the example text fragments. | 07-26-2012 |
20120197646 | Open Architecture For a Voice User Interface - A system and method for processing voice requests from a user for accessing information on a computerized network and delivering information from a script server and an audio server in the network in audio format. A voice user interface subsystem includes: a dialog engine that is operable to interpret requests from users from the user input, communicate the requests to the script server and the audio server, and receive information from the script server and the audio server; a media telephony services (MTS) server, wherein the MTS server is operable to receive user input via a telephony system, and to transfer the user input to the dialog engine; and a broker coupled between the dialog engine and the MTS server. The broker establishes a session between the MTS server and the dialog engine and controls telephony functions with the telephony system. | 08-02-2012 |
20120203554 | SYSTEMS AND METHODS FOR PROVIDING EMERGENCY INFORMATION - In one general aspect, emergency information for a person is received from a user. A unique identifier for the person is generated. The unique identifier is associated with the emergency information. The emergency information is stored on an emergency information device. The unique identifier is associated with the emergency information device. The emergency information device is sent to the user. | 08-09-2012 |
20120215540 | METHOD FOR CONVERTING CHARACTER TEXT MESSAGES TO AUDIO FILES WITH RESPECTIVE TITLES FOR THEIR SELECTION AND READING ALOUD WITH MOBILE DEVICES - The present invention relates to a method for selecting and downloading content from a content provider which is accessible via an IP/DNS/URL address to a mobile device, the content being any text information data, for converting the text information data to at least one audio message and for storing the at least one audio message as at least one audio file on the mobile device, wherein the at least one audio file is playable and discernable as a music file. Said method implemented on a mobile phone enables controlling and playing the audio messages as music files, for instance also in a car environment with a car kit enabling a control and a selection of one or more of said at least one audio files for playing from the mobile phone. | 08-23-2012 |
20120221338 | AUTOMATICALLY GENERATING AUDIBLE REPRESENTATIONS OF DATA CONTENT BASED ON USER PREFERENCES - A custom-content audible representation of selected data content is automatically created for a user. The content is based on content preferences of the user (e.g., one or more web browsing histories). The content is aggregated, converted using text-to-speech technology, and adapted to fit in a desired length selected for the personalized audible representation. The length of the audible representation may be custom for the user, and may be determined based on the amount of time the user is typically traveling. | 08-30-2012 |
20120221339 | METHOD, APPARATUS FOR SYNTHESIZING SPEECH AND ACOUSTIC MODEL TRAINING METHOD FOR SPEECH SYNTHESIS - According to one embodiment, a method, apparatus for synthesizing speech, and a method for training acoustic model used in speech synthesis is provided. The method for synthesizing speech may include determining data generated by text analysis as fuzzy heteronym data, performing fuzzy heteronym prediction on the fuzzy heteronym data to output a plurality of candidate pronunciations of the fuzzy heteronym data and probabilities thereof, generating fuzzy context feature labels based on the plurality of candidate pronunciations and probabilities thereof, determining model parameters for the fuzzy context feature labels based on acoustic model with fuzzy decision tree, generating speech parameters from the model parameters, and synthesizing the speech parameters via synthesizer as speech. | 08-30-2012 |
20120221340 | SCRIPTING SUPPORT FOR DATA IDENTIFIERS, VOICE RECOGNITION AND VOICE INPUT IN A TELNET SESSION - Methods of adding data identifiers and speech/voice recognition functionality are disclosed. A telnet client runs one or more scripts that add data identifiers to data fields in a telnet session. The input data is inserted in the corresponding fields based on data identifiers. Scripts run only on the telnet client without modifications to the server applications. Further disclosed are methods for providing speech recognition and voice functionality to telnet clients. Portions of input data are converted to voice and played to the user. A user also may provide input to certain fields of the telnet session by using his voice. Scripts running on the telnet client convert the user's voice into text and is inserted to corresponding fields. | 08-30-2012 |
20120226500 | SYSTEM AND METHOD FOR CONTENT RENDERING INCLUDING SYNTHETIC NARRATION - A system and method for capturing a voice information and using the voice information to modulate a content output signal. The method for capturing voice information includes receiving a request to create speech modulation and presenting a piece of textual content operable for use in creating the speech modulation based on the textual input. The method further includes receiving a first voice sample and determining a voice fingerprint based on said first voice sample. The voice fingerprint is operable for modulating speech during content rendering (e.g., audio output) such that a synthetic narration is performed based on the textual input. The voice fingerprint may then be stored and used for modulating the output. | 09-06-2012 |
20120226501 | Document Navigation Method - A document navigation tool that automatically navigates a document based on previous input from the user. The document navigation tool is utilized each time a page loads. The method recognizes user behavior on pages using patterns, which are based on four criterion: location, frequency, consistency, and scope. If the user has visited the page previously and has established a pattern, the method automatically focuses on the portion of the page indicated by the pattern, e.g. the location on a web page of the link clicked by the user in the user's last three visits to the page. If the user has not visited the page previously, the method logs the events that occur during this visit to the page. | 09-06-2012 |
20120239404 | APPARATUS AND METHOD FOR EDITING SPEECH SYNTHESIS, AND COMPUTER READABLE MEDIUM - An acquisition unit analyzes a text, and acquires phonemic and prosodic information. An editing unit edits a part of the phonemic and prosodic information. A speech synthesis unit converts the phonemic and prosodic information before editing the part to a first speech waveform, and converts the phonemic and prosodic information after editing the part to a second speech waveform. A period calculation unit calculates a contrast period corresponding to the part in the first speech waveform and the second speech waveform. A speech generation unit generates an output waveform by connecting a first partial waveform and a second partial waveform. The first partial waveform contains the contrast period of the first speech waveform. The second partial waveform contains the contrast period of the second speech waveform. | 09-20-2012 |
20120239405 | SYSTEM AND METHOD FOR GENERATING AUDIO CONTENT - A system and method for generating audio content. Content is automatically retrieved from a website. The content is converted to audio files. The audio files are associated with a hierarchy. The hierarchy is determined from the website. One or more audio files are communicated to an electronic device utilized by a user in response to a request from the user. | 09-20-2012 |
20120253814 | SYSTEM AND METHOD FOR WEB TEXT CONTENT AGGREGATION AND PRESENTATION - A system and method for aggregating text-based content and presenting the text-based content as spoken audio is described herein, where a server module retrieves and aggregates web content from web content providers that may include text-based web content that is then extracted, filtered and categorizes for a client module to retrieve and play as spoken audio. | 10-04-2012 |
20120253815 | TALKING PAPER AUTHORING TOOLS - A range of unified software authoring tools for creating a talking paper application for integration in an end user platform are described herein. The authoring tools are easy to use and are interoperable to provide an easy and cost-effective method of creating a talking paper application. The authoring tools provide a framework for creating audio content and image content and interactively linking the audio content and the image content. The authoring tools also provide for verifying the interactively linked audio and image content, reviewing the audio content, the image content and the interactive linking on a display device. Finally, the authoring tools provide for saving the audio content, the video content and the interactive linking for publication to a manufacturer for integration in an end user platform or talking paper platform. | 10-04-2012 |
20120253816 | TEXT-TO-SPEECH USER'S VOICE COOPERATIVE SERVER FOR INSTANT MESSAGING CLIENTS - A system and method to allow an author of an instant message to enable and control the production of audible speech to the recipient of the message. The voice of the author of the message is characterized into parameters compatible with a formative or articulative text-to-speech engine such that upon receipt, the receiving client device can generate audible speech signals from the message text according to the characterization of the author's voice. Alternatively, the author can store samples of his or her actual voice in a server so that, upon transmission of a message by the author to a recipient, the server extracts the samples needed only to synthesize the words in the text message, and delivers those to the receiving client device so that they are used by a client-side concatenative text-to-speech engine to generate audible speech signals having a close likeness to the actual voice of the author. | 10-04-2012 |
20120265532 | System For Natural Language Assessment of Relative Color Quality - Embodiments of the invention include a system for providing a natural language objective assessment of relative color quality between a reference and a source image. The system may include a color converter that receives a difference measurement between the reference image and source image and determines a color attribute change based on the difference measurement. The color attributes may include hue shift, saturation changes, and color variation, for instance. Additionally, a magnitude index facility determines a magnitude of the determined color attribute change. Further, a natural language selector maps the color attribute change and the magnitude of the change to natural language and generates a report of the color attribute change and the magnitude of the color attribute change. The output can then be communicated to a user in either text or audio form, or in both text and audio forms. | 10-18-2012 |
20120265533 | VOICE ASSIGNMENT FOR TEXT-TO-SPEECH OUTPUT - Text can be obtained at a device from various forms of communication such as e-mails or text messages. Metadata can be obtained directly from the communication or from a secondary source identified by the directly obtained metadata. The metadata can be used to create a speaker profile. The speaker profile can be used to select voice data. The selected voice data can be used by a text-to-speech (TTS) engine to produce speech output having voice characteristics that best match the speaker profile. | 10-18-2012 |
20120278081 | TEXT TO SPEECH METHOD AND SYSTEM - A text-to-speech method for use in a plurality of languages, including: inputting text in a selected language; dividing the inputted text into a sequence of acoustic units; converting the sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein the model has a plurality of model parameters describing probability distributions which relate an acoustic unit to a speech vector; and outputting the sequence of speech vectors as audio in the selected language. A parameter of a predetermined type of each probability distribution in the selected language is expressed as a weighted sum of language independent parameters of the same type. The weighting used is language dependent, such that converting the sequence of acoustic units to a sequence of speech vectors includes retrieving the language dependent weights for the selected language. | 11-01-2012 |
20120278082 | COMBINING WEB BROWSER AND AUDIO PLAYER FUNCTIONALITY TO FACILITATE ORGANIZATION AND CONSUMPTION OF WEB DOCUMENTS - The invention is directed to combining web browser and audio player functionality for the organization and consumption of web documents. Specifically, the invention identifies a set of web documents via a web browser, extracts content from the web documents, and adds the set of web documents to a playlist. In this way, users can build a playlist of web documents and utilize the functionality and convenience of an audio player and listen to the content of the playlist. | 11-01-2012 |
20120284028 | METHODS AND APPARATUS TO PRESENT A VIDEO PROGRAM TO A VISUALLY IMPAIRED PERSON - Methods and apparatus to present a video program to a visually impaired person are disclosed. An example method comprises detecting a text portion of a media stream including a video stream, the text portion not being consumable by a blind person, retrieving text associated with the text portion of the media stream, and converting the text to a first audio stream based on a first type of a first program in the media stream, and converting the text to a second audio stream based on a second type of a second program in the media stream. | 11-08-2012 |
20120290304 | Electronic Holder for Reading Books - A book support and optical scanner assembly for converting printed text to an audio output includes a support for supporting an open book and a pair of optical scanners adapted to scan opposite pages. The assembly also includes means for moving the scanners from the top of the page to the bottom of a page. Further, both scanners can be rotated off of the book for turning a page. In addition, the assembly includes a text to audio converter for converting the scanned text into spoken words and in one embodiment a translator to translate the scanned text into a pre-selected language. | 11-15-2012 |
20120296654 | SYSTEMS AND METHODS FOR DYNAMICALLY IMPROVING USER INTELLIGIBILITY OF SYNTHESIZED SPEECH IN A WORK ENVIRONMENT - Method and apparatus that dynamically adjusts operational parameters of a text-to-speech engine in a speech-based system. A voice engine or other application of a device provides a mechanism to alter the adjustable operational parameters of the text-to-speech engine. In response to one or more environmental conditions, the adjustable operational parameters of the text-to-speech engine are modified to increase the intelligibility of synthesized speech. | 11-22-2012 |
20120303371 | METHODS AND APPARATUS FOR ACOUSTIC DISAMBIGUATION - Techniques for disambiguating at least one text segment from at least one acoustically similar word and/or phrase. The techniques include identifying at least one text segment, in a textual representation having a plurality of text segments, having at least one acoustically similar word and/or phrase, annotating the textual representation with disambiguating information to help disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase, and synthesizing a speech signal, at least in part, by performing text-to-speech synthesis on at least a portion of the textual representation that includes the at least one text segment, wherein the speech signal includes speech corresponding to the disambiguating information located proximate the portion of the speech signal corresponding to the at least one text segment. | 11-29-2012 |
20120310649 | SWITCHING BETWEEN TEXT DATA AND AUDIO DATA BASED ON A MAPPING - Techniques are provided for creating a mapping that maps locations in audio data (e.g., an audio book) to corresponding locations in text data (e.g., an e-book). Techniques are provided for using a mapping between audio data and text data, whether the mapping is created automatically or manually. A mapping may be used for bookmark switching where a bookmark established in one version of a digital work (e.g., e-book) is used to identify a corresponding location with another version of the digital work (e.g., an audio book). Alternatively, the mapping may be used to play audio that corresponds to text selected by a user. Alternatively, the mapping may be used to automatically highlight text in response to audio that corresponds to the text being played. Alternatively, the mapping may be used to determine where an annotation created in one media context (e.g., audio) will be consumed in another media context. | 12-06-2012 |
20120316881 | SPEECH SYNTHESIZER, SPEECH SYNTHESIS METHOD, AND SPEECH SYNTHESIS PROGRAM - A normalized spectrum storage unit | 12-13-2012 |
20120323578 | Text-to-Speech Device and Text-to-Speech Method - A sound control section ( | 12-20-2012 |
20120330665 | PRESCRIPTION LABEL READER - A system is configured to read a prescription label and output audio information corresponding to prescription information present on or linked to the prescription label. The system may have knowledge about prescription labels and prescription information, and use this knowledge to present the audio information in a structured form to the user. | 12-27-2012 |
20120330666 | METHOD, SYSTEM AND PROCESSOR-READABLE MEDIA FOR AUTOMATICALLY VOCALIZING USER PRE-SELECTED SPORTING EVENT SCORES - A method and system for vocalizing user-selected sporting event scores. A customized spoken score application module can be configured in association with a device. A real-time score can be preselected by a user from an existing sporting event website for automatically vocalizing the score in a multitude of languages utilizing a speech synthesizer and a translation engine. An existing text-to-speech engine can be integrated with the spoken score application module and controlled by the application module to automatically vocalize the preselected scores listed on the sporting event site. The synthetically-voiced, real-time score can be transmitted to the device at a predetermined time interval. Such an approach automatically and instantly pushes the real time vocal alerts thereby permitting the user to continue multitasking without activating the pre-selected vocal alerts. | 12-27-2012 |
20120330667 | SPEECH SYNTHESIZER, NAVIGATION APPARATUS AND SPEECH SYNTHESIZING METHOD - Included in a speech synthesizer, a natural language processing unit divides text data, input from a text input unit, into a plurality of components (particularly, words). An importance prediction unit estimates an importance level of each component according to the degree of how much each component contributes to understanding when a listener hears synthesized speech. Then, the speech synthesizer determines a processing load based on the device state when executing synthesis processing and the importance level. Included in the speech synthesizer, a synthesizing control unit and a wave generation unit reduce the processing time for a phoneme with a low importance level by curtailing its processing load (relatively degrading its sound quality), allocate a part of the processing time, made available by this reduction, to the processing time of a phoneme with a high importance level, and generates synthesized speech in which important words are easily audible. | 12-27-2012 |
20120330668 | AUTOMATED METHOD AND SYSTEM FOR OBTAINING USER-SELECTED REAL-TIME INFORMATION ON A MOBILE COMMUNICATION DEVICE - A customized live tile application module can be configured in association with the mobile communication device in order to automatically vocalize the real-time information preselected by a user in a multitude of languages. A text-to-speech application module can be integrated with the customized live tile application module to automatically vocalize the preselected real-time information. The real-time information can be obtained from a tile and/or a website integrated with a remote server and announced after a text to speech conversion process without opening the tile, if the tiles are selected for announcement of information by the device. Such an approach automatically and instantly pushes a vocal alert with respect to the user-selected real-time information on the mobile communication device thereby permitting the user to continue multitasking. Information from tiles can also be rendered on second screens from a mobile device. | 12-27-2012 |
20130013313 | STATISTICAL ENHANCEMENT OF SPEECH OUTPUT FROM A STATISTICAL TEXT-TO-SPEECH SYNTHESIS SYSTEM - A method, system and computer program product are provided for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of speech in a space of acoustic feature vectors. The method includes: defining a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters; and defining a distortion indictor of a feature vector or a plurality of feature vectors. The method further includes: receiving a feature vector output by the system; and generating an instance of the corrective transformation by: calculating a reference value of the distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector; calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector; calculating the enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective transformation; and deriving an instance of the corrective transformation corresponding to the enhancing parameter values from the parametric family of the corrective transformations. The instance of the corrective transformation may be applied to the feature vector to provide an enhanced feature vector. | 01-10-2013 |
20130013314 | MOBILE COMPUTING APPARATUS AND METHOD OF REDUCING USER WORKLOAD IN RELATION TO OPERATION OF A MOBILE COMPUTING APPARATUS - A mobile computing apparatus comprises a processing resource arranged to support, when in use, an operational environment, the operational environment supporting receipt of textual content, a workload estimator arranged to estimate a cognitive workload for a user, and a text-to-speech engine. The text-to-speech engine is arranged to translate at least part of the received textual content to a signal reproducible as audible speech in accordance with a predetermined relationship between the amount of the textual content to be translated and a cognitive workload level in a range of cognitive workload levels, the range of cognitive workload levels comprising at least one cognitive workload level between end values. | 01-10-2013 |
20130018658 | DYNAMICALLY EXTENDING THE SPEECH PROMPTS OF A MULTIMODAL APPLICATION - A prompt generation engine operates to dynamically extend prompts of a multimodal application. The prompt generation engine receives a media file having a metadata container. The prompt generation engine operates on a multimodal device that supports a voice mode and a non-voice mode for interacting with the multimodal device. The prompt generation engine retrieves from the metadata container a speech prompt related to content stored in the media file for inclusion in the multimodal application. The prompt generation engine modifies the multimodal application to include the speech prompt. | 01-17-2013 |
20130030810 | FRUGAL METHOD AND SYSTEM FOR CREATING SPEECH CORPUS - The present invention provides a frugal method for extraction of speech data and associated transcription from plurality of web resources (internet) for speech corpus creation characterized by an automation of the speech corpus creation and cost reduction. An integration of existing speech corpus with extracted speech data and its transcription from the web resources to build an aggregated rich speech corpus that are effective and easy to adapt for generating acoustic and language models for (Automatic Speech Recognition) ASR systems. | 01-31-2013 |
20130041668 | VOICE LEARNING APPARATUS, VOICE LEARNING METHOD, AND STORAGE MEDIUM STORING VOICE LEARNING PROGRAM - A voice learning apparatus includes a learning-material voice storage unit that stores learning material voice data including example sentence voice data; a learning text storage unit that stores a learning material text including an example sentence text; a learning-material text display controller that displays the learning material text; a learning-material voice output controller that performs voice output based on the learning material voice data; an example sentence specifying unit that specifies the example sentence text during the voice output; an example-sentence voice output controller that performs voice output based on the example sentence voice data associated with the specified example sentence text; and a learning-material voice output restart unit that restarts the voice output from a position where the voice output is stopped last time, after the voice output is performed based on the example sentence voice data. | 02-14-2013 |
20130041669 | SPEECH OUTPUT WITH CONFIDENCE INDICATION - A method, system, and computer program product are provided for speech output with confidence indication. The method includes receiving a confidence score for segments of speech or text to be synthesized to speech. The method includes modifying a speech segment by altering one or more parameters of the speech proportionally to the confidence score. | 02-14-2013 |
20130046541 | APPARATUS FOR ASSISTING VISUALLY IMPAIRED PERSONS TO IDENTIFY PERSONS AND OBJECTS AND METHOD FOR OPERATION THEREOF - An apparatus for assisting visually impaired persons includes a headset. A camera is mounted on the headset. A microprocessor communicates with the camera for receiving an optically read code captured by the camera and converting the optically read code to an audio signal as a function of a trigger contained within the optical code. A speaker communicating with the processor outputs the audio signal. | 02-21-2013 |
20130054244 | METHOD AND SYSTEM FOR ACHIEVING EMOTIONAL TEXT TO SPEECH - A method and system for achieving emotional text to speech. The method includes: receiving text data; generating emotion tag for the text data by a rhythm piece; and achieving TTS to the text data corresponding to the emotion tag, where the emotion tags are expressed as a set of emotion vectors; where each emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories. A system for the same includes: a text data receiving module; an emotion tag generating module; and a TTS module for achieving TTS, wherein the emotion tag is expressed as a set of emotion vectors; and wherein emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories. | 02-28-2013 |
20130066632 | SYSTEM AND METHOD FOR ENRICHING TEXT-TO-SPEECH SYNTHESIS WITH AUTOMATIC DIALOG ACT TAGS - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for modifying the prosody of synthesized speech based on an associated speech act. A system configured according to the method embodiment (1) receives text, (2) performs an analysis of the text to determine and assign a speech act label to the text, and (3) converts the text to speech, where the speech prosody is based on the speech act label. The analysis performed compares the text to a corpus of previously tagged utterances to find a close match, determines a confidence score from a correlation of the text and the close match, and, if the confidence score is above a threshold value, retrieving the speech act label of the close match and assigning it to the text. | 03-14-2013 |
20130073287 | VOICE PRONUNCIATION FOR TEXT COMMUNICATION - A method, computer program product, and system for voice pronunciation for text communication is described. A selected portion of a text communication is determined. A prompt to record a pronunciation relating to the selected portion of the text communication is provided at a first computing device. The recorded pronunciation is associated with the selected portion of the text communication. A visual indicator, relating to the selected portion of the text communication and the recorded pronunciation, is displayed. | 03-21-2013 |
20130073288 | Wireless Server Based Text to Speech Email - An email system for mobile devices, such as cellular phones and PDAs, is disclosed which allows email messages to be played back on the mobile device as voice messages on demand by way of a media player, thus eliminating the need for a unified messaging system. Email messages are received by the mobile device in a known manner. In accordance with an important aspect of the invention, the email messages are identified by the mobile device as they are received. After the message is identified, the mobile device sends the email message in text format to a server for conversion to speech or voice format. After the message is converted to speech format, the server sends the messages back to the user's mobile device and notifies the user of the email message and then plays the message back to the user through a media player upon demand. | 03-21-2013 |
20130080172 | OBJECTIVE EVALUATION OF SYNTHESIZED SPEECH ATTRIBUTES - A method of evaluating attributes of synthesized speech. The method includes processing a text input into a synthesized speech utterance using a processor of a text-to-speech system, applying a human speech utterance to a speech model to obtain a reference wherein the human speech utterance corresponds to the text input, applying the synthesized speech utterance to at least one of the speech model or an other speech model to obtain a test, and calculating a difference between the test and the reference. The method also can be used in a speech synthesis method. | 03-28-2013 |
20130080173 | CORRECTING UNINTELLIGIBLE SYNTHESIZED SPEECH - A method and system of speech synthesis. A text input is received in a text-to-speech system and, using a processor of the system, the text input is processed into synthesized speech which is established as unintelligible. The text input is reprocessed into subsequent synthesized speech and output to a user via a loudspeaker to correct the unintelligible synthesized speech. In one embodiment, the synthesized speech can be established as unintelligible by predicting intelligibility of the synthesized speech, and determining that the predicted intelligibility is lower than a minimum threshold. In another embodiment, the synthesized speech can be established as unintelligible by outputting the synthesized speech to the user via the loudspeaker, and receiving an indication from the user that the synthesized speech is not intelligible. | 03-28-2013 |
20130080174 | RETRIEVING DEVICE, RETRIEVING METHOD, AND COMPUTER PROGRAM PRODUCT - In an embodiment, a retrieving device includes: a text input unit, a first extracting unit, a retrieving unit, a second extracting unit, an acquiring unit, and a selecting unit. The text input unit inputs a text including unknown word information representing a phrase that a user was unable to transcribe. The first extracting unit extracts related words representing a phrase related to the unknown word information among phrases other than the unknown word information included in the text. The retrieving unit retrieves a related document representing a document including the related words. The second extracting unit extracts candidate words representing candidates for the unknown word information from a plurality of phrases included in the related document. The acquiring unit acquires reading information representing estimated pronunciation of the unknown word information. The selecting unit selects at least one of candidate word of which pronunciation is similar to the reading information. | 03-28-2013 |
20130080175 | MARKUP ASSISTANCE APPARATUS, METHOD AND PROGRAM - According to one embodiment, a markup assistance apparatus includes an acquisition unit, a first calculation unit, a detection unit and a presentation unit. The acquisition unit acquires feature amount for respective tags, each of the tags being used to control text-to-speech processing of a markup text. The first calculation unit calculates, for respective character strings, a variance of feature amounts of the tags which are assigned to the character string in a markup text. The detection unit detects first character string assigned first tag having the variance not less than a first threshold value as a first candidate including the tag to be corrected. The presentation unit presents the first candidate. | 03-28-2013 |
20130080176 | Methods and Apparatus for Rapid Acoustic Unit Selection From a Large Speech Corpus - A speech synthesis system can select recorded speech fragments, or acoustic units, from a very large database of acoustic units to produce artificial speech. The selected acoustic units are chosen to minimize a combination of target and concatenation costs for a given sentence. However, as concatenation costs, which are measures of the mismatch between sequential pairs of acoustic units, are expensive to compute, processing can be greatly reduced by pre-computing and caching the concatenation costs. The number of possible sequential pairs of acoustic units makes such caching prohibitive. Statistical experiments reveal that while about 85% of the acoustic units are typically used in common speech, less than 1% of the possible sequential pairs of acoustic units occur in practice. The system synthesizes a large body of speech, identifies the acoustic unit sequential pairs generated and their respective concatenation costs, and stores those concatenation costs likely to occur. | 03-28-2013 |
20130085758 | Telecare and/or telehealth communication method and system - A telecare and/or telehealth communication method is described. The method comprises providing predetermined voice messages configured to ask questions to or to give instructions to an assisted individual, providing an algorithm configured to communicate with the assisted individual, and communicating at least one of the predetermined voice messages configured to ask questions to or to give instructions to the assisted individual. The method further comprises analyzing a responsiveness and/or compliance characteristics of the assisted individual, and providing the assisted individual with voice messages in a form most acceptable and effective for the assisted individual on the basis of the analyzed responsiveness and/or the analyzed compliance characteristics. | 04-04-2013 |
20130085759 | SPEECH SAMPLES LIBRARY FOR TEXT-TO-SPEECH AND METHODS AND APPARATUS FOR GENERATING AND USING SAME - A method for converting translating text into speech with a speech sample library is provided. The method comprises converting translating an input text to a sequence of triphones; determining musical parameters of each phoneme in the sequence of triphones; detecting, in the speech sample library, speech segments having at least the determined musical parameters; and concatenating the detected speech segments. | 04-04-2013 |
20130085760 | TRAINING AND APPLYING PROSODY MODELS - Techniques for training and applying prosody models for speech synthesis are provided. A speech recognition engine processes audible speech to produce text annotated with prosody information. A prosody model is trained with this annotated text. After initial training, the model is applied during speech synthesis to generate speech with non-standard prosody from input text. Multiple prosody models can be used to represent different prosody styles. | 04-04-2013 |
20130096920 | FACILITATING TEXT-TO-SPEECH CONVERSION OF A USERNAME OR A NETWORK ADDRESS CONTAINING A USERNAME - To facilitate text-to-speech conversion of a username, a first or last name of a user associated with the username may be retrieved, and a pronunciation of the username may be determined based at least in part on whether the name forms at least part of the username. To facilitate text-to-speech conversion of a domain name having a top level domain and at least one other level domain, a pronunciation for the top level domain may be determined based at least in part upon whether the top level domain is one of a predetermined set of top level domains. Each other level domain may be searched for one or more recognized words therewithin, and a pronunciation of the other level domain may be determined based at least in part on an outcome of the search. The username and domain name may form part of a network address such as an email address, URL or URI. | 04-18-2013 |
20130096921 | INFORMATION PROVIDING SYSTEM AND VEHICLE-MOUNTED APPARATUS - A portable terminal apparatus is configured to obtain provided information including character data from an information distribution server apparatus, transmit partial data, which is a portion of the character data, to a voice synthesizing server apparatus, and obtain voice data obtained by converting the partial data into voice from the voice synthesizing server apparatus, and when a predetermined notification is received from a vehicle-mounted apparatus, a command is given to cause the vehicle-mounted apparatus to display the provided information corresponding to the voice data, and the vehicle-mounted apparatus displays information given by the portable terminal apparatus, plays the voice data, and when selection operation performed by a user is received, the portable terminal apparatus is notified that the selection operation has been performed. | 04-18-2013 |
20130110512 | FACILITATING TEXT-TO-SPEECH CONVERSION OF A DOMAIN NAME OR A NETWORK ADDRESS CONTAINING A DOMAIN NAME | 05-02-2013 |
20130117025 | APPARATUS AND METHOD FOR REPRESENTING AN IMAGE IN A PORTABLE TERMINAL - An apparatus for displaying an image in a portable terminal includes a camera to photograph the image, a touch screen to display the image and to allow selecting an object area of the displayed image, a memory to store the image, a controller to detect at least one object area within the image when displaying the image of the camera or the memory and to recognize object information of the detected object area to be converted into a voice, and an audio processing unit to output the voice. | 05-09-2013 |
20130117026 | SPEECH SYNTHESIZER, SPEECH SYNTHESIS METHOD, AND SPEECH SYNTHESIS PROGRAM - State duration creation means creates a state duration indicating a duration of each state in a hidden Markov model, based on linguistic information and a model parameter of prosody information. Duration correction degree computing means derives a speech feature from the linguistic information, and computes a duration correction degree which is an index indicating a degree of correcting the state duration, based on the derived speech feature. State duration correction means corrects the state duration based on a phonological duration correction parameter and the duration correction degree, the phonological duration correction parameter indicating a correction ratio of correcting a phonological duration. | 05-09-2013 |
20130132087 | AUDIO INTERFACE - Methods, systems, and apparatus are generally described for providing an audio interface. | 05-23-2013 |
20130144624 | SYSTEM AND METHOD FOR LOW-LATENCY WEB-BASED TEXT-TO-SPEECH WITHOUT PLUGINS - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio and responds to the browser with the audio file. The system saves the audio file in a cache, with the file indexed by a unique identifier. As the system continues converting text into speech, when identical text appears the system uses the cached audio corresponding to the identical text without the need for re-synthesis via the TTS server. | 06-06-2013 |
20130144625 | SYSTEMS AND METHODS DOCUMENT NARRATION - Disclosed are techniques and systems to provide a narration of a text in multiple different voices. In some aspects, systems and methods described herein can include receiving a user-based selection of a first portion of words in a document where the document has a pre-associated first voice model and overwriting the association of the first voice model, by the one or more computers, with a second voice model for the first portion of words. | 06-06-2013 |
20130166304 | SYNCHRONISE AN AUDIO CURSOR AND A TEXT CURSOR DURING EDITING - A speech recognition device ( | 06-27-2013 |
20130179170 | CROWD-SOURCING PRONUNCIATION CORRECTIONS IN TEXT-TO-SPEECH ENGINES - Technologies are described herein for providing validated text-to-speech correction hints from aggregated pronunciation corrections received from text-to-speech applications. A number of pronunciation corrections are received by a Web service. The pronunciation corrections may be provided by users of text-to-speech applications executing on a variety of user computer systems. Each of the plurality of pronunciation corrections includes a specification of a word or phrase and a suggested pronunciation provided by the user. The pronunciation corrections are analyzed to generate validated correction hints, and the validated correction hints are provided back to the text-to-speech applications to be used to correct pronunciation of words and phrases in the text-to-speech applications. | 07-11-2013 |
20130191130 | SPEECH SYNTHESIS METHOD AND APPARATUS FOR ELECTRONIC SYSTEM - A speech synthesis method for an electronic system and a speech synthesis apparatus are provided. In the speech synthesis method, a speech signal file including text content is received. The speech signal file is analyzed to obtain prosodic information of the speech signal file. The text content and the corresponding prosodic information are automatically tagged to obtain a text tag file. A speech synthesis file is obtained by synthesizing a human voice profile and the text tag file. | 07-25-2013 |
20130204623 | ELECTRONIC APPARATUS AND FUNCTION GUIDE METHOD THEREOF - In an electronic apparatus having a plurality of functions, a connecting unit connects the electronic apparatus to an external device which presents text information in a form recognizable by a visually impaired user. A function selection unit selects a function to be executed. A storage unit stores a table defining correspondence between the plurality of functions and a plurality of text files each containing text information. A text file selection unit selects a text file corresponding to the selected function with reference to the table. An acquisition unit acquires file information from the selected text file. A transmission unit transmits the acquired file information to the external device. | 08-08-2013 |
20130204624 | CONTEXTUAL CONVERSION PLATFORM FOR GENERATING PRIORITIZED REPLACEMENT TEXT FOR SPOKEN CONTENT OUTPUT - A contextual conversion platform, and method for converting text-to-speech, are described that can convert content of a target to spoken content. Embodiments of the contextual conversion platform can identify certain contextual characteristics of the content, from which can be generated a spoken content input. This spoken content input can include tokens, e.g., words and abbreviations, to be converted to the spoken content, as well as substitution tokens that are selected from contextual repositories based on the context identified by the contextual conversion platform. | 08-08-2013 |
20130211837 | SYSTEM AND METHOD FOR MAKING AN ELECTRONIC HANDHELD DEVICE MORE ACCESSIBLE TO A DISABLED PERSON - An electronic handheld device is described having an options module for providing a user with at least one option in the handheld device, each option associated with an enabling mode of operation of the handheld device. The device also includes an enabling module for implementing, in response to a particular option being selected by a user, an associated enabling mode of operation. Each enabling mode of operation makes the handheld device more accessible to a person having a corresponding disability. | 08-15-2013 |
20130211838 | APPARATUS AND METHOD FOR EMOTIONAL VOICE SYNTHESIS - The present disclosure provides an emotional voice synthesis apparatus and an emotional voice synthesis method. The emotional voice synthesis apparatus includes a word dictionary storage unit for storing emotional words in an emotional word dictionary after classifying the emotional words into items each containing at least one of an emotion class, similarity, positive or negative valence, and sentiment strength; voice DB storage unit for storing voices in a database after classifying the voices according to at least one of emotion class, similarity, positive or negative valence and sentiment strength in correspondence to the emotional words; emotion reasoning unit for inferring an emotion matched with the emotional word dictionary with respect to at least one of each word, phrase, and sentence of document including text and e-book; and voice output unit for selecting and outputting a voice corresponding to the document from the database according to the inferred emotion. | 08-15-2013 |
20130218566 | AUDIO HUMAN INTERACTIVE PROOF BASED ON TEXT-TO-SPEECH AND SEMANTICS - The text-to-speech audio HIP technique described herein in some embodiments uses different correlated or uncorrelated words or sentences generated via a text-to-speech engine as audio HIP challenges. The technique can apply different effects in the text-to-speech synthesizer speaking a sentence to be used as a HIP challenge string. The different effects can include, for example, spectral frequency warping; vowel duration warping; background addition; echo addition; and varying the time duration between words, among others. In some embodiments the technique varies the set of parameters to prevent using Automated Speech Recognition tools from using previously used audio HIP challenges to learn a model which can then be used to recognize future audio HIP challenges generated by the technique. Additionally, in some embodiments the technique introduces the requirement of semantic understanding in HIP challenges. | 08-22-2013 |
20130218567 | APPARATUS FOR TEXT-TO-SPEECH DELIVERY AND METHOD THEREFOR - A method and apparatus for determining the manner in which a processor-enabled device should produce sounds from data is described. The device ideally synthesizes sounds digitally, and reproduces pre-recorded sounds, together with an audible delivery thereof, a memory in which is stored a database of a plurality data at least some of which is in the form of text-based indicators, and one or more pre-recorded sounds device is further capable of repeatedly determining one or more physical conditions, e.g. current GPS location, which is compared with one or more reference values provided in memory such that a positive result of the comparison gives rise to an event requiring a sound to be produced by the device. | 08-22-2013 |
20130218568 | SPEECH SYNTHESIS DEVICE, SPEECH SYNTHESIS METHOD, AND COMPUTER PROGRAM PRODUCT - According to an embodiment, a speech synthesis device includes a first storage, a second storage, a first generator, a second generator, a third generator, and a fourth generator. The first storage is configured to store therein first information obtained from a target uttered voice. The second storage is configured to store therein second information obtained from an arbitrary uttered voice. The first generator is configured to generate third information by converting the second information so as to be close to a target voice quality or prosody. The second generator is configured to generate an information set including the first information and the third information. The third generator is configured to generate fourth information used to generate a synthesized speech, based on the information set. The fourth generator configured to generate the synthesized speech corresponding to input text using the fourth information. | 08-22-2013 |
20130218569 | TEXT-TO-SPEECH USER'S VOICE COOPERATIVE SERVER FOR INSTANT MESSAGING CLIENTS - A system and method to allow an author of an instant message to enable and control the production of audible speech to the recipient of the message. The voice of the author of the message is characterized into parameters compatible with a formative or articulative text-to-speech engine such that upon receipt, the receiving client device can generate audible speech signals from the message text according to the characterization of the author's voice. Alternatively, the author can store samples of his or her actual voice in a server so that, upon transmission of a message by the author to a recipient, the server extracts the samples needed only to synthesize the words in the text message, and delivers those to the receiving client device so that they are used by a client-side concatenative text-to-speech engine to generate audible speech signals having a close likeness to the actual voice of the author. | 08-22-2013 |
20130226584 | SPEECH SYNTHESIS APPARATUS AND METHOD - A speech synthesizing apparatus includes a selector configured to select a plurality of speech units for synthesizing a speech of an input phoneme sequence by referring to speech unit information stored in an information memory. Speech unit waveforms corresponding to the speech units are acquired from a plurality of speech unit waveforms stored in a waveform memory, and the speech is synthesized by concatenating the speech unit waveforms acquired. When acquiring the speech unit waveforms, at least two speech unit waveforms from a continuous region of the waveform memory are copied onto a buffer by one access, wherein a data quantity of the at least two speech unit waveforms is less than or equal to a size of the buffer. | 08-29-2013 |
20130231935 | METHOD AND APPARATUS FOR GENERATING SYNTHETIC SPEECH WITH CONTRASTIVE STRESS - Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings. | 09-05-2013 |
20130238338 | METHOD AND APPARATUS FOR PHONETIC CHARACTER CONVERSION - A method and apparatus for improved approaches for uttering the spelling of words and phrases over a communication session is described. The method includes determining a character to produce a first audio signal representing a phonetic utterance of the character, determining a code word that starts with a code word character identical to the character, and generating a second audio signal representing an utterance of the code word, wherein the first audio signal and the second audio signal are provided over a communication session for detection of the character. | 09-12-2013 |
20130238339 | HANDLING SPEECH SYNTHESIS OF CONTENT FOR MULTIPLE LANGUAGES - Techniques that enable a user to select, from among multiple languages, a language to be used for performing text-to-speech conversion. In some embodiments, upon determining that multiple languages may be used to perform text-to-speech conversion for a portion of text, the multiple languages may be displayed to the user. The user may then select a particular language to be used from the multiple languages. The portion of text may then be converted to speech in the user-selected language. | 09-12-2013 |
20130238340 | Wearing State Based Device Operation - Methods and apparatuses for wearing state device operation are disclosed. In one example, a headset includes a sensor for detecting a headset donned state or a headset doffed state. The headset operation is modified based on whether the headset is donned or doffed. | 09-12-2013 |
20130246067 | USER INTERFACE FOR PRODUCING AUTOMATED MEDICAL REPORTS AND A METHOD FOR UPDATING FIELDS OF SUCH INTERFACE ON THE FLY - A system for producing automated medical reports. The interface includes a menu area and a medical report area which is distinct from the menu area. The menu area includes a list of names representing medical conditions. The doctor may make different selections of names from the menu area as the medical service is being rendered to build a report in the medical report area. If a medical condition is not listed in the menu area, the doctor may add a new field for it and select/enter a name and a descriptor for the new field. Whereby, the field is automatically added in the menu area, and the name is automatically displayed in the new field without exiting the report/interface. Upon receiving a user selection of the new name, the descriptor associated therewith is retrieved from the memory and added in the medical report area without exiting the report/interface. | 09-19-2013 |
20130253935 | Indicating A Page Number Of An Active Document Page Within A Document - Methods, apparatuses, and computer program products for indicating a page number of an active document page within a document are provided. Embodiments include detecting, by a presentation controller, activation of a document page on a presentation device; in response to detecting the activation of the document page on the presentation device, tracking, by the presentation controller, an amount of time that the document page is consecutively active on the presentation device; determining, by the presentation controller, that the amount of time that the document page is consecutively active on the presentation device exceeds a predetermined threshold; and in response to determining that the predetermined threshold has been exceeded, providing to a target source, by the presentation controller, an output indicating a page number of the document page while the document page is active on the presentation device. | 09-26-2013 |
20130262118 | PLAYBACK CONTROL APPARATUS, PLAYBACK CONTROL METHOD, AND PROGRAM - A playback control apparatus includes a playback controller configured to control playback of first content and second content. The first content is to output first sound which is generated based on text information using speech synthesis processing. The second content is to output second sound which is generated not using the speech synthesis processing. The playback controller causes an attribute of content to be played back to be displayed on the screen, the attribute indicating whether or not the content is to output sound which is generated based on text information using speech synthesis processing. | 10-03-2013 |
20130262119 | TEXT TO SPEECH SYSTEM - A text-to-speech method configured to output speech having a selected speaker voice and a selected speaker attribute, including: inputting text; dividing the inputted text into a sequence of acoustic units; selecting a speaker for the inputted text; selecting a speaker attribute for the inputted text; converting the sequence of acoustic units to a sequence of speech vectors using an acoustic model; and outputting the sequence of speech vectors as audio with the selected speaker voice and a selected speaker attribute. The acoustic model includes a first set of parameters relating to speaker voice and a second set of parameters relating to speaker attributes, which parameters do not overlap. The selecting a speaker voice includes selecting parameters from the first set of parameters and the selecting the speaker attribute includes selecting the parameters from the second set of parameters. | 10-03-2013 |
20130262120 | SPEECH SYNTHESIS DEVICE AND SPEECH SYNTHESIS METHOD - A speech synthesis device includes: a mouth-opening-degree generation unit which generates, for each of phonemes generated from input text, a mouth-opening-degree corresponding to oral-cavity volume, using information generated from the text and indicating the type and position of the phoneme within the text, such that the generated mouth-opening-degree is larger for a phoneme at the beginning of a sentence in the text than for a phoneme at the end of the sentence; a segment selection unit which selects, for each of the generated phonemes, segment information corresponding to the phoneme from among pieces of segment information stored in a segment storage unit and including phoneme type, mouth-opening-degree, and speech segment data, based on the type of the phoneme and the generated mouth-opening-degree; and a synthesis unit which generates synthetic speech of the text, using the selected pieces of segment information and pieces of prosody information generated from the text. | 10-03-2013 |
20130268275 | SPEECH SYNTHESIS SYSTEM, SPEECH SYNTHESIS PROGRAM PRODUCT, AND SPEECH SYNTHESIS METHOD - Waveform concatenation speech synthesis with high sound quality. Prosody with both high accuracy and high sound quality is achieved by performing a two-path search including a speech segment search and a prosody modification value search. An accurate accent is secured by evaluating the consistency of the prosody by using a statistical model of prosody variations (the slope of fundamental frequency) for both of two paths of the speech segment selection and the modification value search. In the prosody modification value search, a prosody modification value sequence that minimizes a modified prosody cost is searched for. This allows a search for a modification value sequence that can increase the likelihood of absolute values or variations of the prosody to the statistical model as high as possible with minimum modification values. | 10-10-2013 |
20130275137 | WARNING SYSTEM WITH SYNTHESIZED VOICE DIAGNOSTIC ANNOUNCEMENT CAPABILITY FOR FIELD DEVICES - Field devices, including sensors and final elements, are provided with a speech synthesizer and optionally a speech control chip, to sound audible voice maintenance and fault alarms to alert field personnel and, optionally, a voice message upon manual activation of a pushbutton or other switch directing them how to perform the maintenance task or clear the fault. | 10-17-2013 |
20130275138 | Hands-Free List-Reading by Intelligent Automated Assistant - Systems and methods for providing hands-free reading of content comprising: identifying a plurality of data items for presentation to a user, the plurality of data items associated with a domain-specific item type and sorted according to a particular order; based on the domain-specific item type, generating a speech-based overview of the plurality of data items; for each of the plurality of data items, generating a respective speech-based, item-specific paraphrase for the data item based on respective content of the data item; and providing, to a user through the speech-enabled dialogue interface, the speech-based overview, followed by the respective speech-based, item-specific paraphrases for at least a subset of the plurality of data items in the particular order. | 10-17-2013 |
20130282375 | Vehicle-Based Message Control Using Cellular IP - Architecture for playing back personal text-based messages such as email and voicemail over a vehicle-based media system. The user can use a cell phone that registers over a cellular network to an IMS (IP multimedia subsystem) to obtain an associated IP address. The personal messages are then converted into audio signals using a remote text-to-voice (TTV) converter and transmitted to the phone based on the IP address. The phone then transmits the audio signals to the vehicle media system for playback using an unlicensed wireless technology (e.g., Bluetooth, Wi-Fi, etc.). Other alternative embodiments include transmitting converted message directly to the media system, via a satellite channel, converting the messages via a TTV converter on the cell phone, and streaming the converted messages to the phone and/or the media system for playback. | 10-24-2013 |
20130282376 | FILE FORMAT, SERVER, VIEWER DEVICE FOR DIGITAL COMIC, DIGITAL COMIC GENERATION DEVICE - A viewer device for a digital comic comprising: an information acquisition unit that acquires a digital comic in a file format for a digital comic viewed on a viewer device, the file format including speech balloon information including information of a speech balloon region that indicates a region of a speech balloon, first text information indicating a dialogue within each speech balloon, the first text information being correlated with each speech balloon, and first display control information including positional information and a transition order of a anchor point so as to enable the image of the entire page to be viewed on a monitor of the viewer device in a scroll view; and a voice reproduction section that synthesizes a voice for reading the letter corresponding to the text information based on an attribute of the character, an attribute of the speech balloon or the dialogue, and outputs the voice. | 10-24-2013 |
20130282377 | COMMUNICATION DEVICE TRANSFORMING TEXT MESSAGE INTO SPEECH - The application discloses a communication device and method of processing a text message in the communication device. An aspect of the present application is a method of processing text message in a communication device, the method including receiving a text message from an external sender, receiving a request to transform the text message into voice data, transforming the received text message into voice data according to the request, and transmitting the voice data to an external sound reproduction device through a wireless communication module. | 10-24-2013 |
20130289998 | Realistic Speech Synthesis System - A system and method for realistic speech synthesis which converts text into synthetic human speech with qualities appropriate to the context such as the language and dialect of the speaker, as well as expanding a speaker's phonetic inventory to produce more natural sounding speech. | 10-31-2013 |
20130304474 | SYSTEM AND METHOD FOR AUDIBLY PRESENTING SELECTED TEXT - Disclosed herein are methods for presenting speech from a selected text that is on a computing device. This method includes presenting text on a touch-sensitive display and having that text size within a threshold level so that the computing device can accurately determine the intent of the user when the user touches the touch screen. Once the user touch has been received, the computing device identifies and interprets the portion of text that is to be selected, and subsequently presents the text audibly to the user. | 11-14-2013 |
20130311186 | METHOD AND ELECTRONIC DEVICE FOR EASY SEARCH DURING VOICE RECORD - An electronic device for allowing the user not to terminate audio recording while at the same time checking the content corresponding to a recorded portion in real time during the audio recording, and a method of reproducing the audio signal. An electronic device according to an embodiment disclosed in the present disclosure may include a memory configured to store an audio signal being input; a display unit configured to display at least one of an item indicating a progressive state in which the audio signal is stored therein and an STT-based text for the audio signal; a user interface unit configured to receive the selection of a predetermined portion of the item indicating the progressive state or the selection of a partial character string of the text from the user; and a controller configured to reproduce an audio signal corresponding to the selected portion or the selected character string. | 11-21-2013 |
20130311187 | Electronic Apparatus - An electronic apparatus comprises a storage module, a manipulation module, a voice output control module, and a display module. The storage module configured to store book data. The manipulation module is configured to convert a manipulation of a user into an electrical signal while the voice output control module configured to reproduce a voice by reading the book data in the storage module based on the manipulation, and the display module is configured to display the book data. When it is determined that a part to be reproduced includes an illustration or a figure, the user is urged to view the display module and the illustration or the figure is displayed at the display module. | 11-21-2013 |
20130311188 | Text-to-speech device, speech output device, speech output system, text-to-speech methods, and speech output method - An audio read-out device comprises an audio signal generator, a first information receiver, a first information transmitter, a first controller, and a mixed audio signal generator, and when the first information receiver receives audio output enablement information indicating that audio output is disabled, the first controller causes the mixed audio signal generator to generate a mixed audio signal composed of a broadcast audio signal and causes the first information transmitter to transmit the mixed audio signal until the first information receiver receives audio output enablement information indicating that audio output is enabled; and when the first information receiver receives audio output enablement information indicating that audio output is enabled, the first controller causes the mixed audio signal generator to generate a mixed audio signal obtained by mixing a read-out audio signal and a broadcast audio signal, and causes the first information transmitter to transmit the mixed audio signal. | 11-21-2013 |
20130332169 | Method and System for Enhancing a Speech Database - A system, method and computer readable medium that enhances a speech database for speech synthesis is disclosed. The method may include labeling audio files in a primary speech database, identifying segments in the labeled audio files that have varying pronunciations based on language differences, identifying replacement segments in a secondary speech database, enhancing the primary speech database by substituting the identified secondary speech database segments for the corresponding identified segments in the primary speech database, and storing the enhanced primary speech database for use in speech synthesis. | 12-12-2013 |
20130332170 | METHOD AND SYSTEM FOR PROCESSING CONTENT - Provided are a method and system for processing user input and web based content by transforming content to metadata and by using a plurality of vocabularies, including specific vocabularies (e.g. location dependent, culture dependent, personalized, non formal, and more), and other methods to process voice or non-voice content. | 12-12-2013 |
20130346081 | DEVICE FOR AIDING COMMUNICATION IN THE AERONAUTICAL DOMAIN - The device ( | 12-26-2013 |
20140006030 | Device, Method, and User Interface for Voice-Activated Navigation and Browsing of a Document | 01-02-2014 |
20140006031 | SOUND SYNTHESIS METHOD AND SOUND SYNTHESIS APPARATUS | 01-02-2014 |
20140006032 | SYSTEM AND METHOD FOR DYNAMICALLY INTERACTING WITH A MOBILE COMMUNICATION DEVICE | 01-02-2014 |
20140012583 | METHOD AND APPARATUS FOR RECORDING AND PLAYING USER VOICE IN MOBILE TERMINAL - A method and an apparatus for recording and playing a user voice in a mobile terminal are provided. The method for recording and storing a user voice in a mobile terminal includes entering a page by executing an electronic book, identifying whether a user voice record file related to the page exists, generating a user voice record file related to the page by recording a text included in the page to a user voice if the user voice record file does not exist, and playing by synchronizing the user voice stored in the user voice record file with the text if the user voice record file exists. Accordingly, a user voice can be recorded corresponding to a text of a page when recording a specific record of an electronic book, and the text corresponding to the user voice being played can be highlighted by synchronizing the user voice and the text. | 01-09-2014 |
20140019134 | BLENDING RECORDED SPEECH WITH TEXT-TO-SPEECH OUTPUT FOR SPECIFIC DOMAINS - A text-to-speech (TTS) engine combines recorded speech with synthesized speech from a TTS synthesizer based on text input. The TTS engine receives the text input and identifies the domain for the speech (e.g. navigation, dialing, . . . ). The identified domain is used in selecting domain specific speech recordings (e.g. pre-recorded static phrases such as “turn left”, “turn right” . . . ) from the input text. The speech recordings are obtained based on the static phrases for the domain that are identified from the input text. The TTS engine blends the static phrases with the TTS output to smooth the acoustic trajectory of the input text. The prosody of the static phrases is used to create similar prosody in the TTS output. | 01-16-2014 |
20140019135 | SENDER-RESPONSIVE TEXT-TO-SPEECH PROCESSING - A method of speech synthesis including receiving a text input sent by a sender, processing the text input responsive to at least one distinguishing characteristic of the sender to produce synthesized speech that is representative of a voice of the sender, and communicating the synthesized speech to a recipient user of the system. | 01-16-2014 |
20140019136 | ELECTRONIC DEVICE, INFORMATION PROCESSING APPARATUS,AND METHOD FOR CONTROLLING THE SAME - The present invention provides a technology for enabling a natural voice reproduction in which, depending on a gazed character position, a position of a voice output character follows but not excessively reacts with the gazed character position. Therefore, in an electronic device provided with a display unit for displaying text on a screen, a voice outputting unit for outputting the text as voice, and a sight-line detection unit for detecting a sight-line direction of a user, a control unit changes a starting position at which a voice outputting unit starts voice output if a distance between the position of the current output character and the position of the current gazed character is a preset threshold or more. | 01-16-2014 |
20140019137 | METHOD, SYSTEM AND SERVER FOR SPEECH SYNTHESIS - A speech synthesis system synthesizes speech using a reading text and a speech dictionary set, and includes a server apparatus. The server apparatus includes an interface unit open to the public; a speech input reception unit that receives an input of speech from an external terminal through the interface unit to generate a speech dictionary set; a registration information reception unit that receives registration information relating to a speech owner who inputs the speech from the external terminal through the interface unit; a speech dictionary set maintaining unit that maintains a speech dictionary set generated from the speech of which the input has been received in association with the registration information of a person inputting the speech; and a speech dictionary set selecting unit that allows selection of a speech dictionary set maintained in the speech dictionary set maintaining unit from the external terminal through the interface unit. | 01-16-2014 |
20140019138 | Training and Applying Prosody Models - Techniques for training and applying prosody models for speech synthesis are provided. A speech recognition engine processes audible speech to produce text annotated with prosody information. A prosody model is trained with this annotated text. After initial training, the model is applied during speech synthesis to generate speech with non-standard prosody from input text. Multiple prosody models can be used to represent different prosody styles. | 01-16-2014 |
20140025381 | EVALUATING TEXT-TO-SPEECH INTELLIGIBILITY USING TEMPLATE CONSTRAINED GENERALIZED POSTERIOR PROBABILITY - Instead of relying on humans to subjectively evaluate speech intelligibility of a subject, a system objectively evaluates the speech intelligibility. The system receives speech input and calculates confidence scores at multiple different levels using a Template Constrained Generalized Posterior Probability algorithm. One or multiple intelligibility classifiers are utilized to classify the desired entities on an intelligibility scale. A specific intelligibility classifier utilizes features such as the various confidence scores. The scale of the intelligibility classification can be adjusted to suit the application scenario. Based on the confidence score distributions and the intelligibility classification results at multiple levels an overall objective intelligibility score is calculated. The objective intelligibility scores can be used to rank different subjects or systems being assessed according to their intelligibility levels. The speech that is below a predetermined intelligibility (e.g. utterances with low confidence scores and most severe intelligibility issues) can be automatically selected for further analysis. | 01-23-2014 |
20140025382 | SPEECH PROCESSING SYSTEM - A text to speech method, the method comprising:
| 01-23-2014 |
20140025383 | Voice Outputting Method, Voice Interaction Method and Electronic Device - A voice outputting method, a voice interaction method and an electronic device are described The method includes acquiring a first content to be output; analyzing the first content to acquire a first emotion information for expressing the emotion carried by the first content to be output; acquiring a first voice data to be output corresponding to the first content; processing the first voice data to be output based on the first emotion information to generate a second voice data to be output with a second emotion information, wherein the second emotion information is used to express the emotion of the electronic device outputting the second voice data to be output to enable the user to acquire the emotion of the electronic device, and wherein the first and the second emotion information are matched to and/or correlated to each other; outputting the second voice data to be output. | 01-23-2014 |
20140025384 | METHOD AND APPARATUS FOR GENERATING SYNTHETIC SPEECH WITH CONTRASTIVE STRESS - Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings. | 01-23-2014 |
20140052446 | PROSODY EDITING APPARATUS AND METHOD - According to one embodiment, a prosody editing apparatus includes a storage, a first selection unit, a search unit, a normalization unit, a mapping unit, a display, a second selection unit, a restoring unit and a replacing unit. The search unit searches the storage for one or more second prosodic patterns corresponding to attribute information that matches attribute information of the selected phrase. The mapping maps each of the normalized second prosodic patterns on a low-dimensional space. The restoring unit restores a restored prosodic pattern according to the selected coordinates. The replacing unit replaces prosody of synthetic speech generated based on the selected phrase by the restored prosodic pattern. | 02-20-2014 |
20140052447 | SPEECH SYNTHESIS APPARATUS, METHOD, AND COMPUTER-READABLE MEDIUM - According to one embodiment, a speech synthesis apparatus is provided with generation, normalization, interpolation and synthesis units. The generation unit generates a first parameter using a prosodic control dictionary of a target speaker and one or more second parameters using a prosodic control dictionary of one or more standard speakers based on language information for an input text. The normalization unit normalizes the one or more second parameters based a normalization parameter. The interpolation unit interpolates the first parameter and the one or more normalized second parameters based on weight information to generate a third parameter and the synthesis unit generates synthesized speech using the third parameter. | 02-20-2014 |
20140058733 | SCREEN READER WITH FOCUS-BASED SPEECH VERBOSITY - The amount of speech output to a blind or low-vision user using a screen reader application is automatically adjusted based on how the user navigates to a control in a graphic user interface. Navigation by mouse presumes the user has greater knowledge of the identity of the control than navigation by tab keystroke which is more indicative of a user searching for a control. In addition, accelerator keystrokes indicate a higher level of specificity to set focus on a control and thus less verbosity is required to sufficiently inform the screen reader user. | 02-27-2014 |
20140058734 | SYSTEM FOR TUNING SYNTHESIZED SPEECH - An embodiment of the invention is a software tool used to convert text, speech synthesis markup language (SSML), and/or extended SSML to synthesized audio. Provisions are provided to create, view, play, and edit the synthesized speech, including editing pitch and duration targets, speaking type, paralinguistic events, and prosody. Prosody can be provided by way of a sample recording. Users can interact with the software tool by way of a graphical user interface (GUI). The software tool can produce synthesized audio file output in many file formats. | 02-27-2014 |
20140067397 | USING EMOTICONS FOR CONTEXTUAL TEXT-TO-SPEECH EXPRESSIVITY - Techniques disclosed herein include systems and methods that improve audible emotional characteristics used when synthesizing speech from a text source. Systems and methods herein use emoticons identified from a source text to provide contextual text-to-speech expressivity. In general, techniques herein analyze text and identify emoticons included within the text. The source text is then tagged with corresponding mood indicators. For example, if the system identifies an emoticon at the end of a sentence, then the system can infer that this sentence has a specific tone or mood associated with it. Depending on whether the emoticon is a smiley face, angry face, sad face, laughing face, etc., the system can infer use or mood from the various emoticons and then change or modify the expressivity of the TTS output such as by changing intonation, prosody, speed, pauses, and other expressivity characteristics. | 03-06-2014 |
20140067398 | METHOD, SYSTEM AND PROCESSOR-READABLE MEDIA FOR AUTOMATICALLY VOCALIZING USER PRE-SELECTED SPORTING EVENT SCORES - A method and system for vocalizing user-selected sporting event scores. A customized spoken score application module can be configured in association with a device. A real-time score can be preselected by a user from an existing sporting event website for automatically vocalizing the score in a multitude of languages utilizing a speech synthesizer and a translation engine. An existing text-to-speech engine can be integrated with the spoken score application module and controlled by the application module to automatically vocalize the preselected scores listed on the sporting event site. The synthetically-voiced, real-time score can be transmitted to the device at a predetermined time interval. Such an approach automatically and instantly pushes the real time vocal alerts thereby permitting the user to continue multitasking without activating the pre-selected vocal alerts. | 03-06-2014 |
20140067399 | METHOD AND SYSTEM FOR REPRODUCTION OF DIGITAL CONTENT - The present invention relates to a method and system of aurally reproducing visually structured content by associating specific audio formatting elements with visual formatting elements of the content. A method and system for reproducing visually structured content by associating abstract visual elements with visual formatting elements of the content is also described. | 03-06-2014 |
20140067400 | PHONETIC INFORMATION GENERATING DEVICE, VEHICLE-MOUNTED INFORMATION DEVICE, AND DATABASE GENERATION METHOD - In a word string information DB, when phonetic information automatically generated from written notation information matches regular phonetic information, only the written notation information is registered, or, when the phonetic information automatically generated does not match the regular phonetic information, the written notation information and the regular phonetic information are registered. A word string information retrieving unit | 03-06-2014 |
20140067401 | PROVIDE SERVICES USING UNIFIED COMMUNICATION CONTENT - Example embodiments disclosed herein relate to using intelligence within unified communication content to facilitate, services. A semantic store including unified communication content is queried. Then, results of the query are determined. | 03-06-2014 |
20140074478 | SYSTEM AND METHOD FOR DIGITALLY REPLICATING SPEECH - A speech replication system including a speech generation unit having a program running in a memory of the speech generation unit, the program executing the steps of receiving an audio stream, identifying words within the audio stream, analyzing each word to determine the audio characteristics of the speaker's voice, storing the audio characteristics of the speaker's voice in the memory, receiving text information, converting the text information into an output audio stream using the audio characteristics of the speaker stored in the memory, and playing the output audio stream. | 03-13-2014 |
20140088969 | AUTOMATED METHOD AND SYSTEM FOR OBTAINING USER-SELECTED INFORMATION ON A MOBILE COMMUNICATION DEVICE - A customized live the application module can be configured in association with the mobile communication device in order to automatically vocalize the information preselected by a user in a multitude of languages. A text-to-speech application module can be integrated with the customized live tile application module to automatically vocalize the preselected information. The information can be obtained from a tile and/or a website integrated with a remote server and announced after a text to speech conversion process without opening the tile, if the tiles are selected for announcement of information by the device. The information can be obtained in real-time. Such an approach automatically and instantly pushes a vocal alert with respect to the user-selected information on the mobile communication device thereby permitting the user to continue multitasking. Information from tiles can also be rendered on second screens from a mobile device. | 03-27-2014 |
20140088970 | METHOD AND DEVICE FOR USER INTERFACE - A method for user interface according to one embodiment of the present invention comprises the steps of: displaying text on a screen; receiving a character selection command of a user who selects at least one character included in a text, receiving a speech command of a user who designates a selected range in the text including at least one character, specifying the selected range according to the character selection command and the speech command; and a step for receiving an editing command of a user for the selected range. | 03-27-2014 |
20140095164 | MESSAGE ORIGINATING SERVER, MESSAGE ORGINATING METHOD, TERMINAL, ELECTRIC APPLIANCE CONTROL SYSTEM, AND ELECTRIC APPLIANCE - A control server ( | 04-03-2014 |
20140095165 | SYSTEM AND METHOD FOR SYNCHRONIZING SOUND AND MANUALLY TRANSCRIBED TEXT - A method for synchronizing sound data and text data, said text data being obtained by manual transcription of said sound data during playback of the latter. The proposed method comprises the steps of repeatedly querying said sound data and said text data to obtain a current time position corresponding to a currently played sound datum and a currently transcribed text datum, respectively, correcting said current time position by applying a time correction value in accordance with a transcription delay, and generating at least one association datum indicative of a synchronization association between said corrected time position and said currently transcribed text datum. Thus, the proposed method achieves cost-effective synchronization of sound and text in connection with the manual transcription of sound data. | 04-03-2014 |
20140100852 | DYNAMIC SPEECH AUGMENTATION OF MOBILE APPLICATIONS - Speech functionality is dynamically provided for one or more applications by a narrator application. A plurality of shared data items are received from the one or more applications, with each shared data item including text data that is to be presented to a user as speech. The text data is extracted from each shared data item to produce a plurality of playback data items. A text-to-speech algorithm is applied to the playback data items to produce a plurality of audio data items. The plurality of audio data items are played to the user. | 04-10-2014 |
20140108014 | INFORMATION PROCESSING APPARATUS AND METHOD FOR CONTROLLING THE SAME - The present invention is configured to display a screen that includes a voice output position with a simple operation, even when another text that does not include the voice output position is displayed by manipulation during output of a text as voice. Therefore, when an input unit 101 detects an operation by a user while outputting a text as voice, a display control unit executes processing that corresponds to this operation such as scrolling, and displays the designated part of the text. Thereafter, when the input unit 101 further detects an operation and if the detected operation and the immediately previous operation are opposite operations to each other, a text that includes a current voice output position is displayed. | 04-17-2014 |
20140114663 | GUIDED SPEAKER ADAPTIVE SPEECH SYNTHESIS SYSTEM AND METHOD AND COMPUTER PROGRAM PRODUCT - According to an exemplary embodiment of a guided speaker adaptive speech synthesis system, a speaker adaptive training module generates adaptation information and a speaker-adapted model based on inputted recording text and recording speech. A text to speech engine receives the recording text and the speaker-adapted model and outputs synthesized speech information. A performance assessment module receives the adaptation information and the synthesized speech information to generate assessment information. An adaptation recommendation module selects at least one subsequent recording text from at least one text source as a recommendation of a next adaption process, according to the adaptation information and the assessment information. | 04-24-2014 |
20140122079 | GENERATING PERSONALIZED AUDIO PROGRAMS FROM TEXT CONTENT - Features are disclosed for generating text-to-speech (TTS) audio programs from textual content received from multiple sources. A TTS system may assemble an audio program from several individual audio presentations of user-selected network-accessible content. Users may configure the TTS system to retrieve personal content as well as publically accessible content. The audio program may include segues, introductions, summaries, and the like. Voices may be selected for individual content items based on user selections or on characteristics of the content or content source. | 05-01-2014 |
20140122080 | SINGLE INTERFACE FOR LOCAL AND REMOTE SPEECH SYNTHESIS - Features are disclosed for providing a consistent interface for local and distributed text to speech (TTS) systems. Some portions of the TTS system, such as voices and TTS engine components, may be installed on a client device, and some may be present on a remote system accessible via a network link. Determinations can be made regarding which TTS system components to implement on the client device and which to implement on the remote server. The consistent interface facilitates connecting to or otherwise employing the TTS system through use of the same methods and techniques regardless of the which TTS system configuration is implemented. | 05-01-2014 |
20140122081 | AUTOMATED TEXT TO SPEECH VOICE DEVELOPMENT - A group of users may be presented with text and a synthesized speech recording of the text. The users can listen to the synthesized speech recording and submit feedback regarding errors or other issues with the synthesized speech. A system of one or more computing devices can analyze the feedback, modify the voice or language rules, and recursively test the modifications. The modifications may be determined through the use of machine learning algorithms or other automated processes. | 05-01-2014 |
20140122082 | APPARATUS AND METHOD FOR GENERATION OF PROSODY ADJUSTED SOUND RESPECTIVE OF A SENSORY SIGNAL AND TEXT-TO-SPEECH SYNTHESIS - A method for generation of a prosody adjusted digital sound. The method comprises receiving at least a sensory signal from at least one sensor; generating a digital sound respective of an input text content and a text-to-speech content retrieved from a memory unit; and modifying the generated digital sound respective of the at least the sensory signal to create the prosody adjusted digital sound. | 05-01-2014 |
20140129228 | Method, System, and Relevant Devices for Playing Sent Message - A method and a system for playing a message that are applicable to the field of communications technologies. The message playing method includes: receiving, by a receiving terminal, a message that includes a user identifier and text information, obtaining a speech identifier and an image identifier corresponding to the user identifier, generating or obtaining a speech animation stream according to a speech characteristic parameter indicated by the speech identifier, an image characteristic parameter indicated by the image identifier, and the text information, and playing the speech animation stream. In this way, the text information in the message can be played as a speech animation stream according to the user identifier, the text information in the message can be presented vividly, and the message can be presented in a personalized manner according to the speech identifier and the image identifier corresponding to the user identifier. | 05-08-2014 |
20140129229 | PERSONAL AUDIO ASSISTANT DEVICE AND METHOD - A wearable device includes a housing for the wearable device, a first microphone within or on the housing, a communication module within or on the housing, a logic circuit communicatively coupled to the first microphone, a memory storage unit communicatively coupled to the logic circuit and an interaction element. The interaction element and logic circuit cooperatively initiate control of media content or initiate operations of the wearable device. Other embodiments are disclosed. | 05-08-2014 |
20140129230 | METHOD AND APPARATUS FOR GENERATING SYNTHETIC SPEECH WITH CONTRASTIVE STRESS - Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings. | 05-08-2014 |
20140136208 | SECURE MULTI-MODE COMMUNICATION BETWEEN AGENTS - A system to assist in secure multi-mode communication is described. The system can receive communications from a source agent in a source communication mode, authenticate the source agent, and determine a recipient communication mode associated with the recipient agent. The system then transforms the communication from the first communication mode to the second communication mode based on the recipient agent and/or the content of the communication and provides the communication to the recipient agent in the recipient communication mode. | 05-15-2014 |
20140149119 | SPEECH TRANSCRIPTION INCLUDING WRITTEN TEXT - Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for transcribing utterances into written text are disclosed. The methods, systems, and apparatus include actions of obtaining a lexicon model that maps phones to spoken text and obtaining a language model that assigns probabilities to written text. Further includes generating a transducer that maps the written text to the spoken text, the transducer mapping multiple items of the written text to an item of the spoken text. Additionally, the actions include constructing a decoding network for transcribing utterances into written text, by composing the lexicon model, the inverse of the transducer, and the language model. | 05-29-2014 |
20140156280 | SPEECH PROCESSING SYSTEM - A method of deriving speech synthesis parameters from an audio signal, the method comprising:
| 06-05-2014 |
20140163992 | WIRELESS SERVER BASED TEXT TO SPEECH EMAIL - An email system for mobile devices, such as cellular phones and PDAs, is disclosed which allows email messages to be played back on the mobile device as voice messages on demand by way of a media player, thus eliminating the need for a unified messaging system. Email messages are received by the mobile device in a known manner. In accordance with an important aspect of the invention, the email messages are identified by the mobile device as they are received. After the message is identified, the mobile device sends the email message in text format to a server for conversion to speech or voice format. After the message is converted to speech format, the server sends the messages back to the user's mobile device and notifies the user of the email message and then plays the message back to the user through a media player upon demand. | 06-12-2014 |
20140163993 | FACILITATING TEXT-TO-SPEECH CONVERSION OF A DOMAIN NAME OR A NETWORK ADDRESS CONTAINING A DOMAIN NAME - To facilitate text-to-speech conversion of a username, a first or last name of a user associated with the username may be retrieved, and a pronunciation of the username may be determined based at least in part on whether the name forms at least part of the username. To facilitate text-to-speech conversion of a domain name having a top level domain and at least one other level domain, a pronunciation for the top level domain may be determined based at least in part upon whether the top level domain is one of a predetermined set of top level domains. Each other level domain may be searched for one or more recognized words therewithin, and a pronunciation of the other level domain may be determined based at least in part on an outcome of the search. The username and domain name may form part of a network address such as an email address, URL or URI. | 06-12-2014 |
20140188479 | AUDIO EXPRESSION OF TEXT CHARACTERISTICS - In a method for communicating characteristics of an electronic document, a coefficient representative of predetermined characteristics of the electronic document is determined. The coefficient is associated with a corresponding audio rendering parameter. A speech signal communicating content of the electronic document is generated. The speech signal includes predetermined text content items audio formatted based on the audio rendering parameter. The speech signal is rendered to the user. | 07-03-2014 |
20140188480 | SYSTEM AND METHOD FOR GENERATING CUSTOMIZED TEXT-TO-SPEECH VOICES - A system and method are disclosed for generating customized text-to-speech voices for a particular application. The method comprises generating a custom text-to-speech voice by selecting a voice for generating a custom text-to-speech voice associated with a domain, collecting text data associated with the domain from a pre-existing text data source and using the collected text data, generating an in-domain inventory of synthesis speech units by selecting speech units appropriate to the domain via a search of a pre-existing inventory of synthesis speech units, or by recording the minimal inventory for a selected level of synthesis quality. The text-to-speech custom voice for the domain is generated utilizing the in-domain inventory of synthesis speech units. Active learning techniques may also be employed to identify problem phrases wherein only a few minutes of recorded data is necessary to deliver a high quality TTS custom voice. | 07-03-2014 |
20140195240 | VISUAL CONTENT FEED PRESENTATION - A method, system, and computer program product for a visual content feed presentation. The method includes receiving different streams of content from different sources of the streams, characterizing the content in each of the different streams, and determining based upon the characterization of the content of each of the streams, a visual arrangement of the content for presentation in a graphical user interface (GUI). The method further includes presenting the content in the determined visual arrangement in the GUI and text to speech (TTS) converting content in one of the streams and playing back the TTS converted content in synchronization with a display of an avatar in the GUI. | 07-10-2014 |
20140195241 | Synchronizing the Playing and Displaying of Digital Content - The techniques disclosed herein allow a user to synchronize the playing and displaying of digital content on an electronic device. The device may render a first portion of digital content so it may be displayed. The device may also play a segment of the digital content as audio using text to speech software. The device may also render a second portion of digital content for display depending on whether the position of the last word read is greater than the last position in the first portion of digital content. | 07-10-2014 |
20140195242 | Prosody Generation Using Syllable-Centered Polynomial Representation of Pitch Contours - The present invention discloses a parametrical representation of prosody based on polynomial expansion coefficients of the pitch contour near the center of each syllable. The said syllable pitch expansion coefficients are generated from a recorded speech database, read from a number of sentences by a reference speaker. By correlating the stress level and context information of each syllable in the text with the polynomial expansion coefficients of the corresponding spoken syllable, a correlation database is formed. To generate prosody for an input text, stress level and context information of each syllable in the text is identified. The prosody is generated by using the said correlation database to find the best set of pitch parameters for each syllable. By adding to global pitch contours and using interpolation formulas, complete pitch contour for the input text is generated. Duration and intensity profile are generated using a similar procedure. | 07-10-2014 |
20140200894 | DISTRIBUTED SPEECH UNIT INVENTORY FOR TTS SYSTEMS - In a text-to-speech (TTS) system, a database including sample speech units for unit selection may be configured for use by a local device. The local unit database may be created from a more comprehensive unit database. The local unit database may include units which provide sufficient TTS results for frequently input text. Speech synthesis may then be performed by concatenating locally available units with units from a remote device including the comprehensive unit database. Aspects of the speech synthesis may be performed by the remote device and/or the local device. | 07-17-2014 |
20140200895 | Systems and Methods for Automated Media Commentary - Techniques for providing automated media commentary are provided. A user agent requests audio commentary for media. In response, a service searches data sources to identify the specified media, finds information related to those entities, generates text that represents those information, combines the text into a textual monologue, and synthesizes speech audio from that textual monologue. The service selects relevant information to be likely unknown to the user while also being desired by the user. | 07-17-2014 |
20140207461 | CAR A/V SYSTEM WITH TEXT MESSAGE VOICE OUTPUT FUNCTION - A car A/V system head unit includes a text message receiving and decoding device for receiving a text message sent by an external communication device and decoding the text message into a decoded data signal, a display device including a display zone for display the decoded data signal and function selection touch zones selectively touchable by a person to select different functions, a text/voice converter for converting the decoded data signal into a voice signal, and a voice output system for outputting the voice signal. Thus, the car A/V system head unit can convert each received text message into a voice signal and output the voice signal directly in the car, avoiding distracted driving and enhancing driver safety. | 07-24-2014 |
20140207462 | System and Method of Providing a Spoken Dialog Interface to a Website - Disclosed is a method for training a spoken dialog service component from website data. Spoken dialog service components typically include an automatic speech recognition module, a language understanding module, a dialog management module, a language generation module and a text-to-speech module. The method includes selecting anchor texts within a website based on a term density, weighting those anchor texts based on a percent of salient words to total words, and incorporating the weighted anchor texts into a live spoken dialog interface, the weights determining a level of incorporation into the live spoken dialog interface. | 07-24-2014 |
20140222433 | System and Method for Evaluating Intent of a Human Partner to a Dialogue Between Human User and Computerized System - A system and method for assigning relative scores to various possible intents on the part of a user approaching a virtual agent, the method comprising predicting priority topics, including gathering first data and employing the first data to discern and seek user confirmation of at least one possible intent on the part of the user; and subsequent to receipt of the confirmation, gathering second data and employing the second data to provide service to the user to suit the user's confirmed intent. | 08-07-2014 |
20140244263 | METHOD AND SYSTEM FOR CONTROLLING A USER RECEIVING DEVICE USING VOICE COMMANDS - A system and method includes a language processing module converting an electrical signal corresponding to an audible signal into a textual signal. The system further includes a command generation module converting the textual signal into a user receiving device control signal. A controller controls a function of a user receiving device in response to the user receiving device control signal. | 08-28-2014 |
20140257815 | SPEECH RECOGNITION ASSISTED EVALUATION ON TEXT-TO-SPEECH PRONUNCIATION ISSUE DETECTION - Pronunciation issues for synthesized speech are automatically detected using human recordings as a reference within a Speech Recognition Assisted Evaluation (SRAE) framework including a Text-To-Speech flow and a Speech Recognition (SR) flow. A pronunciation issue detector evaluates results obtained at multiple levels of the TTS flow and the SR flow (e.g. phone, word, and signal level) by using the corresponding human recordings as the reference for the synthesized speech, and outputs possible pronunciation issues. A signal level may be used to determine similarities/differences between the recordings and the TTS output. A model level checker may provide results to the pronunciation issue detector to check the similarities of the TTS and the SR phone set including mapping relations. Results from a comparison of the SR output and the recordings may also be evaluation by the pronunciation issue detector. The pronunciation issue detector outputs a list that lists potential pronunciation issue candidates. | 09-11-2014 |
20140257816 | SPEECH SYNTHESIS DICTIONARY MODIFICATION DEVICE, SPEECH SYNTHESIS DICTIONARY MODIFICATION METHOD, AND COMPUTER PROGRAM PRODUCT - According to an embodiment, a speech synthesis dictionary modification device includes an extracting unit, a display unit, an acquiring unit, an modification unit, and an updating unit. The extracting unit extracts a synthesis information containing a feature sequence of a synthetic speech from the synthetic speech generated by using a speech synthesis dictionary containing probability distributions of speech features. The display unit displays an image prompting to modify a probability distribution contained in the speech synthesis dictionary on a basis of the synthesis information extracted by the extracting unit. The acquiring unit acquires an instruction to modify the probability distribution contained in the speech synthesis dictionary. The modification unit modifies the probability distribution contained in the speech synthesis dictionary according to the instruction. The updating unit updates the speech synthesis dictionary on a basis of a result of modifying by the modification unit to generate a new speech synthesis dictionary. | 09-11-2014 |
20140257817 | System and Method for Synthetic Voice Generation and Modification - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating a synthetic voice. A system configured to practice the method combines a first database of a first text-to-speech voice and a second database of a second text-to-speech voice to generate a combined database, selects from the combined database, based on a policy, voice units of a phonetic category for the synthetic voice to yield selected voice units, and synthesizes speech based on the selected voice units. The system can synthesize speech without parameterizing the first text-to-speech voice and the second text-to-speech voice. A policy can define, for a particular phonetic category, from which text-to-speech voice to select voice units. The combined database can include multiple text-to-speech voices from different speakers. The combined database can include voices of a single speaker speaking in different styles. The combined database can include voices of different languages. | 09-11-2014 |
20140257818 | System and Method for Unit Selection Text-to-Speech Using A Modified Viterbi Approach - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for speech synthesis. A system practicing the method receives a set of ordered lists of speech units, for each respective speech unit in each ordered list in the set of ordered lists, constructs a sublist of speech units from a next ordered list which are suitable for concatenation, performs a cost analysis of paths through the set of ordered lists of speech units based on the sublist of speech units for each respective speech unit, and synthesizes speech using a lowest cost path of speech units through the set of ordered lists based on the cost analysis. The ordered lists can be ordered based on the respective pitch of each speech unit. In one embodiment, speech units which do not have an assigned pitch can be assigned a pitch. | 09-11-2014 |
20140257819 | METHOD AND DEVICE FOR SWITCHING CURRENT INFORMATION PROVIDING MODE - A method for switching a current information providing mode is provided, wherein the method comprises steps as follows: user context information related to a user device is firstly collected. A current user context of the user device is then identified in accordance with the collected user context information, so as to an identified consequence data is the generated. A current information providing mode suitable for the current user context of the user device is subsequently switched according to the consequence data. | 09-11-2014 |
20140278429 | IDENTIFYING CORRESPONDING POSITIONS IN DIFFERENT REPRESENTATIONS OF A TEXTUAL WORK - Described herein are techniques for determining corresponding positions between different representations of a textual work. In some of the techniques, portions of one or more representations may be processed. A determination of a corresponding position may be made in response to a request received from a user, such as a reader that desires to switch between representations. The request may indicate a position in one representation and the representation to which the user would like to switch. In response to receiving the request, one or more portions of one or more representations of a textual work may be processed. In some techniques, a corresponding position between different representations may be determined without processing the entirety of one or more representations of the textual work. For example, a corresponding position may be determined without processing an entire audio representation. | 09-18-2014 |
20140278430 | APPARATUS, METHOD, AND COMPUTER READABLE MEDIUM FOR EXPEDITED TEXT READING USING STAGED OCR TECHNIQUE - A system and method are provided for accelerating machine reading of text. In one embodiment, the system comprises at least one processor device. The processor device is configured to receive at least one image of text to be audibly read. The text includes a first portion and a second portion. The processor device is further configured to initiate optical character recognition (OCR) to recognize the first portion. The processor device is further configured to initiate an audible presentation of the first portion prior to initiating OCR of the second portion, and simultaneously perform OCR to recognize the second portion of the text to be audibly read during presentation of at least part of the first portion. The processor device is further configured to automatically cause the second portion of the text to be audibly presented immediately upon completion of the presentation of the first portion. | 09-18-2014 |
20140278431 | Method and System for Enhancing a Speech Database - A system, method and computer readable medium that enhances a speech database for speech synthesis is disclosed. The method may include labeling audio files in a primary speech database, identifying segments in the labeled audio files that have varying pronunciations based on language differences, identifying replacement segments in a secondary speech database, enhancing the primary speech database by substituting the identified secondary speech database segments for the corresponding identified segments in the primary speech database, and storing the enhanced primary speech database for use in speech synthesis. | 09-18-2014 |
20140297285 | AUTOMATIC PAGE CONTENT READING-ALOUD METHOD AND DEVICE THEREOF - The present disclosure discloses a page content reading method and device thereof. The method includes obtaining page content requested to browse, and determining whether a format of the page content meets a pre-determined requirement; it the format of the page content meets the pre-determined requirement, displaying the page content, and processing the page content into a form adapted for reading-aloud and automatically reading-aloud the processed page content, upon receiving a reading-aloud request; if the format of the page content does not meet the pre-determined requirement, displaying a page content, after the format of which having been converted into a format that meets the pre-determined requirement, and processing the page content into a form adapted for reading-aloud and automatically reading-aloud the processed page content, upon receiving a reading-aloud request from the user. The embodiment of the present invention can be widely applied and can bring down the cost of realization. | 10-02-2014 |
20140303979 | SYSTEM AND METHOD FOR CONCATENATE SPEECH SAMPLES WITHIN AN OPTIMAL CROSSING POINT - A method for identifying an optimal crossing point for concatenation of speech samples within an overlap area is provided. The method includes retrieving a first speech sample and a second speech sample, the second speech sample is concatenated immediately after the first speech sample is concatenated; determining a first region within the ending of the first speech sample and a second region within the beginning of the second speech sample, the first region and the second region are determined respective of relatively high spectral similarity over time between the first speech sample and the second speech sample; identifying an overlap region between the first region and the second region; determining an optimal crossing point between the first speech sample and the second speech sample, the optimal crossing point has a maximum correlation over time; and concatenating the first speech sample and the second speech sample at the optimal crossing point. | 10-09-2014 |
20140316786 | Creating statistical language models for audio CAPTCHAs - Methods for creating statistical language models (SLMs) for audio Completely Automated Turing Tests for Telling Computers and Humans Apart (CAPTCHAs) are disclosed. In these methods, candidate challenge items including one or more words are automatically selected from a document corpus. Selected ones of the challenge items are articulated by a machine text-to-speech (TTS) system as candidate articulations. Those articulations are ranked based on a human listener score indicating whether a candidate articulation originated from a machine. The SLM is then trained to recognize machine TTS articulations according to those rankings, by using a subset of the plurality of candidate challenge items identified as machine articulations as a seed set. | 10-23-2014 |
20140324435 | COMBINED STATISTICAL AND RULE-BASED PART-OF-SPEECH TAGGING FOR TEXT-TO-SPEECH SYNTHESIS - In response to a word of a text sequence, a first part-of-speech (POS) tag is generated using a statistical part-of-speech (POS) tagger based on a corpus of trained text sequences, each representing a likely POS of a word for a given text sequence. A second POS tag is generated using a rule-based POS tagger based on a set of one or more rules associated with a type of an application associated with the text sequence. A final POS tag is assigned to the word of the text sequence for TTS synthesis based on the first POS tag and the second POS tag. | 10-30-2014 |
20140324436 | METHOD AND APPARATUS FOR AUDIO PLAYING - A method and apparatus for audio playing is provided. The method includes receiving an audio conversion request carrying a first text identifier, and obtaining a first electronic text corresponding to the first text identifier; obtaining the audio data corresponding to the characters in the first electronic text according to a correspondence between characters and audio data which is stored in advance; and playing the audio data in the order of corresponding characters in the first electronic text. By applying the present disclosure, it is possible to improve the efficiency of obtaining information from electronic text. | 10-30-2014 |
20140324437 | COMMUNICATION DEVICE TRANSFORMING TEXT MESSAGE INTO SPEECH - The application discloses a communication device and method of processing a text message in the communication device. An aspect of the present application is a method of receiving text messages from a counterpart; determining whether to transform the received text messages into voice data according to control information, wherein the control information corresponds to a condition for transforming the text messages into the voice data; selectively transforming a text message among the text messages into the voice data of the counterpart according to a result of the determining step; and transmitting the transformed voice data to a sound reproduction device. | 10-30-2014 |
20140324438 | SCREEN READER HAVING CONCURRENT COMMUNICATION OF NON-TEXTUAL INFORMATION - A screen reader software product for low-vision users, the software having a reader module collecting textual and non-textual display information generated by a web browser or word processor. Font styling, interface layout information and the like are communicated to the end user by sounds broadcast simultaneously rather than serially with the synthesized speech to improve the speed and efficiency in which information may be digested by the end user. | 10-30-2014 |
20140330567 | SPEECH SYNTHESIS FROM ACOUSTIC UNITS WITH DEFAULT VALUES OF CONCATENATION COST - A speech synthesis system can select recorded speech fragments, or acoustic units, from a very large database of acoustic units to produce artificial speech. When a pair of acoustic units in the database does not have an associated concatenation cost, the system assigns a default concatenation cost. The system then synthesizes speech, identifies the acoustic unit sequential pairs generated and their respective concatenation costs, and stores those concatenation costs likely to occur. | 11-06-2014 |
20140337033 | ELECTRONIC DEVICE FOR PROVIDING INFORMATION TO USER - The present disclosure relates to an electronic device and a method which may visually provide information to a user, and notify the user of the information through other senses (e.g., a tactile sense, a hearing sense, etc.). The method includes performing voice guidance of information displayed on the touch screen in a predetermined order, detecting a user's input through the touch screen, and changing the order and performing the voice guidance in the changed order, when the detected user's input is a direction change input. | 11-13-2014 |
20140358548 | VOICE PROCESSOR, VOICE PROCESSING METHOD, AND COMPUTER PROGRAM PRODUCT - According to an embodiment, a voice processor includes a presenting unit to present text to an operator; a voice acquisition unit to acquire a voice of the operator reading aloud the text; an identifying unit to identify output intervals of phonemes included in the voice; a determination unit to determine whether each of time lengths of the output intervals is normal; a frequency acquisition unit to acquire frequency values respectively representing occurrence frequencies of contexts, respectively corresponding to the phonemes, the context including the phoneme and another phoneme adjacent to at least one side of the phoneme; and a score calculator to calculate a score representing correctness of the voice on the basis of the determination results of the time lengths of the output intervals and the frequency values of the contexts acquired respectively corresponding to the phonemes. | 12-04-2014 |
20140372123 | ELECTRONIC DEVICE AND METHOD FOR CONVERSION BETWEEN AUDIO AND TEXT - A method for outputting a text as audio, the method includes detecting a request for outputting a text as audio, searching for the text in a user input storage unit, searching for pronunciation data corresponding to the found text in the user input storage unit, and outputting an audio signal corresponding to the found pronunciation data. Other embodiments including an electronic device for converting audio into a text are disclosed. | 12-18-2014 |
20150012277 | Training and Applying Prosody Models - Techniques for training and applying prosody models for speech synthesis are provided. A speech recognition engine processes audible speech to produce text annotated with prosody information. A prosody model is trained with this annotated text. After initial training, the model is applied during speech synthesis to generate speech with non-standard prosody from input text. Multiple prosody models can be used to represent different prosody styles. | 01-08-2015 |
20150025891 | METHOD AND SYSTEM FOR TEXT-TO-SPEECH SYNTHESIS WITH PERSONALIZED VOICE - A method and system are provided for text-to-speech synthesis with personalized voice. The method includes receiving an incidental audio input ( | 01-22-2015 |
20150046164 | METHOD, APPARATUS, AND RECORDING MEDIUM FOR TEXT-TO-SPEECH CONVERSION - A text-to-speech conversion method includes receiving a message including text and originator identification information, retrieving stored voice data corresponding to an originator identified by the originator identification information, and synthesizing speech from the text included in the message based on the retrieved voice data. A text-to-speech conversion apparatus is also disclosed, where the voice data can be updated using a voice signal obtained during a telephone conversation including the originator. The speech can be synthesized using a statistical parametric speech synthesis method, and the voice data can include a statistical acoustic voice model. The speech can also be synthesized according to an emotion detected from the text in the received message. | 02-12-2015 |
20150058019 | SPEECH PROCESSING SYSTEM AND METHOD - A method of training an acoustic model for a text-to-speech system,
| 02-26-2015 |
20150066510 | VARIABLE-DEPTH AUDIO PRESENTATION OF TEXTUAL INFORMATION - A respective sequence of tracks of Internet content of common subject matter is queued to each of a plurality of stations, where each of the tracks of Internet content resides on a respective Internet resource in textual form. In response to receiving a sample input, snippets of each of multiple tracks queued to a selected station among the plurality of stations is transmitted for audible presentation as synthesized human speech, where each of the snippets includes only a subset of a corresponding track. Thereafter, one or more complete tracks among the multiple tracks for which snippets were previously transmitted are transmitted for audio presentation as synthesized human speech. | 03-05-2015 |
20150066511 | IMAGE PROCESSING METHOD AND ELECTRONIC DEVICE THEREOF - A method and an electronic device for processing an image in an electronic device are provided. The method includes determining whether the electronic device is mounted in a cradle comprising at least one guide region, scanning an image in the guide region using a camera, and outputting the scanned image or image information based on the scanned image. The method can easily process the image and provide the user with the output information. Therefore, the output information is favorable to the blind people or the illiterate, and the usability and the reliability of the electronic device can be enhanced. | 03-05-2015 |
20150073805 | SYSTEM AND METHOD FOR DISTRIBUTED VOICE MODELS ACROSS CLOUD AND DEVICE FOR EMBEDDED TEXT-TO-SPEECH - Disclosed herein are systems, methods, and computer-readable storage media for intelligent caching of concatenative speech units for use in speech synthesis. A system configured to practice the method can identify a speech synthesis context, and determine, based on a local cache of text-to-speech units for a text-to-speech voice and based on the speech synthesis context, additional text-to-speech units which are not in the local cache. The system can request from a server the additional text-to-speech units, and store the additional text-to-speech units in the local cache. The system can then synthesize speech using the text-to-speech units and the additional text-to-speech units in the local cache. The system can prune the cache as the context changes, based on availability of local storage, or after synthesizing the speech. The local cache can store a core set of text-to-speech units associated with the text-to-speech voice that cannot be pruned from the local cache. | 03-12-2015 |
20150081306 | PROSODY EDITING DEVICE AND METHOD AND COMPUTER PROGRAM PRODUCT - According to an embodiment, a prosody editing device includes an approximate contour generator, a setter, a display controller, an operation receiver, and an updater. The approximate contour generator approximates a contour representing a time series of prosody information with a parametric curve including a control point to generate an approximate contour. The setter sets, on the approximate contour, an operation point corresponding to the control point. The display controller displays, on a display device, an operation screen including the approximate contour on which the operation point is shown. The operation receiver receives an operation to move the operation point optionally selected on the operation screen. The updater calculates a position of the control point from a moving amount of the operation point and updates the approximate contour. | 03-19-2015 |
20150081307 | ARRANGEMENT AND A METHOD FOR CREATING A SYNTHESIS FROM NUMERICAL DATA AND TEXTUAL INFORMATION - An arrangement and a method for creating a synthesis from numerical data and textual information, and more particularly, relating to wellbeing of an individual. The arrangement includes a first information in a numerical form provided by a wellbeing device, for example, a second information including free-form text in natural language format provided by care-giving personnel, for example, and a control entity for obtaining the first information and the second information, wherein the control entity is arranged to semantically analyze the free-form text in natural language format of the second information in order to create a synthesis from the first information and the second information. | 03-19-2015 |
20150088521 | SPEECH SERVER, SPEECH METHOD, RECORDING MEDIUM, SPEECH SYSTEM, SPEECH TERMINAL, AND MOBILE TERMINAL - A speech server includes: a speech terminal-specifying information management unit configured to manage speech terminal-specifying information; a reception unit configured to receive, from an external server, (i) the speech terminal-specifying information or user-specifying information and (ii) speech information indicative of speech content to be outputted as speech; and a speech instruction unit configured to instruct a speech terminal specified by the speech terminal-specifying information to output the speech content as speech. | 03-26-2015 |
20150088522 | SYSTEMS AND METHODS FOR DYNAMICALLY IMPROVING USER INTELLIGIBILITY OF SYNTHESIZED SPEECH IN A WORK ENVIRONMENT - A method and apparatus that dynamically adjust operational parameters of a text-to-speech engine in a speech-based system are disclosed. A voice engine or other application of a device provides a mechanism to alter the adjustable operational parameters of the text-to-speech engine. In response to one or more environmental conditions, the adjustable operational parameters of the text-to-speech engine are modified to increase the intelligibility of synthesized speech. | 03-26-2015 |
20150095034 | PERSONALIZED TEXT-TO-SPEECH SERVICES - A personalized text-to-speech (pTTS) system provides a method for converting text data to speech data utilizing a pTTS template representing the voice characteristics of an individual. A memory stores executable program code that converts text data to speech data. Text data represents a textual message directed to a system user and speech data represents a spoken form of text data having the characteristics of an individual's voice. A processor executes the program code, and a storage device stores a pTTS template and may store speech data. The pTTS system can be used to provide various services that provide immediate spoken presentation of the speech data converted from text data and/or combine stored speech data with generated speech data for spoken presentation. | 04-02-2015 |
20150106101 | METHOD AND APPARATUS FOR PROVIDING SPEECH OUTPUT FOR SPEECH-ENABLED APPLICATIONS - Techniques for providing speech output for speech-enabled applications. A synthesis system receives from a speech-enabled application a text input including a text transcription of a desired speech output. The synthesis system selects one or more audio recordings corresponding to one or more portions of the text input. In one aspect, the synthesis system selects from audio recordings provided by a developer of the speech-enabled application. In another aspect, the synthesis system selects an audio recording of a speaker speaking a plurality of words. The synthesis system forms a speech output including the one or more selected audio recordings and provides the speech output for the speech-enabled application. | 04-16-2015 |
20150112687 | METHOD FOR RERECORDING AUDIO MATERIALS AND DEVICE FOR IMPLEMENTATION THEREOF - The inventive method and apparatus improve the quality of the teaching phase, improve a degree of match of the user's voice in a converted speech signal, and ensure the possibility of carrying out the teaching phase only once for different audio materials. A program-controlled electronic information processing device (PCEIPD) generates an acoustic base of initial audio materials (ABIA) and an acoustic teaching base (ATB). Upon selecting at least one audio material from the ABIA list, this material is transmitted to the PCEIPD RAM for storage. Files are selected from the ΔTB of the speaker's teaching phrases, and are converted into audio phrases transmitted to a sound playback device. The user repeats audio phrases into a microphone, and the text of a repeated phrase and a cursor moves along the phrase text in accordance with how the user should repeat the phrase. | 04-23-2015 |
20150127348 | DOCUMENT DISTRIBUTION AND INTERACTION - Workflows are provided that enable documents to be distributed, assented to, and otherwise interacted with on an aural and/or oral basis. Such workflows can be implemented so as to allow a recipient to receive, understand, and interact with a document using conventional components such as the microphone and speaker provided by a telephone. For instance, in one embodiment a document originator may send a document to a recipient with a request for an electronic signature. The document may include an audio version of the document terms. The recipient can listen to the audio version of the document terms and record an electronic signature that represents assent to such terms. An electronic signature server can record the recipient's electronic signature and incorporate it into the document, such that it forms part of the electronic document just as a traditional handwritten signature forms part of a signed paper document. | 05-07-2015 |
20150134338 | FOREIGN LANGUAGE LEARNING APPARATUS AND METHOD FOR CORRECTING PRONUNCIATION THROUGH SENTENCE INPUT - Provided are a foreign language learning apparatus and method using a function of reading an input sentence in voice via a Text To Speech (TTS) engine. The foreign language learning apparatus and method correct pronunciation through sentence input. The foreign language learning apparatus includes a sentence input unit for receiving a first sentence from a user; a linked letter detection unit for detecting at least one letter corresponding to at least one linking rule; a linked letter removal unit for removing the letter and generating a second sentence by inserting a linking code; a partial waveform generation unit for generating one or more partial waveforms using the TTS engine; an input waveform generation unit for converting a voice corresponding to the first sentence into an input waveform; and a matching degree calculation unit for calculating a matching degree and a partial matching degree. | 05-14-2015 |
20150134339 | Devices and Methods for Weighting of Local Costs for Unit Selection Text-to-Speech Synthesis - A device may determine a representation of text that includes a first linguistic term associated with a first set of speech sounds and a second linguistic term associated with a second set of speech sounds. The device may determine a plurality of joins between the first set and the second set. A given join may be indicative of concatenating a first speech sound from the first set with a second speech sound from the second set. A given local cost of the given join may correspond to a weighted sum of individual cost. A given individual cost may be weighted based on a variability of the given individual cost in the plurality of joins. The device may provide a sequence of speech sounds indicative of a pronunciation of the text based on a minimization of a sum of local costs of adjacent speech sounds in the sequence. | 05-14-2015 |
20150142444 | AUDIO RENDERING ORDER FOR TEXT SOURCES - A method includes loading text content into at least one user device; applying at least one reading order to at least one text section of the text content to change a presentation order; converting the at least one text section to an audio output based upon the presentation order; and playing the audio output using the presentation order on the at least one user device. | 05-21-2015 |
20150149178 | SYSTEM AND METHOD FOR DATA-DRIVEN INTONATION GENERATION - Systems, methods, and computer-readable storage media for text-to-speech processing having an improved intonation. The system first receives text to be converted to speech, the text having a first segment and a second segment. The system then compares the text to a database of stored utterances, identifying in the database a first utterance corresponding to the first segment and determining an intonation of the first utterance. When the database does not contain a second utterance corresponding to the second segment, the system generates the speech corresponding to the text by combining the first utterance with a generated second utterance corresponding to the second segment, the generated second utterance having the intonation matching, or based on, the first utterance. These actions lead to an improved, smoother, more human-like synthetic speech output from the system. | 05-28-2015 |
20150149179 | SYSTEMS AND METHODS FOR PRESENTING SOCIAL NETWORK COMMUNICATIONS IN AUDIBLE FORM BASED ON USER ENGAGEMENT WITH A USER DEVICE - Methods and systems are described herein for generating an audible presentation of a communication received from a remote server. A presentation of a media asset on a user equipment device is generated for a first user. A textual-based communication is received, at the user equipment device from the remote server. The textual-based communication is transmitted to the remote server by a second user and the remote server transmits the textual-based communication to the user equipment device responsive to determining that the second user is on a list of users associated with the first user. An engagement level of the first user with the user equipment device is determined. Responsive to determining that the engagement level does not exceed a threshold value, a presentation of the textual-based communication is generated in audible form. | 05-28-2015 |
20150149180 | MOBILE TERMINAL AND CONTROL METHOD THEREOF - A mobile terminal and a control method of the mobile terminal are provided. The mobile terminal includes: a memory configured to store event information; and a controller configured to retrieve at least one event information entered for the time between specified points from the memory, create a frame screen for displaying the retrieved at least one event information and a notepad for storing at least one keyword extracted from each of the retrieved at least one event information contained in the frame screen, and create a diary by interfacing the frame screen with the notepad. | 05-28-2015 |
20150149181 | METHOD AND SYSTEM FOR VOICE SYNTHESIS - Method and system for generating audio signals ( | 05-28-2015 |
20150293745 | TEXT-READING DEVICE AND TEXT-READING METHOD - A text-reading device includes: a visual line direction detection device for a driver; a memory that stores the visual line direction when the driver looks at a display device; a gaze determination device that determines that the driver gazes the display device when a state that the detected visual line direction coincides with the stored visual line direction continues for predetermined time or longer; a voice conversion device that outputs text information of the display device as a voice signal based on an instruction; and a reading control device that inputs the instruction when the driver gazes the display device while the display device displays the text information, and the vehicle starts to move. | 10-15-2015 |
20150294664 | NAVIGATION DEVICE AND INFORMATION PROVIDING METHOD - A navigation device includes: an acquisition section for acquiring plurality of updated content introductory information each representing the latest updated content in a predetermined web site based on user's preference information; a detection section for detecting surrounding position information covering an area around the current position from large number of position information stored in a predetermined storage section; and a search section for searching for particular updated content introductory information corresponding to the surrounding position information detected by the detection section as surrounding updated content introductory information from the plurality of updated content introductory information acquired by the acquisition section. | 10-15-2015 |
20150324169 | System And Method For Audibly Presenting Selected Text - Disclosed herein are methods for presenting speech from a selected text that is on a computing device. This method includes presenting text on a touch-sensitive display and having that text size within a threshold level so that the computing device can accurately determine the intent of the user when the user touches the touch screen. Once the user touch has been received, the computing device identifies and interprets the portion of text that is to be selected, and subsequently presents the text audibly to the user. | 11-12-2015 |
20150325231 | FACILITATING TEXT-TO-SPEECH CONVERSION OF A DOMAIN NAME OR A NETWORK ADDRESS CONTAINING A DOMAIN NAME - To facilitate text-to-speech conversion of a username, a first or last name of a user associated with the username may be retrieved, and a pronunciation of the username may be determined based at least in part on whether the name forms at least part of the username. To facilitate text-to-speech conversion of a domain name having a top level domain and at least one other level domain, a pronunciation for the top level domain may be determined based at least in part upon whether the top level domain is one of a predetermined set of top level domains. Each other level domain may be searched for one or more recognized words therewithin, and a pronunciation of the other level domain may be determined based at least in part on an outcome of the search. The username and domain name may form part of a network address such as an email address, URL or URI. | 11-12-2015 |
20150325233 | METHOD AND SYSTEM FOR ACHIEVING EMOTIONAL TEXT TO SPEECH - A method and system for achieving emotional text to speech. The method includes: receiving text data; generating emotion tag for the text data by a rhythm piece; and achieving TTS to the text data corresponding to the emotion tag, where the emotion tags are expressed as a set of emotion vectors; where each emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories. A system for the same includes: a text data receiving module; an emotion tag generating module; and a TTS module for achieving TTS, wherein the emotion tag is expressed as a set of emotion vectors; and wherein emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories. | 11-12-2015 |
20150332664 | ELECTRONIC BOOK WITH VOICE EMULATION FEATURES - A method and system for providing text-to-audio conversion of an electronic book displayed on a viewer. A user selects a portion of displayed text and converts it into audio. The text-to-audio conversion may be performed via a header file and pre-recorded audio for each electronic book, via text-to-speech conversion, or other available means. The user may select manual or automatic text-to audio conversion. The automatic text-to-audio conversion may be performed by automatically turning the pages of the electronic book or by the user manually turning the pages. The user may also select to convert the entire electronic book, or portions of it, into audio. The user may also select an option to receive an audio definition of a particular word in the electronic book. The present invention allows a user to control the system by selecting options from a screen or by entering voice commands. | 11-19-2015 |
20150348532 | METHOD AND SYSTEM FOR MAKING AND PLAYING SOUNDTRACKS - A composite variable duration soundtrack for a user to play while reading a text source, the soundtrack duration being defined by a soundtrack timeline. The soundtrack comprises multiple sound layers configured to play concurrently through the soundtrack timeline, each sound layer having an arrangement of one or more audio features that are configured to play at preset start times in the soundtrack timeline. At least one sound layer is adapted for modifying the preset start and stop times of its audio features to match the reading speed of a user based on a reading speed input. | 12-03-2015 |
20150348533 | DOMAIN SPECIFIC LANGUAGE FOR ENCODING ASSISTANT DIALOG - Systems and processes for generating output dialogs for virtual assistants are provided. An output dialog can be generated from multiple output segments that can each include a string of one or more characters or words. The contents of an output segment can be selected from multiple possible outputs based on a predetermined order, conditional logic, or a random selection. The output segments can be concatenated to form the output dialog. In one example, a dialog generation file that defines the possible outputs for each output segment, an ordering of the output segments within the output dialog, and format for the output dialog can be used to generate the output dialog. The dialog generation file can include any number of functional blocks, which can each output an output segment, that can be arranged hierarchically and in a particular order to generate a desired output dialog. | 12-03-2015 |
20150348534 | AUDIO OUTPUT OF A DOCUMENT FROM MOBILE DEVICE - Architecture for playing a document converted into an audio format to a user of an audio-output capable device. The user can interact with the device to control play of the audio document such as pause, rewind, forward, etc. In more robust implementation, the audio-output capable device is a mobile device (e.g., cell phone) having a microphone for processing voice input. Voice commands can then be input to control play (“reading”) of the document audio file to pause, rewind, read paragraph, read next chapter, fast forward, etc. A communications server (e.g., email, attachments to email, etc.) transcodes text-based document content into an audio format by leveraging a text-to-speech (TTS) engine. The transcoded audio files are then transferred to mobile devices through viable transmission channels. Users can then play the audio-formatted document while freeing hand and eye usage for other tasks. | 12-03-2015 |
20150356967 | Generating Narrative Audio Works Using Differentiable Text-to-Speech Voices - An approach is provided in which a voice management system generates multiple audio test recordings using multiple text-to-speech (TTS) voices that have different acoustic properties. The voice management system determines that a comparison between a first one of the TTS voices and a second one of the TTS voices reaches an acoustic differentiation threshold and, as a result, assigns the first TTS voice to a first character and assigns the second TTS voice to a second character. In turn, the voice management system generates a narrative audio work utilizing the first TTS voice corresponding to the first character and the second TTS voice corresponding to the second character. | 12-10-2015 |
20150364126 | ON-SITE SPEAKER DEVICE, ON-SITE SPEECH BROADCASTING SYSTEM AND METHOD THEREOF - Embodiments of the present disclosure provide a method of on-site speech broadcasting. The method includes receiving text signal, wherein the text signal is generated in response to a parameter reaching a predetermined value sensed by an on-site sensor ( | 12-17-2015 |
20150371626 | METHOD AND APPARATUS FOR SPEECH SYNTHESIS BASED ON LARGE CORPUS - The present invention discloses a method and apparatus for speech synthesis based on a large corpus. The method for speech synthesis based on a large corpus comprises: utilizing a prosodic structure prediction model to carry out prosodic structure prediction processing on input text to provide at least one alternative prosodic boundary partitioning solution; determining a prosodic boundary partitioning solution according to structure probability information about a prosodic unit in a speech corpus in the at least one alternative prosodic boundary partitioning solution; and carrying out speech synthesis according to the determined prosodic boundary partitioning solution. The method and apparatus for speech synthesis based on a large corpus provided by the embodiments of the present invention improve the naturalness and flexibility of speech synthesis. | 12-24-2015 |
20150379981 | AUTOMATICALLY PRESENTING DIFFERENT USER EXPERIENCES, SUCH AS CUSTOMIZED VOICES IN AUTOMATED COMMUNICATION SYSTEMS - An automated communication system with an associated method for presenting customized voices is disclosed. The system which performs a predetermined task accepts information regarding an intended user indicating the intended user's identity, preferences, etc. Next, the system customizes one or more voices for the intended user based on the accepted information. The system then presents to the intended user one or more audible communications converted from text associated with a predetermined task performed by the system using the one or more customized voices. | 12-31-2015 |
20150379982 | SYNTHESIZED AUDIO MESSAGE OVER COMMUNICATION LINKS - A communication device establishes an audio connection with a far-end user via a communication network. The communication device receives text input from a near-end user, and converts the text input into speech signals. The speech signals are transmitted to the far-end user using the established audio connection while muting audio input to its microphone. Other embodiments are also described and claimed. | 12-31-2015 |
20150379994 | Personalized Sound Management and Method - A personalized sound management system for an acoustic space includes at least one transducer, a data communication system, one or more processors operatively coupled to the data communication system and the at least one transducer, and a medium coupled to the one or more processors. The processors access a database of sonic signatures and display a plurality of personalized sound management applications that perform at least one or more tasks among identifying a sonic signature, calculating a sound pressure level, storing metadata related to a sonic signature, monitoring sound pressure level dosage levels, switching to an ear canal microphone in a noisy environment, recording a user's voice, storing the user's voice in a memory of an earpiece device, or storing the user's voice in a memory of a server system, or converting received text received in texts or emails to voice using text to speech conversion. Other embodiments are disclosed. | 12-31-2015 |
20160005391 | Devices and Methods for Use of Phase Information in Speech Processing Systems - A device may receive a speech signal. The device may determine acoustic feature parameters for the speech signal. The acoustic feature parameters may include phase data. The device may determine circular space representations for the phase data based on an alignment of the phase data with given axes of the circular space representations. The device may map the phase data to linguistic features based on the circular space representations. The linguistic features may be associated with linguistic content that includes phonemic content or text content. The device may provide a synthetic audio pronunciation of the linguistic content based on the mapping. | 01-07-2016 |
20160005393 | Voice Prompt Generation Combining Native and Remotely-Generated Speech Data - An electronic device includes a processor and a memory coupled to the processor. The memory stores instructions that, when executed by the processor, cause the processor to perform operations including determining whether a text prompt received from a wireless device corresponds to first synthesized speech data stored at the memory. The operations include, in response to a determination that the text prompt does not correspond to the first synthesized speech data, determining whether a network is accessible. The operations include, in response to a determination that the network is accessible, sending a text-to-speech (TTS) conversion request to a server via the network. The operation further include, in response to receiving second synthesized speech data from the server, storing the second synthesized speech data at the memory. | 01-07-2016 |
20160035343 | METHOD AND APPARATUS FOR LIVE CALL TEXT-TO-SPEECH - A method and apparatus provide live call text-to-speech. The method can include entering an ongoing voice call at a first communication device with a second communication device. The method can include receiving text input at the first communication device during the ongoing voice call. The method can include converting the text input to speech during the ongoing voice call to generate a text-to-speech audible signal. The method can include sending the text-to-speech audible signal from the first communication device to the second communication device during the ongoing voice call. | 02-04-2016 |
20160049145 | MOBILE TERMINAL DEVICE - A mobile terminal device able to automatically set suitable field break positions in accordance with the situation, able to realize a skip operation and back skip operation by specific operations, able to efficiently utilize a readout function, and able to improve convenience to a user is provided. It has an operation unit | 02-18-2016 |
20160055838 | SYSTEM AND METHOD FOR AUTOMATICALLY CONVERTING TEXTUAL MESSAGES TO MUSICAL COMPOSITIONS - A method for converting textual messages to musical messages comprising receiving a text input and receiving a musical input selection. The method includes analyzing the text input to determine text characteristics and analyzing a musical input corresponding to the musical input selection to determine musical characteristics. Based on the text characteristic and the musical characteristic, the method includes correlating the text input with the musical input to generate a synthesizer input, and sending the synthesizer input to a voice synthesizer. The method includes receiving a vocal rendering of the text input from the voice synthesizer, generating a musical message from the vocal rendering and the musical input, and outputting the musical message. | 02-25-2016 |
20160055843 | System and Method for Enhancing Locative Response Abilities of Autonomous and Semi-Autonomous Agents - A computer system and method according to the present invention can receive multi-modal inputs such as natural language, gesture, text, sketch and other inputs in order to simplify and improve locative question answering in virtual worlds, among other tasks. The components of an agent as provided in accordance with one embodiment of the present invention can include one or more sensors, actuators, and cognition elements, such as interpreters, executive function elements, working memory, long term memory and reasoners for responses to locative queries, for example. Further, the present invention provides, in part, a locative question answering algorithm, along with the command structure, vocabulary, and the dialog that an agent is designed to support in accordance with various embodiments of the present invention. | 02-25-2016 |
20160071509 | TEXT-TO-SPEECH PROCESSING BASED ON NETWORK QUALITY - A method is disclosed that provides text-to-speech (TTS) functionality to a telematics unit of a telematics-equipped vehicle. The method includes: receiving text content to be played back by an audio system of the telematics-equipped vehicle; determining, by a processor, a TTS rendering process to be used for the text content from a plurality of TTS rendering processes, wherein the plurality of TTS rendering processes include local TTS rendering using a local TTS engine at the telematics-equipped vehicle and remote TTS rendering using a remote TTS engine at a communications center; and causing, by the processor, the text content to be rendered as an audio signal for playback by the telematics-equipped vehicle using the determined TTS rendering process. | 03-10-2016 |
20160071510 | VOICE GENERATION WITH PREDETERMINED EMOTION TYPE - Techniques for generating voice with predetermined emotion type. In an aspect, semantic content and emotion type are separately specified for a speech segment to be generated. A candidate generation module generates a plurality of emotionally diverse candidate speech segments, wherein each candidate has the specified semantic content. A candidate selection module identifies an optimal candidate from amongst the plurality of candidate speech segments, wherein the optimal candidate most closely corresponds to the predetermined emotion type. In further aspects, crowd-sourcing techniques may be applied to generate the plurality of speech output candidates associated with a given semantic content, and machine-learning techniques may be applied to derive parameters for a real-time algorithm for the candidate selection module. | 03-10-2016 |
20160071511 | METHOD AND APPARATUS OF SMART TEXT READER FOR CONVERTING WEB PAGE THROUGH TEXT-TO-SPEECH - A method and an apparatus for outputting a full name voice of a unit or an abbreviation are provided. The method includes detecting a unit or an abbreviation from a text to be output as a voice, searching a full name database for the detected unit or abbreviation to acquire a full name of the detected unit or abbreviation, converting the acquired full name of the unit or abbreviation into a voice and outputting the voice. A context of a text content is parsed to be converted into a voice of an appropriate term so as to transmit accurate meaning information appropriate for a situation. This provides a huge help to a user and a visually handicapped person who has a low accessibility to a webpage. Also, a webpage and a mobile provide a smart talkback service for the accessibility of the visually handicapped person. | 03-10-2016 |
20160078859 | TEXT-TO-SPEECH WITH EMOTIONAL CONTENT - Techniques for converting text to speech having emotional content. In an aspect, an emotionally neutral acoustic trajectory is predicted for a script using a neutral model, and an emotion-specific acoustic trajectory adjustment is independently predicted using an emotion-specific model. The neutral trajectory and emotion-specific adjustments are combined to generate a transformed speech output having emotional content. In another aspect, state parameters of a statistical parametric model for neutral voice are transformed by emotion-specific factors that vary across contexts and states. The emotion-dependent adjustment factors may be clustered and stored using an emotion-specific decision tree or other clustering scheme distinct from a decision tree used for the neutral voice model. | 03-17-2016 |
20160086598 | SYSTEM AND METHOD FOR DISTRIBUTED VOICE MODELS ACROSS CLOUD AND DEVICE FOR EMBEDDED TEXT-TO-SPEECH - Systems, methods, and computer-readable storage media for intelligent caching of concatenative speech units for use in speech synthesis. A system configured to practice the method can identify a speech synthesis context, and determine, based on a local cache of text-to-speech units for a text-to-speech voice and based on the speech synthesis context, additional text-to-speech units which are not in the local cache. The system can request from a server the additional text-to-speech units, and store the additional text-to-speech units in the local cache. The system can then synthesize speech using the text-to-speech units and the additional text-to-speech units in the local cache. The system can prune the cache as the context changes, based on availability of local storage, or after synthesizing the speech. The local cache can store a core set of text-to-speech units associated with the text-to-speech voice that cannot be pruned from the local cache. | 03-24-2016 |
20160093284 | METHOD AND APPARATUS TO SYNTHESIZE VOICE BASED ON FACIAL STRUCTURES - Disclosed are embodiments for use in an articulatory-based text-to-speech conversion system configured to establish an articulatory speech synthesis model of a person's voice based on facial characteristics defining exteriorly visible articulatory speech synthesis model parameters of the person's voice and on a predefined articulatory speech synthesis model selected from among stores of predefined models. | 03-31-2016 |
20160093285 | SYSTEMS AND METHODS FOR PROVIDING NON-LEXICAL CUES IN SYNTHESIZED SPEECH - Systems and methods are disclosed for providing non-lexical cues in synthesized speech. Original text is analyzed to determine characteristics of the text and/or to derive or augment an intent (e.g., an intent code). Non-lexical cue insertion points are determined based on the characteristics of the text and/or the intent. One or more non-lexical cues are inserted at insertion points to generate augmented text. The augmented text is synthesized into speech, including converting the non-lexical cues to speech output. | 03-31-2016 |
20160093286 | SYNTHESIZING AN AGGREGATE VOICE - A system and computer-implemented method for synthesizing multi-person speech into an aggregate voice is disclosed. The method may include crowd-sourcing a data message configured to include a textual passage. The method may include collecting, from a plurality of speakers, a set of vocal data for the textual passage. Additionally, the method may also include mapping a source voice profile to a subset of the set of vocal data to synthesize the aggregate voice. | 03-31-2016 |
20160093287 | SYSTEM AND METHOD FOR GENERATING CUSTOMIZED TEXT-TO-SPEECH VOICES - A system and method are disclosed for generating customized text-to-speech voices for a particular application. The method comprises generating a custom text-to-speech voice by selecting a voice for generating a custom text-to-speech voice associated with a domain, collecting text data associated with the domain from a pre-existing text data source and using the collected text data, generating an in-domain inventory of synthesis speech units by selecting speech units appropriate to the domain via a search of a pre-existing inventory of synthesis speech units, or by recording the minimal inventory for a selected level of synthesis quality. The text-to-speech custom voice for the domain is generated utilizing the in-domain inventory of synthesis speech units. Active learning techniques may also be employed to identify problem phrases wherein only a few minutes of recorded data is necessary to deliver a high quality TTS custom voice. | 03-31-2016 |
20160093288 | Recording Concatenation Costs of Most Common Acoustic Unit Sequential Pairs to a Concatenation Cost Database for Speech Synthesis - A speech synthesis can record concatenation costs of most common acoustic unit sequential pairs to a concatenation cost database for speech synthesis by synthesizing speech from a text, identifying a most common acoustic unit sequential pair in the speech, assigning a concatenation cost to the most common acoustic sequential pair, and recording the concatenation cost of the most common acoustic sequential pair to a concatenation cost database. | 03-31-2016 |
20160093289 | SYSTEMS AND METHODS FOR MULTI-STYLE SPEECH SYNTHESIS - Techniques for performing multi-style speech synthesis. The techniques include using at least one computer hardware processor to perform: obtaining input comprising text and an identification of a first speaking style to use in rendering the text as speech; identifying a plurality of speech segments for use in rendering the text as speech, the identified plurality of speech segments comprising a first speech segment having the first speaking style and a second speech segment having a second speaking style different from the first speaking style; and rendering the text as speech having the first speaking style, at least in part, by using the identified plurality of speech segments. | 03-31-2016 |
20160098985 | SYSTEM AND METHOD FOR LOW-LATENCY WEB-BASED TEXT-TO-SPEECH WITHOUT PLUGINS - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio and responds to the browser with the audio file. The system saves the audio file in a cache, with the file indexed by a unique identifier. As the system continues converting text into speech, when identical text appears the system uses the cached audio corresponding to the identical text without the need for re-synthesis via the TTS server. | 04-07-2016 |
20160111081 | Med Say - After years of mispronouncing medication names and listening to other people struggle with trying to say a medication that they have been on for several years, I was inspired to invent Med Say. Med Say is a useful process that provides the proper pronunciation of brand and generic medication names. Soon I realized that anyone who needs this information should have quick and easy access to having a medication properly pronounced for them from an electronic device such as a smart phone, tablet, computer, etc. Finally a simple process is available that allows for medications to be searched, retrieved, and properly pronounced via an electronic audio file. | 04-21-2016 |
20160125872 | SYSTEM AND METHOD FOR TEXT NORMALIZATION USING ATOMIC TOKENS - A system, method and computer-readable storage devices are for normalizing text for ASR and TTS in a language-neutral way. The system described herein divides Unicode text into meaningful chunks called “atomic tokens.” The atomic tokens strongly correlate to their actual pronunciation, and not to their meaning The system combines the tokenization with a data-driven classification scheme, followed by class-determined actions to convert text to normalized form. The classification labels are based on pronunciation, unlike alternative approaches that typically employ Named Entity-based categories. Thus, this approach is relatively simple to adapt to new languages. Non-experts can easily annotate training data because the tokens are based on pronunciation alone. | 05-05-2016 |
20160133246 | VOICE SYNTHESIS DEVICE, VOICE SYNTHESIS METHOD, AND RECORDING MEDIUM HAVING A VOICE SYNTHESIS PROGRAM RECORDED THEREON - Provided is a voice synthesis device, including: a voice synthesis information acquisition unit configured to acquire voice synthesis information for specifying a sound generating character; a replacement unit configured to replace at least a part of sound generating characters specified by the voice synthesis information with an alternative sound generating character different from the sound generating character; and a voice synthesis unit configured to execute a second synthesis process for generating a voice signal of an utterance sound obtained by the replacing. | 05-12-2016 |
20160140951 | Method and System for Building Text-to-Speech Voice from Diverse Recordings - A method and system is disclosed for building a speech database for a text-to-speech (TTS) synthesis system from multiple speakers recorded under diverse conditions. For a plurality of utterances of a reference speaker, a set of reference-speaker vectors may be extracted, and for each of a plurality of utterances of a colloquial speaker, a respective set of colloquial-speaker vectors may be extracted. A matching procedure, carried out under a transform that compensates for speaker differences, may be used to match each colloquial-speaker vector to a reference-speaker vector. The colloquial-speaker vector may be replaced with the matched reference-speaker vector. The matching-and-replacing can be carried out separately for each set of colloquial-speaker vectors. A conditioned set of speaker vectors can then be constructed by aggregating all the replaced speaker vectors. The condition set of speaker vectors can be used to train the TTS system. | 05-19-2016 |
20160140952 | Method For Adding Realism To Synthetic Speech - The present disclosure provides a method for adding realism to synthetic speech. The method includes receiving text ( | 05-19-2016 |
20160140953 | SPEECH SYNTHESIS APPARATUS AND CONTROL METHOD THEREOF - A speech synthesis apparatus and method is provided. The speech synthesis apparatus includes a speech parameter database configured to store a plurality of parameters respectively corresponding to speech synthesis units constituting a speech file, an input unit configured to receive a text including a plurality of speech synthesis units, and a processor configured to select a plurality of candidate unit parameters respectively corresponding to a plurality of speech synthesis units constituting the input text, from the speech parameter database, to generate a parameter unit sequence of a partial or entire portion of the text according to probability of concatenation between consecutively concatenated candidate unit parameters, and to perform a synthesis operation based on hidden Markov model (HMM) using the parameter unit sequence to generate an acoustic signal corresponding to the text. | 05-19-2016 |
20160163332 | EMOTION TYPE CLASSIFICATION FOR INTERACTIVE DIALOG SYSTEM - Techniques for selecting an emotion type code associated with semantic content in an interactive dialog system. In an aspect, fact or profile inputs are provided to an emotion classification algorithm, which selects an emotion type based on the specific combination of fact or profile inputs. The emotion classification algorithm may be rules-based or derived from machine learning. A previous user input may be further specified as input to the emotion classification algorithm. The techniques are especially applicable in mobile communications devices such as smartphones, wherein the fact or profile inputs may be derived from usage of the diverse function set of the device, including online access, text or voice communications, scheduling functions, etc. | 06-09-2016 |
20160171970 | SYSTEM AND METHOD FOR AUTOMATIC DETECTION OF ABNORMAL STRESS PATTERNS IN UNIT SELECTION SYNTHESIS | 06-16-2016 |
20160171971 | GUIDED PERSONAL COMPANION | 06-16-2016 |
20160171972 | System and Method of Synthetic Voice Generation and Modification | 06-16-2016 |
20160179947 | SYSTEM AND METHOD OF LATTICE-BASED SEARCH FOR SPOKEN UTTERANCE RETRIEVAL | 06-23-2016 |
20160180155 | ELECTRONIC DEVICE AND METHOD FOR PROCESSING VOICE IN VIDEO | 06-23-2016 |
20160180833 | SOUND SYNTHESIS DEVICE, SOUND SYNTHESIS METHOD AND STORAGE MEDIUM | 06-23-2016 |
20160189704 | VOICE SELECTION SUPPORTING DEVICE, VOICE SELECTION METHOD, AND COMPUTER-READABLE RECORDING MEDIUM - A voice selection supporting device according to an embodiment of the present invention includes an acceptance unit that accepts input of a text, an analysis knowledge storage unit that stores therein text analysis knowledge to be used for characteristic analysis for the input text, an analysis unit that analyzes a characteristic of the text by referring to the text analysis knowledge, a voice attribute storage unit that stores therein a voice attribute of each voice dictionary, an evaluation unit that evaluates similarity between the voice attribute of the voice dictionary and the characteristic of the text, and a candidate presentation unit that presents, based on the similarity, a candidate for the voice dictionary suitable for the text. | 06-30-2016 |
20160203814 | ELECTRONIC DEVICE AND METHOD FOR REPRESENTING WEB CONTENT FOR THE ELECTRONIC DEVICE | 07-14-2016 |
20160379622 | AGING A TEXT-TO-SPEECH VOICE - A voice recipient may request a text-to-speech (TTS) voice that corresponds to an age or age range. An existing TTS voice or existing voice data may be used to create a TTS voice corresponding to the requested age by encoding the voice data to voice parameter values, transforming the voice parameter values using a voice-aging model, synthesizing voice data using the transformed parameter values, and then creating a TTS voice using the transformed voice data. The voice-aging model may model how one or more voice parameters of a voice change with age and may be created from voice data stored in a voice bank. | 12-29-2016 |
20160379623 | VOICE FONT SPEAKER AND PROSODY INTERPOLATION - Multi-voice font interpolation is provided. A multi-voice font interpolation engine allows the production of computer generated speech with a wide variety of speaker characteristics and/or prosody by interpolating speaker characteristics and prosody from existing fonts. Using prediction models from multiple voice fonts, the multi-voice font interpolation engine predicts values for the parameters that influence speaker characteristics and/or prosody for the phoneme sequence obtained from the text to spoken. For each parameter, additional parameter values are generated by a weighted interpolation from the predicted values. Modifying an existing voice font with the interpolated parameters changes the style and/or emotion of the speech while retaining the base sound qualities of the original voice. The multi-voice font interpolation engine allows the speaker characteristics and/or prosody to be transplanted from one voice font to another or entirely new speaker characteristics and/or prosody to be generated for an existing voice font. | 12-29-2016 |
20170236509 | SYSTEM AND METHOD FOR INTELLIGENT LANGUAGE SWITCHING IN AUTOMATED TEXT-TO-SPEECH SYSTEMS | 08-17-2017 |
20190147838 | SYSTEMS AND METHODS FOR GENERATING ANIMATED MULTIMEDIA COMPOSITIONS | 05-16-2019 |
20220137917 | METHOD AND SYSTEM FOR ASSIGNING UNIQUE VOICE FOR ELECTRONIC DEVICE - A method in an interactive computing-system includes pre-processing an input natural-language (NL) from a user command based on natural language processing (NLP) for classifying speech information and non-speech information, obtaining an NLP result from the user command, fetching a device specific information from one or more IoT devices operating in an environment based on the NLP result, generating one or more contextual parameters based on the NLP result and the device specific information, selecting at least one speaker embedding stored in a database for the one or more IoT devices based on the one or more contextual parameters, and outputting the selected at least one speaker embedding for playback to the user. | 05-05-2022 |