Patent application number | Description | Published |
20090112945 | DATA PROCESSING APPARATUS AND METHOD OF PROCESSING DATA - Data processing apparatus comprising: a chunk store containing specimen data chunks, a manifest store containing a plurality of manifests, each of which represents at least a part of a data set and each of which comprises at least one reference to at least one of said specimen data chunks, a sparse chunk index containing information on only some specimen data chunks, the processor being operable to: process input data into input data chunks; identify manifests having at least one reference to one of said specimen data chunks that corresponds to one of said input data chunks and on which there is information contained in the sparse chunk index; and prioritize the identified manifests for subsequent operation. | 04-30-2009 |
20090113167 | DATA PROCESSING APPARATUS AND METHOD OF PROCESSING DATA - Data processing apparatus comprising: a chunk store containing specimen data chunks, a manifest store containing at least one manifest that represents at least a part of a data set and that comprises at least one reference to at least one of said specimen data chunks, a sparse chunk index containing information on only those specimen data chunks having a predetermined characteristic, the processing apparatus being operable to process input data into input data chunks and to use the sparse chunk index to identify at least one of said at least one manifest that includes at least one reference to one of said specimen data chunks that corresponds to one of said input data chunks having the predetermined characteristic. | 04-30-2009 |
20100030780 | IDENTIFYING RELATED OBJECTS IN A COMPUTER DATABASE - Provided are, among other things, systems, methods and techniques for identifying related objects in a computer database. In one representative implementation: (a) a feature vector that describes an existing object is obtained; (b) comparison scores are generated between the feature vector and various sample vectors; (c) a set that includes at least one designated vector is identified from among the sample vectors by evaluating the generated comparison scores; (d) a computer database is searched for matches between label(s) for the designated vector(s) and labels for representative vectors for other objects represented in the computer database; and (e) at least one related object is identified based on the identified match(es). | 02-04-2010 |
20100077015 | Generating a Hash Value from a Vector Representing a Data Object - To generate at least one hash value for a feature vector that represents a data object, a discrete orthogonal transform is applied on a second vector produced from the feature vector. Applying the discrete orthogonal transform on the second vector produces a third vector. At least one value is selected from the third vector to produce the hash value. The at least one hash value is used to perform an action. | 03-25-2010 |
20100082562 | Managing Storage Of Data In A Data Structure - To manage storing of data in a data structure, a particular data value is represented as a group of segments stored in corresponding entries of the data structure. Additional data values represented by corresponding groups of segments are written into the data structure. A probability of overwriting segments representing the particular data value increases as a number of the additional data values increase. A correct version of the particular data value is retrieved even though one or more segments representing the particular data value has been overwritten. | 04-01-2010 |
20100082907 | System For And Method Of Data Cache Managment - The present invention provides a system for and a method of data cache management. In accordance with an embodiment, of the present invention, a method of cache management is provided. A request for access to data is received. A sample value is assigned to the request, the sample value being randomly selected according to a probability distribution. The sample value is compared to another value. The data is selectively stored in the cache based on results of the comparison. | 04-01-2010 |
20100083346 | Information Scanning Across Multiple Devices - Provided are, among other things, systems, methods and techniques for scanning information across multiple different devices. In one representative system, remote data-processing devices are provided with scanning applications that repeatedly scan information on their respective data-processing devices to identify matching data units that satisfy a specified matching criterion, the specified matching criterion including required matches against a set of screening digests, and then transmit characteristic information regarding the matching data units; and a central processing facility receives the characteristic information from the remote data-processing devices and determines whether the corresponding matching data units satisfy a policy criterion. | 04-01-2010 |
20100114842 | Detecting Duplicative Hierarchical Sets Of Files - To detect duplicative hierarchically arranged sets of files in a storage system, a method includes generating, for hierarchically arranged plural sets of files, respective collections of values computed based on files in corresponding sets of files. For a further set of files that is an ancestor of at least one of the plural sets of files, a respective collection of values that is based on the collection of values computed for the at least one set is generated. Duplicative sets according to comparisons of the collections of values are identified. | 05-06-2010 |
20100205163 | SYSTEM AND METHOD FOR SEGMENTING A DATA STREAM - A method of limiting redundant storage of data comprises receiving a data stream and partitioning the data stream into a series of data chunks. At least one content hash value for a set of data chunks is generated based on data content of the set of data chunks. One or more data chunks are grouped into a segment with at least one boundary of the segment defined based on an evaluation of content hash values of data chunks. Content hash values of data chunks of the segment are compared to content hash values of data chunks of segments stored on a backup mass storage device. A pointer to a stored data chunk of an existing segment is stored on the backup mass storage device if a content hash value of a data chunk of the segment matches the content hash value of the stored data chunk. | 08-12-2010 |
20100246709 | PRODUCING CHUNKS FROM INPUT DATA USING A PLURALITY OF PROCESSING ELEMENTS - Input data is divided into multiple segments that are processed by processing elements of a computer. The processing of the segments produces a plurality of tentative sets of chunks. The plurality of tentative sets of chunks are stitched together to produce an output set of chunks. | 09-30-2010 |
20100280997 | COPYING A DIFFERENTIAL DATA STORE INTO TEMPORARY STORAGE MEDIA IN RESPONSE TO A REQUEST - A plurality of differential data stores are stored in persistent storage media. In response to receiving a first request to store a particular data object, one of the differential data stores that are stored in the persistent storage media is selected, wherein selecting the one differential data store is according to a criterion relating to compression of data objects in the differential data stores. The selected differential data store is copied into temporary storage media, where the copying is not delayed after receiving the first request to await receipt of more requests. The particular data object is inserted into the copy of the selected differential data store in the temporary storage media, where the inserting is performed without having to retrieve more data from the selected differential store in the persistent storage media. The selected differential data store in the persistent storage media is replaced with the copy of the selected differential data store in the temporary storage media that has been modified. | 11-04-2010 |
20100281077 | BATCHING REQUESTS FOR ACCESSING DIFFERENTIAL DATA STORES - Data objects are selectively stored across a plurality of differential data stores, where selection of the differential data stores for storing respective data objects is according to a criterion relating to compression of the data objects in each of the data stores, and where the differential data stores are stored in persistent storage media. Plural requests for accessing the differential data stores are batched, and one of the differential data stores is selected to page into temporary storage from the persistent storage media. The batched plural requests for accessing the selected differential data store that has been paged into the temporary storage are executed. | 11-04-2010 |
20110182513 | WORD-BASED DOCUMENT IMAGE COMPRESSION - Locations of word images corresponding to words in a document image are ascertained. The word images are grouped into clusters. For each of multiple of the clusters, a respective compressed word image cluster is determined based on a joint compression of respective ones of the word images that are grouped into the cluster. The positions of the word images in the document image are associated with the respective ones of the compressed word image clusters corresponding to the clusters respectively containing the word images. | 07-28-2011 |
20120020561 | METHOD AND SYSTEM FOR OPTICAL CHARACTER RECOGNITION USING IMAGE CLUSTERING - The present disclosure provides a computer-implemented method of translating an image-based electronic document into a text-based electronic document. The method includes electronically scanning an image-based document to determine positions of word images in the image-based document. The method also includes extracting the word images from the image-based document and storing the word images to an electronic storage device. The method also includes grouping a subset of the word images into a word cluster based on a similarity of the word images, wherein the word images in the word cluster correspond to a same actual word. The method also includes generating a character-encoded transcription for the word cluster based on the word images in the word cluster. The method also includes adding the character-encoded transcription to a text-based electronic document at locations corresponding to the positions of the word images in the image-based document. | 01-26-2012 |
20120143715 | SPARSE INDEX BIDDING AND AUCTION BASED STORAGE - Illustrated is a system and method that includes a receiving module, which resides on a back end node, to receive a set of hashes that is generated from a set of chunks associated with a segment of data. Additionally, the system and method further includes a lookup module, which resides on the back end node, to search for at least one hash in the set of hashes as a key value in a sparse index. The system and method also includes a bid module, which reside on the back end node, to generate a bid, based upon a result of the search. | 06-07-2012 |
20120239815 | DISTRIBUTED DIFFERENTIAL STORE WITH NON-DISTRIBUTED OBJECTS AND COMPRESSION-ENHANCING DATA-OBJECT ROUTING - One embodiment of the present invention provides a distributed, differential electronic-data storage system that includes client computers, component data-storage systems, and a routing component. Client computers direct data objects to component data-storage systems within the distributed, differential electronic-data storage system. Component data-storage systems provide data storage for the distributed, differential electronic-data storage system. The routing component directs data objects, received from the clients computers, through logical bins to component data-storage systems by a compression-enhancing routing method. | 09-20-2012 |
20130236068 | CALCULATING FACIAL IMAGE SIMILARITY - In one embodiment, for a first image, a first vector of similarity to a set of reference images is calculated as a first face descriptor, and for a second image, a second vector of similarity to the set of reference images is calculated as a second face descriptor. A similarity measure between the first face descriptor and the second face descriptor is then calculated. | 09-12-2013 |
20140324742 | SUPPORT VECTOR MACHINE - A method of building a classification model using a SVM training module comprising, with a processor, computing a mean value of a number of training vectors received by the processor, subtracting the mean value of the number of training vectors from each training vector received by the processor to obtain a number of difference vectors, applying a hash function to each of the difference vectors to obtain a number of hashed vectors, and applying a linear training formula to the hashed vectors to obtain a classifier model. Classifying a sample vector comprises, with a processor, subtracting a mean value of a number of support vector machine training vectors from the sample vector to obtain a sample difference vector, with a processor, applying a hash function to the sample difference vector to obtain a hashed sample vector, and classifying the hashed sample vector using a classifier model. | 10-30-2014 |
20140344229 | SYSTEMS AND METHODS FOR DATA CHUNK DEDUPLICATION - A method includes receiving information about a plurality of data chunks and determining if one or more of a plurality of back-end nodes already stores more than a threshold amount of the plurality of data chunks where one of the plurality of back-end nodes is designated as a sticky node. The method further includes, responsive to determining that none of the plurality of back-end nodes already stores more than a threshold amount of the plurality of data chunks, deduplicating the plurality of data chunks against the back-end node designated as the sticky node. Finally, the method includes, responsive to an amount of data being processed, designating a different back-end node as the sticky node. | 11-20-2014 |
20150088840 | DETERMINING SEGMENT BOUNDARIES FOR DEDUPLICATION - A sequence of hashes is received. Each hash corresponds to a data chunk of data to be deduplicated. Locations of previously stored copies of the data chunks are determined, the locations determined based on the hashes. A breakpoint in the sequence of data chunks is determined based on the locations, the breakpoint forming a boundary of a segment of data chunks. | 03-26-2015 |