Character Equivalency

SoftwareGlobalVision Desktop - Software version 5.0 - 5.5

The character equivalency list is used to single out pairs and/or sequences of characters that are equivalent. Typical examples include the different flavors of “bullet” characters, as well as specific accented characters (mostly used in the Czech, Polish, etc languages). The accented characters could also be composite characters, being constructed of two individual, distinct characters. By using a character equivalency list, such minor differences (or in some cases false positives) will no longer be reported and it will overall increase the accuracy of the comparison, due to better word matching.

The character equivalency list is saved with the project and restored from the project.

Location, format and hexidecimal

The character equivalency list can be specified in the docuproof.ini file (found in C:\Users\Public\GlobalVision\Resource\) using the following format and syntax:

CharEquivalencyList=<0x0079 + 0x0301 : 0x00fd>; <0x0065 : 0x0066>; <0x2022, 0x2023 :0x2024 >

The reason we use a hexadecimal representation of the character code is that more often than not, the equivalency list contains special characters, that are not readily available as a single key press on the keyboard.

Syntax

Each equivalency list token is bracketed by the “<” (at the beginning) and the “>” (at the end) characters. The series of tokens are separated by “;”

Explanation of the code

In the above example the list has 3 equivalency tokens.

The syntax and format of one equivalency token is as follows:

“Source part” : “equivalent part”

The source part can use “,”(commas) to specify several sequences or “+” (plus sign) to specify a contiguous sequence.

The equivalency part can use only “+”(plus sign) to specify a contiguous sequence, or only one character

Example : the last sequence in the above example <0x2022, 0x2023 :0x2024 > is deciphered as follows:

0x2022 is equivalent to 0x2024

0x2023 is equivalent to 0x2024

Translating the equation to English, it is seen as:

A regular bullet(0x2022) is equivalent to the “one dot leader” character (0x2024)

A triangular bullet(0x2023) is equivalent to the “one dot leader” character (0x2024)

Using the same example, but slightly modified:

<0x2022 + 0x2023 :0x2024 >

This is now deciphered as: A sequence comprised of a regular bullet(0x2022) and triangular bullet(0x2023) next to each other is equivalent to the “one dot leader” character (0x2024)

Limitations

The equivalency logic is presently replacing the character visually in the difference grid e.g. small letter o (Unicode value: 006f) is used as a bullet in the original file whereas in the revision, Unicode value: 00ba is being used.
Once added to the ini file, the one-to-one differences are not detected, however, the Unicode character from revision is replacing the character from the original file e.g. tools would be displayed as tººls (in the grid).
The equivalency list should be populated in the ini file with the Unicode values. The Unicode values are displayed in the change-grid for each character detected as a difference.
However, not all characters display as expected or the character might not display at all. For example, the ini entry for ý needs to be setup as:
CharEquivalencyList=<0x00fd: 0x0079 + 0x0301>;
where:
0x00fd: ý
0x0079: y
0x0301: ́

The character ́ does not display in the change-grid – the user must hover over the blank field next to the <y> character, to determine the Unicode value of the accent.