Xtrans: Difference between revisions
imported>Gfis hints |
imported>Gfis Format list |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
'''Xtrans''' transforms binary or text files in various formats to and from XML. The conversion produces - where possible - identical results in both directions. | |||
The scanners for individual input formats are implemented in Java in the form of a SAX parser, and the output is generated by a serializer. | |||
For XML output the parser tries to maintain as much structure as possible. | |||
The project can be cloned from the GitHub repository '[https://github.com/gfis/xtrans gfis/xtrans]'''. | |||
==Implemented Formats== | |||
<table border="0" cellpadding="2" cellspacing="2" bgcolor="lavender"> | |||
<tr> | |||
<td valign="top"><strong>Package</strong></td> | |||
<td valign="top"><strong>Format</strong></td> | |||
<td valign="top"><strong>Type</strong></td> | |||
<td valign="top"><strong>Description</strong></td> | |||
</tr> | |||
<tr> | |||
<td valign="top">config</td> | |||
<td valign="top"><strong>ini</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Windows .ini file</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>make</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Unix makefiles</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>manifest</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Java MANIFEST.MF file</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>props</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Java Properties</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">database</td> | |||
<td valign="top"><strong>dbat</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">generate tables</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">edi</td> | |||
<td valign="top"><strong>edifact</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">UN/Edifact message interchange</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>x12</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">ANSI ASC X12 message interchange</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">finance</td> | |||
<td valign="top"><strong>aeb43</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">AEB43 Spanish payments exchange file</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>datev</strong></td> | |||
<td valign="top">byte</td> | |||
<td valign="top">DATEV accounting file (DE001)</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>dta</strong></td> | |||
<td valign="top">byte</td> | |||
<td valign="top">DTA MCV German payments exchange file</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>dta2</strong></td> | |||
<td valign="top">byte</td> | |||
<td valign="top">DTA2 MCV German payments exchange file</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>mt103</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">SWIFT FIN MT103 message (Payment)</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>mt940</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">SWIFT FIN MT940 message (Report)</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>swift</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">SWIFT FIN message (MTnnn)</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">general</td> | |||
<td valign="top"><strong>col1</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">tags in column 1 of text lines</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>column</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">character file with fields of fixed width</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>hexdump</strong></td> | |||
<td valign="top">byte</td> | |||
<td valign="top">Hexadecimal Dump</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>json</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Java Script Object Notation</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>line</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">character file consisting of lines</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>pyx</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">line oriented representation of XML</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>separ</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">file with fields separated or delimited by character strings</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>6ml</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Simplified XML Notation</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">geo</td> | |||
<td valign="top"><strong>nmea</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">NMEA-0183 GPS data for geocoding</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">grammar</td> | |||
<td valign="top"><strong>extra</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Extensible Translator grammar</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>yacc</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Yet Another Compiler Compiler</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">image.raster</td> | |||
<td valign="top"><strong>exif</strong></td> | |||
<td valign="top">byte</td> | |||
<td valign="top">Exif Metadata</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">image.vector</td> | |||
<td valign="top"><strong>wmf</strong></td> | |||
<td valign="top">byte</td> | |||
<td valign="top">Windows Meta File</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">misc</td> | |||
<td valign="top"><strong>gedcom</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Genealogical Data Communication</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>morse</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Morse Code</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">net</td> | |||
<td valign="top"><strong>base64</strong></td> | |||
<td valign="top">byte</td> | |||
<td valign="top">Base64 encoding of a binary file</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>ldif</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">LDAP Data Interchange File</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>qp</strong></td> | |||
<td valign="top">byte</td> | |||
<td valign="top">Quoted Printable format (RFC 2045, 6.7)</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>uri</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Universal Resource Identifiers on single lines</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">office.data</td> | |||
<td valign="top"><strong>dbase</strong></td> | |||
<td valign="top">byte</td> | |||
<td valign="top">dBase Database File</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>dif</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">DIF - Data Interchange Format</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">office.text</td> | |||
<td valign="top"><strong>hit</strong></td> | |||
<td valign="top">byte</td> | |||
<td valign="top">Siemens Hit</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>rtf</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">RTF - Rich Text Format for MS-Word et al.</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>tex</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">TeX, LaTeX - Typesetting System</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">organizer</td> | |||
<td valign="top"><strong>ical</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Calendars and schedules</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>vcard</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">VCard address/phone book entries</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">parse</td> | |||
<td valign="top"><strong>parse</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">parser for programs</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">proglang</td> | |||
<td valign="top"><strong>c</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">C Programming Language</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>cobol</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Cobol Programming Language</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>cpp</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">C++ Programming Language</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>css</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Cascaded Style Sheet</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>fortran</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Fortran Programming Language</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>java</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Java Programming Language</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>javascript</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">JavaScript Programming Language</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>jcl</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">IBM z/OS Job Control Language</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>pascal</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">(Turbo) Pascal Programming Language</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>pl1</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">PL/1 Programming Language</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>ps</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">PostScript (Adobe)</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>progser</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">serializer for programs</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>rexx</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">REXX Programming Language</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>ruby</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Ruby Programming Language</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>sql</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Structured Query Language</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>sqlpretty</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">pretty print SQL</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>token</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Transformer for Parser Tokens</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>vba</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">Visual Basic (for Applications)</td> | |||
</tr> | |||
<tr> | |||
<td valign="top">pseudo</td> | |||
<td valign="top"><strong>count</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">count XML elements</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>dir</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">nested file directory listing</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>jimp</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">check Java imports</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>level</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">add level attribute</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>seq</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">generate a sequence</td> | |||
</tr> | |||
<tr> | |||
<td valign="top"> </td> | |||
<td valign="top"><strong>system</strong></td> | |||
<td valign="top">char</td> | |||
<td valign="top">show system information</td> | |||
</tr> | |||
</table> | |||
===Applications=== | |||
====JavaImportChecker==== | |||
This class is a pseudo serializer to be applied after <code>JavaTransformer</code>. It checks the <code>import</code> statements in a Java source file and reports: | |||
* '''superfluous imports''' which are never used | |||
* '''missing imports''', either because: | |||
** the class is in the same package | |||
** classes prefixed with their package name are not properly recognized by the tool (for example <code>java.util.Date</code>) - these should be explicitely imported also | |||
** inherited enums are not properly recognized by the tool | |||
The tool checks class names when they start with an uppercase letter [A-Z], somewhere followed by a lowercase letter [a-z]. | |||
All sources in a project can be checked with a shell command line like: | |||
find ../xtool/src -iname "*.java" | xargs -l -ißß java -jar dist/xtrans.jar -java ßß -jimp | |||
The corresponding output was: | |||
SchemaBean | |||
PathStack | |||
XPathLink | |||
XPathSelect | |||
XmlnsPrefix | |||
XtoolServlet | |||
import only: Enumeration | |||
import only: InputStream | |||
import only: ServletConfig | |||
import only: ServletContext | |||
import only: ZipFile | |||
SchemaBeanBase | |||
import only: Date | |||
import only: Timestamp | |||
PathElement | |||
IndexPage | |||
import only: HttpSession | |||
import only: Iterator | |||
Messages | |||
SchemaList | |||
XmlnsXref | |||
NonClosingInputStream | |||
SchemaArray | |||
use only: Date | |||
== Bugs == | == Bugs == | ||
====General Problems==== | ====General Problems==== |
Latest revision as of 11:26, 5 January 2023
Xtrans transforms binary or text files in various formats to and from XML. The conversion produces - where possible - identical results in both directions. The scanners for individual input formats are implemented in Java in the form of a SAX parser, and the output is generated by a serializer. For XML output the parser tries to maintain as much structure as possible.
The project can be cloned from the GitHub repository 'gfis/xtrans.
Implemented Formats
Package | Format | Type | Description |
config | ini | char | Windows .ini file |
make | char | Unix makefiles | |
manifest | char | Java MANIFEST.MF file | |
props | char | Java Properties | |
database | dbat | char | generate tables |
edi | edifact | char | UN/Edifact message interchange |
x12 | char | ANSI ASC X12 message interchange | |
finance | aeb43 | char | AEB43 Spanish payments exchange file |
datev | byte | DATEV accounting file (DE001) | |
dta | byte | DTA MCV German payments exchange file | |
dta2 | byte | DTA2 MCV German payments exchange file | |
mt103 | char | SWIFT FIN MT103 message (Payment) | |
mt940 | char | SWIFT FIN MT940 message (Report) | |
swift | char | SWIFT FIN message (MTnnn) | |
general | col1 | char | tags in column 1 of text lines |
column | char | character file with fields of fixed width | |
hexdump | byte | Hexadecimal Dump | |
json | char | Java Script Object Notation | |
line | char | character file consisting of lines | |
pyx | char | line oriented representation of XML | |
separ | char | file with fields separated or delimited by character strings | |
6ml | char | Simplified XML Notation | |
geo | nmea | char | NMEA-0183 GPS data for geocoding |
grammar | extra | char | Extensible Translator grammar |
yacc | char | Yet Another Compiler Compiler | |
image.raster | exif | byte | Exif Metadata |
image.vector | wmf | byte | Windows Meta File |
misc | gedcom | char | Genealogical Data Communication |
morse | char | Morse Code | |
net | base64 | byte | Base64 encoding of a binary file |
ldif | char | LDAP Data Interchange File | |
qp | byte | Quoted Printable format (RFC 2045, 6.7) | |
uri | char | Universal Resource Identifiers on single lines | |
office.data | dbase | byte | dBase Database File |
dif | char | DIF - Data Interchange Format | |
office.text | hit | byte | Siemens Hit |
rtf | char | RTF - Rich Text Format for MS-Word et al. | |
tex | char | TeX, LaTeX - Typesetting System | |
organizer | ical | char | Calendars and schedules |
vcard | char | VCard address/phone book entries | |
parse | parse | char | parser for programs |
proglang | c | char | C Programming Language |
cobol | char | Cobol Programming Language | |
cpp | char | C++ Programming Language | |
css | char | Cascaded Style Sheet | |
fortran | char | Fortran Programming Language | |
java | char | Java Programming Language | |
javascript | char | JavaScript Programming Language | |
jcl | char | IBM z/OS Job Control Language | |
pascal | char | (Turbo) Pascal Programming Language | |
pl1 | char | PL/1 Programming Language | |
ps | char | PostScript (Adobe) | |
progser | char | serializer for programs | |
rexx | char | REXX Programming Language | |
ruby | char | Ruby Programming Language | |
sql | char | Structured Query Language | |
sqlpretty | char | pretty print SQL | |
token | char | Transformer for Parser Tokens | |
vba | char | Visual Basic (for Applications) | |
pseudo | count | char | count XML elements |
dir | char | nested file directory listing | |
jimp | char | check Java imports | |
level | char | add level attribute | |
seq | char | generate a sequence | |
system | char | show system information |
Applications
JavaImportChecker
This class is a pseudo serializer to be applied after JavaTransformer
. It checks the import
statements in a Java source file and reports:
- superfluous imports which are never used
- missing imports, either because:
- the class is in the same package
- classes prefixed with their package name are not properly recognized by the tool (for example
java.util.Date
) - these should be explicitely imported also - inherited enums are not properly recognized by the tool
The tool checks class names when they start with an uppercase letter [A-Z], somewhere followed by a lowercase letter [a-z].
All sources in a project can be checked with a shell command line like:
find ../xtool/src -iname "*.java" | xargs -l -ißß java -jar dist/xtrans.jar -java ßß -jimp
The corresponding output was:
SchemaBean PathStack XPathLink XPathSelect XmlnsPrefix XtoolServlet import only: Enumeration import only: InputStream import only: ServletConfig import only: ServletContext import only: ZipFile SchemaBeanBase import only: Date import only: Timestamp PathElement IndexPage import only: HttpSession import only: Iterator Messages SchemaList XmlnsXref NonClosingInputStream SchemaArray use only: Date
Bugs
General Problems
- Though most transformers convert from the raw (specific) format to an XMLized representation, there are a few exceptions where general binary or text files are converted to the specific format which is then wrapped into XML. Examples are Base64, Quoted Prinatble and Morse Code.
- Most transformers store values in XML elements, but sometimes it seemed easier to store them in attributes of elements. DTA and Datev are examples for the latter case.
- For formats with many different tags (SWIFT for example) the question arises whether such tags are syntax or data. These tags can be converted to id attributes of a generalized XML "field" element, or a seperate element for each such tag can be generated. The SwiftTransformer made the latter decision.
Test
- Not all format conversions are precisely reversible.
- There are only a few test cases.
Incompletene Transformers
- general.XMLTransformer - insufficient serialization of entities; serializer should be replaced by Apaches's
- general.CountingTransformer - cannot generate, but serializes any XML to a sorted list with counts for all elements, and the accumulated length of their direct character content
- net.URITransformer - the set of supported schemas is incomplete, and serializing is not implemented.
- organizer.LDIFTransformer - not well tested, and serializing is not implemented.
Hints for Developers
Xtrans currently processes only a limited set of formats. You are encouraged to:
- play with the format transformer classes,
- email any suggestions for improvement,
- contribute patches for corrections,
- contribute new transformer classes.
Coding conventions
Please try to remain close to the current programming style:
- Write Javadoc comments before all methods and public members.
- Note that the Java sources are compiled with UTF-8 source encoding:
<javac srcdir="${src.home}" destdir="${build.classes}" listfiles="yes" encoding="utf8" source="1.4" target="1.4" debug="${javac.debug}" debuglevel="${javac.debuglevel}">
- Determine the proper accents and non-ASCII characters, and write them in Unicode in the Java source files. Use an Unicode enabled editor that handles UTF-8 properly; write some Unicode characters in the header comment such that the editor can detect the UTF-8 encoding.
- Use reliable sources for the format definition like RFCs or ISO standards, and document them in the Javadoc header of the class.
Reversibility
The transformers should try to serialize XML to exactly the same specific format from which they are able to generate XML. The test Ant targets perform a "generate - serialize - binary compare" sequence to check the reversibility of the transformation.
Some formats don't have a well-defined canonical representation. In JCL, for example, the line breaks and the spaces for field separation are lost in the XML representation, and cannot exactly be reproduced by the serializer. In these cases, subsequent "generate - serialize" sequences should finally produce an identical result.
Future Extensions
- more text processing formats:
- (La)TeX - similiar to RTF
- dot instruction oriented formats: IBM DCF, nroff, troff, perldoc
- binary formats like IBM DCA/RFT, Siemens Hit, WordPerfect
- common tagset for text processing features
- raster image processing formats:
- TIFF
- EXIF - at least the header
- GIF, BMP etc.
- vector image processing formats with target SVG:
- WMF
- Flash?
- RTF DO, AmiPro, WordPerfect Graphics ...
- ZIP file tree pseudo transformer