Xtrans

From tehowiki
Jump to navigation Jump to search

Xtrans transforms binary or text files in various formats to and from XML. The conversion produces - where possible - identical results in both directions. The scanners for individual input formats are implemented in Java in the form of a SAX parser, and the output is generated by a serializer. For XML output the parser tries to maintain as much structure as possible.

The project can be cloned from the GitHub repository 'gfis/xtrans.

Implemented Formats

Package Format Type Description
config ini char Windows .ini file
  make char Unix makefiles
  manifest char Java MANIFEST.MF file
  props char Java Properties
database dbat char generate tables
edi edifact char UN/Edifact message interchange
  x12 char ANSI ASC X12 message interchange
finance aeb43 char AEB43 Spanish payments exchange file
  datev byte DATEV accounting file (DE001)
  dta byte DTA MCV German payments exchange file
  dta2 byte DTA2 MCV German payments exchange file
  mt103 char SWIFT FIN MT103 message (Payment)
  mt940 char SWIFT FIN MT940 message (Report)
  swift char SWIFT FIN message (MTnnn)
general col1 char tags in column 1 of text lines
  column char character file with fields of fixed width
  hexdump byte Hexadecimal Dump
  json char Java Script Object Notation
  line char character file consisting of lines
  pyx char line oriented representation of XML
  separ char file with fields separated or delimited by character strings
  6ml char Simplified XML Notation
geo nmea char NMEA-0183 GPS data for geocoding
grammar extra char Extensible Translator grammar
  yacc char Yet Another Compiler Compiler
image.raster exif byte Exif Metadata
image.vector wmf byte Windows Meta File
misc gedcom char Genealogical Data Communication
  morse char Morse Code
net base64 byte Base64 encoding of a binary file
  ldif char LDAP Data Interchange File
  qp byte Quoted Printable format (RFC 2045, 6.7)
  uri char Universal Resource Identifiers on single lines
office.data dbase byte dBase Database File
  dif char DIF - Data Interchange Format
office.text hit byte Siemens Hit
  rtf char RTF - Rich Text Format for MS-Word et al.
  tex char TeX, LaTeX - Typesetting System
organizer ical char Calendars and schedules
  vcard char VCard address/phone book entries
parse parse char parser for programs
proglang c char C Programming Language
  cobol char Cobol Programming Language
  cpp char C++ Programming Language
  css char Cascaded Style Sheet
  fortran char Fortran Programming Language
  java char Java Programming Language
  javascript char JavaScript Programming Language
  jcl char IBM z/OS Job Control Language
  pascal char (Turbo) Pascal Programming Language
  pl1 char PL/1 Programming Language
  ps char PostScript (Adobe)
  progser char serializer for programs
  rexx char REXX Programming Language
  ruby char Ruby Programming Language
  sql char Structured Query Language
  sqlpretty char pretty print SQL
  token char Transformer for Parser Tokens
  vba char Visual Basic (for Applications)
pseudo count char count XML elements
  dir char nested file directory listing
  jimp char check Java imports
  level char add level attribute
  seq char generate a sequence
  system char show system information

Applications

JavaImportChecker

This class is a pseudo serializer to be applied after JavaTransformer. It checks the import statements in a Java source file and reports:

  • superfluous imports which are never used
  • missing imports, either because:
    • the class is in the same package
    • classes prefixed with their package name are not properly recognized by the tool (for example java.util.Date) - these should be explicitely imported also
    • inherited enums are not properly recognized by the tool

The tool checks class names when they start with an uppercase letter [A-Z], somewhere followed by a lowercase letter [a-z].

All sources in a project can be checked with a shell command line like:

find ../xtool/src -iname "*.java" | xargs -l -ißß java -jar dist/xtrans.jar -java ßß -jimp

The corresponding output was:

SchemaBean
PathStack
XPathLink
XPathSelect
XmlnsPrefix
XtoolServlet
  import only:	Enumeration
  import only:	InputStream
  import only:	ServletConfig
  import only:	ServletContext
  import only:	ZipFile
SchemaBeanBase
  import only:	Date
  import only:	Timestamp
PathElement
IndexPage
  import only:	HttpSession
  import only:	Iterator
Messages
SchemaList
XmlnsXref
NonClosingInputStream
SchemaArray
  use only:	Date

Bugs

General Problems

  • Though most transformers convert from the raw (specific) format to an XMLized representation, there are a few exceptions where general binary or text files are converted to the specific format which is then wrapped into XML. Examples are Base64, Quoted Prinatble and Morse Code.
  • Most transformers store values in XML elements, but sometimes it seemed easier to store them in attributes of elements. DTA and Datev are examples for the latter case.
  • For formats with many different tags (SWIFT for example) the question arises whether such tags are syntax or data. These tags can be converted to id attributes of a generalized XML "field" element, or a seperate element for each such tag can be generated. The SwiftTransformer made the latter decision.

Test

  • Not all format conversions are precisely reversible.
  • There are only a few test cases.

Incompletene Transformers

  • general.XMLTransformer - insufficient serialization of entities; serializer should be replaced by Apaches's
  • general.CountingTransformer - cannot generate, but serializes any XML to a sorted list with counts for all elements, and the accumulated length of their direct character content
  • net.URITransformer - the set of supported schemas is incomplete, and serializing is not implemented.
  • organizer.LDIFTransformer - not well tested, and serializing is not implemented.

Hints for Developers

Xtrans currently processes only a limited set of formats. You are encouraged to:

  • play with the format transformer classes,
  • email any suggestions for improvement,
  • contribute patches for corrections,
  • contribute new transformer classes.

Coding conventions

Please try to remain close to the current programming style:

  • Write Javadoc comments before all methods and public members.
  • Note that the Java sources are compiled with UTF-8 source encoding:
   <javac  srcdir="${src.home}" destdir="${build.classes}" listfiles="yes"
           encoding="utf8"
           source="1.4" target="1.4"
           debug="${javac.debug}" debuglevel="${javac.debuglevel}">
Determine the proper accents and non-ASCII characters, and write them in Unicode in the Java source files. Use an Unicode enabled editor that handles UTF-8 properly; write some Unicode characters in the header comment such that the editor can detect the UTF-8 encoding.
  • Use reliable sources for the format definition like RFCs or ISO standards, and document them in the Javadoc header of the class.

Reversibility

The transformers should try to serialize XML to exactly the same specific format from which they are able to generate XML. The test Ant targets perform a "generate - serialize - binary compare" sequence to check the reversibility of the transformation.

Some formats don't have a well-defined canonical representation. In JCL, for example, the line breaks and the spaces for field separation are lost in the XML representation, and cannot exactly be reproduced by the serializer. In these cases, subsequent "generate - serialize" sequences should finally produce an identical result.

Future Extensions

  • more text processing formats:
    • (La)TeX - similiar to RTF
    • dot instruction oriented formats: IBM DCF, nroff, troff, perldoc
    • binary formats like IBM DCA/RFT, Siemens Hit, WordPerfect
    • common tagset for text processing features
  • raster image processing formats:
    • TIFF
    • EXIF - at least the header
    • GIF, BMP etc.
  • vector image processing formats with target SVG:
    • WMF
    • Flash?
    • RTF DO, AmiPro, WordPerfect Graphics ...
  • ZIP file tree pseudo transformer