angleto / m2m-aligner Goto Github PK

Automatically exported from code.google.com/p/m2m-aligner

License: MIT License

C++ 95.83% CMake 4.17%

m2m-aligner's Introduction

Many-to-Many alignment model (m2m-aligner)

m2m-aligner is implemented by Sittichai Jiampojamarn during the PhD's years at the department of Computing Science, University of Alberta. 
This algorithm has been applied in letter-to-phoneme conversion, name transliteration and other tasks; 
for example, please see the below list of known publications that utilized the m2m-aligner.
In general, this algorithm creates lexicon alignments without requiring annotated data nor linguistic knowledge.
Its principle algorithm is based on the Ristad and Yianilos (1998) stochastic transducer described in:

@Article{RYsed98,
  author  =  {Eric Sven Ristad and Peter N. Yianilos},
  title   =  {Learning String Edit Distance},
  journal =  {IEEE Transactions on Pattern Recognition and Machine Intelligence},
  year    =  1998,
  volume  =  20,
  number  =  5,
  pages   =  {522--532},
  month   =  {May}
}

Tarek Sherif originally proposed this algorithm as a part of his Mater thesis graduated in 2007 from University of Alberta. 
I later reimplemented this algorithm to the first version of m2m-aligner based on the paper we published together at 
NAACL 2007. Since then, many refinements, improvements, and features have been included for later tasks. 

You are welcome to use the code for research, commercial, and other purposes; however, please acknowledge its use with a citation to: 

@InProceedings{jiampojamarn2007:,
  author    = {Jiampojamarn, Sittichai  and  Kondrak, Grzegorz  and  Sherif, Tarek},
  title     = {Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion},
  booktitle = {Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics;
Proceedings of the Main Conference},
  month     = {April},
  year      = {2007},
  address   = {Rochester, New York},
  publisher = {Association for Computational Linguistics},
  pages     = {372--379},
  url       = {http://www.aclweb.org/anthology/N/N07/N07-1047}
}

VERSIONS: 

 1.0 : The first version of m2m-aligner released to public. 
 All previous versions were for in-house users available upon request in the past.

INSTALL: 

m2m-aligner has been tested on Linux systems with gcc version 4.1.2. It should be compatible with other versions, and c++ compilers. 
By default, the makefile is "makefile.default". It should be compiled by the "make" command: 

$ make 

For a faster m2m-aligner, you can change "makefile" to "makefile.stlport" which uses the stlport library instead of the default one. 
You can obtain the stlport library from http://www.stlport.org/
Then, specify the stlport path in the makefile file.  
The major difference is in the use of <map> and <hash_map> data structures. 
You may also use <hash_map> implemented as in gcc and others. The interface should be the same but I haven't tested on them yet.


USAGE:

   ./m2m-aligner  [--limit] [--errorInFile] [--initProb <long double>]
                  [--init <string>] [--nBest <int>] [--inFormat <l2p|news>]
                  [--sepInChar <string>] [--sepChar <string>] [--nullChar
                  <string>] [--prefixProcess <string>] [--printScore] [--cutoff
                  <double>] [--maxFn <conXY|conYX|joint>] [--eqMap]
                  [--delY] [--delX] [--maxY <int>] [--maxX <int>]
                  [--alignerIn <string>] [--alignerOut <string>] [-o
                  <string>] -i <string> [--version] [-h]


Where:

   --limit
     Limit the alignment pair to used only from the initFile only (default false)

   --errorInFile
     Keep unaligned item in the output file (default false)

   --initProb <long double>
     Cut-off sum prior probability (default 0.5)

   --init <string>
     Initial mapping (model) filename (default null)

   --nBest <int>
     Generate n-best alignments (default n=1)

   --inFormat <l2p|news>
     Input file format [l2p, news] (default news)

   --sepInChar <string>
     Separated in-character used (default :)

   --sepChar <string>
     Separated character used (default |)

   --nullChar <string>
     Null character used (default _)

   --prefixProcess <string>
     Specify prefix output files

   --printScore
     Report score of each alignment (default false)

   --cutoff <double>
     Training threshold (default 0.01)

   --maxFn <conXY|conYX|joint>
     Maximization function [conXY, conYX, joint] (default conYX)

   --eqMap
     Allow mapping of |x| == |y| > 1 (default false)

   --delY
     Allow deletion of substring y (default false)

   --delX
     Allow deletion of substring x (default false)

   --maxY <int>
     Maximum length of substring y (default 2)

   --maxX <int>
     Maximum length of substring x (default 2)

   --alignerIn <string>
     Aligner model input filename

   --alignerOut <string>
     Aligner model output filename

   -o <string>,  --output <string>
     Output filename

   -i <string>,  --input <string>
     (required)  Input filename

   --version
     Displays version information and exits.

   -h,  --help
     Displays usage information and exits.


File formats: 
	m2m-aligner takes two input formats so called "l2p" and "news".
	
	news format: each token separated by a space, 
	             a tab (\t) separates between source x and target y
		         one line per (x,y) pair.

	l2p format : each character byte is a toke,
	             white space(s) separates between source x and target y
				 one line per (x,y) pair.
	
	Please see an example file "toAlignEx". 
This example file is a small part of randomly taken around 1k words from the CMU Pronouncing Dictionary --
http://www.speech.cs.cmu.edu/cgi-bin/cmudict

Example run:

	$ ./m2m-aligner --delX --maxX 2 --maxY 2 -i toAlignEx
	
	--delX : allow deletion in the source side.
	--maxX <value> : the maximum size of sub-alignments in the source side.
	--maxY <value> : the maximum size of sub-alignments in the target side.
	-i <inputfile> : unaligned lexical file to train a model
	
Example outputs: 

toAlignEx.m-mAlign.2-2.1-best.conYX.align
	alignment output file of "toAlignEx":
		Each token's separated by ":", each sub-alignment's separated by "|", a tab (\t) separates between aligned x and y.

toAlignEx.m-mAlign.2-2.1-best.conYX.align.err
	contains those examples from "toAlignEx" that can't be aligned with the current model.

toAlignEx.m-mAlign.2-2.1-best.conYX.align.model
	aligner's model file.

	
Acknowledgments:

This work was supported by the Alberta Ingenuity, Informatics
Circle of Research Excellence (iCORE) and Alberta Ingenuity Fund throughout 
the Alberta Ingenuity Graduate Student Scholarship and
iCORE ICT Graduate Student Scholarship.

The list of known publications that utilized the m2m-aligner:
(Please contact me to include your usage of the m2m-aligner in this list)

Sittichai Jiampojamarn, Colin Cherry and Grzegorz Kondrak Integrating Joint n-gram Features into a Discriminative Training Framework In Proceeding of
The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2010), June 2010

Sittichai Jiampojamarn and Grzegorz Kondrak Online Discriminative Training for Grapheme-to-Phoneme Conversion In Proceeding of the 10th Annual
Conference of the International Speech Communication Association (INTERSPEECH), Brighton, U.K., September 2009, pp.1303-1306.

Sittichai Jiampojamarn, Aditya Bhargava, Qing Dou, Kenneth Dwyer and Grzegorz Kondrak "DIRECTL: a Language-Independent Approach to Transliteration".
In Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009), Singapore, August 2009, pp.28-31.

Qing Dou, Shane Bergsma, Sittichai Jiampojamarn and Grzegorz Kondrak "A Ranking Approach to Stress Prediction for Letter-to-Phoneme Conversion".
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing
of the AFNLP, Singapore, August 2009, pp.118-126.

Cook, P. and Stevenson, S. 2009. An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches To
Linguistic Creativity (Boulder, Colorado, June 04 - 04, 2009). ACL Workshops. Association for Computational Linguistics, Morristown, NJ, 71-78.

Sittichai Jiampojamarn, Colin Cherry and Grzegorz Kondrak. "Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion". In
Proceeding of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08: HLT), Columbus, OH, June
2008, pp.905-913.

Sittichai Jiampojamarn, Grzegorz Kondrak and Tarek Sherif. "Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme
Conversion". Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007),
Rochester, NY, April 2007, pp.372-379.


Author: Sittichai Jiampojamarn
Date  : March 25th, 2010
http://code.google.com/p/m2m-aligner/

m2m-aligner's People

Contributors

Watchers

m2m-aligner's Issues

‘logl’ was not declared in this scope

What steps will reproduce the problem?
1.originally got output described in 
http://code.google.com/p/m2m-aligner/issues/detail?id=1
2.implemented prescribed solution
3.executed 'make'

What is the expected output? What do you see instead?
My output:

g++ -O3 -ffast-math -funroll-all-loops -fpeel-loops -ftracer -funswitch-loops 
-funit-at-a-time -pthread  -c -I./tclap1.1.0/include/   mmEM.cpp -o mmEM.o
g++: unrecognized option '-pthread'
mmEM.cpp: In member function ‘std::vector<long double> 
mmEM::nViterbi_align(param, vector_str, vector_str, vector_2str&, 
vector_2str&)’:
mmEM.cpp:752:73: error: ‘logl’ was not declared in this scope
mmEM.cpp:771:73: error: ‘logl’ was not declared in this scope
mmEM.cpp:801:61: error: ‘logl’ was not declared in this scope
mmEM.cpp: In member function ‘long double mmEM::viterbi_align(param, 
vector_str, vector_str, std::vector<std::basic_string<char> >*, 
std::vector<std::basic_string<char> >*)’:
mmEM.cpp:937:61: error: ‘logl’ was not declared in this scope
mmEM.cpp:954:61: error: ‘logl’ was not declared in this scope
mmEM.cpp:983:49: error: ‘logl’ was not declared in this scope
makefile:23: recipe for target `mmEM.o' failed
make: *** [mmEM.o] Error 1


What version of the product are you using? On what operating system?
Using m2m-aligner v1.0. Running cygwin on windows 7. 
Thread model: posix
gcc version 4.5.3 (GCC)

Please provide any additional information below.

Discovered that 'logl' is in math.h. Included math.h in mmEM.h but still had 
the same issue

Original issue reported on code.google.com by [email protected] on 24 May 2012 at 12:40

Build problem on gcc 4.4


Thank you very much for making this code available. I've succeed in building 
m2m-aligner on 
g++ 4.1.2. On the more recent 4.4, I get this error message

$ make
g++ -O3 -ffast-math -funroll-all-loops -fpeel-loops -ftracer -funswitch-loops 
-funit-at-a-
time -pthread  -c -I./tclap1.1.0/include/   mmAligner.cpp -o mmAligner.o
In file included from ./tclap1.1.0/include/tclap/UnlabeledValueArg.h:30,
                 from ./tclap1.1.0/include/tclap/CmdLine.h:28,
                 from mmAligner.cpp:48:
./tclap1.1.0/include/tclap/ValueArg.h: In member function ‘int 
TCLAP::VALUE_ARG_HELPER::ValueExtractor<T>::extractValue(const std::string&)’:
./tclap1.1.0/include/tclap/ValueArg.h:103: error: ‘EOF’ was not declared in 
this scope
In file included from ./tclap1.1.0/include/tclap/UnlabeledMultiArg.h:29,
                 from ./tclap1.1.0/include/tclap/CmdLine.h:29,
                 from mmAligner.cpp:48:
./tclap1.1.0/include/tclap/MultiArg.h: In member function ‘int 
TCLAP::MULTI_ARG_HELPER::ValueExtractor<T>::extractValue(const std::string&)’:
./tclap1.1.0/include/tclap/MultiArg.h:103: error: ‘EOF’ was not declared in 
this scope
make: *** [mmAligner.o] Error 1

$ uname -a
Linux tyr 2.6.31-20-server #58-Ubuntu SMP Fri Mar 12 05:40:05 UTC 2010 x86_64 
GNU/Linux

$ g++ --version
g++ (Ubuntu 4.4.1-4ubuntu9) 4.4.1

If you need any further information, please let me know. Thanks!

Anders

Original issue reported on code.google.com by [email protected] on 29 Apr 2010 at 12:10

missing "std::" in util.h

What steps will reproduce the problem?
1. make -f makefile.stlport
2.
3.

What is the expected output? What do you see instead?
Compiler errors of the form
util.h:126: error: reference to ‘istringstream’ is ambiguous
./tclap-1.2.1/include/tclap/Arg.h:43: error: candidates are: typedef struct 
stlp_std::istringstream istringstream

What version of the product are you using? On what operating system?
m2m-aligner-1.2
Linux 2.6.32-5-amd64
gcc (Debian 4.4.5-8) 4.4.5

Please provide any additional information below.
Changing line 126 in util.h from
     istringstream i(s);
to
     std::istringstream i(s);
fixes the problem.

Original issue reported on code.google.com by [email protected] on 9 Aug 2013 at 2:14

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.