Zentrum für Molekularbiologie der Pflanzen (ZMBP)

VBA/VB2007/.NET

 


*NEW* Motif Mapper .NET

 

 

5.2.4.0 (May 2012)

 

P-value correction was removed.

 

 

5.2.3.0 (Jan 2012)

 

Removed small bug when there are not matches for a motif 
in the background, the variance still showed a value. it is now
properly re-set. (Did not affect the data)

Added corrected p-values based Type I Error tests
on real data for Significant Motifs in Promoters.

 

 

Small Newline output bug found in the mmtb output; corrected.

 

 

 

5.2.2.1 (Mar 2011)

Corrected small bugs: division by zero, Int-Str conversions. Coverted Integer
variables to Long variables for large sequences that exceed the 32 Int.
Removed the .(peroid / dot) file name error call from Fasta_From_Fasta.

 

 

 

5.2.2.0 (Aug 2010)

GFF3 parser: GFF3 file - choose the appropriate columns, etc.
The sequence files have to be individually given in
FASTA format (separate files); program has to be able to
read each one to process the information otherwise it aborts.
The program needs to know which folder to look for the files.
It scans for all files and seeks to match the file name with the
chromosome name; NOT the FASTA name, just the file name!!!!

 

 

Removed the expected frequeny calcualtions from MapNCtr
fixed the RemoveRedundancy algo added INPUT note to
SigCluster output fixed small output bugs where I found them

 


Notes:
motif{x}motif will not make the Dyad all
if you need to test a fixed space with Dyad all
use motif{x,x}motif

 


5.1.1.39 (May 2010)

Motif Mapper has now a graphical interface using Visual Basic 2007 and .NET.
Most programs have been ported over and some improvements have been added. New
programs include ones for calcuating the percent overlaps between large sets
of gene lists for visualization in graph editors and calculation of motif enrichments
in clusters of genes using real background permutations. Graphical outputs MMGraphicsP
and ExonIntronMickey have been discarded.


VB Script Versions


3.6.5

*I added Control Tip text to the large GUI, more click indication to the file GUI.

*aGBSQL
I noticed that there are a few gene annotations that have alternate annotation schemes,
in order to correct for this, I have added a control to make sure that the
mRNA preceeds the CDS annotation in case of alternate entry types.

3.6.0

*Modifications and Addtions

-aGBSQL(version for TAIR v.7.0)

Improved regular expression name extraction, since the heavy annotating in the TAIRv7 release
caused many problems. There still may be an occasional glitch when an annotation carries
"illegal" characters confusing the RegEx. I am not sure why it worked for a while and then
"decided" not to work later. Be careful, since I do not extract all loci/gene identifiers like before!

Also, I added a check box to frmGBSQL to offer something that I had only offered
as a complete code change, and that is extracting only genes with a 5'UTR
(and therefore a presumable annotated transcription start site).

Made a visual enchantment to frmFile to show the file extension change option.

-MMGraphicsP

The MMgraphicsP was altered to return the Sense as a bar ABOVE the "sequence line",
and the antisense BELOW the sequence line. This allows more motifs to be mapped
and visualized.


3.5.2

*Cleaned Code

Motif Mapper Project (Module)
IF the ITextStream Object and String variable for the entrie chromosome read
is not disposed (a common problem with the script versions of Visual Basic)
then the ITextStream Object crashes when the memory is overloaded since the
stack is filled. (No automatic garbage collection). I moved the ITextStream
read to a function (so that the local variable scope would be purged, even
though the local string object is not) and that seems to have solved the problem.
I don't think that there is a Virtual Memory option (or is my off?) so that
once the RAM memory is exceeded for the ITextStream object, it still crashes.
In such cases that file (usually really big) is skipped by Error Handling.
The function for the read is placed in the MM_Basic_Class


3.5.1 (April 2007)

*Modifications and Additions

-aGBSQL

*Added a regular expression for extracting Gene/Locus names (before partial names were taken)

*Corrected the Name check when either the Gene or Locus annotation was missing, caused
a NULL comparison and didn't extract the sequences

*Added a function for returning the Gene/Locus name for the aGBSQL


-MotifMapper Promoter

*changed the sorting histogram algorithm (full read through) to calling
the binary sorting algorithm in MMdyad.

-Point Mapper

*length shift bug found in Dyad mapping, wrong variable call for placing the motif mark
with fixed motifs, e.g. nnn{2}nnn{2}nnn should be length 13


3.5.0 (March 2007)

*Modification
The Motif Mapper suite has been rewritten into class structures, a cover GUI to choose all of the available
programs has been added, and a new multi-dyad algorithm has been developed. Other various improvements,
dead code removal, and any found bugs have been pruned (I would love to say permanently eliminated).

*aGBSQLv3.5.0 now can cut off overlapping upstream genes (for promoter extraction) or the entire length
from gene to gene. The extraction algorithm for promoters was improved and placed as its own function.

*dyads
Multi-spaced elements can be analyzed up to 5 elements (including overlapping sequence) in FASTA format.
Multi-spaced elements can be analyzed up to 5 elements (non-overlapping) for Chromosome Sequences (otherwise
the program is too slow). Chromosome analysis no longer uses CHOPPING of the SEQUENCE up, instead now
the entire chromosome (long sequence) is loaded into the memory and a histogram is written out using
the user defined window size. NO elements are missed for non-multispaced elements. One thing,
elem1{x4}elem1 is converted to elem1NNNNelem1. If there was elem1x4elem1x4elem1 in the
search string, then only the first 5' match would be found. In order to find both, you need to write
elem1{x4,x4}elem1 and the search will allow overlaps. This is not available for chromosome searches (or, you
can try, but it is too long since I am scanning the entire string each time - there are various tricks
to get around it, one of them is to use a better RegEx library, which is not available for VB6).

*Clearing Lists
I have two small programs for clearing lists. One is made for small lists (few thousand members) and the
other is slower but works for really large lists. The smaller one uses a recursive array function, that can
split tab delimited lines and remove redundancy by letting the user choose which column is the index.

The slower version writes and reads a recursive function to the harddisk (which is the reason that it is slow),
but it considers each entire line as the index for comparison. The advantage of this program is that it won't
crash until it is finished. Which one you want to use is up to you or what you need to do.

*Addition
Element Distributions
Have you ever wanted to get the distribution values for two elements in you set of sequences? The small
program ElementDistributions automatically collects the sense and antisense values for all events found
in a FASTA sequence and returns a tab delimited file for analysis.


3.4.1 (Sep 2005)
*Additions

1. A scanning algorithm called ElemDistSingle returns the position and orientation of motifs in
sequences formatted in FASTA, the sense and antisense are automatically handled unlike the other
mapping algorithms.

2. MAIN_SeqFASTA_Klear which extracts FASTA sequences to individual files is re-introduced into
the package.


*Modifications
1. Fasta_extractor when comparing with a name list takes the first contiguous name until the first blank
space or end of line, furthermore it removes one space "> NAME" that may precede a name for whatever
annoying reason.

2. ClearRedundancy was changed not to read and write to the hardrive and designed so that any data that is
tab delimitated using the first column as an index can be cleared using the index name (case-insensitive).

*NOTE
The dyad function was still poorly written. I have attempted to outdo myself, and correctly this time
in version 3.5.0. The entire way of thinking about analysing them have been changed to make use
of the RegExp library as much as possible and now I jumbo with the positions, not chopping the line
up like before (an almost reasonable solution for the RegEx library from VB6). For this reason, v.3.4.1
was not made available but was critical in leading to abandoning this approach.

3.4.0 (Aug 2005)


*Corrections

1. found bug in the aGBSQL that prevent the Gene-Toggle function from working
(simple code positional error).


*Modifications

1. removed the overlap option for variably spaced motifs (the user can type this
in themselves for simple motifs) but this was a good option for degenerate motifs!!!
-might re-implement it later.

2. Sequence name extraction is everything until the first space in the name is reached,
(excluding the space) for MM_Fasta and MAIN_Extract_Once_FASTA.


*Major additions

1. changed the dyad splicing algorithm to a recursive algorithm so that
and unlimited number of variably spaced motifs can be searched for.

2. added the Dyadic (variable spaced motif) option to REG_Pro_Point_Mapper_340,
only output problem is the 3p overlap is counted as many times as it is used
3p Overlap is scored.

3. pSUM scan to log peaks and their stats. This version takes 1 min
vs. 8 hrs for a PERL version for the 4096 hexamers!!!. Please anyone tell me
what I was doing wrong!!!

4. added the intergenic extraction option in aGBSQL finally.
See documentation for usage notes.


*Minor additions/notes

1. Alternative transcripts is still not supported during GenBank extraction.
If you want this information, one only has to un-hide the Previous name skip.

2. A line of code in aGBSQL to skip genes that whose 5'ends are not well annotated is hidden,
and can be easily unhidden to activate.

3. Name comparison for aGBSQL uses the "Name" or "Locus Tag". If in doubt,
extract everything and see what is going on to make an appropriate name list.

4. removed Tabs in the names outputted from the aGBSQL and added a check for
Locus Tag and/or Gene Names, instead of just Gene Names.

5. MAIN_Extract_Once_FASTA write out to individual files has been removed,
use the 3.1.2 version.


3.3.2

*Corrections

1. Count_text_files has a variable name typo in it.

2. Extract_Indv_seq_FASTA has flow control problems; corrected.

*Modification

1. File input allows the user to que all files from a folder.
User must enter the folder path name manually, however.


3.3.1

*Modification

1. By using the "locus_tag: " annotation, one is now able to use
AGI lists to extract sequences. This assumes that the locus tag
is the most desired version - but since the version for At Release 4.0
uses the first entry to prevent redundancy, this is probably a better choice.
KEEP in mind however - the AGI is only pulled out when it is searched for,
otherwise the original default is still present. Why? Turns out that
most of the locus_tagged sequences are the first one in the list of alternative
transcripts. When one pulls out promoters (upstream), the output sequence
name identifier returns both the BAC name and the AGI name, if there is one
(and there usually is).

2. MickeydaExonIntron now automatically flips reverse-complement annotation
numbers and image so that it is 5p->3p.

3.2.3

*Correction

1. Non-reinitialization of the upstream and downstream variables
caused redundant, illegal extraction. This has been corrected
by setting them to zero in the UTR calls.

2. SeqConkatFasta was dumb since I read in a new line
before I could check to see if it was a FASTA name
line or not. fixed.

3.2.2 (Mar2004)

*Modification

1. Promoter_Point_Mapper is enabled for IUPAC handling with
regular expression algorithm and renamed to REG_Pro_Point_Mapper.

3.2.1 (Mar2004)

*Corrections

1. Dyad splicing algorithm was bad. It is replaced and debugged
with the series gg{2}g / gg{0,2}g / gg{2,2}g. It works as desired now.
Careful with MMapper_Project since I use "i" there as a master control
variable and in MMapper_Fasta it is only a local control variable for
the dyad part. I wasted lots of time again, but still didn't change the name.

3.1.2 (Feb2004)

*Corrections

1. ClearSeq still had commas in the "LIKE" command.

2. NamesFromFasta was pruned of useless variables.

3. MAIN_Extract_Indv_seq_FASTA was not corrected for the InList function
due to variable calling errors. Corrected.

5. MTBS and MMTB output file name echo returned the wrong file name.

*Modification

1. Error-Handling for Reg_MM GC_Content input added. Handle checks
for a numeric value from 0 <= value >=1 .

3.1.1 (Jan2004)

*Modifications

1. ALL older algorithms are now replaced with
VisualBasic5.5 regular expression Object enabling
IUPAC usage embedded by template activation.

2. Single dyads with variable spacing are implemented
to also allow overlapping of 3p motifs
with 5p motifs, but autocorrelation is still
prevented by advancing the entire length of the 5p motif.
MMfasta and MMProject are complete.
Perhaps, GrepShuffle will be integrated.

2.3.1 (Jan2004)

*Bugs known or fixed.

1. When comparing names, InList is now employed in the
MM_FASTA versions (see 1.2.2 pt.4) that employs adding
a division character "%" to search for hits!

2. Output breaks at 60000 lines with MM-FASTA corrected
by >= 60000, since there are more than one line outputted
when .........

1.3.2 (Jan2004)

*Bugs known or fixed.

1. Output and retrieval problems with introns. The
selection option was defective due to no check with sense
or antisense. All exons pulls were correct except for
the right exon when choosing a selected exon.

1.2.2 (Jan 2004)

*Bugs known or fixed.

1. All IUPAC converters were missing the D -> H for reverse-compliments.

2. When using the -like- command and one sets something within
the brackets [], then that means that every letter should be taken
individually and if you put a comma -,- inside, then it also
looks for commas too. That is a list in brackets is handled
individually for all characters present and the comma had been
taken for a character in all instances where I have used it.
Not any more.

3. The Rootfolder of the Conkat could not be used. Fixed,
assuming that it is the root folder that is being searched
for files in it; a logical assumption.

4. Name comparison in GenBankSQL not appropriate for
sub-nest string names, corrected by InList function. InList
appends an "%" to the end of each sequence name.
A "%" in a sequence name is therefore prohibited.

5. Same problem like (4.) for MAIN_Extract_Indv_seq_FASTA,
corrected using InList function. Fasta Name truncation
only employed when writing out individual text
files for each sequence.

6. The Jan-2004 annotation of the Arabidopsis genome retains
multiple entries for alternative transcripts/translation products.
The first entry is taken by default as the desired annotation
when extracting information with GenBankSQL.

1.2.1 (Nov 2003)

*Bugs known or fixed.

1.Manually GC input was incorrectly handled after input due
to dividing to yield the G or C content before calculating
the A or T content.

2.The GBsql_form allowed an exit when the list option was chosen.

3.The ClearList function was originally written as a recursive
function, but crashed with long lists by filling the stack.
This has been corrected by writing to the disk. All
associative programs using ClearList have been updated. This
of course means that long lists never crash.

4.MMgraphicP still has a output name error when it reaches the last
FASTA sequence.

5.SetGrouping was a bit stupid and hung when no match was found.

6.Map_N_Count dyads were searched as absolute for the 5 prime motif,
meaning that if one has ttttgggggggaaaa and searches for tttt{8}aaaa
and ttt{8}aaaa only the hit for the former was returned because it
advanced by one hit once found starting at the next nucleotide. This
has been corrected to advance by one nucleotide until a true hit is found,
then the search is advanced by one 5prime motif length.

1.1 (Aug 2003)

Programs added:

*ClearSeq

Clears any text file of all non-IUPAC characters,
to a new text file under the c:\ directory.

1.0
The inital finished package with guide describing the
programs, individual .bas files, .frm/.frx files, and
the complete suite as a "template" file were compressed
into a WinZip file. April 2003.




 




Impressum