maxr1876 February 2016

Java String variable becomes corrupt at runtime

Hey all I've got a weird bug in a small Java program I'm writing for a school project. I am well aware of how sloppy the code is (it is still a work in progress), but anyway, somehow my string variable "year" becomes corrupted after breaking out of a loop. I am using Java with Mapreduce and hadoop to count unigrams and bigrams and sort them by year/author. Using print statements, I have determined that "year" is indeed set when I set it equal to temp, but any time after the loop it is set in, the variable is corrupted somehow. The year number becomes replaced with a huge amount of whitespace (at least that's how it appears in the console). I have tried setting year=year.trim() and using the regex year=year.replaceAll("[^0-9]",""), neither works. Anybody have any ideas? I have only included the map class, as that is where the problem is. Also it should be noted that the text files being parsed are files from Project Gutenberg.I am working with a small sample of about 40 random texts from the project.

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {
 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text(); 
    public synchronized void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        line = line.toLowerCase();
        line = line.replaceAll("[^0-9a-z\\s-*]", "").replaceAll("\\s+", " "); 
        Str        

Answers


Gavriel February 2016

You have year = temp in your code. It seems it depends on your input what you get there.

Possible bug:

for (int i = 0; i<temp.length();i++){
    if (Character.isDigit(temp.charAt(0))){

IMHO you mean i instead of 0 in charAt:

for (int i = 0; i<temp.length();i++){
    if (Character.isDigit(temp.charAt(i))){

Also consider not to use StringTokenizer:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

The following example illustrates how the String.split method can be used to break up a string into its basic tokens:

 String[] result = "this is a test".split("\\s");
 for (int x=0; x<result.length; x++)
     System.out.println(result[x]);


obi1 February 2016

Found your white space...

The 2 statements that print out the year variable add a couple newlines:

System.out.println("\n"+year+"\n")

or a tab:

word.set(temp+"\t"+year);   
context.write(word,one); 

Try removing the \n and \t.

Post Status

Asked in February 2016
Viewed 3,097 times
Voted 9
Answered 2 times

Search




Leave an answer