Need help in getting HTML of a website in Java

Question

I got some code from java httpurlconnection cutting off html and I am pretty much the same code to fetch html from websites in Java. Except for one particular website that I am unable to make this code work with:

I am trying to get HTML from this website:

http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289

But I keep getting junk characters. Although it works very well with any other website like http://www.google.com.

And this is the code that I am using:

public static String PrintHTML(){
    URL url = null;
    try {
        url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289");
    } catch (MalformedURLException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    HttpURLConnection connection = null;
    try {
        connection = (HttpURLConnection) url.openConnection();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
    try {
        System.out.println(connection.getResponseCode());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String line;
    StringBuilder builder = new StringBuilder();
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    try {
        while ((line = reader.readLine()) != null) {
            builder.append(line);
            builder.append("\n"); 
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String html = builder.toString();
    System.out.println("HTML " + html);
    return html;
}

I don't understand why it doesn't work with the URL that I mentioned above.

Any help will be appreciated.

Community · Accepted Answer · 2017-05-23 10:32:47Z

7

That site is incorrectly gzipping the response regardless of the client's capabilities. Normally a server should only gzip the response whenever the client supports it (by Accept-Encoding: gzip). You need to ungzip it using GZIPInputStream.

reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()), "UTF-8"));

Note that I also added the right charset to the InputStreamReader constructor. Normally you'd like to extract it from the Content-Type header of the response.

For more hints, see also How to use URLConnection to fire and handle HTTP requests? If all what you after all want is parsing/extracting information from the HTML, then I strongly recommend to use a HTML parser like Jsoup instead.

edited May 23, 2017 at 10:32

CommunityBot

11 silver badge

answered Aug 4, 2010 at 14:06

BalusC

1.1m377 gold badges3.7k silver badges3.6k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

bits Over a year ago

Wow it worked. Thanks for the explanation. And a big thanks for the snippet as well. I initially tried using HTMLCleaner as my parser, but I was getting the same issue. Now I am going to feed this HTML string into HTMLCleaner.

Jonathan Hedley Over a year ago

BTW, jsoup (1.3.1) now deals with that gzipped output correctly when using Jsoup.connect(url).get();

Collectives™ on Stack Overflow

Need help in getting HTML of a website in Java

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related