datameet-pune / datameet-pune.github.io Goto Github PK

View Code? Open in Web Editor NEW

16.0 9.0 20.0 15.52 MB

Common repo and documentation space for DataMeet Pune chapter

Home Page: https://sites.google.com/view/datameetpune/home

License: GNU General Public License v3.0

HTML 100.00%

pune maharashtra village datameet bus volunteers city india

datameet-pune.github.io's People

Contributors

Stargazers

Watchers

Forkers

shivamordanny ajinkyayelai gayatrivenugopal miaj22 devendrabhat sumedh3 geokaka6 knickhill shivdahiphale poulomeeghosh ajainf pranav-gairola udaykeith vivek-bombatkar vrushali-d adedhe ishikapal abirmetya bgis-abhishek

datameet-pune.github.io's Issues

multipolygon incorrect district and other attributes

GML_ID Maharashtra_village.40557
21.289816, 77.502907 - I do not believe this is the correct village name or parent district.

I think the village name should be Kheltapmali, unless if Achalpur took over the entire area (*but this does not look urban). District should be Amravati.

Table unpivot script done for Rainfall Data

Task name: Table Un Pivot
Task description: https://github.com/datameet-pune/datameet-pune.github.io/wiki/Table-Un-Pivot
Task status: Completed!

Data source: http://www.indiawaterportal.org/articles/district-wise-monthly-rainfall-data-list-raingauge-stations-india-meteorological-department

Script avaliable at: https://github.com/kshithijiyer/DataMeet-Pune-Rainfall

Processed Data available at: https://github.com/kshithijiyer/DataMeet-Pune-Rainfall/tree/master/Processed-Data

Have a look at the ReadMe.md to get some ideas on what can be done with the data!

sliver polygon

GML_ID Maharashtra_village.16756 in MH Villages v2W2 shapefile

Problem: Many maps like development plan, ward maps are typically released by government agencies in a PDF form, exported from AutoCAD or so. When you open the PDF, sometimes you can notice that it is loading in layers, with a background image, then border lines, then labels, etc. This is because the PDF is holding multiple layers together, like a bunch of transparent sheets.

I have outlined here some ways to extract these layers from the PDF, using a free open source vector editing tool called Inkscape :

https://superuser.com/a/1296609/487360 Answer to forum question: How can I split PDF file into layers

Problem statements for a hackathon

25.10.18: Edit:

Link to the Slides shown onscreen at the meetup

Gathering problem statements for a hackathon.
Guidelines:

One problem statement per post. If you have multiple, then make separate posts.
Clearly show skillsets required to qualify for participating in the task.
Provide reference links if any at bottom.
Make it small, not big. We want people to actually finish the task and put out the solution within a half-day or whatever the hackathon's action duration is. If the task is looking big, break it down into multiple tasks.
We want to solve real world problems with the task. Hence, provide a realistic use case scenario or better a real life need (mention who needs it)

Bank accounts data from PMJDY website

Datameet group thread: https://groups.google.com/forum/#!searchin/datameet/pdfs%7Csort:date/datameet/ErNY82gA7dw/mmBUxH5DAgAJ

Site: https://www.pmjdy.gov.in/archive

Co-ordination for Business Intelligence Hackathon, SICSR, Jan 2019

Schedule

Saturday 5th Jan 2018, 2 to 6 pm.

Participant audience

Students doing an MBA course, learning Business Intelligence tools like Tableau / PowerBI / SuperSet / other (will update here if we get to know precisely)

Things to get together

People with good work experience in these tools from DM side who will conduct the session and mingle with participants, provide ideas and tips, co-ordinate between teams.
Datasets to work on. (Can be Pune level, MH level, India level. Real world data only.)
Exploratory questions on each dataset that participants can make data-viz's etc to answer.

Desired output of the event

Best data-visualizations created by the participants + DMers will be featured by Pune Open Data Portal.
No ranking or limit on viz's, because likely it will be multiple creations assembled together which will be more impressive than individual viz's.
But of course there will be a selection and only top quality work, decided by core team, will be chosen for publishing.

Inviting inputs for which datasets to use at the event.

Should highlight outputs of this chapter

What are some things that have been created by members of this chapter, that datameet can claim some sort of involvement in? (like : connected with partners through datameet, found a particular dataset or code through datameet, etc)
We should feature them, to show a kind of output or result of doing this whole thing.

PDF data extraction related

This is pertaining to extraction of text that is in Unicode Devnagri or other Indian language scripts, from PDFs.
Just gathering some links.

https://bugs.documentfoundation.org/show_bug.cgi?id=66597 : Problems with copying and extracting text from generated PDF, LibreOffice bug tracker
Tabula (which is a popular opensource pdf to table extractor) issue discussing this
A patent document filed by Wipro that talks about a solution for this that they might have engineered (haven't read fully for now).

Providing descriptive captions for images on (educational/government) websites in India to improve their accessibility

Accessibility refers the method of making a product easy to use for users irrespective of their
abilities. In case of web accessibility, the aim is to make it easy for users to read and understand
the content on the website. In this proposal, we focus on the accessibility of images to visually
impaired users. WCAG 2.0 are a set of guidelines by the World Wide Web that have been
published to make the content on websites more accessible to its consumers.
In order to improve the accessibility of images on websites, WCAG 2.0 provides different
solutions for different types of images. For instance, informative images such as photos should
be given a short description. Decorative images should not have any alternate text since their
purpose is only to make the page more attractive. Functional images are those that are displayed
on buttons and other controls that have associated actions. In such cases, the alternate text should
describe the action and not the image. Complex images such as graphs should be given elaborate
descriptions. Informative text should be avoided in images as much as possible. An image in a
group of images conveying the same meaning should have an alternative text such that it should
convey the meaning of all the images in the group. Image maps should have an alternative text
for each region.
Unfortunately many websites do not follow these guidelines which causes a negative effect on
the user experience, as they are unable to acquire the information that they were searching for.
Various tools can be found that have been created as a part of many studies carried out in the
area of image recognition. These tools perform either or all of the following:

Extracting text from images - Here, the focus is on images containing text. OCR software
solutions extract the text from the image. However, not many successful versions of OCR
could be found for Hindi.
Rendering the image without the text
One popular example is https://cloud.google.com/vision/ by Google. The tool accepts an image
and returns its auto-generated description. We tried to use this tool to read the following image:

The generated text is given below::
Mess mmm oir REGISTER For Open Governemnt Data (0GD) Platform India -· CATEG0RY 3 ·
Last Date: 18th February 2018
Another site: https://egreetings.gov.in/, that is used to share e-greeting cards has a range of cards
to choose from but with no understandable alternate text. Two images of cards that fall in the
‘Holi greetings’ category are given below:

The alt text provided is ‘Holi | Greetings Portal’. Since images in such categories carry
cultural/religious/mythological graphics, it may not be possible for a generic image descriptor
tool to render the image and generate appropriate text; therefore manual intervention is required
in such scenarios.
To summarize, we propose to:

Select websites and identify the non-decorative images on all the pages of the site.
Categorize the images according to their type (as specified in WCAG 2.0).
Label each image with a tag (e.g. content, information, festival etc.).
Manually provide a suitable description in English/Hindi/other regional languages.
A repository of such images and their metadata could be created which could be used to map
with the websites they are available on.

References:

Geohash idea for Bus stops (and other location redundancy) de-duplication

From Pune Open Data portal, we have lat-long data of bus stops, but it is non-unique and heavily repeating in some cases. The BRT stops were there in a separate unique list so they are easy to pry out, but the larger dataset of non-BRT stops needs work.

Geohashes resolve lat-long values into square areas. So, a pair of lat-longs that are very close to each other but not the same can be resolved to belong to the same geohash. So, this could be a way of clustering the stops data. Links:

http://www.movable-type.co.uk/scripts/geohash.html - this is shorter than pluscodes even
https://plus.codes/map/

Build a mapped data explorer for DKAN portals like Telangana Open Data Portal

Build a map + table interface like this that enables the user to pull in data from different sources.

Example:
post: Telangana Temperature Data from 2013 to 2017
file/resource: Monthly maximum temperature
There, see data API tab

Sample API query:
https://www.data.telangana.gov.in/api/action/datastore/search.json?resource_id=cc9950ce-89aa-455b-847b-d87756db8f91&limit=5

A query for district=adilabad and limit=2:
https://www.data.telangana.gov.in/api/action/datastore/search.json?resource_id=cc9950ce-89aa-455b-847b-d87756db8f91&district=adilabad&limit=2
(suggestion: copy-paste the json output to codebeautify, see in tree viewer mode)

Wanted: One page where multiple such queries can be run, and the output is displayed on inter-linked map, and table for that dataset. Multiple datasets > multiple tables loaded, but all on same map.

Clicking on a row on the table will make map zoom to it.
Selecting something on map will highlight the corresponding row on table.
Multiple selections possible
Filtering data on table will filter it on the map
have a constrain to map view function to filter the tables to show only the data that is visible on map.

MH Villages v2W2 shapefile District error

18.483120, 76.050704 (Area)

should these have Yavatmal District assigned to them? I believe this is an error and should be Osmanabad District.

Project: NoSQL database

Starting Idea: Get some data like Census data into a NoSQL database, and then build a web tool to query the database.

Reference links:

How to use mongoimport to import csv
MongoDB : Free cluster offering

Later: Build a web-based way for people to add in data, work towards a NoSQL data portal. See this conversation on datameet mailing list.

Gathering tasks for SPPU Statistics dept hackathon, 9 Feb 2019

Gathering tasks for COEP hackathon, 2 Feb 2019

main participant audience : 3rd year computer engineering and IT students of COEP. But event will be optional to attend and will be open for others.

project: Adding Indian place names to Spellcheck Dictionaries

Where this is coming from:
https://etherpad.net/p/LibreOffice-Hackathon-Gnunify
17 Feb Gnunify 2018 event: Session on hacking LibreOffice conducted by @geekgod where we talked about this.

Initial task list:

District Census Handbook page: http://www.censusindia.gov.in/2011census/dchb/DCHB.html
Download excel files for each state under "Town Amenities" and "Village Amenities" headings.
Find the worksheet & column for a. Districts , b. Sub-districts. And if desired, c. Towns, and d. Villages.
Extract the data. Take care to exclude headers.
Remove duplicates.
Remove artefacts like "(MC)", hyphens, asterisk etc.
Isolate entries having multiple words and figure out what to do with them. One option is to add those words in distinct entries, and remove the duplicates.
Diff with existing dictionary to get the place words that aren't present in dictionary.
Push this list to update the dictionary on LibreOffice and possibly other places.

GSDA groundwater site

GSDA : Groundwater Surveys and Development Agency, Maharashtra
Page where one can download Ground Water Recharge Priority maps of individual villages: https://gsda.maharashtra.gov.in/english/index.php/GWRechargePriorityMap

This page also has a state-wide map shown in a thumbnail that was looking heavy. The 'thumbnail' turned out to be 4mb's size and quite high res. So, I put it in mapwarper,
http://mapwarper.net/maps/28327#Preview_tab

Calculate open location codes for tabular data

Input : A table having lat and long values.

Desired Output : Same table, with columns added carrying OpenLocationCodes at varying precision levels: 4,6,8,10.

Target Audience : Knows excel and copy paste. No coding.

Two possible ways to do it:

Macro / script in spreadsheets. Errors encountered in making macros for LibreOffice
Webpage based script where user can copy-paste their data in or load a file, and can copy out the output or download it as a csv or so.

Gathering tasks for SICSR data session Jan 23 2019

Marathi localisation for QGIS project : possibility to pull already done work from other open source projects

@geekgod (Karunakar Sir's) suggestion:
Pull in existing marathi localisation in other projects like KDE, Mozilla etc.
These translations are typically in .po format.

The .ts file of QGIS can be converted to .po
Then, existing translations of recurring phrases like "Save Project As" can be pulled in from other projects. And then for translation exercise we only have the remainder phrases to deal with, stuff that is in QGIS and not in the other OSs/softwares.

Look online for Translation Toolkit

@craigdsouza