To gather domain-specific data and the most critical objective of data collection is ensuring that information-rich and reliable data is collected….

  

To gather domain-specific data and the most critical objective ofdata collection is ensuring that information-rich and reliable data is collected. The approach of data collection might not be the same for all domains, it would mostly depend on the type of information/ data available for that domain. For example, some domains might have easily available data sources like spreadsheets, csv files, etc., while others would require scraping from different sources to form a reliable database. Irrespective of that, you will now see a generic approach that you could incorporate while collecting data. -You need to first estimate the number of attributes (columns) you would like to have in your data collection. This number should be reasonable – not to high not too low. For example, for a domain like movies, the attributes can be the year of release, title, country, language, budget, production company, plot etc. Having a range of 15-50 attributes will be a good start.Note: Attribute count can vary based on the selected domain and the data. -Start searching for data sources. These can be websites, API’s, Spreadsheets, CSV files, PDF’s, Wiki dumps, Databases, etc. When a data source is found, -Check whether the data is meaningful and can be used for article generation. If you find it meaningful, then check for the attributes that can be used. -Extract the data. -Scraping libraries/ tools: Beautifulsoup, scrapy, tabula, selenium etc. These are some of the starter libraries/ tools you can make use of for scraping -Note: Apart from the above-mentioned libraries/ tools, there are many other ways to perform this task. Feel free to explore and make use of them. -You can also rely on using Wikipedia articles, Wiki data for scraping the data that you want to make use of in your dataset. -In case of API’s, Spreadsheets, CSV files we can use the data directly after analyzing it -Images can also be collected if required for the data. -The above steps of searching for data sources, extracting data would repeat until you have enough attributes -If the data is collected from multiple sources, then to create a unified knowledge base we merge all the data collected (based on a primary key like the ID, name, etc). -The data can be stored as json, excel sheets or any of the key-value data types. -After collecting the data, you need to analyze, clean the data before making use of it. -You need to look for missing values, duplicate entries in the dataset. You must make sure that there are no duplicate entries. For handling the missing values, you can either remove the records with missing rows or fill the missing values depending on the data. -To perform Exploratory Data Analysis (EDA), we can make use of libraries like Sweetviz (open-source python library). You can refer to that here. -data set link: https://docs.google.com/spreadsheets/d/1NWOb2_KumNqx34CoEtJNQUTlNhF60fkk4pL1-Zoe8kQ/edit?usp=sharing Computer Science Engineering & Technology Python Programming DSE 201

Don't use plagiarized sources. Get Your Custom Essay on
To gather domain-specific data and the most critical objective of data collection is ensuring that information-rich and reliable data is collected….
Just from $13/Page
Order Essay
  

Leave a Reply

Your email address will not be published.

Related Post

Prudence is dened as the ability to have good judgment that allows avoidance of dangers and risks. Meanwhile, frugality is the act of using money or…Prudence is dened as the ability to have good judgment that allows avoidance of dangers and risks. Meanwhile, frugality is the act of using money or…

  1. What is happiness? If possible, related Fromm’s view with other philosophers discussed.2. What is freedom? Relate Fromm’s view with other philosophers discussed in previous lessons.3. What is joy? Do

READ MOREREAD MORE
Open chat
💬 Need help?
Hello 👋
Can we help you?